The Structure of Spoken Language

The Structure of Spoken Language
Using an innovative approach, this book focuses on a widely debated area of

phonetics and phonology: intonation, and specifically its relation to metrics,
its interface with syntax, and whether it can be attributed more to phonetics or
phonology, or equally to both. Drawing on data from six Romance languages
(French, Italian, Spanish, Portuguese, Catalan, and Romanian), whose rich
intonation patterns have long been of interest to linguists, Philippe Martin
challenges the assumptions of traditional phonological approaches, and re-
evaluates the data in favor of a new usage-based model of intonation. He
proposes a unified description of the sentence prosodic structure, focusing on
the dynamic and cognitive aspects of both production and perception of
intonation in speech, leading to a unified grammar of Romance languages
sentence intonation. This book will be welcomed by researchers and advanced
students in phonetics and phonology.
philippe martin is a Professor in the Linguistics Department at the

Université Paris Diderot.
The Structure of Spoken Language
Intonation in Romance
Philippe Martin
University Printing House, Cambridge CB2 8BS, United Kingdom
Cambridge University Press is part of the University of Cambridge.

It furthers the University’s mission by disseminating knowledge in the pursuit of
education, learning and research at the highest international levels of excellence.
www.cambridge.org
Information on this title: www.cambridge.org/9781107036185
© Philippe Martin 2015
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2015
Printed in the United Kingdom by TJ International Ltd. Padstow Cornwall
A catalogue record for this publication is available from the British Library
Library of Congress Cataloguing in Publication data
Martin, Philippe, 1944– author.
The structure of spoken language : intonation in Romance / Philippe Martin.
pages cm
ISBN 978-1-107-03618-5 (hardback)
1. Romance languages – Phonetics – Intonation. 2. Romance languages –
Phonology. 3. Romance languages – Phonology, Historical. 4. Romance
languages – Spoken Romance languages. 5. Intonation (Phonetics)
6. Biolinguistics. I. Title.
PC81.5.M27 2015
440′.0415–dc23
2015012063
ISBN 978-1-107-03618-5 Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication,
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
Contents
List of figures and maps page x

List of tables xix
Preface xxi
Acknowledgments xxv
Key concepts xxvii
1 Introduction 1
The respiratory cycle 1
The source-filter model of phonation 3
Emotions 5
Voiced and unvoiced speech sounds 7
Laryngeal frequency 7
Fundamental frequency and melodic curve 7
Intensity 9
Spectrographic analysis 9
Syllabic duration 10
Syntax and prosody 11
The prosodic structure: the structure of spoken language 13
Stressed syllables 13
Intonation and syntax 14
Brain waves and prosody 14
A Copernican change 15
From laboratory to spontaneous speech 16
Reading and listening 16
Romance languages 17
2 The role of technological advances 20

The kymograph 20
The spectrograph 21
Fundamental frequency tracking 23
First results 23
Electroencephalography and brain waves 27
Transcription and alignment of speech 27
3 Transcription systems 29
Acoustic and perceived data 30
v
vi Contents
Obtaining data: pitch curves 30

Selecting data 32
Historical background 32
The AMPER project 36
The Prosogram 36
ToBI 38
INTSINT and Momel 40
Analor 41
Transcription as theory 42
Perception and interpretation 43
A phonological transcription system 44
4 The Autosegmental-Metrical Prosodic Structure 46

A brief description 46
Properties 48
Applying the concept 51
Questions and remarks 54
The prosodic structure revisited 56
5 The Incremental Prosodic Structure 59

Melodic curves 59
The stress group 61
The prosodic word 62
Syllabic chunking 63
The time dimension 64
Conversion of syllabic chunks 65
The syllable in the stress group 66
The stress group in the sentence 68
Classes of conclusive contours 68
Basic modalities 68
Modality variants 69
Alternative questions 71
Iconicity of conclusive contours 71
Imperative contour 73
Implicative contour 73
Contour of surprise 74
Contour of doubt 75
The Incremental Prosodic Structure 76
Independence 79
Prosodic events 79
Properties 81
Prosodic phrasing 82
Planarity 83
Connexity 83
Domain 84
Neutralization 85
Differentiation in the time domain 85
Differentiation of prosodic events 86
The dynamic prosodic structure 87
Contents vii
The Incremental Storage-Concatenation process 88

Preplanning 90
Melodic contours features 90
One prosodic word 91
Two prosodic words 92
Three prosodic words ended with C0 92
Contrast of melodic slope 94
Three prosodic words ended with C1 94
Prosodic structure constraints 96
The arc accentuel in French 96
Stress clash 97
Minimum duration of prosodic words 98
Maximum duration of prosodic words 99
Eurhythmy 101
Word alignment 104
Syntactic clash 105
Experimental data 106
Brain waves and prosodic structure 107
Theta brain waves and the perception of syllables 107
Delta brain waves and stressed syllables 108
Delta brain waves frequency range 110
Prosodic structure constraints and brain waves 111
Stress groups and brain waves 113
Constraints revisited 113
Sequential sentence structuration by prosody and syntax 115
A simple example: telephone numbers 116
6 Lexical stress in Romance languages 120

Stress and accent 120
Stress in various languages 121
Stressed syllables in Latin 122
Stressed syllables in Romance languages (other than French) 123
Orthographic convention and homographs 124
Rules for word stress placement 126
A statistical approach 127
A phonological-phonetic approach 127
A phonological approach 128
A morphophonetic approach 128
A morphological approach 128
French 130
Secondary accent and arc accentuel 131
The groupe de sens… 131
Stress variations in Romance languages 132
7 The Incremental Prosodic Structure in six Romance languages 133

EuRom4 and EuRom5 134
The process of reading 135
Note on figures 136
The melodic contours of Romance languages 137
viii Contents
Inventory 138
Processing prosodic information 141
Prosodic structures in Romance languages 144
Identification of prosodic contours 144
Complex contour 146
Sequences of two prosodic words 150
Sequences of three prosodic words 160
Sequences of four prosodic words and more 185
Coordination, enumeration, parenthesis 192
Coordination 192
Enumeration 198
Parenthesis 200
An example of AM prosodic analysis in French 203
An example of ISC prosodic processing in French 208
Conclusion 212
8 Macrosyntax 214
A first approach 215
Three current models for macrosyntax 217
The theory of la lingua in atto 220
Text macrosyntax and prosodic macrosyntax 221
Merging text and intonation 222
Dysfluencies 224
Ponctuants 225
The prosodic eraser 226
Use of dysfluencies 226
Deletions 227
Additions 228
Text and prosodic macrosegments 230
Examples of macrosyntactic analysis 232
French 233
Italian 241
Portuguese 245
Conclusion 248
9 Applications 249
Teaching French prosodic structure 249
Silent reading 252
Eye movement 253
Subvocalization 253
Delta wave synchronization 255
10 Conclusion 256
Quotes from Frédéric Dard (San Antonio) 256
11 WinPitch 259
Sound recording made clear 259
Sound and video 260
Transcription and alignment on the fly 261
Contents ix
Data mining for large speech corpora 262

Acoustic analysis 266
Prosodic morphing 270
Automatic segmentation 270
Interface with other software 270
References 272
Analyzed corpora 285
Author index 287
Subject index 290
List of figures and maps
Figures
1.1 Respiration cycle, without phonation (top) and with
phonation (bottom) page 2
1.2 An example of an out-of-breath speaker (NS), interrupting
the phonation process by pauses longer than usual 3
1.3 Source-filter model of phonation 4
1.4 Interactions in the source-filter model between phonation and
emotions 4
1.5 Extreme cases of the emotion–phonology relationship:
emotion dominates phonology (extreme stress or anger), and
phonology dominates emotions 6
1.6 An example of melodic curve, interrupted at segments without
voicing (including pauses and silence), with the fundamental
frequency (top), intensity (middle) and wave (bottom) curves 8
1.7 Narrowband spectrogram to visualize harmonics
corresponding to the fundamental frequency curve 10
1.8 Staircase duration curves showing the evolution of syllabic
duration 11
1.9 Bézier duration curves showing the evolution of syllabic
duration 12
1.10 Map: Romance dialects 18
2.1 Rousselot kymograph 21
2.2 Measure of the laryngeal period, directly (top), or indirectly
from the duration of a packet of ten periods 22
2.3 Spectrogram printed on thermo-sensitive paper 22
2.4 The ten basic intonation patterns for French by Delattre (1966) 24
2.5 An example of analysis with the software Waves™ of the
sentence Jim builds a big daisy-chain (from ToBI, 1999) 25
2.6 An example of a fundamental frequency curve with a wide
band spectrogram displayed underneath (from Delais-
Roussarie et al., 2015) 26
x
List of figures and maps xi
2.7 WinPitch display 26

3.1 Pitch curve using a linear scale in Hz 31
3.2 Melody example of perceived pitch curve transcribed on a
musical range for English (Jones, 1909) 33
3.3 Musical transcription used by Fónagy and Magdics (1963) for
English 33
3.4 Unstressed syllables, static tones, and contours for English
(Armstrong & Ward, 1931) 34
3.5 Unstressed syllables, static tones, and contours for German
(von Essen, 1956) 34
3.6 Stressed and final syllables pitch transcribed as static tones in
English (Pike, 1945) 34
3.7 Melody contours of groups for English oral (Palmer &
Blandford, 1924) 35
3.8 Melody movements by syllable for English (Bolinger, 1961) 35
3.9 Simplified musical range has four levels for French: 1 Low, 2
Average, 3 High, 4 Acute (Léon & Martin, 1969) 35
3.10 Example of comparison of melody movements on all the
syllables of statement read by two speakers of the Valley of
Aoste 36
3.11 Prosogram for the automatic determination of pitch starting
from the syllabic segmentation and of the glissando threshold
(Mertens, 2004) 37
3.12 Variations of rising melody contours transcribed with the ToBI
notation 39
3.13 In the ToBI transcription, a high tone can correspond to a rising
(on the left) or a falling contour (on the right) 40
3.14 An example of an intonative period boundary detection from
the four parameters (from Lacheret-Dujour & Victorri, 2002) 42
4.1 An example of metrical grid 47
4.2 Degrees of stress obtained by counting the number of stress
nodes 48
4.3 The (revised) Autosegmental-Metrical Prosodic Structure 51
5.1 Micromelodic fundamental frequency dips (circled) due to the
presence of a voiced stop [d] in the sequence … actividades
denunciadas por Traffic 60
5.2 Two examples in French (c’est ma maman “this is my mother,”
and c’est mon papa “this is my father”) 67
5.3 Variants of modality melodic contours located on the last
stressed syllable (declarative case) or the last syllable, stressed
or not (interrogative case) 70
5.4 Voulez-vous du thé du café ou du chocolat? 71
xii List of figures and maps
5.5 Voulez-vous du thé du café du chocolat? 71

5.6 Declarative vs. imperative conclusive melodic contours 73
5.7 Emphasis on a declarative contour gives an imperative
contour; emphasis on an interrogative contour gives a surprise
contour 74
5.8 Implicative contour (evidence), a moderate rise followed by a
large fall 74
5.9 Bell shaping on a declarative contour gives an evidence
contour, and on an interrogative contour gives a doubt contour 75
5.10 Interrogative vs. surprise conclusive melodic contours 75
5.11 Implicative interrogative contour (doubt), a large rise followed
by a moderate fall 75
5.12 A non-planar partial structure [A [B] C], not well-formed for a
prosodic structure 83
5.13 A prosodic structure without connexity, well-formed for a
prosodic structure with non-integrated parentheses 84
5.14 A domain between two consecutive C0 contours, where
contours C1 inside the domain must be realized phonetically
with enough similarities 85
5.15 Three basic planar hierarchic configurations 93
5.16 Three basic prosodic hierarchical configurations 95
5.17 Prosodic word shortest duration in ms for various speech styles
(Martin, 2014b), political discourse, narrative, conference,
radio news, university lecture 99
5.18 A car license plate difficult to read 99
5.19 A telephone number difficult to read and to remember 99
5.20 Prosodic word longest duration in ms for various speech styles
(Martin, 2014b) 102
5.21 Two ways to obtain eurhythmicity: balancing the number of
syllables or varying the speech rate 103
5.22 Syllabic duration in function of the number of syllables in
prosodic words 104
5.23 A phrasing non-congruent to syntax, involving a syntactic clash 106
5.24 Example of EEG spectral analysis (channel 28 or Pz) of evoked
potential for a stimulus of a sequence of pure tones with no
temporal structure 109
5.25 Example of EEG spectral analysis (channel 28 or Pz) of evoked
potential for a stimulus of a sequence of pure tones with a
temporal structure 109
5.26 EEG Theta waves synchronized by Delta pulses 111
5.27 The seven syllables constraint is actually a duration constraint,
governed by Delta waves 114
List of figures and maps xiii
5.28 Delta waves synchronize the transfer of chunks of syllables

from short-term memory 114
5.29 The eurhythmicity constraint is linked to the relative stability
of consecutive periods of Delta waves 115
5.30 Chunks of syllables of more than four syllables must be
identified in long-term memory 115
5.31 A hierarchy which contradicts the graphic structure 118
5.32 A simple enumeration by juxtaposition of the three stress
groups 119
5.33 A structure corresponding to the usual graphic representation
522 4436 119
7.1 The process of reading schematized 136
7.2 Marking of stressed melodic contours and boundary tones 137
7.3 An example of a long enumeration made from a sequence of
groups of two prosodic words 140
7.4 Some examples of terminal conclusive contours in various
regional realizations (Turin, Rome, Palermo, Naples, Florence) 141
7.5 An example of melodic contours sequence, showing the
contrast of melodic slope in French 142
7.6 anticoncezionali 146
7.7 que hay una provocación a la discriminación 146
7.8 que componen Malásia 147
7.9 nos estados unidos 147
7.10 prepozitionala 147
7.11 An example of prosodic structure 150
7.12 L’idée était simple 151
7.13 L’idea era semplice 151
7.14 La idea era simple 152
7.15 La idea era simple 152
7.16 A ideia era simples 153
7.17 Ideea era simplă 153
7.18 In pericolo poi 154
7.19 cuando se constate 154
7.20 apelidado de Óscar 155
7.21 Com les formigues 155
7.22 La più grande delle piroghe misura cinque metri 156
7.23 Siguiendo Cn este modelo 156
7.24 Segundo a especialista 157
7.25 na cidade de York 157
7.26 Nascido no Japão 158
7.27 les garçons de piste 158
7.28 La plus grande des pirogues mesure cinq mètres 159
xiv List of figures and maps
7.29 mais les scientifiques japonais 159

7.30 La cuinera de Sant Pol de Mar 160
7.31 Configuration I, Cx Cx C0 161
7.32 Les romans ont un début et une fin 161
7.33 on rend cette interdiction strictement inefficace 162
7.35 ainsi sa nouvelle gamme de combinés présentés lundi 163
7.36 A saturation of melodic contrasts in the long syntagm
anniversaire de la mort au combat en dix-sept cent dix-huit du
roi Charles Douze 163
7.37 Configuration II, Configuration II Cy Cx C0 164
7.38 de se livrer à des affrontements en règle 164
7.39 cette maladie est devenue une pathologie changeante et
multiforme 165
7.40 Configuration II, Cy Cx C1 165
7.41 C’est au travers de cette relation qu’il instaurera à ces deux
personnes 166
7.42 Configuration III, Cx Cy C0 166
7.43 Certains de ces bâtiments préfabriqués se sont révélés
dangereux 167
7.44 Neuf cents policiers n’ont pu, cependant, empêcher les
bagarres recherchées de part et d’autre 167
7.46 A saturation of melodic contrasts in the long syntagm
anniversaire de la mort au combat en dix-sept cent dix-huit du
roi Charles Douze 168
7.48 B, antiparasiti C anticoncezionali 169
7.49 I romanzi hanno un inizio e una fine 170
7.50 Los romances tienen un inicio y un fin 170
7.51 Els romanços tenen un inici i un final 171
7.52 Os romances têm um início e um fim 171
7.53 Avião de papel no Espaço 172
7.54 Configuration I, Cx Cx Cc 172
7.55 Um cão de raça terra nova 173
7.56 é permitido matar um escocês 173
7.57 Os vídeos sobre Cn actividades paranormais 174
7.58 La recomendación plantea a los Estados 174
7.59 Mișcarea separatistă bască a comis noi atentate 175
7.60 i després que el vedell ataqués un dels homes que el volia
lligar 175
7.61 Configuration, I Cx Cx C1 176
List of figures and maps xv
7.62 A escolha da carreira profissional 176

7.63 Configuration II, Cy Cx C0 177
7.64 in coppie nelle quali il padre è sieropositivo 177
7.65 sarà arruolato dai carabinieri e addestrato 178
7.66 probabilmente sfuggito al controllo del padrone 178
7.67 Aceasta este o dilemă insolubilă 179
7.68 Configuration II, Cy Cx Cc 179
7.69 Situația periferică a Portugaliei o menține într-o poziție
marginală în raport cu fluxurile din Est 180
7.70 i després que el vedell ataqués un dels homes que el volia
lligar 180
7.71 Unele dintre aceste clădiri prefabricate s-au dovedit
periculoase 181
7.73 Romanele Cc au un început C1 și un sfârșit 182
7.74 Alarmă la școala britanică 182
7.75 Configuration III, Cx Cy Cc 183
7.76 che trasferirsi in USA 183
7.77 Pocs minuts després de les set de la tarda, el vedell 184
7.78 probabilmente sfuggito al controllo del padrone 184
7.79 Un grup de cercetători germani a rezolvat enigma 185
7.80 les médecins de l’Académie des sciences médicales 186
7.81 El catalán es la ochenta y ocho lengua del mundo 186
7.82 L’acadèmia de la llengua catalana l’Institut d’Estudis
Catalans IEC 187
7.83 Le programme de recherche a débuté en deux mille deux 187
7.84 Poche zampate per attirare l’attenzione del piantone del
comando provinciale dei carabinieri 188
7.85 Una nuova e divertente ginnastica con la palla 189
7.86 es necesario alfabetizar a cuatro millones de personas
cada año 189
7.87 La resolución propone que se pongan en marcha dispositivos
nacionales de autocontrol 190
7.88 Així abans que acabiel dos mil vuit 190
7.89 Com cerca de sete centímetros de comprimento 191
7.90 In Germania violența rasistă a depășit limita 191
7.91 je vous suggère d’installer des volets des rideaux et des
voilages 193
7.92 je vous conseille d’étudier le néerlandais le danois et le
norvégien 193
7.93 Jamais Barnabé Jean-David ni Mamadou ne seraient prêts à
venir travailler le samedi 194
xvi List of figures and maps
7.94 Le muret le donjon et l’église sont de style roman 194

7.95 Two possible hierarchical configurations [ABC] and [[A] [B]
[C]] for postverbal accentual units, resulting in sequences
similar contours rising, rising, falling 195
7.96 Different groupings coordinated units A, B, and C [ABC] and
[[A] [B] [C]] resulting in different sequences of melodic
contours, falling, falling, rising or rising, rising, and rising in
the preverbal case 195
7.97 le vélo le roller ou l’aviron comptent parmi les activités
populaires sur le campus 196
7.98 le muret le donjon et l’église sont de style roman in which the
coordinate units are subject of the Verb Phrase sont de style
roman 196
7.99 A parallel realization where the first stress groups are
coordinated with the conjunction ni, associated in each case
with an emphatic accent 198
7.100 On peut livrer le lavabo la baignoire ou l’évier sans acompte
de votre part 198
7.101 Enumeration in Italian of numbers (1.46, 1.47, 1.44, 2.78,
2.41), sequence of Cc contours ending each prosodic group,
terminated by C0 conclusive on the last item (giorni) 199
7.102 […B C2 antiparasiti Cc] [C C2 anticoncezionali Cc] … [M
Cc antilopi C0]] Primo Levi enumeration C2 Cc, C2 Cc …
Cc C0 199
7.103 Uma equipa de cientistas do instituto de reabilitação de
Chicago 199
7.104 Il fatto che, in quel mondo, gli uomini 200
7.105 che trasferirsi in USA 201
7.106 permittan una utilización más segura de algo que,
evidentemente 201
7.107 Ancora i giapponesi, per contro hanno la più alta incidenza
mondiale 202
7.108 Le donne giapponesi, per esempio, hanno un’incidenza di
tumori alla mammella 203
7.109 Le coléreux garçon ment à sa mère (from Jun & Fougeron,
2002) 204
7.110 le coléreux et mauvais garçon ment à sa mère (from Jun &
Fougeron, 2002) 205
7.111 A counterexample le garçon coléreux ment à sa mère (from
Jun & Fougeron, 2002) 206
7.112 Ou le donjon ou le minaret ou les murailles doivent être
restaurés 207
List of figures and maps xvii
7.113 Identification of prosodic events 209

7.114 Classification of prosodic events 209
7.115 Retrieving the prosodic structure 210
8.1 Macrosyntactic analysis of text and intonation 223
8.2 Adding a corrective syntagm after the nucleus: [en Angleterre
dans le métro] nucleus 229
8.3 Adding syntactic segment to the prenucleus [mes parents
m’emmenaient] 229
8.4 [Je confirme que le premier ministre Elio Di Rupo m’a parlé
de cette situation et euh de l’inquiétude des brasseurs] C1
[belges] C1 [par rapport à ce qu’était la consommation
française …] 230
8.5 Melodic curve with the pitch movements on stressed syllables
circled 234
8.6 The ISC schema of the example Les vieux graphistes ou
les anciens je devrais dire graphistes pas les vieux
quelquefois lorsqu’ils voient les mises en page de certaines
revues ou de certains journaux ils se mettent les mains sur
la tête 236
9.1 Variants of conclusive contours: declarative, imperative,
implicative, interrogative, surprise, doubt 250
9.2 Three prosodic structures organizing a sequence of three
stress groups 251
11.1 WinPitch command, alignment, navigation, and analysis
windows 260
11.2 Example of video analysis 261
11.3 Assisted alignment by slowing down speech playback 262
11.4 Automatic IPA transcription from orthographic text and
morphological and syntactic labeling 263
11.5 Fine tuning of speech segments limits with the help of a
simultaneously displayed spectrogram 263
11.6 Automatic segmentation from spectrographic transitions 264
11.7 Batch processing of a large set of conjunctions obtained
from a concordance analyzer with their left and right
contexts 265
11.8 Entering the key word “parce que” and selecting a
Transcriber file 266
11.9 Table generated automatically listing the occurrences of the
entered keyword 267
11.10 Automatic generation of text from alignment files and
selection of the entered key word 267
xviii List of figures and maps
11.11 Command window displaying the available pitch tracking

algorithms that can be used on any user selected sections of
the speech recording 268
11.12 Most common sources of errors for F0 tracking 269
List of tables
5.1 Variants of modality page 69

5.2 Phonological description of modality variants using the
features +/− Rising, +/− Ample, and +/− Bell shaped 70
5.3 Phonological description of modality contours 91
5.4 System of contrasts Cx C0 92
5.5 System of contrasts Cx C1 C0 93
5.6 System of contrasts C1 Cx C0 94
5.7 System of contrasts Cx Cx C1 94
5.8 System of contrasts Cy Cx C1 95
5.9 System of contrasts Cx Cy C1 95
7.1 Processing the prosodic events Cn, C2, C1 and C0 in the
example of Figure 7.5 143
7.2 Static phonological description of romance languages
melodic contours 148
7.3 Configurations of three successive melodic contours 149
7.4 Percentage of realizations according to the coordination type 195
7.5 Sequence of prosodic events triggering the storage-
concatenation process 211
8.1 Gars and Lablita equivalence 220
8.2 Utterance segmentation into prosodic words indicating
the melodic contours on the last stressed syllable and the
type of contour in the process of storage-concatenation 236
xix
Preface
This book is the culmination of some forty-five years of personal research on

intonation. When I first went to Canada in 1968, I had the rare and precious
opportunity to be hired by Pierre Léon (1926–2013), then a young and
enthusiastic phonetician, who was eager to carry out research on all possible
aspects of intonation, in didactics, phonostylistics, and phonology. Being a young
graduate in electronic engineering, I was given as a first task the job of developing
an acoustic instrument capable of measuring in real time the fundamental
frequency of speech (an acoustic measure of vocal folds vibration frequency).
Being quite new to the field, I thought this assignment would be easy to complete
in some two or three weeks! I realize today, after all these years, that despite the
considerable progress that has been made, the question of fundamental frequency
tracking has still not been completely solved. In fact, new generations of young
specialists in speech signal analysis are regularly tackling the problem, only to
discover that even though systems are becoming more reliable, there is always
room for improvement. The grail in the field would be the availability of an
algorithm which would give reliable results for all cases of speech recording
conditions, or at least all cases where a human listener can perceive the melodic
variations resulting from fundamental frequency change.
Nevertheless, I had no fear at that time, especially as I had received a brand
new PDP 8/I for my exclusive usage (the PDP 8/I was at that time a
revolutionary laboratory computer with 8 kBytes central memory and a 32
kBytes hard disk). I spent all my work and leisure time developing a hybrid
analog-digital system, capable of delivering in real time a fundamental
frequency curve with an extended range (70 Hz to 500 Hz). The curve was
displayed on a high remanence oscilloscope screen and filmed on a large TV
screen. At a time when all intonation acoustical data were obtained either with a
spectrograph (limited to 2.4 second analysis and requiring 3 to 5 minutes of
processing, not to mention the necessary and tedious manual setup) or with a
mingograph (hooked to a not too reliable voice filter operating in a limited
frequency range and requiring an expensive roll of ink or UV photographic
paper to print the results), this was an important achievement which had direct
consequences for my own prosodic research.
xxi
xxii Preface
Indeed, I was then able to process a very large amount of data, essentially in
French, sometimes playing with the intonation of my own voice and getting the
resulting melodic curves immediately, and was curious to see if some pattern
would emerge from all these trials. In the late months of 1973, I finally got an
idea pertaining to the frequently observed regularity of F0 patterns, an idea that
I formalized with the term contraste de pente (contrast of melodic slope) in
French, and published in the review Linguistics in 1975 (Martin, 1973, 1975).
At the same time, I coined the terms structure prosodique and hiérarchie
prosodique (prosodic hierarchy and prosodic structure) to name a hierarchical
organization of minimal prosodic units or prosodic words, containing one and
only one prosodic event indicating this hierarchy. Although referring to the
same kind of experimental data in French, papers published at that time (e.g.
Vaissière, 1975; Émerard, 1977) were phonetically rather than phonologically
oriented but brought comparable data.
To my regret, my Linguistics paper had almost no impact on the research
domain in prosody, except in France. Only five years later, however, papers
using the term prosodic structure appeared, but unfortunately without ever
mentioning my earlier work. Later, speech analysis computer software
became popular (Signalyze, WinPitch, Praat, and so on), and phonologists
(essentially based in US universities) were happy to discover a new
playground. To differentiate their activity from that of the phoneticians, who
were not considered seriously by linguists at the time, and to avoid being
confused with them, they called it “laboratory phonology,” which
corresponds to what most phoneticians have actually been doing for a
century or more.
The purpose of this monograph is to present an alternative theoretical
approach that attempts to describe and understand the prosodic organization
of sentences. In this endeavor, I briefly present a critical exposition of the main
aspects of the dominant Autosegmental-Metrical model (henceforth AM),
succinctly describing existing research using this approach for Romance
languages such as European French, Italian, Spanish, Catalan, European
Portuguese, and Romanian (Post, 2000; D'Imperio 2002; D'Imperio et al.,
2005; Michelas & D’Imperio, 2010; Sosa, 1999; Hualde, 2003; Prieto, 2014;
Frota, 2009). Then I introduce an alternative model, called Incremental Storage
Concatenation (henceforth ISC) derived from the Storage-Concatenation
model I proposed in 2009 (Martin, 2009). In this model, I highlight some
characteristics, apparently never mentioned in AM descriptions, formalized
as a set of constraints limiting the number of prosodic structures that could be
associated with a given text.
This leads to a concept of intonation that from the start completely
dissociates sentence text from its hierarchical organization by syntax. This
concept departs dramatically from earlier concepts of prosodic structure
Preface xxiii
conceived under the AM approach, where only one such structure can be
associated with a given syntactic structure, even if its restructuration appears
possible,in order to obtain a better eurhythmicity (Post, 1999).
The set of constraints, originally part of the Storage-Concatenation
framework, i.e. planarity, the seven syllables rule, eurhythmicity, stress clash,
and syntactic clash, made me look for an underlying explanation that gives a
proper account for the observed constraints. A key aspect is their time
dimension and especially the dynamic process performed by listeners to
recover the prosodic structure intended by the speaker or the writer.
Examining the consequences of the time domain aspect of the process is the
key to a better understanding of the observed data, an aspect that is often
neglected or totally ignored in the current literature. Indeed, the usual
reasoning on a two-dimensional plane of a sheet of paper limits considerably
an understanding of the mechanisms necessarily used by the listener in the
perception of the prosodic structure.
Pushing this exploration further, I related this model to results obtained
recently in the neurolinguistic domain, and particularly those concerning
evoked potential linked to prosodic stimuli. These results lead me to propose
a new and coherent model based not only on the time dynamics of the prosodic
structure but also, and perhaps even more interestingly, on specific cognitive
mechanisms, in particular those involving short-term memory (Gilbert, 2012).
This approach suggests a convincing set of explanations pertaining not only to
the set of constraints relative to the prosodic structure but also to some phonetic
data, such as the duration of minimal units of prosody (defined below as
prosodic words), the minimal and maximal time interval between
consecutive stressed syllables, and even the speed limits of silent reading.
The second part of this book is devoted to applications of the model
presented in the first part to the analysis of data in some Romance languages,
starting with French, often considered as the ugly duckling among other
languages of the same family as it is deprived of lexical stress. This second
part itself is divided according to the type of data analyzed: read/laboratory
speech and spontaneous/non-prepared speech. In this latter set of chapters, I use
a modified macrosyntax approach derived from the GARS (Groupe Aixois de
Recherche en Syntaxe) work (Blanche-Benveniste, 1990, 2000) for both the
text and the prosodic aspects of speaker productions.
I sincerely hope that this book will help both new and experienced
researchers in the field of prosody to restore sentence intonation to its
deserved place in linguistic studies. I will try to show that far from being the
cherry on the phonological cake for some, intonation is the essential linguistic
base for both speech production and speech perception.
Acknowledgments
I have many people to thank, and in the first place, Pierre Léon (1926–2013)
who, like Obelix, a cartoon character in the adventures of the French comic
book Astérix, “plunged me in a barrel of prosody when I was little.” From
Pierre Léon I learned a lot of facts about intonation in linguistics, stylistics,
phonetics, etc., and about how to survive in the academic world.
In addition, I had the privilege to meet and work with the outstanding linguist
Claire Blanche-Benveniste (1935–2010). She had a tremendous influence on
my research, always encouraging me to improve in our countless fruitful and
pleasant discussions.
Many other people helped me in various ways. In particular, I would like to
thank (in alphabetic order):
Mathieu Avanzi (Université de Neuchatel) for his numerous useful (and
exacting) comments;
Helen Barton (Cambridge University Press) for her constant support
and encouragement in this project;
Gabriela Bilbiie (Université Paris Diderot) for her help in elaborating
the Romanian corpus;
Victor Boucher (Université de Montréal) for his original and fruitful
views on speech perception and his constant support for this project;
Georges Boulakia (Université Paris Diderot) for his constant friendship
and understanding;
Marie Claude Capt-Artaud (Université de Genève), formerly skeptical
but now convinced;
Emanuela Cresti (Università degli Studi di Firenze) for our discussions
and her constant friendship;
Jeanne-Marie Debaisieux (Université Paris 3) for the trust she placed in
my research;
Élisabeth Delais-Roussarie (Université Paris Diderot) for many inter-
esting discussions;
Didier Demolin (Université de Grenoble) for his indefectible friendship;
José Henri Deulofeu (Université Aix-Marseille) for teaching me what
macrosyntax is and staying my friend;
xxv
xxvi Acknowledgments
Helena Dowson (Cambridge University Press) for her patience and

encouragement;
Caterina Falbo (Università degli Studi di Trieste) for her understanding
and inspiration;
Ana Maria Fernández Plana (Universitat de Barcelona) for her help in
elaborating and recording the Catalan corpus;
Jacqueline French (Cambridge University Press) for her tireless efforts
to improve the text;
Aline Germain (Middlebury College) for her trust in my theories;
Annie Gilbert (Université de Montréal) for her illuminating view on
syllables;
Rémi Godement (Université Paris Diderot) for his trust in being one of
my doctoral students;
Sarah Green (Cambridge University Press) for her constant help with
this project;
Jane Leung (University of Toronto) for her patience and impatience (at
times), and her moral support;
Mirian Matta-Machado (Universidade Federal Fluminense) for her
constant friendship and support;
Massimo Moneglia (Università degli Studi di Firenze) for his friend-
ship and exceptional cuisine;
Patricia Perez (Université Paris Diderot) for her friendly participation
in corpora building;
Michaela Pirvulescu (University of Toronto) for her help in elaborating
the Romanian corpus;
Guan Qianwen (Université Paris Diderot) for interesting discussions
about Mandarin prosody;
Tommaso Raso for his constant support and friendship;
Mario Rossi (Université de Provence) for his trust in my software
realizations;
Alexandre Sévigny (University McMaster) for his support in difficult
days, and brilliant suggestions in incremental syntax;
And finally, I would like to thank all my present and past students (master’s
and doctoral), who, by their presence and active participation during my
lectures, encouraged me to write this book, and, of course, all the friends and
colleagues I may have inadvertently forgotten.
Every effort has been made to secure necessary permissions to reproduce
copyright material in this work, though in some cases it has proved impossible
to trace copyright holders. If any omissions are brought to our notice, we will be
happy to include appropriate acknowledgments in any subsequent edition.
Key concepts
To help the reader to quickly evaluate the distance from known (and
dominant?) concepts in the field of intonation studies, the following list
contains the essential nonstandard theoretical points developed in this book.
1. This book is about the structure of spoken language.
2. Spoken language is made of time sequences of syllables organized into
stress groups (basic units of speech are syllables, not phonemes).
3. Stress groups are not necessarily aligned with words or syntactic groups;
however, they are aligned on complete words (i.e. their beginnings and
ends are aligned on beginnings and ends of lexical units – words).
4. Stress groups are also called rhythmic groups, accent groups, prosodic
words, and Accent Phrases in the literature.
5. Prosodic words are segments of prosody associated with and aligned on
stress groups.
6. Prosodic words are organized hierarchically by a prosodic structure.
7. Specific prosodic markers indicate prosodic structures; they allow the
listener to reconstitute dynamically the speaker intended prosodic struc-
ture in an incremental time fashion.
8. Prosodic markers are instantiated by prosodic events located on stressed
groups’ stressed and final vowels.
9. Prosodic events are instantiated by prosodic contours, described primarily
in acoustic terms of duration, melodic contour, and intensity.
10. (Silent) reading and speaking are described as an Incremental Storage-
Concatenation (ISC) process.
11. Recovering the prosodic structure in (silent) reading mode is a specific
process distinct from listening to speech.
12. Generation of spontaneous speech involves chunks of prosody hosting
syntactic constructions, which in turn host morphological units.
13. There is therefore a precedence of prosody over syntax, and of syntax over
morphology.
14. It follows that the same prosodic structure can host various syntactic and
morphological constructions, i.e. different texts.
15. Conversely, more than one prosodic structure can be associated with a
given text (for example when reading).
xxvii
xxviii Key concepts
16. Prosodic boundaries (between an AP, ip, or IP) do not correspond neces-
sarily to syntactic or macrosyntactic boundaries. Likewise, (macro)syn-
tactic boundaries do not necessarily correspond to prosodic boundaries.
17. Stress shift in stress clash conditions entails a reallocation of stress groups
organized hierarchically in the prosodic structure;
18. Generation of a prosodic structure when reading involves the precedence
of syntax (analyzed by the reader from the written text).
19. Prosodic structures are not necessarily congruent with the sentence syn-
tactic structure. They do not result from restructuration of the prosodic
structure either. Actually they do not coexist with syntax; they precede
syntax.
20. Prosodic markers are subject to neutralization of some of their acoustic
features when partially or totally redundant in a given prosodic structure
configuration.
21. Prosodic markers must be acoustically similar in their respective domain.
22. Acoustic features describing prosodic events ensure a necessary and
sufficient differentiation between prosodic markers (melodic contours) in
the prosodic structure.
23. Prosodic structures are constrained by a set of rules: planarity / seven
syllables / stress clash / syntactic clash / eurhythmy (the latter for read
speech).
24. Neurocognitive properties and processes may explain these constraints.
25. Prosodic structure and prosodic markers properties are extended to
macrosyntax.
26. Broad and narrow focus are subcases of macrosyntax configurations
(Prenucleus, Nucleus, Postnucleus).
27. There is a macrosyntax analysis of sentence intonation (no Prefix, only
prosodic Nucleus, prosodic Parenthesis, and prosodic postfix).
In order to be compatible with the many other studies on intonation that are
probably familiar to most readers, I use the terms R (prosodic structure root), IP
(intonation phrase), ip (intermediate intonation phrase), and AP (Accent
Phrase) throughout this book whenever possible, despite potential general
conceptual differences.
1 Introduction
The respiratory cycle

One of the most remarkable features of phonation is the disruption of the
normal respiratory cycle. Outside phonation, the normal cycle of respiration
presents a comparable duration for both the inspiration and the expiration
phases (top of Fig. 1.1), whereas during phonation the expiratory phase is
usually much longer than the inspiratory phase.
Indeed, the phonation process results from the air flow generated by the lung
compression during the respiration–expiration phase. This air flow generates
the subglottal pressure needed to produce the vibration of the vocal folds for
voiced sounds (vowels, voiced consonants), friction for fricative consonants,
and intraoral pressure to allow the production of stop consonants. As a result,
phonation is possible only during the expiration phase.
While producing speech, the speaker has to optimize the duration of both the
inspiration and the expiration phases of the respiratory cycle, so that the
expiration phase is the longest possible and the inspiration phase the shortest
possible (bottom of Fig. 1.1). The latter should induce an acceptable duration of
silence that fits with the specific conditions of the speech act (usually at a
syntactic boundary). On the other hand, expiration should correspond to the
speaker’s estimation of the appropriate quantity of expired air necessary to
produce the planned sequence of syllables, with the desired parameter values of
rhythm, intensity, and laryngeal frequency. As prosodic unit, the breath group
contains the prosodic objects produced during the expiration phase, i.e.
between the consecutive inspiration phases.
The air consumption, and therefore the cycle of respiration, depends on the
speaker’s emotional state, and more specifically on the energy needs related to
the speaker’s emotions. The respiratory cycle will be longer at rest, and will
accelerate when the activity level increases, as it requires more oxygen, until it
gets heavily reduced by physical effort (swimming, carrying heavy objects,
racing, etc.) to a time span that allows only a short phonation time.
The speaker’s emotionally depressed state consumes less energy with less
pulmonary air volume and allows a slower phonation rhythm while producing
1
2 Introduction
Silence
Time
Inspiration Expiration Inspiration Expiration
Phonation
Phonation Phonation
Time
Inspiration Expiration Inspiration Expiration
Figure 1.1 Respiration cycle, without phonation (top) and with phonation
(bottom).
the same number of syllables that a speaker would in a more neutral emotional
state. On the other hand, some types of anger and fear consume a lot of
physiological energy. This state leads to shorter phonation sequences, which
may not even reach the duration of a single sentence, or of a complete syntagm,
and which may end with an unexpected (for the listener) respiratory pause of
considerable duration.
An example of an out-of-breath speaker phonation cycle, when the speaker
needs inspiratory pauses that are longer and more frequent than usual, is given
in Figure 1.2.
It is clear, then, that in order to master her/his speaking activity, the speaker
must constantly control her/his physiological state, which is itself conditioned
by an emotional state, in order to control the air volume inspired to the lungs
and to maintain a sufficient subglottal pressure during expiration. The larger the
pulmonic air debit, the shorter the phonation time, as, for example, during
phonation with high acoustic intensity, a higher than usual laryngeal frequency,
or with a large amount of melodic variation. Conversely, a low debit of
pulmonic air will result in low-intensity speech and a lower laryngeal fre-
quency with reduced melodic variation.
Pulmonic air expiration requires a control of the vocal folds tension (the
word tension stands here for the complex muscular mechanisms controlling the
positioning and the elongation of the vocal folds) in order to compensate for
the diminution of the lung air volume during expiration. As this air volume
diminishes, the subglottal pressure mechanically diminishes as well, since an
The source-filter model of phonation 3
Figure 1.2 An example of an out-of-breath speaker (NS), that is, when a

speaker needs inspiratory pauses that are longer and more frequent than usual:
Mesdames et Messieurs # je vous demande de bien vouloir excuser mon retard
# qui est dû # à la longueur du dialogue que je viens d’avoir avec Monsieur
Poutine [NS, 2007] (“Ladies and Gentlemen # I ask you to excuse my delay #
which is due # to the long dialog I had with Mr. Poutine”).
adequate and complete lung compression is impossible to achieve. This drop

must be compensated for by the speaker, according to his or her expectation of
the number of syllables to be pronounced in the planned breath group (the
interval between two consecutive inspiration phases). This mechanism is not
always completely compensated for by the muscles controlling the expiration,
which in turn may partly explain the origin of the frequently observed declina-
tion line of the laryngeal frequency, i.e. the tendency of the laryngeal frequency
to be higher at the beginning than at the end of the expiration phase.
The source-filter model of phonation

The speech production mechanism is constrained by the speaker’s specific
physiological and emotional state and, at the same time, by phonological,
syntactic, and semantic constraints on language functions. A particularly sim-
ple model frequently used in speech processing represents the phonation
mechanism by two separate processes, a sound source and a filter, that shapes
the speech spectral characteristics of the source (Fig. 1.3).
The speech source whose characteristics are deemed to represent at the same
time not only the vocal folds vibrations for voiced sounds (vowels and voiced
fricatives such as [v], [z], [ʒ]), but also the friction noise used to produce
consonants such as [f], [s], [ʃ] (despite the fact that the noise source is not
actually localized in the glottis). The filter in this source-filter model, which
represents the shaping action of the vocal tract on the sound spectrum produced
by the source, possesses characteristics allowing for the amplitude of the
4 Introduction
Vocal folds
Vocal tract filter
Source Speech
Figure 1.3 Source-filter model of phonation.
Vocal folds
Source Vocal tract filter
Speech
Effect on
laryngeal Source-filter Effect on articulation
frequency interaction and vowel quality
Emotional
state
Figure 1.4 Interactions in the source-filter model between phonation and

emotions.
harmonics of voiced sounds to be shaped, on the one hand, and of noise regions
of the fricatives (both for voiced and unvoiced) on the other hand. Stop
consonants such as [p], [t], [k] are simply not taken into account in this
model, although voiced stops [b], [d], [g] are partially represented by their
voiced character. However, this is partially justified, as the perception of stop
consonants is based essentially on spectral transitions on the vowel (if any) that
follows (spectral loci).
If the speaker’s emotional state has to be taken into account in a model, it is
essential to consider the interactions necessarily existing between the source
and the filter (Fig. 1.4). Indeed, the emotional state has an effect on the
physiological mechanism of phonation, as it affects the respiration cycle, the
volume of air inspired (resulting in the speech rate), the subglottal pressure, and
the tension of the vocal folds, which, in turn, determines the laryngeal fre-
quency and the voice pitch. The position of the articulators is also modified,
conditioning vowel quality. This emotional state affects the muscular tension
Emotions 5
responsible for the positioning of the articulators, which are modeled by the
filter. It also produces secondary effects on the source characteristics (for
example, on the control of the laryngeal frequency and the position of the
glottis in the vocal tract).
Emotions
One can easily say that there are as many categories of emotions as there are
authors dealing with the subject. For example, in what may appear as a con-
tinuum, Eckman (1999) distinguishes the following basic categories: Joy,
Sadness, Disgust, Fear, Anger, and Surprise, with secondary emotions resulting
from a mixture of these basic emotions. Shame, for example, can be considered
as a mixed emotion, combining fear and anger directed at oneself. Eckman’s
categories of emotion, like many other systems, obviously pertains to the
terminology of emotions, often influenced or even determined by categories
existing in the language of the researchers (cf. color terminology or snow quality
in Inuktitut, etc.).
Physiological constraints linked to various emotions were often studied (e.g.
Sauleau 2010). Factors prone to influence phonation are salivation, muscular
tension, perspiration, or more globally, blood pressure, and cardiac frequency.
The physiological parameters controlling phonation affected by emotions
are essentially as follows:
a. Energy, which acts on voicing and vowel quality;
b. Tension of the vocal folds, which determines the melodic height as well as
vowel quality;
c. Articulation, another factor affecting vowel quality;
d. Speech rate, responsible for the tense or lax mode of articulation;
e. The degree of voicing, characterizing the noise/source ratio (breathy voice);
f. Breath insertion, as an index of irritation, pleasure, fear (iconic value);
g. Uncontrolled muscular movements (shivering) acting on the laryngeal
frequency as well as on vowel quality;
h. Regulation of the respiration cycle, which determines the position and the
length of pauses.
Dominance of an emotional state occurs when linguistic rules and con-
straints are not fulfilled in the realization of vowels, consonants, and the
prosodic structure.
However, emotion affects the whole phonation process (laryngeal source and
vocal tract), as well as all of the syllables, whereas the dialectal or idiosyncratic
variations pertain essentially to stressed (prominent) syllables. An extreme case of
this process is shown in Figure 1.5. The borderline cases correspond to the “hot”
anger and extreme stress, for which emotion disturbs all or some aspects of the
phonological realizations of prosodic markers, and at the other end of the scale,
6 Introduction
Emotions Prosodic structure

Physiological Perturbations of F0,
perturbations Intensity, Rhythm
Emotions dominates phonology: extreme stress or anger

Low
Degree of phonation control
Phonology and emotions coexist

High
Phonology dominates emotions: synthetic voice(diphones)
Figure 1.5 Extreme cases of the emotion–phonology relationship: emotion

dominates phonology (extreme stress or anger), and phonology dominates
emotions (diphone speech synthesis). In the case of prosody, emotions influence
acoustical parameters such as fundamental frequency (F0), intensity, and rhythm.
synthetic speech based on diphones, totally deprived of emotional content (con-

trary to speech synthesis by corpus which necessarily presents traces of emotions
of speakers who are involved in building the corpus).
Verbal communication by speech always includes an emotional component,
and can be placed between these two extreme cases, affecting at various levels
realization of speech units. The realization of melodic contours can also vary
according to the socio-geographical origin of the speakers as well as their
idiosyncratic characteristics, but the speaker’s emotional state can affect the
prosodic structure in various ways, as in the following:
a. Interruption of a stress group (a sequence of syllables with only one stressed
syllable) due to a perturbed control of the respiration cycle, in particular
during the expiration phase (e.g. in the case of extreme fear or anger). Stress
groups can then become difficult or impossible to identify by the listener,
particularly in the case of erroneous syntactic grouping.
b. Sequences of incomplete melodic contours (melodic variations on stressed
syllables), without final conclusive contour (abandons).
c. Insufficient acoustic contrasts in the realization of melodic contours pre-
venting a correct identification by the listener, resulting, for example, in a
“flat” prosodic structure (with only one level leading to an enumeration
structure. see Chapter 5).
In conclusion, the realizations of linguistic units, and in particular prosodic ones,
due to emotional variations is purely phonetic and does not affect the linguistic
functions ensured by these units, except in extreme cases (Martin, 2014a).
Fundamental frequency and melodic curve 7
Voiced and unvoiced speech sounds

As mentioned earlier, speech sounds are produced by a variety of mechanisms
that involve various noise sources: (1) the vibration of the vocal folds for
vowels and some “voiced” (except, of course, whispered) consonants; (2) a
narrowing of the vocal tract forcing expiratory air to enter a turbulent regime
(called “fricatives” consonants [f], [s], or [ʃ]); (3) a micro-explosion caused by
a sudden release of the closure of the vocal tract which increase[s] the pressure
upstream of the closure (such as stop consonants [p], [t], or [k]); and (4) a
micro-implosion caused by the sudden release of a location of a closure in the
vocal tract in which a depression is produced by reduction of its volume
(clicks). These modes can be combined (with the exception of the implosion),
and when the vocal folds are involved by their vibration this is called “voiced
sound.” If this is not the case, the sound will be called “unvoiced.”
Vowels produced by vocal folds vibrations are always voiced (but may be
devoiced in some circumstances or whispered). However, only consonants
generated with vibration of the vocal folds, such as [b], [d], and [g] for stop
consonants, [v], [z], and [ʒ] for fricatives, and [m], [n], [ɲ], and [ŋ] for nasals,
are voiced.
Laryngeal frequency
The successive cycles of slow opening and rapid closing of the vocal folds
produce harmonics whose frequencies are integer multiples of the frequency of
vibration of the vocal folds, called laryngeal frequency. Strictly speaking, a
frequency can thus be associated with any segment of a voiced speech sound,
assuming that the frequency value remains constant, which is, of course, never
the case. In fact, the term frequency is a bit confusing and strictly corresponds
to the inverse of the cycle time of vibration of the vocal folds, itself often called
the laryngeal period (while laryngeal vibration is a quasi-periodic phenomenon
rather than strictly periodic).
Fundamental frequency and melodic curve

The term fundamental frequency in speech is related to the measure of the
laryngeal frequency derived from Fourier spectral analysis. This type of ana-
lysis decomposes successive small segments of the speech signal (seen through
a “window” time of several tens of milliseconds) into their harmonic compo-
nents, by using the Fourier theorem for harmonic analysis. These components
have frequencies which are integer multiples of the inverse of the time window
duration. A typical value of 30 ms window duration would, for example, give
Fourier harmonic frequencies of 1/0.03 = 33.3 Hz, 66.6 Hz, 99.9 Hz, etc. The
8 Introduction
longer the time window, the finer the frequency analysis resolution, at the
expense of a lower time resolution due to the use of longer windows. A 1
second window would give an excellent frequency resolution of 1 Hz, but
would be unsuitable for speech as many events may occur in a single second of
speech. The commonly retained value of 30 ms results in a sufficient frequency
resolution to “capture” the fundamental frequency of voiced speech sounds by
interpolation. This value compares with the number of frames per second
commonly used in movies to capture body movements (typically 24, 25, or
30 frames per second).
The speech fundamental frequency, F0 (denoted F “zero”), not to be con-
fused with the Fourier fundamental frequency, corresponds to the first harmo-
nic component found in the Fourier analysis of the signal, but also, by
definition, corresponds to the frequency difference between two consecutive
speech harmonics. As this analysis needs a rather long time window to be
effective, the actual value of the laryngeal period may fluctuate during the time
window needed for the analysis. By moving the analysis time window along the
time axis, values for each successive position of the time window are obtained.
These plotted values, whose ordinate corresponds to the fundamental fre-
quency (vertical axis) and the abscissa the time (horizontal axis), form a
melodic curve (Fig. 1.6).
It appears that the melodic curve has a much tormented shape with numerous
ups and downs in frequency, and is also interrupted in some places. These
interruptions correspond to the absence of fundamental frequency value, due in
turn to the absence of voicing (unvoiced speech sounds or silence), at least if
the measure is reliable, which is not always the case in adverse recording
conditions (e.g. low signal to noise ratio). We observe, for example, a rather
300
veau
250
chaud
faut soit
200 a ti
gneau que
beau rô
150 ou le
100
50
0
0 0.5 1 1.5 2 2.5 3
[1] agneau ou veau [2] faut que le beau rôti soit chaud
Figure 1.6 An example of melodic curve, interrupted at segments without

voicing (including pauses and silence), with the fundamental frequency (top),
intensity (middle), and wave (bottom) curves.
Spectrographic analysis 9
large interruption of the melodic curve in Figure 1.6 corresponding to a silent

pause (between the French words veau and faut) and another corresponding to
the location of the voiceless consonant [k] in the word que, at a time equal to
2 seconds.
Intensity
Besides the melodic curve resulting from the successive values of F0 plotted
along the time axis, it is also customary to display an intensity curve by
measuring the intensity or the amplitude of each speech segment inside a
time window. The unit of measurement is usually the decibel (dB), a logarith-
mic value relative to some arbitrary reference defined in the instruments or
within the software used. The most commonly used value for the measurement
is relative to the global intensity detected within the time window used in
Fourier analysis.
A remarkable intensity value corresponds to the increase in decibels result-
ing from doubling the amplitude of a pure tone: 10 log (2) = 3 dB (exactly
3.0102999. . . dB) for amplitude and 20 log (2) = 6 dB intensity. Halving the
amplitude causes −3 dB amplitude and −6 dB of intensity drop. The multi-
plication of the amplitude by a factor of 10 corresponds to an increase in
intensity of 20 log (10) = 20 dB, by a factor 100 of 40 dB, etc.
The dB unit is always a relative value. When the threshold of hearing is used
for reference, the values are absolute decibels (0 dB SPL in English notations,
where SPL stands for Sound Pressure Level) and decibels relative when the
reference is different from this threshold. Absolute dB are thus dB relative to
the threshold of audibility at 1000 Hz.
Since it is sufficient to increase or decrease the amplitude level of sound
reproduction equipment in order to change the intensity curve span up or down,
only relative intensity measures make sense, for example by comparing the
values in dB of two consecutive vowels. Also it is not legitimate to average
intensity values in dB, since this unit is logarithmic. Averages should be
computed from the amplitude values, the formula to obtain an amplitude
value from a dB value of a sound relative to a reference amplitude being Aref
is I = 20 log [a / Aref].
Spectrographic analysis
The spectrogram is a three-dimensional graphical representation (time on the
abscissa, frequency on the ordinate, and amplitude coded by colors or levels of
gray) of the Fourier analysis of successive windows of the speech signal
previously recorded. Depending on the length of the time window used, it
can display harmonics (setting called narrowband) or high concentrations of
10 Introduction
4500
4000
3500
3000
2500 veau ti
2000 a faut
gneau que beau soit
1500 rô
le
1000 ou
chaud
500
0
0 0.5 1 1.5 2 2.5 3

Figure 1.7 Narrowband spectrogram for visualizing harmonics corresponding

to the fundamental frequency curve.
harmonic amplitude (setting called broadband) for voiced sounds. In cases of

poor recording quality, a melodic curve superimposed on a narrowband spec-
trogram allows one to validate (or not) the reliability of the analysis, by
comparing visually the evolution of the melodic curve with the harmonics:
both must globally match as shown on Figure 1.7.
It should be noted that the same duration of time window is not necessarily
appropriate for all voices in order to obtain a narrowband spectrogram.
Harmonics will be well separated if the duration of the window is sufficient,
and thus different values, e.g. 32 ms for a male voice at 130 Hz and 16 ms for a
female voice to 250 Hz, may be suitable.
Syllabic duration
Syllable duration is also a prosodic parameter. The duration unit of measure
is the millisecond (ms). Instrumental measurement may seem trivial, but in
practice it is actually complex to do manually or automatically. Indeed, to
determine syllabic segment boundaries, even by an expert versed in the
joys of visual inspection of spectrograms, is far from simple and cannot be
automated easily. The main reason is that the problem is ill posed, since the
consonants and vowels result from continuous changes of the speaker
articulatory configuration, as is the case when we walk, where moments
of beginning and end gestures are not precisely defined. Likewise, the
starting and ending instants of a speech event should be evaluated from
the time they are perceived and the time they cease to be perceived, and
these moments are not necessarily identical for a listener and for an
acoustic speech analyzer.
Syntax and prosody 11
Indeed, syllabic boundaries result from phonological definitions, while their

material existence spans a range of time and not a specific instant (except
partially for stop consonants). The detailed examination of spectrograms does
not always offer much help, and operators are sometimes fooled by a particu-
larly high amplitude sound level that influences the representation of harmo-
nics. Indeed, a higher playback level may result in a higher intensity dynamic
on the spectrogram, leading one to consider longer duration for segments, as
speech harmonics appear larger due to a change in the minimum grey level,
which would not be visible with a different setup. This effect is similar to a
change in contrast level in black and white photography.
However, algorithms for automatic segmentation of vowels and consonants
do exist. Not surprisingly, their reliability depends heavily on the speech
recording quality (i.e. with a high signal to noise ratio). EasyAlign developed
by J.-Ph. Goldman (2011) and IrcamAlign by Lanchantin et al. (2008) belong
to this category. However, an algorithm recently implemented in WinPitch
appears very efficient and quite insensitive to recording quality as it mimics a
human operator for segmentation (see Chapter 11).
The software program WinPitch used in this book displays graphically
the evolution of syllabic and inter-syllabic duration in the form of staircases
(Fig. 1.8) or a Bézier curve (Fig. 1.9).
Syntax and prosody

In the early 1980s, when phonologists started to be interested in prosody, the
doxa (i.e. the dominant way of thinking) considered that units of prosody
(usually stress groups) are usually organized hierarchically in a prosodic
structure derived from the syntactic structure, the center of everything in
300
veau
250
chaud
faut soit
200 a gneau ti
quebeau
150 le rô
ou
100
50
0
0 0.5 1 1.5 2 2.5 3
Figure 1.8 Staircase duration curves showing the evolution of syllabic

duration. The vertical scale on the left indicates the syllabic duration in ms.
12 Introduction
300
veau
250
chaud
faut soit
200 a gneau ti
que beau
le rô
150 ou
100
50
0
0 0.5 1 1.5 2 2.5 3
Figure 1.9 Bézier duration curves showing the evolution of syllabic

duration. The vertical scale on the left indicates the syllabic duration in ms.
linguistics at the time (Di Cristo & Rossi, 1977; Selkirk, 1978, 1986; Di Cristo,
1998; Rossi, 1999; Fox, 2000; Mertens, 1993, 2008; Bocci, 2013; Di Cristo,
2013, and so on). The advent of AM phonology, aimed first at describing the
tonal systems of African languages (Goldsmith, 1976), came to the scene,
together with the ToBI (Tone and Break Indices) notation system for tonal
targets (Beckman & Ayers Elam, 1997). This, given that the concept that
sentence intonation could be described as strings of well-formed tonal targets,
obscured considerably, at least in my view, research in the field. Data, however
scarce and resulting from sometimes unreliable pitch analysis of coined labora-
tory sentences, had to fit into the theoretical grid imposed by the community.
Perhaps one of the most disappointing aspects of the AM approach pertains
to the idea, inspired by syntactic theory, that strings of prosodic events could
not have variations, i.e. that one set of tonal targets should be deemed correct
for a given sentence without considering other possibilities. The common use
of very short sentences was equally misleading for the interpretation and
comprehension of sentence intonation. Furthermore, what was lacking is an
explanation principle specific to sentence prosody, as the one usually proposed
was strongly linked to syntax. Sentence intonation, with its prosodic structure,
was frequently viewed as a crutch helping a locally deficient syntax (in
“ambiguous” sentences, Lehiste, 1979), or, at most, as a cherry on the syntactic
cake. The independence of the prosodic structure from syntax, proclaimed
around 2005 (though already discussed in detail in Rossi et al., 1981), was
difficult for some researchers to accept, and even more so was the more recent
idea that this prosodic structuration would operate before the syntactic struc-
ture. Indeed, this latter view would imply that syntax would depend, at least
partially, on intonation, and not conversely that intonation does depend on
syntax.
Stressed syllables 13
I strongly felt and still do feel that the advent and dominance of the AM
model was not giving a proper account of the prosodic structure function.
Furthermore, one of the supplementary problems is linked to the use of the
ToBI notation, which appeared more and more as a convenient ad hoc system to
force data to fit into the model. Indeed, whereas simple use of properly aligned
high H* and low L* tonal targets may be satisfactory, transcriptions frequently
appear deliberately without any convincing link with the data, especially if
these data are illustrated by fundamental frequency curves without any detailed
frequency scale as was often the case at that time.
The prosodic structure: the structure of spoken language

As mentioned before, what may really be missing in the AM model is a genuine
explanation principle. Once the prosodic structure is defined and its relative
independence from syntax established, one may legitimately wonder about its
role in the linguistic system. Again focusing on syntax, most studies investigate
the alignment of prosodic groups (i.e. the AM intermediate phrase, ip, and
Intonative Phrase, IP) on syntactic units, leading to the (old) idea that prosodic
structure exists to facilitate the decoding of syntax by listeners. Since most if
not all of these studies were conducted on read text in the frame of laboratory
phonology, rare were the prosodists who took notice of the most important and
obvious feature of oral production: to occur dynamically in time. Indeed,
speech production and perception operate object after object, prosodic event
by prosodic event, with little or no persistence of the already pronounced and
perceived prosodic entities (the same applies, of course, to all linguistic objects
such as syllables, words, or syntagms). This “Written Language Bias” approach
assumes that all analysis of prosodic facts must be elaborated from a written
transcription of syllables and words, and a representation on paper of prosodic
data (i.e. fundamental frequency, intensity, and duration curves).
Stressed syllables
Many observers and analysts of the voice, not to mention professional linguists,
noticed a long time ago that in the flow of syllables some were stronger than
others. Some fine ears, i.e. amateurs or professional musicians, even noticed
that these “strong” syllables were not necessarily all stronger in the same way
and were differentiated by musical features such as duration and, interestingly,
melody (linked to the variation of laryngeal frequency). Indeed, modern acous-
tical analysis revealed that stressed syllables (i.e. strong syllables) do bear some
melodic changes, but this is also the case for the other syllables of the
sentences, which are perceived as stronger or not. From these common obser-
vations emerged many descriptions of the phenomena, either purely descriptive
14 Introduction
and phonetic, or phonological in their attempt to capture regularities in their

realizations by speakers, focusing on stressed syllable prosodic events.
Intonation and syntax

Obviously, the term prosodic structure derives in part from the most popular
object in contemporary linguistic research, i.e. the syntactic structure. The idea
to extend the concept to prosody and specifically to prosodic events in the
sentence may be traced back to at least 1975, in a paper called “Analyse
phonologique de la phrase française” (Martin, 1975). Actually, the kind of
structure defined in this paper was called a hierarchy, as I was very attentive at
the time to the conceptual differences between hierarchy and structure. (As a
reminder, a hierarchy, as its name implies, refers to a hierarchical grouping of
objects, whereas a structure is a hierarchy where categories of groups or
relations between groups are specified.)
Most of the papers on the subject, published sometime later, considered
explicitly or implicitly the prosodic structure as derived from the syntactic
structure, and proposed rules to derive one from the other. The debate about
congruence or non-congruence with syntax started to emerge around the 1980s
(Rossi et al., 1981). Later, prosodic and syntactic structures were analyzed
independently of each other (Martin, 2009) but were still considered as operat-
ing in parallel in speech.
Today, as the prosodic structuration appears to precede the syntactic and
morphological ones in both the generation and the perception processes
(Blanche-Benveniste, 2000; Blanche-Benveniste & Martin, 2011; Martin,
2012b), the flow of structuration in sentence production may be reversed. At
least for the listener, and probably for the speaker, the prosodic structure
delivers the first perceived hierarchical organization of speech segments
divided into temporal groups at the brain processing level well before syntax.
This organization constitutes the base of the structural analysis of speech
prosodic units (i.e. stressed groups) performed by the listener, and local
occurrences of non-congruence between prosody and syntax are handled sepa-
rately in a following processing step through a recalculation of the dependency
relations existing between syntactic and morphological units.
Brain waves and prosody

Observation on some not obvious characteristics of prosodic structures, rea-
lized on spontaneous speech and not limited to (read) laboratory speech, led to
the discovery of a set of constraints, some of which had been known (of course
under a different form) at least since the sixteenth century (Meigret, 1550).
Although published many years before, studies conducted in the AM
A Copernican change 15
framework had ignored them, as the then “available theory could not accom-
modate them,” as is the case with the contrast of melodic slope (Jun, 2012,
personal communication).
These constraints were discovered and evaluated empirically from acoustic
analysis of numerous recordings, but an explanation based on the interpretation
of electroencephalographic (EEG) data was recently found (see Chapter 7).
This discovery led to a very new concept of prosodic structure, taking not the
usual representation on a piece of paper ignoring the dynamic aspect of the time
axis (the time parameter in most papers on linguistic events is translated into
“to the left,” or “to the right” instead of “before” or “after” to characterize their
position in time). However, our perspective changes radically if we reintroduce
the time parameter when considering the linguistic generation and perception
of linguistic objects, and particularly prosodic events and their structuration.
A Copernican change
Instead of starting from well-described syntactic facts and properties, I decided
to go the other way around. Martin (1987) and Avanzi (2012), for example,
showed that (1) there is in general more than one single prosodic structure that
can be attached to a given syntactic structure, and (2) it is not reasonable to
assume that a unique prosodic structure can be predicted from textual facts (see
also Bolinger’s famous (1972) paper, “Accent is Predictable (if you’re a Mind-
Reader)”). It seems therefore quite vain to continue piling up rules that would
give better predictions than the ones already available. The key point about this
change of view, a kind of Copernican revolution, is not to consider that the
prosodic structure results from the syntactic structure and some semantic
structure or properties, but instead to view the prosodic structure as the first
hierarchical organization produced by the speaker, in which text and syntax
will be accommodated more or less felicitously in a next or a concurrent step in
the speech production process. This view also involves the a priori indepen-
dence of the prosodic structure from syntax.
Many arguments can validate this view, including the following:
1. In general, more than one prosodic structure can be associated with a given
syntactic structure.
2. When reading, speakers try to recover the prosodic structure implicitly or
explicitly designed by the writer (who has a limited number of orthographic
symbols – the punctuation marks – to give indications to the reader).
3. Far from being an accessory (as considered in functional and generative
syntax as belonging to phonetics, and relevant only to performance rather
than competence), the prosodic structure is indispensable to the listener and
the reader alike to process the linguistic information. Even in silent reading,
the reader has to regenerate some prosodic structure from the read text, the
16 Introduction
prosodic events (for example melodic contours on stressed syllables) being

heard silently in the reader’s mind. The prosodic structure is therefore
central in the linguistic comprehension process.
The alternate Incremental Storage-Concatenation model (the subject of this
book) proposes a plausible explanation as how the prosodic structure deter-
mines the processing of the linguistic information by the listener. The role of
the prosodic structure appears clearly in this model, highlighting the dynamic
aspects of the process along the time axis, and detailing how prosodic units (the
stress groups) are assembled in hierarchical levels in function of time.
From laboratory to spontaneous speech

The advance of technology has been and still is of paramount importance in the
development of prosodic analysis, in particular pertaining to the reliability of
pitch curves (representing the evolution of laryngeal frequency) and the overall
duration of speech recordings analyzed. For instance, one of main reasons
explaining why AM prosodic analysis is currently limited to laboratory speech
and to very short sentences is linked to the quasi unique (to get published and
read) use of the speech analysis software Praat. Although an excellent general
purpose speech analysis tool, Praat is not specifically designed for prosodic
analysis of long sentences (not to mention the relatively low reliability of its
pitch curves for degraded recordings). As a result, the idea that speech prosodic
analysis must be conducted in a soundproof room is still very strong today.
Reading and listening

The key differences between reading and listening pertains to the handling of
time. While reading, the reader moves her/his eye spotting point to various
parts of the written sentence, such as punctuation marks and verbal group
boundaries (Martin, 2011). The reader can therefore “accommodate” the time
dimension as she/he reconstructs the prosodic structure without which the
sentence could not be processed mentally (in the case of silent reading) or
orally. The reader is thus able to anticipate the necessary prosodic events such
as rises and falls of the (mental) melodic curve through syntactic backtracking.
The following example, borrowed from Delais-Roussarie (2000), illustrates
this. The sentences Sans attendre les enfants sont partis (“Without waiting, the
children left”) and Sans attendre les enfants ils sont partis (“Without waiting
for the children, they left”) require different phrasing. Obviously, a correct
phrasing of these examples requires preplanning from the speaker, whereas the
reader has to localize pertinent syntactic boundaries from the presence of the
pronoun ils in the second case, and its absence in the first sentence. Either way,
oralization and silent reading require correct phrasing: [Sans attendre] [les
enfants sont partis] (“Without waiting, the children left”), [Sans attendre les
enfants] [ils sont partis] (“Without waiting for the children, they left”), not to
mention the phrasing [Sans attendre] [les enfants] [ils sont partis], which is
also well-formed.
The listener, on the other hand, has limited possibilities to anticipate proso-
dic events to be realized in the near future by the speaker, although, depending
on the prosodic grammar used in the language, some indications may exist. In
French, for example, the mechanism of melodic slope contrast (see Chapter 7)
allows the listener to anticipate the occurrence of the final conclusive contour
as boundary tones of the final prosodic group may present a reversed rising
slope.
This dissimilarity in processing the linguistic information plays an important
role in explaining the differences of realization in pitch, rhythm, and intensity
observed in laboratory (read) and spontaneous speech analysis. However, many
papers are based on reading and written texts, with the deeply held conviction
that the prosodic structure derives from syntactic/semantic/pragmatic condi-
tions observable from a written text as in recent contributions (e.g. Gachet &
Avanzi, 2008, or Delais-Roussarie, 2009 and Delais-Roussarie et al., 2015) on
parenthesis and coordination in read and spontaneous sentences.
Romance languages
No less than thirty-two distinct Romance dialects are currently identified from
Latin (in alphabetic order): Aragonais, Aroumain, Asturien, Bergamasque
(Lombard de l’Est), Bourbonnais, Bourguignon-Morvandiau, Catalan, Corse,
Espagnol, Estrémègne, Français, Franc-Comtois, Francoprovençal valaisan,
Frioulan, Gallo, Galicien, Italien, Léonais, Milanais (Lombard de l’Ouest),
Mirandais, Napolitain, Normand, Occitan, Piémontais, Portugais, Romanche,
Romanesco, Roumain, Sarde, Sicilien, Vénitien, Wallon (Fig. 1.10).
Among those, a dialect becomes a language when it is used by a strong
political power to become the official language (Posner, 1996). Then if we
retain state languages, only French, Italian, Spanish, Portuguese, and
Romanian remain. But then we miss Quebec French spoken in Quebec, a part
of Canada, and Catalan spoken in Catalonia, a part of Spain. Another criterion
could be the production of literary work. In this case, most dialects should be
added to the list.
The most widely spoken Romance languages are Spanish, French,
Portuguese, Italian, and Romanian (97% of speakers). The Romance language
with the largest number of speakers is Spanish (spoken in about thirty-six
countries by about 329 million speakers). French is next, spoken by 250 million
speakers in twenty-five countries, followed closely by Portuguese (240 million
speakers in eight countries), Italian (62 million in ten countries), Romanian
Figure 1.10 Map: Romance dialects (www.romaniaminor.net/mapes/romania.swf)
(spoken by 26 million speakers in six countries), and Catalan (12 million

speakers in four countries).
In this book, the choice is limited to six languages: French, Italian, Spanish,
Catalan, Portuguese, and Romanian (in order of my own competence). French
is a national language in France, Belgium, Switzerland, Quebec, and many
African countries. Italian is spoken in Italy, Portuguese in Portugal, Brazil, and
many African countries such as Mozambique, Angola, Cape Verde, Guinea-
Bissau, and São Tomé and Príncipe. Spanish is a national language in Spain and
many other countries (Equatorial Guinea, Mexico, Cuba, all the South America
countries except Brazil and the three Guianas). Catalan is spoken in Catalonia
and Romanian in Romania and Moldavia.
Of course, many other varieties could have been considered, but as I will
attempt to demonstrate later in this book, few prosodic phonological differ-
ences are to be expected, as those varieties differ essentially by phonetic
characteristics, pertaining to the details of realization of prosodic markers.
Research projects like AMPER (Atlas Multimédia Prosodique de l’Espace
Romain; Contini et al., 2002) aim to obtain very detailed descriptions of subtle
prosodic differences in the realizations of melodic contours, and hope to
finalize the results in a geographic atlas of intonation varieties of Romance
languages. These studies are essentially phonetic, and generally do not consider
the structuring functions of sentence intonation.
The sheer size of the Roman Empire and the relative slow speed of commu-
nication together with the lack of an enforced grammatical and literary norm
(except in Christian institutions such as monasteries and convents) were essen-
tial factors that allowed the phonetic, lexical, and grammatical characteristics
of Latin to evolve. The presence of substrates spoken in various regions which
had been invaded at various times allowed sporadic influence on the original
lexicon, phonology, and, to a lesser extent, the syntax of the Latin spoken in
administration and commerce. This explains the natural phonetic evolution of
Romance languages.
Similarities in lexicon and syntax are also present (perhaps unexpectedly) in
prosodic features, despite the fact that pronunciation has evolved differently in
various Romance languages. Indeed, the key prosodic characteristics are sur-
prisingly similar, the lexical stress rules are morphologically based everywhere
(except in French, where the lexical stress disappeared), and the prosodic
markers of the prosodic structure use comparable melodic movement accord-
ing to the same grammar. The following chapters give a detailed review of these
similarities.
2 The role of technological advances
In the last six decades, the advent of new and ever more sophisticated speech
analysis tools has given researchers the opportunity to test existing theoretical
phonological models, especially those devoted to sentence intonation.
Complex models elaborated from the linguist intuitions could be tested against
actual speech data, first in well-defined production conditions (laboratory
speech) and later in various real-life conditions (spontaneous speech).
Technological advances which allowed for acoustic analysis to be quickly
and reliably performed were of paramount importance for the analysis of data
recorded in various discourse production environments. For prosodic research,
the quest for a correct and reliable measure of fundamental frequency has been
definitively pivotal.
Parallel to these advances, the emergence of large spontaneous speech
corpora, gathering speakers’ performances in all kinds of conditions (mono-
logues, dialogs, public and family contexts, etc., for example, C-ORAL-ROM,
http://lablita.dit.unifi.it/coralrom/) led to a reconsideration of what have some-
times been intangible descriptive results, elaborated in isolation from the
intuition of researchers. Early work on prosody, for instance, was largely
based on intuitive concepts proposed by theorists like Liberman and Prince
(1977), without using much experimental data (as they were felt technically
difficult to obtain by phonologists not especially trained in experimental
phonetics).
The kymograph
In 1860, Édouard-Léon Scott de Martinville realized the oldest known record-
ing of the human voice, with the phonautographe, seventeen years before
Edison’s phonograph. With this device, a stylus engraves the sound vibration
on a sheet of paper coated with carbon black wound around a rotary cylinder.
Although this system could not reproduce the recorded sound (this was done
much later with a laser tracking device, Rosen, 2008), it allowed a first
acoustical analysis of a human voice and announced a specific development
20
The spectrograph 21
Figure 2.1 Rousselot kymograph.
for acoustic speech analysis from the kymograph invented by Carl Ludwig in
1847. Later, Etienne-Jules Marey developed a series of instruments dedicated
essentially to the dynamic aspects of speech, including vibration of the vocal
folds (Teston, 2004).
Indeed, phoneticians were already using laboratory speech in the early
twentieth century. Rousselot (1901, 1908), for instance, used a modified kymo-
graph (Fig. 2.1) to obtain rudimentary speech waveforms from which it was
possible to derive values of laryngeal frequency in function of time. This was
done by visual identification of the period or group of periods on the waveform
(Fig. 2.2). The duration of analyzed speech was of course quite limited and
speakers had to be physically present to produce recordings. This approach
became much improved later with high-speed speech wave recording on
photosensitive paper, allowing a reasonable precision in the resulting melodic
curve realized by expert phoneticians.
The spectrograph
Later, in the 1950s, thanks to the development of electronic amplifiers, the
sound spectrograph appeared and it became possible to perform an acoustical
analysis of speech segments of 2.4 s from speech recordings made elsewhere
(Fig. 2.3). The information provided by this tool became quickly central in
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Direct measure of the period 4 cs -> Fo = 250 Hz
-> Low frequency resolution
Measure by packets of periods (ex. 10)

42 cs -> T = 4,2 ms Fo = 238 Hz
-> Low temporal resolution
Figure 2.2 Measure of the laryngeal period, directly (top), or indirectly from
the duration of a packet of ten periods.
10th harmonic F = 24 mm −> 2400 Hz −> Fo = 240 Hz
0 – 4.000 Hz
(Scale magnifier)
45 Hz
2.4 seconds
Figure 2.3 Spectrogram printed on thermo-sensitive paper.
experimental phonetic studies, leading to a better comprehension of the phona-

tion mechanism, in particular pertaining to vowel formant frequencies.
For studies in intonation requiring knowledge of the laryngeal frequency, use
of a “scale magnifier” and a “narrowband” filter allowed the visual identifica-
tion of the fundamental frequency from the other harmonics, but the thickness
of the harmonics in the resulting display (corresponding to a frequency of
45 Hz) prevented satisfactory precision in the measures.
First results 23
Fundamental frequency tracking

In order to achieve a reasonable precision, the 10th harmonic (for example)
displayed on a narrowband spectrogram had to be visually identified. The
measure of this harmonic frequency at regularly spaced time intervals
improved the precision of the fundamental frequency measure, given the traces
representing the harmonics were rather broad and not very precise. This
operation was quite time-consuming, not to mention that the spectrogram
frequency scale was not always linear.
This rather tedious evaluation of melodic curves led to the development of
specialized hardware devices (VisiPitch, Pitch Computer, etc.) and software
programs such as “Pitchmeters’ and “Pitch analyzers” (Signalyze, Waves,
WinPitch, Praat, Wavesurfer, and so on) with embedded pitch tracking algo-
rithms (Martin, 2008).
First results
Among the first changes of point of view pertaining to phonology, the use of a
kymograph by Rousselot (1901, 1908) and then by Grammont (1933) led to a
better understanding of the nature of stressed vowels and the importance of their
duration. Later, the advent of the spectrograph made possible one of the first
phonetic if not phonological descriptions of basic intonation patterns in French
based on acoustical analysis (Fig. 2.4) by Delattre (1966). These results are still
used today as a basis for the phonological description of French speech prosody.
The description of pitch contours relied on the analysis of recordings stored
on vinyl that were actually inserted in an issue of The French Review where the
paper was published. Everybody (having access to a spectrograph) could
therefore verify the experimental results and review the hypotheses existing
at the time on sentence intonation.
Later, thanks to the realization of a real-time pitch analyzer working reliably
in a large laryngeal frequency range (70 Hz to 500 Hz; Léon & Martin, 1969),
extensive analysis of French prosodic data led to a first model of sentence
intonation using the concept of prosodic structure (Martin, 1975). In this
model, stress groups (sequences of words with only one stressed syllable) are
assembled hierarchically into a prosodic structure according to specific melo-
dic contours located on stressed syllables.
The emergence of relatively inexpensive computers in the years 1980
brought the development of easier to use software pitch analyzers. These
software programs led gradually researchers to apply the phonological AM
(Goldsmith, 1976) model to sentence intonation data, so that the AM model
quickly became dominant in intonation phonology. In this model, the prosodic
structure organizes hierarchically prosodic events in non-recursive levels: a
Si ces œufs Continuation mineure 2–3
étaient frais, Continuation majeure 2–4
j'en prendrais. Finalité 2–1
Qui les vend ? Interrogation 4–1
C'est bien toi ? Question 2–4+
Ma jolie? Echo 4–4
Evidemment, Implication 2–4-
Monsieur. Parenthèse 1–1
Allons donc ! Exclamation 4–1
Prouve-le-moi. Commandement 4–1
Figure 2.4 The ten basic intonation patterns for French by Delattre (1966).
first level assembles syllables σ into content words Wc (verbs, nouns, adjec-
tives, and adverbs) and function words Wf (conjunctions, pronouns, preposi-
tions, articles, verb auxiliaries; a second level into accentual phrases (APs); a
third level groups APs into intonation phrases (IPs); and finally a phonological
utterance (PU) eventually groups sequences of intonation phrases (see
Chapter 4).
The prosodic events are aligned on AP stressed syllables and are described as
sequences of tones transcribed with the ToBI notational system. This system
uses High (H) and Low (L) symbols to represent tone targets as perceived or
observed on fundamental frequency curves obtained from the speech signal
acoustic analysis (an example is given Fig. 2.5).
The prosodic structure in the AM model was applied in the 1990s to
data obtained from the then recently available speech analysis software
Waves™. This opened a new playground for phonologists and syntacticians
interested in sentence intonation, as it offered much simpler access to acoustic
data than the spectrograph currently available at that time. Unfortunately, funda-
mental frequency curves displayed by waves were frequently found unreliable
despite the use of high-quality speech recorded in soundproof rooms.
Some of these errors, such as frequency doubling and halving, were so
frequent that their manual correction became part of the ToBI description
manuals (Beckman & Ayers Elam, 1997).
First results 25
Figure 2.5 An example of analysis with the software Waves™ of the sentence
Jim builds a big daisy-chain (from ToBI 1999). The erroneous segments (on
big and daisy) are circled.
The generalized use of Praat from 2000 onwards delivered more reliable
data, but users frequently choose a display combining a wideband spectrogram
with the fundamental frequency curve (Fig. 2.6).
The fact that this representation became a kind of standard in the field is
rather unfortunate as even in speech recorded in laboratory conditions errors
occasionally do occur. A simultaneous display of a narrowband spectrogram
would be more advisable, as it would allow even moderately knowledgeable
operators to locate immediately potential errors in the pitch curve from the
observation of voice harmonics displayed simultaneously (see Chapter 1).
More recently, the elaboration of rather large spontaneous speech corpora
has led to the development of more complex software programs such as
WinPitch (from 1996 onwards) imbedding various sophisticated tools to tran-
scribe, annotate, and align recorded data (Fig. 2.7).
A somewhat detailed description of WinPitch is included at the end of this
book. One of the important features of the program is the ability to apply
multiple fundamental frequency tracking algorithms on various segments of
recorded speech. It is then possible to obtain the best fundamental frequency
curves in adverse recording conditions, by selecting appropriate algorithms.
Figure 2.6 An example of a fundamental frequency curve with a wideband

spectrogram displayed underneath (from Delais-Roussarie et al., 2015).
Figure 2.7 WinPitch display.

Transcription and alignment of speech 27
Electroencephalography and brain waves

In the years 1930–1940, researchers observed that the human brain consisted of
a very large number of neurons (in the order of 100 billion) interconnected in
groups in specific regions of the brain mass. These interconnections allow an
electrical transfer of chemically stored information in each neuron. These
transfers induce variations of a small electric potential (in the µV range), that
can be observed through captors positioned on the subject’s skull (EEG). These
electrical variations are called evoked potential as they result from a sensory
stimulation, auditory, visual, or other.
Electrical activity produced by transfers of groups of neurons to other groups
of neurons is not done haphazardly. First, they operate in specific frequency
ranges linked to specific cognitive activities, and second they can be synchro-
nized in phase in specific frequency range. Greek letters designate specific
frequency ranges: Alpha, Delta, Beta, Gamma.
Evoked potential is usually observed with a relatively large number of
captors (from 32 to more than 256) placed around the subject skull according
to location standards. EEG signals are stored in real time and analyzed into the
frequency domain with either a (Fast) Fourier or Wavelet transform. The
resulting representation is very similar to spectra obtained in frequency speech
analysis, but in a much lower frequency range.
With evoked potential techniques, it becomes possible to investigate the
effect of speech perception, and particularly on prosodic events, on brain waves
(see Chapter 7).
Transcription and alignment of speech

When the duration of speech recordings to be analyzed becomes too large, it
may be quite impractical and too time-consuming to look at specific speech
segments corresponding to some selected text transcription. When only tape
recordings were available to store speech data, the retrieving of a particular
speech fragment was simply too tedious in practice and prevented many
projects being considered feasible. Speech-text alignment software of various
type solves this problem, by adding a set of bidirectional pointers to the speech
and text filles, allowing an efficient retrieval of speech segments from their
corresponding transcription and vice versa. The granularity of the alignment is
variable depending on the application, going from phone to syntagm or sen-
tence levels, or even higher to the speech turn level.
Specialized software programs such as Transcriber (http://trans.sourceforge.
net), Praat (www.praat.org) or WinPitch (www.winpitch.com) have built-in
functions implementing speech-text alignment during the speech transcription
process. Often transcriptions are available beforehand and the alignment has to
be done in an extra step. Some dedicated software does exist (e.g. EasyAlign,
Goldman, 2011; Penn Phonetics Lab Forced Aligner, Yuan & Liberman, 2008),
but requires reasonably good quality speech recordings, without too much
echo, speech overlapping, frequency distortion (e.g. from high compression
mp3 coding), and so on to operate properly.
WinPitch is an example of another approach to this problem. Instead of
relying on automatic processing using tools available from speech recognition
techniques, its aligner uses the capacity of human operators to handle speech
quality problems, assuming that operators have better overall speech percep-
tion faculties than machines.
The WinPitch alignment engine is based on the following approach: psy-
choacoustics experiments have shown that subjects are capable of correlating
moving objects with speech if the speech rate is reduced by at least 30 percent.
Using a PSOLA (Pitch Synchronous Overlap and Add), Autocorrelation or
Phase vocoder type re-synthesis for natural speech, the aligner can play back
speech with an adjustable reduced speech rate (down to seven times slower),
allowing the user to click on written words (or any other unit) of the text corpus
corresponding to the running slower speech. The speech rate is adjustable in
real time with a mouse wheel, so that the operator can continuously adjust the
output rate to suit the alignment task.
This aligner presents the important following features:
1. It will work with degraded quality recordings, which are common in
spontaneous speech corpora (recordings made on the street, with machine
background, multiple speakers, echo, etc.), which is not presently the case
for existing automatic aligners.
2. The error rate will depend on the operator, and is expected to be very low
(5%) compared to automatic methods (reporting 25–30% error rate) thanks
to the adjustable speech rate. Graphic tools are available to make an adjust-
ment if needed.
3. It will be insensible to speaker’s dialectal variations, whereas automatic
segmentation and alignment based on Hidden Markov Model speech recog-
nition technology requires training for each speaker voice to be effective,
and is therefore very sensitive to those variations.
4. It integrates user-friendly graphic commands with the mouse to correct
eventual misalignment errors.
3 Transcription systems
Whether we are aware of it or not, every description of an object implies a

preexisting theoretical point of view of some sort. For example, the description
of the movement of stars and planets implies not only the use of adequate
instruments, such as telescopes, but also the acceptance of some cosmological
model to interpret the data. This model can change in time. In the Middle Ages,
all the universe was assumed to revolve around the earth, whereas today the
movement of planets may be interpreted through the general relativity model,
leading to a different understanding of observed data.
Actually, a description process implies two distinct steps: (a) obtaining the
data by the use of appropriate instruments and (b) selecting among these data
those that allow an interpretation through a chosen theoretical frame. Even if an
interpretation model is not explicitly present, the simple use of statistical
analysis tools would implement hidden selection criteria. However, at times,
the emergence of data that could not fit the theory currently used will provoke a
more or less important redesign of the theoretical framework.
In speech science, too, the use of a specific method to gather and select
acoustic data analysis implies practical and theoretical choices pertaining to the
mode of observation of speech data and to their interpretation. In the case of
intonation, the fundamental frequency is an essential acoustical parameter used
to obtain and select the data. Still, the (often unsuspected) use of a particular
fundamental frequency analysis algorithm may very well affect the gathering
of data slightly or more severely. For instance, a frequency domain approach
based on spectral analysis prevents (by mathematical limitations) the reliable
tracking of fast-changing laryngeal frequency and may thus induce a false
representation, whereas other processes operating in the time domain may
involve other kinds of errors, such as pitch doubling or halving (i.e. reading
twice or half of the actual values). As for the selection of data, the use of the de
facto standard ToBI system (see below) when transcribing manually prosodic
events results in a (possibly assumed) bias by the theoretical choices more or
less hidden in this notation system (by preventing the transcription of the
prosodic events duration, for example).
29
Acoustic and perceived data

The relationship between fundamental frequency (or laryngeal frequency) and
perceived melodic height is not linear. Indeed, the perceived height depends on
both the sound frequency and the intensity. Palmer and Holleran (1994), for
example, showed that for multivoiced music, perceived height depended on the
harmonic structure of the musical sound. Furthermore, musical training
enhances the perception of speech height for listeners. As for melodic move-
ments found in stressed vowels, Rossi (1971, 1972, 1978) showed that the
perception of height (for pure sounds and synthetic vowels) depends on the
speed of frequency change. Below a certain glissando threshold, the pitch
movement is perceived as a static tone, whereas above this threshold a melodic
change is perceived. Moreover, the glissando threshold is sensitive to a change
of intensity occurring inside the sound.
Perceived intensity also depends on the sound frequency and the harmonic
composition of speech sound. The curves of Fletcher–Munson, established in
the 1930s, indicate that the perceived intensity of pure sounds, which varies
logarithmically with physical intensity, depends on the frequency of the sound.
The harmonic composition of vowels will then also influence the perception of
their intensity. The Fletcher–Munson curves lead to a new unit of intensity
distinct from the decibel (dB), the Phone, defined from the Fletcher–Munson
curves, with a reference sound at 1000 Hz. However, the Phone is seldom used
by phoneticians, let alone phonologists, as it is too complex to implement in
practice (and also relies on experiments based on the perception of pure tones
and not speech).
Obtaining data: pitch curves

If we don’t trust our ears to transcribe prosodic events, we may (with caution)
rely on the visualization of their acoustic analysis as a starting point and extract
pertinent information from their three essential parameters, intensity, duration,
and fundamental frequency. The latter parameter is commonly represtented by
the “pitch curves,” or “melodic curves,” delivered by so-called pitch analyzers.
They are related in a logarithmic fashion to the perception of melodic height by
listeners.
At first one can be easily disoriented looking at the tormented aspect of
fundamental frequency curves, which furthermore disappear episodically
for passages without laryngeal vibration (pauses or unvoiced consonants
such as [f], [s], [ʃ] or [p], [t], [k]) (see Fig. 3.1). In whispered speech, no
laryngeal vibrations are produced at all, and the pitch curve will remain
hopelessly stuck on the horizontal axis of the graph, representing by
convention the absence of fundamental frequency. Switching from a linear
Obtaining data: pitch curves 31
300
veau
250
soit chaud
faut
200 ti
a gneau que beau
ou le rô
150
100
50
0
0 0.5 1 1.5 2 2.5 3
Figure 3.1 Pitch curve using a linear scale in Hz (French sentence agneau ou
veau faut que le beau rôti soit chaud “Lamb or veal, the beautiful roast must
be hot”). The waveform, representing directly the speech sound vibrations, is
displayed on the bottom of the graph.
to a logarithmic frequency scale will also change the graphic aspects of the
melodic curves and possibly influence their interpretation by researchers.
Furthermore, the creaky mode of phonation, due to an irregular or alternate
(i.e. consecutive long and short periods) mode of vibration of the vocal
folds, will give other problems of interpretation of data, as in this case the
fundamental frequency is generally not correctly evaluated and is therefore
displayed with erroneous values. It will be necessary for an operator to
learn how to identify certain remarkable passages, so as to segment it in
relevant sections for phonological analysis.
Moreover the acoustical measurement of the fundamental frequency is
episodically erroneous. This characteristic, frequently due to the poor record-
ing conditions, unseats more than one beginner in this field of research. After
all, measuring instruments, acoustics or others, are a priori designed to function
correctly (within the limits specified in their respective user manuals, which are
not necessarily read by everybody). The same applies to pitch analyzers. Many
more or less reliable software programs allow for pitch curve visualization. If
Praat is one of the most popular today, a program like WinPitch, as its name
indicates, is adapted more to the prosodic analysis by extended possibility of
analysis with multiple fundamental frequency tracking algorithms reducing the
risk of errors and by its assisted aligned process, which is useful for transcribing
large corpora.
The purpose of transcription, as for any other physical phenomenon, is to
filter and select information in order to make data interpretable. Most of the
existing transcription systems import some principle that is implicitly or
explicitly part of the system. For instance, ToBI, a transcription system based
on the perception of high and low pitch targets, inherits from Pike (1945)
notation for English as well as perception experiments carried out at IPO
(Instituut voor Perceptie-Onderzoek) in Holland in the 1970s, using a vocoder
to manipulate speech intonation (’t Hart et al., 1990). Another system, the
prosogram (Mertens, 2004), is based on the assumed validity of the glissando
threshold for voiced speech sounds (despite the fact that this threshold has been
established for the synthetic isolated vowel [a]). Other systems such as Analor
(Avanzi et al., 2008) or IntSint (Hirst & Espesser, 1993) imply the validity of
perception tests based in specific conditions, Analor from a set of defined
acoustic parameters, IntSint on the equivalence of the overall perceived sen-
tence intonation with an assumed equivalent quadratic fundamental frequency
function. The AMPER project (Contini et al., 2002) eliminates the rhythmic
parameter, apparently judged less important, by aligning pitch values of dif-
ferent realizations of read sentences syllable by syllable.
In most cases, syllabic duration and intensity are absent from these
transcriptions, and only pitch movements are retained as pertinent informa-
tion. This is especially true for ToBI, which is the de facto standard for
sentence intonation transcription. However other systems (including the
one used in this book) use a more detailed description of melodic contours
including their duration.
Selecting data
Historical background
Prosodic transcription systems found in the twentieth-century linguistic litera-
ture are mostly of Anglo-Saxon origin and were elaborated by phoneticians as
well as phonologists. Their transcriptions are based on auditory perception,
without the assistance of acoustic analysis devices, which appeared only later.
Some of these systems reveal theoretical options concerning the status of
prominent syllables in the sentence, options which one finds almost every-
where today. The instruments available at the time (the kymograph since 1847,
the spectrograph after 1950) were complex and tedious to handle and did not
allow in practice the analysis of speech of large duration. The use of acoustic
analysis spread gradually only after 1970, together with the availability of
personal computers.
Some early prosodic transcriptions were inspired by musical notation
(Figs. 3.2 and 3.3). Throughout the duration of the musical notes and their
grouping in rhythmic units, this kind of musical notation in some ways takes
into account the sentence rhythm, an attribute which is seldom found in
contemporary transcriptions.
Selecting data 33
Figure 3.2 Melody example of perceived pitch curve transcribed on a musical

range for English (Jones, 1909).
Figure 3.3 Musical transcription used by Fónagy and Magdics (1963) for
English.
During the same period, other systems appeared, using the verticality of
graphic space without referring precisely to a musical or frequency scale. These
iconic notations represent non-prominent syllables by points, prominent sylla-
bles as static tones by horizontal strokes, and pitch variations on stressed
syllables by tilted or curved strokes (Figs. 3.4 and 3.5).
Later, the theoretical importance of the stressed syllables began to
appear. As a precursor of the ToBI notation, Pike (1945) transcribed
Figure 3.4 Unstressed syllables, static tones, and contours for English
(Armstrong & Ward, 1931).
Figure 3.5 Unstressed syllables, static tones, and contours for German (von
Essen, 1956).
Figure 3.6 Stressed and final syllables pitch transcribed as static tones in
English (Pike, 1945).
perceived syllabic prominences by changes of levels relative only to

stressed syllables (Fig. 3.6).
A more phonetic notation is found in the grammar of spoken English written
by Palmer and Blandford (1924). The details of pitch variation on and around
the stressed syllables are transcribed by complex contours (Fig. 3.7).
Bolinger (1961, Fig. 3.8) used an intuitive and original notation. Syllables are
transcribed by laying them out on the vertical dimension of the paper according
to their perceived pitch level, and by separating the characters from the same
Selecting data 35
1 2 3 4 5 6
Figure 3.7 Melody contours of groups for English oral (Palmer & Blandford,
1924).
You dark bla

I
was
may call it it
say
should
blue, ck.
Figure 3.8 Melody movements by syllable for English (Bolinger, 1961).
Figure 3.9 Simplified musical range has four levels for French: 1 Low,
2 Average, 3 High, 4 Acute (Léon & Martin, 1969).
syllable to indicate a contour. Thus in Figure 3.8, the syllable dark presents a
high and flat tone, whereas the syllable blue presents a melody contour slightly
rising, and black a downward contour.
The French tradition uses a simplified four levels of musical notation
(Delattre, 1966). These levels, usually numbered from 1 to 4, correspond to
perceived or measured low, average, high, and acute pitch. This notation has
been used for a long time in many phonetic research projects as well as in the
teaching of the intonation of French as a foreign language (M. Léon, 1964)
(Fig. 3.9).
Many other systems were also proposed (Léon & Martin, 1969); however, a
tendency gradually appeared over time to sort the prosodic events related to
syllables in three classes: non-prominent syllables (not accentuated), promi-
nent syllables accentuated thanks to their (perceived) pitch static level, and
prominent syllables accentuated thanks to their pitch variation. One finds this
classification in modern semi-automatic systems of transcription (e.g. the
Prosogram [Mertens, 2004]).
The AMPER project

The AMPER project (AMPER being the acronym of Atlas Multimédia
Prosodique de l’Espace Romain, i.e. Prosodic Multi-media Atlas of Roman
Regions) aims to compare the intonations of many Romance languages
(French, Italian, Portuguese, Romanian, Spanish, Catalan, Provencal, and so
on), including their regional variations (Contini et al., 2002). The project is
inspired by linguistic atlases published at the beginning of the twentieth century.
In these atlases, a narrow phonetic transcription of a limited number of words
pronounced by speakers of various regions was reported on a detailed map
(Gilléron & Édmont, 1902–1910). With this principle extended to transcriptions
of the intonative variations in the AMPER project, the data are obtained by
recording a limited set of simple SN + SV sentences, such as La belle-soeur de
Massimo joue de la guitare andalouse, “Massimo’s stepsister plays guitar from
Andalucia,” read by speakers from the regions considered (Fig. 3.10).
To allow an easier comparison between observed varieties of realizations,
sections of pitch curves are aligned syllable by syllable, therefore removing the
rhythmic parameter from the transcription.
The prosogram
Mertens (2004) implemented in a Praat script an automatic prosodic transcrip-
tion operating on pitch curves. This transcription is based on the assumption
that a change in time of the fundamental frequency F0 is perceived either as a
static tone or as a pitch movement according to the speed of variation of F0, and
8pxti 1
300
250
200
Fo[Hz]
180 Hz
150
100
aff.
int
50
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
samples × 104
Figure 3.10 Example of comparison of melody movements on all the

syllables of statement read by two speakers of the Valley of Aoste.
Selecting data 37
9 10 11 12
100 vow-nucl G(adapt)–0.16–0.32/T2,DG–30, dm–0.050
P
ST
90
150 Hz
80
d y R ã t u tse z a n e d ~
c k d v u p ã S e s yR l @ s c R d e
s e s e Q e Q f a m
70 cessé durant touteses années donc devous pencher sur le sort des femmes
Groult Prosogram v2.9
Figure 3.11 Prosogram for the automatic determination of pitch starting from
the syllabic segmentation and of the glissando threshold (here equal to 0,32/
sec2) (Mertens, 2004).
therefore of the frequency change in the syllable per unit of time. A glissando
threshold value determines whether the perceived pitch is a static tone or a pitch
variation.
This threshold was established for pure sounds by Seargent and Harris (1962),
for synthetic vowels initially by Rossi (1971, 1978), and then by ’t Hart (1976)
who used a semitones scale. If the variation – assumed to be linear – is lower than
the threshold, the pitch movement will correspond to a static tone at a level
equivalent to 2/3 of the final frequency of the variation (rising or falling). If it is
higher, it will be perceived as pitch variation and as a static tone. Segmenting
the speech into syllable or phoneme-like units allows for the automatic repla-
cement of the actual pitch curve by either straight or rising or falling variations,
the latter corresponding to a perceived glissando and consequently to a segment
perceived as stressed (Fig. 3.11).
The problems of interpretation and reliability of the representation by a
prosogram are multiple. It is clear that the results depend directly on the
reliability of the speech segmentation, as segment durations intervene directly
into the glissando value for this segment. In addition, the perception of the pitch
of harmonic sounds depends not only on the value of the fundamental fre-
quency but also on the frequency difference existing between two consecutive
harmonics as well as on the distribution of their respective intensities.
The occurrence of nonlinear pitch variation is also a problem. The nonlinear
variations of contours (convex, concave, or bell curved) are common, leading to
errors in the evaluation of a proper glissando threshold, whether the fundamental
frequency scale is linear or logarithmic. Moreover, the variation of intensity
inside the vowel is also a perturbing factor: Rossi (1978) showed that the
threshold of glissando decreases when the intensity increases inside the syllable,
whereas the threshold increases in the event of reduction in the intensity.
In fact this process detects only the differences possibly perceived between
pitch variation and the absence of variation, according to the value adopted for
the parameter of glissando. The interpretation which is often made from the
Prosogram (Simon et al., 2008) tends then to conclude that the only prominent
syllables are those realized with a glissando, although some syllables can be
prominent only because of longer duration, whether pitch variation is perceived
or not. Likewise the bell-curved pitch movements, frequently met in regional,
or idiosyncratic realizations (e.g. in political discourse) could be wrongfully
interpreted as below the glissando threshold as their fundamental frequency
values at the beginning and end of syllables could be close in value (Martin,
2013a).
ToBI
ToBI is one of the most popular systems used almost everywhere today
for many languages. ToBI (acronym of Tone and Break Indices) is inspired
by earlier systems, notably first by Trager and Smith (1951), who use a
system with four tones, then by the description of Swedish accent by Bruce
(1977), the autosegmental theory elaborated in the wake of J. Goldsmith’s
(1976) work on Igbo (a spoken language in Nigeria), and the work of
Liberman and Prince (1977). This system is local and transcribes only
prosodic events assumed to be pertinent in some theoretical approach, such
as the AM model.
A ToBI transcription of a sentence (Beckmann et al., 2005) comprises four
tiers:
1. An orthographical tier;
2. A tier of break indices, noted from 1 to 4 on a perceptive scale related to
the cohesion degree perceived between units;
3. A tonal tier where the pitch events are consigned; these include the boundary
tones and the pitch accents;
4. A comment tier.
In an attempt to use universal features, the pitch events are noted by High (H)
and Low (L) tones, which are interpreted phonetically as pitch target points toward
which the considered pitch movements tend, or, according to some practitioners,
by points of inflection of the pitch curve. The attribution of H and L tones rests on
phonological definitions, together with few phonetic variations.
An underlying phonological system is associated with every sequence with
the following basic elements elaborated for English:
Specific pitch movements on stressed syllables
H* High pitch target (peak accent);
L* Low pitch target (low accent);
L*+H Falling rising pitch movement (scooped accent);
L+H* Rising pitch movement (rising peak accent);
H+! H* High pitch target high preceded by a high tone followed by a fall
in terrace (downstep).
Selecting data 39
Sentence stress
L- Low, on a boundary of intermediate component
H- High on a boundary of intermediate component
!H High and going down in terrace (downstepping)
Boundary tones
L% Low and sentence final
%L Low and sentence initial
H% High and sentence final
%H High and sentence initial
The problem with this system of transcription stems from the fact that the
transcriber has to carry out simultaneously a perception and a phonological task
since some symbols require the identification of intermediate components in
the prosodic structure.
Moreover, the transcription of slow or fast pitch rises (or falls), small or large
pitch excursions, or concave or convex variations may be transcribed by the
same sequences of symbols (Fig. 3.12). Duration of prosodic events is not
implicitly transcribed in ToBI, as each one of the F0 movements in Figure 3.12
will be represented a priori by the same sequence L H*: (a) shorter than (b) will
be noted L H*, like (c) less full than (a) L H*. By the same token (a), (d)
concave, and (e) convex will all three be transcribed L H*, with a possible
alternative L+H* for (d).
This is especially a problem for French, as some melodic contours do
contrast by the amplitude of frequency change. Their transcription may result
in the same LH* sequence. However, recent adaptation suggests using some
kind of LHH* notation, which in practice means the abandonment of a bitonal
system (Jun, 2012).
Another possibly surprising characteristic of the ToBI system consists in
aligning the high tone H* with the end or the beginning of the stressed syllable,
in order to give a similar account of downward and rising contours, although
(a) (b) (c)
L H* L H* L H*
(d) (e)
L H ou L+H* ? L H*
Figure 3.12 Variations of rising melody contours transcribed with the ToBI
notation.
F0 rise F0 fall
Stressed Stressed
syllable syllable
L H* L L H* L
Figure 3.13 In the ToBI transcription, a high tone can correspond to a rising
(on the left) or a falling contour (on the right), according to the alignment of
the high tone H* associated with the end or to the beginning of the syllable.
the distinction should be transcribed as L H* and H* L. In both cases the

syllable is regarded as high, the H* target being at the beginning of the syllable
or at the end (Fig. 3.13).
The H* tone as a high pitch point of the syllable can also be positioned in the
middle of the syllable to give an account of a bell-shaped contour. This process
tends to relegate an essential characteristic of pitch movements by representing
them mostly by a High tone. One retrieves the old belief that syllabic prominence
is necessarily related to a positive acoustic feature associated with a high pitch.
By specifying the alignment of the tonal target inside the syllable, it is
possible, although in a not very intuitive way, to give an account for complex
F0 curved shapes. A recent attempt to define a French ToBI set of contours held
in Tarragona (Delais-Roussarie et al., 2015) revealed the difficulty of obtaining a
satisfactory set of prosodic events descriptions for French using ToBI symbols.
ToBI has been adapted to many languages: G-ToBI for German (Baumann
et al., 2001); ToDI for Dutch (Gussenhoven, 2005); K-ToBI for Korean (Beckman
& Jun, 1996); F-ToBI for French (Delais et al., 2015); ToBIt for Italian (Avesani,
1995); Sp-ToBI for Spanish (Beckman et al., 2002); Cat-ToBI for Catalan (Prieto,
2009); P-ToBI for Portuguese (Viana & Frota, 2007), and so on.
INTSINT and Momel

INTSINT (an acronym for INternational Transcription System for INTonation:
Hirst & Espesser, 1993; Hirst, 2005) is a hybrid transcription system proceed-
ing initially by a global representation of the complex pitch curve by a quad-
ratic function spline, followed by a symbolic representation of the resulting
function in terms of rises and falls:
Selecting data 41
a. Representation of the pitch variation – by nature complex and resulting from

multiple micro- and macromelodic processes – by a quadratic function
spline, with multiple regression to minimize the sum of squared differences
between the acoustic F0 curve and the spline function (Momel algorithm).
This graphic function has to meet a certain number of target points defined
manually or automatically, and appears as an approximation of the original
pitch variation. The optimal character of this transcription derives from the
possibility of recovering by re-synthesis a sentence judged perceptively
equivalent to the original from the prosodic point of view.
b. Coding of the spline function by a sequence of T (Top), H (Higher), U
(Upstepped), S (Same), M (mid), D (Downstep), L (Lower), B (Bottom)
symbols which encode the pitch variations indicated by the spline function.
This sequence is assumed to give a surface phonological representation. It
will be noticed that, as for ToBI, the temporal dimension is eliminated in this
representation and that the form of pitch contours remains implicit.
Furthermore, the coding of the pitch movements by symbols High and
Lower is skeletal and thus prone to interpretations.
INTSINT appears as a data reduction process using only mathematical
criteria and therefore transparent versus specific theories of intonation.
However, if the main advantage pertains to the elimination of most of the
micromelodic factors in the fundamental frequency curve, the benefits are not
clear compared to raw F0 data.
Analor
Analor (Avanzi et al., 2008) is a tool which has been conceived to model and
process semi-automatically prosodic constructions at the different levels of the
grammatical analysis. It is actually implemented as a Praat script, with a
specific display layout. With a set of assumed appropriate parameters (now
user selectable), it detects automatically boundaries of what the authors call
intonative periods, characterized originally by a melodic change of at least four
semitones, followed by a pause exceeding 300 ms and a difference between the
first melodic value after the pause and the last before the pause of at least three
semitones (Fig. 3.14). More generally, it can also detect prominent syllables,
possible candidates to be stressed, or accented syllables.
Improvements in the system are obtained by an automatic adjustment of the
Analor parameters from global properties of the speech sections transcribed
(Avanzi et al., 2011). Of course, results are directly dependent on segmentation
criteria, which may limit their phonological validity. In particular, many
occurrences of continuation majeure (IP boundary tones in AM terminology)
and even of conclusive melodic contours are not followed by a pause, restrain-
ing the usefulness of this approach.
8 8.5 9 9.5 10 10.5 11 11.5 12
220
175
139
110
a1Md a1M a1Md
kilomètres) (tout autour de cet endroit)] [(ie suis arrivèe (euh au Kenya)]
prèsence d'um ''euh''

hauteur du geste
hauteur du geste
durèe de la pause
durèe de la pause
hauteur du saut
hauteur du saut
7.8
7.6
7.4
7.2
7
6.8
8 8.5 9 9.5 10 10.5 11 11.5 12
Figure 3.14 An example of an intonative period boundary detection from the

four parameters detailed above (from Lacheret-Dujour & Victorri, 2002).
Transcription as theory
Transcribing prosodic events obviously reduces the apparent complexity of
acoustic data and makes their interpretation apparently easier, but also neces-
sarily biased. A particular problem lies in the choice and definition of phonetic
and phonological transcriptions. A phonetic transcription of prosody should
include all the details that are a priori important for further phonological
description, but that imply the elaboration of a satisfactory list of features
without knowing in advance their phonological pertinence. It may then be
simpler to use the acoustic data themselves rather than a transcription of any
flavor. A phonological transcription of prosody, on the other hand, should
represent only data pertinent for the theoretical approach adopted, i.e. only
the features that would give a proper account of the role and function attributed
to prosodic events, for instance to the indication of the sentence prosodic
structure.
A system like ToBI is positioned somewhere between the phonetic and the
phonological domain. Indeed, as reflected by many reports of annotator con-
sistency (Wightman, 2002), transcriptions are made from a lexicon of prosodic
events (Frota, 2009), and are directly inspired by the shape of melodic curves in
the language in question. In this sense, ToBI is close to the principles governing
International Phonetic Alphabet (IPA) elaboration.
Perception and interpretation 43
Badiou (1969) and Ochs (1979) have demonstrated that the choice of a
transcription system determines the theory that uses this system, whereas, in
linguistics, and in phonology in particular, it should be the reverse: the models
derived from a theory should determine the transcription system used to
analyze the experimental data. Although this observation seems rather obvious,
this opinion is not shared by everybody, as new researchers introduced to the
field accept without question the dominant notation system. Their choice may
of course be pivotal in the interpretation of data later.
The transcription system I adopted reflects directly and explicitly the theo-
retical assumptions underlying the analysis, and is therefore clearly phonolo-
gical. Indeed, the set of descriptive features chosen reflect directly the relations
of dependency of prosodic events in their prosodic context, relations directly
indicating the sentence prosodic structure. Another possibility, of course,
would be to ignore this phenomenon at least in a first step, for example by
installing prosodic events in some deep phonological structure, even if it means
proposing adjustment rules that would give a proper account of the observed
data.
I do not believe this is a suitable solution in phonology, as it allows any
appropriate mechanism to be built that would explain prosodic (and others)
data, whatever they are. It is always possible to find a set of rules that
would generate the proper forms of ANY linguistic events from ANY deep
structure form (cf. the Ugly Duckling theorem in logic, Watanabe, 1969).
While applying Occam’s razor principle of simplicity would give some
warning signal to linguists, it seems that working in a deep structure
environment would give too many possibilities in their phonological
descriptions, at the expense, often delayed for better times, of proposing
appropriate rules generating surface structure forms that would correspond
to data in a more satisfactory manner.
Perception and interpretation

While transcribing intonation, the operator does not act as a naive listener and
utilizes, consciously or not, through a binding process, knowledge that the
speaker or the listener does not necessarily have. Among many other possibi-
lities, the operator can listen at will to the segments to be transcribed, some-
times even at slower speed in order to enhance auditory capabilities. The
transcriber may also be influenced by some underlying prosodic theory when
difficult decisions must be made concerning the use of one or another symbol,
the intonative characteristics of his or her native tongue, or even the volume of
sound reproduction.
Experiments on identification of French prominent syllables, for example,
usually show a high rate of discordances in the judgments of the listeners. Poiré
(2006) observed only 19% to 49% agreement in the judgment made on syllabic
prominence on the same text by expert phoneticians.
A question then arises: how can experts, familiar with the techniques of
transcription as well as specialists in prosody, who are free to listen at will to the
segments of word of which they were to detect syllabic prominences, put forth
such divergent judgments? Would syllabic prominence be such an evasive
concept?
We know from Saussure that a linguistic object does not exist if it is deprived
of signification. The segmentation and the categorization of the prosodic events
rest on a knowledge which is, a priori, unconscious of the speaker. Thus, the
user of the “syllabic prominence” object operates by a process among other
unconscious processes while decoding a sentence. The lambda listener is not
consciously aware of a possible function of perceived prominences. The
semiotician and linguist Luis Prieto (1975) would say that the listener uses a
system of primary categorization elaborated during the acquisition and the
practice of the language.
The situation is quite different when the lambda listener becomes conscious
of the operation, for example if asked for a judgment on syllabic prominence,
with instructions such as “do you perceive this syllable as prominent, accen-
tuated, or insistent?” This kind of instruction will bring into play another type
of knowledge, implying secondary systems of categorizations operated by the
listener. It is clearly the strategy adopted by Simon et al. (2008) to obtain 93
percent agreement in the transcription of prominences in a standard text. It
seems illusory to think that the automation of perception will protects us from
an erroneous linguistic interpretation. Such a process will only give us results
based on a set of assumptions, embedded in the perception of software algo-
rithms, which may be linked only remotely to the perception of intonation by
the listeners.
A phonological transcription system

The AM model of the prosodic structure does not give any phonological role to
metrical strong syllables (bearing “pitch accents”) in accent phrases to indicate
its structure. Instead, observed varieties sometimes described in AM-inspired
work is essentially phonetics, as, for example, in French (Delais-Rossarie et al.,
2015). It seems then only reasonable to propose a more appropriate set of
notations, even at the risk of being rejected by less adventurous readers who
have grown too comfortable in their ToBI armchairs.
To avoid all these difficulties, I will use a top-down instead of a bottom-up
approach as the systems described above. Indeed, in a top-down approach, the
choice of levels or contours or other auditory or acoustic features becomes
secondary, provided that the features selected give an account of the starting
A phonological transcription system 45
assumption. Briefly stated, the representation of intonation starts from

the assumed existence of a prosodic structure, independent from the syntac-
tic structure. This sentence prosodic structure hierarchically and dynami-
cally organizes prosodic words (corresponding to segments of prosody
containing only one primary stressed syllable), and obeys the following
constraints:
1. A minimum and maximum duration for prosodic words. Stress clashes are
resolved either by merging the corresponding stress groups that are clashing
or by inserting a time gap ensuring the minimum prosodic word duration.
2. Eurhythmy, aiming either to the balance of the number of syllables in
prosodic groups at the same level in the prosodic structure, or to a rhythmic
adaptation to compensate the unbalance of the number of syllables in those
groups (essentially for read speech).
3. Syntactic alignment, preventing the prosodic grouping of syntactic units
dominated immediately by different nodes in the syntactic structure.
Violation of this constraint is allowed, but at the expense of a more elabo-
rated cognitive processing by listeners.
In Chapter 5, I will propose some paths to explain the origin of these
constraints.
It follows that a prosodic structure is not necessarily congruent with a given
syntactic structure, and that in general more than one prosodic structure can be
associated with a given syntactic structure and therefore with a given text.
By a top-down approach, the resulting representation of intonation clearly
reveals the underlying hypotheses made about a function selected by the
transcriber for intonation. By modifying the starting set of assumptions, one
can easily adapt the whole deductive process to represent other characteristics
of intonation, such as stylistic or pragmatic ones.
4 The Autosegmental-Metrical Prosodic
Structure
A brief description
Given that the Autosegmental-Metrical (AM) model has been and is still
dominant in the field of intonation phonology, I will briefly recall some of its
essential characteristics and its associated prosodic transcription system, ToBI,
relying on a recent book presenting a relatively up-to-date version of the AM
model (Feldhausen, 2010). The main goal of this model is to “explain the
complexity and the diversity of fundamental frequency (F0) contours” where
pitch is the perceived F0 and intonation the variation of the fundamental
frequency while speaking. Intonation corresponds to “the overall melody of
an utterance, as reflected by its tonal or F0 contour” (Hualde, 2003). In the AM
model, however, only particular points of the utterance are specified for tone.
These points are either prominent syllables or phrasal boundaries at the pho-
nological level. The rest of the contour is filled in by phonetic interpolation
between tonally specified points, assumed to correspond to actual fundamental
frequency values.
There are two types of tonal units (also called prosodic events): pitch
accent and boundary (edges) tones. Pitch accents are associated with
metrically strong syllables of a word or a sentence. They are strictly
locally determined, do not interact with each other, and are categorically
distinct from the other prosodic events. Only two tones, H(igh) and L(ow),
suffice to describe tonal units, which can be mono, bi, or tri tonal. Bitonal
pitch accents can combine two basic tones (e.g. L+H*, H*+L, etc.), the *
symbol indicating the association of the tone with a pitch accent and thus
metrically strong syllable.
Boundary tones mark the edge of prosodic constituents. Pierrehumbert
(1980) distinguished two kinds of edge tones: final boundary tones (noted
L% or H%, but %L, %H initial boundary tones may also exist), and phrase
accents (annotated L- or H-). Boundary tones mark the start or the end of an
intonation phrase (IP). They are also independent from pitch accents. Their
function is to mark limits (the “left” or “right” side) of IPs, the higher level of
Accent Phrases grouping in the prosodic structure.
46
A brief description 47
Phrase accents are freestanding unstarred tones, not aligned a priori on any
syllable, but their existence was questioned by Beckman and Pierrehumbert
(1986) and replaced by intermediate (intonative) phrases (ip’s) as an additional
component of the prosodic hierarchy. Phrase accents were thus substituted by
ip boundary tones. Today, the actual existence of the ip is debated, as according
to Selkirk (2005), they are not marked by specific boundary tones, whereas for
Sun-Ah Jun (2005) some markers do exist in Korean or in French (Michelas &
D’Imperio, 2010), for example.
All these concepts operate in the framework of autosegmental-metrical pho-
nology, metrical following Liberman and Prince (1977) and autosegmental
according to the work of Leben (1971) and Goldsmith (1976). Essentially,
autosegmental means that sequences of tones (as annotated with ToBI combina-
tions of H and L symbols) behave independently from the syllabic segments to
which they are associated. Tones are thus autosegments, each tone being asso-
ciated with one or more elements in the tonal structure called the Tone Bearing
Unit (TBU). On the other hand, they are metrical because relative prominence is
assigned through metrical grids to elements within phrases composing the
utterance. The claimed advantage of a metrical grid over a simple assignment
of stressed syllables is to allow degrees of accent determined by rules such as
S(trong) → w(eak) w(eak) . . . s(trong), i.e. a node generates two or more
syllables with the last one being the strongest. In other words, the rightmost
syllable in node branches is supposed to have maximal prominence. From a
hierarchy of syllables up to the root, it is then possible to calculate the degree of
stress of each syllable, as shown in Figure 4.1. By representing these degrees of
stress on a grid, it is also possible to predict events caused by stress clash (i.e. the
presence of two consecutive stressed syllables; cf. Dell, 1984).
The degrees of prominence, represented by the number of symbols * placed
on the grid on top of syllables, are obtained by applying the s → w w . . . s rule
on nodes of a metrical tree congruent to the syntactic structure of the sentence
considered (Fig. 4.2). Counting the number of s strong nodes obtained by
percolating to the root of this tree leads to the value of the degree of stress
(1 is the lowest, and 5 the highest stress) (see Fig. 4.2).
Figure 4.1 An example of metrical grid (from Bonami & Delais-Roussarie,

2006).
48 The Autosegmental-Metrical Prosodic Structure
W S
W S W S
W S W S W S W S
1 2 2 3
Figure 4.2 Degrees of stress obtained by counting the number of stress nodes.
Metrical trees are built congruent to the sentence syntactic tree, implicitly
admitting that syntax indirectly determines the degrees of stress. The hypothesis
of congruence has been abandoned since, but the idea that a given prosodic event,
for example a pitch accent, has a dependency relation with another prosodic
event located on its right (i.e. in the future of the current prosodic events) is one of
the founding principles prevailing for the indication of the prosodic structure in
many non-AM approaches (see Martin, 1975, 1987; Mertens, 1987).
Properties
In the AM model, the overall resulting structure is as follows: syllables group
segments, feet group syllables, Accent Phrases (APs) group feet, intermediate
(intonative) phrases (ip’s) group APs, Intonational Phrases (IPs) group inter-
mediate phrases (ip’s), and utterances group intonation phrases. Furthermore,
an AP is assumed to contain either a verb, an adverb, an adjective, or a noun,
and optionally a grammatical word, such as a conjunction, a determinant, etc.
The metrically strongest syllable usually belongs to one of the content (i.e. non-
grammatical) elements. A specific constraint called the Strict Layer Hypothesis
(SLH) applies to this structure (Nespor & Vogel, 1986; Selkirk, 2005). The
SLH specifies that:
1. A given nonterminal unit of the prosodic hierarchy is composed of one or
more units of the immediate lower category.
2. A unit of a given level of the hierarchy is exhaustively contained in the
superordinate unit of which it is a part.
3. The hierarchical structures of prosodic phonology are n-ary branching.
4. The relative prominence relation defined for sister nodes is such that one
node is assigned the value strong (s) and all other nodes are assigned the
value weak (w).
Stated more simply, the prosodic hierarchy is non-recursive (objects of one
layer are grouping objects of another nature (condition 1), it is a structure
Properties 49
(condition 2), this structure is non-binary (n branching is allowed, condition 3),

therefore the metrical w-s may be specified for more than two nodes (condition
4), the last one being strong.
The AM model assumes that pitch accents do not interact with each other:
perhaps this is due to the ToBI transcription and the reminiscence of phonemes.
On the contrary, I show throughout this book that pitch accent realizations in
Romance languages are highly dependent on the configuration of the prosodic
structure and that pitch accents do effectively interact with each other. Of
course, it is always possible to relegate the apparent observed dependence of
melodic contours on the complexity of the sentence to some surface structure,
leaving the immutable character of pitch accents (and possible boundary tones)
to a deep structure, a set of ad hoc rules ensuring the path from one deep level to
the surface level.
Nevertheless, the AM prosodic model appears at first as a reasonable
abstraction built from (limited) data observation, inspired by the then fashion-
able autosegmental and metrical theoretical concepts. The SLH may seem
appropriate, but compared to the current definitions of the syntactic structure,
it just adds the possibility of n-branching, breaking the binary taboo applied to
syntax in those days. The possibility of having w w . . . s pattern excluding
binarism emerged as an inevitable consequence of considering examples of
simple enumerations of APs in the data.
On the other hand, the intermediate (intonative) phrase (ip) cannot be ended
with boundary tones, even if they had characteristics other than IP boundary
tones, as the prosodic structure would then become (partially) recursive, and
therefore not complying with the SLH. The now extinct phrase accents there-
fore should end ip’s and function as ip boundary tones.
Pierrehumbert (1980) postulated the Intonational Phrase (IP) as the only
intonational constituent in American English, but later work with Beckman
(Pierrehumbert & Beckman, 1988) suggested the need for a second level of
intonationally defined constituent, the intermediate intonational phrase (ip).
This constituent is defined as an intonation contour with one or more pitch
accents and a phrase accent, but no boundary tone.
Researchers gathering more and more data, even if these data belonged to the
laboratory phonology field, felt obliged to consider, timidly at first, and then
louder, the existence of an intermediate layer in the prosodic structure.
Interestingly, renowned specialists such as Ladd admitted the need for this
intermediate level, but changed his mind to justify the first version of the
prosodic structure and remained therefore inside the mainstream (see Ladd,
1996, and his revised position, 2008, on this question).
Consequently, if this hypothesis is retained, the added layer ip makes the
structure partially recursive, except for the fact that, contrary to IP, ip does not
have boundary tones (or does it?). In any case, I will consider in this book that
IP groups one on more ip’s, and that every ip groups one or more accent
phrases (APs).
In summary, the prosodic structure presents three levels:
1. AP, accent phrases, with one stressed syllable on a content word (verb,
adverb, adjective, or noun);
2. ip, intermediate (intonation) phrase, containing one or more APs;
3. IP, Intonation Phrase, containing one or more ip’s;
4. U, the Utterance, containing one or more IPs.
At this point, there are no assumptions whatsoever about any correspon-
dence with any unit belonging to the syntactic or semantic domains, except that
APs must contain one content (i.e. non-grammatical) word.
The prosodic structure is also planar, which means that the graph that
represents the hierarchical grouping of units, be it AP, ip, or IP, is planar (i.e.
tree branches representing the grouping of APs into ip, of ip’s into IP, or of IPs
into U do not cross when drawn in a two-dimensional space like a piece of
paper).
The prosodic structure is also connected, which means that no unit, whether
AP, ip, or IP is floating, i.e. not belonging to a unit of higher rank, i.e. an AP to
an ip, an ip to an IP, an IP to a U. In reality, especially in spontaneous speech,
IPs can float and not maintain any dependency relation with another IP when
prosodic parentheses are embedded in the sentence (see Chapter 8 on
macrosyntax).
These properties mean, among other things, that a prosodic structure may
contain only one AP, which would be at the same time the unique ip and unique
IP of the prosodic structure.
In summary, the dominant concept of prosodic structure derived from AM
theory organizes Accent Phrases, APs) hierarchically in one or more levels.
These APs are supposed to contain one content word bearing a pitch accent
(verb, adverb, noun, or adjective) around which may revolve one or more
grammatical words (conjunction, pronoun, preposition, etc.). The APs bear a
melodic accent (pitch accent) placed on the AP content word’s metrically
strong syllable.
In a complex prosodic structure, a first assembly of APs forms an inter-
mediate (intonation) phrase (ip), bearing some prosodic marker distinct from a
boundary tone. The grouping of these ip’s constitutes an Intonation Phrase (IP)
ended with a boundary tone. Finally, the sequence of IPs forms the prosodic
structure, terminated by another type of boundary tone, conclusive and ending
the sentence.
A given prosodic structure does not necessarily present all these levels of
hierarchy. A prosodic structure can be “flat,” as an enumeration of a sequence
of APs assembled in one single level. A two-level prosodic structure groups the
AP in IP’s, and the IP in the sentence prosodic structure. A three-level prosodic
Ë Utterance
È È Intonational Phrase
ip ip ip intermediate phrase
ω ω ω ω ω Prosodic Word
F F F F F Foot
σ σ σ σ σ σ σ Syllable
tu: mE ni kUks spOIl D@ brOT segmental structure
H* L*+H H* H* + L LÈ tonal structure
Figure 4.3 The (revised) Autosegmental-Metrical Prosodic Structure (Too

many cooks spoil the broth) (from Gussenhoven, 2004, Feldhausen, 2010).
structure groups the AP into ip, the ip into IP, and the IPs into the final prosodic
structure. In this formalism, the most complex prosodic structure presents three
levels, and in its original form (Selkirk, 1978) with only two levels is not
recursive (Fig. 4.3), as each level groups units of different types (AP and IP).
In addition, the definition of the AM prosodic structure in the literature has
more than one version, depending on whether the IP is considered to be the
largest intonative unit in the sentence or not, and on whether the existence of
ip’s is considered or not.
Applying the concept

In French, the definition of the prosodic structure poses perturbing questions,
given the final position stressed syllables in APs and the absence of lexical
stress. The stress located on AP final syllables could be analyzed either as
instantiation of pitch accents, manifest boundary tones or both. One of the
arguments often brought to sort these categories bears on the glissando criterion
(Rossi, 1971, 1978; Mertens, 2004). If the melodic contour located on these
stressed syllables realizes a glissando value above a certain threshold, i.e. if the
rate of melodic change exceeds this threshold, the contour would be considered
as a boundary tone. If not, it will be classified as the manifestation of a syllabic
prominence (a pitch accent).
Another problem in French, which has only group stress, is that an AP
can contain more than one content (lexical) word, as well as a single
grammatical word, which seems to make the AM approach per se
inapplicable to languages lacking lexical stress (Martin, 2012a). Other lan-

guages such as Korean apparently share this property (Jun, 2005).
Romance languages other than French have lexical stress and therefore pitch
accents and the final position of stress in an AP is somewhat rare. Even in this
case one can observe a so-called complex contour, showing two consecutive
melodic variations on the same final stressed syllable, the first manifesting the
pitch accent, the second the boundary tone, usually rising if ending sequences
of APs in an IP. This complex melodic variation is particularly evident in Italian
data.
A good summary of operations to describe the intonational system in the
AM framework is given by S. Frota et al. (2013). Accordingly, the under-
standing of the intonation system in any intonation language requires the
knowledge of the following:
• The prosodic structure (specifying domains, edges, and heads);
• An intonational lexicon (inventory of observed pitch accents and edges
tones, and the meanings they convey in context/usage);
• The relevant domain for pitch accent distribution;
• The distributional constraints (“tonotactics”), as some tonal events only
appear in/are banned from certain positions; that may not co-occur with
others;
• The implementation rules;
• Some early/late alignment; spreading/interpolation; contextual upstep (¡) /
downstep (!); compression/truncation.
It seems that the biggest problem pertains to the first step: how can the
prosodic structure be known for a given sentence? In practice, the prosodic
structure is inferred from the sentence syntactic hierarchy, applying the SLH
constraint limiting the number of levels to two. This may explain why in most
AM papers devoted to intonational systems, the examples treated are so short.
Furthermore, they generally admit that sentences permit one and only one
prosodic structure. In examples showing an obvious discrepancy with syntax
(i.e non-congruence), some restructuring rule will be proposed, allowing a
derivation from a congruent basic form to the structure actually observed in
the data (cf. Post, 1999).
Assuming the congruence hypothesis is always valid (and it is the case
most of the time for the very short read sentences on which the AM-driven
intonational system will be based), it is relatively easy to build an intona-
tional lexicon simply by describing realized sentence intonation patterns
for various conditions with the ToBI notation system, i.e. different sen-
tence modalities, such as declarative, imperative, and interrogative and
their variants. The intonational lexicon can then be completed with a
description of IP boundary tones and the prosodically metrically strongest
syllable of an AP (the stressed syllable).
According to Frota and Prieto (2015), current conditions prevailing to gather

experimental data are:
• For boundary tones: Declarative (broad and narrow focus), Interrogative
(broad and narrow focus), Prenuclear interrogative, Postnuclear declarative,
Enumeration question, Wh question, Continuation (and lists) of non-final Ip
and ip;
• For accent phrases: the presence of a pitch accent.
The transcription of a given sentence must use the available intonational
lexicon, picking up the best symbol in the pitch accent and boundary tone
lists for the considered language.
The next step pertains to the description of the AP domain for pitch accents,
e.g. noun, verb, adverb, and adjective with their associated function words.
Tonotactics is the next stage, attempting to discover conditional realizations
or deletions of some tonal events after the above steps are completed, and
obtaining a list and possibly a grammar of well-formed sequences.
Finally, refined descriptions will proceed to verify the adequacy to experi-
mental fundamental frequency data, by defining some upstep or downstep tonal
adjustments, etc.
Most of the AM-inspired studies evoked below present together both pho-
nological and phonetic description on tonal events. However, it is sometimes
difficult to distinguish variants from basic phonological forms.
In Frota et al. (2013), the intonative lexicon is elaborated from a list of
elicited sentences following a standardized questionnaire. From the categories
involved, one can easily observe that the results will mix phonological, syntac-
tical, semantic, and informational data, all obtained on sentences of rather
limited length. Furthermore, the context elaborated before the expected sub-
ject’s realizations may possibly require professional actors to get some reliable
results. Most if not all these categories pertain to the sentence modality rather
than to the prosodic structure itself as a hierarchical organization of APs.
A questionnaire proposes specific contexts of elicitation. Some are straight-
forward, such as alternative questions (called here disjunction interrogative).
For example: context: You bought vanilla and hazelnut ice cream for your
birthday. Ask guests if they want vanilla or hazelnut.
Vous voulez de la glace à la vanille ou à la noisette?
“Do you want vanilla or hazelnut ice cream?”
Other contexts and resulting categories appear quite problematic, as, for exam-
ple, one called “Partial question anti-expectation.” Context: Your neighbor tells
you she dined at a restaurant and she ordered rabbit with onions. Completely
convinced, she says they gave him cat instead of rabbit. You cannot believe her.
Ask her (very surprised) what she says they gave her (from Frota and Prieto, 2015).
Qu’est-ce que tu dis qu’ils t’ont donné? “What do you say they gave you?”
These categories appear somewhat non-orthogonal, viewed from a func-

tional point of view. Furthermore, the examples seem difficult to read according
to the announced category, even with an explicit context. In many examples,
the intonative meaning could very well be carried by the text itself rather than
by the sentence intonation. It seems that not only a professional actor but also a
professional film or theater director would be required for such experiments (I
personally had to resort to this kind of help for a similar research project).
The mixture of phonetic and phonological aspects of intonation descriptions
of various Romance languages was exposed at the PaPI 2011 meeting (Delais-
Roussarie et al. for French, Frota et al. for Portuguese, Fivela et al. for Italian,
Hualde et al. for Spanish, Prieto et al., 2009, 2014, for Catalan, and Jitcă et al.
for Romanian, see Frota & Prieto, 2015), and the intonative lexicon for various
Romance languages emerged from the Romance ToBI workshop held in
Tarragona, on June 23, 2011.
These descriptions lead to inventories of monotonal and bitonal pitch
accent and boundary tones, attached to specific essentially semantic cate-
gories, such as narrow and broad focus statement and questions, counter-
expectational echo questions (!), vocative chant, insisting call, etc., and all
kinds of combinations of falling, level, and rising tones supposedly describ-
ing the overall sentence melodic contours. As an example, the proposed
intonational lexicon for Italian given in the workshop includes, for pitch
accents, Low L*, High H*, Fall H+L*, Upstep fall !H+L*, Rise-fall H*+L,
Rise L+H*, Upstep rise L+!H*, Rise late peak L+>H*, Fall-rise L*+H, L+H*+L
and H+L*+H. For boundary tones, Falling or low L-, Rising of high H-, Mid !H-,
Falling HL- (ending ip’s), and Falling or level low L-L%, Rising or level high
H-H%, Falling or low-rise L-H%, Rising or high-fall H-L%, Up/down to mid or
level mid !H-!H% and Up/down to mid fall !H-L% for IPs.
Questions and remarks

What does the AM model tell us about the prosodic structure?
1. There exists a hierarchy grouping APs into IPs (many examples are too
short to observe the existence and behavior of ip’s).
2. The AP’s pitch accents have no function other than to signal the presence
of APs themselves in the surrounding of the stressed syllable (i.e. the
strongest syllable in the domain), despite the metrical grid assigning
degrees of stress to pitch accents. Nevertheless, their description remains
purely phonetic.
3. Metrical grids are actually seldom used, as pitch accents are simply
associated with stressed syllables so that in the absence of emphatic
stress, the number of APs equals the number of lexical stressed
syllables.
Questions and remarks 55
4. Declarative sentences (statements) end with an L*L% sequence (L* being

a pitch accent, L% a boundary tone). L* is aligned on the last stressed
syllable of the sentence, whereas L% is on the last syllable, stressed or not
(this observation is re-examined in Chapter 7).
5. Likewise, interrogative sentences end with an H*H% sequence, with a
pitch accent H* on the final stressed vowel and a boundary H% tone on the
final syllable.
6. The definition of the intonation phrase (IP) is circular: What is an IP? By
definition, an IP is a segment of prosody ended with a boundary tone; what
is a boundary tone? By definition, a boundary tone is a tone ending an IP.
Adding another condition such as the alignment of a boundary tone on a
syntactic boundary is inevitable, bringing syntax into the definition.
Boundary tones should be defined on prosodic criteria alone.
7. The only contrastive aspects of the analyses pertain to the prosodic struc-
ture modality (i.e. declarative vs. interrogative and their variants).
8. The prosodic structure is a hierarchy (non-recursive or partially recursive
depending on the absence or the presence of an ip). However, what is its
linguistic function (if any)? No explanation principle is given.
9. Why use the ToBI notation system? This is partially justified by the principle
of independence of pitch accents relatively to the prosodic structure and their
assumed lack of interaction with each other. Pitch accents are then categori-
cally distinct by design and behave just like phonemes. Differences in
melodic realizations are considered purely phonetic, and described by the
alignment of the target tone on the stressed syllable (or vowel).
10. How does one establish the phonological relevance of tones (pitch accents
and boundary tones)? It seems that in practice they are assumed pertinent
just because they are there, aligned on the strongest syllable (i.e. the
syllable carrying the main stress). The problem is that many annotations
that describe sentence intonation with ToBI annotation have a tendency to
mix phonetics and phonology, as most of the time no criterion is used to
distinguish between them.
11. Is the prosodic structure congruent to syntax? Are there any constraints
limiting the configuration of prosodic structures? Can more than one pro-
sodic structure be assigned form a single syntactic structure or is it unique?
12. What happened to the time parameter? There is no specification of dura-
tions relative to prosodic events or APs.
13. Should the prosodic structure derive from the syntactic structure?
Apparently, this idea is still predominant in current research.
14. The only aspect of the analyzed sentences not linked to modality pertain to
focalization (broad focus vs. narrow focus), retrieving classical description
known under another terminology (e.g. theme-rheme, theme-propos in
French, etc.).
As a model, the AM prosodic structure acts as a filter or a sort of microscope

helping to select the data assumed to be relevant and pertinent just because they
are “seen” through the model. Although this process is frequent in scientific
research, the AM filter does not rely on any explanatory principle other than to
assume that sentence intonation prosodic events are organized hierarchically in
a structure (this had already been stated years ago: see Martin, 1973, 1975,
among others). Others approaches were proposed some twenty years ago for
French, as detailed in Lacheret-Dujour & Beaugendre (1999).
The prosodic structure revisited

A few other questionable points (Martin, 2012a), addressed later in this book,
should be mentioned:
1. The AM prosodic structure is non-recursive. This property has already been
discussed by various authors, and in my opinion originates, among other
reasons, from the fact that very short sentences, in limited number, were
used as experimental justification for this property. Indeed, exclusive use of
limited-length sentences prevents the observation of recursivity, even in
English (Ladd, 2008).
2. Descriptions of prosodic events underlying a prosodic structure do not take
duration parameters into account. The ToBI system has no explicit provision
for describing temporal aspects of sentence intonation other than the perceived
break durations (which is seldom used and is impressionistic in nature).
3. While other transcription systems either are available or could be more or
less easily adapted to fit specific properties of a given language, the quasi
exclusive use of the ToBI system, even customized, involves an oversim-
plification of the description of melodic events. This oversimplification is
sometimes compensated for at a later stage by complex tone alignment rules
aimed to better account for the phonetic details of melodic movements.
Furthermore, in practice, transcriptions are often impressionistic: the link
with the actual data of some specific ToBI sequences seems rather inspired
by a theoretical necessity than by the actual reality of facts.
4. In many instances, confusion exists between phonological and phonetic
descriptions of prosodic events. Some authors give descriptions so detailed
that they appear purely phonetic rather than phonological.
5. Contextual properties of prosodic events are ignored (they are excluded by
design), as there seems to be a strong underlying assumption that prosodic
events share properties similar to phonemes. This aspect is intriguing, as the
AM approach was proposed to address the possible effect of context in the
realization of melodic contours. Contextual rules may appear only on
the surface structure but are generally not considered as inherent properties,
whether applied to pitch accents or boundary tones.
The prosodic structure revisited 57
6. In early versions of the AM framework, the prosodic structure was assumed

to be congruent with the sentence syntactic structure. This implies that only
one prosodic structure could be associated with a given sentence (except for
syntactically “ambiguous” sentences). Even if congruence with syntax is
not necessarily retained today as a hypothesis, it is rare to find an author
considering the possibility to associate more than one prosodic structure
with a given syntactic structure.
7. Like other less-known theoretical approaches, AM ignores a basic property
of sentence intonation, i.e. to be encoded sequentially by prosodic events
encoded and decoded sequentially by the speaker and decoded sequentially
by the listener. Therefore, it may be misleading to consider prosodic events
on a piece of paper as emerging at once to represent the prosodic structure,
as prosodic events appear in reality in a timely fashion, one after the other in
a time sequence. This time domain dynamic aspect may modify the vision
sentence intonation and the prosodic structure.
8. The last point pertains to the quasi-exclusive use of laboratory speech,
generally involving (very) short sentences. This limited choice of data,
justified in the early years by technical limitations, prevents the AM proso-
dists from observing data that would seriously question many aspects of
their approach.
Throughout this book, I will take another stand adopting the following main
theoretical points:
1. Prosodic words (associated with stress groups, i.e. APs) are organized into a
structure (a structure is a hierarchy with either relations defined between
classified objects or equivalently labels assigned to objects). This function
of the prosodic structure allows the users (i.e. the speaker and the listener) to
(re)build a hierarchy from linearly produced and received information, i.e. a
string of prosodic events.
2. This structure is recursive.
3. Descriptions of the prosodic structure must be strictly separated from other
structures organizing the language, and especially the syntactic structure. Also
in spontaneous speech as in read speech, the generation of the syntactic
structure depends on the presence of a simultaneously generated prosodic
structure. One can pronounce a prosodic structure without text, and thus with-
out syntax, but the opposite (i.e. to pronounce a sentence without intonation,
without a prosodic structure) is not actually possible, even in silent reading.
4. Pitch accents and boundary tones must be defined with rigorous phonolo-
gical criteria, not from a questionable and difficult to apply hypothesis
assuming the alternation of weak and strong syllables.
5. Pitch accents and boundary tones do both have a role in the indication of the
prosodic structure. In particular, pitch accents interact with each other and
can be phonologically neutralized, and their realization is therefore not

purely phonetic.
6. The contrast of melodic slope applies to all Romance languages as one of the
most perceptually efficient acoustic contrasts. It leads to a phonological
description using necessary and sufficient contrasts independent from envir-
onmental characteristics (e.g. emotional state of the speaker, geographical
and sociological variations) affecting melodic realizations. This description
in turn avoids mixing phonetics and phonology in the account of the
prosodic structure.
7. Prosodic structures are constrained by a set of rules: (a) the minimal and (b)
the maximal time duration between consecutive prosodic events (pitch
accents and boundary tones) explain the so-called stress clash and seven
syllable rules found in the literature (on stress clash, see Bally (1944); on
the seven syllable rule, Meigret, 1550), (c) the eurhythmy rule (Martin,
1987), and (d) the syntactic clash rule (Martin, 1987).
8. The linguistic function and constraints of the prosodic structure are
explained by a universal cognitive principle, linked to the accumulation of
strings of syllables by the listener’s short-term memory. Departing from the
Written Language Bias, the time dimension should be in the center of the
prosodic structure definition and properties.
9. The real linguistic functions of the prosodic structure in the language coding
and decoding processes are now being explained by recent neurocognitive
findings (in particular, constraints cited under 7).
It appears (in my opinion at least) that the AM model of the sentence
prosodic structure is both naïve and simplistic. Naïve as it just rewrites data
pertaining to sentence intonation in various languages, which is well-known
and has been frequently observed and related in numerous papers for perhaps
the past 50 to 70 years. It is simplistic as it does not propose any explanation
principle as to the linguistic function of the prosodic structure, other than its
implicit role in helping the listener to decode the syntactic hierarchy. Many
phenomena such as rhythmic constraints are generally ignored, as the AM
approach was designed at the start as something just helping syntax and not
worthy of having a phonological existence per se.
5 The Incremental Prosodic Structure
This book is about the structure of spoken language. It is based on the assumed
existence of a prosodic structure, organizing hierarchically minimal units of
prosody, the prosodic words.
This assumption implies:
a. that prosodic words do exist, and
b. that some prosodic markers do indicate the hierarchical organization of
prosodic words in a structure.
This chapter exposes the main concepts of the Incremental Prosodic
Structure, a prosodic structure built dynamically along the time axis while
speaking. These concepts are applied to the analysis of Romance language
sentence intonation in Chapter 7.
Melodic curves
Whether we read a text orally or silently, we produce a “music of the sentence”
created by specific melodic height, duration, and intensity accompanying each
syllable emitted orally with vibration of the vocal folds. Some of these syllables
are perceptually “stronger” than others. These “strong” syllables are not neces-
sarily all stronger in the same way; they are differentiated from other strong
syllables by acoustic features such as melody (their perceived height), duration,
and intensity. Modern acoustical analysis shows clearly that stressed syllables
do bear some melodic changes, but this is also the case for the other syllables of
the sentence, whether they are perceived as stronger or not.
Many descriptions of the melodic changes have emerged from these acoustic
analyses, some purely descriptive and phonetic, others phonological in their
attempt to highlight regularities governing their realizations by speakers of a
given language.
The melodic curve, whose acoustic analysis can be obtained by dedicated
speech analysis packages such as Praat, WinPitch, WaveSurfer, and so on,
are usually displayed with various degrees of reliability and details on a
frequency/time graph. These curves are quite complex and can be described
and studied in many ways. Therefore, their description presupposes some
59
250
200
150
100
50
0
39 39.5 40 40.5 41 41.5
L1 [17] son las principales actividades denunciadas por Trafic.
Figure 5.1 Micromelodic fundamental frequency dips (circled) due to the

presence of a voiced stop [d] in the sequence . . . actividades denunciadas por
Traffic.
selection criteria in order to isolate characteristics assumed to be pertinent

according to an implicit or explicit theoretical choice.
Phoneticians, for instance, were keen to describe what they called micro-
prosody, i.e. phenomena affecting the melodic curve (actually the fundamental
frequency curve, an estimation of the successive laryngeal periods usually
obtained from harmonic analysis of the speech signal) and directly caused
(and explained) by some peculiarities of the articulation process. For example,
typical dips observed in the fundamental frequency curve corresponding to
voiced stop consonants are easily explained by the momentary closure of the
vocal tract while laryngeal folds are still vibrating. An example is given
Figure 5.1.
On the other hand, the linguistic description of melodic curves, which was
called for a time macroprosody, appeared not so easy to achieve and obviously
needed some external (linguistic) principle that could be applied to extract
relevant features from the raw acoustic data. In the earliest days, phonologists
started to investigate global properties of the melodic curve through statistical
analysis, eventually linked to syntactic or semantic specific configurations
(Cooper & Sorensen, 1981). Still, a rather evident argument based on percep-
tion experiments (’t Hart et al., 1990) came to light, holding in the descriptions
the most perceptually prominent segments of the melodic curve, corresponding
to stressed (i.e. strong) syllables. Later, other arguments appeared in order to
retain only the vowel inside the stressed syllables, presumably always voiced
(except in whispered speech, and possibly difficult to analyze in case of voice
creak). Alternatively, some researchers retained stressed vowels and the voiced
consonants that eventually follow.
Of course, for prosodists who strongly believe in the predominance of
syntax, other natural candidates deserve to define relevant segments of the
melodic curve, among which segments aligned on syntactic boundaries may
The stress group 61
appear as excellent choices. In the first days of the AM approach, for instance,
prosodic phrase boundaries were simply aligned on syntactic boundaries.
However, researchers confronted with more and more data gradually gave
prosodic boundaries their independence from syntax, at least in principle.
The stress group

Most if not all theoretical approaches in prosodic phonology implicitly or
explicitly use a common selection criterion to filter the complex experimental
data obtained by acoustical analysis, retaining prosodic events occurring on
stressed syllables (or metrically dominant in the AM framework) and on some
lexical words boundaries, to eventually define minimal units of prosody, or
prosodic words and their associated stress groups (which correspond to Accent
Phrases in AM). Of course once detached from external syntactic character-
istics, other criteria may be needed to describe a prosodic event that could
qualify as relevant in describing the prosodic structure. We can then expect
some debate on these selection criteria.
As stressed vowels occur repeatedly in the flow of syllables, they may be good
candidates to be included in the definition of stress groups as units associated with
prosodic words. Stress groups will then be defined as sequences of syllables with
only one stressed syllable (excluding emphatic stress events). Its counterpart, the
prosodic word, is the segment of prosody that accompanies a stress group.
Therefore, for convenience, when referring to the prosody of a stress group, we
mean the prosodic parameters carried by the syllables of that group. This as
prosodic material constitutes the prosodic word. (This definition of the prosodic
word was already proposed in Martin, 1975). The stress group appears, then, as the
syllabic side of the minimal prosodic unit (including its morphological or lexical
properties), whereas the prosodic word constitutes its prosodic side. The Accent
Phrase (AP) in the AM approach merges these two sides, syllabic and prosodic.
Properties of lexical words included in a stress group may also be defined.
From observations on English, APs in the AM model were deemed to contain
one single content word (an open-class word, such as a noun, verb, adverb or an
adjective), and optionally one or more grammatical words (closed-class units
such as conjunctions, articles, auxiliaries, etc.), linked by a dependency relation
or by any suitable grammatical model showing their link with the AP content
word. This is a direct consequence of the stress properties of words in lexical
stress languages (like most Romance languages), but it will not hold for non-
lexical stress languages like French. Indeed in French, one single stress group
(and therefore one single AP) can contain more than one content word as in
mon papa est president “my father is president” pronounced with a single final
stressed syllable. Also, one stress group can contain one single grammatical
word, as in moi mon papa il est president, “as for me, my father, he is
president” where the tonic pronoun moi will be stressed in the phrasing
[moi] [mon papa] [il est president]. Likewise, a stress group may contain a
single syllable (e.g. je suis dé-bor-dé “I am overwhelmed,” c’est absolument
fan-tas-tique “it’s absolutely fantastic,” with each syllable of débordé or
fantastique pronounced detached and stressed), or more than one content
word (la fin du film “the end of the movie” with one stressed syllable on film,
or la ville de Paris, “the city of Paris”).
The prosodic word

Spoken language always carries prosodic information, manifested from one
syllable to the next by a change of syllabic and vowel duration, change of vowel
intensity, and, for its voiced part, by a change of fundamental frequency (the
acoustic measure of the vocal folds vibration frequency).
Pitch is the perceptual attribute of a speech sound, which enables the listener
to locate the sound on a frequency scale from low to high. The physical
correlate of pitch is the fundamental frequency of the speech sound. In normal
speech, pitch is constantly varying in vowels and reduced consonants, resulting
from changes in the rate of vibration of the vocal folds.
As seen above, the prosodic word is the prosodic counterpart of a stress
group. It defines the minimal linguistic unit of prosody. The prosodic structure
refers to the hierarchical grouping of prosodic words. Prosodic words carry
prosodic events, instantiated by melodic contours, usually located on stress
groups’ stressed and final syllables.
Furthermore, the last prosodic event of a prosodic word is usually aligned on
the last syllable of some content or grammatical word in French, a language
without lexical stress, but can also be aligned on a single syllable of any type of
word in the case of syllabic separation by the speaker. Stress refers to the
relative prominence of a specific syllable inside a stress group, using greater
duration, intensity, or modulation of pitch (or a combination of these para-
meters), characteristics belonging to prosodic words. Intonation in phonology
refers, then, to the evolution of melodic rises and falls across the sentence,
including the duration these changes take (Li et al., 2008).
In more general terms, intonation is defined as the use of pitch variation in
phrases and sentences to reveal (unconsciously) or signal (consciously) speaker
attitudes, sentence type (e.g. statement vs. question), and information structure.
Intonation contributes to the interpretation of utterances but not to the differ-
entiation of words with similar syllabic sequences. However, in this book,
intonation describes more specifically the pitch movements on stressed and
eventually last syllables inside prosodic words, the melodic side of stress
groups. As will be shown, these movements are not generated at random by
the speaker; they belong to the linguistic system.
Syllabic chunking 63
Syllabic chunking
Recent work in neurocognition (Gilbert & Boucher, 2009; Gilbert, 2012)
shows that, in order to be perceived and memorized, strings of syllables must
be organized in chunks of three to five syllables if theses chunks do not
correspond to lexical entries. If presented with larger strings of syllables such
as bisraktoubzachdujpermasrik, the listener will spontaneously segment these
sequences in subgroups of three or four syllables (for example, a sequences of
eight syllables will be segmented into two groups of 3 and 5 syllables bisrak-
toub zachdujpermasrik, 4 and 4 bisraktoubzach dujpermasrik or 5 and 3
bisraktoubzachdu permasrik).
This kind of segmentation can also be observed when reading (even silently)
numbers with many digits written without a dividing space, such as
13878376396 (a telephone number in China), whereas spacing between digits
will force the reader to determine its proper segmentation, for example 138
7837 6396.
The same principle is applied to sixteen-digit credit card numbers, usually
formatted in four groups of 4 digits: 1234 5678 9012 3456 instead of
1234567890123456, and to car license plates, often alternating digits and
alphabetic letters: 845 KWC 87 and KZ-801-RJ (French license plates).
Similar examples were given by Miller (1956) with, for instance, the
sequence IBMCBSCIAIRS. Readers familiar with US culture have no trouble
in dividing this string into the well-known abbreviations IBM, CBS, CIA, and
IRS (International Business Machines, Columbia Broadcasting System,
Central Intelligence Agency, and Internal Revenue Service). Other readers
either would not be capable of remembering the sequence as presented, or
would segment it into smaller parts, such as IBMCBS and CIAIRS, or IBMC,
BSCI and AIRS.
The prominence of one syllable inside a large string, realized, for example,
by vowel lengthening, will also determine the segmentation chosen, for exam-
ple bisraktoub and zachdujpermasrik if the third syllable is made more promi-
nent (stressed), or bisraktoubzach et dujpermasrik if the fourth syllable is more
prominent. However, when syllable groups do correspond to strings already
stored in the listener’s long-term memory, their segmentation can result in a
larger number of syllables, the last one not necessarily being the strongest. This
suggests that some triggering mechanism must exist to allow the listener to
determine and realize quickly the segmentation of nonsense sequences of
syllables into what is called temporal groups by Gilbert and Boucher (2007).
As stress is rarely located on the last syllable in lexically stressed languages
(e.g. in Romance languages other than French), another mechanism of seg-
mentation must be considered, possibly using both the presence of stress and
the direct identification of a lexical entry in the listener’s long-term memory.
In summary, the segmentation mechanisms could use the following:

a. The absolute position in the syllabic sequence (e.g. final in French, or in
unidentified chunks);
b. Stress as a morphological marker, with a stressed syllable determined by the
morphological structure of the words in the group (e.g. capitano vs. capi-
tano vs. capitano in Italian), cf. Chapter 6;
c. Specific rhythmic patterns (e.g. peng2 you3 “friend” or zi4 xing2 che1
“bicycle” in Mandarin where multisyllabic words are grouped
rhythmically);
d. Multiple stressed syllables in one orthographic word (e.g. Übersetzen or
‘Schönheit in German, Ticonderoga in American English);
e. Direct identification of a lexical entry (the most expensive in term of
processing time).
None of these possibilities may alone explain all syllabic sequences conver-
sion mechanisms in any one language, as multiple triggering schemes may be
used concurrently, which incidentally may explain the remarkable resistance to
noise of the linguistic system in human communication.
The time dimension

To introduce an alternative approach to the AM model, I will consider prosodic
events from the point of view of the listener rather than the linguist. Looking
along the time dimension, the listener perceives a succession of syllables,
possibly identified individually, but most probably stored in short-term mem-
ory before being linguistically processed. A more global identification process
can then take place using features spread over the whole entry rather than on
syllables evaluated one by one. These strings of syllables are accompanied
from time to time by localized prosodic events (i.e. stressed syllables), and also
by some non-localized prosodic characteristics affecting more than one sylla-
ble at a time.
These latter prosodic characteristics affecting the whole sentences pertain to
the speaker’s emotional state and socio-geographical origin (cf. Martin,
2014a), whereas the localized prosodic events located on specific syllables
may have a phonological function. As with other phonological entities, such as
vowels and consonants, the phonetic realization of phonological prosodic
events, i.e. the melodic contours, can vary from speaker to speaker and reflect
geographical, social group, or idiosyncratic usages.
The presence of melodic contours as remarkable prosodic events, instan-
tiated with some specific variations of melody, duration, and intensity of the
affected syllables, makes these syllables special: some will be perceived as
stressed and their position in the flow an inherent part of a lexical entry, whereas
others will be perceived as marking a boundary located at the end of syllabic
Conversion of syllabic chunks 65
strings. A third kind of prosodic event presents none of the two first character-
istics and will be identified with an iconic value, putting emphasis on the
current string of syllables. Usually, the first type of event is viewed classically
as a property of lexical entries (words), the second as a boundary tone, and the
third type as an emphatic or secondary accent (accent d’insistance in French).
In a dynamic process where syllables occur one after the other in a time
sequence, there is a cognitive limit to the number of syllables the listener can
retain in the short-term memory buffer where syllables are stored, waiting for
further processing. Indeed, sentence processing does not simply operate on
syllables, but groups them in stress groups. Interestingly, this limit does
not affect the actual number of syllables stored, but their cumulated duration.
As mentioned above, experiments suggest that the perception of nonsense
sequences of syllables is limited to four or five if there is no marking by
some prosodic events (Gilbert, 2012), whereas this limit extends to some
seven or eight if such a prosodic mark occurs. As to maximal duration,
measurements made on spontaneous speech data in French suggests a value
of about 1,250 ms (Martin, 2014b).
Conversion of syllabic chunks

What triggers the conversion of the syllables stored in listener short-term
memory into some other memory space, storing not strings of syllables but a
higher linguistic unit corresponding to stress groups? Is this conversion oper-
ated when the short-term buffer is full, or is there another event responsible for
the transfer?
At this point, it is necessary to propose a hypothesis assuming that the
conversion is, at least indirectly, triggered and therefore synchronized, by the
prosodic events located on stressed syllables. If this role is assigned to a
prosodic event always placed at the end of the string to be converted, we
retrieve the well-known demarcative function of stress described a long time
ago for French (Garde, 1968, 2013). If for other languages, however, this role is
devoted to lexical stress, not necessarily placed at the end of the syllabic
chunks, we have to admit either (1) that a possible time delay must exist
between the perception of the stressed syllable in question and the syllabic
conversion itself to process the prosodic information (2) that a direct identifica-
tion of the syllabic chunk takes place, possibly triggered by its stressed syllable.
This time delay corresponds to the duration of the syllables following the
stressed syllable.
A problem still arises concerning how to explain the syllabic conversion
process for languages without lexical and boundary prosodic events other than
pauses, such as Mandarin or Vietnamese. In this case, another limit must be
evoked, i.e. the minimal duration between consecutive stressed syllables.
Measures conducted on spontaneous speech in French give a value of about

250 ms (Martin, 2014b). Indeed, this value corresponds roughly to the time gap
between two consecutive Mandarin monosyllabic words, plurisyllabic words
being longer with shorter gap between their composing syllables. Therefore, one
can envision that every monosyllabic word of a tonal language is converted into
a higher rank linguistic unit, whereas for plurisyllabic words or short syntagms,
the process is similar to the one pertaining to compound words in non-tonal
languages, one syllable being stronger than the others (Dell, 2004).
The above hypothesis suggests that at least one prosodic event acts as
(indirect) trigger for the conversion of stored strings of syllables, their identi-
fication as a prosodic word, and the clearing of the short-term syllabic memory.
Indeed, in the Incremental Storage-Concatenation (ISC) model discussed in
this book, the presence of prosodic events on specific syllables is assumed to
synchronize the transfer of syllabic chunks, the stress groups, into another part
of memory. In this specific region of memory, where information is stored as a
stress group and no longer as a string of syllables, this may involve more than
one lexical unit, for example a verb with its subject and object pronouns.
The coexistence of lexical stress and boundary tones in Romance languages
other than French is assumed to form one and only one prosodic event,
normally spread on more than one syllable, one stressed and the other final,
and on one single final syllable if this syllable is stressed.
In summary, the triggering of prosodic events synchronizing the transfer of
syllabic strings stored in the listener short-term memory could be due to the
following:
1. For non-lexical stressed languages such as French: the so-called boundary
tones, i.e. the stressed syllable ending stress groups in French.
2. For lexically stressed languages (all Romance languages other than French):
the combination of lexical stress and boundary stress. These prosodic events
can occur on the final syllable of a stress group if this syllable is stressed (see
examples in Chapter 7).
3. For tone languages (such as Mandarin, Vietnamese): for monosyllabic
words the actual tone on the syllable synchronizes the transfer of that
syllable into a higher rank memory. For multisyllabic words, some promi-
nent syllable (first or last?) plays the same role.
The syllable in the stress group

Experiments done on French revealing the phoneme as part of the linguistic
knowledge of speakers may have been influenced by the acquisition of the
orthographic system. To the classical question to non-linguists native speakers
of French: how many vowels does French have? students not trained in
phonology would probably cite only five vowels, [a], [e], [i], [o], and [u], the
The syllable in the stress group 67
ones used in writing and inherited from Latin, whereas linguists will give a list
of fifteen or sixteen.
Speakers cannot spontaneously spell isolated phonemes, they have to rely
on the writing system at hand. You cannot spell a word in Mandarin, unless
you refer to its alphabetic Pinyin transcription. On the other hand, if asked, all
naïve native speakers of a language can segment a word into syllables
intuitively, whereas only subjects trained in linguistics can segment it into
phonemes.
One possible reason may originate from the knowledge of single syllable
words, so that multisyllabic words may appear as concatenated monosyllables.
In récréation “recreation” in French, it is easy to find by segmentation (not
based on orthography) the words ré, crée, à, and Sion (“a musical note, created,
to, a city name in Switzerland”) leading to the segmentation ré.cré.a.tion.
Adopting the vowel as the anchor of prosodic events in the syllable has
long been a debatable issue, as the presence of final voiced nasals such as [l],
[m], or [n] generally prolong the melodic movement located on the preceding
vowel (Chen, 1970, Raphael et al., 1975). Among the arguments in favor, the
variability of the syllabic structure, and in particular its duration in case of
final nasals, makes it difficult to consider the overall syllabic duration as a
descriptive phonological feature. A simple example demonstrates this point
(Fig. 5.2).
Comparing the examples c’est ma maman “this is my mother” and c’est mon
papa “this is my father” shows the advantage of adopting the prosodic events
occurring on the vowel only instead of the whole syllable. Whatever the initial
or final consonant, voiced, unvoiced, or absent, the prosodic part of the vowel
remains (relatively) unchanged. Still, voiced nasals (such as [l], [m], and [n] in
French) may extend the vowel melodic movement and participate in its percep-
tion by the listener. As a result, to avoid taking into account the syllabic
structure, segments of prosodic events occurring on stressed and final vowels
only will be described in the following chapters.
200
c'est ma maman c'est mon papa
150
100
50
0
0 0.5 1 1.5 2 2.5
Figure 5.2 Two examples in French (c’est ma maman “this is my mother,”

and c’est mon papa “this is my father”; example of D. Hirst).
The stress group in the sentence

Spontaneous speech data show that reformulations, repetitions, and dropouts
are structured into stress groups. When a speaker proceeds to a correction
instantiated by a reformulation, an abort, a repetition, the production of a euh of
hesitation, the reprisal always involves a complete stress group, and not a part
of it that would not include its first elements. This suggests that if not com-
pleted, the partial stress group is not processed by the listener memory, as the
aborted part may not carry a stressed syllable and therefore a prosodic event
indicating the prosodic structure. The following examples in spontaneous
French given in Blanche-Benveniste (2003) illustrate this mechanism.
In Alors la l’infirmière de temps en temps me m’humectait euh les lèvres “Then
the nurse occasionally me moistened uh my lips,” the partial first stress group
alors la is not memorized by the listener, as it is not stressed on its last syllable la
and is replaced by the reformulated stress group l’infirmière. In Je je j’ai eu les
jambes qui ont tremblé “I I I had legs that trembled,” the reformulation abandons
the first two occurrences of the pronoun je and integrates its elided form j’ in the
stress group j’ai eu les jambes. In the example Je je je vais le faire bientôt “I I I’ll
do it soon,” the subject pronoun je is repeated twice and retaken a third time to
produce a complete stress group je vais le faire. The mechanism may involve
more complex constructions, as in et on les on les cultive comme ça “And they
are they are grown like this,” where the subject on and object les pronouns are
repeated to form a well-formed stress group on les cultive.
These examples suggest that the stress group, and not necessarily the word,
is effectively the primary unit in sentence phrasing. Indeed, the listener cannot
handle the information without a complete sequence of syllables with one
syllable being stressed, a process essential for language processing by the
human brain, as explained below.
Classes of conclusive contours
Basic modalities
A sentence generated by a speaker necessarily involves a modality. It has long
been accepted that sentence modality is directly linked to sentence intonation,
and particularly to its last conclusive melodic contour located on the last
stressed syllable. Basically, sentence modality defines the relation between a
speaker and a listener. In the simplest classification, modality can be declarative
or interrogative. The prosodic structure being assumed (in this book) indepen-
dent from the sentence text modality (i.e. the one possibly indicated in the text
itself) is correlated with a modality without direct relation with other modality
(syntactic, morphologic) markers eventually present in the sentence.
Classes of conclusive contours 69
In this classification, basic speaker–listener relations are reduced to either

give or ask for information. The imperative modality, traditionally presented as
a category per se, may be considered as well as an insistent variant of declara-
tion (Léon, 1993). The lack of specific imperative flection in the verbal system
of many languages constitutes an argument in favor of this interpretation.
I will call text a sentence deprived of its prosodic structure. A text corre-
sponds to an utterance without any punctuation, which could be read without
any intonation. Although this is clearly impossible in practice, this abstraction
will be very useful to understand the concept of prosodic structure better.
Examples in French without any morphosyntactic markers of interrogative
modality may help us to grasp the concept. The sentence Max adore les
chocolats “Max loves chocolates,” written without punctuation, needs a mod-
ality conclusive contour on its last stressed syllable to actually constitute a
sentence. This contour can be instantiated by a low and falling melodic contour,
transcribed by a final dot in its written orthographic form Max adore les
chocolats. indicating a declarative modality, or by a rising melodic contour,
transcribed with a question mark Max adore les chocolats? indicating an
interrogative modality.
Modality variants
Depending on the perspective chosen, many variants of the basic modality of
declarative and interrogative contours can be considered (Cresti et al., 2002). I
will select only three variants for each of the basic declarative and interrogative
modalities, retaining as supplementary features the emphasis the speaker puts
(1) on the overall information conveyed in the sentence itself and (2) on the
context and situation of the speech act in which the sentence is pronounced
(Table 5.1).
The emphasis on the declarative corresponds fairly well to the order or
command, usually considered as a basic modality (probably by analogy with
the corresponding verbal mode, which, as already mentioned, incidentally
borrows all its forms from other verbal modes such as indicative and subjunc-
tive). Emphasis on the context (containing the information built by previous
statements or present in the situation – in short all the information already
Table 5.1 Variants of modality (Martin, 1987)
On the context
Emphasis None On the sentence and/or situation
Declarative Assertion Command Evidence

Interrogative Question Exclamation Doubt
supposedly known by the listener) corresponds to the declarative mode made

obvious, i.e. evidence.
The insistence on an interrogative statement corresponds for the speaker to a
surprise or an exclamation (an “insistent interrogation”). Applied to the context
of the sentence, the emphasis appears to be a question implicative on the
context, thus casting doubt or unconvinced questioning.
Phonetic observation of the melodic contours (Fig. 5.3), obtained by experi-
mental analysis or by prosodic morphing, leads to a phonological description
using the following binary features added to the feature +/− Rising defining the
basic modalities declarative and interrogative.
The feature +/− Ample, related to the amplitude of melodic variation con-
tour, can differentiate the assertion, −Ample, from command, +Ample, for
declarative mode. This feature allows us to differentiate the question −Ample
and the exclamation, +Ample, but it is not relevant to distinguish the assertion
from evidence or question from doubt.
The feature +/− Bell shaped, referring to the upward downward variation,
bell-shaped contour, provides for declarative mode a differentiation between
the assertion and the order on the one hand, and evidence on the other hand. For
the interrogative mode, this feature allows us to differentiate the question and
exclamation −Bell shaped, from doubt with +Bell shaped.
The resulting phonological matrix is given in Table 5.2 (Martin, 2009).
Table 5.2 Phonological description of modality variants using the features +/−
Rising, +/− Ample, and +/− Bell shaped
Assertion Command Evidence Question Exclamation Doubt
Rising − − − + + +
Ample − + +/− − + +/−
Bell shaped − − + − − +
Figure 5.3 Variants of modality melodic contours located on the last stressed
syllable (declarative case) or the last syllable, stressed or not (interrogative
case).
Voulez vous du thé du café ou du chocolat ?
Figure 5.4 Voulez-vous du thé du café ou du chocolat?
Voulez vous du thé du café du chocolat ?
Figure 5.5 Voulez-vous du thé du café du chocolat?
Alternative questions
The concept of independence given a priori to the sentence text and its prosodic
structure allows us to better handle cases long regarded as difficult to analyze
by prosodists and semanticists alike. In a statement in French such as Voulez-
vous du thé du café ou du chocolat? “do you want tea, coffee or chocolate?”
(Fig. 5.4), the text contains an interrogative modality, but stressed syllables on
thé and café are necessarily associated with two rising (or neutralized flat)
similar melodic contours while sentence last syllable on chocolat carries a
conclusive declarative falling contour.
Without the conjunction of coordination ou “or” “do you want tea, coffee,
chocolate?” the example becomes as shown in Figure 5.5, with all three
melodic contours on the stressed syllable rising.
The proposed explanation is simple: in the first case, the text carries an
interrogative modality (due to the inversion of verb and subject in voulez-vous),
but the prosodic structure is declarative, in the second case, the text is asso-
ciated with three independent interrogative prosodic structures. The first pro-
sodic structure contains three prosodic words, the first two are carrying a rising
contour contrasting with the final falling contour, while in the second case, the
text is segmented by three successive prosodic groups, and thus three indepen-
dent sentences with an interrogative modality.
Iconicity of conclusive contours

Iconicity is the conceived or perceived similarity or analogy between a linguis-
tic form and its corresponding meaning. This similarity can challenge the a
priori arbitrariness of the relationship assumed between form and meaning. In
the segmental domain, two well-known examples in French are le glouglou and
faire pschitt, in Italian il gorgoglio and fare pschitt, in Spanish el gorgoteo and
hacer pschitt, etc., which, when pronounced, may sound somewhat similar to
the sound of a liquid poured from a bottle, and of the noise made by carbonic
gas escaping an opened can of beer.
The classical view on the correlation between sentence modality and
a prosodic event distinguishes between declarative and interrogative
categories, as well as their imperative, implicative, surprise, and doubt
variants. Using a sketchy phonological description, these contours are
respectively:
Basic declarative: low-range falling contour
Imperative: high-range falling contour
Implicative: moderately rising followed by a falling contour
Basic interrogative: low-range rising contour
Surprise: high-range rising contour
Doubt: rising contour followed by a moderately falling contour
A given modality melodic contour is in a relation of opposition with the other
modality contours, whereas, as it will be shown later, non-final melodic con-
tours are in relations of contrasts with other non-final contours present in the
sentence.
It has long been assumed that the declarative contour, normally in sentence-
final position, has an iconic value, as its fundamental frequency usually reaches
the lowest value of the whole sentence (Martinet, 1960). This melodic move-
ment is due to a drop in the subglottal pressure, which, in the absence of a
counter action by a muscular action on the vocal folds tension, is generally
accompanied by a down movement of the speaker head (Fónagy, 1983). This
head movement appears as a sign of submission toward the listener and is used
in various cultures such as Greek, where a down movement and slight rotation
of the speaker head signifies a submissive agreement (typically associated with
the Greek word μάλιστα – malista – “indeed”). The falling melodic contour
upward signals the end of the sentence and the possible relinquishment of a
speech turn by the speaker in control.
On the contrary, a rising contour found in the middle of (rather long)
sentences (the continuation majeure in French) is supposed to be linked to
a rising movement of the speaker head, correlated with a gesture indicating
the conservation of power toward the listener, i.e. the conservation of the
speech turn control. This melodic movement can be followed by a short or
long pause, which may be used by the speaker for inspiration, permitting
the lungs to be filled for a new phonation sequence. Famous political
leaders sure of their power may perhaps abuse the use of silent pauses in
their speeches, as nobody in their audience would dare to interrupt them.
Doing so will also give their audience the chance to applaud. Indeed, a
vertical rising rotating head is a sign of dominance over the audience
Figure 5.6 Declarative vs. imperative conclusive melodic contours.
(Duez, 1997). Again Greek culture uses this body gesture to signal strong
denial or refusal, and such a head movement is typically accompanied by
an alveolar click.
The interrogative rising contour in French leads to another interpretation: it
is realized at the end of the sentence like the declarative falling contour, at a
point of low subglottal pressure, as corroborated by a drop in intensity of 6 dB
or so (Martin, 2009). The melodic rise is then obtained by activating the
phonation muscles which control the vocal folds tension. There is also a rising
rotation of the speaker head, which remains basically at its normal straight
position, with only a slight upward rotation. Indeed, there is no direct submis-
sion involved, as the speaker deliberately relinquishes control on the speech
turn to request an answer from the listener.
Imperative contour
As shown by phonetic studies such as in Léon (1993), the imperative contour
appears as an emphatic variant of the declarative contour (Fig. 5.6).
On the iconic level, the imperative melodic contour is an assertion that
admits (supposedly) no comment or reply, i.e. an emphasized assertion.
Furthermore, its phonetic realization requires considerably more articulatory
effort than the simple declaration, involving a preliminary laryngeal frequency
rise to achieve a large fall afterwards controlled essentially by vocal fold
tension, the simple drop in subglottal pressure being insufficient to achieve a
large frequency swing. This suggests that the imperative contour is linked to
some degree of muscular effort from the speaker, an effort which can be
symbolically linked to moderate to strong violence (Fig. 5.7).
Implicative contour
The implicative melodic contour has frequently been called contour d’évidence
in French work on intonation (Léon, 1993). It can be interpreted on the iconic
level as the speaker asking a rhetorical question, indicated by the moderate rise
of the fundamental frequency, followed by a large fall, correlative of an
Figure 5.7 Emphasis on a declarative contour gives an imperative contour;

emphasis on an interrogative contour gives a surprise contour.
Figure 5.8 Implicative contour (evidence), a moderate rise followed by a

large fall.
assertion (Martin, 2009). In other words, the speaker suggests by this melodic
movement that any question on the content of the assertion should be aborted as
a clear certainty follows immediately, as suggested by the falling part of the
contour (Figs. 5.8 and 5.9).
Contour of surprise
The contour of surprise can be viewed as an emphatic question, instantiated by
an exaggeration of the melodic rise associated with the interrogative modality.
On the iconic level, surprise can be considered as an interrogation almost
deprived of any control, a question which imperiously requires an answer,
which may not necessarily come from the listener (Fig. 5.10).
Figure 5.9 Bell shaping on a declarative contour gives an evidence contour,

and on an interrogative contour gives a doubt contour.
Figure 5.10 Interrogative vs. surprise conclusive melodic contours.
Figure 5.11 Implicative interrogative contour (doubt), a large rise followed

by a moderate fall.
Contour of doubt
The contour of doubt starts as an interrogative contour but ends as a moderately
falling declarative contour. As an emphasis bearing on the context rather than
on the sentence itself, it raises a question at the beginning but ends with a
moderately marked assertion. This combines two contradictory indications, a
strong demand for information and a moderate denial of any answer that can be
the outcome of the demand (Fig. 5.11).
The Incremental Prosodic Structure

Inspired by the work of Karcevski (2000) and Prieto (1975), I introduced
(perhaps for the first time) the concept of prosodic hierarchy in 1973 and
1975 (Martin, 1973, 1975). In the 1975 paper, most of the concepts appearing
later in prosodic phonology are introduced, but often under a different denomi-
nation that may possibly have inspired the AM approach.
I assumed then the existence of a prosodic hierarchy that would group in
successive layers prosodic words, defined as the minimal prosodic unit. These
minimal units contain one and only one prosodic event located on a stressed
syllable, being characterized, among other phonetic parameters, by a melodic
contour. As the concept was applied to French, the stressed syllables were
assumed to be in final position in stress groups.
At first, the prosodic hierarchy (again a structure is a hierarchy equipped
either with labels differentiating the nodes or with relations between the nodes
of the hierarchy) was deemed congruent with the syntactic structure. This
position was modified in Rossi et al. (1981).
In these early papers, the sentence is analyzed in two parts: the proposition
and the phrase, to use Karcevski’s terms. These parts are similar to the two
sides of a banknote: they coexist but cannot be physically separated (acousti-
cally filtering the segmental information may be close to separating intonation
from text). The proposition corresponds to the text of the sentence, i.e. as
appearing in a phonetic transcription deprived of punctuation, for example, and
the phrase is related to everything else, i.e. prosody.
Secondly, the phrase is analyzed into its smallest parts, the minimal units of
signification, which contain one and only one stressed syllable (not an emphatic
stress). These minimal units of signification correspond to stress groups, i.e.
units with only one stressed syllable (conceptually a stress group includes
syllables, whereas a prosodic word includes prosodic material only). The key
point here is that this decomposition into minimal signification units or stress
groups is defined by the speaker, and not necessarily by grammar. This is
especially evident in French. The example le frère de Max a mangé les tartines
can be analyzed in two, three or four minimal units of signification, i.e. (a) le
frère de Max and a mangé les tartines, (b) le frère de Max, a mangé and les
tartines, and (c) le frère, de Max, a mangé and les tartines, according to the
speaker speech rate.
These minimal units retrieve in a way the old concept of groupe de sens in
traditional grammar (where a syntagm is a group of words that produces a
unique meaning) and correspond more or less to prosodic words, stress groups
(cf. Sabater, 1991), or temporal groups (Gilbert, 2012) found in the literature.
As a unit of signification, it implies as well that some lexical unit must enter in
the composition of stress groups. At this point, no maximum number of
syllables were assumed to limit the size of the minimal units of signification,
whereas other studies (including one published in the sixteenth century by
Meigret, 1550) showed that this limit may be around the number seven (+/– two
syllables), or be related to short memory limitations (Miller, 1956). I will show
below that the limit is actually a temporal one, the number of syllables being a
consequence and not the origin of the phenomenon.
The next step pertaining to the concept of prosodic structure is clearly part
of a top-down approach, as it poses the existence of a hierarchy organizing
the minimal units of signification. This hierarchy results from the speaker’s
usage and is not necessarily determined by syntax. It is then called prosodic
hierarchy, and when associated with dependency relations between units, it
becomes a prosodic structure (again a structure is a hierarchy with either
node labels or dependency relations). From a Saussurian point of view,
the significant would be the sequence of prosodic markers indicating the
prosodic structure and the signifier the hierarchical classification of stress
groups.
In the Martin (1975) paper, the prosodic markers are instantiated by melodic
contours, whose acoustic description involves an absolute value of frequency
height, frequency span, and duration. These contours operate by contrast
with the prosodic marker ending (in French) the unit of immediate higher
rank. This final position in sequences of prosodic words in French brings a
confusion from an AM point of view since both pitch accents and boundary
tones are manifested by prosodic events occupying the same position on the
final syllables.
However, there is more. Indeed, if the only differentiation between pro-
sodic events (excluding emphatic stress) exists between terminal boundary
tones and the other stressed syllables, there is still an explanation to provide
as to how the hierarchical arrangements realized between syllabic chunks
actually occur. Indeed, the concatenation of these chunks is not realized
most of the time as a single enumeration, but as a hierarchy with more than
one level corresponding to a prosodic structure. I did notice at that time the
tendency for pitch movements on stressed syllables to follow a principle of
contrast of melodic slope, i.e. to be falling if they depend on a rising contour
“at their right,” and to be rising if that contour is falling (Martin, 1975).
More precisely, a contour would fall if it belongs to a larger group ended
with a rising contour, and conversely it would rise if the ending contour is
falling.
The model considers the various prosodic events (not emphatic accents) as
markers determining dynamically the assembly of successive prosodic
words. This process implies that prosodic markers on lexical stressed sylla-
bles and on accent phrases boundaries, eventually combined into one single
syllable, are sufficiently differentiated to ensure an efficient indication to the
listener to assemble in a proper hierarchy the strings of syllabic chunks

(stress groups). In the AM model, the recovery of the prosodic structure
is limited to the presence of (optional) intermediate phrases (ip’s) and
Intonation Phrases (IP) boundary tones, the details of the hierarchy being
left to syntactic processing, except for a variable degree of stress derived
from the metrical grid associated with IP (which incidentally is derived
from text and may give questionable results). However, examples implying
telephone numbers or other oral realizations of examples deprived of
syntactic structure show that such structures may be recovered by listeners
from pitch accents only.
The whole Incremental Storage-Concatenation (ISC) process is based
on the existence of an effective acoustic set of contrasts between prosodic
markers, essentially instantiated by prosodic contours, to indicate how
successive prosodic words must be put together hierarchically. It is impor-
tant at this point to remember that this process is dynamic along the time
axis, i.e. that the contrasts between prosodic markers act locally against the
immediate next prosodic events, and not globally by considering the whole
prosodic structure at one as it is often presented in the literature. This local
character of the contrasts to be maintained between prosodic markers may
explain many puzzling variations observed on spontaneous speech, as the
contrasts have to be evaluated only locally (in time) by the listener. This
process takes place in specific time domains defined between prosodic
events of the same class.
Still, one specific prosodic marker must possess such acoustic character-
istics so that it can be identified by listeners whatever the conditions of
sentence productions, such as particular habits of the speaker. This particular
prosodic event is the final conclusive melodic contour, which is always
located on the final stressed syllable (not necessarily the last syllable) of
the last stress group of the sentence in declarative sentences (see Chapter 7),
or on the final syllable in interrogative sentences, whether this final syllable
is stressed or not.
In summary, similarly to the AM approach, the ISC prosodic structure is
defined as a hierarchy of minimal prosodic units, the prosodic words. Prosodic
words are entities of prosody containing one and only one prosodic marker of
the prosodic hierarchy, i.e. a prosodic object being part of the indication of the
prosodic structure. This unique marker is instantiated by prosodic variations (of
fundamental frequency, duration, and intensity) occurring on the stressed
syllable of the corresponding stress group and eventually (for complex con-
tours in Romance languages other than French) on the vowels of the stressed
syllable and the final vowel of the group.
In French, since there is no lexical stress, these prosodic markers are
reduced to boundary tones located on the final stressed syllable of the
prosodic phrases. In the other Romance languages, if there is no boundary

tone, the prosodic markers are realized on the lexical stress of the group. If
there is a boundary tone, the markers are instantiated by two distinct prosodic
events, one on the stressed vowel and the other on the final syllable vowel. If
the last syllable bears the lexical stress and if there is also a boundary tone,
the prosodic realization of the marker combines the two prosodic events
on the same syllables. Many examples are given in Chapter 7 to illustrate
this point.
Independence
The predominance of syntax over the other organizations of linguistic objects
(phonological, morphological, informational), following the Chomskian
model(s) giving a predominant role to the (deep) syntactic structure in the
sentence, affected research on sentence intonation almost from the beginning
of linguistic interest in prosody. Former studies influenced by structuralism
also presented this tendency. Many papers on sentence prosodic structure
(including Martin, 1975) had trouble admitting that the prosodic and syntac-
tic structures were not (always) congruent (cf. Rossi et al., 1981). A lot of
effort has then been devoted to the elaboration of sets of rules to prove that the
prosodic structure is actually derived from syntax, although this may not
always appear obvious from the data (e.g. Longchamp, 1998; Rossi, 1999).
Rather than renounce the primacy of syntax, researchers were keen to propose
elaborated sets of alignment, interpolation, and other fancy phonological
devices that would hopefully result in convincing arguments to derive
prosody from syntax.
Even in classical modality analysis, morphology and syntax are given
priority over prosodic events, as encoded by sentence-final melodic contours.
Many difficult problems such as the ones brought by alternative questions
evoked above (vous voulez du thé ou du café? “do you want tea or coffee?”
with a falling modality contour indicating a statement) were handled, again
giving precedence to facts visible in writing transcription, i.e. morphology and
syntax (Beyssade et al., 2007).
Prosodic events
The whole storage-concatenation principle relies on the ability of the listener
to classify adequately prosodic events located on stressed syllables. This
implies, among other things, that pertinent stressed syllables can be distin-
guished from other prosodic events, for example that emphatic stress can be
differentiated from prosodic contours leading to the storage of the current
sequence of syllables (since the last decoded prosodic events) in the proper
buffer. It also implies that categories of distinct prosodic events are instan-
tiated (realized) with sufficient acoustic differences to be classified correctly
by the listener.
The generation of a prosodic structure is a dynamic process, whose span
does not exceed a certain limit set by the capacity of the speaker’s and
listener’s short-term memory (probably limited to a few stress groups).
In this process, the speaker has a limited choice of fragments to encode
successive chunks of the prosodic structure (corresponding to ip – intermediate
[intonation] phrase – and IP – Intonational Phrase – as defined in the AM
prosodic structure).
In the course of sentence production, the speaker must necessarily plan a
sequence of stress groups which necessarily end with the modality contour C0
(not considering the abandonment or the possibility of adding a deferred
complement after the prosodically defined sentence end (see Chapter 8). In
fact, at each step, the choice is between coding a relationship of parataxis
(juxtaposition of units), of rection (dependency of units), or no relation at all
with one following melodic contour to appear later. The encoding of the chosen
relation involves always a melodic contour “on the right,” i.e. occurring later in
the sentence.
There is therefore necessarily planning, at least locally, until the appearance
of the next contour in a sequence, since by definition the contour classes define
relationships of dependency toward a contour appearing later, immediately or
not, in the sequence.
Many suitable acoustic differences come to mind to differentiate the
realizations of stressed syllables, but in Romance languages, the most
important contrast is accomplished by the so-called melodic slope contrast.
In essence, if a prosodic event C1 is realized normally with a rising melodic
contour, the most efficient acoustic characteristic for another prosodic event
C2 to contrast with C1 is to have a melodic variation of opposite melodic
slope, i.e. falling if C1 is rising, and rising if C1 is falling. However, other
contrasts are possible, essentially in some idiosyncratic cases.
Although this is true for French, other Romance languages with a lexical
stress system have a supplementary possibility by using both stressed and
final syllabic locations, as seen above. This is what essentially differentiates
French from the other Romance languages, as French does not have lexical
stress, only a final group stress. As French leaves the speaker to choose the
phrasing of a given sentence, only constrained by the duration of the resulting
stress groups (and the alignment with lexical word-final syllables), the
storage-concatenation process will be influenced by the duration and the
hierarchy of these stress groups. In the sentence le frère de Max a mangé
les tartines “Max’s brother ate the sandwiches,” for example, the possible
phrasings are as follows:
le frère de Max a mangé les tartines – ten syllables, difficult to pronounce with
only one final syllable
le frère de Max a mangé les tartines – two prosodic words of 4 and 6 syllables –
probably pronounced with an accelerate speech rate for the second prosodic word
le frère de Max a mangé les tartines – three prosodic words, of 2, 2, and 6 syllables, a
quite unbalanced phrasing, not very probable
le frère de Max a mangé les tartines – three prosodic words, of 4, 3, and 3 syllables, a
somewhat balanced phrasing, a probable realization (in reading mode)
le frère de Max a mangé les tartines – four prosodic words, of 2, 2, 3, and 3 syllables,
phrasing adapted to a slow speech rate.
This flexibility is not totally possible in the other Romance languages. For
instance in Italian, il fratello di Max ha mangiato i panini requires the four
stressed syllables to be pronounced stressed. The maximal duration of prosodic
words appears less frequently than in French, only in lexical words with more
than seven syllables before the stressed syllable or so (e.g. precipitevolissimo
“very hasty” does not qualify, but precipitevolissimevolmente “in a way like
someone/something that acts very hastily” does, and leads to the indicated
stressed pattern). Most of these cases are found in the specialized chemical or
pharmaceutical domain.
Properties
Although the prosodic structure is usually presented as independent of the flow
of time in linguistic studies (i.e. as if it would be completely known at once
from start to end), it should always be remembered from the point of view of the
listener that prosodic events are perceived one after the other in function of
time. Therefore, their categorization, the identification of the class they belong
to, instead of being absolute on isolated events, is relative and depends only on
past and expected future events and not on all future events, normally unknown
to the listener. Besides, the process is similar for syllables which are not
perceived in isolation but in sequence.
This view is central in the ISC model, as every prosodic event, instantiated
by a prosodic contour, determines a relation of dependency “to the right,” i.e.
toward another prosodic event occurring later on the time axis, until the final
conclusive contour occurs. (The denomination “to the right” refers of course to
the western way of writing; in the writing of other cultures, such as Arabic or
Hebrew, it should be referred to as “to the left” – or even “to the bottom” for old
Chinese writing. This may be a good reason to abandon it).
This means that a prosodic event, say C1, for example (denoting what is
called in the literature a continuation majeure), is correlative to a dependency
relation toward a future event, be it an event of the same class C1 to be part of
an enumeration, or an event C0 of a higher rank belonging to a class other than
C1, C0 being a conclusive terminal contour. Higher rank means that C1

existence presupposes the occurrence of C0. A contour C1 may also be
followed by a contour of a lower rank, say C2, indicating that the prosodic
word containing C1 will not form a prosodic group with the prosodic word
containing C2. The prosodic word ended with C2 will be merged later in time
when a contour of higher rank will occur.
The storage-concatenation process (Martin, 2009) is built around this prin-
ciple to determine the prosodic structure of a sentence. It can be briefly
described as follows. Consider C0, C1, C2, Cn as classes of prosodic events,
differentiated by a set of acoustic or perceptive features and indicating depen-
dency relation “to the right” (to a future event). This means Cn indicates a
dependency relation toward C2 occurring after Cn on the time axis, C2 toward
C1, and C1 toward C0. It also means that a sequence of Cn would indicate a
concatenation of all groups bearing Cn until a higher rank contour C2, C1, or
C0 occurs. The same principle applies to all other prosodic events, until the
final C0 terminates the process.
It follows from this mechanism that prosodic events instantiated by melo-
dic contours are identified dynamically, and that a contour C2 becomes C2 as
it is followed by another contour, be that Cn, C2, or C1. In the absence of
direct identification by the listener (in case of extreme differences in phonetic
realization from a particular speaker), a contour C1, for example, will be
identified as such if it is followed by a sequence such as Cn C2 C1. The
occurrence of Cn after C1 in a sequence C1 Cn indicates only that the
corresponding prosodic words do not form a group, and not the level of
imbedding of Cn in the prosodic structure. In other words, the contrast C1
Cn includes a relative and not an absolute position in the prosodic structure
(see a detailed example in Chapter 7).
Prosodic phrasing
Prosodic phrasing refers to the segmentation of the syllabic flow into stress
groups corresponding to groups of syllables with only one stressed syllable.
Prosodic hierarchy pertains to the hierarchical assembly of the prosodic
words corresponding to stress groups into larger groups (prosodic phrases)
until the whole intonation line of the sentence is obtained (usually by the advent
of a final conclusive contour).
In this book, there is absolutely nothing that links this hierarchy, which
becomes a structure once dependency relations between groups are defined,
to any syntactic or morphological object, with one exception. Even if the last
(or only) lexical word integrated into a stress group normally has its final
syllables aligned on the end of the stress group (i.e. no stress group would
stop in the middle of a lexical word), other possibilities do exist, for example by
realizing every syllable as stressed, or by putting an extra stress inside long
words (see below). Therefore, the prosodic and syntactic structure (and other
structures existing in the sentence for that matter) are a priori totally indepen-
dent from each other.
Planarity
The planarity constraint (Martin, 1987) forbids prosodic grouping of
stress groups such as [A[B]C] in which prosodic words (and stress groups)
A and C would form a larger prosodic group before integration of C in a
sequence A B C to form the prosodic structure. A prosodic group forms a
larger unit with the group placed “at its right,” i.e. later on the timescale
marked by a prosodic marker (i.e. a melodic contour) of higher rank. It is
therefore not possible, due to the absence of prosodic markers allowing the
indication of a dependency relation going “over” another group (as morpho-
logical markers of gender and numbers for instance) to realize a non-planar
prosodic structure (see Fig. 5.12).
Connexity
The dependency “to the right” assumed for every prosodic event in a
prosodic structure implies connexity between the hierarchized prosodic
words and phrases, i.e. that every prosodic word or phrase maintains a
relation of dependency toward another prosodic word or stress group
located “at its right,” i.e. next to it on the time axis (Fig. 5.13).
However, this property ceases to be valid when a prosodic parenthesis,
Figure 5.12 A non-planar partial structure [A [B] C], not well-formed for a
prosodic structure.
Figure 5.13 A prosodic structure without connexity, well-formed for a

prosodic structure with non-integrated parentheses.
i.e. a completely independent prosodic structure ended with its own con-
clusive contour, is embedded in the main sentence prosodic structure.
Examples in spontaneous speech are given in Chapter 8 on macrosyntax
(Debaisieux & Martin, 2010).
Domain
In the ISC model, a prosodic domain is defined between two consecutive
prosodic markers belonging to the same class (for instance, two successive
conclusive contours C0, or two successive first-level C1 contours – called
continuation majeure in the French tradition or IP boundary tones in the AM
model). Phonetic realizations of contours of the same class must present
enough similar characteristics that the listener can recognize and identify the
contours as belonging to that class correctly. Outside a domain, for example in
two consecutive domains, the realizations of prosodic markers of the same
class (i.e. phonologically identical) can vary, as long as they are sufficiently
similar to be correctly classified by the listener, and as long as their differences
are such that they are not confused with markers of another class inside the
domain. This rule originates from the mechanism of prosodic markers identi-
fication, which operated dynamically along the time axis. In this process, the
listener has to compare a limited number of successive realizations of markers
in order to proceed to the storage and concatenation of the sequences of
prosodic words. Another consequence of this rule is that final conclusive
contours must be realized the same way in a given speaking community,
although lower-level contours inside the prosodic structure may be different
from speaker to speaker, or from domain to domain for the same speaker
(Fig. 5.14).
Figure 5.14 A domain between two consecutive C0 contours, where contours

C1 inside the domain must be realized phonetically with enough similarities
so that they can be identified as belonging to the same class.
The contours of the same class in the same domain must use the same
subsets of melodic features to contrast (or not) with other contours. In
another domain, contrasts may use different features to differentiate distinct
contours.
Neutralization
The actual realization of an abstract (phonological) prosodic marker is done in
such a way that it differentiates (acoustically and/or perceptually) from all the
other realizations of other prosodic markers that could happen in its place (i.e.
at the same position and in the same context). This is a basic rule in functional
phonology. It simply stipulates that the actual realization of a specific prosodic
contour by the speaker must possess such acoustic characteristics that it may
not be confused by the listener with another contour that could be selected
instead by the speaker (in the same context). As the following discussion will
show, the consequences of this rule are extremely important in prosodic
phonology, whereas they are totally absent from AM descriptions, as pitch
accents are not supposed to interact.
The differentiation rule (syntagmatic axis) spells as follow: the mani-
festation (realization) of a prosodic marker by acoustic parameters (again
F0, intensity and duration variations of either stressed or final prosodic
phrases vowels) must contrast only locally in time. This means that only a
limited number of prosodic markers in a time sequence must contrast (or
not) so that their prosodic marker classes are correctly identified by the
listener.
Differentiation in the time domain

The common representation of the prosodic structure is two-dimensional, with
the time assigned to the horizontal axis. As mentioned earlier, this representa-
tion, inspired from the syntactic structure representations, is somewhat
misleading, as it masks the temporal aspect of prosody, suggesting that the
information perceived and processed by the listener is immediate, with a
simultaneous perception of past, present, and future prosodic events. In reality,

prosodic events are perceived in sequence one after the other, and as it will be
explained later, the processing of past events is quite limited due to the listener
short-memory capabilities. However, for the speaker, reading a text will always
make past (and future) information always available to generate the prosodic
structure.
The same observation applies to syntactic structure representations, where
the time domain aspect is often totally ignored. Besides, the written language
bias (Ochs, 1979) appears clearly in the terminology used by prosodists and
syntactitians, when they describe left and right syntactic dislocation, for
example. The usual terminology “dislocation to the left” or “dislocation to
the right,” which refers to anteposition and postposition on the time axis, as in
“Mary, she is nice” and “She is nice, Mary,” reveals the written bias for events
that actually occur “before” and “after.”
Differentiation of prosodic events

Although the AM approach may assign degrees of syllabic stress through
metrical grids, the melodic characteristics of these degrees of stress are not
an essential part of the prosodic structure elaboration process and are therefore
considered by nature phonetic. On the contrary, for the ISC process, the
stressed syllables of melodic contours play an essential role in the indication
of the prosodic structure, meaning that they belong to different phonological
classes to ensure their function.
This hypothesis implies that the acoustic realizations of stressed syllables
must contrast in order to indicate the prosodic structure when boundary tones
are not present. It implies also that the listener must be able to classify stressed
syllables instantiated by melodic contours and derive from this information an
appropriate process to recover the prosodic structure intended by the speaker.
Prosodic events enter a network of relative contrasts and do not need
invariant acoustic characteristics, as long as they are identified by the listener
as belonging to the same class, or more precisely as long as the contrast
between two successive contours is correctly identified. This differential
approach to describing prosodic events takes into account the large variety of
styles in the production of speech. An emotionally depressed speaker, for
instance, will realize little or no melodic variations. In whispered voice, con-
trasts between prosodic marks will have to be realized by means other than
laryngeal frequency variations, essentially with segment durations, about 50 to
70 percent higher than the equivalent achievements of duration in not whis-
pered voice (Vercherand et al., 2011). For these reasons, we cannot expect
invariance of prosodic marker realizations, contrary to what many researchers
in this field seem to expect.
The dynamic prosodic structure

The prosodic structure reconsidered dynamically results from a process by
which strings of syllables (the stress groups associated with prosodic words)
are hierarchically stored and merged. This process requires the identification
of each terminal prosodic event as belonging to the conclusive class known
by the listener and also the identification of the type of contrast perceived
between two consecutive contours. The process involves two mechanisms:
(1) storage of the string of syllables perceived by the listener since the last
advent of a prosodic event and (2) concatenation of this string with all strings
belonging to the same level (i.e. whose storage was triggered by same-class
prosodic events), as illustrated above. A more detailed example is given at the
end of this chapter.
In French, among all prosodic events, only those located on the last
syllable of syllabic groups take part in this process, to the exclusion of
events located on the first syllable of lexical words, which are treated as
emphatic accent. This means that the identification of prosodic events
implied in the dynamic prosodic structure elaborated by the listener involves
the identification of a syllable as stressed in order to qualify the associated
prosodic event as a triggering signal for the storage-concatenation mechan-
ism. For the other Romance languages, this triggering is also done together
with the identification of a boundary tone eventually present on the last
syllable of the group, or by direct identification of the words belonging to the
stress group.
The dynamic view leads to the consideration that each prosodic event,
instantiated by various contrasts of variation and melody height, duration,
intensity, or vocalic quality, appears as a signal triggering (possibly with a
slight delay) the storage of syllables perceived since the appearance of the
last prosodic signal belonging to the same class. This syllabic storage is
accompanied by concatenation of the elements already present at the same
prosodic “address.”
For tonal languages, there is apparently no lexical stress. Although mono-
syllabic languages like Mandarin or Vietnamese evolved from monosyllabic
lexicon to multisyllabic (e.g. Mandarin: bicycle自行车 Zìxíngchē, bus 公共
汽车 Gōnggòng qìchē), none of the components seems to clearly carry a
(primary) stress.
The ISC process is a new concept that may appear at first as very similar to
the AM approach, briefly summarized in Chapter 4. However, the main differ-
ences pertain to the following essential points:
1. It focuses on the sequential aspects of prosodic events in the time axis,
avoiding the Written Language Bias (Linne, 2005) common to most theo-
retical studies on prosodic structures.
2. It assumes the a priori total independence of the prosodic structure toward

other structures that may exist in the utterance, and in particular the syntac-
tic structure.
3. It highlights specific constraints pertaining to the prosodic structure, gen-
erally unknown in most research studies on the subject.
4. It departs from the conception that prosodic events should be handled
like phonemes, ignoring the local contrasting effects on phonetic
realizations.
5. It proposes a general principle of explanation for these constraints based on
recent neurolinguistic research on brain waves.
The Incremental Storage-Concatenation process

The Incremental Storage-Concatenation (ISC) model brings another point of
view to the description of the prosodic structure. Instead of a string of tonal
targets transcribed with ToBI symbols organizing a static hierarchy of
prosodic words, sentence intonation is analyzed phonologically as a
sequence of short-lived instantiation of melodic contours. Indeed, prosodic
events are short-lived in the listener memory (and the speaker’s memory
alike), the time span being of the order of a second or less. It is therefore
reasonable, instead of considering a prosodic structure represented on a
piece of paper, to view prosodic events as “flashing” time after time, leaving
the listener (contrary to the linguist) a very short amount of time to process
the information and transfer the strings of just perceived syllables into a
higher-level storage.
Pursuing this view a little bit further, the notion of domain becomes
important, but with a different meaning than that commonly found in
AM-inspired work. As mentioned earlier, a domain is defined here as the
time span between two consecutive prosodic events belonging to the same
phonological class. For instance, the domain between two consecutive con-
clusive contours is the whole set of prosodic events occurring in the sentence
(or with the start or the end of the sentence if the prosodic event is unique).
The domain of contours is defined “on the left,” i.e. before the contour to
either the beginning of the sentence or another occurrence of the same
contours of the immediately lower rank. Inside this domain, all contours
must have similar acoustic characteristics so that the listener can classify
these prosodic events in the same category, or successive same-class if they
are not consecutive.
From this definition of prosodic domain, it follows that the process of prosodic
structure reconstruction by the listener is necessarily local, and implies a
sequence of comparisons between two successively perceived prosodic events.
If Cx and Cy are two consecutive prosodic events perceived by the listener:
1. If Cx and Cy are identified as belonging to two phonological classes known

by the listener, there exists a ranking between both melodic contours, i.e. a
class can be weaker, equal, or stronger than another class of melodic
contours. The three possible cases are:
a. Cx < Cy
b. Cx = Cy
c. Cx > Cy
2. If only Cx is identified as belonging to a phonological class known by the
listener, then Cy is assumed to belong to a neutralized class Cn. This is
equivalent to case 1(c) above.
3. If only Cy is identified as belonging to a phonological class known by the
listener, then Cx is assumed to belong to a neutralized class Cn. This is
equivalent to case 1(a) above.
4. If neither Cx nor Cy is identified as belonging to two phonological classes
known by the listener, both prosodic events are neutralized as Cn. This is
equivalent to case 1(b) above.
If we note as Cn, C2, C1, C0 the classes of prosodic events instantiated
by melodic contours, with their phonological hierarchy defined by Cn < C2
< C1 < C0 (see details below), the storage concatenation algorithm is described
by the following simple instructions applied to two consecutive contours, Cx
and Cy, Cx preceding Cy on the time scale:
if Cx < Cy {Concatenate the prosodic word (phrase) containing Cx with the
prosodic word containing Cy and store the result in Cy buffer};
else if Cx = Cy {Store the prosodic word (phrase) containing Cx with the
prosodic word containing Cy, without concatenate them};
else if Cx > Cy {Store the prosodic word containing Cy without merging it
with the prosodic word (phrase) containing Cx}.
More details on the incremental process are given further down. The defini-
tion of the ICS process has interesting consequences on the resulting config-
urations of the prosodic structure.
In configurations such as C2 C1 Cn C2 C1 Cc . . . C0, taken as an example,
the comparison between the second C1 and the third contour Cn allows the
listener only to evaluate a temporary hierarchical level for the Cn contour.
Indeed, the advent of C2 after Cn lowers the level of Cn in the structure, and the
advent of C1 after C2 would lower both Cn and C2 levels, to finally get to the
following configuration:
Step 1: [C2 opens a bracket
Step 2: [C2 C1]
Step 3: [C2 C1] Cn
Step 4: [C2 C1] [Cn C2]
Step 5: [C2 C1] [[Cn C2] C1]

Step 6: [[C2 C1] [[Cn C2] C1] C0] end of process
Still, the comparison between two successive melodic contours allows the
correct decoding of the intended prosodic structure from successive temporary
hierarchies of concatenated stress groups. Therefore, a given melodic contour
is not necessarily linked to a specific level in the prosodic structure, as these
levels are defined dynamically. The Incremental Storage-Concatenation pro-
cess is similar to a Shift-Reduce parser.
Preplanning
If for the listener the dynamic restitution of the prosodic structure intended by
the speaker involves only comparing locally consecutive melodic contours, it is
not necessarily the case for the speaker, who has to plan ahead the sequence of
more than two contours if the prosodic structure is relatively complex, i.e. if the
structure has more than one level.
Indeed, if starting with a neutralized contour Cn, the speaker has the
choice between three options (to take the example of French): the simplest,
with one level, Cn C0 (again C0 is the terminal conclusive contour), a two
level hierarchy encoded by Cn C1 C0, and a three level arrangement Cn C2
C1 C0. The realization of a contour C1 after Cn is the most frequent,
whereas Cn C2 C1 C0 requires preplanning of at least four stress groups,
consecutive or not. If this kind of sequence may be observed in read
speech, spontaneous speakers prefer often sequences such as Cn Cn
Cn . . . C2 C1, with a melodic slope contrast affecting only the two last
contours of the sequence, therefore not requiring preplanning on a long
sequence of stress groups.
Melodic contours features

As discussed earlier, focusing the analysis on the time aspect of prosodic events
leads to important consequences, related to interpretations of many phonolo-
gical properties of the prosodic structure.
The assumed existence of dependency relations instantiated by melodic
contours to indicate the sentence prosodic structure implies that these contours
must be differentiated by appropriate prosodic features according to their role
in the structure configuration and the dependency relation they maintain. From
data observations, prosodic features candidates should be linked to melodic
characteristics of the contours, such as their duration, melodic height and span,
glissando values, etc.
Instead of analyzing a general configuration and determining the possible

neutralization of some redundant features, I will proceed the other way around
and establish the conditions of contrast that should meet melodic contours to
ensure the indication of prosodic structures of increasing complexity.
From the acoustic analysis of data in French and in the other five Romance
languages, the retained (binary) melodic features are:
The contour lowest frequency, +/−Low
The flat or slightly falling change on the stressed vowel, and a sharp melodic
rise on the final vowel, +/−Complex (not used in French)
The rising or falling melodic change, +/−Rising
The rate of melodic change on the stressed vowel, +/−Glissando
The duration of the stressed vowel, +/−Long
The shape of the stressed vowel contour, +/−Bell shaped
The glissando value characterizing the speed of frequency change. If this
value is above a glissando threshold, the change of frequency is perceived,
if below, a static tone at 2/3 of the extreme value of the contour (rising or
falling) is perceived instead (Rossi, 1971, 1978), +/−Glissando.
The threshold coefficient used for the experimental data in this book is
0.16/s2, an arbitrary value chosen for convenience.
One prosodic word

A single prosodic word structure ends with a conclusive contour C0. Being a
modality contour, C0 is opposed in a paradigmatic relation with all other
modality contours that could occur in the same position, i.e. in the same context
and the same situation characterizing the speech act. If this context and situa-
tion are not defined (as is the case in the examples below), C0 must be opposed
by appropriate acoustic features to all the other modality contours and their
variants. Taking the phonological description given above, C0 declarative must
be −Rising, −Ample, and −Bell shaped, with phonetic realizations of these
features opposed to the +Rising, +Ample, and +Bell shaped shared by the other
types of conclusive contours (Table 5.3).
Table 5.3 Phonological description of modality contours
Assertion Command Evidence Question Exclamation Doubt
Rising − − − + + +
Ample − + +/− − + +/−
Bell shaped − − + − − +
Table 5.4 System of contrasts Cx C0
Cx C0
−Low +Low
+/−Complex −Complex
+/−Rising −Rising
−/−Glissando +Glissando
+/−Long +Long
Two prosodic words

The phonological sequence is then Cx C0, Cx being a contour to define
phonologically (excluding the case where the conclusive contour C0 is fol-
lowed by other contours, as Nucleus + Postfix C0 Cx analyzed in Chapter 8 on
macrosyntax). For the sake of simplicity, I will from now on omit the Bell-
shaped feature, which should have a minus value for all melodic contours
inside the prosodic structure.
As C0 inherits from all features implied in the relation of opposition
with all other modality contours, Cx has to be differentiated not only from
C0 but also from all these features as well. Observations (see Chapter 7)
show that speakers may vary in their selection of features contrasting
between Cx and C0, but most frequently use +/− Rising and +/− Low,
and sometimes a combination of both, as well as +/− Long, especially in
French (Table 5.4).
As the contrast −Low +Low is sufficient to ensure the differentiation
with C0, Cx can be +/−Complex, +/−Rising, +/−Glissando, or +/−Long.
A −Glissando value defines a neutralized contour called Cn (the index n for
neutralized). A +Glissando value will involve a +Rising melodic slope in order
to avoid a reduced contrast of final frequency for a contour that would be
−Rising. The resulting −Low, +Rising, +Glissando, +/−Long contour defines a
C1 contour. The feature +Rising may be truncated acoustically when the
stressed vowel is preceded by a stop consonant in the same syllable. As the
contrast of melodic slope (see below) prevents a dependency relation between
C2 and C0 both −Rising, for example (i.e. *C2 C0), the possible sequences are
then Cn C0 and C1 C0.
Three prosodic words ended with C0

There are actually three planar configurations to consider (the prosodic
structures are represented with squared branches and arrows indicting the
direction of the dependency relations between contours, see Fig. 5.15). For
the sake of simplicity, I will consider only French cases, excluding the use of
the +/−Complex feature.
Figure 5.15 Three basic planar hierarchic configurations.
The first configuration Cx Cx C0 (I) implies an enumeration type with two

successive contours Cx. The contrast network is therefore the same as in the
preceding case, except that the feature chosen to contrast with C0 must be the
same, so that the two first contours can be identified as belonging to the same
class, either Cn or C1. These two contours belong to the same domain and must
be phonologically identical.
In the second configuration (II), Cx C1 C0, there are two possible matrices
(see Table 5.5).
The first matrix introduces a contour C2, −Rising, +Glissando contrasting
with C1 +Rising, +Glissando with its melodic slope, whereas the second
solution reveals a contour of reduced melodic swing (−Glissando) and unde-
fined melodic change (+/−Rising, Cn).
In the third configuration (III), C1 Cx C0, one first possible matrix is shown
in Table 5.6.
Table 5.5 System of contrasts Cx C1 C0
Cx C1 C0
−Low −Low +Low

−Rising +Rising −Rising
+Glissando +Glissando +Glissando
−Long +Long +Long
−Low −Low +Low
+/-Rising +Rising −Rising
−Glissando +Glissando +Glissando
−Long +Long +Long
Table 5.6 System of contrasts C1 Cx C0
C1 Cx C0
−Low −Low +Low

−Rising +/−Rising −Rising
+Glissando −Glissando +Glissando
+long −Long +Long
This matrix defines a contour contrasting with C1 +Rising, +Glissando with

its reduced melodic swing. Since the value −Glissando does not allow a
differentiation between +Rising and −Rising, this is the only matrix possible,
and C1 Cn C0 is the only possible sequence of contours for this configuration
(again considering French configurations, not using the +/−Complex feature).
Contrast of melodic slope

The preceding configurations reveal a basic mechanism involving an efficient
feature to differentiate successive contours: the contrast of melodic slope. This
mechanism tends to favor inverted melodic slope for melodic contours in a
dependency relation. In a sequence Cx Cy, where Cx depends on Cy “on its
right,” Cx will be +Rising if Cy is −Rising (falling), whereas Cx will be
−Rising if Cx is +Rising. This general principle is characteristic of French
intonation, but also of the other Romance languages (see Chapter 7). Again,
contrast involves successive contours in a single domain and not in the overall
sentence or speech turn.
Three prosodic words ended with C1

The reasoning is similar with the phonological matrices shown in Table 5.7.
Table 5.7 System of contrasts Cx Cx C1
Cx Cx C1
−Low −Low +Low

+/−Rising +/−Rising +Rising
−Glissando −Glissando +Glissando
−Long −Long +Long
−Low −Low +Low
−Rising −Rising +Rising
+Glissando +Glissando +Glissando
Figure 5.16 Three basic prosodic hierarchical configurations.
In configuration (I, Fig. 5.16, Table 5.7), Cx Cx C1, if Cx is neutralized

(−Glissando), it will be instantiated as Cn neutralized contour. If not, the
sequence will be C2 C2 C1, with C2 a contour above the glissando threshold
contrasting with C1 with the feature −Rising (Contrast of melodic slope).
The second case (II, Fig. 5.16, Table 5.8), Cy Cx C1, leads to the sequence
Cn C2 C1, the first contour being neutralized as more features are not available
to ensure another type of contrast for Cy.
The third configuration (III, Fig. 5.16, Table 5.9), Cx Cy C1, gives the
sequence C2 Cn C1, as shown in Table 5.9.
The Romance languages other than French do have lexical stress, which
offers another combinatorial possibility with a “complex contour.” The com-
plex contour Cc is realized when the prosodic word (eventually ending a
Table 5.8 System of contrasts Cy Cx C1
Cy Cx C1
−Low −Low +Low

+/−Rising −Rising +Rising
−Glissando +Glissando +Glissando
Table 5.9 System of contrasts Cx Cy C1
Cx Cy C1
−Low −Low +Low

−Rising +/−Rising +Rising
+Glissando −Glissando +Glissando
prosodic phrase) bears a final rising contour, contrasting with a slightly rising
or falling melodic movement on its stressed syllable usually below the glis-
sando threshold. This configuration corresponds to a boundary tone in AM
terminology, where two distinct prosodic events occur in the same prosodic
word. Where the lexical stress in the prosodic word is final, the two melodic
events merge and a falling-rising contour takes place. The rising part of Cc is
normally above the glissando threshold, and Cc is phonologically described
as −Low, +Complex, +Rising, +Glissando, and +Long. Furthermore, like C1,
Cc may be followed by a pause, adding an extra phonetic feature.
With the principle of contrast of melodic slope, by combining the different
configurations of three successive contour ended with a complex contour Cc,
the possible sequences are: C2 C2 Cc for the first configuration (I), Cn C2 Cc
for (II), and C2 Cn Cc or C1 Cn Cc for (III). More configurations and examples
are given in Chapter 7.
Prosodic structure constraints

The properties of the prosodic structure of planarity, connexity, and the concept
of domain were directly derived from the assumed relations of dependency
maintained by melodic contours toward contours of higher phonological rank
(and possibly perceived with a higher degree of stress) located “on their right,”
i.e. in the immediate future in the prosodic events sequence. However, other
types of properties limit the phrasing into prosodic words, and therefore stress
groups: the minimal and maximal duration of prosodic words (affecting their
number of syllables), the alignment of the last lexical word syllables with the
end (right boundary) of stress groups in French, the syntactic clash constraint as
well as eurhythmicity trend in reading mode to balance the duration of enun-
ciation of successive prosodic words.
The arc accentuel in French

In French, particularly in public speech situations (political meetings, radio
news, train and airport announcements), stress groups may include two content
words (often an adjective and a noun, or a verb and a noun) with the first
syllable of the first content word stressed, the last syllable of the second content
word being stressed on its last syllable as usual. Some typical examples are: à
protéger la France, la majeure partie, les hautes pressions, etc. (Fónagy,
1979). The first stress of these stress groups appears then as a secondary accent
(emphatic) as seen before.
This configuration, called arc accentuel, may also result from a stress clash
condition (see below) when the second content word is monosyllabic and
therefore stressed on its first and unique syllable, as in un soulier noir “a
black shoe,” un regard triste “a sad look,” or il arrive tard “he arrives late.” The
regular final stress on the first content word gets shifted to the first syllable.
The arc accentuel forms a larger stress group (AP, prosodic word). The
first stress shifted to the first lexical word syllable becomes a secondary
(emphatic) accent and therefore does not function as a prosodic marker
anymore. However, this configuration is possible only if the newly formed
prosodic word complies with the syntactic clash condition (see below), i.e.
if the two content words involved are dominated by the same node in the
syntactic structure, or put more simply, if the resulting stress group can be
recognized as belonging to the listener lexicon. In French examples such as
je bois mon café tôt “I drink my coffee early” or le travail de nuit nuit
“night work harms,” there is no possibility to shift the first stress or delete
it. The stress group *café tôt does not belong to the lexicon, and in the
second example, there is no room to shift the stress, and the stress group
obtained by deletion of the first stress *nuit nuit does not belong to the
lexicon of French either. The only possible realization is then to stress both
words nuit (the first being a noun, the second a verb), leaving a time gap
between both words.
Stress clash
The so-called stress clash has been observed in many languages, and in
particular in Romance languages. For Italian onestà sarde, a metà prezzo
(Nespor & Vogel, 1986; Profili and Martin, 1987), in Portuguese café quente,
French café froid, etc. also known a rhythmic rule in order to promote an
alternate binary principle (Liberman & Prince, 1977). A stress clash refers to a
sequence of adjacent stressed syllables and is assumed to be avoided in most
cases. Stress clash rules for shifting or deleting the first stress involved in the
clash were formulated a long time ago (Meigret, 1550; Prince, 1983) and their
consequences discussed for French by Martin (2009), Italian, (Profili & Martin,
1987), Spanish (Hualde, 2010), and so on.
In French, at the question Comment Max aime-t-il son café? (“How does
Max like his coffee?”), the (possible) answer Max aime le café # froid shows
two consecutive stressed syllables, whereas to the question Qu’est-ce que Max
aime boire le matin? (“What does Max like to drink in the morning?”), the
answer would be Max aime le café froid, with a stress shift of the first stress on
the preceding syllable to avoid a stress clash. However, the realization of two
successive stressed syllables in the first case implies the insertion of a short
pause (eventually realized with a glottal stop) whose origin will be explained
later.
As seen above, this latter possibility in French leads to a rephrasing of the
sentence and the merge of the originally prosodic words café and froid into one
café froid, with an emphatic accent on the first syllable. It is easy to establish
the quality of the apparently “shifted” stress by observing the lack of melodic
change whether the final stress bears a rising or a falling contour (resulting
from a contrast in melodic slope) as in Juliette préfère le café russe, mais Max
préfère le café suisse. The same mechanism applies in Italian (ex.: metà
prezzo → metà prezzo “half price”) or Spanish (sofá cama → sofá cama
“sofa bed”), where the last lexical stress remains in place and the first
becomes an emphatic (or secondary) accent in the newly formed prosodic
word and stress group.
Stress clash requires obvious phonological conditions: in French the first unit
is normally stressed on its last syllable, so stress clash requires it to be followed
by a one (stressed) syllable word. In the other Romance languages, the first
word must be stressed on its last syllable, and the following on its first syllable,
whatever the number of syllables.
But these conditions are not sufficient to induce a stress shift: stress clash
can occur only if the resulting group formed by the two consecutive words
are dominated directly by the same node in the syntactic structure (the
syntactic clash condition). This simply means the group formed by the two
words involved in the stress clash have to form a single unit that can be
transferred in concatenation memory, and be later identified in the listener
lexicon.
Whether the first stress involved in a clash is shifted on the first syllable of
the word or elsewhere inside the word, the result if the formation of another
larger stress group, with a larger number of syllables. The stress clash
constrains the formation of a larger prosodic word when the first stress
implied in a stress clash shifts to the left. As the newly formed prosodic
word cannot violate the syntactic clash constraint (see below), this simply
means that the conversion of the syllabic memory would be unsuccessful in
this case.
Minimum duration of prosodic words

Again, stress clash occurs when two consecutive syllables are stressed (not
necessarily implying content words). Experimental data (Martin, 2014b) show
that a time gap must be present between the syllables involved, in order to
maintain them at least about 250 ms apart, even if the duration of the syllables is
shorter, for example 150 ms.
If we accept the inclusion of this time gap as part of the second stress group,
we can conclude that the minimum duration of a stress group (and therefore of a
prosodic word) is about 250 ms, as we cannot have two successive stressed
syllables closer than this value (Fig. 5.17).
Shortest AP duration [ms]

500
450
400
350
300
250
200
150
100
50
0
Pol Nar Cnf Jpa Lec
Figure 5.17 Prosodic word shortest duration in ms for various speech styles
(Martin, 2014b), political discourse, narrative, conference, radio news,
university lecture.
Figure 5.18 A car license plate difficult to read.
Figure 5.19 A telephone number difficult to read and to remember.
Maximum duration of prosodic words

A car license plate without spaces between fields (Fig. 5.18, Oran, Algeria) is
difficult to read, unless you know the internal structure of the information (i.e. first
five digits is the actual license number, the next three digits give the matriculation
date – January 2007, and the last two digits correspond to the Willaya – here
Oran, an administrative district in Algeria). The license plate layout should
include spaces as in 03752 107 31, which is much easier to decode.
Phone numbers give another example. Figure 5.19 gives the phone number
of a small shop in China. Locals have no trouble in decoding the information, as
they know the format of phone numbers: the first three digits for the cell phone
access (138), and then two groups of four digits for the number itself (7837
6396). Again, readers not familiar with the phone number format have to divide
the whole sequence into subgroups, such as 138783 and 76396 for example, in
order to be able to interpret (and write down) the information.
Aside from numbers on a car license plate or phone numbers, the same process
should be applicable to strings of syllables. This would mean that sequences of
more than some four or five syllables cannot be decoded, i.e. handled for further
linguistic processing, without being spliced into short sequences.
However, when syllables do correspond to strings already stored in the
listener’s long-term memory, their segmentation can result in a larger number
of syllables being identified. This suggests that some triggering mechanism
must exist to allow the listener to determine and realize quickly the segmenta-
tion into stress groups, but that Gilbert and Boucher (2007) call temporal
groups, a chunk of syllables ended with a stress.
Although words containing more than four syllables are rare in most if not all
languages, French taken as an example offers some cases:
Anticonstitutionnellement (“unconstitutionally”), 8 or 9 pronounced syllables,
depending on the realization of a mute e after the 7th syllable)
Paraskevidekatriaphobie (“Paraskevidékatriaphobia, fear of Friday 13”), 10 syllables
Παρασκευή /pa.ɾa.skɛ.ˈvi/ “Friday”
δεκατρείς /ðɛ.ka.ˈtɾis/ “thirteen”
de δεκα /ðɛ.ka./ “ten” et τρείς /ˈtɾis/ “three”
φόβος /ˈfɔ.vɔs/ “fear”
The normal stress pattern of these examples in French assumes a stressed last
syllable: Anticonstitutionnellement, Paraskevidekatriaphobie. However, at least
another stressed syllable must be realized in order to be pronounced and, as I
will discuss later, perceived. We can then have Anticonstitutionnellement
and Paraskevidekatriaphobie or Paraskevidekatriaphobie or even
Paraskevidekatriaphobie for example, depending on the knowledge of the speaker
about the internal morphology of these rare words (the latter realization by
speakers knowledgeable of Modern Greek). A similar effect occurs in other
Romance languages. In Italian, one of the longest words (outside chemical and
pharmaceutical entries, which could as well be written with spaces or hyphens) is
precipitevolissimevolmente, “in a way like someone/something that acts very
hastily.”
Another possible realization would, of course, consist of separate syllables,
so that each syllable would be stressed, as in an.ti.con.sti.tu.tio.nnel.le.ment
and pa.ra.ske.vi.de.ka.tri.a.pho.bie.
This apparent obligation to stress at least one syllable in a sequence of seven
was already noticed in the sixteenth century! Indeed, in his Le tretté de la
grammère françoise, Louis Meigret (1550) coined some unattested words
containing more than seven syllables, in bizarre sentences such as Les

Mégapolitains surreparlementeront quoique nous surreparlementassions
“The Megalopolitans will overparley whatever we will overparley” inducing
stressed syllables complementing the normally final syllable of these artificial
(invented) words.
Another similar argument pertains to the rather outdated attempt to define
stress groups as minimal units of meaning in French. In many teaching manuals
of French, stress groups were and are still frequently defined as minimal units of
meaning (unité minimale de sens). Examples such as l’armoire, la petite
armoire, la jolie petite armoire were commonly given to illustrate how learners
of French should identify stress groups and stress their final syllable accordingly.
However, by simply expanding the example as la jolie petite armoire vert-
bouteille, with ten syllables, another syllable must be stressed: la jolie petite
armoire vert-bouteille. This shows clearly that the criterion of sense group
cannot be retained to define stress groups.
It appears then that another reason needs to be found to explain this con-
straint limiting the number of syllables in a prosodic word. It has been observed
experimentally that the limit of seven syllables is actually variable, authors
giving values varying from five to nine syllables, depending on the speech rate
realized by the speakers. The actual parameter involved may then be rhythmic
rather than directly linked to the number of syllables, and turning into cognitive
aspects of syllable perception may give us a convincing explanation. As it will
be discussed below, the maximal number of syllables in a prosodic word
pertains to the duration of enunciation of the group rather than the number of
syllables it contains. Figure 5.20 gives some values for maximal prosodic
words duration for various speech styles.
Eurhythmy
Eurhythmy refers to the tendency for speakers to either (1) realize temporal
groups whose number of syllables of consecutive temporal groups are compar-
able or (2) accelerate the speech rate when temporal groups contain a larger
number of syllables and slow down when they have few syllables. Both strategies
realize the same goal: to balance the duration of enunciation of successive
temporal groups. In Max adore les chocolats (“Max loves chocolate”) for
instance, an eurhythmic realization would be [Max adore] [les chocolats] to
balance the number of syllables of both groups, whereas a realization congruent
with syntax would be [Max] # [adore les chocolats], whose non-eurhythmicity
could be compensated for by insertion of a pause after Max (since in this case the
speech rate is difficult to modify on a single syllable). These variations are
possible due to the lack of lexical stress in French.
1200
1000
800
600
400
200
0
Pol Nar Cnf Jpa Lec
Figure 5.20 Prosodic word longest duration in ms for various speech styles
(Martin, 2014b).
But there is another aspect of the reading process. In fact, for sentences with
any sizable degree of syntactic complexity, the reader must be a good syntax
expert, helped only by punctuation signs. Due to the limitation of the prosodic
structure to two or three levels, an adaptation must frequently be made in the
advent of a more complex syntactic structure with more than two or three levels.
It is then no wonder that only right boundaries of syntactic and prosodic phrases
are effectively realized by the reader, in order that a minimal recovering of the
prosodic structure intended by the writer is established for the eventual listeners,
including the speaker. In any case, this implies that more than one prosodic
structure can be associated with a given read text, except perhaps for sentences
with only one or two prosodic words. (Actually, even two prosodic words can be
assembled differently, as a simple two prosodic words group Nucleus, or as a
Nucleus followed by a prosodic Postfix, or even with two consecutive prosodic
Nuclei, a differed complement, as described in Chapter 8 on macrosyntax.)
In this regard, the prediction of a prosodic structure in French is more proble-
matic. Lacking lexical stress, speakers of French, even in reading mode, may or
may not effectively stress stressable syllables, i.e. the syllables ending predicted
stress groups. The only limit to the predicted variation is the maximal stress group
duration. This duration can be translated into the number of syllables a projected
stress group can contain. If this number is too short, the speaker (and the silent
reader alike) may skip one or more stressable syllables to form a larger stress
group, especially selecting a faster speech rate since the stress group constraint
pertains to the total duration and not to the number of syllables.
Incidentally, a possible way to force a reading speaker to effectively stress

stressable syllables in French consists of using long words in the text so that the
number of syllables between two consecutive stressable syllables is close to the
limit seven, and therefore to the maximal stress group duration when enun-
ciated, as in, for example, [les éléphanteaux] [de Marie-Antoinette] [se sont
regroupés] [en associations] “the baby elephants of Marie-Antoinette groups
themselves in associations,” with successive prosodic words of 5, 6, 5, and 5
syllables. But this scheme would not force the speaker to realize a prosodic
structure congruent to syntax.
As already mentioned, eurhythmy can take two forms: the speaker strives to
achieve some comparable duration of successive stress groups either (1) by
compressing syllable duration for large prosodic words and slowing downs
speech rate for small prosodic words, or (2) by realizing a prosodic structure
balancing not only the number of syllables of successive stress groups but also
the number of syllables of successive prosodic phrase ip’s and IPs (see
Fig. 5.21) (Martin, 1987). This, of course, may result in prosodic structures
non-congruent with syntax. For example: Marie adore les chocolats belges
“Mary adores Belgian chocolates” with the following phrasing: [Marie adore]
[les chocolats belges]. This restructuration seems to be more frequent in read
than in spontaneous speech. In read speech, the eurhythmicity effect is obtained
by restructuring the prosodic structure from an assumed congruence state with
the syntactic structure, whereas in spontaneous speech, the effect is reached by
maintaining congruence while compressing or extending the syllabic duration
(Martin, 2014b).
Therefore in languages deprived of lexical stress like French, the eurhythmy
process produces a compression of syllabic duration when stress groups
include many syllables, and a dilatation when the number of syllables is low,
making possible a compensation of the overall stress group duration with its
Average Average Slow Fast
[Marie adore] [les chocolats] [Marie] [adore les chocolats]
eurythmic (4 + 4) syntactic (2 + 6)
Average Average Slow Fast
[Antonio mangia] [la zuppa inglese] [Antonio ] [mangia la zuppa inglese]
eurythmic (5 + 6) syntactic (3 + 8)
Figure 5.21 Two ways to obtain eurhythmicity: balancing the number of

syllables or varying the speech rate.
Syllabic duration vs. Nb Syllables in Ap (cnf-fr)

600
500
400
Duration [ms]
300
200
100
0
0 1 2 3 4 5 6 7 8 9
Nb Syllables
Figure 5.22 Syllabic duration in function of the number of syllables in

prosodic words regression line (nar-fr narrative style). Standard deviation is
indicated by double arrows (Martin, 2014b).
number of syllables. Figure 5.22 shows, for example, the evolution of the
average syllabic duration varying from about 100 ms to 250 ms in stress groups
of one to eight syllables.
Examples of stress groups containing up to twelve syllables are found, for
example, in Lehka and Le Gac (2004), but even in this fast speech rate case
their duration is below the limit of about 1,250 ms.
Word alignment
The only constraint linking stress groups to morphology or syntax pertains to
the alignment of words’ last syllables with the end of prosodic words.
Ambiguous application of this principle occurs in puns, whereas violation
may happen in other cases mentioned earlier. Below are some examples in
French:
Ce sont des Mongols fiers de leur passé Ce sont des montgolfières de leur passé
“These are Mongols from their past” “These are balloons from their past”
J’ai vu l’eau tarie dans la fontaine J’ai vu l’otarie dans la fontaine
“I saw the water dried in the fountain” “I saw sea lions in the fountain”
Depending on the stress pattern, listeners will perceive, thanks to the asso-
ciated prosodic words, Mongols fiers or montgolfières in the first example, and
l’eau tarie or l’otarie in the second (examples taken from Rossi, 1983). But
again, this constraint is violated for long stress groups, as shown above.
Intonation may resolve ambiguity only if (1) the context and situation are not
bringing any information susceptible to remove the ambiguity, and (2) the
incremental process implemented by storage-concatenation ensures a non-
ambiguous hierarchical grouping of stress groups.
In a book devoted to linguistic ambiguities in French, C. Fuchs (1996)
gives some examples of syntactic ambiguities (which would cease to be
ambiguous if pronounced). Although in practice these cases seldom occur
orally in real life, the following examples may illustrate how the ISC process
operates:
[Nadine] [couvre] [la corbeille de fleurs] vs. [Nadine] [couvre la corbeille] [de
fleurs] “Nadine covers the flower basket” vs. “Nadine covers the basket with
flowers”
[Il a parlé fort] [spécialement] vs. [Il a parlé] [fort spécialement] “He spoke loudly
specially” vs. “He spoke very specifically”
[Moules marinières] [et frites à volonté] vs. [Moules marinières et frites] [à volonté]
or most probably [Moules marinières] [et frites] [à volonté] (to avoid a stress
group with six syllables) “Mussels, and fries at will” vs. “Mussels and fries, at
will” and “Mussels, and fries, at will”
For all these examples, the ambiguity is resolved by the difference in phrasing
(cf. Boulakia, 1985).
Syntactic clash
An apparent constraint seems to exist between the syntactic and prosodic struc-
tures, governed by the syntactic clash rule (Martin, 1987). This rule defines
properties of sentence phrasing, i.e. the way sequences of syllables are grouped
together to form stress groups, or seen from a top-down point of view, how
sentences are divided into stress groups. The original definition of the syntactic
clash forbids the grouping of two syntactic units corresponding to two prosodic
words to be dominated immediately (i.e. at the first level) by distinct nodes in the
syntactic structure. For example, in Mary is eating this excellent chocolate, the
following phrasing would not be well formed: [Mary is] [eating this] [excellent
chocolate], since in [Mary is], the auxiliary is is dominated in the syntactic
structure by a node which is also dominating eating in the next prosodic word
syntagm. Likewise in [eating this], this is dominated immediately by a node
which also dominates chocolate in the next chunk (Fig. 5.23).
An explanation for this rule is rather easy to find: although Mary is may be
part of the standard lexicon stored in the listener memory, as eating this, a
“correct” chunking allowing the listener to retrieve a lexical entry (which is not
[Mary is] [eating this] [excellent chocolate]
Figure 5.23 A phrasing non-congruent to syntax, involving a syntactic clash.
limited to dictionary entries) can normally be recovered by re-evaluating the

syntactic dependency relations existing between the decoded units; however,
this processing takes time (the N400 and P600 effects, Steinhauer et al., 1999)
and may not be possible in practice if the speech rate is relatively fast.
This syntactic alignment constraint is actually the only link between stress
groups and syntax, but, as demonstrated by numerous examples in spontaneous
speech, it can be violated at the expense of a longer cognitive treatment by the
listener (with a P600 effect). It allows us to define the syntactic group contained
in temporal groups as the pertinent minimal unit instead of the word, and
suggests that language users normally manipulate more or less large groups
instead of words as defined in a dictionary.
Experimental data
In a detailed analysis of various styles of spontaneous speech in French
(Martin, 2014b), I made the following observations confirming the points
mentioned above:
a. Successive stressed syllables are found (“stress clash,” corresponding in
French to one single-syllable stress group preceded by any other stress
group), but there is a minimal amount of time between two consecutive
stressed syllables (actually between two consecutive stressed vowels). This
observation confirms the hypothesis about Delta brain waves synchronizing
the perception of AP, for a maximum frequency, i.e. a minimal period of
about 250 ms (see below).
b. Cases where eurhythmy is obtained at the expense of congruence of the
prosodic structure with syntax are rare, so the eurhythmic compensation is
done by compressing the syllabic duration in stress groups with many
syllables. This was already observed empirically in (Fónagy & Magdigs,
1960; Lehka & Le Gac, 2004; Pasdeloup, 2004, and more recently in Avanzi
et al., 2013). One of the reasons why balancing of the number of syllables is
not frequent in spontaneous data may pertain to the fact that such balancing
requires preplanning essentially more likely for read speech (cf. the read
phrasing [Marie adore] [les chocolats] vs. the spontaneous [Marie]
[adore les chocolats]). It seems that speakers realize eurhythmic phrasing

when the syntactic constraint is weak or absent, i.e. for enumeration, short
read sentences, etc.
c. No cases of syntactic clash were observed, although it is not rare in multi-
media and political speech styles.
Brain waves and prosodic structure

In the years 1930–1940, researchers observed that the human brain consisted of
a very large number of neurons (in the order of 100 billion) interconnected in
groups in specific regions of the brain mass. These interconnections allow a
transfer of chemically stored information in each neuron. These transfers
induce variations of a small electric potential (in the µV range) that can be
observed through captors positioned on the subject’s skull (electroencephalo-
graphy, or EEG). These electrical variations are called evoked potential, as they
result from a sensory stimulation, auditory, visual, or other.
Electrical activity produced by transfers of group of neurons to other groups
of neurons is not done haphazardly. First, they operate in specific frequency
ranges linked to specific cognitive activities, and secondly they may synchro-
nize in phase with other waves operating in a different frequency range. Greek
letters designate specific frequency ranges: Alpha, Delta, Delta, Gamma, and
so on. The range of interest here pertains to Delta, varying from 1 to 4 Hz, and
Theta, varying from 4 to 10 Hz. It may be more meaningful to consider the
period ranges of Theta and Delta waves instead of their frequency range:, the
period range of Theta is thus from 100 ms to 250 ms, and the period range of
Delta is from 250 ms to 1,250 ms. Of course these values are approximate, but
they correspond to the figures reported in most studies (Ghitza et al., 2013).
Event Related Potential (ERP) is usually observed with a relatively large
number of captors (from 32 to more than 256) placed around the subject’s skull
according to location standards. EEG signals are stored in real time and
analyzed into the frequency domain with either a (Fast) Fourier or a Wavelet
transform. The resulting representation is very similar to spectra obtained in
frequency speech analysis, but in a lower frequency range.
Theta brain waves and the perception of syllables

Interestingly, the range of variation of syllabic duration and Theta brain waves
is about the same: 100 ms to 250 ms (syllables can, of course, be longer in
stylistic or affective realizations), and many researchers in neurolinguistics are
intrigued by the correspondence between syllabic duration and Theta fre-
quency range, which may not be a simple coincidence. The actual consensus
today is that the perception of syllables along the time axis is synchronized by
Theta waves (Henry & Obleser, 2012; Ghitza et al., 2013), or conversely, that
Theta waves are entrained by syllabic acoustic landmarks (Doelling et al.,
2014). Indeed, the intelligibility of syllabic sequences is improved when the
Theta bursts are in phase with the sequence of syllables (Ghitza 2012; Henry &
Obleser, 2012, Ghitza et al., 2013). In a way, this process can be compared to
mechanisms often used in computer circuitry, where a master clock synchro-
nizes the transfer of information from some circuit output, allowing these to
vary in response time, as the resulting stage would be retained at the same
instant for all electrical outputs, whatever the individual delay values (glitch)
for each of them.
This interpretation (Henry & Obleser, 2012) excludes the role sometimes
given to syllables to themselves synchronize Theta waves by mean of a phase-
locked loop (PLL), as suggested by Ghitza (2013). On the contrary, Henry and
Obleser (2012) demonstrated the importance of phase realignment in
response to frequency-modulated auditory stimuli, where this realignment
depends on the instantaneous phase of delta oscillations, which are them-
selves entrained by an auditory spectrally modulated stimulus (for pure tones
actually). To be efficient, the maximal Theta phase shift should not exceed
about 50 ms, as the range of Theta varies by a factor of about 2 only, from
100 ms to 250 ms.
Delta brain waves and stressed syllables

In an experiment involving the perception of sequences of pure tones, Boucher
et al. (2015) showed that a structured train of stimuli (by groups of three in Fig.
5.24, the third stimulus being longer) would synchronize Theta bursts into well-
organized sequences of three, whereas without lengthening this temporal
organization disappears.
Figures 5.24 and 5.25 illustrate the difference pertaining to ERP recordings
resulting from a stimulus of unstructured pure tones (Fig. 5.24) and a structured
sequence organized in four chunks of 3 pure tones (Fig. 5.25), the last tone of
each chunk with a longer duration.
These figures suggest that, as mentioned by Henry and Obleser (2012), an
outside stimulus, here a longer pure tone, synchronizes Delta bursts, which in
turn synchronize Theta bursts.
Extending this reasoning from pure tone stimuli to speech syllables, I
suggest a hypothesis pertaining to the role of stressed syllables synchronizing
Delta waves, which in turn synchronize Theta waves. In this regard, data
obtained from spontaneous speech analysis (see above) are particularly inter-
esting. Measuring the duration of sequences of syllables between two conse-
cutive stressed syllables in French for various speech styles, the range of values
Figure 5.24 Example of EEG spectral analysis (channel 28 or Pz) of evoked

potential for a stimulus of a sequence of pure tones with no temporal structure
(top of the figure). Spectral analysis (bottom) shows Theta waves (in the range
4 Hz–10 Hz), which is thus not decoded by the listener.
Figure 5.25 Example of EEG spectral analysis (channel 28 or Pz) of evoked

potential for a stimulus of a sequence of pure tones with a temporal structure
(top of the figure). Spectral analysis (bottom) shows Theta waves (in the range
4 Hz–10 Hz) corresponding to the one stimulus. The structure is therefore
decoded by the listener (Boucher et al., 2015).
extends from 250 ms to 1,250 ms, values very similar to the range of period
values for Delta waves (Martin, 2014b).
This hypothesis may raise some difficulties of interpretation when applied
to tone languages lacking stressed syllables, leading one to consider that the
melodic changes linked to tone realizations (high flat, rising, falling-rising, and
falling in Mandarin) are responsible for the synchronizing of Delta waves,
as would be the case for stressed syllables, realized in non-tonal languages

essentially with a longer syllabic duration and a melodic change (as we will see
in detail for Romance languages). This hypothesis would be compatible with
Henry and Obleser (2012) observations.
As stated earlier, accent (as opposed to stress) is an optional syllabic marker
interpreted by the listener as an emphasis put on some syllable, word, or
syntagm. Contrary to the role of stressed syllable for Delta waves, they do
not resynchronize Theta bursts, but they induce a so-called “N400” negative
burst (a negative spike) brain wave response, appearing about 400 ms after the
stimuli. Originally linked to syntactic or semantic incongruencies in sentences
presented to subjects (Steinhauer et al.,1999), the N400 effect has been shown
to correspond to any semantic extra processing of the stimuli, whereas syntactic
reanalysis and repair does induce a “P600” wave response, a positive burst
occurring about 600 ms after any syntactic quirk (Wang et al., 2012). Examples
of syntactic repair are given in Chapter 7, where phrasing does separate words
belonging to the same syntagm.
The difference between accent and stress is thus clear on both phonologic
and neurologic levels: stressed syllables synchronize Delta waves, whereas
accented syllables trigger N400 responses. Delta wave oscillations are always
active, whereas N400 is occasional. The effect of initial accent in French,
investigated by Astésano et al. (2004) and Aguilera et al. (2014), show similar
N400 responses.
Delta brain waves frequency range

Whereas Delta waves are synchronized (phase-locked) by stressed syllables,
their range of frequency is independently limited from 250 ms to 1,250 ms.
Outside these limits, the synchronizing of syllabic perception could not take
place (Henry & Obleser, 2012). This means that a group stress language like
French is constrained in the distribution of stressed syllables by these values.
As stressed syllables in French are located on words’ final syllables, the
realization of required syllables may need a certain amount of compression to
fit into the longest time window given by Delta waves. As shown in Figure 5.22
above, this compression is somewhat linear (Martin, 2014b), reducing average
syllabic duration to about 100 ms in (rare) occurrences of eight syllables in a
single stress group.
On the other side of the Delta frequency range, the shortest period is about
250 ms, which implies that a stress group (and thus a prosodic word) with only
one syllable has to have a minimum duration of 250 ms, even if the syllable
itself is much shorter (in the order of 150 ms). This gives a proper account for
the time gap observed in so-called stress clashes when two successive stressed
syllables do occur (as in Portuguese Max ama o seu café quente “Max likes his
Syllable
Theta wave
Synchronization Delta wave
Temporal group
Figure 5.26 EEG Theta waves synchronized by Delta pulses.
coffee hot,” vs. Max ama o seu café quente, “Max likes his hot coffee”), similar
to the same example in French discussed earlier).
This observation suggests that an imperative reason does exist to squeeze
syllables in a limited time window, a reason linked to the frequency range of
Delta waves, essential to synchronize the perception of syllables by Theta
waves, and constitutes another argument in favor of considering Delta waves
synchronized by stressed syllables (but not by accented syllables). If stressed
syllables would not play this role, there would be no need to keep both stressed
syllables 250 ms apart (Fig. 5.26).
Prosodic structure constraints and brain waves

This hypothesis leads to a principle of explanation for each of the constraints
limiting the possibilities of segmentation of a sequence of syllables (Martin,
2013b) seen above, the minimal and maximal duration of stress groups, eur-
hythmy, and alignment of the last syllable of lexical words in French, and
avoiding stress groups associated with distinct syntactic units.
Considered from a time domain point of view, theses four constraints could
be rephrased as follows:
a. Stress clash is actually permitted if there is enough time spacing between
consecutive stressed syllables.
b. Stress groups cannot exceed a time limit (in the order of the time needed to
utter 7 +/– 2 syllables).
c. Eurhythmicity tends to realize stress groups of comparable duration, balan-
cing the number of syllables in consecutive stress groups.
d. Syntactic alignment allows the listener to decode more rapidly chunks or

syllables without using syntactic knowledge at a later stage of linguistic
processing.
Focusing on the time domain aspects of these constraints leads to the
following hypothesis, implying the Delta waves as triggers for the conversion
of sequences of syllables (i.e. stress groups) stored in short-term memory into
some higher rank unit.
The above observations suggest that long-term memory stores not only stress
groups but also more complex (syntactic) groups as well. This would explain
why speakers can produce relatively large strings of words with their corre-
sponding intonation without apparent effort of construction. Likewise, large
chunks of words can be decoded immediately by listeners without apparent
recalculation of the dependency relations existing between “words,” as they
correspond to chunks already stored “as is” in long-term memory.
Reading text and “reading” pictograms do not belong to the same process.
Reading a text is a process attempting to recover the oral information (even in
silent reading) wanted by the scripter (writer), whereas pictograms do not refer
directly to an oral counterpart. The values of fixation duration in reading reflect
this recovering process, implying the prosodic structure and thus the syllable
chunking synchronized by Delta brain waves. Reading is faster than speaking
where articulatory process is involved and limited in speed by mechanical
constraints. But the mere necessity of recovering the prosodic structure for
silent reading leads to this speed limitation, i.e. of about 250 ms between stress
groups, whatever their syllabic size.
Silent reading of written texts involves a process of subvocalization, i.e. the
presence of a voice reading the text in the head of the reader speaking to her/
himself. This process includes not only the sequences of syllables correspond-
ing to the written material, but also sentence intonation. Since subvocalization
cannot be eliminated other than by changing the status of each word into a
pictographic function (as could be the case for a STOP road panel sign), it could
be argued here that sentence intonation is essential to language comprehension,
and more specifically to the conversion of sequences of syllables into higher-
order linguistic units, the stress groups.
Consequently, reading and in particular silent reading is constrained by the
same rules as the prosodic structure, and specifically to the minimal duration of
accent phrases. This minimal value, occurring when prosodic words contain
only one syllable, is about 250 ms, a value which corresponds to the minimal
period value of Delta brain waves. This prosodic word minimal duration limits
also the maximal number of prosodic words that could be processed in silent
reading, i.e. about 240 prosodic words per minute, which corresponds to the
maximal number of words per minutes that experts in fast reading can process
while keeping a reasonable level of comprehension, i.e. about 800 wpm.
Stress groups and brain waves

What would be linked to the conversion of stored syllables into higher linguis-
tic units? An average value for (unstressed) syllables duration would be around
100 ms to 130 ms, giving an average duration of the largest stress group of
about 7 * 100 ms = 700 ms or 7 * 130 ms = 910 ms. Of course, these values vary
according to the nature of syllables in question, especially as one of these
syllables is supposed to be stressed, and therefore of longer duration. A good
candidate for triggering the syllable chunks would then be Delta waves, as their
range of frequency variation is 1 Hz to 4 Hz, or in terms of periods, from 250 ms
to 1,250 ms, corresponding to the minimal and maximal values of prosodic
words shown in Figures 5.17 and 5.22.
Indeed, these values correspond nicely to the different aspects of constraints
evoked above. First, the values of Delta wave periods correspond not only to
the maximum duration of stress groups (about 1,250 ms), but also to the
minimum duration value between two consecutive stressed syllables. When a
so-called stress clash occurs, the space between stressed syllables is in the order
of 250 ms, precisely the order of value for the shortest Delta wave periods.
Finally, brain waves exhibit frequency variations, and therefore period
variations, relatively limited from one spike to the next. Referring to the
hypothesis linking stress groups to Delta waves, this would give a proper
account for eurhythmicity observed experimentally on both read and sponta-
neous speech prosodic structure and implied syllabic segmentation.
Indeed, eurhythmy is manifested by a tendency (for the speaker) either (1) to
balance the number of syllables of consecutive stress groups, or conversely, if
this is not possible due to the choice of words in order to avoid or the resulting
non-congruence with syntax, (2) to accelerate the syllabic rhythm when stress
groups are large, and to slow it when they contain few syllables (eventually
adding a pause when the stress group contains only one syllable).
Constraints revisited
The constraints governing stress groups observed on prosodic structures of
both read and spontaneous speech can find a justification – and an explanation –
in recent neurophysiological research work on speech (Steinhauer et al., 1999;
Friederici, 2002; Makuuchi et al., 2009; Giraud & Poeppel, 2012). These
studies, based essentially on EEG, investigate the possible correlations that
may exist between brain activity and the perception and linguistic treatment of
the information by listeners. They also use magnetic resonance imaging in
specific experiments.
Steinhauer et al. (1999), for instance, demonstrated with this technique of
investigation the precedence of prosodic over syntax treatment. Obrig et al.
(2010) as well as Gilbert and Boucher (2007) and Gilbert (2012) showed that
the speech flow was segmented thanks to prosodic tags and with direct identi-
fication of already memorized units.
Figures 5.27 to 5.30 give explicit explanations linking each of the prosodic
structure constraints to a specific property of Delta waves.
Long-term memory
Identification
of temporal
groups Temporal Maximum
groups duration
Time
Delta
CPS 250/1250 ms
Conversion
Syllables
100/250 ms
Short-term memory Time
Thêta
100/250 ms
Figure 5.27 The seven syllables constraint is actually a duration constraint,

governed by Delta waves, whose periods (250 ms to 1,250 ms) cannot
accommodate more than seven syllables or so (depending on the speaker
speech rate).
Long-term memory
Identification
of temporal
groups
Temporal
groups
Time Minimum
Delta
CPS 250/1250 ms Conversion gap
Syllables
100/250 ms
Thêta
100/250 ms
Figure 5.28 Delta waves synchronize the transfer of chunks of syllables from
short-term memory. The minimum gap between consecutive stressed
syllables is therefore limited by the minimum period value of the Delta waves
(about 250 ms). A lower value would leave Theta bursts desynchronized and
lower efficiency in the perception of syllables.
Long-term memory
Balance of
Identification chunks duration
of temporal
groups Temporal
groups
Time
Delta
CPS 250/1250 ms
Conversion
100/250 ms Syllables
Thêta
100/250 ms
Figure 5.29 The eurhythmicity constraint is linked to the relative stability of

consecutive periods of Delta waves (synchronizing the syllables chunks
transfer), as they cannot change their cycle duration from one cycle to the next
by a large amount.
Long-term memory
Identification
of temporal
groups Identification
Temporal
groups
Time
Delta
CPS 250/1250 ms
Conversion
Syllables
100/250 ms
Thêta
100/250 ms
Figure 5.30 The syntactic clash constraint: To be transferred, chunks of

syllables of more than four syllables must be identified in long-term memory,
which is not directly possible if a syntactic clash occurs, although the
information can be recovered later by the listener (P600 effect).
Sequential sentence structuration by prosody and syntax

Independence of the prosodic structure over syntax does not mean either that
structuration, syntactic and prosodic, operate simultaneously in parallel in the
sentence. On the contrary, as will be shown later, the prosodic structure
operates before syntax (which in passing explains why the prosodic structure is
not necessarily congruent to syntax). It is therefore not surprising that con-
gruence appears more frequently in laboratory read speech, as in this case the
written text syntax is obviously present before sentence intonation.
As mentioned before, whether in read speech, silent reading, or sponta-
neous speech, the prosodic structure is always present as an obligatory
linguistic object in order to allow the listener to process the information
brought by the flow of syllables and access the syntactic information con-
tained in the sentence. The goal is to demonstrate that the elaboration of the
prosodic structuration necessarily present in the sentence actually precedes
the elaboration of the other structures and particularly of the syntactic struc-
ture, whether in the generation process by the speaker or the perception
process by the listener.
Arguments favoring this conclusion are of various order and are based on the
following facts:
– The prosodic structure can exist without any words or any syntax whereas
the opposite is not true. Syntax depends on the presence of prosody, but
prosody does not depend on the presence of syntax.
– The flow of syllables must be segmented in chunks in order to be processed
by Delta brain waves. Delta waves synchronize the transfer of sequences of
syllables (the stress group) stored in short-term memory.
– In spontaneous speech, reformulations are (almost) always realized by
retaking a complete stress group and not a selected word (except sometimes
in stylistic applications, which may not be a real reformulation).
– The dynamic process of the prosodic structure generation shows that the
speaker has to choose between a relation of dependence (rection), indepen-
dence (paratax), or the absence of relation between the actual prosodic group
(ip or IP in AM terminology). This is done by specifying prosodic contours
indicating a dependency relation toward another contour to occur in the
immediate future (i.e. to “the right”).
All these observations lead to a conclusion suggesting that the prosodic struc-
ture operates before the syntactic and the other structures of the sentence. The
usual graphic representation and analysis of the prosodic structure obscure
considerably these aspects, leading to the belief that intonation acts as a
supplement to syntax, to be processed by the listener (in reality only the reader)
as another set of syntactic features.
A simple example: telephone numbers

Telephone numbers may constitute one of the best examples to demonstrate the
total independence between intonation and syntax. So behave multiplication
tables, and mathematical and chemical formulas, although to a lesser extent as

some graphical structuration of the information may be present.
Other than these specific examples, it is customary in the field to rely on
“ambiguous” sentences to force the intonation system to reveal itself in cases
where the syntactic structure is ambiguous. The difficulty here resides in the
effective neutralization of the context and the situation, as these artificial cases
(natural speech production is rarely ambiguous) must be read by speakers put in
a carefully controlled environment, to ensure that the context of enunciation is
not neutralizing the intonation efforts to remove the assumed ambiguity of the
experimental sentences (see above).
Telephone numbers are a priori deprived of syntactic or semantic structure.
At the beginning, more than a hundred years ago, phone numbers had only four
digits and could be enunciated in one single stress group with four digits (e.g.
1234 as one two three four), as two stress groups with two prosodic words in
each stressgroup (e.g. 1234 as twelve thirty-four), but (normally) not by
spelling out the whole number (e.g. 1234 as one thousand two hundred
thirty-four, which requires three stress groups in this example). Depending on
the language, one system may be more economical in terms of numbers of
syllables than another. As any other speech production, enunciated phone
numbers are subject to the general constraints given earlier, and linked to
properties of Delta waves: minimal and maximal duration of stress groups,
syntactic clash, eurhythmy, and congruence (or non-congruence) with the
graphic hierarchical organization usually defined by usage or the telephone
company in their telephone directories. This hierarchy can be different accord-
ing to the local cultures. In Quebec and in France, for example, enunciation of
similar phone numbers is very different. Furthermore, telephone numbers are
graphically divided in groups of two to four digits (except in some cases in
China), 01 21 18 35 78 in France, 416 922 5835 in Canada, 4563 8739 in Brazil,
and so on.
Again, the governing principle is the same and follows the process of
storage-concatenation given earlier. The listener must be able to process and
store the information in order to remember and dial the telephone number
given. The typographic presence of blank spaces between groups of numbers
determines a kind of structure to be imperatively followed by the speaker to be
understood.
With regard to the number of syllables, the Quebec version is more efficient,
as each digit is pronounced separately as in North American English. For
example, the telephone number 514 522 4436 is pronounced cinq un quatre
cinq deux deux quatre quatre trois six, with ten syllables grouped in three stress
groups as suggested by the graphic layout (only the number 0 would require the
two syllables ze.ro). In France the same number would be pronounced as cinq
cent quatorze cinq cent vingt deux quarante quatre trente six, with thirteen
syllables grouped in four stress groups, as the usage is to spell out numbers
below 100. The three stress groups will be simply enumerated and form a two-
level prosodic structure:
[[cinq un quatre] [cinq deux deux] [quatre quatre trois six]]
“[[five one four] [five two two] [four three six four]]”
or
[[cinq cent quatorze] [cinq cent vingt deux] [quarante quatre trente six]].
“[[five hundred fourteen] [five hundred twenty-two] [four thirty and forty six]]”
A realization with each number stressed is also possible, but much slower in
order to maintain the minimal duration between consecutive stressed syllables:
[[cinq # un # quatre] # [cinq # deux # deux] # [quatre # quatre # trois # six]]
The maximal duration of a stress group constrains the enunciations as well,
with about 1,250 ms as a maximum duration. This limits the possibility of
enunciating every digit separately without putting stress on every syllable, as
*[cinq un quatre cinq deux deux quatre quatre trois six]
since this sequence is too long to pronounce it in less than 1,250 ms, the
maximum duration of a stress group. Therefore, a possible realization with a
flat structure (an enumeration) is obtained by simply enumerating the succes-
sive digits requires a pause # between each syllable, as
[cinq # un # quatre # cinq # deux # deux # quatre # quatre # trois # six].
The prosodically defined hierarchy of stress groups must be congruent with
the “syntax” defined graphically. The interpretation of the sequence of numbers
by storage-concatenation requires the segmentation in 3 and 7 syllables or 3, 3
and 4 syllables: the example [cinq cent vingt deux] [[quarante quatre] [trente
six]] “five hundred twenty two forty four thirty six” for the telephone number
5224436 can be segmented in three different ways (Figs. 5.31, 5.32, 5.33):
522 # 4436 522 # 44 # 36522 44 # 36
Figure 5.31 A hierarchy which contradicts the graphic structure, since the
most important frontier of the prosodic grouping corresponds to an absence of
a graphic space, and vice versa: 522 44 36.
Figure 5.32 A simple enumeration by juxtaposition of the three stress groups.
Figure 5.33 A structure corresponding to the usual graphic representation

522 4436.
In the case of the configuration congruent with the graphic layout [522] [[44]
[36]] pronounced cinq cent vingt deux quarante quatre trente six, with nine
syllables, two syllables can be stressed: the final on six, and the other on deux
ending cinq cent vingt deux. One feature is enough to maintain the contrast, the
choice being left to the speaker:
1. A pause after deux, resulting in a eurhythmic prosodic structure vingt deux #
quarante quatre trente six;
2. A contrast of melodic height, with a higher pitch on deux and a lower pitch
on six.
3. A contrast of melodic slope, deux rising vs. six falling;
4. A contrast of duration, with the syllable deux shorter than six.
This last choice may be more difficult to realize, given the intrinsic duration
of single stress groups.
These considerations may throw a new light on memory recall experiments
for strings of digits, where tone language speakers perform better than non-
tonal, recalling eight to ten digits for Mandarin and Cantonese speakers,
compared to four digits for English speakers (Chen at al., 2009). Although
the digits are monosyllabic in both types of languages (with the exception of
zero and seven in English), sequences of syllables bearing a tone are better
remembered and processed. In view of the ISC model, this could be explained
by the fact that in tone languages, each monosyllabic digit is processed as a
stress group, whereas for English or French, digits are groups in chunks (i.e.
stress groups) of four to five syllables.
6 Lexical stress in Romance languages
One of the key features in the description of sentence intonation in Romance

languages other than French resides in the location of lexical stress.
Indeed, whether in the Autosegmental-Metrical or Incremental Storage-
Concatenation models, the metrically strong or stressed syllables are the
places in the stress groups where prosodic events are located. In the AM
model, lexical stress simply defines the location of pitch accents, which
plays no role in the indication of the prosodic structure (Frota, 2009;
Feldhausen, 2010). In the ISC model, however, prosodic characteristics
function also as indicators of the prosodic structure, together with boundary
tones.
Stress and accent

Many approaches and models relate the placement of stress in Romance
languages to some phonological characteristics of the syllable. However,
with two very simple principles based on a stress rule in Latin and morpholo-
gical properties of the Romance language considered, it is possible to under-
stand and predict its position in most cases. In this regard, metrical theory leads
to rather strange conclusions, not really fit for Romance languages. The
obstinacy of some researchers to stick to syllabic properties to infer stress
placement rules may constitute a school case in this academic domain.
The general principle, valid for all Romance languages (except French) and
already proposed by Paul Garde (1968, 2013), relies on the morphological
analysis of lexical words into an optional prefixes, a stem, and optional suffixes
and inflections:
(Prefix) + Stem + (Suffix) + … + (Suffix) + (Inflection) + … + (Inflection)
The stress rule is based on the notion of stressability, i.e. the possibility for
a syllable to be stressed. The stem is always stressable, and if derived from
Latin, its potential stress location is predictable from the original Latin stress
according to a simple rule given below. Suffixes and inflections are either
120
Stress in various languages 121
stressable (one of their syllables is stressed in at least one morphological

configuration) or non-stressable (none of their syllables can be stressed
whatever the configuration). For suffixes and flections, this property depends
on their subcategory, i.e. if they are nominal, adjectival, adverbial, or verbal
suffixes of flections. For example, the nominal suffix -ic is non-stressable in
Italian (e.g. in reppublica “republic”), but -on is stressable, as in canzone
“song.” Likewise, verbal flexion can be stressable as -ò in cantò, “he sang,”
whereas -ano is not stressable in cantano “they sing.” The general rule is
then: “The last (rightmost) stressable morpheme gets the stress.” The final
accentuation rule simply places the stress on the rightmost possible syllable
among the sequence of stressable and non-stressable syllables. The same
principle applies to the root, which gets the stress if followed by non-stres-
sable suffix or flection.
Stress in various languages

Since stressed syllables are central in the Incremental Prosodic Structure
approach, let’s consider three types of languages from the point of view of
stressed syllables:
a. The so-called fixed stress languages (Garde, 1968, 2013). The position of
stressed syllables does not depend on the lexicon, but on some fixed rule, as
in French on the last syllable of a lexical word (actually of a stress group);
b. The “variable stress” languages, such as English or Italian, whose stress
syllable position is part of the lexicon entries, and may be linked to mor-
phological boundaries (as in Romance languages);
c. The tone languages, such as Mandarin or Vietnamese, with a large propor-
tion of monosyllabic words, but with no obvious stressed syllable on
plurisyllabic (compound) words.
French is particularly interesting here as the position of stress in a group of
syllables is apparently free from lexical or even (partially) from syntactic
constraints. Indeed, because in French stress is located on the last (pro-
nounced) syllable of a stress group, this syllable may end a lexical entry
which can not only be a noun, adjective, verb or adverb as it is traditionally
asserted, but also belong to any other grammatical category. Aside from the
often-quoted example of the pronoun subject–verb inversion in imperative
and interrogative sentences (as in donne le “give it” or comprends-tu? “do
you understand?”), showing a last syllable stressed, spontaneous speech data
frequently bring other examples, such as … et on comprend que “… and we
understand that” (Martin, 2014b) or demain hausse de la # temperature
(Weather news, France 2, 1/7/14) “tomorrow, increase of temperature.”
Incidentally, the popular author Frédéric Dard, who in his novels plays
frequently with all aspects of language, wrote the following sentence: une
main conquérante comme Guillaume le, “a conquering hand as William the

[Conqueror]” (San Antonio, cf. Chapter 10), forcing the reader to stress the
last syllable of the sentence le, normally not stressed, simply because le is
the last syllable of the stress group.
In summary, we have two kinds of prosodic units for Romance languages:
a. In French, stress groups are ended with prosodic events instantiated by
melodic contours, generally (but not always) involving a lengthening of
the last stress group syllable.
b. In the other Romance languages, stress groups contain one prosodic event
similarly instantiated by a melodic contour on the corresponding stressed
syllable, but not necessarily in final position. A second prosodic event can
also take place on the stress group final syllable.
Stressed syllables in Latin

Classical Latin had five phonological vowels, |i| |e| |a| |o| and |u| (the vowels
included in today’s computer Latin fonts). Each vowel can be short or long
so that the vocalic system includes five short and five long vowels. Latin
also has three diphthongs, |aj|, |oj|, and |aw|, written ae, oe, au (Alkire &
Rosen, 2010).
The stress rule is as follows: stress is located on the penultimate syllable if
this syllable is heavy, i.e. long or closed (ended with a consonant). If the
penultimate is light, stress falls on the prepenultimate syllable. Diphthongs
are always long.
If the penultimate syllable is not closed by a consonant and is not a
diphthong, the stress is predictable only if we know that its vowel is long
(which is a property of the lexicon). If the word has only two syllables, stress
falls on the penultimate, and if it has only one syllable, this unique syllable is
stressed (but only if the word is a noun, an adjective, an adverb, or a verb). For
example, in.fer.no has its second syllable closed by the consonant |r| and is
therefore heavy, so that the penultimate is stressed: inferno. The syllable mi in
a.mi.ca “friend” contains a long vowel |i|, and the stress is therefore on the
penultimate: amica. The syllable ro in au.ro.ra “dawn” has a light vowel |o|,
then the stress of aurora falls on the prepenultimate: aurora. The word spi.na
“plug” has only two syllables. Since there is no other possibility, its first
syllable, whether light or heavy, is stressed: spina.
The word vir.tus “virtue” (which gave virtù in Italian, virtud in Spanish,
virtut in Catalan, virtude in Portuguese, virtute in Romanian, and vertu in
French) is derived from vir “men” and tus (with a long u), therefore stressed
according to Latin stress rules on the morphological suffix -tu: virtus. So the
rule applies on morphological components in this case, as the general principle
does not predict stress on the ultimate syllable.
Stressed syllables in Romance languages (other than French) 123
Stressed syllables in Romance languages (other than French)

Since the stress systems of Romance languages other than French have
so much in common, they will not be discussed individually but as a
group. The position of stress in Romance languages is restricted to the
six-syllable window at the right edge of the word (six syllables for verbs
and up to four syllables for nouns, adjectives, adverbs, and other gramma-
tical categories).
Each language may have its own terminology to refer to each stress config-
uration, as given below.
Stress on the last syllable (Oxyton)

Italian: tronco: virtù “virtue,”caffè “coffee,” amerò “I will love,”
marked in the orthography by a stress mark
Spanish: agudas: conversar “converse,” pastor “pastor,” oración
“prayer,” sometimes marked in the orthography by a stress mark
Catalan: agudas: nació “nation,” després “after,” valor “valor”
Portuguese: agudas: ruirão “they will collapse”
Romanian: tronco: cercetator “researcher,” cobor “I descend”
Stress on the penultimate syllable (Paroxyton)

Italian: piano: amare “to love,” nationale “national”
Spanish: llanas: libro “book,” difícil “difficult,” ángel “angel,” some-
times marked in the orthography by a stress mark
Catalan: llanas, plana: Barcelona “Barcelone,” plaça “place,”
lingüista “linguist”
Portuguese: plana: duvida “he doubts,” falaram “they spoke,”
túnel “tunnel”
Romanian: paroxytone: fântâna “fountain”
Stress on the antepenultimate syllable (Proparoxyton)

Italian: sdrucciolo: telefono “telephone,” celebre “famous,” prendilo
“take it”
Spanish: esdrújulas: préstamo “let’s loan,” hipócrita “hypocritical,”
agnóstico “agnostic,”crédito “credit,” always marked in the ortho-
graphy by a stress mark
Catalan: proparoxítono, esdrújulas: típica “typical,” política “poli-
tic,” always marked in the orthography by a stress mark
Portuguese: proparoxytone: dúvida “doubt” (noun), dinâmicos
“dynamic,” lâmpada “lamp”
Romanian: proparoxytone: modele “the fashions,” incaleca “to
mount a horse’”
Stress on the anteantepenultimate syllable (Preproparoxyton)

Italian: bisdrucciole: caustico “caustic,” fabbricalo “fabricate it”
Spanish: sobreesdrújulas: cómetelo “eat it,” tráemela “bring it to
me,” always marked in the orthography by a stress mark
Catalan: sobreesdrújulas: transpórtaselo “transport it,” trágatelo
“swallow it,” always marked in the orthography by a stress mark
Romanian doisprezece “twelve,” lingurile “the spoons,” veveriță
“squirrel”
Stress on the anteanteantepenultimate syllable
(Prepreproparoxyton)
Italian: trisdrucciole: fabbricalmelo “fabricate it for me”
Romanian: siaptesprezece “seventeen”
Stress on the anteanteanteantepenultimate syllable
(Preprepreproparoxyton)
Italian: quadrisdruciole: fabbricalmecelo “fabricate it for me to him”
Romanian: siaptesprezecelea “seventeenth”
Orthographic convention and homographs

Spanish, Catalan, and Portuguese indicate which vowel is stressed in the
orthography of the word, either indirectly from the word written ending, from
the presence of a graphic acute accent on the stressed vowel, or from the general
default rule (stress on the penultimate syllable). Any exception to the rule or
homographic cases receives an orthographic accent to avoid ambiguities. In
Romanian, for example, véselă “jovial,” vesélă or veselă “tableware,” albí “to
whiten,” albi “white, masculine plural adjective,” copii “copies,” copii “chil-
dren.” In Italian, even homographs are not distinguished in writing: capitano
“they happen,” capitano “captain.” The written ending configuration is sup-
posed to reflect phonological properties influencing stress location and had a
great impact on phonology of stress in Romance languages, looking for corre-
lations between stress and the nature of the phonemes involved. However, as
discussed below, the location of stressed syllables is not linked to phonology
but to morphology, a rather obvious fact for learners of any Romance language
but not easily accepted by some linguists.
The paroxyton is the default stress pattern in orthography. Indeed, it corre-
sponds to the most frequent case in Spanish as in all the other Romance
languages (78% of cases in Italian; Sandri & Vivalda, 1981). The rule specifies
that this is the case if the word ends with a vowel, an -s, or an -n, for example
llana “flat,” modo “mode,” medios “media,” examen “review.” Words which
do not end with a vowel, -s, or -n are then oxyton: virtud “virtue,” nacional
“national,” reloj “clock,” feliz “happy.”
Orthographic convention and homographs 125
All exceptions to the above cases take a graphic accent:

Oxyton exceptions: también “also,” jamás “never,” lección “lesson,”
además “moreover.”
Proparoxyton exceptions: áspera “rough,” esdrújula “antepenulti-
mate,” católico “catholic,” propósito “purpose” (actually all pro-
paroxyton words get a graphic accent).
Then paroxyton words which do not end with a vowel or -s or n are also
exceptions: difícil “difficult,” cárcel “prison,” automóvil “automobile,” bíceps
“biceps,” ántrax “antrax.”
It seems that these orthographic rules do capture very clearly a phonological
process that would explain word accentuation in Spanish. As Spanish ortho-
graphy represents closely the pronunciation, rules given in grammar actually
say nothing about the applicable stress rules, they define only the orthographic
conventions established from the commonly used accentuation patterns.
Homographs are distinguished graphically in all cases (which does not mean
that they would necessarily be stressed orally):
Monosyllabic
más (adverb of quantity): Quiero más comida “I want more food.”
mas (conjunction): Le pagan, mas no es “You get paid but not more.”
él (personal pronoun): ¿Estuviste con él? “Were you with him?”
el (article): El vino está bueno “The wine is good.”
Plurisyllabic
célebre “famous,” celebre (from celebrar “to celebrate,” 3rd person
present subjunctive of celebrar, “to celebrate”), celebré “I celebrated.”
Written with a hyphen, both components keep their acute stress (if any);
without a hyphen both words keep their original graphic stress:
Cuentakilómetros “odometer,” lógico-matemático “logical-
mathematical”
Spanish grammars, which are relying on written forms of isolated words, tell us
that a word pronounced in isolation always carries a stress, but some items lose
their stress when used in connected speech. Nouns, adjectives, verbs, adverbs,
disjunctive pronouns, numerals, and interrogative wh-words (i.e. words
like qué, quién, cuándo, cuál, etc.) are always stressed, whether uttered in isolation
or in connected speech. Words that are never stressed include the following:
1. The definite article, e.g. [la ˈtʃika] la chica “the girl.”
2. Clitic pronouns, e.g. [te lo emˈbje] te lo envié “I sent it to you.”
3. Monosyllabic possessive determiners, e.g. [mi ˈkasa] mi casa “my house.”
4. Monosyllabic prepositions, e.g. [ˈbamos poɾ la kareˈteɾa] vamos por la

carretera “let’s take the main road.”
5. The coordinating conjunctions, e.g. [ˈpan i ˈletʃe] pan y leche “bread and
milk.”
6. The complementizer que, e.g. [esˈta ke ˈaɾðe] está que arde “tempers are
running high.”
7. Non-interrogative wh-words, e.g. [la ˈtʃika a kjen selo ˈðixe] la chica a
quien se lo dije “the girl to whom I said it.”
When it functions as a determiner, the quantifier un/una, etc. is usually
unstressed in connected speech, but it may be stressed for contrastive effect,
as in cómprame un libro (= [ˈun ˈliβɾo]) y no dos “buy me one book and
not two.”
Obviously, these observations confirm that stress pertains to stress groups
rather than to written words taken in isolation.
Rules for word stress placement

Since Italian has no orthographic marking of word stress among the Romance
languages (except in final position), it aroused a lot of interest in the field of
automatic language processing, in order, among other applications, to develop
algorithms for text to speech synthesis. Romanian also drew some attention
recently (Ungurean et al., 2009). Indeed, the placement of syllabic stress is of
paramount importance in this application and cannot be inferred directly from
the written material as is the case with other Romance languages (excluding
French). Nowadays of course, with the availability of a very large amount of
computer memory, syllabic stress position can simply be included in a large
lexicon, containing all basic and derived forms, possibly reflecting human
knowledge when acquiring the language.
Still the problem of homographs remains, a problem which is usually
partially solved by labeling each entry with its syntactic category and eliminat-
ing not attested sequences of successive three or more specific categories, thus
implementing a kind of mini grammar. Automatic placement of stress from text
in Italian must disambiguate homographs with distinct stress patterns, which is
normally only possible if the homographs belong to different grammatical
categories, or if there is another morphological feature (such as distinct agree-
ment in gender or number) allowing the disambiguation. Disambiguation is
usually done with the trigram approach, scanning the text for allowed
sequences of three grammatical categories (as simple as noun, adjective,
verb, adverb) to reject impossible sequences. The correct category can then
usually be restored and the corresponding stress pattern implemented. When
homographs belong to the same syntactic categories, other characteristics, for
example semantic, may be added to each entry.
A statistical approach
Statistics performed on some 8,000 frequent words in Italian show that 78% of
them are stressed on the penultimate syllable (Sandri & Vivalda, 1981). With
only one rule assigned to the penultimate syllable, the error rate is already
reduced to 22% (Delmonte, 1981).
Some researchers felt that a statistical approach may be the method of choice
for all cases. This method was used by the CSELT (Centro Studi e Laboratori
Telecomunicazioni, now partially Telecom Italia Lab) in 1981.
The approach taken by the CSELT is based on the correlation observed
between orthographic trigrams (sequences of three graphemes) and the location
of the stressed syllable in the word. However, the implementation of these
observations require no less than 250 rules implemented in an augmented
transition network automata operating from the end of the word. For example,
sequences ending with -isia, -isie, -osia, -psia, -psie, etc. will indicate a stress
on the word penultimate syllable.
The appropriate ranking of these rules and an extensive list of exceptions
allowed the system to reach a correct stress positioning of about 97% in
standardized tests. This method was also used for Romanian, claiming 99%
success (Ungurean et al., 2009. See also Chitoran, 2002 and Chitoran et al.,
2014).
A phonological-phonetic approach
From a linguistic point of view, it may seem reasonable to think that stress is
linked to the phonological structure of the syllable. It would then be possible to
establish contextual rules to determine the stressed character of a syllable from
its phonetic and phonological structure.
This approach was adopted and implemented by Delmonte (1981) in a text-
to-speech synthesis system for Italian. The problem with this method is that it
requires a large number of rules and exceptions to the rules in order to obtain
satisfactory performances. After phonetization (orthographic-phonetic conver-
sion), the rules analyze contexts of three elements around a given vowel in each
syllable (starting from the end).
For example, if the vowel [i] belongs to the penultimate syllable, the word is
proparoxyton if [i] is followed by [t, d, l, m, k, t] and if the word does not belong
to a list of exceptions. The word is also proparoxyton if [i] is followed by [l, m,
n, y, t] and the word is a verb followed by a clitic. The word is again
proparoxyton if [i] is followed [g, r, n, t] and the word is a verb of the first
group in -anare, -agire, -atare, or a noun or an adjective belonging to an
exception list. The word is paroxyton in all other cases.
Other rules apply to the left context of the vowel, leading to a very complex
system not really capturing the assignment of stress mechanism (if one exists).
A phonological approach
Sticking to their phonological guns, some phonologists still attempt to find rules
to predict word stress in Italian based on syllabic properties. This is part of the
tradition aiming to find universal stress rules for a majority of languages (cf. Halle
& Vergnaud, 1987). Cei and Hayes (2012), for example, direct a large project to
find out “How predictable is Italian word stress?” Using a very large corpus and
sophisticated mathematical tools, an optimality treatment of data, and filtering of
borrowed words, they still have not achieved their goals to date. Tackling timidly
some properties of suffixes, they reject this solution because the suffixes represent
lexical properties of morphemes! To gather even more data, they use an “Amazon
Mechanical Turk” platform to assemble a very large set of occurrences in order to
consider possible variations in various regions of Italy. The data are then modeled
with complex statistical tools, such as Bayesian models.
A morphophonetic approach
O. Profili (1987) proposed phonetic rules for nouns suffixes which required a
morphological analysis of this category of words. For instance, the suffixes -illo,
-esse are always stressed on their penultimate syllable: distillo “distillate,” profe-
tessa “prophetesse.” Other suffixes are never stressed: -ido, bile as shown by
timido “timid,” sensibile “sensitive.” A large number of cases can be correctly
analyzed with this approach, but a problem remains with homographic suffixes,
such as -ino stressed or not stressed. In piccolino “small,” the suffix + flection -ino
(diminutive masculine singular) is stressed, whereas in amino (3rd person sub-
junctive plural of amare “to love”) -ino is unstressed. This approach lead to the
solution detailed below.
A morphological approach
The morphological approach, originally suggested by Paul Garde (1968, 2013),
and implemented by Martin (1989) in an automatic software program for
automatic placement of stressed syllables in Italian, is based on (1) the stress
rules in Latin and (2) a morphological analysis of nouns, adjectives, and verbs
into their morphological structure:
(prefix) + stem + (suffixes) + (flections)
According to a proposal by Paul Garde (1968), suffixes and flections can be
classified as stressable and unstressable, i.e. susceptible to be stressed, or not
susceptible to be stressed. As most lexical entries in Italian are derived from
Latin (thus excluding borrowed words), the stem follows the Latin stress rule as
given above in this chapter.
The stress rule is then very simple: the last stressable morphological element
(stem, suffix, flexion) of the word determines the position of the stressed
syllable. Given the relatively large number of suffixes and flections homo-
graphs, the key to success of this method lies in matching corresponding
morphological categories (i.e. suffixes and flections for verbs, nouns, and
adjectives) and a correct morphological analysis.
Things may appear more complicated with homographs belonging either to
distinct grammatical categories or worse (for a computer program) to the same
category. An often-quoted example is sono cose che capitano capitano “these
are things that happen captain,” where the first capitano is a verb (3rd person
singular of the verb capitare) and is stressed on the fourth syllable from the end,
whereas the second capitano is a noun (here in its singular masculine form) and
is stressed on its penultimate syllable.
Likewise, two stress patterns pertain to ancora: ancora (anchor), ancora
(still) and ancora “he moors.” Ancora comes from Latin ancora (second
syllable with the light vowel |o|), borrowed from Ancient Greek ἄγκυρα
(stressed on the first syllable), whereas Ancora derives from Latin ad hanc
hōram “at this hour.”
Homographs can belong to the same grammatical category. For example,
principi “princes” and principi “principles,” or turbine “whirlwind” (singular,
il turbine) and turbine “turbines” (plural of la turbina). Furthermore, the
position of the stress may vary with the dialect considered, and sometimes
with the level of language: tenebra “darkness” vs. tenebra as poetic form.
Depending on their nature, i.e. the syntactic category of stems they deter-
mine, homographic suffixes and flections can be stressable or unstressable. For
example, as seen above, the diminutive suffix -in is stressable and by syllabic
segmentation affects the syllable li (piccolo → piccolino, with the morphemes
piccol, in and o), -in being a noun suffix.
However, the same -in is also a marker of the subjunctive and is unstressable
as a verb suffix. In amino, analyzed as am, in, and o, subjunctive present 3rd
person plural of amare “to love,” neither the suffix nor the verb flection are
stressable, resulting in the stress located on the first and only stem syllable.
When derived from Latin, stems are always stressable according to the rules
seen above on their penultimate or prepenultimate syllables. According to
Antonetti and Rossi (1970), 82% of stems (not of words) are stressable on
their last syllable, and only 18% on their penultimate for their root.
The derivations of oper stressable on its first syllable, as derived from Latin
ops “means, resources, power” and opus “business, work” are:
Opera → oper + -a, unstressable flection -a, resulting in opera “opera.”
Operoso → oper + -os stressable nominal suffix + -o unstressable adjec-
tival masculine singular flection, resulting in oporoso “hard worker.”
Operetta → oper + ett stressable nominal suffix + -a unstressable

adjectival masculine singular flection, resulting in operetta
“operetta.”
Operosità → oper + os stressable nominal suffix + ità stressable on its
last syllable suffix, resulting in operosità “industriousness.”
In coined words, morphological analysis leads to two (or more) stems, often
borrowed from Greek or English. Both stems can be interdependent (they
cannot appear alone independently), or one or both can be independent. In
the first case, the first stem is stressed on its last syllable and determines the
word stress: telefono (from Greek tele + phonos), sincrono (from Greek syn +
chronos).
If the last stem is independent or if both stems are independent, the last
stressable unit determines the word stress: aeronave “airplane” (from Italian
aero “air” and nave “boat”), aliscafo “hydrofoil” (from Italian ali “wings” and
scafo “hull”).
In general, homographs lead to two (or more) morphological analyses. The
example of capitano gives:
As a noun capitano → capit (from Latin caput head”) + -an nominal
stressable suffix + -o unstressable nominal masculine singular
flection, resulting in capitano “captain.”
As a verb, capitano → capit + -ano unstressable 3rd person plural
present flection, resulting in capitano (from capitare “to happen”).
As a verb again, capitanó → capit + capitanare + -ó 3rd person
singular past flection, resulting in capitanó “he led.”
French
French has no lexical stress, only a final group stress. Progressively from Old
French, all segmental units following the accented syllables were dropped, with
the exception of a single mute [ə] in certain cases. Some traces of penultimate
accent can be found in words ending with the suffix -ation, such as la nation,
l’exaggération, etc. stressed on their penultimate syllable until the middle of
the twentieth century (Martin, 2009). By this process, the position of stress lost
its function of marking morphological boundaries as in the other Romance
languages. From lexical the stress became demarcative in French, indicating
boundaries not of words but of groups of words, content and grammatical
words, or even single syllables.
Therefore, every stressed syllable instantiates in the AM sense a boundary
tone in French, with the exception of emphatic accent. Thus, stress in stress
groups can be located on grammatical words and not only on lexical words, as
in interrogative or imperative forms such as le lui donneras tu? “will you give it
French 131
to him?” or donne le lui “give it to him” where the pronouns are stressed simply
as final syllable of a stress group.
Emphatic accent, which – if melodic – is most of the time realized with a
rising melodic contour (but counterexamples do exist; Martin, 2012a), is
located on the first syllable of the first (or only) lexical words of the stress
group. It follows that a problem in analysis may occur when the stress group
contains a lexical word of only one syllable, as in oui c’est exact “yes it’s
correct.” Should it be considered as an emphatic accent, therefore presenting a
melodic rise, or as a stress group boundary tone? An expected melodic rise due
to the principle of melodic slope inversion would leave the ambiguity in the
analysis, whereas a falling melodic contour would classify the prosodic event
as a stress group boundary.
Secondary accent and arc accentuel

In examples of the so-called stress clash, such as un soulier noir or la voiture
bleue “the blue car,” the stress shifts from the last syllable of soulier and of
voiture to its first syllable to produce a secondary accent, and forms an arc
accentuel “stress arc” as well (Fónagy, 1979). In these cases, the stress shift
provokes a reconfiguration merging the two involved stress groups into one,
with a secondary stress on the first syllable of the content word: un soulier noir,
la voiture bleue (see Chapter 5).
In another example where the secondary accent does not result from a shift
due to a stress clash, les éléphantaux de Marie-Ségolène “Marie-Segolene’s
elephant babies,” with an arc accentuel, there are too many syllables, and
therefore too much of a time gap, between the first and the last stressed
syllables to prevent the realization of the group stress on éléphantaux. When
the number of syllables between the secondary accent of the first word and the
primary stress of the second is low (i.e. below 7), the first word primary stress is
not mandatory: l’éléphant orphelin “the orphan elephant,” or les débats de
l’assemblée “the assembly debates.” In this case the arc accentual defines a
single stress group.
Although the secondary accent is often presented in recent descriptions of
French stress phonology (e.g. Jun & Fougeron, 2002), it does not represent any
frequent usage in France today, except perhaps in (some) professional multi-
media and public address speech style.
The groupe de sens …

In many contemporary grammar books on French, the location of stress is
defined as marking the last syllable of a groupe de sens (a signification unit).
Besides the fact that defining a groupe de sens is not particularly obvious
(especially for learners of French), it is easy to show that this definition does not
hold for long sense groups. Compare the following examples:
L’armoire “the drawer”
La petite armoire “the small drawer”
La petite armoire verte “the small green drawer”
La petite armoire vert-bouteille “the small bottle-green drawer”
However, if the number of syllables of the sense group exceeds a given number
of syllables (in the order of seven, depending on the speech rate, as the limit
depends on the duration of enunciation and not the number of syllables), such
as in La jolie petite armoire vert-bouteille “the pretty small bottle-green
drawer,” it is easy to notice that an extra stress must be applied on some
syllable, for example on jolie, or on armoire, thus keeping the sequence of
syllables below the maximal number of syllables in a single stress group.
Stress variations in Romance languages

Coined words (mots savants) borrowed from foreign languages (mostly from
Latin and Ancient Greek in Romance languages) sometimes had their stressed
syllable defined by analogy with existent suffixes found in common words. The
proparoxyton Spanish pelícano (from Latin pelecānus “pelican,” from Ancient
Greek πελεκάνος) may be the result from a wrong morphological analysis
involving the unstressable suffix -ano.
Despite the morphological general principle, apparent differences in stress
assignment seem to exist between Romance languages in similar words. Most
of the examples pertain to borrowed words, often from Classical Greek. For
example, θεραπεία “therapy,” stressed on the penultimate syllable in Greek,
was introduced in Italian as terapia, in Spanish as terapia, in Catalan as teràpia,
in Portuguese as terapia, and in Romanian as terapie. Actually, in all these cases
the stress is penultimate, but the syllabic segmentation is different: te.ra.pi.a in
Italian, te.ra.pia in Spanish.
Some other examples are: Portuguese polícia, Spanish policía, Italian
polizia, Catalan policia, Romanian poliţie. This word is derived from classical
Greek πολιτεία, itself derived from πολις “city,” Latin polītīa with a long [i],
and therefore stressed on the antepenultimate syllable.
The Portuguese example farmácia, Spanish farmacia, Italian farmacia, from
classical Greek φαρμακεία “the use of drugs”, results in a difference in the
number of syllables (three far.ma.cia, in Portuguese and Spanish, four far.ma.
ci.a in Italian).
Again different syllabic segmentation in all these occurrences leads to a
penultimate stress for all these languages, and therefore there is real stress
variation in these examples.
in six Romance languages
This chapter presents a comparative intonation analysis of read sentences in six

Romance languages – European French, Italian, Spanish, Catalan, European
Portuguese, and Romanian. This analysis was conducted in the Incremental
Storage-Concatenation (ISC) framework detailed in Chapter 5. The corpus
consisted of:
1. A set of sixty sentences of increasing syntactic complexity read by two
speakers and extracted from the corpus EuRom4. Sentences were originally
in French, Italian, Spanish, and European Portuguese and were adapted in
Catalan and Romanian.
2. The EuRom4 corpus (French, Italian, Spanish, European Portuguese), con-
taining twenty-four short stories of increased length read by four speakers in
each language. The average number of sentences for each language was
about 200.
3. The EuRom5 corpus (French, Italian, Spanish, Catalan, European
Portuguese), consisting of twenty short stories read by two professional
speakers, with about 300 sentences for each language.
4. Additional data came from the French spontaneous speech corpus CFPP
2000, Romance languages C-ORAL-ROM, C-PROM French, Português
Falado, Romanian Rasc, etc. (the complete list is given in the analyzed
corpora section of the References).
The total number of analyzed sentences (all languages) is about 2,600. The
sixty sentences in the first corpus were designed in order to obtain increasing
complexity, keeping syntactic structure and lexical entries as similar as possi-
ble across the languages. The goal is to observe the evolution of pitch move-
ments in prosodic words as indicators of the prosodic structure (i.e. the
prosodic events instantiated by melodic contours on stressed and final stress
group syllables) as the structures become more and more complex. A prosodic
analysis of the sixty sentences of the first corpus was already published in
Martin (2002).
Text transcriptions were available for all corpora, and alignment was per-
formed with WinPitch on the fly alignment function, in order to obtain a fast
133
134 The Incremental Prosodic Structure in Romance languages
and very efficient alignment of the available text transcriptions with corre-
sponding speech sound segments (see Chapter 11 for details). The alignment
allowed for a fast retrieval of prosodic information from any text segment
selected through simultaneous acoustic analysis.
The ISC model assumes a specific function for stress group stressed and final
syllables in the indication of the prosodic structure, challenging the idea that
pitch accent characteristics belong to the phonetic domain, and also extending
to melodic movements the concept of a stress degree hierarchy inside stress
groups. Indeed, melodic rises and falls are not randomly distributed. Eventually
combined with final pitch movements, they indicate to the listener how to
assemble dynamically the sequences of prosodic words along the time axis in
order to reconstitute the sentence prosodic structure (see Chapter 5).
To avoid possible confusion in terminology with pitch accent used in the AM
model, the pitch movements located on the stressed vowels of stress groups as
well as those on their final syllables are called melodic contours (see Chapter 5).
In Romance languages, in a given stress group, we will necessarily have a
stressed syllable (vowel) melodic contour optionally accompanied with a final-
syllable (vowel) melodic contour. In this case both melodic movements are
considered as one unique prosodic event. In French, with no lexical stress,
stressed and final syllables occupy the same position. Exceptions pertain only
to cases where the last syllable contains a mute [ə].
Furthermore, only the melodic change inside the stressed syllable vowel will
be retained to describe melodic contours. Possible extension of pitch move-
ments on a voiced consonant after a stressed vowel or a pause can be part of
the perceived pitch movement by the listener, but will not be considered in the
phonological description of the melodic contour.
EuRom4 and EuRom5

Analyzed sentences in the first corpus were for the most part extracted and
adapted from the corpora used in the EuRom4 project (1991–1997). This
project, led by Claire Blanche-Benveniste et al. (1997), aimed at designing a
learning method in order to acquire a reading knowledge of Romance lan-
guages by relying on the vocabulary and syntactic similarities, thereby allow-
ing a relative intercomprehension of texts written in any other Romance
language. In the published manual EuRom4 and its revised edition EuRom5
published in 2011 (Bonvino et al., 2011), similar texts pertaining to comparable
journalistic topics are presented to readers side by side, with annotations high-
lighting only the lexical, morphological, and syntactic differences from the
language known by the reader.
The project included some recordings to test oral intercomprehension, pre-
senting sentences of increasing syntactic complexity and an increased number
The process of reading 135
of stress groups and prosodic words. Later, I added Romanian and Catalan to
the already analyzed set (Martin, 2002, 2004).
The first corpus sentences were designed in such a way as to observe the
realizations of pitch movements relative to more and more complex syntactic
structure. Typical examples in Italian illustrating this increasing complexity go
from l’idea era semplice “the idea was simple” to L’animale sotto línfluenza
del dolore reagisce attivando il sistema nervoso autonomo con il risultato, ad
esempio, di un incremento della temperatura corporea e della frequenza
respiratoria e cardiaca “The animal under the influence of pain reacts by
activating the autonomic nervous system with the result, for example, of an
increase in body temperature and respiratory rate and heart rate.” The results
obtained from the analysis of EuRom4 examples were tested against the full
texts part of the EuRom4 “lessons,” read by at least four speakers in each
language.
The EuRom5 recordings for their part contain numerous read examples of
enumeration, parentheses, and long sentences. Although read by professional
speakers, observed phrasing appeared to be essentially guided by punctuation
signs (comma, colon, semi colon, parentheses, etc.), by verbal group boundar-
iess, and also by the occurrence of conjunctions of coordination. Relative
pronouns serve also as phrasing boundaries, frequently taking over
punctuation.
The process of reading

Reading, either silently or aloud, is a two-step process. First some proso-
dic structure must be recovered from the written text, as it is not possible
to read aloud or silently without an associated prosodic structure.
Secondly, the text syntactic structure must be decoded dynamically in
function of time from the recovered prosodic structure processing the
information according to the ISC model. This back-and-forth mechanism
may not result automatically in a prosody–syntax congruence, as the
inference of the prosodic structure from the text may be shaped by the
prosodic constraints pertaining to the minimal and maximal duration
between two successive syllabic stresses and to the eurhythmicity target
in assembling prosodic words and prosodic groups, as schematized in
Figure 7.1 (see also Chapter 5).
Studies on eye movement in reading show that the reader will attempt to
localize punctuation marks, such as sentence-final dots, commas, etc., as well
as verbal forms (Martin, 2011). For these, a prosodic group’s right boundary
often corresponds to the boundary marking the end of a subject Noun Phrase
and the beginning of a Verb Phrase. This process provides a relative predict-
ability to the generation of prosodic structures, predictability which does not
Prosodic Syntactic
Text Structure Structure
Temporal Text
Constraints
Figure 7.1 The process of reading schematized.
normally exist in spontaneous speech, where congruence with syntax is limited

to a few consecutive stress groups.
The production of the prosodic structure while reading is more likely to
integrate eurhythmy as the sequence of stress groups is known in advance from
the text, as well as the assumed size of stress groups, which is not the case for
unprepared spontaneous speech. Moreoever, in well-prepared speech mode
(reciting text, for example), the prosodic structure will ideally be totally con-
gruent with the text syntax, provided of course the speaker is a good syntacti-
cian, at least for this purpose (which is normally the case as this expertise
belongs to the basics of learning to read in elementary school).
The prosodic analysis in this chapter pertains only to read sentences (spon-
taneous speech is the object of Chapter 8). As congruence between the prosodic
and syntactic structures is not always guaranteed for read material, even for
“good” speakers, the interpretation of the actual acoustic data from the syntac-
tic hierarchy is always hypothetical, although in most case realizations are very
close to congruence between prosody and syntax.
Note on figures
All the figures below (and most in this book) were obtained with the analysis
software WinPitch. In these representations, plain bold curve segments indicate
the F0 sections corresponding to stressed vowels, and bold dotted curve seg-
ments a final boundary tone in a complex Cc contour. Melodic contours below
the glissando threshold are displayed with a lighter bold segment (see Fig. 7.2).
The Incremental Storage-Concatenation process

In describing the sentence intonation coding system for Romance languages,
one has to be careful to remember and apply the main principles of the ISC
model:
200
B
150
an ti pa si ri
ra ta
100
50
50
2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8
L1 [1] B [2] antiparasitari
Figure 7.2 Marking of stressed melodic contours and boundary tones. Plain
bold curve segments indicate the F0 sections corresponding to stressed
vowels, and bold dotted curve segments a final boundary tone in a complex
Cc contour. In this example, b antiparasitari “B” (pronounced [bi]) and [a] in
the syllable ta are stressed vowels, whereas [i] in ri is the final vowel carrying
the rising part of a complex contour.
1. The prosodic structure has to be considered by itself, independently from

any other structures organizing linguistic objects in the sentence, and in
particular the syntactic structure.
2. Stressed syllables’ melodic contours are not considered independently from
each other; on the contrary, their realizations depend, among other factors,
on the local complexity of the prosodic structure.
3. From (2), it follows that stressed syllables participate together with bound-
ary melodic movements (when present) to the encoding of the prosodic
structure.
4. Therefore, there will be no distinction in the functions of stressed syllables
and boundary melodic movements; both indicate the sentence prosodic
structure as complex contours.
5. The ISC process is in essence dynamic and operates on the time axis. This
means that prosodic events pronounced by the speaker and perceived by the
listener occur in a limited time window, and that successive prosodic events
are perceived and decoded as prosodic markers essentially by contrast, and
secondary by opposition to other prosodic markers correlative to the mod-
ality contours and their variants).
6. As a tentative criterion, glissando values of melodic contours instantiating
prosodic markers will serve to classify prosodic events for Romance lan-
guages, together with vowel duration, contour height, and melodic change.
The melodic contours of Romance languages

The Incremental Prosodic Structure assumes that stressed vowels and even-
tually final vowels in stress groups carry specific pitch movements which
indicate the prosodic structure. This process operates stressed syllable by

stressed syllable, melodic contour by melodic contour, allowing the listener
to elaborate the structure of successive prosodic words dynamically, by assem-
bling prosodic words into prosodic groups step by step.
As in this process successive contours must contrast efficiently relative to
each other if they belong to distinct contour classes, and be sufficiently similar
if they belong to the same class to allow the listener to classify them, the
expected set of acoustic features should include a priori variables that are not
very sensitive to the speaker’s voice characteristics (such as pitch range,
emotional state, etc.). Good candidates may be, in a possible order of efficiency,
vowel duration, direction of pitch movement (rising vs. falling), relative pitch
height, etc. Another likely feature pertains to the possibility for Romance
languages to combine lexically stressed vowels with stress group boundary
melodic movements, a combination that for French, being deprived of lexical
stress, is impossible.
Inventory
In any case, for all read sentences the last stressed vowel indicates the last stress
group of the utterance with an easily recognizable melodic contour, either
perceptually or visually on a melodic curve, falling and/or reaching the lowest
fundamental frequency of the sentence. As it indicates the end of the sentence,
this contour is often called the conclusive terminal contour (examples of
sentences with more than one conclusive contour [complement différé or
epexegesis] are given in Chapter 8 on macrosyntax).
Again, by definition, melodic contours refer to the fundamental frequency
(F0) variation (including their duration) on the stressed syllable vowels (see
Chapter 5). Their definitions belong to the phonological domain, as their actual
phonetic realizations depend on the minimal and sufficient contrast being
maintained in order to allow the listener to group (or not) successive prosodic
words. For example, in a simple prosodic structure with two prosodic words,
the last one being a conclusive declarative contour, the first melodic contour
needs to be differentiated from the last by only one acoustic (and perceived)
feature, for example vowel duration or a non-falling pitch movement (more
than one feature may also be used). In other words, some features characteristic
of the contour may be neutralized, depending on the structure configuration.
When structures are more complex, the network of contrasts that needs to be
maintained also becomes more complex and the realizations of melodic con-
tours more contrasted by using more melodic feature types. The melodic
features retained for the phonological description of the prosodic contours
are: +/− Low, +/− Rising, +/− Complex, +/− Glissando, +/− Long (see defini-
tion in Chapter 5). The inventory of contours is for French: Cn, C2, C1, C0, and
for the other Romance languages, Cn, C1, C2, Cc, C0. Their phonological
descriptions are as follows:
C0 [+Low, −Complex, −Rising, +Glissando, +Long]
Cc [−Low, +Complex, +Rising, +Glissando, +Long]
C1 [−Low, −Complex, +Rising, +Glissando, +/−Long]
C2 [−Low, −Complex, −Rising, +Glissando, +/−Long]
Cn [−Low, −Complex, +/−Rising, −Glissando, −Long]
The actual + or − values of these features depend on the contrast to be
maintained (or not) between contours in a domain. A rising contour C1 of
50 Hz span, for example, should not be compared in absolute value of F0
change, but only to its neighbor. However, inside the same domain, defined by
what comes before the contour from either the beginning of the sentence or the
occurrence of a same-class contour, all contours of immediately inferior rank
must have similar acoustic realizations. As an example, all C2 occurrences
inside a domain defined between two contours C1 must share similar features in
order to be perceived as belonging to the same class by the listener.
A contour C1 taken as an example at the beginning of a sentence could then
show a larger change in F0 than another C1 located near the end of the sentence,
realizing a downstep. The downstep effect has been observed and described for
a rather long time, and is largely due to the diminution of the lungs’ volume and
subglottal pressure during sentence production (see Chapter 5). It will be
considered as phonetic and not phonological. The advantage of a localized
approach focused on contrasts between contours is to take into account (by
essentially eliminating them) the possible changes in speaker state of mind,
emotions, etc., influencing the realizations of prosodic events and in particular
the contour melodic span inside the same speech turn, or even inside the same
sentence.
In French there is no complex contour. The ranking between prosodic events
is as follows, C0 being the final conclusive contour: Cn < C2 < C1 < C0.
Contours ranked in this order use an increasing number of salient phonetic
characteristics, such as glissando value, rise or fall, and duration. The prece-
dence of C1 over C2 originates from the contrast of the melodic slope principle,
where C1 is rising as depending on C0 falling, and C2 is falling as depending on
C1 rising.
If a complex contour is present (for all Romance languages excluding
French), the ranking is: Cn < C1 < C2 < Cc < C0.
The complex contour Cc is spread over two syllables (or its characteristics
are merged on a final stressed syllable) and is inserted between C2 and C0. Here
the ranking order of C1 and C2 is inverted, and both contours can select Cc “on
their right,” as Cc is considered ambivalent versus the principle of melodic
slope contrast, i.e. a rising and a falling contour can contrast with it.
200
C
B
150 an ti li
an ti pa si ri con zio
ce
ra ta na
100
50
0 2.5 3 3.5 4 4.5 5 5.5

L1 [1] B [2] antiparasitari [3] C [4] anticoncezionali
Figure 7.3 B antiparasitari c anticoncezionali, a section of an example of a

long enumeration made from a sequence of groups of two prosodic words,
ended with Cc, except the last one ended with a conclusive contour C0. All
groups are structured with the contours C2 Cc, except the last one by Cc C0,
illustrating the contrast of melodic slope at work (from Italo Svevo, La
Coscienza Di Zeno, read by Moro Silo, Il Narratore, 2006).
Occurrences of C1 Cc do occur, especially associated with simple Noun

Phrases. The enumeration example in Figure 7.3 uses sequences of C2 Cc.
The ranking of contours implies transitivity. A contour C1 in a sequence can
be followed by C2 or Cc, maintaining the ranking C1 < C2 in the first case, or
C1 < Cc when a complex contour is present. These rankings operate in their
respective domains.
The terminal conclusive contour C0 has the highest rank as it ends the
sentence intonation structure (C0 is the head of the prosodic structure).
Incidentally, this final conclusive declarative contour is not located on the
last syllable, but on the last stressed syllable (also in French). The ToBI
notation L% generally given to this prosodic event belongs to the phonetic
domain, as in many cases the final and not stressed syllable in Romance
languages, the F0 value may actually be higher than the last value of the final
and falling stress contour (Martin, 2006). Examples of varieties of Italian are
given in Figure 7.4.
Although this may seem unexpected, the phonological rules ensuring the
indication of the prosodic structure are the same for all Romance languages
analyzed except French. Of course phonetic differences are numerous. For
example, complex contours in Spanish have a shorter stressed syllable contour
than Italian, where the stressed part is longer (except when this syllable is final
which implies a flat-rising realization on a single syllable, the rising part being
placed on the final voiced consonant if present).
As seen earlier, the most important phonological difference between French
and the other Romance languages results from the extra combinatorial possi-
bility offered by the last group on account of the existence of a lexical stress,
offering an extra configuration combining the lexical and the boundary melodic
Figure 7.4 Some examples of terminal conclusive contours in various

regional realizations (Turin, Rome, Palermo, Naples, Florence). It is easy to
observe on these experimental fundamental curves that the pertinent segment
of the conclusive contour is located on the stressed syllables of all final words,
and that the final syllable is not necessarily terminated by a final L% boundary
tone.
movement resulting in the complex contour Cc. This possibility does not exist
in French, which uses therefore a slightly different prosodic marker system and
a distinct contour ranking.
Processing prosodic information

In the ISC model, the listener processes the prosodic events instantiated by
melodic contours one at a time. When, after a silence or the occurrence of a
preceding conclusive contour, another sentence is pronounced by the
speaker, a first melodic contour is perceived by the listener. If this contour
is terminal conclusive, the current prosodic sentence processing is termi-
nated and handling of the current sequence of syllables occurs. If not, the
sequence of syllables currently stored in short-term memory is transferred
to another part of memory, and the listener stores a new string of syllables
waiting for a new prosodic event.
When this new melodic contour is perceived, the listener transfers the new
sequence of syllables according to the following three distinct actions:
a. If the new melodic contour belongs to a class ranked below the last contour,
the syllabic sequence is stored in another part of memory, waiting for further
processing.
b. If the new melodic contour belongs to the same class as the last contour,
the syllabic sequence is stored in the same part of memory as the first
sequence.
c. If the new melodic contour belongs to a class ranked higher than the last
contour, the current sequence is concatenated with the already stored
sequence of syllables and the newly formed string of syllables is stored in
the same part of memory as under (b).
This incremental process goes on until a final conclusive contour occurs,
leading to the complete sentence processing of the complete chain of syllables.
Figure 7.5 illustrates a spontaneous speech example in French.
In this example (corpus C-ORAL-ROM French), the sequence of prosodic
events on stressed syllables is C2, C1, Cn, C2, C1, C0, as revealed by the
fundamental frequency curve of Figure 7.5. The ISC process implies that a
certain number of memory buffers must be used to store the intermediate results
of partial concatenation before obtaining the final prosodic structure. The
number of buffers equals the depth of the prosodic structure (four levels,
including the root, in this example). The process requires then buffers M3,
M2, M1, and M0.
400
350
300 et pour question la
250 répondre différence

à ta entre un
200 cheval
poney et un c'est sa hauteur
150
100
50
0
0 0.5 1 1.5 2 2.5 3
Figure 7.5 An example of melodic contours sequence, showing the contrast

of melodic slope in French.
The sequence of operations showing the state of each memory buffer is as

follows:
Preceding C0 → M3 = 0; M2 = 0; M1 = 0; M0 = 0. //clearing of
memory buffers
C2 occurs → M3 = 0; M2 = et pour répondre //first stress group put in
M2 memory
C1 occurs → M3 = 0; M2 = 0; M1 = et pour répondre à ta question
//second stress group concatenated with M2 buffer, result put in
M1memory
Cn occurs → M3 = la différence, M1 = et pour répondre à ta question
C2 occurs → M3 = 0; M2 = la différence entre un poney, M1 = et pour
répondre à ta question
C1 occurs → M3 = 0; M2 = 0, M1 = et pour répondre à ta question la
différence entre un poney et un cheval
C0 occurs → M3 = 0; M2 = 0, M1 = 0, M0 = et pour répondre à ta
question la différence entre un poney et un cheval c’est sa hauteur
The sequence of events triggered by the successive prosodic contours
(occurring from top to bottom) is given in Table 7.1.
Table 7.1 Processing the prosodic events Cn, C2, C1, and C0 in the example of
Figure 7.5
Prosodic
events Cn C2 C1 C0
Buffers M3 M2 M1 M0
C2 et pour 0 0
répondre
C1 et pour répondre 0
à ta question
Cn la 0
difference
C2 la différence entre 0 0
un poney
C1 et pour répondre 0
à ta question la
différence entre
un poney et un
cheval
C0 et pour répondre à
ta question la
différence entre
un poney et
un cheval c’est
sa hauteur
As detailed elsewhere in this book, in reading mode, realizations of prosodic

markers and the generated prosodic structure result from a recovering process
made by the reader. This process is constrained by the general limits of syllabic
and stress group processing, i.e. the minimal and maximal duration of Delta
waves synchronizing the conversion of strings of syllables into stress groups
(or some other type of higher-rank linguistic unit, or, in a slightly different
model, the synchronization of Theta waves, optimizing the perception of
strings of syllables by the listener). These limits determine the minimum
duration gap between two consecutive stressed syllables as well as the com-
pression of syllabic duration when the interval between stressed syllables
includes a large number of syllables, with a maximum in the order of seven
(Martin, 2014b).
Prosodic structures in Romance languages

To illustrate the mechanisms of the ISC process in the six Romance
languages considered in this book, some examples extracted from the
numerous sentences analyzed are examined below. For each example, the
vowels (and only the vowels) of stressed syllables (i.e. perceptually con-
sidered as prominent and stressed) are highlighted and retained in the
phonological description. As detailed above, the phonetic description of
melodic contours must correspond to the necessary and sufficient contrasts
to differentiate the contours from all the other melodic contours that could
occur at its place, i.e. in the same context. Furthermore, these contrasts are
essentially local and imply only the contrasts that must be maintained with
all the contours from which a given contour depends, either directly or
indirectly. In a sequence C2 C1 C0, for example, C2 must be differentiated
from C1 and C0, C1 only from C0, and C0 from all variants of modality
conclusive contours. As dependency relations act “on the right” of a
contour (i.e. toward the head of each node), the set of acoustic features
of a particular contour depends on future events relatively to this contour, a
property that implies the preplanning of prosodic events from the speaker
in both read and spontaneous speech.
Identification of prosodic contours

To function, the whole identification process of prosodic contours implies the
identification of the class of prosodic events from acoustic data, essentially
syllabic (actually vocalic, see below) duration and melodic variation, and
incidentally intensity variation from the preceding syllable(s). The process
requires the existence of a certain number of prosodic events classes, one of
which must be identifiable in all conditions: the final contour C0 (with its
variants declarative Cd and interrogative Ci). The four other classes of contours
are Cc (for Romance languages other than French), C1, C2, and Cn. These
classes may be instantiated differently by melodic features as long as the
necessary and sufficient contrasts between contours is preserved.
In the simplest configuration, the sentence contains one single prosodic word
ended with a declarative contour C0 located on the last stressed syllable.
The next configuration presents two prosodic words, with the first stressed
syllable bearing a prosodic event Cx to be identified. When the first prosodic
event occurs, the only differentiation to be made by the listener pertains to its
belonging to the C0 (Cd declarative or Ci interrogative) class or not. Indeed, if
the prosodic marker belongs to the C0 class and if the next prosodic marker is
also a C0, the two consecutive prosodic words form two independent prosodic
structures, normally attached to two distinct sentences (they could also be
attached to a single syntactic structure, thus one single sentence, organized in
two sections with a deferred complement, see macrosyntax Chapter 8).
If the contour is not a C0 and the next contour is a C0, whatever the
realization, Cc complex, C1 rising, C2 falling, or Cn neutralized, the indicated
prosodic structure will be the same [Cx C0]. In other words, Cx is neutralized in
this configuration and must only be differentiated from the other contours that
can occur at its place, i.e. C0 (with their Cd and Ci variants). This fact is
frequently observed in French read and spontaneous data, where there is no
lexical stress. However, due to the principle of contrast of melodic slope, the
sequence C2 C0 (almost) never occurs, as C2 +Glissando and –Rising has a
similar falling melodic slope as C0. The C2 and C0 contours may then appear
perceptually too close (although informal perception tests show that listeners
can differentiate both contours taken in isolation).
In read mode, one cannot always expect the prosodic structure to be con-
gruent to the syntactic structure. On the contrary, the reading process allows the
speaker to gather information ahead of what is said by appropriate eye move-
ments, resulting often in eurhythmy not normally observed in spontaneous
speech. Still, some hypotheses can be laid as to the probable configuration of
the sentence prosodic structure, especially when proceeding from simple to
more complex sentences.
To ensure a consistent phonological notation, non-final rising and falling
melodic contours are transcribed phonologically as C1 (rising) and C2 (falling)
if their glissando values (in semitones per second) exceed a glissando threshold
(Rossi, 1978; Mertens, 2004). This glissando threshold expressed in semitones/
sec2 is assumed to correspond to the threshold of perception of a change in pitch
varying linearly. Actually, this value is parametric and should be adjusted, as
the threshold values were obtained from perception tests pertaining to synthetic
vowel [a] (Rossi, 1971). The glissando gives only an indication of the
pertinence of the transcription. Therefore, if pitch movement on the vowel is

below the glissando threshold, the contour is transcribed as Cn (neutralized
contour).
Complex contour
The Complex contour Cc is characterized by the realization of a flat (or slightly
rising or falling) melodic contour on the stress group stressed vowel, and a
sharp rise on the final vowel, above the glissando threshold. If the stressed
vowel is in final position in the stress group, both flat and rising melodic shapes
coexist on the same vocalic segment, eventually extended to an adjacent voiced
consonant. This Cc contour presents various realizations apparently linked to
phonetic differences between Romance languages, as illustrated in the follow-
ing figures (Fig. 7.6 to Fig. 7.10).
Italian (Svevo)
200
150 an
ti li
zio
con ce
na
100
50
0
4.4 4.6 4.8 5 5.2 5.4
L1 [4] anticoncezionali
Figure 7.6 anticoncezionali
Spanish (EuRom4 E06-14)

300
250
una
200 hay
que ción
ción
150 a cri
pro vo
la mi
ca
100 dis na
50
0
15 15.5 16 16.5 17 17.5
L1 [1] que hau una provocación a la discrimiación
Figure 7.7 que hay una provocación a la discriminación

Catalan (CAT-ANA56)
350
300 sia
250
que com po nen
200 Ma là
150
100
50
0
231.2 231.4 231.6 231.8 232 232.2 232.4 232.6
L1 [1] que componen Malàsia
Figure 7.8 que componen Malásia
Portuguese (EuRom5 P06-1)

300
250
nos dos
200 u
esta ni
dos
150
100
50
0
26 26.2 26.4 26.6 26.8 27
L1 [1] nos estados unidos
Figure 7.9 nos estados unidos
Romanian (RR1 200914)

250
200 po
pre
zi
tio la
150 ,
na
100
50
0
9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10
L1 ,
[2] prepozitionala
Figure 7.10 prepozitionala

Table 7.2 Static phonological description of romance languages

melodic contours
C0 Cc C1 C2 Cn
Low + − − − −
Complex − + − − −
Rising − + + − +/–
Glissando +/– + + + −
Long + + − − −
In the example of Figure 7.7, the last stressed vowel has a falling melodic
contour, whereas the complex rising part is realized by the voiced [n] ending
the stressed syllable.
The examples (Figs. 7.6 to 7.10) in five Romance languages of a complex
contour, show a slight fall on the stressed vowel and rise on the final vowel
(with possible extension of the melodic rise on the following consonant if
voiced).
In the read corpora, all retained realizations were declarative, with a “broad”
focus, i.e. without implicative or imperative variations. The presence of
Postfixes or Suffixes is also avoided (see Chapter 8 on macrosyntax for defini-
tions of these macrosegments).
Table 7.2 gives a static phonological description of the melodic contours.
This description uses the binary features +/– Low, +/– Complex, +/– Rising,
+/–Glissando, +/–Long. The feature Low pertains to the value of the funda-
mental frequency reached at the end of the melodic contour. The feature
Complex refers to realization of two distinct contours on stressed and final
vowels (eventually merged).
In the rest of this chapter, these definitions are applied dynamically in
sequences of three consecutive melodic contours, which have to be differen-
tiated by necessary and sufficient contrasts from all other contours that could
occur at their place. These contrasts may involve only the minimally necessary
and not all the features of Table 7.2.
Experimental data
Table 7.3 gives all possible hierarchical configurations of three prosodic
words, excluding cases of saturation resulting in sequences of neutralized
contour as outcome, and sequences where the falling contour would
depend on another falling contour (C0 for example). Configurations I, II,
and III are given in Table 7.3. Sequences involving two prosodic words
Table 7.3 Configurations of three successive melodic contours
Ended Configuration I Configuration II Configuration III

Language with [A B C] [[A B] C] [A [B C]]
French C0 C1 C1 C0 C2 C1 C0 C1 Cn C0
(10) Cn Cn C0 Cn C1 C0
C1 C2 C2 C1 Cn C2 C1 C2 Cn C1
Cn Cn C1
C2 Cn Cn C2 – –
Romance C0 Cc Cc C0 C2 Cc C0 Cc C1 C0
(22) C1 C1 C0 C1 Cc C0 C1 Cn C0
Cn Cn C0 Cn Cc C0
Cn C1 C0
Cc C2 C2 Cc C1 C2 Cc C2 C1 Cc
C1 C1 Cc Cn C2 Cc C2 Cn Cc
Cn Cn Cc Cn C1 Cc C1 Cn Cc
C2 Cn Cn C2 Cn C1 C2 C1 Cn C2
C1 Cn Cn C1 – –
can be extracted from this table by retaining two contours instead of three
and more complex realizations by expanding these configurations. These
combinations reassemble the various possible conditions of operation of
the ISC process.
The configurations are directly derived from the principles of (1) depen-
dency “to the right,” (2) inversion of melodic slope, and (3) ranking of the
melodic contours (Cn < C2 < C1 < C0 for French, Cn < C1 < C2 < Cc < C0 for
the other Romance languages).
A possible mini grammar producing the sequences of contours include the
following rewriting rules (limited to two daughters per expanded node):
For French:
C0 → {C1 C0 | Cn C0}
C1 → {C1 C1 | C2 C1 | Cn C1}
C2 → {C2 C2 | Cn C2}
Cn → {Cn Cn}
For the other Romance languages:
C0 → {Cc C0 | C1 C0 | Cn C0}
Cc → {Cc Cc | C2 Cc | C1 Cc | Cn Cc}
C2 → {C2 C2 | C1 C2 | Cn C2}
C1 → {C1 C1 | Cn C1}
Cn → {Cn Cn}
Figure 7.11 An example of prosodic structure.
Although definitions of contours are local, it is easy to see that the same
contour, say Cn, can occupy distinct levels in the prosodic hierarchy. For
example, in the context C1 Cn C0, Cn is located one level below C1, which
is itself one level below C0. In another context, for example C2 Cn C1, Cn has
to be on a structure level below both C2 and C1, C2 being below C1. Merging
the two sequences as C2 Cn C1 Cn C0, the corresponding structure is shown in
Figure 7.11.
Sequences of two prosodic words

The possible configurations implying two prosodic words are [C1 C0] and [Cn
C0] if ended with C0; [C2 Cc], [C1 Cc], [Cn Cc] if ended with Cc; and [C2 C1],
[Cn C1] if ended with C1, and [Cn C2] if ended with C2 (*[C2 C0] is not
attested, except in rare emphatic cases in French).
Ended with C0 Figures 7.12 to 7.17 are examples of C1 C0 realiza-
tions in the six Romance languages.
The first contour does not have to be differentiated from any contour other
than the terminal contour. Its realization can therefore take many combinations
of melodic forms, as long as they are perceptually distinct from the conclusive
contour ending the same sentence. One acoustic feature is necessary and
sufficient to ensure this differentiation, for example the duration of the vowel
implied, short for C1 and Cc (on the stressed syllable part), and longer for C0.
Other possible sequences are Cn C0, C1 C0, and Cc C0 (except for French).
In order to be decoded properly, the first prosodic marker, placed on the
stressed syllable (thus on the final syllable for French), must be acoustically
different from all other contours that could occur at the same place (i.e. in the
same context and environment), and in particular from a declarative or
interrogative modality contour. It cannot be confused with a flat melodic contour

characteristic of Postfixes (see Chapter 8) as not preceded by a conclusive
contour. Therefore, any realization that would ensure this differentiation will
be observed, as shown in the examples below. The final conclusive contour is
located on the final stressed syllable and not on the final syllable.
An example: “the idea was simple,” with the final conclusive melodic
contour on the final stressed syllable. Stressed syllables C0 are underlined in
bold line in the text, with a thinner line for stressed contours Cn below the
glissando threshold).
French C1 C0
(1) [L’idée C1 était simple C0] “The idea was simple” (EuRom4 frfn6F)
150 dée
l'i é
tait sim
100
ple
50
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4
L1 [1] l'idée était simple
Figure 7.12 L’idée était simple
Italian Cn C0
(2) [L’idea Cn era semplice C0] “The idea was simple” (EuRom4 itfn6I)
200
a
de
150 e ra
L'i
sem pli ce
100
50
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
L1 [1] L'idea era semplice
Figure 7.13 L’idea era semplice

Spanish Cn C0
(3) [La idea Cn era simple C0] “The idea was simple” (EuRom4 esfn6E)
200
150
a e
de ra
La i sim
100 ple
50
0
0 0.2 0.4 0.6 0.8 1 1.2
L1 [1] La idea era simple
Figure 7.14 La idea era simple
Catalan Cn C0
(4) [La idea Cn era simple C0] “The idea was simple” (EuRom4 cafn6C)
300
250 ra
La i a e
de sim
200
ple
150
100
50
0
0 0.2 0.4 0.6 0.8 1 1.2
L1 [1] La idea era simple
Figure 7.15 La idea era simple

Portuguese Cn C0
(5) [A ideia Cn era simples C0] “The idea was simple” (EuRom4 ptfn6P)
150
A i de ia e ra
sim
100
ples
50
0
0 0.5 1 1.5 2
L1 [1] A ideia era simples
Figure 7.16 A ideia era simples
Romanian Cc C0
(6) [Ideea Cc era simplă C0] “information is essential” (EuRom4 rofn6R)
350
300 ra
e
250 a
200 I dee sim
150
plǎ
100
50
0
7.8 58 58.2 58.4 58.6 58.8 59 59.2
L1 ˇ simpla
[1] Ideea era
Figure 7.17 Ideea era simplă
The sequence of contours complies with the theoretical predictions, giving

as well-formed C1 C0 (Fig. 7.14 French), Cn C0 (Figs. 7.13 – 7.16 Italian,
Spanish, Catalan, and Portuguese), and Cc C0 (Fig. 7.17 Romanian). This latter
case may indicate a phonostylistics emphasis in the realization of the prosodic
structure or phrasing.
Ended with Cc in Romance languages other than French As dis-

cussed earlier, Romance languages other than French have an extra combinatorial
possibility by using the complex contour Cc, realized with the phonological
features +Rising, +Long, +Glissando, and of course +Complex. Phonetically, the
contour is realized on two distinct syllables, the lexically stressed and the final
syllable. On the stressed syllable, the pitch is generally slightly falling or rising
(usually below the glissando threshold) and rising on the final syllable (above
the glissando threshold). If the stressed syllable is in final position, a combined
pitch movement takes place, first falling, then rising sharply.
Italian C2 Cc
(7) [In pericolo C2 poi Cc] “In danger then . . .” (Euro5 I07-2)
250
200
pe co lo
In ri
150 poi
100
50
0
36.6 36.8 37 37.2 37.4 37.6
L1 [1] In pericolo poi
Figure 7.18 In pericolo poi in In pericolo, poi, ci sono alcune città africane
come Timbuctù . . .
Spanish C2 Cc
(8) [cuando C2 se constate Cc] “when it was noticed . . .” (EuRom5 E12-1)
250
200
ta
150
cuan te
do se cons
100
50
0
14 14.2 14.4 14.6 14.8 15
[1] cuando se constate
Figure 7.19 cuando se constate inque pide asus 47 Estados miembros que
establezcan algún tipo de sanción cuando se constate que hay una
“provocación a la discriminación” en los mensajes publicitarios.
Portuguese C2 Cc
(9) [apelidado C2 de Óscar Cc] “called Oscar . . .” (EuRom5 P04-1)
300
250
a
pe car
li
200
de
150
do de Ós
100
50
0
7.4 7.6 7.8 8 8.2 8.4 8.6
[1] apelidado de Óscar
Figure 7.20 apelidado de Óscar inUm cão de raça terra nova apelidado de
Óscar, cujo dono é um socialite, vai ser submetido a um lifting aos olhos.
Catalan C2 Cc
(10) [Com C2 les formigues Cc] “Like ants . . .” (EuRom5 C19-1)
200
150
Com gues
les for mi
100
50
0
7.8 48 48.2 48.4 48.6 48.8
[1] Com les formigues
Figure 7.21 Com les formigues in Com les formigues, els cucs o els
escarabats, les meduses són en algunes cultures un element bàsic de
l’alimentació.
The C2 Cc pattern appears in all these examples.

Italian C1 Cc
(11) [[La più grande C1 delle piroghe Cc] [misura Cn cinque Cn metri C0]]
“The largest of the canoes measures five meters” (EuRom4 itfn42I)
200
ghe cin que
gran de
150 più de mi su
La lle pi ra
ro me tri
100
50
0
0 0.5 1 1.5 2 2.5 3
L1 [1] La più grande delle piroghe misura cinque metri
Figure 7.22 La più grande delle piroghe misura cinque metri.
Spanish C2 Cc
(12) [Siguiendo Cn este modelo Cc] “Following this model . . .” (EuRom5
E11-1)
200
guien
150
Si do
te
100
es mo de lo
50
0
0 0.2 0.4 0.6 0.8 1 1.2
L1 [1] Siguiendo este medelo
Figure 7.23 Siguiendo Cn este modelo Cc in Siguiendo este modelo, la

cadena de clínicas estadounidenses . . .
Portuguese C2 Cc
(13) [Segundo C2 a especialista Cc] “According to the expert . . .” (EuRom5
E11-1)
250
Se
200
lis
ta
gun
150 pe cia
100
do aes
50
0
0 0.2 0.4 0.6 0.8 1 1.2
L1 [1] Segundo a especialista
Figure 7.24 Segundo a especialista . . .
Portuguese C2 Cc
(14) [na cidade C2 de York Cc] “in the city of York . . .” (EuRom5 P05-1)
250
e na
200
ci
da
150 de
de
York
100
50
0
69.6 69.8 70 70.2 70.4 70.6
L1 [1] e na cidade de York
Figure 7.25 na cidade de York in . . . e, na cidade de York é permitido matar

um escocês junto às antigas muralhas . . .
Portuguese C2 Cc
(15) [Nascido C2 no Japão Cc] “Born in Japan . . .” (EuRom5 P06-1)
350
300
250
Na pão
200 sci Ja
do no
150
100
50
0
22.6 22.8 23 23.2 23.4 23.6
L1 [1] Nascido no Japão
Figure 7.26 Nascido no Japão in Nascido no Japão, Watanabe, que dá aulas

na universidade, nos EUA, tem
Ended with C1 in French The next example illustrates the proto-

typical contrast of melodic slope in French:
French C2 C1
(16) [les garçons C2 de piste C1] “the track boys . . .” (EuRom5 F03-1)
200
150 çons
gar
les de pis
100
te
50
0
21.8 22 22.2 22.4 22.6
L1 [1] les garçons de piste
Figure 7.27 les garçons de piste in Les garçons de piste se transforment en

porteur, voltigeur, funambule ou cascadeur, . . .
The sequence of contours is C2 −Rising C1 +Rising, C2 is +Glissando but

C1 has a curved shape compensating for the actual −Glissando value and
a difference of duration (100 ms and 150 ms respectively). The listener
perceiving the first contour C2 −Rising may expect the following contour to
be C1 +Rising. Instead, C0 occurs and the process of recovering the prosodic

structure intended by the speaker is ended.
French C2 C1
(17) [[La plus grande C2 des pirogues C1] [mesure Cn cinq mètres C0]]
“The largest of the canoes measures five meters.” (EuRom4 frfn42F)
200
plus grande rogues cinq

150
des pi me sure mètres
La
100
50
0
0 0.5 1 1.5 2
L1 [1] La plus grande des pirogues mesure cinq mètres
Figure 7.28 La plus grande des pirogues mesure cinq mètres.
In these two examples, the contrast of melodic slope is clearly at work, reveal-
ing one important characteristic of French sentence intonation (Martin, 1975).
French C2 C1
(18) [mais les scientifiques C2 japonais C1] “but Japanese scientists . . .”
(EuRom5 F05-1)
200
scien
150
ti fiques ja
mais les po nais
100
50
0
31.8 32 32.2 32.4 32.6 32.8 33
L1 [1] mais les scientifiques japonais
Figure 7.29 mais les scientifiques japonais in Mais les scientifiques japonais
ont montré que cet agent était contrôlé par une enzyme.
The example mais les scientifiques japonais “but Japanese scientists”

(Fig. 7.29) shows the use of a contrast of melodic slope C2 C1, and also of
duration (C2 vowel 70 ms, C1 150 ms), illustrating the possibility for the
speaker to select various melodic features as long as the contrast between the
desired successive contours is realized to ensure a proper perception of the
prosodic structure.
Ended with C1 in Romance languages other than French
Catalan Cn C1
(19) [La cuinera de Sant Pol de Mar] “The chef of San Pol de Mar. . .”
(EuRom5 C19-1)
200
150
La Mar
ne ra
cui
de Pol
100 de
Sant
50
0
70.2 70.4 70.6 70.8 71 71.2 71.4
L1 [1] La cuinera de Sant Pol de Mar
Figure 7.30 La cuinera de Sant Pol de Mar in La cuinera de Sant Pol de Mar,
Carme Ruscalleda . . .
Sequences of three prosodic words

When the prosodic structure organizes there prosodic words, the system of
necessary and sufficient contrasts becomes more complex. Indeed, besides the
common differentiation with the terminal contour, the two first contours must
also contrast with each other. This differentiation depends on the structure
configuration, each contour must be differentiated from all the contours from
which it depends directly or indirectly. Due to the difference in contours
ranking and the absence of the complex contour Cc, French is treated separately
from the other Romance languages and is handled first, followed by the other
Romance languages.
FrenchRanking of contours: Cn < C2 < C1 < C0

The configuration (I) is an enumeration (Fig. 7.31),
Enumerations
where the two first contours belong to the same class, say C1 or Cn: [C1 C1 C0]
Figure 7.31 Configuration I, Cx Cx C0
or [Cn Cn C0], the first realization being phonetically more marked than
the first, as in Figure 7.35. If ended with C1, the possible prosodic groups are
[C2 C2 C1] and [Cn Cn C1].
Ended with C0 The two first contours Cx both contrast with
C0 and must belong to the same class. They can be instantiated by C1
or Cn.
An example of enumeration, realized as such in all the other Romance
languages with similar text (except in Romanian).
French C1 C1 C0
(20) [Les romans C1 ont un début C1 et une fin C0]]
“Novels have a beginning and an end” (EuRom4 frfn9)
200
mans but
150
ro dé
Les une
un et fin
100
50
0
0 0.5 1 1.5 2
L1 [1] Les romans ont un début et une fin
Figure 7.32 Les romans ont un début et une fin.

French Cn Cn C0
(21) [[on rend C1] [cette interdiction Cn strictement Cn inefficace C0]]
“we make the ban strictly ineffective” (EuRom4 F_23_16)
250
200 tion
rend
stric te ment
cette
150 on in ter i
neff
dic i
cace
100
50
0
53.5 54 54.5 55 55.5 56
L1 [17] on rend cette interdiction strictement inefficace.
Figure 7.33 on rend cette interdiction strictement inefficace.
Ended with C1 Cx can be instantiated by C2 or Cn in this case.

French C2 C2 C1
(22) [ainsi C1] [[sa nouvelle gamme C2 de combinés C2 ] [présentés Cn
lundi C1]]
“so its new range of handsets presented Monday . . .” (EuRom5 F04-1)
with the sequence of contours C1 C2 C2 C1
350
350
si
350 nou
com di
ain sa bi
200 pré
velle
gamme lun
sen
150
de nés tés
100
50
0
16 16.5 17 17.5 18 18.5 19
L1 [1] ainsi [2] sa nouvelle gamme de combinés présentés lundi
Figure 7.35 ainsi sa nouvelle gamme de combinés présentés lundi . . .
French Cn Cn C1
(23) [[Le trente Cn novembre C2], [anniversaire Cn de la mort Cn au combat
Cn en dix-sept cent dix-huit Cn du roi Cn Charles XII C1]]
“On 30 November, the anniversary of the death in battle in 1718, King
Charles XII . . .” (F_01_22 EuRom4)
300
trente
250
saire ze
200 no dou
de bat
ver
nni la mort com
150 au en cent dix roi
le a dix sept huit du
les
venmbre Char
100
50
0
8.5 9 9.5 10 10.5 11 11.5 12 12.5
L1 [2] Le 30 novembre, anniversaire de la mort au combat, en 1718, du roi Charles XII
Figure 7.36 This example shows a saturation of melodic contrasts in the long
syntagm anniversaire de la mort au combat en dix-sept cent dix-huit du roi
Charles Douze resulting in Cn neutralized contours on all stress groups’ final
(and stressed) syllables, except the last, carrying C1. The contrast of melodic
slope is realized with the contrast C2 ending Le trente Cn novembre with C1.
Configuration II
The configuration (II) groups the two first prosodic words, which are then
grouped with the third prosodic word. The possible sequences are [[C2 C1]
C0], [[Cn C1] C0] if terminated by C0 (Fig. 7.38), and [[Cn C2] C1] (Fig. 7.39)
if terminated by C1.
Ended with C0
Figure 7.37 Configuration II, Cy Cx C0
French C2 C1 C0
(24) [[de se livrer C2 à des affrontements C1] en règle C0]
“to engage in good standing clashes” (EuRom4 F_01_22)
250
200 ments
li
de se te
a en
150
vrer ffron
à des
règ le
100
50
0
16.5 17 17.5 18
L1 [3] es devenu... [4] de se liver à des affrontements en règle.
Figure 7.38 de se livrer à des affrontements en règle.

French Cn C1 C0
(25) [[cette maladie C1] [[est devenue C2] [une pathologie Cn changeante
C1]] et multiforme C0]
“this disease has become a changing and multifaceted pathology”
(EuRom4 F_08_04)
250 ma
pa mul
la nue chan gean te
the
200 cette die est et
de ve une lo ti
gie for
150
me
100
50
0
12 12.5 13 13.5 14 14.5 15 15.5 16
L1 [7] cette maladie est devenue une pathologie changeante et multiforme.
Figure 7.39 cette maladie est devenue une pathologie changeante et

multiforme.
Congruence with syntax is ensured until the conjunction et is reached,

putting the two prosodic groups ended with C1 on the same level as the last
prosodic word.
Ended in C1

French Cn C2 C1
(26) [[C’est au travers Cn de cette relation Cn qu’il instaurera C2] à ces deux
personnes C1]
“It is through this relationship that he will build with these two people”
(EuRom4 F_21_16)
300
250
200
au
c’est re
vers la tion ins tau
150 tra re
de la qu’il ra
à sonnes
per
ces deux
100
50
0
0 29.5 30 30.5 31 31.5 32 32.5
L1 [9] c’est au travers de la relation qu’il uinstaurera à ces deux personnes,
Figure 7.41 C’est au travers de cette relation qu’il instaurera à ces deux
personnes.
Configuration III
Ended with C0
Figure 7.42 Configuration III, Cx Cy C0

French C1 Cn C0
(27) [[Certains C1] [de ces bâtiments Cn préfabriqués C1] [se sont révélés Cn
dangereux C0]]
“Some of these prefabricated buildings have proved dangerous” (EuRom4
frfn39F)
200
tains
cer de ti qués
150 ces
ments pré
bâ se sont lés dan
fa ré vé
bri ge
reux
100
50
0
0 0.5 1 1.5 2 2.5 3 3.5
Figure 7.43 Certains de ces bâtiments préfabriqués se sont révélés dangereux.
French C1 Cn C0
(28) [[cependant C1][empêcher C2 les bagarres C1] [recherchées Cn de part
et d’autre C0]]
“however, to prevent fights sought both sides” (EuRom4 F_01_22)
300
250 dant
cher garres
200 ce pen pê
re
em cher chées
les
150 ba de part et
100 d’autre
50
0
34.5 35 35.5 36 36.5 37
[11] Neuf cents policiers
L1 [12] empêcher les bagarres recherchées de part et d’autre.
n’ont pu, cependant,
Figure 7.44 Neuf cents policiers n’ont pu, cependant, empêcher les bagarres
recherchées de part et d’autre.
As the sequence C2 C0 violates the principle of contrast of melodic slope, the

melodic contour on recherchées contrasting with C0 can only be Cn, remaining
congruent with syntax.
Ended with C1
French C2 Cn C1
(29) [[Le 30 Cn novembre C2], [anniversaire Cn de la mort au combat Cn en
1718 Cn du roi Charles XII C1]]
“On 30 November, the anniversary of the death in battle in 1718, King
Charles XII . . .” (EuRom4 F_01_22)
300
trente
250
saire ze
200 no
dou
ver de bat
nni la mort com
150 au en cent dix
le roi
a dix sept huit du les
vembre Char
100
50
0
8.5 9 9.5 10 10.5 11 11.5 12 12.5
L1 [2] Le 30 novembre, anniversaire de la mort au combat, en 1718, du roi Charles XII,
Figure 7.46 This example shows a saturation of melodic contrasts in the long
syntagm anniversaire de la mort au combat en dix-sept cent dix-huit du roi
Charles Douze resulting in Cn neutralized contours on all stressed groups’
final (and stressed) syllables, except the last, carrying C1. The contrast of
melodic slope is realized with the contrast C2 ending Le trente Cn novembre
with C1.
Romance languages other than French Ranking of contours: Cn <

C1 < C2 < Cc < C0
Configuration I Enumerations
Ended with C0 The possible instantiations of Cx are Cc, C1, and Cn.
An example of enumeration in Italian (excerpt, the complete sentence includes

sixteen prosodic groups ended by contours Cc) (Fig. 7.48).
(Svevo)
(30) [B C2 antiparasitari Cc] [C C2 anticoncezionali Cc]
“antiparasites contraceptives . . .”
200
C
B ti
150 li
an ti si ri an con zio
pa ce
ra ta na
100
50
0
2.5 3 3.5 4 4.5 5 5.5
Figure 7.48 B, antiparasiti C anticoncezionali
These cases normally require the existence of a lower level in the prosodic
structure between the consecutive complex contours Cc (as in Fig. 7.48).
Outside emphatic style, consecutive contours ending single prosodic words
do not need high-level contrasts using Cc and are instead using Cn or C1 as in
the following examples.
Italian C1 C1 C0
(31) [I romanzi C1 hanno un inizio C1 e una fine C0]
“The novels have a beginning and an end” (EuRom4 itfn9I)
200
150 zi zio
ni
man ha mno e u
l ro uni na
100
fi ne
50
0
0 0.5 1 1.5 2 2.5
[1] I romanzi hanno un inizio e una fine
Figure 7.49 I romanzi hanno un inizio e una fine.
Spanish Cn Cn C0
(32) [Los romances Cn tienen un inicio Cn y un fin C0]
“The novels have a beginning and an end” (EuRom4 esfn9E)
200
d.
e.
f. ces
150 man
g. tie
Los nen un ni cio
h. ro y un
i
100 i. fin
50
0
0 0.5 1 1.5 2
[1] Los romances tienen un inicio y un fin
Figure 7.50 Los romances tienen un inicio y un fin.

Catalan Cn Cn C0
(33) [Els romanços Cn tenen un inici Cn i un final C0]
“The novels have a beginning and an end” (EuRom4 cafn9C)
300
250 ços
ci
200 te
man i
Els ro nen un un
i ni fi
150
nal
100
50
0
33 33.5 34 34.5 35
[1] Els romanços tenen un ininci i un final
Figure 7.51 Els romanços tenen un inici i un final.
Portuguese Cn Cn C0
(34) [Os romances Cn têm um início Cn e um fim C0]
“The novels have a beginning and an end” (EuRom4 ptfn9P)
150
Os ro mances têm cio
um i ní e um
fim
100
50
00 0.5 1 1.5 2 2.5

L1 [1] Os romances têm um início e um fim
Figure 7.52 Os romances têm um início e um fim.

Portuguese Cn Cn C0
(35) [Avião Cn de papel Cn no Espaço C0]
“Paper Airplane in Space” (EuRom5 P02-1)
300
250
A vião
200 pel
de pa no
Es
150
pa ço
100
50
0
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
L1 [1] Avião de papel no Espaço
Figure 7.53 Avião de papel no Espaço.
All these examples illustrate similar phonetic realizations of same-class

contours, Cc, C1, or Cn. Cc will be used preferably if the structure is
more complex, with prosodic words included in each group terminated
by Cc.
Ended with Cc
Figure 7.54 Configuration I, Cx Cx Cc

Portuguese C2 C2 Cc
(36) [Um cão C2 de raça C2 terra nova Cc]
“A dog of Newfoundland breed” (EuRom5 P04-1)
350
Um
300
250
te
cão
rra va
200
de ra
150 ça
no
100
50
0
5.5 6 6.5 7
L1 [1] Um cão de raça terra nova
Figure 7.55 Um cão de raça terra nova.
In this sequence, the second contour C2 has a glissando value very close to the
threshold and could be transcribed as Cn instead of C2, corresponding to the
structure [[C1 [Cn Cc]] congruent with the syntax of Um cão de raça terra nova.
This is an example of a complex contour Cc occurring on the last and
stressed syllable of the prosodic group (um escocês).
Portuguese C2 C2 Cc
(37) [é permitido C2 matar C2 um escocês Cc]
“it is allowed to kill a Scotsman . . .” (EuRom5 P05-1)
250 per
cês
é mi
200
ti esco
ma
150 tar um
do
100
50
0
71 71.2 71.4 71.6 71.8 72 72.2 72.4 72.6 72.8
L1 [1] é permitido matar um escocês
Figure 7.56 é permitido matar um escocês.

Portuguese C2 C2 Cc
(38) [[Os vídeos C2] [sobre Cn actividades C2] paranormais Cc]
“The videos about paranormal activities” (EuRom5 P16-1)
300
250
200 Os ví
ti pa
vi
150 so bre ac da
deos ra
des
nor mais
100
50
0
10.5 11 11.5 12 12.5
L1 [4] Os vídeos sobre actividades paranormais
Figure 7.57 Os vídeos sobre Cn actividades paranormais são verdadeiros

casos de sucesso no . . .
Spanish C1 C1 Cc
(39) [La recomendación C1 plantea C1 a los Estados Cc]
“The recommendation poses to the States . . .” (EuRom5 E12-1)
300
ción te a
250
La
re ce a ta
200 men
da plan
los Es dos
150
100
50
0
91 91.5 92 92.5 93
L1 [1] La recomendación plantea a los Estados
Figure 7.58 La recomendación plantea a los Estados.
This is a classical [C1 C1 Cc] sequence in the form of an enumeration non-

congruent with syntax. The choice between C2 and C1 and this structure may
be due to the variable complexity estimated by the speaker, C2 preceding C1 in

contour ranking and being more marked than C1.
Cn Cn Cc is quite improbable, as this configuration skips two levels
in the rank of melodic contours (C2 C2 Cc and C1 C1 Cc, with the ranking Cn <
C1 < C2 < Cc < C0). However, Figure 7.59 is an example in Romanian.
Romanian Cn Cn Cc
(40) [[Mișcarea Cn separatistă Cn bască Cc] [comis C1 noi atentate C0]]
“The Basque separatist movement committed new bombings” (EuRom4
rofn42)
350
300 ca
250
cǎ a
200 tis a mis noi ten
Mis, co
tǎ ta te
150
bas
100
50
0
206 206.5 207 207.5 208 208.5 209 209.5 210
L1 [42] Miscarea
, separatistaˇ bascaˇ a comis noi atentate
Figure 7.59 Mișcarea separatistă bască a comis noi atentate.
Ended with C2
Catalan Cn Cn C2
(41) [[[i després Cn que el vedell Cn] ataqués C2] un dels homes Cc] [que el
volia C1 lligar Cc]
“and, after the calf attacked one of the men who wanted to tie him”
(EuRom5 C18-1)
250
200
i qués mes
150 que ho que
prés dell a ta un li gar
des el dels el a
vo lli
100 ve
50
0
0 0.5 1 1.5 2 2.5 3
L1 [1] i després que el vedell ataqués un dels homes que el volia lligar
Figure 7.60 i després que el vedell ataqués un dels homes que el volia lligar.
Ended with C1
Portuguese Cn Cn C1
(42) [A escolha Cn da carreira Cn profissional C1]
“The choice of a professional career . . .” (EuRom5 P09-1)
300
co
250 A es Iha
200 fe
da nal
150 pro ssio
ca
rrei
100 ra
50
0
6.6 6.8 7 7.2 7.4 7.6 7.8 8 8.2
L1 [3] A escolha da carreira profissional,
Figure 7.62 A escolha da carreira profissional uma das decisões mais

importantes da vida.
Configuration II
Ended with C0
Italian C2 Cc C0
(43) . . . [[in coppie C2 nelle quali il padre Cc] è sieropositivo C0]
“in couples in which the father is HIV positive” (EuRom4 I_22_17)
300
250 sie ro
in qua è po
co ppie nelle li il si
200 dre
pa
ti
150 vo
100
50
0
24.5 25 25.5 26 26.5 27
L1 [10] in coppie nelle quali il padre è sieropositivo.
Figure 7.64 in coppie nelle quali il padre è sieropositivo.

Italian C2 Cc C0
(44) [[. . . sarà arruolato C2 dai carabinieri Cc] e addestrato C0]
“. . . will be recruited and trained by the police” (EuRom5 I03-1)
250
ddes
200
e a
150 ra bi
sa la
rà rruo ca tra
a ri
to to
100 nie
dai
50
0
64.5 65 65.5 66 66.5
L1 [22] s... [23] sarà arruolato dai carabinieri e addrestrato.
Figure 7.65 . . . sarà arruolato dai carabinieri e addestrato.
Italian C1 Cc C0
(45) [[[probabilmente C1] [sfuggito Cn al controllo Cc]] del padrone C0]
“probably escaped the control of the master” (EuRom5 I03-1)
250
ba bil
200 pro con
te sfu
men to al pa
150 tro del
ggi llo
dro
100
ne
50
0
32 32.5 33 33.5 34 34.5
L1 [11] probabilmente sfuggito al controllo del padrone.
Figure 7.66 probabilmente sfuggito al controllo del padrone.
c. Cn Cc C0 Not attested in the corpus.

Cn C1 C0 is not attested in the corpus.
Romanian Cn C1 C0
(46) [[Aceasta Cn este o dilemă C1] insolubilă C0]
“This was an insoluble dilemma” (EuRom4 rofn10)
350
300
ta
250
ceas este in
mǎ
200 o so
le
lu
150 a di
bi lǎ
100
50
0
43.5 44 44.5 45 45.5
L1 [10] Aceasta este o dilemaˇ insolubilaˇ
Figure 7.67 Aceasta este o dilemă insolubilă.
Ended with Cc
Figure 7.68 Configuration II, Cy Cx Cc

Romanian C1 C2 Cc
(47) [[Situația C1 periferică C2] a Portugaliei Cc]
“The peripheral situation of Portugal . . .” (EuRom4 rofn40)
350
300 ,tia
,
tua
250 Si pe fe
ri liei
200 ri a Por
cǎ tu ga
150
100
50
0
193.5 194 194.5 195 195.5 196
L1 [40] Situatia
, perifericaˇ a Portugaliei o mentine
, într-o pozitie
, marginalaˇ in raport cu fluxurile din Est
Figure 7.69 Situația periferică a Portugaliei o menține într-o poziție

marginală în raport cu fluxurile din Est.
Catalan Cn C2 Cc
(48) [i després Cn que el vedell Cn ataqués C2] un dels homes Cc] que el volia
C2 lligar Cc]
“and, after the calf attacked one of the men who wanted to tie him . . .”
(EuRom5 C18-1)
250
200
150 i qués mes

que ho que
prés dell a ta un li gar
des el dels el a
100 lli
ve vo
50
0
0 0.5 1 1.5 2 2.5 3
L1 [1] i després que el vedell ataqués un dels homes que el volia lligar
Figure 7.70 i després que el vedell ataqués un dels homes que el volia lligar.
These examples show that, contrary to French, the other Romance languages
use C2 and not C1 as marker of the prosodic dependency relation with the
higher rank melodic contour Cc.
Romanian Cn C1 Cc
(49) [[[Unele C1] [[dintre aceste Cn clădiri C1] prefabricate Cc]] s-au dovedit
Cn periculoase C0]
“Some of these prefabricated buildings have been found to be dangerous”
(EuRom4 rofn46)
400
350
ne
300
250 u le ces pre te

s-au
te diri do a pe
200 din fa ca
tre ri cu
a ve
150 dit loa se
bri
100
50
0
228.5 229 229.5 230 230.5 231 231.5 232
L1 [46] Unele dintre aceste cladiri
ˇ prefabricate s-au dovedit periculoase
Figure 7.71 Unele dintre aceste clădiri prefabricate s-au dovedit

periculoase.
Ended with C2 These are saturation cases. There is no possible set

of contrasts to encode this configuration.
Ended with C1 These are saturation cases.
Configuration III
Ended with C0

Romanian Cc C1 C0
(50) [[Romanele Cc] [[au un început C1] [și un sfârșit C0]]]
“The novels have a beginning and an end” (EuRom4 rofn9R)
300
250 si
,
le au un
200 un put sfar
ˇ
Ro în ce
ma ne
150 sit
,
100
50
0 69.5 70 70.5 71 71.5 72 72.5

L1 [1] Romanele Cc au un început C1 si
, un sfarsit
ˇ ,
Figure 7.73 Romanele Cc au un început C1 și un sfârșit.
This sentence was read with a different prosodic structure from the other
Romance languages, congruent with syntax.
Romanian C1 Cn C0
(51) [Alarmă C1] [la școala Cn britanică C0]
“Alarm in the British school” (EuRom4 rofn9R)
350
300
250 lar
maˇ
scoa
, la
200 bri
la ta
A
150
ni
cǎ
100
50
0
40.4 40.6 40.8 41 41.2 41.4 41.6 41.8 42 42.2
L1 [9] Alarmaˇ la scoala
, britanicaˇ
Figure 7.74 Alarmă la școala britanică.

Ended with Cc
Figure 7.75 Configuration III, Cx Cy Cc
Italian C2 C1 Cc
(52) [[che C2] [trasferirsi C1 in USA Cc]]
“which, moved in USA . . .” (EuRom4 I_23_06)
300
rir SA
si
250 in
che
tras fe U
200
150
100
50
0 91 91.5 92 92.5 93
L1 [1] che [2] trasferirsi in USA
Figure 7.76 che trasferirsi in USA.
This example is prototypical of a falling melodic contour C2 on a mono-

syllabic word che just before the beginning (the left boundary) of the
parenthesis transferred in USA. It shows clearly that C2 does contrast with
Cc and not with the next contour C1, forming a group with the whole
parenthesis.
Catalan C2 Cn Cc
(53) [[Pocs Cn minuts C2] [després Cn de les set Cn de la tarda Cc]] el vedell
“A few minutes, after seven in the evening, the calf . . .” (EuRom5 C18-1)
300
250
mi
200 da
nuts ve dell
Pocs
prés el
150 set tar
des de de
100 les la
50
0
55.5 56 56.5 57 57.5 58
L1 [1] Pocs minuts després de les set de la tarda [2] el vedell
Figure 7.77 Pocs minuts després de les set de la tarda, el vedell.
The prosodic structure segments the text into Pocs minuts and després de les
set de la tarda.
Italian C1 Cn Cc
(54) [[probabilmente C1] [sfuggito Cn al controllo Cc]]] del padrone C0]
“probably escaped the control of the master” (EuRom5 I03-1)
250
ba bil
con
200 pro sfu
te
men to
al pa
150 tro del
ggi llo
dro
100 ne
50
0 32 32.5 33 33.5 34 34.5

L1 [11] probabilmente sfuggito al controllo del padrone.
Figure 7.78 probabilmente sfuggito al controllo del padrone.

Romanian C1 Cc C0
(55) [[Unele clădiri C1 s-au dovedit Cc]a fi periculoase C0]
“Some of these buildings have proved dangerous” (EuRom4 rofn8R)
Romanian C1 Cc C0
(56) [[[Un grup C1] [de cercetători Cn germani Cc]] a rezolvat enigma C0]
“A group of German researchers has solved the enigma” (EuRom4
rofn3R)
350
300
e
250
grup vat
to a
200 Un de re
cer ce nig
ta ri zol
150 ger mani
ma
100
50
0
0 0.5 1 1.5 2 2.5 3
ˇ
[1] Un grup de cercetatori germani a rezolvat enigma
Figure 7.79 Un grup de cercetători germani a rezolvat enigma.
Ended with C2 This is a saturation case. There is no possible set of

contrasts to encode this configuration.
Ended with C1 This is another saturation case. There is no possible
set of contrasts to encode this configuration.
Sequences of four prosodic words and more

Various strategies realized by the speaker among the possibilities offered main-
tain the necessary (and sufficient) contrasts between melodic contours implied.
These strategies derive from the configurations with two or three prosodic words
detailed above. When the prosodic structure becomes more complex, the encod-
ing system runs short of melodic contrast between contours, and realizations use
neutralized contours Cn, as shown in the following examples.
a. French C2 C2 C2 C1 An example of saturation, where, due to the
lack of available prosodic contrasting features, the speaker chooses to produce
a list of prosodic words ended with C2 (−Rising) contours, contrasting with the
group final C1 (+Rising) to carry a more complex syntactic structure of les

médecins de l’Académie des sciences médicales.
French
(57) [[les médecins C2] [[de l’Académie Cn des sciences C2] [médicales C1]]
“the doctors of the Academy of Medical Sciences” (EuRom5 F01-1)
200
150
decins
ca mie sciences mé
les mé de l’A dé des di cales
100
50
00 0.5 1 1.5 2
L1 [1] les médecins de l’Académie des sciences médicales
Figure 7.80 les médecins de l’Académie des sciences médicales in Les

médecins de l’Académie des sciences médicales pensent pouvoir greffer un
utérus artificiel dans l’abdomen masculin.
The glissando values of all four contours are above the threshold, and a
supplementary contrast between C1 and the C2 contours is ensured by a
difference of duration of the vowels implied: C2 about 100 ms, C1 160 ms.
b. C1 C1 Cn Cn Cn C0
Spanish
(58) [[El catalán C1] [es C1] [[la ochenta Cn y ocho Cn lengua Cn] del
mundo C0]]]
“Catalan is the 88th language of the World” (Euro5 E16-1)
300
250
200 lán yo
es cho
la
el ca ta o
150 chen ta len
gua del
100 mun
do
50
0 3
0.5 1 1.5 2 2.5 3.5
L1 [1] 16. El catalán es la 88a lengua del mundo
Figure 7.81 El catalán es la ochenta y ocho lengua del mundo.

c. Cn Cn Cc Cn Cn Cc
Catalan
(59) [L’acadèmia Cn de la llengua Cn catalana Cc] [l’Institut Cn d’Estudis Cn
Catalans Cc] [IEC Cc]
“The Catalan language academy, the Institute of Catalan Studies (IEC) . . .”
(Euro5 C06-1)
200
dè
mia
150 ca de
l’a na
la llen tut tu Ca
lans
l’ln sti IEC
100 gua la d’Es
ca ta dis ta
50
0 6.5 7 7.5 8 8.5 9 9.5 10

L1 [1] L'acadèmia de la llengua catalana, [2] l'lnstitut d'Estudis Catalans [3] (IEC),
Figure 7.82 L’acadèmia de la llengua catalana l’Institut d’Estudis

Catalans IEC.
The speaker of this example realizes a particular complex contour with both
rising melodic movement on the stressed and final vowels.
d. Cn Cn Cn C0 Another example using the same strategy, with a
succession of falling contours Cn –Glissando contrasting with the final contour
by the +/−Low and +/−Glissando features (C0 being + Low and +Glisssando),
as well as the vowel duration (about 90 ms for the first four stressed vowels, and
150 ms for the final vowel of the conclusive contour.
French
(60) [Le programme Cn de recherche Cn a débuté Cn en deux mille deux C0]
“The Research Program began in two thousand and two” (Eurom5 F05-1)
150 gramme té
pro cherche deux
a en mille deux
Le de dé
re bu
100
50
0
0 0.5 1 1.5 2
L1 [1] Le programme de recherche a débuté en deux mille deux
Figure 7.83 Le programme de recherche a débuté en deux mille deux.

The example Poche zampate per attirare l’attenzione del piantone del
comando provinciale dei carabinieri (EuRom5 I03-1) illustrates a typical
case of melodic feature saturation, with a succession of neutralized contours
Cn ended with a complex contour Cc [[per attirare Cn l’attenzione Cn del
piantone Cn del comando Cn provinciale C2] dei carabinieri Cc] after a first
group [Poche Cn zampate C2]. In this example, the speaker cannot realize a
prosodic structure congruent to the relatively complex syntax (typical of
written texts and seldom, if ever, heard in a non-prepared spontaneous
production).
Italian
(61) [[Poche Cn zampate C2] [per attirare Cn l’attenzione Cn del piantone Cn
del comando Cn provinciale Cn dei carabinieri Cc]]
“There were few paws to attract the attention of the steering column of the
provincial command of the police . . .” (EuRom5 I03-1)
200
Po che zam pa
tti ra re zio to bi
150 cia nie
per tten ri
pain co ra
a ne ne man le ca
la do pro vin
del del
100 te dei
50
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
L1 [1] Poche zampate per attirare l’attenzione del piantone del comando provinciale dei carabinieri
Figure 7.84 Poche zampate per attirare l’attenzione del piantone del
comando provinciale dei carabinieri . . .
The following example Una nuova e divertente ginnastica con la palla

“A new and fun gymnastic exercise with a ball” is realized with the
sequence of contours C1 Cn Cn Cc, with C1 as the first contour, as
expected.
Italian
(62) [[Una nuova C1] [e divertente Cn ginnastica Cn con la palla Cc]] “A new
and fun gymnastic exercise with a ball . . .” (EuRom5 I05-1)
350
300
250 u na nuo
va
200 e ten
di ver te ti
150
gin nas ca con la pa lla
100
50
0
9 9.5 10 10.5 11 11.5
L1 [4] Una nuova e divertente ginnastica con la palla
Figure 7.85 Una nuova e divertente ginnastica con la palla.
Example (63) below shows a saturation chosen by the speaker when reading
the complex syntagm: es necesario alfabetizar a cuatro millones de personas
cada año (the two successive vowels [a] in cada año are pronounced as one
unique [a]), with a sequence of six neutralized contours Cn followed by the
final group rising contour C1.
Spanish
(63) [es Cn necesario Cn alfabetizar Cn a cuatro Cn millones Cn de personas
Cn cada año C1]
“it is necessary to alphabetize four million people each year . . .” (EuRom5
E02-1)
300
ne ce
250 sa rio
al fa cua tro
es llo nes a
mi de
200 zar so
be ti a per ca da
nas ño
150
100
50
0
30 30.5 31 31.5 32 32.5 33 33.5
[13] es necesario alfabetizar a cuatro millones de personas cada año,
Figure 7.86 es necesario alfabetizar a cuatro millones de personas cada

año . . .
An example of Cc on the stressed and final syllable autocontrol, with the

rising part of the contour located on the voiced nasal [l] ending the stressed
syllable in La resolución propone que se pongan en marcha dispositivos
nacionales de autocontrol. This is another case of a structure in [C1 Cc],
with five intermediate prosodic words ending with neutralized contours Cn.
Spanish
(64) [[La resolución C1] [propone Cn que se pongan Cn en marcha Cn
dispositivos Cn nacionales Cn de autocontrol Cc]]
“The resolution proposes to implement national self-monitoring devices
. . .” (Euro5 E12-1)
250
ción ti vos
cha
200 trol
pro po neque pon po si na con
La gan dis de
re lu se les
so mar ciona
150 en au
100
J
50
0
51 52 53 54 55
L1 [1] La resolución [2] propone que se pongan en marcha dispositivos nacionales de autocontrol.
Figure 7.87 La resolución propone que se pongan en marcha dispositivos

nacionales de autocontrol.
This is another clear example of C2 depending on Cc and not on the

immediate vicinity of contours Cn.
Catalan
(65) [[Així C2] [abans Cn que acabiel Cn dos mil vuit Cc]]
“Thus before 2008 . . .” (EuRom5 C20-1)
200
Ai xí bans
150
a que vuit
ca
biel mil
100 a
dos
50
0
9.5 130 130.5 131 131.5
L1 [1] Així [2] abans que acabiel dos mil vuit
Figure 7.88 Així abans que acabiel dos mil vuit.

Portuguese
(66) [Com cerca Cn de sete Cn centímetros Cn de comprimento Cc]
“With about seven inches long . . .” (EuRom5 P02-1)
350
300
com
250
cer
se te com
200 ca
cen tí pri
de de mento
150 me tros
100
50
0
45 45.5 46 46.5 47
L1 [19] Com cerca de sete centímetros de comprimento
Figure 7.89 Com cerca de sete centímetros de comprimento.
Romanian
(67) [[[In Germania C1] [violența Cn rasistă Cc]] [a depășit Cn limita C0]]
“Racist violence in Germany has exceeded the limit” (EuRom4 rofn22R)
350
300
nia ta
250
len sis
200 ma vio ta
In a Sit
de li
Ger ra pa
150 mi
ta
100
50
0
92.5 93 93.5 94 94.5 95 95.5 96
[1] In Germania violenta
, rasista
ˇ a depasit
, limita
Figure 7.90 In Germania violența rasistă a depășit limita.
The incremental process indicating the successive grouping of prosodic

words utilizes a set of melodic contours, whose acoustic features implement
the necessary and possibly more than sufficient contrasts to allow the listener
to correctly recover the prosodic structure intended by the speaker. French,

with a ranking of melodic contours Cn < C2 < C1 < C0, and the other
Romance languages, with a ranking of contours Cn < C1 < C2 < Cc < C0,
use the same mechanism to indicate the relations of dependency between
prosodic words inside a given prosodic structure. In all the prosodic proper-
ties of all the Romance languages discussed in this book, the contrast of
melodic slope is at work, preventing occurrences of *C2 C0 sequences, for
example, while the complex contour Cc has its melodic slope feature
neutralized.
In the large number of read sentences analyzed, almost all follow the same
prosodic grammar common to all Romance languages, manifesting at the same
time a high degree of congruence with syntax. As shown in Chapter 8 on
macrosyntax, this will not always be the case for spontaneous speech.
Coordination, enumeration, parenthesis

Some interesting cases derive from the reading of enumeration, coordination,
and parenthesis as indicated in the written text by syntax and punctuation.
Coordination
In a recent paper devoted to the prosodic aspects of coordinated constructions
in French (Mouret et al., 2008), the authors examine three syntactic configura-
tions implying coordination:
1. Postverbal position
a. simple (e.g.. il faut décorer Anne-Marie, Jean-Philippe et Ségolène “you
must decorate Anne-Marie, Jean-Philippe and Ségolène”)
b. duplicated (e.g. il faut décorer et Anne-Marie et Jean-Philippe et
Ségolène “you must decorate and Anne-Marie and Jean-Philippe and
Ségolène”)
c. juxtaposition (e.g.. il faut décorer Anne-Marie, Jean-Philippe, Ségolène
“you must decorate Anne-Marie, Jean-Philippe, Segolene”).
2. Preverbal position
a. simple (e.g. Anne-Marie, Jean-Philippe et Ségolène vont être décorés
“Anne-Marie, Jean-Philippe and Segolene will be decorated”)
b. duplicated (e.g. et Anne-Marie et Jean-Philippe et Ségolène vont être
décorés)
c. juxtaposition (e.g.. Anne-Marie, Jean-Philippe, Ségolène vont être décorés
“And Anne-Marie and Jean-Philippe and Segolene will be decorated”).
Seven speakers read 126 sentences of these different types. The acoustic
analysis of the recordings was displayed in order to highlight the melodic
movements occurring on stressed vowels, at the boundaries of the coordinated
300
250
su ta
200 d’ins et
je des ri
vous ggère des des
ller lets
vo deaux voi
150
lages
100
50
0
0 0.5 1 1.5 2 2.5 3
L1 [1] je vous suggère d'installer des volets des rideaux et des voilages
Figure 7.91 je vous suggère d’installer des volets des rideaux cet des
voilages “I suggest you install shutters, curtains, sheers.”
300
250 con et
je le
200 le
vous d’é tu da nor
néer lan nois
seille dier le dais vé
150 gien
100
50
0
0 0.5 1 1.5 2 2.5 3 3.5 4
L1 [1] Je vous conseille d'étudier le nééralandais le danois et le norvégien
Figure 7.92 je vous conseille d’étudier le néerlandais le danois et le

norvégien “I suggest you learn Dutch, Danish and Norvegian”
units and eventually on conjunctions bearing a secondary stress (emphatic

stress, accent d’insistance).
The figures below illustrate clearly the mechanism of melodic slope
inversion. In postverbal position, all cases show the first two stressed
syllables ending the sequence of coordinated groups with a rising melodic
contour, contrasting with the final falling conclusive declarative contour.
(68) [[Je vous suggère Cn d’installer Cn des volets C1] [des rideaux C1] et des
voilages C0]
(69) [[[je vous conseille C2] [d’étudier Cn le néerlandais C1]] [le danois C1] et le
norvégien C0]
A rising melodic contour contrasting with the terminal conclusive contour
C0 occurs in all occurrences of postverbal realizations.
In preverbal position, however, two melodic patterns occur. One (pattern A)

with three rising melodic contours on final syllables of the coordinated units
(Fig. 7.93), the other one (pattern B) with two first falling contours followed by
a third rising contour, again on the last syllables of the coordinated units
(Fig. 7.94).
250
200
mais ni
bé vid Ma prêts le
150 ja Jean dou à ve nir
Bar na Da ma ne seraient tra vailler sa
me
di
100
50
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
L1 [1] Jamais Barnabé Jean-David ni Mamadou ne seraient prêts à venir travailler le samedi
Figure 7.93 Jamais Barnabé Jean-David ni Mamadou ne seraient prêts à

venir travailler le samedi “Never would Barnabé Jean-David, nor Mamadou
be ready to work on a Saturday.”
250
mu style
200
don
le glise ro man
150 ret le
jon
et
l'é sont de
100
50
0
0 0.5 1 1.5 2 2.5
L1 [1] Le muret le donjon et l'église sont de style roman
Figure 7.94 Le muret le donjon et l’église sont de style roman “The wall the
dungeon and the church are Romanesque.”
(70) [[Jamais C2 Barnabé C1] [Jean-David C1] [ni Mamadou C1] [ne seraient prêts
Cn à venir travailler Cn le samedi C0]]
(71) [[Le muret C2 le donjon C2 et l’église C1] sont de style roman Co]
The sequences of melodic contour are distributed according to Table 7.4.
Table 7.4 Percentage of realizations according to the

coordination type
A B et C et A et B et C ABC
A ↗↗↗ 59% 44% 83%

B ↘↘∕ 41% 56% 15%
An explanation All cases of coordination present the same melodic

pattern, Rise Rise Fall, whatever the type of coordination, simple, duplicated,
or juxtaposed. The explanation relies on the principle of melodic slope
contrast in French. Whatever the local hierarchy [A B C] or [[A] [B] [C]]
(Fig. 7.91), the sequence of melodic contours will be the same, as the contrast is
realized with the same falling contour that ends the prosodic words group C in
both cases.
Figure 7.95 Two possible hierarchical configurations [ABC] and [[A] [B]
[C]] for postverbal accentual units, resulting in sequences similar contours
rising, rising, falling.
Figure 7.96 Different groupings coordinated units A, B, and C [ABC] and

[[A] [B] [C]] resulting in different sequences of melodic contours, falling,
falling, rising or rising, rising, and rising in the preverbal case.
In preverbal position, however, the two possible groupings involve

different sequences of melodic contours. The final C contour can dominate
the two first stress groups A and B (Fig. 7.96 left), or C can be the third
element of a sequence A B C dominated by the final sentence contour C0
(Fig. 7.96 right).
The speaker then has the possibility to indicate different local prosodic
hierarchies with different sequences of melodic contours. In le vélo le roller
ou l’aviron comptent parmi les activités populaires sur le campus “the bicycle
the roller or rolling count as popular activities on the campus” the prosodic
sequence piles the three elements on the same level as subjects of the verbal
phrase [[le vélo C1] [le roller C1] [ou l’aviron C1] [comptent parmi les activités
Cn populaires Cn sur le campus C0]] (Fig. 7.97). In Figure 7.98, however, the
prosodic phrase le muret le donjon et l’église forms a separate group as unique
subject of the verbal phrase [[le muret C2 le donjon C2 et l’église C1] sont de
style roman C0] “the wall dungeon and the church are Romanesque.”
250
200
lo le com
ptent
ller ou l’a ron ac ti vi
150 vé
ro vi
par mi les tés po pulairessur le
le cam
pus
100
50
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
L1 [1] le vélo le roller ou l'aviron comptent parmi les activités populaires sur le campus
Figure 7.97 le vélo C1 le roller C1 ou l’aviron comptent parmi les activités

populaires sur le campus “Bicycle, roller and rowing count as popular
activities in the campus.”
250
mu style
200
don
le glise ro man
150 ret le
jon
et
l’é sont de
100
50
0
0 0.5 1 1.5 2 2.5
L1 [1] Le muret le donjon et l'église sont de style roman
Figure 7.98 le muret le donjon et l’église sont de style roman in which the
coordinate units are subject of the Verb Phrase sont de style roman.
Referring to the storage-concatenation model, the differences appear in the

accumulation of the information in the listener memory. In the first case
(Fig. 7.96 left), the three units are grouped together to form a prosodic group,
which will be merged with the other upcoming group terminated by C1 or C0.
In Figure 7.96 right, each unit is ended with a contour C1 and stored with the
next units ended with C1, until C0 occurs to terminate the process. From
the cognitive point of view, the first case reduces the information brought
by the three units into one single global unit (the one ended with C1). In the
second case, each unit, terminated by C1, is left in memory until C0 occurs to
terminate the process.
(72) [[le vélo C1] [le roller C1] [ou l’aviron C1] [comptent parmi les activités Cn
populaires Cn sur le campus C0]]
(73) [[le muret C2 le donjon C2 et l’église C1] sont de style roman C0]
The distribution of realizations suggests that the nature of the implied

conjunction et, ou, or ni is irrelevant in the choice of the speakers. If specialists
in semantics may reveal subtle differences in the various uses of the conjunc-
tions, this is not necessarily the case for (non-semantician) speakers.
Furthermore, in final position the result from the point of view of the storage-
concatenation model is the same, and only a subtle distinction may be argued in
subject noun phrase configuration, a distinction that may not be felt by
listeners.
In the example le vélo le roller ou l’aviron comptent parmi les activités
populaires sur le campus (Fig. 7.97), one wonders why the speaker has not
grouped together le roller ou l’aviron. As the material was read, the speaker
has access to the future along the time line and realized a more eurythmic
phrasing by enunciating le roller and l’aviron separately, on the same level as
the first stress group le vélo, producing a sequence of 3, 3, and 3 syllables,
instead of 3 and (3+3). In this latter case, eurhythmicity would have required a
slower rhythm for le vélo, and an accelerated speech rate for le roller ou
l’aviron.
Other cases The prosodic structure encoding can also involve the
realization of secondary stress, located on the conjunction of coordination. In
case of conjunction, duplication, all of them will bear a secondary stress
(Fig. 7.99).
Figure 7.100 shows that in postverbal but not final position, stress
groups are coordinated and almost always associated with a stack rather
than a grouping.
The three coordinate objects of the verb livrer are stacked in three prosodic
words.
300
250
200
ni les
vo ni les ni
au cun ri les glerce
150 voi per
lets lages ne mettent
en deaux de ré pro
cas blème
100
50
0
0 1 2 3 4 5
L1 [1] En aucun cas ni les volets ni les volets ni les rideaux ni les voilages ne permettent de régler ce problème
Figure 7.99 A parallel realization where the first stress groups are
coordinated with the conjunction ni, associated in each case with an
emphatic accent. En aucun cas ni les volets ni les rideaux ni les voilages ne
permettent de régler ce problème.
300
250 plus
On peut liver ni
200 ne la le
baignoire l'évier sans
accompte
lavabo de votre
150
part
100
50
0
0 1 2 3 4 5
L1 [1] On ne peut plus livrer la baignoire le lavabo ni l'évier sans accompte de votre part
Figure 7.100 On peut livrer le lavabo la baignoire ou l’évier sans acompte

de votre part “We cannot deliver the sink tub or sink without a deposit
from you.”
Enumeration
Enumeration implies the use of same class prosodic contours, usually C1 or Cn
in French, Cc or C1 in the other Romance languages, except for the last item.
Some examples are given in Figures 7.101 to 7.103.
(74) 1.46, 1.47, 1.44, 2.78, 2.41
(75) [. . . [B C2 antiparasiti Cc] [C C2 [anticoncezionali Cc] . . . [M Cc
antilopi C0]]
250
200 2,78 2,41

men te gior
1,86 1,44
1,47
150
ni
100
ou
50
0
119 120 121 122 123 124 125 126 127 128
L1 [42] 1,86, [43] 1,47, [44] 1,44, [45] 2,78, [46] 2,41 giorni.
[41] rispettivamente,
Figure 7.101 Enumeration in Italian of numbers (1.46, 1.47, 1.44, 2.78, 2.41),
sequence of Cc contours ending each prosodic group, terminated by C0
conclusive on the last item giorni (Italian EuRom4 I_09_04).
200
M anti
150 antiparasi ta ri anticongestio na li
lo pi
100 B C
50
0
2 2.5 3 3.5 4 4.5 5 5.5 17.5 18 18.5
Figure 7.102 [. . . B C2 [antiparasiti Cc] [C C2] anticoncezionali Cc]] . . . [M

Cc antilopi C0]] Primo Levi enumeration C2 cc, C2 Cc . . . Cc C0 (Italian Svevo).
Portuguese
(76) [[Uma equipa C2] [de cientistas C2] [do instituto Cn de reabilitação C2]
de Chicago]
“A team of scientists at the Rehabilitation Institute of Chicago” (EuRom5
P14-1)
350
300 u
ma
e
250 Ch go
Rea
qui tu
200 tis de ica
pa cien ti bi li
Ins to de ta
150 de tas do ~
çao
100
50
0
5 5.5 6 6.5 7 7.5 8 8.5
~
L1 [2] Uma equipa de cientistas do Instituto de Reabilitaçao de Chicago
Figure 7.103 Uma equipa de cientistas do instituto de reabilitação de Chicago.

The system of melodic contour cannot encode a prosodic structure that

would be congruent with syntax in this example. Instead, the speaker
chooses to realize an enumeration of the five prosodic words, each ended
with a C2 contour, contrasting with the final Cc concluding the
enumeration.
Parenthesis
The stress group corresponding to a parenthesis in the text can either be
integrated prosodically and end with a C1 or Cc contour, or appear isolated
and end with a conclusive terminal C0 contour (cf. Gachet & Avanzi, 2008;
Debaiseux & Martin, 2010).
A first example (Fig. 7.104) shows the prosodic integration of the relative
pronoun che in the parenthesis in quel mondo, using a melodic contour C2
contrasting with Cc ending the parenthesis:
Italian
(77) [Il fatto che C2, in quel mondo Cc], [gli uomini Cc]
“The fact that, in that world, men . . .” (EuRom4 I16-21)
350
300
fa
250 tto gli
il uo ni
200 che quel do
mon mi
in
150
100
50
0
33 33.5 34 34.5 35
L1 [12] ll fatto che, in quel mondo, gli uomini
Figure 7.104 Il fatto che, in quel mondo, gli uomini.
Another example of prosodic integration is given in Figure 7.105, with a two

prosodic words parenthesis. The stress group che ends with a falling C2
melodic contour, contrasting with the final contour of the parenthesis Cc,
whereas the contour C1 internal to the parenthesis is also in a dependency

relation with Cc, but at a lower level.
Italian
(78) [[che C2] [trasferirsi C1 in USA Cc]]
“which, moved in USA . . .” (EuRom4 I_23_06)
300
rir SA
si
250 in
che
tras fe U
200
150
100
50
0
91 91.5 92 92.5 93
L1 [1] che [2] trasferirsi in USA
Figure 7.105 che trasferirsi in USA.
The same configuration can be found for the other Romance languages, for
instance in Spanish (Fig. 7.106).
Spanish
(79) [. . .[permittan C1 una utilización C2] [más segura Cn de algo Cn que C2]
evidentemente Cc]
“allow a safer use of something that obviously . . .” (EuRom4 E19-17)
250
200 na
ttan
mi ra
u u ti
li ción más gu den
150 per za al go e te men te
se de
que vi
100
50
0 206.5 207 208 209 210

207.5 208.5 209.5
L1 [74] permitan una utilización más segura de algo [75] que, evidentemente,
Figure 7.106 permittan una utilización más segura de algo que,

evidentemente.
Another configuration using rising contours C1 instead of C2 . . . Cc is

shown in Figure 7.107.
Instead of being integrated into the parenthesis prosodic group,
the relative pronoun is placed in an enumeration sequence ending
with C1.
The prosodic integration of the relative pronoun into the parenthesis proso-
dic group contrasts with the enumeration configuration when, for example, a
verb is used as in Figure 7.107.
Italian
(80) [Ancora C1 I giapponesi, Cc] [per contro Cc]
“Yet the Japanese, by contrast . . .” (EuRom4 I_23_06)
350
300
co
si la
250 ha nno
ra ppo per più
i tro
200 ne
gia
al in ci den za
an con ta
150
100
50
0
97.5 98 98.5 99 99.6 100 100.5
[31] hanno la più alta incidenza
L1 [30] Ancora i giapponesi, per contro,
mondiale di tumori dello stoma-co,
Figure 7.107 Ancora i giapponesi, per contro hanno la più alta incidenza
mondiale.
Figure 7.108 is an example with a (rare) sequence C1 C2.
Italian
(81) [[Le donne C1 giapponesi C2] per esempio Cc] [[hanno Cn un’incidenza
C2] [di tumori Cn alla mammella Cc . . . ]]
“Japanese women, for example, have an incidence of breast cancer . . .”
(EuRom4 I_23_06)
400
350
300
nne ci
250
do un in
ha nno
gia
ppo pio den
200
le ne
150 si per e sem
100
50
0
72 72.5 73 73.5 74
[24] hanno un'incidenza di
L1 [23] Le donne giapponesi, per esempio,
tumori alla mammelle
Figure 7.108 Le donne giapponesi, per esempio, hanno un’incidenza di

tumori alla mammella . . .
Here we have a C1 C2 sequence, given that C2 Cc (in front of the parenthesis

per esempio) suggests C1 at the beginning of the sentence may have another
function (cf. Frota, 2009), but counterexamples do exist. The rest of the
sequence is congruent with syntax: C2 Cn Cc.
An example of AM prosodic analysis in French

In a frequently quoted paper, Jun and Fougeron (2002) propose an analysis of
French intonation conducted in the AM framework. Briefly stated, the authors
state that French has metrically strong syllables which are either boundary
tones or emphatic prosodic events, the latter generally located on the first
syllable of content words.
They use also the concept of arc accentuel, described by Fónagy (1979),
which nowadays reflects mainly the journalistic style on radio and TV
channels. It does not represent anything significant in today’s pronunciation
of French speakers (Léon, 2005). Still they associate LHi with some initial
syllables of the content word in an AP (i.e. a stress group), and sometimes

on a clitic preceding the content word if there is a large number of clitics in
the AP.
The right boundary of a prosodic word is marked by a final rise, represented
by LH*. LH* has a double association as shown in Figure 7.9: LH* marks the
right edge of a prosodic word, but H* is also associated with the stressed
syllable of a stress group, i.e. the final full syllable of the last content word of
a stress group. This reflects H*’s association with the most prominent syllable
within a phrase and its demarcative function. They consider this LH* tone to be
a pitch accent because part of the tone is associated with a stressed syllable at
the phrasal level 4.
In summary, the prosodic word left side is demarcated by an initial rising
tone (LHi) and a prosodic word (AP in AM terminology) final rising tone (the
pitch accent LH*). Syllables in-between Hi and final L (in LH*) get their
melodic curve by interpolation, with a slope inversely correlated with the
number of syllables between the two prosodic events.
In their examples, the highest prosodic constituent is IP, which contains one or
more prosodic words. This IP ends with an intonation phrase final L% or H%.
A first example is as follows:
(82) Le coléreux garçon ment à sa mère “The choleric boy lies to his mother.”
This sentence is analyzed with two Intonations Phrases (IP): [Le coléreux
garçon] [ment à sa mère], getting the ToBI transcription of Figure 7.109.
In this example the use of a Hi tone on the first syllable of coléreux is justified
by the marked order Adj + Noun of the constituents, the unmarked being N +
Adj le garcon coléreux. The presence of this emphasis mark modifies the
Figure 7.109 Le coléreux garçon ment à sa mère (from Jun & Fougeron,
2002).
Figure 7.110 le coléreux et mauvais garçon ment à sa mère (from Jun &
Fougeron, 2002).
configuration in two stress groups involved into a single stress group [Le
coléreux garçon] LHiLH*. The falling melodic contour on the final sylla-
ble of coléreux is explained by the interpolation which must exist between
the initial rise Hi and the initial L in the final prosodic word LH*.
However, the author’s interpretation associating a Hi on the single syllable
word ment is quite questionable, as this content word has only one syllable, and
if stressed, this single syllable can be a final AP pitch accent (in the interpreta-
tion of the authors).
The next example analyzed by the authors is le coléreux et mauvais garçon
ment à sa mère “The choleric and bad boy lies to his mother.”
Driven by the concept of a pitch accent realized with an H* tonal target, the
sentence is analyzed into three stress groups: [Le coléreux] [et mauvais
garçon] [ment à sa mère] so that each AP receives the sequence Hi H*.
Since it seems rather strange to associate an emphasis Hi on the second
syllable of mauvais, a better interpretation is also possible, with the following
segmentation:
(83) [Le coléreux] [et mauvais] [garçon] [ment à sa mère].
Actually, whether there are considered stressed or not, the final syllables
of le coléreux and et mauvais have a falling melodic contour, as predicted
by the ISC model, in a melodic contour sequence C2 C2 C1 C0 or Cn Cn
C1 C0.
Interestingly, the authors get puzzled by what they call “exceptional
cases” and discuss such an example where an AP ends with a falling pitch
accent.
Indeed, the sentence le garçon coléreux ment à sa mère shows a clear

example of a melodic slope inversion, with a falling contour on garçon (i.e.
C2) contrasting with the rising contour on coléreux (C1) indicating the struc-
ture [[le garçon] [coléreux]] [ment à sa mère].
Not considering the effect of contraste de pente in this example, and in order
to stick to their AM phonological guns and explain this apparent counter-
example, the authors bring an extra constraint: avoid three consecutive H
tones *[Hi*H*Hi*]. The L* tone would then be there to comply with this
constraint, transforming the middle H* of the predicted sequence into an L*. Of
course, this ad hoc condition is easily discarded by evoking enumeration
sentences, such as in French lundi mardi mercredi jeudi je travaille “Monday
Tuesday Wednesday Thursday I work.” The first four -di syllables carry H*
pitch accents, contradicting the *[Hi*H*Hi*] above.
Another clear example is given in Figure 7.112.
According to the AM model, and considering that there is no lexical stress in
French, the phonologically pertinent melodic contours are the ones located on
the (effectively) stressed syllables:
(84) Ou le donjon ou le minaret ou les murailles doivent être restaurés
L*L- LH*H% LH* L*L%
“Or the dungeon or the minaret or the walls must be restored”
Interestingly, the overall transcription above is clearly phonetic, but includes
the appropriate phonological transcription as shown.
The optional contour Hi predicted by the model (Jun & Fougeron, 2002) is
placed whenever possible, but this solution is highly debatable. Indeed, the
second LHi introduces a stress clash with the final stressed syllable on donjon.
Figure 7.111 A counterexample le garçon coléreux ment à sa mère (from

Jun & Fougeron, 2002).
0 0.5 1 1.5 2 2.5 3 3.5

350
300
F0 (Hz)
250
200
150
100
Hip LHi L* L- Hip Hi L* L- Hip LH*H% LH* LHi L*L%
d~ ÂE
~ R
l l R
dwav te t
e
e n c c e n mi na
n
le my aj rεs to re
Ou le donjon ou le minaret oules murailles doivent être restaurés
0 0 3 0 0 3 0 0 4 0 1 4
Figure 7.112 Ou le donjon ou le minaret ou les murailles doivent être

restaurés “Or the donjon or the minaret or the walls need to be restored”
pronounced by a female speaker from Paris (Delais et al., 2014).
Furthermore, the L symbol seems to be there because of the preceding dip in the
fundamental frequency curve, but this dip is phonetic due to the syllable-initial
voiced stop [d] (see Martin, 2008). Likewise, the second Hi aligned on the first
syllable of minaret is actually lower than the F0 value on the preceding ou. An
actual emphatic stress would have provoked a clear bump in the melodic curve.
Another remark pertains to the initial stress on the three conjunctions ou. No
phonological explanation for this fact is given here, although studies on
intonation of coordination in French provide simple explanations (all conjunc-
tion must be stressed equally – or non-stressed – in this kind of example;
Martin, 2009).
The observed prominences must be of another nature than pitch accents,
namely boundary tones. The problem then pertains to the Single Layer
Hypothesis (SLH), as even in simple sentences with only ip an IP specific
boundary tone will occur. This has been somewhat swept under the carpet by
Jun and Fougeron (2002) who managed to analyze very short sentences with
only IPs (see above). In a more general case, however, as already revealed in
Delais et al. (2014), in the sentence Ou le donjon ou le minaret ou les murailles
doivent être restaurés, there are two boundary tones ending the stress groups ou
le donjon and ou le minaret, with a falling melodic curve. These tones are of
another nature than the boundary tone, rising, that ends the sentence first IP Ou
le donjon ou le minaret ou les murailles.
If the hypothesis of having two different boundary tones is maintained, the
SLH is not valid anymore as contradicting the recursivity observed in this
example (ip groups prosodic words, and IP groups ip’s, even having ip and IP
reduced to a single component, respectively a stress group and an ip). If the
SLH must be kept at all cost, then the prosodic events on the first prosodic
words must be considered as pitch accents . . . French would then have prosodic
events which are sometimes pitch accents, sometimes boundary tones, the first
being realized with a falling pitch, the second with a rising pitch. Numerous
examples detailed in the next chapter show that this interpretation does not hold
when the prosodic structure becomes more complex.
An example of ISC prosodic processing in French

The analysis of the following example gives another illustration of the ISC
process:
(85) Surtout ce qui m’a fait quand même infiniment plaisir c’est de voir tous les
parents d’élèves du privé comme du public tous les élèves approuver cette charte
“What made me especially extremely happy anyway is to see all the parents of the
private and public school system all students approve this chart”
(The passage was extracted from a TV interview given by Claude Allègre,
Education Minister in France from 1997 to 2000, on the national TV channel
Antenne 2 on March 4, 1999)
The stress groups identified first from auditory perception in this
example are:
[[Surtout C1] [[ce qui m’a fait Cn quand même C2] [infiniment Cn plaisir C1]]
[c’est de voir C2 tous les parents d’élèves C1] [du privé C2 comme du public
C1] [tous les élèves C1] approuver Cn cette charte C0]
Surtout is stressed on its last syllable and therefore appears to be a prosodic
structure marker. Otherwise, the first syllable would have been stressed
Surtout, as tous carrying an emphatic stress.
The storage-concatenation process operates linearly along the time axis (i.e.
as the times passes) to recover the prosodic structure encoded by the speaker.
All these stress groups contain less than seven syllables, and their duration
remains below 1 second. The following chart shows the contraction of stress
group according to the number of syllables they contain.
Surtout C1 (“Especially”)
ce qui m’a fait Cn quand même C2 (“what made me”)
infiniment Cn plaisir C1 (“extremely happy”)
c’est de voir C2 (“it is to see”)
tous les parents d’élèves C1 (“all the parents”)
du privé C2 (“of the private ”)
comme du public C1 (“and the public school system”)
tous les élèves C1 (“all students”)
tous
les
tout plaisir
pa
sur
in
fi ment c'est voir rents
qui fait quand meme de léves
ce m'a ni d'é
tous
les
du pri a
vé é
blie
comme pprou ver
léves charte
du pu cette
Figure 7.113 Identification of prosodic events.
C1 C1 Emphasis
tous
les
tou plaisir
Cn pa
sur
in
fi ment c'est de voir rents
qui léves
ce m'a fait quandmeme ni d'é
Cn C1
C2
C2
Emphasis
tous
C1
les
du pri a
vé é
lic
comme léves
pprou ver C0
du pu cette charte
C2
C1
Cn
Figure 7.114 Classification of prosodic events.

Figure 7.115 Retrieving the prosodic structure.
approuver Cn (“approve”)
cette charte C0 (“this chart”)
This particular example in French is canonical, in the sense that prosodic
events are instantiated by standard melodic contours conforming to commonly
found phonetic descriptions. Only the final C0 conclusive declarative contour
seems awkward on the melodic curve (see Fig. 7.115) due to overlapping of
another speaker pronouncing the word alors before the speaker (CA) had
finished (careful examination of harmonics displayed on a narrowband spectro-
gram shows the falling fundamental frequency of the contour realized as
expected).
The storage-concatenation process implies the (hypothetical) existence of
“buffers” of short-term memory keeping temporary information. Let’s call
them C0, C1, C2, and Cn. These buffers keep temporary strings of syllables
terminated by specific prosodic events C0, C1, C2, and Cn. As with the strings
of syllables, there is a maximum number of items (words) that can be stored
(thus remembered by the listener), in the order of seven.
First, at the beginning of the storage-concatenation process, all memory
buffers are cleared. Then along the time axis appear successively strings of
syllables organized (phrased) into stress groups, successively ended with
prosodic events C1, C2, C1, C2, C1, C2, C1, C1, Cn, and C0. The listener
must identify each of these prosodic events as belonging to a specific prosodic
events class. Only the final conclusive (and in this case declarative) contour has
to be realized according to a specific pattern in the language in question in order
to be always identifiable by the listener. The other prosodic events may vary in
their realizations, according to sociolinguistics parameters, geographic origins,
etc., but will be properly identified by the listeners after a short period of
adaptation if necessary.
Then as the prosodic events are identified, the sequence of syllables ended
with this prosodic events are stored in their appropriate buffer, as shown in
Table 7.5:
Sequence of events (see Table 7.5):
Surtout C1| “Especially” ended with prosodic event C1, stored in
buffer C1;
ce qui m’a fait Cn quand même C2| “what made me” ended with C2,
stored in buffer C2;
infiniment Cn plaisir C1| “extremely happy” ended with C1, conca-
tenated with what was stored in C2 (ce qui m’a fait quand même)
and stored in buffer C1, which now contains ce qui m’a fait quand
même infiniment plaisir. C2 buffer is cleared;
c’est de voir C2| “it is to see” ended with C2, stored in buffer C2;
Table 7.5 Sequence of prosodic events triggering the storage-concatenation

process
Syllabic
sequence Event Buffer C2 Buffer C1 Buffer C0
Surtout C1 Surtout
ce qui m’a C2 ce qui m’a
fait quand fait quand
même même
infiniment C1 Surtout + ce qui m’a fait
plaisir quand même
infiniment plaisir
c’est de voir C2 c’est de voir
tous les C1 c’est de voir + tous les
parents parents d’élèves
d’élèves
du privé C2 du privé
comme du C1 du privé + comme du
public public
tous les C1 tous les élèves
élèves
approuver Cn
cette charte C0 Surtout ce qui m’a
fait quand même
infiniment plaisir c’est
de voir tous les parents
d’élèves du privé
comme du public tous
les élèves approuver
cette charte
tous les parents d’élèves C1| “all the parents” ended with C1, con-
catenated with what was stored in C2 (c’est de voir) and stored in
buffer C1, which now contains c’est de voir tous les parents
d’élèves. C2 buffer is cleared;
du privé C2| “of the private” ended with C2, stored in buffer C2;
comme du public C1| “like the public school system” ended with C1,
concatenated with what was stored in C2 (du privé) and stored in
buffer C1, which now contains du privé comme du public. C2 buffer is
cleared;
tous les élèves C1| “all students” ended with prosodic event C1, stored
in buffer C1;
approuver Cn| “approve” ended with Cn, stored in buffer Cn;
cette charte C0| “this chart” ended with the terminal conclusive
prosodic event Co, concatenated with all remaining syllabic
sequences stored in buffer C1 to form the whole sentence. All
buffers are then cleared to process the next sentence.
In all concatenation stages, the resulting buffer cannot contain more than a
limited number of stress groups. In the final memorization process, one may
imagine that the speaker will remember only the key words. In the above
examples, surtout, plaisir, voir, élèves, privé, public, charte, i.e. a limited
number of words (only seven here) end stress groups with their final syllable
stressed.
Conclusion
By highlighting the dynamic time aspect of the prosodic structure and of the
encoding and decoding process performed by the speaker and the listener, an
unexpected coherence in the mechanism using a limited number of melodic
contours emerged for all the Romance languages considered in this chapter.
The comparison between prosodic realizations of similar sentences reveals the
similarities in the processing of the prosodic structure by both the speaker and
the listener.
In summary, the phonological contours on stressed vowels are:
Co Terminal (declarative or interrogative);
Cc Complex contour, slightly falling on the stressed syllable and
rising on the final syllable (stressed or not), absent in French;
C1 Rising (above the glissando threshold);
C2 Falling (above the glissando threshold);
Cn Neutralized (below the glissando threshold).
The essential differences between French and the other Romance languages
pertain to:
Conclusion 213
1. The existence of a complex contour Cc absent in French, which does not

have lexical stress and therefore cannot have a contour positioned on two
distinct syllables;
2. The ranking of prosodic contours in French, Cn < C2 < C1 < C0, and an
inverted ordering C1 < C2 for Romance languages: Cn < C1 < C2 < Cc < C0.
Given these differences, the prosodic grammar operates the same way. By
comparing two successive melodic contours, say Cx and Cy, relative to their
ranking, the listener is able to assemble or not the prosodic words implied.
1. If Cx < Cy, the prosodic words attached to Cx and Cy are merged;
2. If Cx = Cy, the prosodic words attached to Cx and Cy are part of a list, to be
terminated by the occurrence of a contour of higher rank;
3. If Cx > Cy, the prosodic words attached to Cy is not merged with the one
attached to Cx.
8 Macrosyntax
For a long time, the mere study of spontaneous speech was not considered
worthy of scientific investigation, as this kind of speech style was regarded as
being full of mistakes and could reflect the bad usage of “uneducated people.”
Although this view is becoming exceptional, many linguists still insist on
considering written text, which is by definition created by “educated” people,
as the sole legitimate production of language for scientific description. Indeed,
syntactic theories often use linguists’ intuition to validate analyzed examples,
and linguists’ intuition naturally follows the rules of written language.
Pioneering work has been underway since as early as the sixteenth century
(Meigret, 1550), and linguists like Ferdinand Brunot were interested in regional
popular productions since the 1910s. A real development of research on spoken
French appeared after the 1968 student uprising in France, with the work of
GARS (Groupe Aixois de Recherche en Syntaxe), initiated by Claire Blanche-
Benveniste (among others). At this time one of the reference titles on the
linguistic analysis (essentially syntactic) of spoken French appeared, the per-
iodical edited by GARS appropriately entitled Recherche sur le français parlé
(1977–2004).
Interestingly, one of the key factors in encouraging research on sponta-
neous speech came from developers of computer applications in speech
recognition. Indeed, if some current efficient techniques (for example Siri
for Apple, or Dictanote for Google) rely on the speaker context and situation
to proceed to the recognition of single words or short sentences, taking
advantage of the extra-linguistic information they may currently have on
individual speakers, the automatic recognition of natural, non-prepared
speech is hampered in existing systems by the use of a grammar based on
written text properties. These systems would give reasonably good results
when reading a text (without the “mistakes” found in spontaneous speech).
The use of a grammar in the software boosts the recognition rate from about
70 to 75% when no grammar is used to about 90 to 95% with an embedded
syntactic description of the language in question, allowing locally missing
phonetic information to be supplemented by the redundancy present in
written texts.
214
A first approach 215
Keen to reuse their existing algorithms, many developers in computer speech

recognition proceeded by first removing the dysfluencies (“scories” in French)
from the raw speech signal. Indeed, dysfluencies are typical of spontaneous
production. They involve unjustified and unnecessary (from the point the view
of written text purists) repetitions, reformulations, aborts, hesitations, ponc-
tuants (“well, but, so, . . . ”), etc. While the removal of these elements is not
especially easy, the improvement of the recognition rate may be noticeable but
generally still below the scores obtained from read speech. Therefore, the need
to develop grammars describing syntactic specificities of spontaneous speech
appeared, and the concept of a new kind of syntax, called macrosyntax,
emerged.
Despite appearances, spontaneous speech is not easy to define. Indeed, it
cannot just be opposed to read speech, as some speakers can produce natural
non-read speech exactly the same way they would read a text, at least from the
point of view of syntax. Of course, preparation makes the difference, and
professional speakers such as politicians or teachers, not to mention actors,
can simply reproduce already known large segments of discourse the same
way they would when reading a text. I will thus oppose read speech to
unprepared spontaneous speech, keeping in mind that the difference is not
easy or obvious to establish for uncontrolled or unknown mode in speech
production recordings.
A first approach
One of the key elements in macrosyntactic analysis pertains to the definition of
the sentence itself. With the Written Language Bias at work (Linne, 2005), the
sentence is simply defined by an uppercase character at its left boundary, and a
dot (or a question mark) at its end. Of course, this definition cannot totally apply
to oral production, unless we prove that the conclusive declarative and inter-
rogative contours transcribed with an orthodox dot act as reliable boundaries by
their relatively stable features. Likewise, commas in the text represent prosodic
breaks, and no further investigation on this question was felt necessary in most
grammatical studies.
One early approach to macrosyntax is linked to the so-called “left disloca-
tion” and “right dislocation” in sentence configuration. Dislocation in a sen-
tence occurs, in a traditional syntactic description, when a constituent is placed
outside clause boundaries, whereas it could otherwise (in its basic grammatical
form) be an argument or an adjunct placed inside the clause. In English,
examples could be “Romeo and Juliet, they met on the balcony” for a left
dislocation, and “They met on the balcony, Romeo and Juliet” for a right
dislocation. Both of these written examples include a comma, supposedly
transcribing some expected prosodic event eventually including a pause.
216 Macrosyntax
Left dislocation is often described as a process to emphasize elements of the

sentence, or introduce a topic. Some older analyses use the terms “theme” and
“rheme” to denote the dislocate element, the theme introducing the subject of
what is said in the rheme corresponding to the main clause. When the theme
follows the rheme, there may be a difference depending – for classical syntax –
on the novelty of the information brought by the theme. If this information is
new, the theme is called complement différé in French or epexegesis (Bally,
1944). It ends then with a falling conclusive contour, similar to the one ending
the theme. If on the contrary the information is already present in the context
(what has been said or written before) or in the situation (the environment of
the speech act), the theme placed after the rheme carries a melodic flat contour
and is therefore associated with a Prosodic Postfix (see below), clearly distinct
from a conclusive declarative contour. This flat melodic characteristic may
receive an iconic interpretation, with a reduced melodic movement for seg-
ments containing information already known (at least known by the speaker,
and signaled as such to the listener); cf. Chafe (1976). In the interrogative case,
the rheme’s final contour is rising, as a copy of the theme’s final contour
(Martin, 2009).
Although most linguists would not consider complement différé construc-
tions as belonging to written language, Bally (1944) proposed to use two
successive dots to indicate in writing the successive conclusive contours, the
first one being followed by a small cap letter to distinguish it from two separate
sentences: je le trouve très intéressant. moi. ce film “I find it very interesting.
me. this movie.” This is opposed to → je le trouve très intéressant, moi, ce film
“I find it very interesting as for me this movie” where the choice of dots versus
commas reflects the difference in final prosodic contours, conclusive falling
versus flat.
As a final introductory remark, I would like to add that the development of
better and easy to use recording equipment brought a much larger quantity of
spontaneous speech data that could be analyzed, leading to a revision of some
traditional views and the emergence of macrosyntax analysis (see Chapter 3).
As early as 2000, relatively large spontaneous speech corpora in Romance
languages started to be elaborated (e.g. C-ORAL-ROM, 2001, 2005; C-ORAL-
BRAZIL; Mittmann et al., 2009; Raso & Mello, 2012). Unfortunately, users of
this equipment were not necessarily technically well trained, resulting some-
times in recordings of bad to very bad acoustic quality (essentially due to ill-
placed microphones relative to recorded speakers).
The need for reliable speech fundamental frequency analysis for poor quality
recordings led me to improve the software program WinPitch that I had started
to develop in 1996. Many challenges pertained to the low signal to noise ratio,
the frequent presence of important echo in the recordings, the use of inap-
propriate microphones cutting the low frequencies of the speech signal, etc.
The answer took the form of manually selected multi-analysis methods, allow-
ing the recovery of a reliable melodic curve in adverse recording conditions.
The importance of this kind of spontaneous speech recording is enormous, as
we cannot underestimate the space of prosodic possibilities in spontaneous
speech. Whereas most researchers in the domain work under laboratory pho-
nology conditions, analyzing very short coined sentences, a large number of
unexpected cases produced by speakers in real life made necessary some
radical changes in the theoretical vision, leading among other things to the
consideration of many variants in prosodic realization not expected earlier.
Three current models for macrosyntax

One of the important consequences of the Written Language Bias is that it leads
us to consider spoken languages as a simple mise en oeuvre of the speaker
linguistic competence instantiated in written texts. It would not, then, be surpris-
ing to observe in this kind of linguistic production errors and approximations
similar to the ones observed by phoneticians on variants in the realizations of
phonemes and syllables. However, there were early exceptions to this view, such
as brought by Henri Frei in his Grammaire des fautes published in 1928. Despite
its title, this essay aims to better relate some observed usage of the language,
usage rejected by the dominant cultural power, institutionalized in France by the
Académie française and its current incarnations in schools, newspapers, and
books. Successful books apparently taking care of large possibilities of occur-
rences like le bon usage of Maurice Grevisse (1936–2011) simply identify all or
most possible uses of grammatical and semantic constructions, but rely only on
written material produced by socially or culturally dominant accepted authors in
newspapers, magazines, and novels.
For French, as mentioned earlier, one of the most important efforts to
describe spontaneous speech production has been conducted since the 1970s
by the GARS, led by Blanche-Benveniste in Aix-en-Provence (1990, 2000,
2005). Two factors may explain the advent of such endeavour: (1) the propaga-
tion of new ideas (mainly addressed against the establishment including in
linguistics) after the surge of student protest in 1968 in France and elsewhere;
and (2) the appearance of low-cost sound-recording equipment (cassette tapes
essentially, and sometimes more costly equipment such as the Nagra sound
recorder used by media professionals). Despite the veil of indifference or
hostility (how can a serious linguist be interested in the way more or less
educated people speak, a reproach that had already been directed at Ferdinand
Brunot for his campaign of recordings in French villages in the 1920s), time
after time, the project grew bigger and bigger, and today a vast part of research
in France and elsewhere is devoted to such investigation, in the framework of
what is now called macrosyntax.
218 Macrosyntax
At least three definitions pertain to the term macrosyntax (cf. Avanzi, 2012):
1. C. Blanche-Benveniste (1980): With her associates in GARS in Aix-en-
Provence, Claire Blanche-Benveniste considered that two different types of
syntactic dependencies must be considered to describe properties of sponta-
neous speech: (a) the morphosyntactic combinations belonging to the verb and
its dependents domain (grammar of categories) and (b) the macrosyntactic
dependencies accounting for the oral and written (long) productions.
2. A. Berrendonner (1990, 2003): For Alain Berrendonner and Marie-José
Béguelin (Groupe de Fribourg), macrosyntax pertains to the succession of
communicative acts, i.e. to the sequences of sentences with their contextual
and praxeologic aspects. This kind of macrosyntax concerns units larger than
sentences analyzed by GARS, and describes the relations (syntactic, semantic,
pragmatic) between sentences in the discourse.
3. E. Cresti – M. Moneglia (Cresti, 2000): For Emanuela Cresti and
Massimo Moneglia, head of the research laboratory LABLITA in Florence,
macrosyntax refers to the pragmatic aspects of sequences of speech acts (cf.
Austin, 1962). The sentence is viewed as an informational and pragmatic unit
of speech analyzed into connected segments of pragmatic and information
value. Their analysis of the sentence uses all linguistic objects encoding the
informational structure, whether syntactic, semantic, prosodic, or pragmatic.
It is possible to find some bridges between these three approaches, all
essentially devoted to the linguistic analysis of spontaneous speech. For
instance, GARS analyzes the sentence into macrosegments: an optional
Prenucleus, a Nucleus, and an optional Postnucleus (also called originally
Postfix, Nucleus, and Postfix). To give a quick definition, the Nucleus is a
segment that can be extracted from the sentence and constitutes a well-formed
autonomous sentence by itself. For Cresti–Moneglia, the sentence is analyzed
in an optional Topic, a Comment, and an optional Appendix. However, the
criteria of analysis are quite different, GARS using relations of (or rather
the lack of) syntactic dependency to segment the sentence into Prenucleus,
Nucleus, and Postnucleus, whereas the communicative approach of Cresti–
Moneglia utilizes semantic, syntactic, and prosodic criteria to identify Topic,
Comment, and Appendix.
All these approaches aim to describe what they observed in spontaneous
speech, a domain outside the rection (relations of dependency) between gram-
matical categories (GARS), where specific non-dependency relations, apposi-
tions, detachment cannot be explained with grammar (Fribourg), or where
some configurations are impossible to explain with a simple constituent gram-
mar (Lablita).
Interestingly, at first, only the Lablita group was interested in the intonation
aspects of the analysis, whereas it has been marginal for the GARS team. The
then called theme–rheme division of the sentence, where the theme pertains to
information belonging to the context or the situation of the speech act, had the
intonative aspects already described by Bally (1944) among others and largely
documented in numerous papers by phoneticians (even in Martin, 1975). In
these analyses, the theme segment carries a somewhat flat melodic curve,
whereas the rheme ends with a conclusive contour (in declarative mode),
corresponding to Appendix and Comment for Lablita.
Likewise, Topic in the Lablita approach does resemble Prenucleus in the
GARS acceptance and may sometimes be equivalent under specific condi-
tions. The difference relates to the way these units are defined, i.e. on
semantic, prosodic, or syntactic criteria. In both cases, a sentence can inte-
grate more than one Topic or Prenucleus, but Topics are typically ended with
a specific list melodic contour (actually a C1 or Cc for Romance languages
other than French, as defined in Chapter 7). On the other hand, Prenuclei are
strictly defined by the lack of dependency relation toward macrosegments
that precede or follow (and are therefore not necessarily ended with a specific
melodic contour, although the presence of these prosodic markers is
frequent).
In the GARS macroanalysis, more than one Nucleus can be present in the
text side of a sentence, but there is only one that ends with a conclusive
prosodic marker. For Lablita, there is only one Comment and possibly more
than one Topic and more than one Appendix. However, a confusion may occur
when the text and the sentence intonation are considered simultaneously to
define a unique Nucleus (see below).
There is also a difference in the treatment of Parentheses. For Lablita, an
Appendix can occur between a Topic and a Comment, and corresponds to a text
parenthesis integrated in the overall sentence prosodic structure for GARS. A
GARS text parenthesis aligned with a prosodic parenthesis is equivalent to a
Lablita definition of the parenthesis.
In Lablita the intonation allows one to determine the sentence segments in
discourse, as every native speaker is assumed to be capable of identifying
important prosodic breaks, as well as the types of speech acts (type of illocu-
tion, declaration, interrogation, etc., with many possible variants – cf. Cresti et
al., 2002) as they are correlated with specific intonative contours (actually
terminal conclusive contours essentially).
Taking into account the fact that the study of the prosodic aspects started to
develop only recently (Debaisieux et al., 2008), even today most syntactitians
and macrosyntactitians alike tend to consider sentence prosodic events as an
accessory to (macro)syntax, eventually looking at intonation “breaks” only if
the syntax “fails,” i.e. when no syntactic marker indicates a macrosegment
boundary, for example (cf. Lacheret, 2003). I will show below that sentence
intonation deserves a macroanalysis per se, in order to better understand the
apparent coexistence of two independent domains.
220 Macrosyntax
The theory of la lingua in atto

The problem with the assumptions presiding to the Lablita approach is that the
boundaries of the illocutionary speech acts may not be indicated by a prosodic
break. To rely on prosodic events to identify segments that allow a pragmatic
interpretation supposes that prosody and text function in parallel and are
congruent with each other. This is obviously not the case as many examples
observed in numerous spontaneous speech data suggest.
The basic units in la lingua in atto are the Comment, the Topic, and the
Appendix. Like the Nucleus, the Comment can appear alone and form a
complete informational unit. Cresti’s prototypical examples are Carlo va a
Roma “Carlo goes to Rome,” declarative, and lavora tutti i giorni? “she works
all day?” (see also Scarano, 2003: 41).
The Topic always precedes the Comment and defines the field of application
of the illocutionary act. In a sense the Topic corresponds to the GARS
Prenucleus, but its definition is purely semantic, i.e. neither syntactic nor
prosodic. In the examples: Il caffè lo voglio bello forte [oral < Cresti] “the
coffee I want it strong,” and Da domani dieta [oral < Cresti] “Starting tomor-
row diet,” caffè and da domani the Topics and lo voglio bello forte and dieta are
the Comment.
Although originally presented differently by putting semantic criteria as
essential for segmentation, it may happen that in practice, this analysis is rather
performed with syntactic and prosodic criteria, the semantic properties being a
consequence rather than the cause of the segmentation.
The Appendix completes semantically what was included in the Comment
and eventually in the Topic. It corresponds closely to the theme in the theme-
rheme or theme-propos of Bally (1944). Typical examples are è mica sposata
l’Ornella [oral < Cresti] with l’Ornella being the Appendix “She is not at all
married, Ornella,” and ce l’ho io la ricetta [oral < Cresti] la ricetta being the
Appendix “It’s me who has it, the receipt.”
In the Lablita transcription (and in particular in C-ORAL-ROM, 2005), the
boundary tones of Topics, Comment, and Appendixes are transcribed as pro-
sodic breaks, using only two classes: nonterminal / and terminal //.
In summary, the macroanalysis of sentences of Lablita appears to be quite
similar in their results to the GARS approach, but they are obtained using
semantic properties rather than syntactic. The equivalence could be summar-
ized as in Table 8.1.
Table 8.1 Gars and Lablita equivalence
GARS Prenucleus Nucleus Postnucleus

LABLITA Topic Comment Appendix
Text macrosyntax and prosodic macrosyntax 221
Text macrosyntax and prosodic macrosyntax

One of the key points characterizing the approach presented here and used in
this book relates to the relationships between syntax and macrosyntax (again as
defined by GARS) on the one hand, and intonation on the other. In this view the
hierarchical organization of text and intonation are considered to be totally
separate a priori. Both text and intonation have a macrosyntactic organization,
but these organizations, resulting in specific phrasing and grouping of their
respective units, are elaborated independently (and probably sequentially) with
specific properties and constraints. Furthermore, their use, i.e. the process of
decoding the information by the listener, is different.
Therefore, the type of macrosyntax used in this chapter is derived from the
GARS work, with the following important differences:
1. Intonation and prosodic words are not part of (text) macrosegments. They
are not part of morphological or syntactic properties and are a priori totally
independent from the text organization into macrosegments. One of the
reasons for this stems from the fact that the prosodic organization preexists
macrosegments at least for large chunks, produced in sequences by the
speaker to form the whole utterance (see discussion in Chapter 5).
2. The a priori independence of the prosodic and the syntactic structures (and in
respect of all the other structures that may organize the utterance) leads to a
definition of a macro-organization of sentence prosody. As all prosodic markers
instantiated by melodic contours located on stressed vowels in prosodic words
and on boundary tones indicate a dependency relation to the right (i.e. to a
melodic contour occurring later), a dependency relation always exists between
prosodic words, and therefore Prosodic Prefixes similar to text Prenuclei, which
have no syntactic dependency relation with the Nucleus, cannot exist. On the
contrary, all prosodic words and prosodic groups have a direct or indirect
dependency relation toward the final conclusive contour, declarative or inter-
rogative, or its variants. The macrosyntactic analysis of the prosodic line leads
to the existence of one Prosodic Nucleus and no prosodic Prefixes.
3. After the conclusive contour ending the Prosodic Nucleus, prosodic seg-
ments characterized by limited melodic variations and a flat ending contour
can occur. As this type of occurrence can only occur after a Prosodic
Nucleus, the dependency relation acts to the left, i.e. toward the Prosodic
Nucleus that precedes. This prosodic macrosegment is called a Postfix.
4. To avoid confusion in the terminology between text and prosodic macro-
segments, I will use Prenucleus, Nucleus, Parenthesis, and Postnucleus to
denote text macrosegments, and Prosodic Nucleus, Postfix, Infix, and Suffix
for prosodic macrosegments.
5. Following the same principles as those applied to the analysis of text into
macrosegments, the prosodic line can be analyzed into prosodic
222 Macrosyntax
macrosegments, i.e. complete well-formed segments of prosody, terminated

by a conclusive declarative or interrogative contour. The prosodic segments
that can form a complete prosodic structure and that has its final terminal
contour associated with the final stress group of the Nucleus is the Prosodic
Nucleus. An independent well-formed prosodic structure embedded inside
the Prosodic Nucleus constitutes another prosodic structure called Prosodic
Parenthesis or Infix. The prosodic phrases occurring after the Prosodic
Nucleus are either Postfixes, if their melodic contours are restrained or
flat, or otherwise Suffixes if ended with a terminal conclusive contour.
Suffixes form another prosodic structure with no dependency relation with
any preceding Prosodic Nucleus. On the other hand, an Infix (Prosodic
Parenthesis) is not a special kind of prosodic structure (cf. for early studies
Nemni, 1980; also Delais-Roussarie, 2009; Gachet & Avanzi, 2008;
Debaisieux & Martin, 2010). This terminology recalls the original terms
used for similar functions assigned to text macrosegments in the GARS
model.
6. Suffixes are well-formed prosodic structures placed after the Prosodic
Nucleus. The only characteristic that differentiates them from a sequence
of two independent prosodic structures associated with two successive
utterances pertains to the syntactic dependency relation that must exist
between the text segments associated with them and the text segment
associated with the Prosodic Nucleus (Avanzi & Martin, 2007). Therefore,
a Suffix is not a special kind of prosodic structure.
7. Postfixes, like Suffixes, are well-formed prosodic structures placed after the
Prosodic Nucleus. The difference with Suffixes lies in the reduced melodic
span of their melodic contours, whether the associated text macrosegment is
identified as a Postnucleus or not. They correspond to the theme in a rheme–
theme configuration.
8. Infixes and Suffixes are convenient designations for independent prosodic
structures inserted either anywhere in the sentence between prosodic words
or after a Prosodic Nucleus.
9. The difference with the standard GARS approach lies in the fact that
Suffixes and Infixes are byproducts of a more basic process which involves
only Prosodic Nuclei and Prosodic Postnuclei.
In summary, the macrosyntactic analysis of the sentence leads to two a priori
independent lines, a text and an intonation line (Fig. 8.1).
Merging text and intonation

There are cases where more than one text Nucleus does exist. In je me levais le
matin j’étais avec des clients je mangeais à midi j’étais avec des clients je
dormais le soir j’étais avec des clients “I woke up in the morning I was with
Merging text and intonation 223
Text
Prenucleus – Nucleus – Postnucleus le métro c ’est sous terre le métro
Parenthesis Prenucleus − Nucleus − Postnucleus
Intonation
Prosodic Nucleus – Postfix - Suffix le métro ( ) c ’ est sous terre le métro
Infix Prosodic Nucleus − Postfix
Figure 8.1 Macrosyntactic analysis of text and intonation

le métro, c’est sous terre, le métro “the subway its underground the
subway” (Raymond Queneau, Zazie dans le métro, Gallimard, Paris, 1959).
customers I ate at noon I was with customers I slept in the evening I was with
customers” (corpus Olive), three identical text Nuclei are present in the same
sentence, as revealed by its grid representation:
je me levais le matin j’étais avec des clients.
je mangeais à midi j’étais avec des clients.
je dormais le soir j’étais avec des clients.
Only one of these nuclei has its final clients aligned with the final prosodic word
containing the conclusive contour. From the point of view of the GARS
tradition, all three occurrences are Nuclei.
Should the macrosyntactic model then consider only the last occurrence of
j’étais avec des clients as the sentence Nucleus?
The same problem occurs with parentheses. A text Nucleus can incorporate
one or more parentheses, which are aligned either with prosodic groups or with
an independent Prosodic Nucleus, and Infix (see discussion below). Likewise,
independent Prosodic Nuclei can be associated with text macrosegments which
do not present syntactic characteristics of parentheses.
Should the model differentiate between these three categories of parenth-
eses? It seems reasonable to give more semantic weight to macrosegments of
text and intonation that match, i.e. which are aligned on each other. For
example, in respect of a text parenthesis aligned on a Prosodic Nucleus it
appears at least intuitively that the parenthetical effect is stronger than if the
text parenthesis is integrated in the sentence Nucleus, i.e. in the sentence
prosodic structure.
An example of parenthesis (underlined) prosodically integrated and ending
with a C1 rising contour:
[notre métier c’est pour ça que il y a plus de jeune(s) qui veut venir sur notre métier
C1] [c’est trop dur C0][crfp]
“Our job that’s why there aren’t more young(s) who want(s) to come on our
business is too hard.”
224 Macrosyntax
The next example shows a parenthesis associated with an independent

prosodic structure, and thus terminated by a conclusive contour C0:
[et alors à Madras pire hein – quand je suis arrivée donc je je – j’avais décidé
de visiter mais en partant par le sud peut-être par Auroville [bon je me
rappelle pas bien C0][et en arrivant à Madras bon ça a été vraiment bon C0]
[crfp]
“and then in Madras worse eh – when I arrived so I I – I decided to visit but starting
from the south perhaps through Auroville well I do not remember very well and
arriving at Madras well it was really good.”
Likewise, if the sentence contains more than one text Nucleus, only the last
one aligned with the last Nucleus stress group with the conclusive melodic
contour will be perceived by the listener as the effective Nucleus, especially
since it is the last heard. Further research should be conducted to reach some
conclusion on this point.
Dysfluencies
Written transcription of spontaneous (i.e. non-prepared) speech almost
always reveals the presence of dysfluencies, elements that seem unnecessary
and that are systematically removed in “correct” transcriptions, as they are,
for example, in declarations of politicians in newspapers. These dysfluencies
appear under various forms: hesitations (instantiated by euh in French, for
example, or by a lengthening of the last vowel preceding the hesitation.
Italian um, Spanish um, Catalan um, Portuguese hum, Romanian um), by
primers of stress groups followed by reformations, by repetitions, by aborts of
the current stress group before its completion, etc. Although they may be
removed to obtain “correct” written text, these dysfluencies belong to the
macrosegments.
Some examples are extracted from the French spontaneous text analyzed
further below.
Hesitations (filler)
bon je reviens sur cette euh ce problème
“well I’ll be back on this uh this problem”
bon ben je la prends et euh et voilà quoi
“well I take it and uh well”
Repetitions
quand même y a des des sous sur le compte
“even when there there is money on the account”
c’est pas loin euh tu tu j’y vais à pied
“it’s not far you you go by foot”
Ponctuants 225
Reprises / Reformulations
bon je reviens sur cette euh ce problème
“well I come back on uh this this problem”
c’est pas loin euh tu tu j’y vais à pied
“it’s not far you you go by foot”
Aborts
elles marchent elles non non elles ont tendance à non non elles adorent marcher
“they walk they tend to no no no no they love to walk”
(From Anita Musso, CFPP 2000)
Ponctuants
Another typical characteristic of spontaneous speech is the presence of ponc-
tuants, which at the difference of dysfluencies, may be left in a “cleaned”
written transcription. A typical (non-exhaustive) list in French is given below
(Morel & Danon-Boileau, 1998):
Tu vois, Hein, Quoi, Enfin, Pour, Alors, Ecoutez, Non mais allo quoi, Attends, Allo, En
tout cas, Ah la la, Ah, Non, Oui, Ouais, Disons, Je veux dire, Bon alors, Et puis,
Ecoutez . . .
“You see, Huh, What, Finally, To, So Listen, But not allo what, Wait, Hello, In any
case, Oh la la, Oh, No, Yes, Yeah, Say, I mean, Okay so, Listen . . . ”
As their name suggests, the ponctuants are small expressions usually placed
at the beginning or the end of macrosegments (and sometimes inside macro-
segments) to replace specific prosodic boundaries that would be transcribed by
a punctuation mark in a written transcription.
To be ponctuants, some of these verbal groups, interjections, etc. need to be
associated with a specific intonation. This is the case for tu vois ended with a
rising melodic contour.
A famous example from a young French TV reality show personality
(Nabilla Benattia): non mais allo quoi “no but hello what,” a sentence
containing only ponctuants, pronounced as three stress groups [non mais
Cn] [allo C0] [quoi C0n]. In this example, each ponctuant suggests to the
listener an elliptic content not formulated but easily recoverable: non mais
introduces a contradictory remark pertaining to the context and/or the situa-
tion, allo indicates the necessity to pay attention to what is going to follow to
begin a telephone conversation, quoi is frequently used as a final concluding
ponctuant. In the original example, allo is associated with a final conclusive
contour. And quoi to a flat terminal contour of a Postfix. In other examples the
ponctuant quoi may also carry a terminal conclusive contour and therefore
end a Prosodic Nucleus.
226 Macrosyntax
Ponctuants may be different in other Romance languages:

Italian: tipo “like,” ecco “there,” cioè “actually,” etc.
Spanish: tipo “like,” ecco “there,” cioè, “actually,” etc.
Portuguese: é, hum, então “so,” tipo “like,” bem “well,” etc.
Romanian: deci /detʃʲ/ (“therefore”) is common, especially in school,
and ă /ə/ is also very common (can be lengthened according to the
pause in speech, rendered in writing as ăăă), whereas păi /pəj/ is
widely used by almost anyone.
The prosodic eraser

While writing a text allows multiple changes, reformulations, aborts,
hesitations, etc. without leaving traces (when working on a computer
with a text editor), this is obviously not the case in spontaneous speech
production. When the sequence of words, stress groups, and syntagms is
not known in advance, i.e. when the speech is not prepared and known
by heart, there are still possibilities for the speaker to proceed with
correction that would not involve an impossible time reversal. However,
these processes would leave traces instantiated by the dysfluencies
described above, and also by macro prosodic constructions functioning
as additions.
Use of dysfluencies
Given that dysfluencies leave traces of mechanisms of discourse production,
it becomes possible to explain or at least propose explanations pertaining to
the formation of syntagms in speech (Blanche-Benveniste, 2003; Martin,
2012b). In written production, hesitations, reformulations, additions, hesita-
tions characteristic of oral production, can be crossed out or integrated in the
text (cf. famous heavily crossed-out manuscripts of nineteenth-century
authors). Today, all these corrections may be totally removed by the use of
text editors, often automatically.
It would seem at first that these dysfluencies would hamper a proper and
easy comprehension of the speaker discourse, but this is far from being the
case. The Incremental Storage-Concatenation model gives explanations as to
how the speaker can use tools that would prevent the learner’s memorization
of stress groups by abort or reformulation, or at a higher level (in the prosodic
structure) to insert an unplanned addition that would be concatenated in
the general prosodic structure, even when the utterance has been signaled
finished with a conclusive contour (case of the deferred complement –
epexegesis).
Deletions
The conversion of a sequence of syllables into a stress group is synchronized by
a lexical stress (in Romance languages), a group stress (French), or a combina-
tion of lexical and group stress. As mentioned earlier, the text aligned with the
prosodic word, i.e. the stress group, may contain not only a content word (verb,
noun, adjective, or adverb) but also grammatical words and even only gram-
matical words. Furthermore, in French, stress groups can contain more than one
content word. There are also cases of stress groups containing one single
syllable part of a multisyllabic word (in “em-pha-sis” mode).
This conversion of a stress group into a higher linguistic unit cannot take place if
the stressed syllable is not effectively realized in the sequence. In French, this
means that incomplete stress groups will not be processed and kept in the listener
memory for further treatment, and in particular for the concatenation to form the
complete prosodic structure. For the other Romance languages, conversion trig-
gering will occur, but the sequence will (normally) not be recognized as part of the
stress group known by the listener (although missing syllables after the stressed
one may be eventually completed, there won’t be enough time for this operation if
a reformulation occurs immediately). In both cases, the incomplete stress group is
not memorized, which is equivalent in the final result to an erasure from the listener
memory. This is thus also equivalent to crossing out a written text segment.
The reformulation of a segment may be accompanied with a morphological
adjustment as in the French example:
Alors la l’infirmière de temps en temps me m’humectait euh les lèvres (Vallier)
“then the the nurse from time to time would moisten euh my lips.”
Alors la → l’infirmière Morphological adjustment by elision of la before a vowel l’.
Je je j’ai eu les jambes qui ont tremblé (Selin)
“I I had legs that trembled.”
Je je → j’ai eu les jambes Morphological adjustment by elision of je before a
vowel j’.
In Blanche-Benveniste’s interpretation, these hesitations-reformulations
would indicate a separation in the lexical and syntactic processes: the syntactic
frame is set, planned, but the choice of the appropriate lexical unit is not
finalized. What’s more important, is that even if there is no phonetic adjustment
by elision in the reformulation, it is the entire stress group which is pronounced,
and not just the missing element finally found by the speaker as, for example, in
Je je je vais le faire bientôt (CL 96) “I I I am going to do it soon.”
Et on les on les cultive comme ça (Choix) “and we we cultivate them like that.”
Indeed, before the mechanism of storage-concatenation can take place, the

chunking of the sequence of syllables into stress groups must effectively take
place, and the corresponding text must match a known sequence of words
228 Macrosyntax
known by the listener. If not, decoding is always feasible, but at the expense
of a longer cognitive treatment (N400 and P600, see Chapter 5), revealed
in particular by the presence of time-expensive mismatch negative or
positive brain wave oscillations. The negative N400 is due to the occurrence
of an unexpected negative spike in EEG recordings when a semantic error
exists in processed speech (Steinhauer et al., 1999), whereas the positive
P600 is linked to any extra syntactic processing by the listener (Wang
et al., 2012).
According to the class of prosodic events identified by the listener as coded
by the speaker, stress groups already stored in listener memory are concate-
nated at their appropriate level, i.e. with the last prosodic group bearing the
same class of identified prosodic event. This is the base of the ISC process. If
the prosodic word is incomplete, in French when the last syllable is not yet
pronounced, in the other Romance languages when the well-formed prosodic
word is not yet completed, the storage-concatenation process is suspended,
waiting for a new eventually reformulated well-formed sequence of syllables
forming a stress group.
This mechanism is distinct from the insertion of a hesitation marker (like uh
in English, euh in French, etc.) or a lengthening of the last pronounced vowel
(or syllable) not finishing the planned stress group. In these cases, the stress
group production is simply interrupted temporarily, but resumes as a well-
formed stress group without erasing all stored syllables that were already
pronounced.
Additions
In both cases, once the prosodic contour characterizing the position of
the prosodic word in the sentence prosodic structure is pronounced by
the speaker, it enters the storage-concatenation process and cannot be
removed from the listener memory. However, a non-planned stress group
or even a large non-planned stress phrase can be inserted immediately
after by the speaker, provided it ends with the same class prosodic
contour. Therefore, this insertion can take place at any level in the
prosodic structure, and not only after the conclusive terminal contour.
These cases correspond to the epexegesis (deferred complement) and the
Suffix (in GARS macrosyntax) but also to additions made inside the
Prosodic Nucleus.
Epexegesis, from classical Greek epexēgēsi, is defined as the “addition
of words to clarify meaning.” In terms of macrosyntax, it is a process
allowing the speaker to add a sequence of words contained in one or more
stress groups by simply terminating the supplementary stress group by a
melodic contour belonging to the same class as the final contour where
250
200
qu’on devait prendre
150 en Angleterre dans le pour rejoindre euh
euh euh notreville d’accueil
métro
100
50
40
30
20
10
0
4 5 6 7 8
Figure 8.2 Adding a corrective syntagm after the Nucleus: en Angleterre

dans le métro “stranded in London in the subway,” followed by a Suffix
associated with [qu’on devait prendre pour rejoindre notre ville d’acceuil]
“that we had to take to reach our home city.” This construction in Nucleus +
Suffix is characterized by two independent prosodic structures, ended each
with a conclusive melodic contour.
350
300
mes parents
250
200
m'emmenaient à l'école
150
50
40
30
20
10
0
66 66.5 67 67.5 68 68.5 69 69.5
pk1 [22] à l'école
pk2 [20] oui alors j'suis rentrée ouais j'suis bien rentrée à l'école ... [21] à l'école [23] le temps du déménagement la première ou la ...
Figure 8.3 Adding syntactic segment to the Prenucleus mes parents

m’emmenaient “my parents would take me” Prenucleus, followed by [à
l’école] “to school.” The final syllables of the two sequences are melodic
contours belonging to the same class C1.
these sequences should have been inserted (Godement & Martin, 2010).
This process allows the speaker to make a correction on the syntactic
structure planned during the utterance process, as in the example of
Figure 8.2.
This same mechanism of correction is possible before the Nucleus to add
a Prenucleus that should have been included in the preceding Prenucleus
(Fig. 8.3, CFPP, 2000).
Another example is given in Figure 8.4.
230 Macrosyntax
200
tu sseurs belges
150 et
euh de
l'in quié des bra
de
100
50
0
23 23.5 24 24.5 25 25.5 26 26.5
L1 [2] et euh de l'inquiétude des brasseurs belges par rapport à ce qu'était la
consommation française qui va rester je vous le confirme à un bon niveau
Figure 8.4 [Je confirme que le premier ministre Elio Di Rupo m’a parlé de
cette situation et euh de l’inquiétude des brasseurs C1] [belges C1] [par
rapport à ce qu’était la consommation française . . .] [FH]. By ending the
Prenucleus [belges] with the same contour ending the preceding Prenucleus,
the speaker realizes a correction compared to what it may have been planified:
[. . . l’inquiétude des brasseurs belges] “I confirm that the Prime Minister Elio
Di Rupo told me about this situation and uh of the anxiety of Belgian
brewers.”
Text and prosodic macrosegments

The basic principle consists to oppose a microsyntax of rections (i.e. of
syntactic dependencies) to a macrosyntax beyond rection. This distinction
permits the proper and satisfactory account of many ad hoc appellations of
constituents in “classical” syntax, such as detached, in coordination or sub-
ordination, linking, elliptic, unfinished, etc. Indeed, all these appellations of
constituent are described in macrosyntax as macrosegments, with essentially
no dependency relations, no relation of rection, between its elements. In the
view I have chosen here, the approach developed by GARS, maximal units of
macrosyntax, the macrosegments, are distinguished from discourse analysis
(Berrendonner), dealing with larger scopes than the sentence, or the paragraph
(Morel & Danon-Boileau, 1998) defined by a conclusive prosodic contour.
Actually, the macrosegments, maximal units whose composing elements
maintain relations of rection, can be instantiated by any grammatical cate-
gory: verb, adjective, noun, adverb, pronoun, interjection, etc. In the view of
Deulofeu (2003, 2006), these macrosegments “float” syntactically on the
sentence ocean (the actual image is of “îles flottantes,” floating islands, a
dessert where macrosegments would be pieces of meringue floating on a
prosodic crème anglaise). The crème anglaise corresponds, of course, to
sentence intonation, but maybe the comparison must end there, as the intona-
tion line on which float the macrosegments is not flat and quiet, but on the
contrary behaves like an ocean with (sometimes) furious waves. These waves
Text and prosodic macrosegments 231
are impersonated by prosodic events, and particularly melodic contours

placed on stressed syllables. Instead of being quiet and flat, and leaving the
floating macrosegments at the same level, resulting in a simple enumeration
of macrosegments in the sentence, the melodic contours will at times assem-
ble macrosegments to form a group of two or more macrosegments, and at
other times leave them in their paratactic configuration, as a simple
enumeration.
The configuration of macrosegments inside the overall sentence prosodic
structure all depends on the size of the macrosegments involved. If the macro-
segment contains few syllables, it may be merged with another macrosegment,
as in moi mon papa (four syllables) with the last syllable stressed in moi mon
papa il est president. If it contains enough syllables to force the speaker to
realize a stressed syllable at the end of the macrosegment, it all depends on the
number of stress groups included in the macrosegment. If there is only one
stress group, there is no need for the speaker to specify a prosodic event other
than Cn (a neutralized contour). If the structure inside the macrosegment is
more complex, the usual system of melodic contours will be realized by the
speaker to encode a prosodic word that would set the floating text macroseg-
ment in the overall sentence prosodic structure.
Various prototypical configurations are given in the following examples:
Prenucleus + Nucleus
[Le lendemain C1] [grande surprise C0] “the next day big surprise”
The prenucleus is included in the next statement by the prosodic structure,
and carries a melodic contour final amount of slope opposite the terminal
descending contour of the utterance.
Noyau + Postnucleus
[à la caisse C0] [ils se pèsent C0n] “at the till they are weighed”
The Nucleus ends with a descending melodic contour sharp variation, while
the Postnucleus has a falling contour lower melodic variation (symbol C0n).
The two contours are necessary to ensure the indication of this structure, which
opposes a core configuration + Suffix as shown in the following example.
Nucleus + Suffix
[j’achète beaucoup de médicaments C0] [qui ne sont pas remboursés C0]
“I buy a lot of drugs which are not reimbursed.”
The sentence presents two independent prosodic structures, the cohesion of
the two to form the whole sentence is provided by a syntactic relationship
implemented by the relative pronoun qui.
232 Macrosyntax
Parenthesis
[tout le monde faisait C1] [j’en ai fait moi-même C0] [de l’aviron C0]
“everyone did rowing I did some myself.”
The two parts of the Nucleus, tout le monde faisait and de l’aviron, are
separated by the parenthesis j’en ai fait moi-même, which is aligned on an
independent prosodic structure, ended with a terminal conclusive contour C0.
Autonomous sentences
Sentences can be autonomous if (a) they carry a terminal conclusive contour
and (b) if they refer to some information contained either in the sentence or in
the context or/and the situation of the speech act (Martin, 2015).
There are statements where the text does not seem to form a communicatively
independent Nucleus. The statement parce que nous le valons bien “because we
deserve it” (L’Oreal) is introduced by a subordinating conjunction and thus
appears as a Prenucleus (or possibly Postnucleus), but the sentence ends with a
conclusive melodic contour. As the text refers to a cultural context (“we deserve
to buy expensive cosmetics”) the sentence is therefore autonomous.
Statements that do not end with an implicative contour, generally rising and
carrying a rising-falling melodic movement, also exist. This contour leads the
interpretation of the context and/or situation of the speech act. It is also
possible to find statements ended with continuation majeure, as in si tu
crois que ça m’ennuie “if you think it bothers me,” which makes the statement
possibly autonomous. Other similar examples call for a follow-up by other
participants to the speech act, as in et donc . . . “and then,” inviting the speaker
to go on.
Finally, the Nucleus can end with a ponctuant carrying a conclusive
melodic contour. In French, the example non mais allo quoi “no but hello
what” mentioned earlier that ends with a conclusive contour is therefore
autonomous.
Examples of macrosyntactic analysis

Macrosyntactic analysis of a speech recording follows the steps below:
a. Identify the dysfluencies.
b. Identify the ponctuants.
c. Segment the text in macrosegments.
d. Identify that the Nuclei end aligned with a conclusive contour C0.
e. Label prosodic events (into Cn, C2, C1, Cc, C0) and identify Prosodic
Nuclei, Postfixes, Infixes, and Suffixes.
f. Label macrosegments (Prenuclei, Nucleus – ended with C0 – Postnuclei);
g. Identify alignments between macrosegments boundaries and prosodic
events.
h. Sort cases of congruence and non-congruence between text macrosegments

and prosodic macrosegments.
i. Identify possible cases of eurhythmy in occurrences of non-congruence.
After sorting the dysfluencies, the next step determines the text nuclei
boundaries, knowing that the right Nucleus boundary is necessarily
aligned with a conclusive melodic contour, low and falling in the declara-
tive case. The Nucleus left boundary corresponds to the absence of any
syntactic dependency relation between any units in the Nucleus toward
any unit of the preceding macrosegment. The status of a macrosegment as
a Nucleus can be easily verified with a sound editor, which allows us to
isolate the corresponding speech segment from the rest of the recording. If
the playback extracted macrosegment appears well formed when listening,
and if no extra segment is expected, the macrosegment will be labeled as a
Nucleus, otherwise as a Parenthesis or a Prenucleus. There is also the
possibility of having an absent Nucleus, and of the sentence ending with a
Prenucleus carrying a rising melodic contour (Cn or C1), leaving the
absent Nucleus as part of the context.
In the macrosyntactic segmentation of the text, macrosegments have to
be labeled as Prenuclei, Parentheses, Nuclei, and Postnuclei. Prenuclei and
Parentheses do not have syntactic dependency relations with any Nucleus
element. Parentheses can be associated with independent prosodic struc-
tures (i.e. ended by a conclusive melodic contour) or integrated as a
prosodic group in the whole sentence prosodic structure (Debaisieux &
Martin, 2010).
French
Les vieux graphistes The first example is from a GARS corpus (Les
vieux graphistes).
Les vieux graphistes ou les anciens je devrais dire graphistes pas les vieux
quelquefois lorsqu’ils voient les mises en page de certaines revues ou de
certains journaux ils se mettent les mains sur la tête [AIX-R00PRO001].
“Aged graphic designers or should I say ancient not aged graphic designers some-
times when they see the layouts of some magazines or some newspapers they put
their hands on the head.”
The melodic curve of this spontaneous speech example is given Figure 8.5.
As discussed earlier, a spontaneous speech utterance consists of one or more
macrosegments. The first step in a macrosyntactic analysis involves the identi-
fication of the Nucleus. By definition, the Nucleus can appear by itself without
any other macrosegment and be well formed both syntactically and prosodi-
cally. This property can easily be tested by extracting the macrosegment
234 Macrosyntax
(a)
phistes
ciens
les gra an
vieux pas
100 les je de les
phistes ou vraisdire vieux
gra
50
40
30
20
10
0
2 2.5 3 3.5 4 4.5 5 5.5 6
8.5a. Les vieux graphistes ou les anciens je devrais dire graphistes pas les vieux.
(b)
neaux
fois mains
ils
sur
lors
qu'ils cer la
taines les te
quel voient les page de de cer tê
que mises re tains se
êr vues ou jour mettent
6.5 7 7.5 8 8.5 9 9.5 10 10.5
8.5b quelquefois lorsqu’ils voient les mises en page de certaines revues ou de

certains journaux ils se mettent les mains sur la tête.
Figure 8.5 Melodic curve with the pitch movements on stressed syllables
circled.
considered as Nucleus from the speech recording with a sound editor. When
heard, the extracted macrosegment should appear complete (i.e. ending with a
conclusive terminal contour) and be syntactically well-formed. In the example
above, the segment ils se mettent les mains sur la tête fills these two conditions
and is accepted as the Nucleus of the utterance.
To identify the other macrosegments boundaries, it is imperative to
separate the identification of the text boundaries from the prosodic bound-
aries, as they are not guaranteed to match. For text, one of the easiest
ways consists in detecting the break in the dependency relations binding
syntactic units inside each macrosegment. The absence of these relations
generally indicated a macrosegment boundary. When applied to the
example, five macrosegments (including the already identified Nucleus)
emerge:
(Les vieux graphistes ou les anciens) (je devrais dire graphistes) (pas les vieux)
(quelquefois lorsqu’ils voient les mises en page de certaines revues ou de certains
journaux)
(ils se mettent les mains sur la tête)
The same segmentation has to be done for the prosodic side of the utterance,
in order to obtain the sequence of prosodic words together with their melodic
contours:
[[Les vieux Cn graphistes C2] [ou les anciens Cph [je devrais dire C0] gra-
phistes C1]]
[pas les vieux C0]
[quelquefois C1]
[[lorsqu’ils voient Cn les mises en page C2] [de certaines revues Cn ou de certains
journaux C1]] [ils se mettent Cn les mains Cn sur la tête C0]
Several segmentations seem possible (e.g. [pas les vieux quelquefois] instead
of [pas les vieux] [quelquefois . . .]), but this first segmentation pertains only to
the text side of the utterance. The prosodic macrosegment analysis may not
reveal boundaries corresponding to the boundaries of the text macrosegment.
Conversely, the prosodic segmentation leads us to identify je devrais dire and
pas les vieux, ending with conclusive contours C0, as parentheses, whereas the
analysis on the text side would give the syntagm je devrais dire graphiste.
The principles of the storage-concatenation process of prosodic words
are then applied, by indicating the type of pitch movement observed. It is
this process which is accomplished first by the listener, revealing the
importance of the prosodic structure as the first organization of utterances
in speech, the (macro)syntactic analysis taking place in a second step
(see Table 8.2).
Remarks:
1. The first stress group Les vieux has a neutralized contour to ensure a
differentiation with the falling contour ending Les vieux graphistes.
2. The group ou les anciens has an emphatic accent on the first syllable of
anciens, which prevents the realization of a stressed syllable group on the
last syllable of the word.
3. The segments je devrais dire and pas les vieux are labeled with the falling
melodic variation and an intensity drop of about −6 dB on the last syllable.
Since they are a well-formed macrosegment on both syntactic and prosodic
levels, they can be isolated and form complete sentences by themselves.
They can also be removed from the whole sentence without modifying the
overall prosodic structure.
This segmentation of the utterance in stress groups, as indicated by
melodic contours on their last stressed syllables, leads to the ISC schema
of Figure 8.6.
236 Macrosyntax
Table 8.2 Utterance segmentation into prosodic words indicating the

melodic contours on the last stressed syllable and the type of contour
in the process of storage-concatenation
Prosodic word Contour Type
Les vieux Neutralized Cn Level 3

graphistes Falling C2 Minor continuation
ou les anciens – Level 3
je devrais dire Descendant C0 Parenthesis
graphistes Rising C1 Major Continuation
pas les vieux Descendant C0 Parenthesis
quelquefois Rising C1 Major Continuation
lorsqu’ils voient Neutralized Cn Level 3
les mises en page Falling C2 Minor continuation
de certaines revues Neutralized Cn Level 3
ou de certains journaux Rising C1 Major Continuation
ils se mettent Neutralized Cn Level 3
les mains Neutralized Cn Minor continuation
sur la tête Descendant C0 Final Declarative
Cn C2 C1 C0
Neutralized Minor continuation Major continuation Conclusive final
Les vieux
graphistes
ou les anciens
je devrais dire
graphistes
pas les vieux
quelquefois Utterance
lorsqu’ils voient
les mises en page
de certaines revues
ou de certains journaux
ils se mettent
les mains
sur la tête
Figure 8.6 The ISC schema of the example Les vieux graphistes ou les
anciens je devrais dire graphistes pas les vieux quelquefois lorsqu’ils voient
les mises en page de certaines revues ou de certains journaux ils se mettent les
mains sur la tête. On the graph, time flows from top to bottom, and the
horizontal axis represents the different assembly levels corresponding to
prosodic falling, flat (neutralized) and rising movements.
J’y vais à pied A second example in French (Blanche-Benveniste

& Martin, 2011) is an excerpt from Corpus de Français Parlé Parisien (CFPP,
2000) elaborated under the direction of Sonia Branca-Rosoff.
The transcription, using standard orthography, is as follows (the speaker is
interviewed about her addiction to using the car for her travel inside Paris):
bon je reviens sur cette euh ce problème qui est un problème euh voilà de d’être chez moi
combien de fois ça m’est arrivé bon ben là tu vas boulevard Voltaire c’est pas loin euh tu
tu j’y vais à pied je suis chez moi je me conditionne dans mon appartement en me disant
j’y vais à pied moi ma voiture elle est garée dans la rue j’ai un stationnement résident je
passe devant je ne peux pas m’empêcher d’ouvrir euh la porte de monter dedans et
d’aller euh à euh voilà cinq minutes en voiture ce qui me mettrait peut-être euh un petit
quart d’heure à pied donc au dernier moment je prends ma voiture sur le coup je me dis
je vais mettre cinq minutes mais le temps de me garer de tourner de faire des ronds pour
pas mal me garer et tout je sais que je suis perdante je le sais que je suis perdante . . . il y a
le oui il y a un côté de facilité de passer devant sa voiture et de se dire bon ben je la
prends et euh et voilà quoi au point où on en est c’est pas très malin faudrait que je c’est
une habitude en tous cas que j’aimerais changer.
“well I come back on this problem which is a problem uh well to be home how many times
it happened to me well you go to Voltaire boulevard it is not far uh you I go by foot I am
home I prepare myself in my apartment thinking I am going on foot my car is parked on
the street I have a resident parking permit I go by my car and I cannot help but open the
door to get into my car and to go uh well five minutes by car for what would take fifteen
minutes by foot so at the last moment I take my car on the spot I think it will take five
minutes but by the time it takes to park to go around to find a good parking spot and
everything I know I know that I am a loser I know I am a loser there is yes there is a facility
to pass near one’s car and tell oneself well I take my car and then well at this stage it is not
very smart it would have it is a habit in any case that I would like to change.”
a. Dysfluencies
Hesitations: euh eight occurrences;
Repetitions: tu tu, followed by one abort;
Reprises and reformulations: cette → ce problème, de → d’être chez
moi, il y a le → il y a un côté;
Aborts: tu tu, et d’aller à, dedans et d’aller euh à, faudrait que je.
b. Ponctuants
Macrosegment initial: bon, voilà, bon ben là;
Macrosegment final: euh voilà, voilà.
c. Text macrosegments
(bon je reviens sur cette euh ce problème qui est un problème (euh voilà) de d’être chez
moi) (combien de fois ça m’est arrivé) (bon ben là tu vas boulevard Voltaire) (c’est
pas loin) (euh tu tu j’y vais à pied) (je suis chez moi) (je me conditionne dans mon
appartement en me disant) (j’y vais à pied) (moi) (ma voiture) (elle est garée dans la
238 Macrosyntax
rue) (j’ai un stationnement résident) (je passe devant) (je ne peux pas m’empêcher
d’ouvrir) (euh la porte) (de monter dedans et d’aller euh à euh voilà) (cinq minutes en
voiture) (ce qui me mettrait peut-être euh un petit quart d’heure à pied) (donc au
dernier moment je prends ma voiture) (sur le coup) (je me dis) (je vais mettre cinq
minutes) (mais le temps de me garer) (de tourner) (de faire des ronds pour pas mal me
garer et tout) (je sais que je suis perdante) (je le sais que je suis perdante)
(il y a le) (oui) (il y a un côté de facilité de passer devant sa voiture et de se dire) (bon
ben je la prends et euh et voilà quoi) (au point où on en est) (c’est pas très malin)
(faudrait que je) (c’est une habitude en tous cas que j’aimerais changer)
d. Prosodic groups
The prosodic groups are determined by the ranking of contours in French: Cn <
C2 < C1 < C0. From the labeling of prosodic events located on stressed
syllables, it is then possible to identify the Nuclei, whose left boundary is a
text macrosegment boundary without syntactic relation to the left (i.e. toward
what precedes), and the right boundary is aligned with a conclusive declarative
contour C0. The prosodic structure reorganizes the text macrosegments as
follows:
[bon je reviens Cn sur cette euh ce problème Cn qui est un problème Cn euh voilà C1]
[de d’être chez moi C1]
[combien de fois Cn ça m’est arrivé C1]
[bon ben là tu vas boulevard Voltaire Cn c’est pas loin Cn euh tu tu j’y vais à pied C1]
[je suis chez moi Cn je me conditionne Cn dans mon appartement Cn en me disant Cn
j’y vais à pied C1]
[moi Cn ma voiture Cn elle est garée dans la rue C1]
[j’ai un stationnement Cn résident C1]
[je passe devant C1]
[je ne peux pas m’empêcher d’ouvrir C1]
[euh la porte Cn de monter dedans Cn et d’aller C1 euh à euh voilà C1] [cinq minutes
en voiture C1]
[ce qui me mettrait peut-être Cn euh un petit quart d’heure Cn à pied C0]
[The first sentence has 11 prosodic groups ended with C1 (continuation majeure), and
the last groups is terminated by a conclusive contour C0. This segmentation defines
a Prosodic Nucleus [ce qui me mettrait peut-être Cn euh un petit quart d’heure Cn
à pied C0].
[donc au dernier moment C1 je prends ma voiture C0]
A simple two stress groups’ prosodic structure.
[sur le coup C1]
[je me dis C1]
[je vais mettre Cn cinq minutes C0]
The text and Prosodic Nuclei [je vais mettre Cn cinq minutes C0] are aligned in
this case.
[mais C1]
[le temps Cn de me garer C1]
[de tourner C1]
[de faire des ronds Cn pour pas mal me garer et tout C1]
[je sais Cemph que je suis perdante C0]
The rising melodic contour on je sais is an emphasis marker (accent d’insis-
tance) and not a continuation majeure contour.
[je le sais C0 que je suis perdante Con]
The prosodic construction in this sentence is Prosodic Nucleus + Suffix, as
indicated by a flat melodic contour on perdante, aligned on the text configura-
tion (je le sais) (que je suis perdante).
[il y a le oui C1]
[il y a un côté de facilité C1]
[de passer devant sa voiture C1]
[et de se dire C1]
[bon ben je la prends C1]
[et euh et voilà C0] [quoi Con]
Succeeding to five text PreNuclei ended with a C1 contour, the Nucleus [et euh
et voilà C0] is followed with a Postfix aligned on the ponctuant quoi.
[au point où on en est C1]
[c’est pas très malin C1]
[faudrait que je C1]
[c’est une habitude Cn en tous cas Cn que j’aimerais changer C0]
The last sentence has three Prefixes followed by the Prosodic Nucleus [c’est
une habitude Cn en tous cas Cn que j’aimerais changer C0].
The example also has one text Parenthesis:
[euh voilà C1] integrated in the prosodic structure and embedded in the Prefix [bon je
reviens Cn sur cette euh ce problème Cn qui est un problème Cn [euh voilà C1] de
d’être chez moi C1]
e. Alignment text macrosegments – prosodic macrosegments

By aligning the text and macrosegments the congruent and non-congruent
sequences appear clearly.
(bon je reviens sur cette euh ce problème qui est un problème) (euh voilà)
[bon je reviens Cn sur cette euh ce problème Cn qui est un problème Cn euh voilà C1]
(de d’être chez moi) (combien de fois ça m’est arrivé)
[de d’être chez moi C1] [combien de fois Cn ça m’est arrivé C1]
(bon ben là tu vas boulevard Voltaire) (c’est pas loin) (euh tu tu j’y vais à pied)
240 Macrosyntax
[bon ben là tu vas boulevard Voltaire Cn c’est pas loin Cn euh tu tu j’y vais à pied C1]
(je suis chez moi) (je me conditionne dans mon appartement en me disant)
[je suis chez moi Cn je me conditionne Cn dans mon appartement Cn en me disant Cn]
(j’y vais à pied) (moi) (ma voiture) (elle est garée dans la rue)
[j’y vais à pied C1] [moi Cn ma voiture Cn elle est garée dans la rue C1]
(j’ai un stationnement résident) (je passe devant)
[j’ai un stationnement Cn résident C1] [je passe devant C1]
(je ne peux pas m’empêcher d’ouvrir) (euh la porte)
[je ne peux pas m’empêcher d’ouvrir C1] [euh la porte Cn]
(de monter dedans et d’aller euh à euh voilà) (cinq minutes en voiture)
[de monter dedans Cn et d’aller C1 euh à euh voilà C1] [cinq minutes en voiture C1]
(ce qui me mettrait peut-être euh un petit quart d’heure à pied)
[ce qui me mettrait peut-être Cn euh un petit quart d’heure Cn à pied C0]
(donc au dernier moment je prends ma voiture)
[donc au dernier moment C1 je prends ma voiture C0]
(sur le coup) (je me dis) (je vais mettre cinq minutes)
[sur le coup C1] [je me dis C1] [je vais mettre Cn cinq minutes C0]
(mais le temps de me garer) (de tourner)
[mais C1] [le temps Cn de me garer C1] [de tourner C1]
(de faire des ronds pour pas mal me garer et tout)
[de faire des ronds Cn pour pas mal me garer et tout C1]
(je sais que je suis perdante) (je le sais que je suis perdante)
[je sais Cemph que je suis perdante C0] [je le sais C0 que je suis perdante Con]
(il y a le) (oui)
[il y a le oui C1]
(il y a un côté de facilité de passer devant sa voiture et de se dire)
[il y a un côté de facilité C1] [de passer devant sa voiture C1] [et de se dire C1]
(bon ben je la prends et euh et voilà quoi)
[bon ben je la prends C1] [et euh et voilà C0] [quoi Con]
(au point où on en est) (c’est pas très malin) (faudrait que je)
[au point où on en est C1]
[c’est pas très malin C1] [faudrait que je C1] (c’est une habitude en tous cas que
j’aimerais changer)
[c’est une habitude Cn en tous cas Cn que j’aimerais changer C0]
Italian
The example analyzed below was taken from the C-ORAL-ROM corpus, file
IFAMDL09, and consists of one female Tuscan speaker SAB talking to a friend
about her (painful) attendance at a concert the night before. Transcribed with-
out any punctuation, the text appears as follows:
ma io penso non so quanta capienza c’ha comunque io penso dodicimila persone s’era
tutte mh guarda non cascava uno spillo sinceramente e poi tra l’altro c’è stata la menata
che noi ci s’aveva il biglietto per il parterre praticamente quando siamo entrati hanno
aperto i cancelli già in ritardo alle sei e mezzo sicché noi siamo stati due ore lì a aspettare
in fila così quando siamo entrati c’hanno detto che praticamente noi non si poteva andare
sulle gradinate a sedere ma soltanto in mezzo si poteva stare sicché a me mi girava un
po’ le palle perché insomma stare ancora a aspettare fino alle nove e poi tutto il concerto
in piedi insomma era stressante la cosa e poi in piedi hai visto anche se il palco è un po’
rialzato però se ti viene uno davanti alto non vedi nulla specialmente io che non sono ba
insomma che son bassa vero sicché nulla io e quest’altra ragazza che era in macchina
con me s’è detto sai sicché proviamo a andare nelle gradinate e siamo riuscite a sgamare
sicché siamo siamo andate su e nulla ci siamo messe a sedere però logicamente tutti i
posti erano prenotati
“but I think I do not know how much capacity it has in any case I think twelve thousand
people that’s all mh look sincerely a needle could not fall down and then among other
things there was the nuisance that we had tickets for the parterre practically when we
entered they opened the gates already late at six-thirty so we were there two hours to
wait in line like that so when we entered they said that practically we could not go in the
stands but only in the middle to sit so I was feeling a little upset because really to stand
and wait still until nine and then standing throughout the whole concert, it was stressful
and then on tiptoe I saw if the stage was a bit raised if you come in front you don’t see
anything especially that I am I am short actually so nothing me and this other girl who
was in the car with me we thought let’s go to the stands and we were able to sneak in so
we we went on and nothing we started to sit but logically all the seats were booked”
a. Dysfluencies
Repetitions: siamo siamo
Reprises and reformulations: none
Aborts: io che non sono ba → che son bassa
Ponctuants vero, sai, insomma, sicchè
b. Text macrosegments
(ma io penso) (non so quanta capienza c’ha) (comunque io penso) (dodicimila persone
s’era tutte) (mh guarda) (non cascava uno spillo sinceramente) (e poi) (tra l’altro)
(c’è stata la menata che noi ci s’aveva il biglietto per il parterre) (praticamente
quando siamo entrati hanno aperto i cancelli già in ritardo alle sei e mezzo sicché noi
siamo stati due ore lì a aspettare in fila così) (quando siamo entrati) (c’hanno detto
che praticamente noi non si poteva andare sulle gradinate a sedere ma soltanto in
mezzo si poteva stare) (sicché a me mi girava un po’ le palle perché) (insomma stare
242 Macrosyntax
ancora a aspettare fino alle nove e poi tutto il concerto in piedi) (insomma era
stressante la cosa) (e poi in piedi hai visto anche se il palco è un po’ rialzato)
(però se ti viene uno davanti alto) (non vedi nulla specialmente io che non sono ba)
(insomma che son bassa vero) (sicché nulla) (io e quest’altra ragazza che era in
macchina con me s’è detto sai) (sicché proviamo a andare nelle gradinate e siamo
riuscite a sgamare) (sicché siamo) (siamo andate su) (e nulla) (ci siamo messe a
sedere però logicamente tutti i posti erano prenotati)
c. Prosodic groups
The prosodic groups are determined by the ranking of contours in Romance
languages: Cn < C1 < C2 < Cc < C0, different from the French one. From the
labeling of prosodic events located on stressed syllables, it is then possible to
identify the Nuclei, whose left boundary is a macrosegment boundary without
syntactic relation to the left (i.e. toward what precedes), and the right boundary
is aligned with a conclusive declarative contour C0. The prosodic structure
reorganizes the text macrosegments as follows:
[ma io C1 penso C2]
[non so C1 [quanta capienza Cn c’ha C2]]
[[comunque Cn io penso Cn dodicimila persone C1] s’era tutte C0]
The text Nucleus (s’era tutte C0) is aligned on a single stress group ended
with Co, whereas the prosodic structure [C1 C2] [C1 [Cn C2]] [[Cn Cn C1] C0]
is congruent with the macrosyntactic organization (ma io penso) (non so quanta
capienza c’ha) (comunque io penso) (dodicimila persone s’era tutte).
[mh guarda Cn non cascava Cn uno spillo C0] [sinceramente Con]
This sentence has a Prosodic Nucleus followed by a Postfix, ended with a
melodic flat contour.
[e poi C1]
[tra l’altro C1]
[c’è stata la menata Cn che noi ci s’aveva Cn il biglietto Cn per il parterre Cc]
[praticamente Cn quando siamo entrati C1]
[hanno aperto Cn i cancelli C1]
[già in ritardo C0]
[alle sei e mezzo Cn sicché noi siamo stati due ore Cn lì a aspettare Con in fila
così Con]
The next sentence has the same prosodic organization into Prosodic Nucleus +
Postfix, where all non-final contours are neutralized.
[[quando siamo entrati C1]
[c’hanno detto Cn che praticamente C1]
[noi non si poteva andare C1]] [sulle gradinate Cn a sedere C0]
[[ma soltanto C2 in mezzo C1] [si poteva stare C0]
This sequence has two conclusive contours C0, indicating an organization in

Prosodic Nucleus + Prosodic Nucleus. The conjunction soltanto marks the link
between the two successive text macrosegments aligned on two independent
prosodic structure. The second prosodic structure is therefore a Suffix.
[sicché a me C1]
[mi girava Cn un po’ Cn le palle C1]
[perché insomma C2 stare C2 ancora C2 a aspettare C1 fino alle nove C2 e poi tutto il
concerto Cc in piedi C1]
[insomma C2 era stressante C1] [la cosa C0]
The last text macrosegment [insomma Cn era stressante C1] [la cosa C0] is the
text Nucleus.
[e poi C1]
[in piedi Cn hai visto anche Cn se il palco Cn è un po’ rialzato C1]
[però C1]
[se ti viene Cn uno davanti alto C1]
[non vedi nulla C0]
[[specialmente io che non sono ba Cn insomma Cn che son bassa C1]]
[vero C0]
This sentence has two Prosodic Nuclei, the text macrosegment corresponding
to the second prosodic structure belongs to the preceding text segment as
indicated by the adverb specialmente.
[sicché C1 nulla C0]
The conclusive contour is realized with a rise on nulla stressed syllable and a
falling contour on its last syllable.
[io e quest’altra Cn ragazza C1]
[che era in macchina Cn con me C1]
[s’è detto sai Cn sicché proviamo C1 a andare C1 nelle gradinate C1]
[e siamo riuscite Cn a sgamare C1] [sicché siamo Cn siamo Cn andate su C1]
[e nulla C1]
[ci siamo messe a sedere C1]
[però Cn logicamente Cn tutti i posti Cn erano prenotati C0]
The last sentence uses only prosodic phrases ended with the contour C1 and not
C2, as all contrasts inside prosodic groups are marked by the neutralized
contour Cn.
As explained in Chapter 7, constructions ending with C2 must contrast with
C1 in more complex structure such as [C1 [Cn C2]] as in [non so C1 [quanta
capienza Cn c’ha C2]] above. To realize a stress phrase at the same level C2,
the speaker has chosen the sequence [C1 C2] in the preceding [ma io C1 penso
C2] instead of [Cn C1] as she did later in most prosodic groups.
244 Macrosyntax
From the distribution of prosodic contours in (c), the congruence of the

prosodic structure with the macrosyntactic structure is realized most of the time.
d. Alignment text macrosegments – prosodic macrosegments
By aligning the text and prosodic macrosegments, the congruent and non-
congruent sequences are shown below.
(ma io penso) (non so quanta capienza c’ha)

[ma io C1 penso C2] [non so C1 [quanta capienza Cn c’ha C2]]
(comunque io penso) (dodicimila persone s’era tutte)
[[comunque Cn io penso Cn dodicimila persone C1] s’era tutte C0]
(mh guarda) (non cascava uno spillo sinceramente)
[mh guarda Cn non cascava Cn uno spillo C0] [sinceramente Con]
(e poi) (tra l’altro)
[[e poi C1] [tra l’altro C1]
(c’è stata la menata che noi ci s’aveva il biglietto per il parterre)
[c’è stata la menata Cn che noi ci s’aveva Cn il biglietto Cn per il parterre Cc]]
(praticamente quando siamo entrati)
[[praticamente Cn quando siamo entrati C1]
(hanno aperto i cancelli già in ritardo
[hanno aperto Cn i cancelli C1] [già in ritardo C0]]
(alle sei e mezzo sicché noi siamo stati due ore lì a aspettare in fila così)
[alle sei e mezzo Cn sicché noi siamo stati due ore Cn lì a aspettare Cn in fila
così C0n]
(quando siamo entrati)
[[quando siamo entrati C1]
(c’hanno detto che praticamente noi non si poteva andare sulle gradinate a sedere ma
soltanto in mezzo si poteva stare)
[c’hanno detto Cn che praticamente C1] [noi non si poteva andare Cc]] [sulle
gradinate Cn a sedere C0] [[ma soltanto Cn in mezzo C1] [si poteva stare C0]
(sicché a me mi girava un po’ le palle)
[sicché a me C1] [mi girava Cn un po’ Cn le palle C1]
(perché) (insomma stare ancora a aspettare fino alle nove e poi tutto il concerto in piedi)
[perché insomma Cn stare Cn ancora Cn a aspettare Cn fino alle nove Cn e poi tutto il
concerto Cn in piedi C1]
(insomma era stressante la cosa)
[insomma Cn era stressante C1] [la cosa C0]
(e poi in piedi hai visto anche se il palco è un po’ rialzato)
[e poi C1] [in piedi Cn hai visto anche Cn se il palco Cn è un po’ rialzato C1]
(però se ti viene uno davanti alto)
[però C1] [se ti viene Cn uno davanti alto C1]
(non vedi nulla specialmente io che non sono ba) (insomma che son bassa vero)
[non vedi nulla C0] [[specialmente Cn io che non sono ba Cn insomma Cn che son
bassa C1]] [vero C0]
(sicché nulla)
[sicché C1 nulla C0]
(io e quest’altra ragazza che era in macchina con me)
[io e quest’altra Cn ragazza C1] [che era in macchina Cn con me C1]
(s’è detto sai) (sicché proviamo a andare nelle gradinate)
[s’è detto Cn sai Cn sicché proviamo Cn a andare Cn nelle gradinate C1]
(e siamo riuscite a sgamare) (sicché siamo) (siamo andate su)
[e siamo riuscite Cn a sgamare C1] [sicché siamo Cn siamo Cn andate su C1]
(e nulla)
[e nulla C1]
(ci siamo messe a sedere però logicamente tutti i posti erano prenotati)
[ci siamo messe a sedere C1] [però Cn logicamente Cn tutti i posti Cn erano
prenotati C0]
Portuguese
António Costa Quinta The short example analyzed below is
extracted from the C-ORAL-ROM corpus (2005, ed. Cresti & Moneglia), file
PFAMCV03, and consists of one female European Portuguese speaker GRA.
GRA is a psychologist, recorded in her home in Lisbon. She is talking to two
researchers about ways of addressing people and her flying experiences. The
recording belongs to a collection of spontaneous conversations recorded in
family environments.
The main characteristics of non-prepared speech pertain to the macrosyntactic
organization of the text in Prenuclei, Nucleus, Parenthesis, and Postnuclei. The
text is thus segmented in macrosegments by identifying lack of dependency
relations between syntagms. Then, from the syntactic properties of analyzed
macrosegments, we can extract the potential nuclei and test their characteristics
(illocutionary property, change in modality, etc.). Extracting its macrosyntactic
text Nuclei, the text (transcribed without any punctuation) appears as follows:
terrível não é eu aliás conheço um médico que é o Costa Quinta o António Costa Quinta
conhecido pelo Tó o Tó Costa Quinta que é a mesma coisa que bebe como uma esponja é
dos tais que não não se altera porque é realmente bem educado mas que chega a qualquer
246 Macrosyntax
sítio e ao fim de cinco minutos está a falar sobre guerra a guerra de África e até acabar até
se ir embora fala sobre a guerra de África eu acho eu só tenho um termo em francês para
definir um tipo destes é um emmerdeur.
“Terrible is it not. By the way I know a doctor, Costa Quinta, António Costa Quinta, known
as Tó or Tó Costa Quinta, which is the same thing, who drinks like a sponge and is such that
he does not get excited because he is really well-behaved, but when he arrives any-
where, after five minutes, he begins to speak about the war, the war in Africa and until he
finishes, until he leaves, he speaks about the war in Africa. I only think I have a
term in French to define a type of this kind: it is an emmerdeur.”
a. Dysfluencies
Hesitations: none;
Repetitions: não não;
Reprises and reformulations: guerra a guerra de África, eu acho →
eu só tenho;
Aborts: none.
b. Ponctuants
Macrosegment initial: none
Macrosegment final: não é.
c. Text macrosegments
(terrível não é) (eu aliás conheço um médico que é o Costa Quinta o António Costa
Quinta) (conhecido pelo Tó) (o Tó Costa Quinta que é a mesma coisa) (que bebe
como uma esponja) (é dos tais que não não se altera porque é realmente bem
educado) (mas que chega a qualquer sítio e ao fim de cinco minutos está a falar
sobre guerra a guerra de África) (e até acabar até se ir embora fala sobre a guerra
de África) (eu acho eu só tenho um termo em francês para definir um tipo destes)
(é um emmerdeur)
d. Prosodic groups
[terrível C0] [não é C0n]
[[eu aliás C1 conheço C2] [um médico Cc]]
[[que é C2] [o Costa Cn Quinta Cn o António Cn Costa Quinta Cn conhecido pelo
Tó Cc]]
[o Tó C2 Costa Quinta Cc]
[[que é a mesma coisa C2] [que bebe C1 como uma esponja Cc]]
[[é dos tais C2] [que não Cn não se altera C2] [porque é realmente bem educado Cc]]
[mas que chega C1 a qualquer sítio Cc]
[e ao fim C1 de cinco minutos Cc]
[[está a falar C2] [sobre guerra Cn a guerra Cn de África Cc]]
[[e até Cn acabar C2] [até se ir embora Cc]]
[[fala C1] [sobre a guerra Cn de África C0]]
The above sentence uses complex contours Cc in order to differentiate a long

sequence of prosodic groups. The structure inside these groups is indicated as
expected by C1 and C2 contours as in [[que é a mesma coisa C2] [que bebe C1
como uma esponja Cc]]. The speaker uses C1 inside the group when there is no
other contrast to maintain, as in [mas que chega C1 a qualquer sítio Cc], and C2
and C1 for more complex hierarchies.
[[eu acho C1] [eu só tenho um termo Cn em francês C1]]
[para definir Cn um tipo Cn destes Cn é um emmerdeur C0]
In this last sentence, the structure uses the contrasts between Cn, C1, and Co a
simpler choice than between C1, C2, and Cc, for example.
e. Alignment text macrosegments – prosodic macrosegments
By aligning the text and prosodic macrosegments, the congruent and non-
congruent sequences are shown below.
(terrível não é)
[terrível C0] [não é C0n]
(eu aliás conheço um médico que é o Costa Quinta)
(o António Costa Quinta) (conhecido pelo Tó)
[[eu aliás C1 conheço C2] [um médico Cc]] [[que é C2] [o Costa Cn Quinta C0n o
António Cn Costa Quinta Cn conhecido pelo Tó Cc]]
(o Tó Costa Quinta que é a mesma coisa)
[o Tó C2 Costa Quinta Cc] [que é a mesma coisa C2]
(que bebe como uma esponja)
[que bebe C1 como uma esponja Cc]
(é dos tais que não não se altera porque é realmente bem educado)
[[é dos tais C2] [que não Cn não se altera C2] [porque é realmente bem
educado Cc]]
(mas que chega a qualquer sítio e ao fim de cinco minutos está a falar sobre guerra a
guerra de África)
[mas que chega C1 a qualquer sítio Cc] [e ao fim C1 de cinco minutos Cc] [[está a
falar C2] [sobre guerra Cn a guerra Cn de África Cc]]
(e até acabar até se ir embora fala sobre a guerra de África)
[[e até Cn acabar C2] [até se ir embora Cc] [fala C1] [sobre a guerra Cn de
África C0]]
(eu acho eu só tenho um termo em francês
[[eu acho C1] [eu só tenho um termo Cn em francês C1]]
para definir um tipo destes) (é um emmerdeur)
[para definir Cn um tipo Cn destes Cn é um emmerdeur C0]
248 Macrosyntax
Conclusion
Separation of sentence prosody and sentence text is essential in macrosyntax,
as well as for classical read sentence analysis. The asymmetry between the two
separately conducted analyses may reveal interesting and surprising speaker
strategies to group into the same prosodic groups more than one text macro-
segment, and conversely, to group into the same text macrosegment a number
of prosodic groups. The evaluation of these asymmetries may lead to a better
classification and a better understanding of various styles of spontaneous
speech, the perfect congruence being probably the key for a good quality
comprehension from the audience, given the resulting reduction of cognitive
load, all occurrences of N400 (semantic discrepancies) and P600 (syntactic
discrepancies) consuming listeners’ energy.
9 Applications
Teaching French prosodic structure

The prosodic structure operates the first structuration of the sentence by the
speaker, after which successive hierarchical organizations of linguistic units
are built with morphology and syntax markers. The prosodic structure is also
the first information pertaining to the organization of linguistic units received
by the listener. It is therefore most important for the speaker to encode properly
the prosodic structure, as listeners would start to process sentence information
from a proper interpretation of the organization encoded by prosody, proces-
sing stress groups into higher-level linguistic units and assembling these groups
hierarchically to eventually access the overall syntactic structure and meaning
of the whole sentence.
In this prosodic decoding process, the listener will not only decode the
linguistic information brought by the prosodic structure, but also use the infor-
mation provided by socio-geographic variations observed in the realization of
prosodic events. The variations are usually detected from the rhythmic changes
in syllable duration in stress groups as well as in the details of realizations of
melodic contours linked to the realization of melodic contours (Martin, 2012c).
They do inform the listeners with the speaker’s social and geographic (pretended)
characteristics, provided the listener has acquired a categorization grid pertaining
to these categories (Martin, 2013c). These phonetic variations of melodic contour
realizations are similar to the phonetic variations observed in the realization of
vowels in regional French or Italy for instance.
These observations lead to a radical change in the teaching approach of a
foreign language. Indeed, the most basic feature to acquire and the one that is
altogether of paramount importance is not the lexicon, nor the morphology or
syntax of the language to learn, but rather the specific “music of the sentence,” i.e.
the prosodic encoding proper to the language. The idea is not new, but so far it does
not seem to have received much attention. For French, the steps that should
precede the presentation of any other linguistic units may be described as follows.
Using nonsense sequences of syllables, such as lala, . . . la, for example,
stress groups varying from one to about seven syllables (the number of
249
250 Applications
syllables in a stress group is limited by short-term memory capabilities) can be

assembled in a somewhat larger number of configurations in order to represent
the “music” of a typical French sentence. A single stress group can thus be
formed with up to seven syllabic sequences with the last syllable stressed
(underlined in the following sequence):
la, lala, lalala, lalalala, lalalalala, lalalalalala, lalalalalalala.
Each of these seven stress groups can be pronounced with a falling or a rising
contour, correlated with a declarative or interrogative modality of the sentence
formed with this single stress group, for example declarative lalalala\\, and
interrogative lalalala//. “\\” and “//” symbolize respectively a falling and low
conclusive declarative contour, and a rising conclusive interrogative contour.
The phonological variants of these two basic conclusive contours can also be
considered (Fig. 9.1): a sharp (i.e. with large melodic change) falling contour
for an imperative modality, a bell-curved falling contour for implicative mod-
ality (i.e. an “evidence” contour), a sharp (i.e. with large melodic change) rising
contour for surprise, and a rising contour ending in a bell shape for a doubt
modality variant (see Chapter 5).
These basic sequences can then be made more complex to form prosodic
structures with two, three, . . ., n stress groups, knowing that sentences with
declarative melodic modality (thus ending with \\ whatever the syntactic or
morphologic marks present in the text) with two stress groups can be organized
only one way by the prosodic structure, three stress groups can be structured
prosodically in 3 ways, four stress groups in 11 ways, five stress groups in 45
ways, etc. (Martin, 1987). For example, a sentence composed of three stress
groups can be hierarchically structured as shown in Figure 9.2.
Figure 9.1 Variants of conclusive contours: declarative, imperative,

implicative, interrogative, surprise, doubt.
Teaching French prosodic structure 251
[lalala / ] [lalala / ] [lalala\\]
rising rising conclusive falling
[[lalala - lalala / ] lalala \\]
neutralized flat rising conclusive falling
[lalala / [lalala - lalala \\]]
Rising neutralized flat conclusive falling
Figure 9.2 Three prosodic structures organizing a sequence of three stress

groups (the arrows represent dependency relations “to the right”).
Even without considering the presence of secondary (emphatic) stress

(usually located on the first syllable of content words in French, and mani-
fested by a rising melodic contour on the associated syllable), the total
number of possible prosodic structures that can organize a given text inde-
pendently from syntax is thus very large, but is in practice limited by
constraints imposed by association rules between the syntactic and the pro-
sodic structures, namely:
1. The stress clash constraint triggered by two consecutive stressed syllables in
the sentence, resulting either in the introduction of a short pause or in a stress
shift of the first stress on the preceding syllable (if there is one) (see Chapter 7).
This latter possibility simply results from a mechanism indicating the forma-
tion of a larger prosodic word (see Chapter 5).
2. The syntactic clash constraint, which discourages the formation of a stress
group containing syllables corresponding to syntactic units depending on
other units external to the one present in the stress group (see Chapter 7).
This syntactic clash constraint would, for example, prevent a sequence of
stress groups such as *[nous observons la] [médicalisation] “we observe
medicalization” since the article la is syntactically linked to the noun
médicalisation. Note that a somewhat well-known TV journalist in France
(Arte Channel) regularly realizes such stress groups, possibly to obtain a
stylistic effect.
252 Applications
3. Eurhythmicity also constitutes a factor that could favor the association of a

more balanced prosodic structure, i.e. with a similar number of syllables at
various levels (but mainly at the first level) in the prosodic structure. In the
classical example Marie adore les chocolats suisses “Mary loves Swiss
chocolates”, speakers will frequently favor a prosodic grouping such as
[Marie adore] [les chocolats suisses] rather than a structure congruent to
syntax that would be in this example [Marie] [[adore] [les chocolats suisses]].
4. The maximum number of syllables contained in one single accent phrase is
typically of the order of seven, but is actually limited by the maximum
duration of a stress group (about 1,250 ms).
The application of these constraints limits the number of prosodic structures
obtained by combinatorial analysis and reveals the relative dependency of
prosody on syntax. Software programs such as WinPitch LTL allow the design
of exercises with automatic alignment function of the learner’s performance on
a model sentence. This function gives automatic localization of learner’s errors
and in particular on prosodic realizations. An author mode allows specific
highlighting of melodic contours on the model sentence. Corresponding melo-
dic contours realized by the learner are then automatically retrieved and high-
lighted, giving an instantaneous feedback.
Silent reading
When we read, either silently or aloud, we generate speech sounds according to
the available information given in the written text. In this process, we also
necessarily generate a prosodic structure, which hierarchically organizes stress
groups (minimal syllabic chunks containing a single lexical or group stress),
into stress phrases, called in the Autosegmental-Metrical (AM) model ip
(intermediate intonative phrases) and at a higher level IP (Intermediate
Phrase), whose sequences constitute the whole utterance intonation.
It is noticeable that this prosodic structure (re)generation is essential to help
the reader to comprehend the text, and there is apparently no way to avoid it.
Therefore, the complete reading process may be constrained by the rules govern-
ing the elaboration of the sentence prosodic structure when speaking, and in
particular the minimal and maximal duration of stress groups (Martin, 2013b).
Indeed in silent reading, it appears impossible to proceed without subvoca-
lization, i.e. without hearing a voice in one’s head that corresponds to a voice
reading the text aloud, including the realization of stressed syllables. For this
reason, silent reading may be subject to the same prosodic constraints as
reading aloud. These constraints may interact or even supersede constraints
established for eye movement while reading. In particular, they may eventually
lead to a new explanation pertaining to the maximum number of words that can
be processed in fast reading.
Silent reading 253
However, subvocalization may be avoided somewhat by scanning the text

rapidly, bypassing all markers of the syntactic structure, retaining only assumed
key words and therefore being relieved of the necessity to generate a prosodic
structure for the text comprehension. This is the base of the speed reading
exercises proposed episodically by specialized institutions.
Eye movement
When reading, the eye proceeds in saccades (short rapid movements) to scan the
text, jumping in steps varying from 1 to 20 characters with an average of 7 to 9
characters (forward and backward). In the process, the most frequent fixations are
given by verbal forms and punctuation marks (dots, commas, semicolons, ques-
tion marks, etc.). The eye jumps then constantly to spot these markers, which will
constitute the possible anchors for the prosodic structure to build (Martin, 2011).
Most of the laboratory speech research on sentence intonation actually
investigates this process thoroughly on read speech, before considering spon-
taneous, non-prepared speech prosodic features. For example, if a dot normally
ending written text sentences is associated with a falling conclusive prosodic
contour, the correspondence of the other punctuation marks and the verbal
forms must be dynamically associated with a proper prosodic contour such as a
continuation majeure C1, i.e. a boundary tone in the AM model.
The saccades allow the eye to focus on fixation points in 20 ms to 40 ms,
whereas the fixation point lasts between 100 ms and 500 ms (Sereno & Rayner,
2003). The fixation state of the eye allows the fovea, the central part of the
retina, to scan the selected written information with high resolution, whereas
peripheral information is viewed in less detail.
Owing to the complex muscular mechanisms for speech production, oral
(i.e. aloud) reading is slower than silent reading. However, the puzzling
aspect of silent reading lies in its limitations. Despite a number of ques-
tionable commercial claims stating that fast readers could read up to 3,000
words per minute (about 50 words/sec, but by skimming on content words
only), the fast reading process is limited by subvocalization, the effect of
hearing a voice in one’s head while reading silently (which some authors
curiously attribute to the way we learn to read at school; Nowak, 2012).
By scanning only assumed key words, access to the text’s meaning will
derive from only a long list of rapidly selected words, with no hierarchical
organization and therefore no syntactic structure.
Subvocalization
Subvocalization does not pertain to the mechanical control of articulators
muscle control, but to the perception of the speech signal, which is recovered
254 Applications
by reading. The invention of writing has precisely this function, allowing not
only reading aloud but also reading silently, i.e. “to talk to oneself in silence.”
Indeed, writing is a shorthand notation system of speech sound and not of
articulatory movements, contrary to what supporters of the motor theory of
speech perception claim (Liberman & Mattingly, 1985).
Other systems such as pictograms bypass the generation of speech sounds by
associating directly significant and signifier to access their signification without
going through language units, syllables, words, prosodic words, syntagms, etc.
A road STOP sign may indeed be read aloud or silently, but is more frequently
directly associated with its meaning, i.e. to stop and give way on the road.
Likewise, well-known dates written with numbers, e.g. 1789, may read as
“seventeen hundred eighty nine,” but the constant use of symbols not corre-
sponding directly to syllables and words leads more frequently to a direct
access to its signifier (the French Bastille day). The passage to the status of
pictogram depends of course on the familiarity of the reader with the object and
its frequency of occurrence. Fast reading by scanning key words relies on this
process.
Writing systems using ideograms, for example Mandarin, also involve sub-
vocalization in silent reading. Learning Mandarin without being concerned by
ideograms pronunciation may be possible but somewhat difficult, as many
words are plurisyllabic, which means the reader must deal with a combination
of ideograms (Marshall Unger, 2003). However, one could associate other
sounds to ideograms, such as English words, for example, but the mediation
of speech sounds and therefore a prosodic structure cannot be avoided.
Commercial US-based fast reading “schools” claim that they can remove
subvocalization, or at least minimize it. The subliminal idea is to transform
every word into a pictogram, so when read it will not be pronounced silently.
Other techniques recommend using a pencil to determine eye fixation targets
and accelerating the number of saccades. One application even proposes dis-
playing only lexical words sequentially on a computer screen with a user-
adjustable speed (this approach incidentally corresponds to the definition of
Accent Phrases in the AM model, i.e. one content word for each Accent
Phrase). Comprehension should then be achieved without any prosodic struc-
ture and no syntactic structure linking the read words together. This is equiva-
lent to reading a list of items.
Faster readers claim speeds from 400 to 800 wpm (words per minute). With
an average number of about three (written) words per stress group, 800 wpm
converts into about 266 stress groups per minute, or 266/60 = 4.4 stress groups
per second. So the minimal average duration between silently read stress
groups would be about 225 ms, 800 wpm for the best-observed performance
for speed readers (Dunning, 2010), again limited by the unavoidable reconsti-
tution of stress groups and their associated prosodic structure.
Silent reading 255
As no actual acoustical speech production is involved, silent reading is much

faster than reading aloud, where multiple muscular commands must be executed.
Still, although eye saccade and eye fixation can operate much faster, for example
while reading pictograms, subvocalization, unavoidable in silent reading, limits
reading speed. Indeed, since subvocalization implies the generation of sentence
prosodic structures as well as sequences of syllables, a prosodic constraint limit-
ing the minimum duration of accent phrase to about 250 ms limits also the speed
of the silent reading process, which has to go necessarily through this prosodic
structure regeneration process. These values correspond tightly to the fastest
reading performances cited in the literature, about four stress groups per second.
Delta wave synchronization

The similarity between the range of period variation of Delta brain waves and
the duration between stressed syllables, i.e. from 250 ms to 1,250 ms, leads to
the formulation of a hypothesis linking the two events (see Chapter 5), where
stressed syllables would trigger Delta spikes, and not the reverse where Delta
waves would trigger stressed syllables (although they could synchronize the
perception of stressed syllables). The importance of this synchronization is
confirmed by the production of silent reading stressed syllables, and furthermore
by the minimal duration between consecutive stressed syllables as obtained by
the fastest subvocalization speed. The simple fact that this limit can apparently
not be exceeded suggests that the Delta wave minimal period is responsible for
the transfer of syllabic chunks into another part of memory storing stress groups.
Interestingly, to exceed the Delta maximum period, either in slow silent
reading or while listening, would involve the generation of a simple list of
stress groups, all placed at the same level in the prosodic structure. This is easy
to notice by slowing down recorded speech. At slower speech rate, i.e. two or
three times slower than normal, the perception of the stressed character of
syllables and the differentiation of melodic contours becomes difficult. It may
therefore be close to impossible for the listener to restore the prosodic structure
intended by the speaker, as the melodic contours cannot be identified anymore.
However, the synchronization of Delta waves can still take place, by involving
one spike in sequences of two or three, as determined by the maximal Delta
wave period.
10 Conclusion
Of all the key concepts used in this book to analyze prosodic data, probably the
most important is the separation between sentence text and sentence intonation.
Indeed, the idea of separating intonation from text in the phonological analysis
may at first seem difficult to conceive, as these two linguistic objects are tightly
linked together. However, this simple change of point of view from the traditional
way allowed me to investigate the properties of both speech sound productions
separately, even if they are obviously pronounced at the same time by the speaker.
By proceeding this way, I could establish not only that structures organizing
text and intonation units use different types of markers, but also that these
markers operate independently and were not necessarily appearing at the same
time during the generation of the sentence. This arrangement may be part of the
explanation for the extraordinary resistance to noise of human language.
Indeed, a localized production or perception error may affect only one “side”
of the sentence, the text morphological or syntactic markers or the intonation
contours, without necessarily affecting the other “side.” Furthermore, adopting
an approach that separates text and intonation brings a much easier way of
analyzing difficult examples observed in spontaneous speech, thereby giving
intonation its deserved place in the phonological world.
As a sort of conclusion, I would like to quote the author Frédéric Dard, now
deceased, who, while writing detective stories, demonstrated his deep under-
standing of many linguistic mechanisms in phonology, syntax, semantics, and
pragmatics simply in order to obtain a comic effect. Here are some examples in
the domain of sentence intonation in French, related to topics discussed in this
book, on word alignment of content words, on Postfixes in macrosyntax, on
minimal duration of stress groups, and on syntactic clash. There is even a
prosodic structure that is apparently impossible to pronounce
Quotes from Frédéric Dard (San Antonio)

This well-known (at the time) French author of popular detective novels
peppered his novels with literary imitation of popular speech, which do not
256
Quotes from Frédéric Dard 257
always correspond to real spontaneous speech production. However, besides

frequent puns, Frédéric Dard also played with linguistic mechanisms and in
particular with intonative ones, which was rarely noticed by literary reviewers.
Here are some examples.
On last syllabic alignment When the syntactic clash condition is

violated, the prosodic word composing the sentence cannot be interpreted. The
listener (and the reader in this example) cannot immediately retrieve a valid
entry in lexical long-term memory, a rather tedious reanalysis being necessary
to reconstitute the correct sequence (San Antonio, 1969, Fleur de nave vinaigr-
ette, Fleuve Noir, 1976, p. 118).
kécequin’faitar tircekon là !
(Qu’est ce qui fait tartir ce con là ! “This guy makes me nuts”). riendéto nantavec
labouilleque ta !
(Rien d’étonnant avec la bouille que t’a. “Nothing surprising with the face you have”).
On postfixes This example presents three Postfixes. Without an
appropriate intonation, i.e. a conclusive contour on the last syllable of connu
ending the Nucleus, and flat melodic contours on the three proper names with
no syntactic link with the Nucleus and only a co-reference relationship with the
subject pronoun il and object pronoun l’, only the orthography with a masculine
form to the past participle connu leads one to interpret mademoiselle Sarah as
referred to the context (the question is indeed addressed to this person). The
order of the mentioned characters corresponds to the order of the pronouns in
the Nucleus (Sucette boulevard, Fleuve Noir, 1976, p. 154):
vous savez où qu’il l’a connu, M. Robert, M. Moise, mademoiselle Sarah?
“Do you know where he knew him, Mr. Robert, Mr. Moïse, Miss Sarah?”
On minimal duration of stress groups This is an example of syllabic

detachment (with about 250 ms between each syllable vowel), interestingly
referred to by one of the characters (Bérurier) as separated words. Indeed,
they constitute three individual prosodic words, and are considered here as
independent lexical words by Bérurier (Morpion circus, Fleuve Noir, 1983,
p. 133):
Tu veux qu’j’te résume en trois mots? En-ra-gée! (Bérurier)
“Do you want me to summarize in three words: mad!”
On syntactic clash The next example pertains to the syntactic align-
ment. With an example such as une main conquérante comme Guillaume le “a
conquering hand as William the [Conqueror],” the reader is forced to stress the
last syllable of the sentence le, normally not stressed, and process an awkward
prosodic word Guillaume le, which in turn leads to the implicit restitution of le
258 Conclusion
Conquérant and finally to the phrasing [une main conquérante] [comme

Guillaume le Conquérant], semantically awkward itself (Remets to slip
gondolier, Fleuve Noir, 1977).
An impossible prosodic structure A final interesting example, using

syntactic engraft (greffe), is given in the book Tarte à la crème story (Fleuve
noir, 2001, p. 165): Mais onc n’a remarqué mon manège à moi c’est toi. This
sentence results from the concatenation of Mais onc n’a remarqué mon manège
“But no one noticed what I was doing” and mon manège à moi c’est toi “my
ride to me is you,” referring to the lyrics of a famous song performed by Edith
Piaf (mon manège à moi).
In the same sentence, the verbal phrase object mon manège is also part of the
subject of the verbal phrase that follows mon manège à moi c’est toi, making
impossible the association of a congruent prosodic structure. One of the possible
solutions would then be to duplicate the mon manège object into a subject mon
manège: Mais onc n’a remarqué mon manège, mon manège à moi c’est toi.
11 WinPitch
WinPitch is a software program devoted to acoustic analysis of speech. It

includes, as its name suggests, specialized functions for research in prosody.
WinPitch has been continuously developed since 1995 and runs under
Windows (any flavor) on PC and Mac personal computers (with a Windows
emulator). Many original functions allow effective acoustical analysis of large-
scale speech corpora, as demonstrated in its use in the C-ORAL-ROM project
(2001), which assembled transcribed and aligned large spontaneous speech
recordings dealing with similar topics in French, Italian, Spanish, and
European Portuguese.
The program screen is divided into four sections: (1) a command section,
with specialized windows grouping the essential parameters related to a parti-
cular function (e.g. recording, playback, prosodic morphing, transcription,
alignment, statistical analysis, etc.), (2) a navigation window, displaying the
speech waveform, (3) an analysis window, displaying a spectrogram, funda-
mental frequency, and intensity curves as well as the waveform related to the
speech section selected in the navigation window, and (4) a data retrieving
window displayed text in aligned segments for easy retrieval of corresponding
speech data (Fig. 11.1).
WinPitch can handle stereo signals and display the resulting analyzed para-
meters in different colors. The program can also analyze multimedia recordings
(many video formats are supported, such as avi, mp4, wmv, flv, etc.) while
keeping functional all the other features, such as reduced speed speech
playback.
Sound recording made clear

Among the unique features not found in other popular programs such as
Transcriber or Praat, real-time spectrographic display allows visual monitoring
of speech recordings. This is especially useful as research speech corpora are
seldom recorded by sound engineers, but rather by untrained researchers,
which often results in poor quality sound recordings (background noise,
259
260 WinPitch
Figure 11.1 WinPitch command, alignment, navigation, and analysis

windows.
echo, wrongly adjusted recording level, microphone filtering, etc.). Poorly

recorded speech samples can make syllabic prominence and fundamental
frequency analysis difficult or impossible.
Since today most personal computers contain a sound card, it is very
easy to implement a speech-monitoring system by merely adding an
appropriate microphone while running WinPitch. This function is equally
useful for teaching phonetic classes, as spectrograms and fundamental
frequency curves are viewed in real time by learners, making the correla-
tion between speech sounds and their acoustical analysis easier to relate to
each other.
Sound and video

WinPitch can handle directly long files either stored in the Ram or Hard Disk
computer memory, or through a sliding window (appropriate for very large
video files whose sound part exceeds the machine memory capacity). In the
sliding window mode, the user first loads a short sample of the file to identify
its format. The user can then select a starting point and an appropriate
duration to extract and load in memory the sound part of the video file,
automatically modifying the sampling frequency and the recording format
Transcription and alignment on the fly 261
Figure 11.2 Example of video analysis.
if desired (Fig. 11.2). When played back, all the WinPitch functions operate
on the speech signal, displaying the synchronized video part at the same time.
Furthermore, dedicated converters handle mp3 and CD sound files directly.
Selecting a slower playback sound speed will always result in a synchronized
corresponding video display.
Transcription and alignment on the fly

Aside from classical transcription tools (with automatic segmentation in short
sections, automatic segmentation in syllables and phones [see Fig. 11.6], and
user-defined variable playback speed), WinPitch has a unique function allow-
ing easy alignment of recordings already transcribed but not aligned, as fre-
quently found in online corpora or elsewhere.
This function is especially useful in case of poorly recorded examples, where
automatic alignment is ineffective. It allows the user to click on any unit of text
(whether on words, syntagms, or whole sentences) while the speech is played
back at user selectable reduced speed (down to seven times real time).
See Figure 11.3.
On-the-fly alignment allows an easy and close to real-time alignment of
already transcribed text even for poorly recorded examples, since the
difficult task of automatic speech recognition is transferred to the more
efficient human recognition capabilities. The whole process precludes a
time-consuming segment-by-segment alignment if the speech transcription
262 WinPitch
Figure 11.3 Assisted alignment by slowing down speech playback. At

each mouse click on a unit of text perceived at slower speed (top right
window), bidirectional pointers are generated automatically between the
corresponding speech segment (bottom right window) and the text
database (left window). The mouse wheel controls playback speed, and
the mouse right button single and double click replay the current
segment or the preceding segment.
is available but not aligned. Other WinPitch modes of transcription

include automatic segmentation based on silence or pause boundaries,
where the user enters directly the corresponding text of predefined
segments.
The program also generates automatically an IPA transcription and morpho-
logical and syntactic labeling (Fig. 11.4).
Manual fine-tuning of on-the-fly alignment can be easily accomplished by
displaying an underlying narrowband spectrogram, which can also be used to
align overlapping sections of speech (Fig. 11.5).
Data mining for large speech corpora

Transcribed and aligned data can be easily retrieved: (1) by merely
selecting with the mouse the desired section of text, (2) by selecting an
entry in a dynamically builtlexicon containing all text entries (including
IPA transcriptions or morphological and syntactic labeling generated
automatically – see Fig. 11.5), (3) by using a table of text segments, or
(4) by entering the researched text with its optional left and right
context.
Figure 11.4 Automatic IPA transcription from orthographic text and
morphological and syntactic labeling.
Figure 11.5 Fine tuning of speech segment limits with the help of a
simultaneously displayed spectrogram (which allows precise segmentation
in case of speaker’s overlapping).
264 WinPitch
Figure 11.6 Automatic segmentation from spectrographic transitions.
Native output formats are XML (for alignment files) and a proprietary WP2
format (which includes all the annotations, text, highlighting, F0 tracking
parameters, etc. as defined by the user).
WinPitch includes also an automatic prominence analyzer operating
from built-in automatic syllabic detection or from automatic syllabic or
phone segmentation. An automated consulting tool is integrated in the
program for automatic syntactic and morphological labeling as well as
an IPA transcription from data extracted from large lexicon in Excel®
format.
Any speech segment can be labeled and highlighted in a user-defined
color to be exported to Excel in a single mouse click. Sophisticated data
analysis can then be executed later using Excel predefined or user-defined
scripts.
A batch mode allows the automatic playback (and acoustical analysis) of
speech segments as defined in a concordance program, giving the search text
together with its left and right contexts (Fig. 11.7). This mode presently
operates from an Excel file, loads the corresponding sound file, and retrieves
the context-defined text automatically. An interesting application of this batch
mode is described below.
WinPitch functions include an integrated concordancer. Figures 11.7 to
11.10 illustrate the details of the operations involved. In Figure 11.8 the user
Data mining for large speech corpora 265
Sound file reference Left context Text to retrieve Right context
Figure 11.7 Batch processing of a large set of conjunctions obtained from a

concordance analyzer with their left and right contexts.
enters the key word parce que taken as an example, selects an appropriate
alignment source format (Transcriber *.trs in this example), and clicks on
any of the file names stored in the same directory. This directory should
contain all the alignment files of interest in the same format, together with
their corresponding sound files (six formats are available: Transcriber, Praat,
CRF, Necte, WinPitch, XML). In the case of Praat textgrid files, the corre-
sponding sound files must have the same name as their textgrid counterpart,
as Praat textgrid files do not contain any reference to their corresponding
speech file.
An Excel table listing all found occurrences of the key word is immediately
generated (Fig. 11.9). This operation is very fast, in the example of parce que,
the completion takes less than one second to scan 104 files giving 1194
occurrences.
When the user clicks on any line of the Excel table, a specific occurrence of
the keyword is selected together with its left and right contexts. The corre-
sponding text and speech segments are automatically displayed, as shown in
Figures 11.10 and 11.11.
Integrating this function in one single software package makes possible
specific research topics on prosody that would have been seen as too time-
consuming previously.
266 WinPitch
Figure 11.8 Entering the key word “parce que” and selecting a Transcriber file.
Acoustic analysis
Since pitch-tracking algorithms are so far prone to errors in adverse recording
conditions, and given that for a particular speech segment some algorithms
are less prone to errors than others are, WinPitch includes six different pitch-
tracking routines to evaluate the fundamental frequency (spectral comb, spec-
tral brush, autocorrelation, AMDF (Average Magnitude Difference Function),
spectral fit, harmonic selection).
These algorithms and their related parameters can be independently
applied on user defined segments of the speech wave, in order to use
the most appropriate scheme in a given speech section of the recording.
The spectral comb and spectral brush are especially resistant to noise and
absence of some harmonics in the spectrum (Fig. 11.12). WinPitch
includes also a scanning feature allowing a quality analysis of the record-
ing in terms of fundamental frequency coherence, transition, and presence
of creak so that the user can easily retrieve speech segments with F0
tracking problems.
The measurement of fundamental frequency is particularly sensitive
to recorded speech signal distortions due to (1) poor signal to noise
ratio, (2) filtering of low frequencies, eliminating low harmonics for
Figure 11.9 Table generated automatically listing the occurrences of the

entered keyword parce que. The whole process takes less than one second
for a list of 104 files and 1194 occurrences found.
Figure 11.10 Automatic generation of text from alignment files and

selection of the entered key word parce que, highlighted with its
immediate context, producing automatically a display of the spectrogram,
intensity, and pitch curves corresponding to the segment retrieved from the
Excel table.
male voices, (3) various spurious components due to room echo in the
recording places, (4) encoding in formats such as mp3 or wma with
excessive compression levels, (5) external sound sources (car engine,
overlapping speech segments, etc.), and (6) presence of creaky segments
where the fundamental frequency is not really defined.
268 WinPitch
Figure 11.11 Command window displaying the available pitch tracking

algorithms that can be used on any user selected sections of the speech
recording.
To address these potential problems and to ensure the generation of reliable

F0 data, WinPitch has a catalogue of methods applicable independently on
user-selected speech segments:
a. Spectral comb, obtained by correlation of the signal spectrum with a spectral
comb with variable teeth intervals. Harmonics frequency range retained in
the computation are user selectable;
b. Spectral brush, obtained by aligning signal harmonics on a selectable time
window followed by a spectral comb analysis;
c. Cepstrum, evaluation of the periodicity of the log spectrum;
d. Harmonic selection followed by spectral comb, with the retained harmonics
selected by the user from a visual inspection on a simultaneously displayed
narrowband spectrogram;
e. Autocorrelation, operating directly on the speech waveform, available in
three flavors, standard, normed Praat, and Yin, with adjustable window
duration;
f. AMDF: average magnitude difference function, with the window length and
the clipping percentage user adjustable;
Figure 11.12 Most common sources of errors for F0 tracking (Rhap-

D0003, PFC).
g. Period analysis: F0 values are obtained from period’s measurements from

pitch markers placed automatically in a first pass and later manually cor-
rected by the user.
These various methods give globally comparable results on good-quality
recordings. However, for lower-quality recordings, the main problems of
analysis are listed in Figure 11.12.
To apply one of these methods, the user first selects a F0 tracking
method in the command window (Fig. 11.11). Then a time window is
selected on-screen with the mouse guided by visual inspection of an
underlying narrowband spectrogram. By releasing the mouse left button,
the corresponding segment of the signal is automatically reanalyzed
with the selected method, replacing F0 data with the newly obtained
values. The new F0 curve segment is displayed in a color specific to
the tracking method chosen so that the user can identify visually on the
overall F0 curve the tracking method pertaining to a specific time
segment. Furthermore, by moving the cursor onscreen, the corresponding
command box corresponding to the F0 tracking method used for the wave
segment defined by the cursor is displayed dynamically in the command
box, together with all parameter values used for the chosen tracking
method.
270 WinPitch
A file containing all the information about corrections made can be saved in
text format, as well as a.pitch file describing the corrected pitch curve to be
exported to Praat.
Prosodic morphing
Another interesting feature of WinPitch, devoted more specifically to
prosodic research, is the prosodic morphing tool, where fundamental
frequency, intensity, and syllabic duration can be easily modified with simple
and intuitive graphic commands. The syllabic (or phone) durations, for
example, can be altered by a single mouse move after automatic or manually
defined syllabic or phone boundaries (imported, for example, from a Praat
TextGrid file).
Automatic segmentation
WinPitch has dedicated functions for automatic segmentation of speech
signals into various levels: speech turn, breath group, pause delimited
group, syllable and phone. This latter capability is based on an innovative
approach mimicking the manual segmentation made on spectrograms by
trained phoneticians. It does not require statistical or neuronal training like
most other systems, and is therefore independent from the language
analyzed, with few or no parameters to adjust. Ergonomic and easy to
use manual correction commands are also available with this segmentation
function.
Interface with other software

WinPitch can import Transcriber, PFC, Necte files among others, and read and
save Praat files (old and new TextGrid format). All data can be exported in
Ascii (with Unicode extension) directly as a text file or into Excel with one
mouse click. It can load wav and many other sound or video files directly, with
resampling into any user-selected sampling frequency. This is especially
important to avoid wasting storage space and computing power by using a
too high sampling frequency, whereas 16,000 Hz or 22,050 Hz are sufficient for
speech recordings.
Sound files can be edited (segment deletion, copy and paste) and conca-
tenated or “glued” together to form a stereo file from two mono files (in cases
where the same event recorded into two independent files must be analyzed
together). Text can be added (in any color and font) on the analysis window
for illustration purposes to be included in a research paper. The resulting
augmented analysis window can then be exported in a picture format in a text
Interface with other software 271
editor such as Word for example. Segments of the acoustic analysis

(F0, intensity, waveform, spectrogram) can be highlighted and indepen-
dently labeled, for paper illustration and for later selection in Excel (or
other program) for further statistical analysis.
WinPitch can be downloaded from www.winpitch.com and after a thirty-day
trial period, an installation key can be obtained from info@winpitch.com.
References
Aguilera, Marion, Radouane El Yagoubi, Robert Espesser & Corine Astésano (2014)
Event-Related Potential investigation of Initial Accent processing in French, in
Nick Campbell, Dafydd Gibbon & Daniel Hirst (eds.), Social and Linguistic
Prosody: Proceedings of the Seventh International Conference on Speech
Prosody, Dublin: Science Foundation Ireland (SFI), 383–387.
Alkire, Ti & Carol Rosen (2010) Romance Languages: A Historical Introduction,
Cambridge University Press.
Antonetti, Pierre & Mario Rossi (1970) Précis de phonétique italienne: synchronie et
dialchronie, Aix-en-Provence: La Pensée Universitaire.
Armstrong, Lilias E. & Ida C. Ward (1931) Handbook of English Intonation (2nd edn.),
Cambridge: W. Heffer.
Astésano, Corine, Mireille Besson & Kai Alter (2004) Brain potentials during semantic
and prosodic processing in French, Cognitive Brain Research (18) 2004, 172–184.
Austin, John Langshaw (1962) How to Do Things with Words, Oxford University Press.
Avanzi, Mathieu (2012) L’interface prosodie/syntaxe en français, Brussels: Peter Lang.
Avanzi, Mathieu & Philippe Martin (2007) L’intonème conclusif: une fin de phrase en
soi? Cahiers de linguistique française 28, 247–258.
Avanzi, Mathieu, Anne Lacheret-Dujour & Bernard Victorri (2008) ANALOR: a tool
for semi-automatic annotation of French prosodic structure, in Proceedings of
Speech Prosody 2008: Fourth International Conference on Speech Prosody,
Campinas, Brazil, May 6–9, 119–122.
Avanzi, Mathieu, Nicolas Obin, Anne Lacheret & Bernard Victorri (2011). Toward a
continuous modeling of French prosodic structure: using acoustic features to
predict prominence location and prominence degree, in Proceedings of
Interspeech, Florence, August, 2033–2036.
Avanzi, Mathieu, Lucie Rousier-Vercruyssen, Sandra Schwab, Sylvia Gonzalez & Marian
Fossard, et al. (2013) C-PROM-Task: a new annotated dataset for the study of French
speech prosody, in Proceedings TRASP 2013: Tools and Resources for the Analysis of
Speech Prosody, Aix-en-Provence, August 30, 27–30.
Avesani, Cinzia (1995) ToBIt: un sistema di trascrizione per l’intonazione italiana, in
Atti delle 5e Giornate di Studio del Gruppo di Fonetica Sperimentale (A.I.A.),
Povo (TN), Italy, 85–98.
Badiou, Alain (1969) Le concept de modèle: introduction à une épistémologie
matérialiste des mathématiques, Paris: Maspéro.
Bally, Charles (1944) Linguistique générale et linguistique française, Berne:
Francke.
272
References 273
Baumann, Stefan, Martine Grice & Ralf Benzmüller (2001) GToBI: a phonological
system for the transcription of German intonation, in S. Puppel & G. Demenko
(eds.), Prosody 2000: Speech Recognition and Synthesis, Poznan: Adam
Mickiewicz University, 21–28.
Beckman, Mary E. & Gayle Ayers Elam (1997) Guidelines for ToBI Labelling (Version
3, March 1997), The Ohio State University Research Foundation, www.ling.ohio-
state.edu/research/phonetics/E_ToBI/.
Beckman, Mary & Sun-Ah Jun (1996) K-ToBI (Korean ToBI) labelling convention,
(Version 2), Ms., Ohio State University and UCLA, www.linguistics.ucla.edu/
people/jun/sunah.htm.
Beckman, Mary E. & Janet B. Pierrehumbert (1986) Intonational structure in Japanese
and English, Phonology Yearbook 3, 255–309.
Beckman Mary E., Manuel Díaz-Campos, Julia Tevis Mcgory & Terrell A. Morgan
(2002) Intonation across Spanish, in the Tones and Break Indices framework,
Probus 14, 9–36.
Beckman Mary E., Julia Hirschberg & Stefanie Shattuck-Hufnagel (2005) The original
ToBI system and the evolution of the ToBI framework. In Sun-Ah Jun (ed.),
Prosodic Typology: The Phonology of Intonation and Phrasing, Oxford
University Press, 9–54.
Berrendonner, Alain (1990) Pour une macro-syntaxe, Travaux de linguistique 21, 25–36.
(2003) Grammaire de l’écrit vs. grammaire de l’oral: le jeu des composantes micro et
macro-syntaxiques, in A. Rabatel (ed.), Interactions orales en contexte didactique:
mieux(se) comprendre pour mieux (se) parler et pour mieux (s’) apprendre, Lyon:
PUL, 249–264.
Beyssade, Claire, Élisabeth Delais-Roussarie & Jean-Marie Marandin (2007) The
prosody of interrogatives in French, Cahiers de linguistique française 28,
163–175.
Blanche-Benveniste, Claire (1990) Le français parlé-études grammaticales, Éditors du
CNRS, Paris
(2000) Approches de la langue parlée en français, Paris: Ophrys.
Blanche-Benveniste, Claire (2003) La naissance des syntagmes dans les hésitations et
répétitions du parler, in J. L. Araoui (ed.), Le sens et la mesure: hommages à Benoît
de Cornulier, Paris: Honoré Champion, 40–55.
(2007) Corpus de langue parlée et description grammaticale de la langue, Langage et
société 121–122(3), 129–141.
Blanche-Benveniste, Claire & Philippe Martin (2011) Structuration prosodique,
dernière réorganisation avant énonciation, Langue française 170, 127–142.
Blanche-Benveniste, Claire, André Valli, Maria Antonia Mota, Raffaele Simone,
Elisabetta Bonvino & Isabel Uzcanga de Vivar (1997) EuRom4: metodo di inseg-
namento simultaneo delle lingue romanze, Florence: La Nuova Italia.
Bocci, Giuliano (2013) The Syntax-Prosody Interface, Amsterdam: Benjamins.
Bolinger, Dwight L. (1961) Contrastive accent and contrastive stress, Language 37, 83–96.
(1972) Accent is predictable (if you’re a mind-reader), Language 48(3), 633–644.
Bonami, Olivier & Elisabeth Delais-Roussarie (2006) Metrical phonology in HPSG, in
Stephan Müller (ed.), Proceedings of the Thirteenth International Conference on
Head-Driven Phrase Structure Grammar, Varna, Bulgaria, July 2006, Stanford:
CLSI Online Publications, 39–59.
274 References
Bonvino, Elisabetta, Sandrine Caddéo, Eulalia Vilaginés Serra & Salvador Pippa (2011)
EuRom5, Milan: Ulrico Hoepli.
Boucher, Victor, Annie Gilbert & Philippe Martin (forthcoming) Prosodic words and
brain waves.
Boulakia, Georges (1985) Ambigüité et intonation, in C. Fuchs (ed.), Aspects de
l’ambigüité et de la paraphrase dans les langues naturelles, Berne: Peter Lang.
Bruce, Gösta (1977) Swedish word accents in sentence perspective, Travaux de
l’Institut de Linguistique de Lund, 12. Gleerup: Lund University Press.
Brunot, Ferdinand (1911–1914) Archives de la parole, Gallica, http://gallica.bnf.fr/html/
enregistrements-sonores/archives-de-la-parole-ferdinand-brunot-1911–1914.
Cei, Erica & Bruce Hayes (2013) Italian Stress Study, http://italianstressstudy.blog
spot.fr/.
Chafe, Wallace (1976) Givenness, contrastiveness, definiteness, subjects, topics, and
point of view, Linguistic Inquiry, 25–55.
Chen, Matthew (1970) Vowel length variation as a function of the voicing of the
consonant environment, Phonetica 22(3), 129–159.
Chen, Zen-Yong, Patricia E. Cowell, Rosemary Varley & Yi-Ching Wang (2009) A
cross-language study of verbal and visuospatial working memory span, Journal of
Clinical and Experimental Neuropsychology 31(4), 385–391, DOI: 10.1080/
13803390802195195.
Chitoran, Ioana (2002) The Phonology of Romanian: A Constraint-Based Approach,
New York: Mouton de Gruyter.
Chitoran, Ioana, Alina Maria Ciobanu, Liviu P. Dinu & Vlad Niculae (2014) Using a
Machine Learning Model to assess the complexity of stress systems, in Nick
Campbell, Dafydd Gibbon and Daniel Hirst (eds.), Proceedings of the Sixth
Conference on Speech Prosody, Tongji University Press, Shanghai, 331–336.
Contini, Michel, Jean-Pierre Lai, Antonio Romano, Stefania Roullet, Lurdes de Castro
Moutinho, et al. (2002) Un projet d’atlas multimédia prosodique de l’espace
roman, in Speech Prosody 2002: Proceedings of the First International
Conference of Speech Prosody (Aix-en-Provence, April, 11–13, 2002), Aubenas
d’Ardèche: Lienhart, 227–231.
Cooper, William & John Sorensen (1981) Fundamental Frequency in Sentence
Production, New York: Springer.
C-ORAL-BRASIL (2012) Reference Corpus for Informal Spoken Brazilian
Portuguese, in Helena Caseli, Aline Villavicencio, António Teixeira & Fernando
Perdigão (eds.), Computational Processing of the Portuguese Language:
Proceedings of the Tenth International Conference, PROPOR 2012, Coimbra,
Portugal, April 17-20, Lecture Notes on Artificial Intelligence, Vol. 7243,
Springer, 362–368.
C-ORAL-ROM (2001) Corpus de référence pour les langues romanes orale, www.elda.
org/en/proj/coral/fr/coralrom.html.
C-ORAL-ROM (2005) Integrated Reference Corpora for Spoken Romance Languages,
Studies in Corpus Linguistics, 15, ed. Emanuela Cresti & Massimo Moneglia,
Amsterdam: Benjamins.
CFPP (2000) Corpus du Français Parlé Parisien http://ed268.univ-paris3.fr/syled/res
sources/Corpus-Parole-Paris-PIII/.
Cresti, Emanuela (2000) Corpus di italiano parlato, Florence: Accademia della Crusca.
References 275
Cresti, Emanuela, Massimo Moneglia & Philippe Martin (2002) L’intonation des
illocutions naturelles représentatives: analyse et validation perceptive, in Macro-
syntaxe et pragmatique: l’analyse linguistique de l’oral, Lablita: Università di
Firenze, 173–192.
Debaisieux, Jeanne-Marie & Philippe Martin (2010) Les parenthèses: étude macrosyn-
taxique et prosodique sur corpus, in Marie-José Béguelin, Mathieu Avanzi & Gilles
Corminboeuf (eds.), La parataxe, Vol. II: Structures, marquages et exploitations
discursives, Berne: Peter Lang, 307–339.
Debaisieux, Jeanne-Marie, Henri-José Deulofeu & Philippe Martin (2008) Pour une
syntaxe sans ellipse, in Jean-Christophe Pitavy & Michèle Bigot (eds.), Ellipse et
effacement: du schème de phrase aux règles discursives, Publications de
l’Université de Saint-Etienne, 227–235.
Delais-Roussarie, Elizabeth (2000) Vers une nouvelle approche de la structure proso-
dique, Langue Française 126(May), 92–112.
(2009) La prosodie des incidentes en français, Cahiers de Grammaire 30, 129–138.
Delais-Roussarie, Elisabeth, Brechtje Post, Mathieu Avanzi, Carolin Buthke, Albert Di
Cristo, et al. (2015) Intonational phonology of French: developing a ToBI system
for French, in Sónia Frota & Pilar Prieto (eds.), Intonational Variation in Romance,
Oxford University Press, pp. 63–100.
Delattre, Pierre (1966) Les dix intonations de base du français, French Review 40, 1–14.
Dell, François (1984) L’accentuation dans les phrases en français, in F. Dell, D. Hirst &
J. R. Vergnaud (eds.), Formes sonores du langage, Paris: Hermann, 65–122.
(2004) On recent claims about stress and tone in Mandarin, Cahiers de Linguistique
Asie Orientale 33(1), 33–63.
Delmonte, Rodolfo (1981) L’accento di parola nella prosodia dell’enunciato dell’ita-
liano standard, Studi di grammatica Italiana 10, 351–394.
Deulofeu, Henri-José (2003) L’approche macrosyntaxique en syntaxe: un nouveau
modèle de rasoir d’Occam contre les notions inutiles, Scolia 16, 77–95.
(2006) Pour une linguistique du rattachement, in Denis Apothéloz, Bernard
Combettes & Franck Neveu (eds.), Actes du colloque international de Nancy
(7–9 juin 2006), Bern/Berlin: Peter Lang, 229–250.
Di Cristo, Albert (1998) Intonation in French, in D. J. Hirst & A. Di Cristo (eds.),
Intonation Systems: A Survey of Twenty Languages, Cambridge University Press,
195–218.
(2013) La prosodie de la parole, Brussels: De Boeck – Solal.
Di Cristo, Albert & Mario Rossi (1977) Propositions pour un modèle d’analyse de
l’intonation, Actes des 8èmes Journées d’Étude sur la Parole (Aix-en-Provence),
1, 323–329.
D’Imperio, Mariapaola (2002) Italian intonation: an overview and some questions,
Probus 14, 37–69.
D’Imperio, Mariapaola, Gorka Elordieta, Sónia Frota, Pilar Prieto & Marina Vigário
(2005) Intonational phrasing in Romance: the role of syntactic and prosodic
structure, in S. Frota, M. Vigário & M. J. Freitas (eds.), Prosodies, Berlin/New
York: Mouton de Gruyter, 59–97.
Doelling, Keith B., Luc H. Arnal, Oded Ghitza & David Poeppel (2014) Acoustic land-
marks drive delta-theta oscillations to enable speech comprehension by facilitating
perceptual parsing, NeuroImage 85(761). DOI: 10.1016/j.neuroimage.2013.06.035.
276 References
Duez, Daniele (1997) Acoustic markers of political power, Journal of Psycholinguistic

Research, November(6), 641–654.
Dunning, Brian (2010) Speed Reading, Skeptoid Podcast. Skeptoid Media, Inc., Oct 26,
2010. Accessed online, Dec. 12, 2013, http://skeptoid.com/episodes/4229/.
Eckman, Paul (1999) Basic emotions, in Tim Dalgleish & Michael J. Power (eds.),
Handbook of Cognition and Emotion, New York: Wiley & Sons, 45–60.
Émerard, Françoise (1977) Synthèse par diphones et traitement de la prosodie, Doctoral
thesis (Doctorat de 3ème cycle), Université de Grenoble III, France.
EuRom4 (1991–1997) Projet européen Lingua (CEE) I Institut National de la Langue
Française (INALF), under the direction of Claire Blanche-Benveniste (Université
de Provence) with Università Terza di Roma, Universidad de Salamanca, and
Universidade da Lisbõa.
Feldhausen, Ingo (2010) Sentencial Form and Prosodic Structure of Catalan,
Amsterdam: Benjamins.
Fónagy, Ivan (1979) L’accent en français: accent probabilitaire, in L’accent en français
contemporain, Studia Phonetica, 15, Paris: Didier, 123–233.
(2003) Fonctions de l’intonation: essai de synthèse, Flambeau 29, 1–20.
Fónagy, Ivan & Judith Fónagy (1983) L’intonation et l’organisation du discours,
Bulletin de la Société de Linguistique de Paris 78, 161–209.
Fónagy, Ivan & Klara Magdics (1960) Speed of utterances in phrases of different
lengths, Language and Speech 3, 179–192.
(1963) Emotional patterns in intonation and music, Zeitschrift für Phonetik 16,
293–326.
Fox, Anthony (2000) Prosodic Features and Prosodic Structure, Oxford University Press.
Frei, Henri (1929/2011) La grammaire des fautes, Presses universitaires de Rennes.
Friederici, Angela D. (2002) Towards a neural basis of auditory sentence processing,
Trends in Cognitive Sciences 6(2), 78–84.
Frota, Sónia (2009) The Prosodic Phonology of European Portuguese, www.fl.ul.pt//
laboratoriofonetica/personal/sfrota/.
Frota, Sónia & Pilar Prieto (2015) Intonation in Romance, Oxford University Press.
Frota, Sónia, Marisa Cruz, Flaviane Svartman, Marina Vigário, Gisela Collischonn, et
al. (2013) Labelling intonational variation across varieties of European and
Brazilian Portuguese, PaPI 2011, Talk presented at the workshop on Romance
ToBI, in PaPI, Phonetics and Phonology in Iberia, Universitat Rovira I Virgili,
Tarragona, Spain.
Fuchs, Catherine (1996) Les ambiguïtés du français, Paris: Ophrys.
Gachet, Frédéric & Mathieu Avanzi (2008) la prosodie des parenthèses en français
spontané, Verbum 30(1), 53–84.
Garde, Paul (1968) L’accent, Paris: PUF; 2nd edn. 2013, Limoges: Lambert-Lucas.
Ghitza, Oded (2012) On the role of theta-driven syllabic parsing in decoding speech:
intelligibility of speech with a manipulated modulation spectrum, Front. Psychol.
3:238. DOI:10.3389/fpsyg.2012.00238.
Ghitza, Oded (2013b) The theta syllable: a unit of speech information defined by ortical
function, Front. Psychol. 4, Article138.
Ghitza, Oded, Anne-Lise Giraud & David Poeppel (2013) Neuronal oscillations and
speech perception: critical-band temporal envelopes are the essence, Frontiers in
Human Neuroscience 6, www.frontiersin.org, 6, Article 340.
References 277
Gilbert, Annie (2012) Le chunking perceptif de la parole: sur la nature du groupement

temporel et son effet sur la mémoire immédiate, Doctoral thesis, Université de
Montréal (March 2011).
Gilbert, Annie & Victor Boucher (2007) What do listeners attend to in hearing prosodic
structures? Investigating the human speech-parser using short-term recall, in H.
van Hamme & R. van Son (eds.), Proceedings of the Eighth Annual Conference of
the International Speech Communication Association (Interspeech 2007),
Antwerp, Belgium, 430–433.
Gilléron Jules & Édmond Édmont (1902–1910) Atlas linguistique de la France, 9 vols.,
Paris: Champion, Paris.
Giraud, Anne-Lise & David Poeppel (2012) Cortical oscillations and speech processing:
emerging computational principles, Nature neuroscience E-pub, DOI:10.1038/
nn.3063, www.nature.com/neuro/journal/v15/n4/abs/nn.3063.html.
Godement, Rémi & Philippe Martin (2010) Suffixes complexes: quand c’est fini ça
recommence . . ., Actes des Xèmes Journées d’Étude sur la Parole, Mons,
185–188.
Goldman, Jean-Philippe (2011) EasyAlign: an automatic phonetic alignment tool under
Praat, in Proceedings of the Twelfth Annual Conference of the International Speech
Communication Association, Florence, August 27–31, 3233–3236.
Goldsmith, John (1976) Autosegmental phonology, Ph.D. thesis, Massachusetts
Institute of Technology, Dept. of Foreign Literatures and Linguistics.
Grammont, Maurice (1933) Traité de phonétique, Paris: Delagrave.
Grevisse, Maurice ([1936]2011) Le bon usage (15th edn.), Paris: Duculot/De Boeck.
Gussenhoven, Carlos (2004) The Phonology of Tone and Intonation, Cambridge
University Press.
(2005) Transcription of Dutch intonation. In Sun-Ah Jun (ed.), Prosodic Typology:
The Phonology of Intonation and Phrasing, Oxford University Press, 118–145.
Halle, Morris & Jean-Roger Vergnaud (1987) An Essay on Stress, Cambrige, MA: MIT
Press.
’t Hart, Johan (1976) Discriminability of the size of pitch movements in speech, IPO
Progress Report 9, 56–63.
’t Hart, Johan, René Collier & Antonie Cohen (1990) A Perceptual Study of Intonation:
An Experimental-Phonetic Approach to Speech Melody, Cambridge University
Press.
Henry, Molly & Jonas Obleser (2012) Frequency modulation entrains slow neural
oscillations and optimizes human listening behavior, Proceedings of the
National Academy of Sciences of the USA (PNAS), www.pnas.org/content/early/
2012/11/. . ./1213390109.
Hirst, Daniel (2005) Form and function in the representation of speech prosody, in K.
Hirose, D. J. Hirst & J. Sagisaka (eds.), Quantitative Prosody Modeling for Natural
Speech Description and Generation (= special issue of Speech Communication 46
(3–4)), 334–347.
Hirst, Daniel & Robert Espesser (1993) Automatic modelling of fundamental frequency
using a quadratic spline function, Travaux de l’Institut de Phonétique d’Aix 15,
71–85.
Hualde, José I. (2003) El modelo métrico y autosegmental, in P. Prieto (ed.), Teorías de
la entonación, Barcelona: Ariel, 155–184.
278 References
(2010) Secondary stress and stress clash in Spanish, in Marta Ortega-Llebaria (ed.),
Selected Proceedings of the Fourth Conference on Laboratory Approaches to Spanish
Phonology, Somerville, MA: Cascadilla Proceedings Project, 11–19, www.lingref.
com/cpp/lasp/4/index.html.
Jones, Daniel (1909) Intonation Curves, Leipzig/Berlin: Teubner.
Jun, Sun-Ah (ed.) (2005) Prosodic Typology: The Phonology of Intonation and
Phrasing, Oxford University Press.
(2012) Prosodic typology revisited: adding macro-rhythm, in Qiuwu Ma,
Hongwei Ding & Daniel Hirst (eds.), Proceedings of Speech Prosody 2012:
Sixth International Conference on Speech Prosody, Tongji University Press,
535–538.
Jun, Sun-Ah & Cécile Fougeron (2002) The realizations of the Accentual Phrase in
French intonation, Probus 14, 147–172.
Karcevski, Serge (2000) Inédits et introuvables, Textes rassemblés et établis par Irina et
Gilles Fougeron, Leuven: Peeters.
Lacheret-Dujour, Anne (2003) La prosodie des circonstants, Leuven: Peeters.
Lacheret-Dujour, Anne & Frédéric Beaugendre (1999) La prosodie du français, Paris:
CNRS Éditions.
Lacheret-Dujour, Anne & Bernard Vittori (2002) La période intonative comme
unité d’analyse pour l’étude du français parlé: modélisation prosodique et enjeux
linguistiques, in M. Charolles, P. Le Goffic & M. A. Morel (eds.), Verbum n°1–2 :
Y-a-t-il une syntaxe au-delà de la phrase? Presses universitaires de Nancy,
55–72.
Ladd, Robert D. (1996) Intonational Phonology, Cambridge Studies in Linguistics,
(2008) Intonational Phonology (2nd edn.), Cambridge Studies in Linguistics,
Lanchantin, Pierre, Andrew C. Morris, Xavier Rodet & Christian Veaux (2008)
Automatic phoneme segmentation with relaxed textual constraints, in Proceedings
of the Sixth International Conference on Language Resources and Evaluation
(LREC 08), Marrakech: European Language Resources Association, www.lrec-
conf.org/proceedings/lrec2008/.
Leben, William (1971) The morphophonemics of tone in Hausa, in C.-W. Kim and H.
Stahlke (eds.), Papers in African Linguistics, Edmonton, Alberta: Linguistic
Research, Inc., 201–218.
Lehiste, Ilse (1979) Suprasegmentals, Cambridge, MA: MIT Press.
Lehka, Irina & David Le Gac (2004) Étude d’un marqueur prosodique de l’accent de
banlieue, Actes des XXIIIème Journées d’Etudes sur la Parole, April 2004, Fès,
Morocco, www.afcp-arole.org/doc/Archives_JEP/2004_XXVe_JEP_Fes/actes/
jep2004/Lehka-LeGac.pdf.
Léon, Monique (1964) Exercices systématiques de prononciation française, fascicule 2,
Rythme et intonation, Paris: Hachette et Larousse.
Léon, Pierre (1993) Précis de phonostylistique: parole et expressivité, coll. “Fac
Linguistique,” Paris: Nathan.
(2005) Phonétisme et prononciations du français (5th edn.), Paris: Armand-Colin.
Léon, Pierre & Philippe Martin (1969) Prolégomènes à l’étude des structures intona-
tives, Montréal: Didier.
References 279
Li, X., Peter Hagoort & Yufang Yang (2008). Event-related potential evidence on the
influence of accentuation in spoken discourse comprehension, Chinese Journal of
Cognitive Neuroscience 20(5), 906–915.
Liberman, Alvin M. & Ignatius G. Mattingly (1985) The motor theory of speech
perception revised, Cognition 21(1), 1–36.
Liberman, Mark & Alan Prince (1977) On stress and linguistic rhythm, Linguistic
Inquiry 8, 249–336.
Linne, Per (2005) The Written Language Bias in Linguistics, New York: Routledge.
Lonchamp, François (1998) Prédire l’intonation des phrases affirmatives: facteurs
rythmiques et syntaxiques, Verbum 17(1), 37–45.
Makuuchi, Michirou, Jörg Bahlmann, Alfred Anwander & Angela D. Friederici (2009)
Segregating the core computational faculty of human language from working
memory, Proceedings of the National Academy of Sciences of the United States
of America, 106(20), 8362–8367.
Marshall Unger, James (2003) Ideogram: Chinese Characters and the Myth of
Disembodied Meaning, University of Hawai’i Press.
Martin, Philippe (1973) Les problèmes de l’intonation: recherches et méthodes, Langue
française 19 (Sept. 1973), 4–42.
(1975) Analyse phonologique de la phrase française, Linguistics, 146 (Feb.), 35–68.
(1987) Prosodic and rhythmic structures in French, Linguistics, 25(5), 925–949.
(1989) Automatic assignment of lexical stress in Italian, Proc. Eurospeech 89, Paris,
Sept. 27–29, 1989, 222–225.
(2002) Intonation et syntaxe dans les langues romanes, in Macro-syntaxe et pragma-
tique: l’analyse linguistique de l’oral, Lablita, Università di Firenze, 193–220.
(2006) Modelli di analisi e sistemi di etichettatura prosodica, AISV 2005, Analisi
Prosodica, teorie, modelli e sistemi di annotazione, ed. Claudia Crocco, B. Gili
Fivela & R. Savy, Padova: EDK editore, 43–56.
(2008) Phonétique acoustique: introduction à l’analyse acoustique de la parole.
Paris: Armand Colin.
(2009) Intonation du français, Paris: Armand Colin.
(2011) Ponctuation et structure prosodique, Langue Française, 172, 99–114.
(2012a) The Autosegmental-Metrical Prosodic Structure: not fit for French?, in
Qiuwu Ma, Hongwei Ding & Daniel Hirst (eds.), Proceedings of Speech Prosody
2012: Sixth International Conference on Speech Prosody, Tongji University Press,
131–134.
(2012b) La structure prosodique dynamique: rature et insertion de texte dans l’oral
spontané, in S. Caddéo, M.-N. Roubaud, M. Rouquier & F. Sabio (eds.), Penser les
langues avec Claire Blanche-Benveniste, Presses universitaires de Provence, coll.
Langues et langages, 117–125.
(2012c) Les contours de continuation majeure dans l’océan Indien, in A.-C. Simon
(ed.), La variation prosodique régionale en français, Brussels: De Boeck-Duculot,
199–211.
(2013a) Iconicity of melodic contours in French, in S. Hancil and D. Hirst (eds.),
Prosody and Iconicity, Amsterdam: Benjamins, 179–190.
(2013b) Contraintes phonologiques de l’intonation de la phrase réinterprétées à
la lumière des recherches récentes en neurophysiologie, La Linguistique 1,
97–113.
280 References
(2013c) Contours mélodiques de continuation majeure à La Réunion, à Maurice et

aux Seychelles, in Gudrun Ledegen (ed.), La variation du français dans les espaces
créolophones et francophones, Paris: L’Harmattan, 31–47.
(2014a) Emotions and prosodic structure: who’s in charge? in Fabienne Baider &
Georgeta Cislaru (eds.), Linguistic Approaches to Emotions in Context,
Amsterdam: John Benjamins, 215–230.
(2014b) Spontaneous speech corpus data validates prosodic constraints, in Campbell,
Gibbon & Hirst (eds.), Social and Linguistic Prosody: Proceedings of the Seventh
International Conference on Speech Prosody, Dublin: Science Foundation Ireland
(SFI), 525–529.
(forthcoming) Phrase autonome et intonation autonome.
Martinet, André (1960) Éléments de linguistique générale, Paris: Armand Colin.
Meigret, Louis (1550) Le tretté de la grammére françoéze, Paris: C. Wechel, http://gallica.
bnf.fr/ark:/12148/bpt6k507854/f1.image.
Mertens, Piet (1987) L’intonation du français, de la description linguistique à la recon-
naissance automatique, Doctoral thesis, Catholic University of Leuven, Belgium.
(1993) Intonational grouping, boundaries, and syntactic structure in French, in D.
House & P. Touati (eds.), Proc. ESCA Workshop on Prosody, September 27–29,
1993, Lund (S) Working Papers 41, Lund University, Dept. of Linguistics,
156–159.
(2004) Un outil pour la transcription de la prosodie dans les corpus oraux, Traitement
Automatique des langues 45(2), 109–130.
(2008) Syntaxe, prosodie et structure informationnelle: une approche prédictive pour
l’analyse de l’intonation dans le discours, Travaux de Linguistique 56(1), 87–124.
Michelas, Amandine & Mariapaola D’Imperio (2010) Durational cues and prosodic
phrasing in French: evidence for the intermediate phrase, in Proceedings of Speech
Prosody 2010: Fifth International Conference on Speech Prosody, Chicago, May
11–14, 2010, http://speechprosody2010.illinois.edu/program.php.
Miller, George A. (1956) The magical number seven, plus or minus two: Some limits on
our capacity for processing information, Psychological Review 63(2), 81–97.
Mittmann, Maryualê Malvessi, Tommaso Raso & Heliana Mello (2009) The C-ORAL-
BRASIL Corpus: Methodological Basis for the Treatment of Spontaneous Speech,
Seventh Brazilian Symposium in Information and Human Language Technology
(STIL 2009), Sao Carlos, Brazil, September 8–11, 2009.
Morel, Mary-Annick & Laurent Danon-Boileau (1998) Grammaire de l’intonation:
l’exemple du français oral, Paris-Gap: Bibliothèque de Faits de Langues, Ophrys.
Mouret, François, Anne Abeillé, Elisabeth Delais-Roussarie, Jean-Marie Marandin &
Hiyon Yoo (2008) Aspects prosodiques des constructions coordonnées en français,
in Actes des XXVIIème Journées d’Études sur la Parole (JEP-TALN 2008),
Avignon, June 2008.
Nemni, Monique (1980) l’identification de l’incise par l’intonation, in Pierre Léon &
Mario Rossi (eds.), Problèmes de prosodie, Studia phonetica No. 18, 103–111.
Nespor, Marina & Irene Vogel (1986, 2007) Prosodic Phonology, Dordrecht: Mouton de
Gruyter.
Nowak, Paul (2012) Speed Reading Tips: 5 Ways to Minimize Subvocalization, www.
irisreading.com/speed-reading/speed-reading-tips-5-ways-to-minimize-subvocali
zation/.
References 281
Obrig, Hellmuth, Simone Rossi, Silke Telkemeyer & Isabell Watenburger (2010) From
acoustic segmentation to language processing: evidence from optical imaging,
Frontiers in Neuroenergetics 2(13), 1–12.
Ochs, Elinor (1979) Transcription as theory, in E. Ochs & B. Schieffelin (eds.),
Developmental Pragmatics, New York: Academic Press, 43–71.
Palmer, Caroline & Susan Holleran (1994) Harmonic, melodic, and frequency height
influences in the perception of multivoiced music, Perception & Psychophysics
56(3), 301–312.
Palmer, Harold H. & F. G. Blandford (1924) A Grammar of Spoken English on a Strictly
Phonetic Basis, Cambridge: Heffer & Sons.
Pasdeloup, Valérie (2004) Le rythme n’est pas élastique: étude préliminaire de l’in-
fluence du débit de parole sur la structuration temporelle, Actes des XXIIIème
Journées d’Etudes sur la Parole, April 2004, Fès, Morocco, www.afcp-parole.org/
doc/Archives_JEP/2004_XXVe_JEP_Fes/actes/jep2004/Pasdeloup.pdf.
Pierrehumbert, Janet B. (1980) The Phonology and Phonetics of English Intonation,
Ph.D. thesis, MIT, http://dspace.mit.edu/handle/1721.1/16065.
Pierrehumbert, Janet B. & Mary E. Beckman (1988) Japanese Tone Structure,
Cambridge, MA: MIT Press.
Pike, Kenneth L. (1945) The Intonation of American English, Ann Arbor: University of
Michigan Publications, Linguistics.
Poiré, François (2006) La perception des proéminences et le codage prosodique, in
Bulletin No. 6, Prosodie du français contemporain: l’autre versant de PFC, ed.
Anne Catherine Simon, Geneviève Caelen-Haumont & Claudine Pagliano, CNRS
& Université de Toulouse-Le Mirail, 69–80.
Posner, Rebecca (1996) The Romance Languages, Cambridge Language Survey,
Post, Brechtje (1999) Restructured phonologic phrases in French: evidence from clash
resolution, Linguistics 37(1), 41–63.
(2000) Tonal and Phrasal Structures in French Intonation, The Hague: Holland
Academic Graphics.
Praat (2013) www.praat.org.
Prieto, Luis, J. (1975) Pertinence et pratique: essai de sémiologie, Paris: Éditions de Minuit.
Prieto, Pilar (2014) The intonational phonology of Catalan, in Sun-Ah Jun (ed.),
Prosodic Typology 2, Oxford University Press, 43–80.
Prieto, Pilar, Joan Borràs-Comes, Verònica Crespo-Sendra, Paolo Roseano, Rafèu
Sichel-Bazin & Maria del Mar Vanrell (2007) The phonetics and phonology of
intonational phrasing in romance, in Pilar Prieto, Joan Mascaró & Maria-Josep
Solé (eds.), Segmental and Prosodic Issues in Romance Phonology, ICREA &
Universitat Autònoma de Barcelona, 131–154.
Prieto, Pilar, Lourdes Aguilar, Ignasi Mascaró, Francesc Torres-Tamarit & Maria del
Mar Vanrell (2009) L’etiquetatge prosòdic Cat_ToBI, Estudios de Fonética
Experimental 18, 287–309.
Prince, Alan (1983) Relating to the grid, Linguistic Inquiry 14, 19–100.
Profili, Olga (1987) L’accent et sa prévisibilité, Rapport Syntalit/Italien, CENT Lannion.
Profili, Olga & Philippe Martin (1987) Antonio mangia la zuppa inglese, in Proceedings
XIth ICPhS: The Eleventh International Congress of Phonetic Sciences, Tallinn:
Academy of Sciences of the Estonian SSR.
282 References
Raso, Tommaso & Heliana R. Mello (2012) The C-ORAL-BRASIL I: Reference

Corpus for Informal Spoken Brazilian Portuguese, in Helena Caseli, Aline
Villavicencio, António Teixeira, Fernando Perdigão (eds.), Computational
Processing of the Portugese Language: Proceedings of the Tenth International
Conference, PROPOR 2012, Coimbra, Portugal, April 17–20, Lecture Notes on
Artificial Intelligence, v. 7243, Springer, pp. 362–368.
Raso, Tommaso & Heliana Mello (eds.) (2012) C-ORAL-BRASIL I, Belo Horizonte:
Editora UFMG.
Raphael, Laurence J., Michael F. Dorman, Frances Freeman & Charles Tobin (1975)
Vowel and nasal duration as cues to voicing in word-final stop consonants: spectro-
graphic and perceptual studies, Journal of Speech and Hearing Research 18(3),
389–400.
Roca, Iggy (1999) Stress in Romance languages, in Harry van der Hulst (ed.), Word
Prosodic Systems in the Languages of Europe, Berlin/New York: Walter de
Gruyter, 659–811.
Rosen, Jody (2008) Researchers play tune recorded before Edison, New York
Times, March 27, 2008. Available at http://graphics8.nytimes.com/audiosrc/arts/1
860v2.mp3. Original file at www.firstsounds.org/sounds/1860-Scott-Au-Clair-de-
la-Lune.mp3.
Rossi, Mario (1971) Le seuil de glissando ou seuil de perception des variations tonales
pour la parole, Phonetica 23, 1–33.
(1972) Le seuil différentiel de durée, in A. Valdman (ed.), Papers in Linguistics
and Phonetics to the Memory of Pierre Delattre, Paris-La Haye: Mouton,
435–450.
(1978) Interaction of intensity glides and frequency glissandos, Language and Speech
21(4), 384–396.
(1979) Le français, langue sans accent? in I. Fonagy & P. Léon (eds.), L’Accent en
français contemporain, Studia Phonetica 15, Didier, Montréal, 14–53.
(1983) Les accents des Français, ed. Pierre Léon, Fernand Carton, Mario Rossi &
Denis Autesserre, Paris: Hachette.
(1999) L’intonation, le système du français, description et modélisation, Collection
l’Essentiel français, Paris: Ophrys.
Rossi, Mario, Albert Di Cristo, Philippe Martin & Yukihiro Nishinuma (1981)
L’intonation: de l’acoustique à la sémantique, Collection Études Linguistiques,
Vol. 25, Paris: Klincksieck Paris.
Rousselot, Jean-Pierre (L’Abbé) (1901–1908) Principes de phonétique expérimentale,
Vol. II, Paris-Leipzig: Welter.
Sandri, S. & Enrico Vivalda (1981) Automatic stress assignment for Italian text-to-
speech synthesis, CSLT Rapporti Tecnici 8(3), 213–216.
Sauleau, Paul (2010) Physiologie des émotions et de la motivation, Polycopié cours du
Pr Sauleau. Université de Rennes 1.
Scarano, Antonietta (2003) Les contructions en syntaxe segmentée: syntaxe, macro-
syntaxe, et articulation de l’information, in Antonietta Scarano (ed.), Macro-
syntaxe et pragmatique: L’analyse linguistique de l’oral, Rome: Bulzoni,
183–203.
Seargent, R. L. & J. D. Harris (1962) Sensitivity to non-directional frequency modula-
tion, Journal of the Acoustical Society of America 34(10), 1625–1628.
References 283
Selkirk, Elisabeth O. (1978) On prosodic structure and its relation to syntactic structure,
in T. Fretheim (ed.), Nordic Prosody, Vol. II, Trondheim: TAPIR, 111–140.
(1986) Derived domains in sentence phonology, Phonology Yearbook 3, 371–405.
(2005) Comments on intonational phrasing in English, in S. Frota, M. Vigario &
J. Freitas (eds.), Prosodies: Selected Papers from the Phonetics and Phonology in
Iberia Conference, 2003, Phonetics and Phonology Series, Berlin: Mouton de
Gruyter, 11–58.
Sereno, Sara & Keith Rayner (2003) Measuring word recognition in reading: eye move-
ments and event-related potentials, Trends in Cognitive Science 7(11), 489–493.
Simon, Anne-Catherine, Mathieu Avanzi, Jean-Philippe Goldman (2008) La détection
des proéminences syllabiques: un aller-retour entre l’annotation manuelle et le
traitement automatique, in J. Durand, B. Habert & B. Laks (eds.), Actes du CMLF
2008: 1er Congrès Mondial de Linguistique Française, 2008, Paris, 1685–1698.
DOI: 10.1051/cmlf08256.
Solé Sabater, Maria-Josep (1991) Stress and Rhythm in English, Revista Alicantina de
Estudios Ingleses 4, 145–62.
Sosa, Juan-Manuel (1999) La entonación del español, Madrid: Cátedra.
Steinhauer, Karsten, Kai Alter & Angela D. Friedrici (1999) Brain potentials indicate
immediate use of prosodic cues in natural speech processing, Nature Neuroscience
2(2), 191–196.
Teston, Bernard (2004) L’œuvre d’Étienne-Jules Marey et sa contribution à l’émergence
de la phonétique dans les sciences du langage, Travaux Interdisciplinaires du
Laboratoire Parole et Langage 23, 237–266.
ToBI (1999) www.ling.ohio-state.edu/~tobi/.
Trager, George & Henry Smith (1951) An Outline of English Structure, Studies in
Linguistics: Occasional Papers 3, Norman, OK: Battenburg Press.
Transcriber (2014) http:// trans.sourceforge.net.
Ungurean, Catalin, Dragos Burileanu & Aurelian Dervis (2009) A statistical approach
to lexical stress assignment for TTS synthesis, International Journal of Speech
Technology 12(2/3), 63–73.
Vaissière, Jacqueline (1975) Caractérisation des variations de la fréquence du fonda-
mental dans les phrases du français, in VIèmes Journées d’Etude sur la Parole,
Toulouse, 39–50.
Vercherand, Géraldine, In-Young Kim & Hi-Yon Yoo (2011) Whispering French and
Korean: a comparative study, Linguistic Research 23(2), 81–95.
Viana, Céu & Sonia Frota et al. (2007) Towards a P_ToBI, http://labfon.letras.ulisboa.
pt/SonseMelodias/P-ToBI/P-ToBI.htm.
Von Essen, Otto (1956) Grundzuge der hochdeutschen satzintonation, Ratingen-
Dusseldorf: A. Henn Verlag.
Wang, Suiping, Deyuan Mo, Ming Xiang, Ruiping Xu & Hsuan-Chih Chen (2012) The
time course of semantic and syntactic processing in reading Chinese: evidence
from ERP’s, Language and Cognitive Processes, iFirst, 1–20.
Watanabe, Satosi (1969) Knowing and Guessing: A Quantitative Study of Inference and
Information, New York: Wiley.
Wightman, Colin W. (2002) ToBI or not ToBI? in B. Bel & I. Marlien (eds.),
Proceedings of the First International Conference on Speech Prosody, Aix-en-
Provence: SProSig, 25–29.
284 References
WinPitch (2013) www.winpitch.com.

Wioland, François (1985) Les structures syllabiques du français, Paris: Slatkine-
Champion.
Xu, Yi and Maolin Wang (2009) Organizing syllables into groups: evidence from F0 and
duration patterns in Mandarin, Journal of Phonetics 37(4), 502–520.
Yuan, Jiahong & Mark Liberman (2008) Speaker identification on the SCOTUS corpus,
Proceedings of Acoustics 2008, 5687–5690.
Analyzed corpora
Les accents des français (French) http://accentsdefrance.free.fr/

Archives de la parole (French) http://gallicadossiers.bnf.fr/
ArchivesParole/
CFPP 2000 (French) http://cfpp2000.univ-paris3.fr/
C-ORAL-Brasil (Brazilian Portuguese) www.c-oral-brasil.org/
C-ORAL-ROM (French, Spanish, Italian, European Portuguese)
http://lablita.dit.unifi.it/coralrom/
Corpus Catalan (Catalan) Author
Corpus Oral de Français Parlé en Suisse Romande (French) http://
www11.unine.ch/
Corpus Québécois (Quebec French) Author
Corpus roumain (Romanian) Author
Corpus Svevo (Italian) CD ROM Vita e opere di Italo Svevo e Trieste
C-PROM (French) https://sites.google.com/site/corpusprom/
Eurom4 (French, Spanish, Italian, European Portuguese, including
lessons recordings) http://sites.univ-provence.fr/delic/Eurom4/
Eurom5 (French, Spanish, Catalan, Italian, European Portuguese)
http://www.eurom5.com/
FLLOC French Learner Language Oral Corpora (FLLOC) (French)
http://www.flloc.soton.ac.uk/
Lablita (Italian) http://lablita.dit.unifi.it/
Nocando (Catalan, Italian, Spanish) http://www.upf.edu/pdi/enric-
vallduvi/research/
Orfeo (French) ttp://www.projet-orfeo.fr/
PFC (French) Phonologie du Français contemporain, http://www.pro
jet-pfc.net/
Português Falado, Documentos Autênticos, Gravações audi com
transcrição alinhada (2001)
Ramirez Puerto Rico (Spanish) Author
Rhapsodie (French) Corpus de référence du français, www.projet-rh
apsodie.fr/
285
286 Analyzed corpora
Romanian anonymous speech corpus (Romanian) (rasc) http://rasc.

racai.ro/index.php?page=home
Sounds of the Romanian Language Corpus, www.etc.tuiasi.ro/sibm/
romanian_spoken_language/index.htmTCOF Traitement de
Corpus Oraux en Français, www.cnrtl.fr/corpus/tcof/
Valibel (French) www.uclouvain.be/valibel
VoxForge (French, Spanish, Catalan, Italian, European Portuguese)
www.voxforge.org/
TCOF (French) Traitement de Corpus Oraux en Français, www.cnrtl.
fr/corpus/tcof
Author index
Abeillé, Anne 280 Caddéo, Sandrine 274, 279

Adam, Marcel 273 Capt-Artaud, Marie-Claude xxv
Aguilera, Marion 110, 272 Cei, Erica 128, 274
Alkire, Ti 122, 272 Chafe, Wallace 216, 274
Alter, Kai 272, 283 Chen, Hsuan-Chih 67, 274, 283
Anwander, Alfred 279 Chitoran, Ioana 127, 274
Armstrong, Lilias 34, 272 Ciobanu, Alina Maria 274
Astésano, Corinne 272 Cohen, Antonie 32, 60, 277
Austin, John 218 Collier, René 32, 60, 277
Avanzi, Mathieu xxv, 15, 17, 32, 37, Cooper, William 60, 274
41, 106, 131, 200, 218, 272, 275, Cresti, Emanuela xxv, 69, 218, 219, 220, 241,
276, 283 245, 274, 275
Avesani, Cinzia 272
Ayers Elam, Gayle 12, 24, 273 D’Imperio, Maria-Paola 275, 280
Danon-Boileau, Laurent 225, 280
Badiou, Alain 43, 272 Dard, Frédéric 256, 257
Bahlmann, Jörg 279 Debaisieux, Jeanne-Marie xxv, 84, 219, 222,
Bailly, Charles 58, 216, 219, 233, 275
220, 272 Delais-Roussarie, Élisabeth xxv, 16, 17,
Baumann, Stefan 273 26, 40, 44, 47, 54, 207, 207, 273,
Beaugendre, Frédéric 278 275, 280
Beckman, Mary 12, 24, 38, 47, 49, Delattre, Pierre x, 23, 24, 35, 275, 282
273, 281 Dell, François 47, 66, 275
Béguelin, Marie-José 218, 275 Delmonte, Rodolfo 127, 275
Benzmüller, Ralf 273 Demolin, Didier xxv
Berrendonner, Alain 218, 230, 273 Dervis, Aurelian 283
Besson, Mireille 272 Deulofeu, Henri-José xxv, 219, 230, 275
Beyssade, Claire 79, 273 Di Cristo, Albert 275, 282
Bilbiie, Gabriela xxv Díaz-Campos, Manuel 273
Blanche-Benveniste, Claire xxiii, xxv, 14, Dorman, Michael F. 282
134, 214, 217, 218, 226, 227, 233, 237, Dunning, Brian 254, 255, 276
273, 279
Blandford, Francis G. 34, 35, 281 Eckman, Paul 5, 276
Bolinger, Dwight 15, 34, 35, 273 Edmont, Edmond 36
Bonami, Olivier 47, 273 Elordieta, Gorka 208, 275
Bonvino, 134, 273, 274 El Yagoubi, Radouane 272
Boucher, Victor xxv, 63, 100, 108, 109, 114, Émerard, Françoise 276
274, 277 Espesser, Robert 32, 40, 272, 277
Boulakia, Georges xxv, 105, 274
Bruce, Gösta 38, 274 Falbo, Caterina xxvi
Brunot, Ferdinand 214, 217, 274 Feldhausen, Ingo 46, 51, 120, 276
Burileanu, Dragos 283 Fernández Plana, Ana Maria xxvi
287
288 Author index
Fivela, B. Gili 54, 279 Leben, William 47, 278

Fletcher, Harvey 30 Lehiste, Ilse 12, 278
Fónagy, Ivan xi, 33, 72, 98, 106, 131, 203, 276 Lehka, Irina 104, 106, 278
Fónagy, Judith 276 Léon Monique, 35, 278
Fougeron, Cécile 131, 203, 204, 206, 207, Léon, Pierre xxi, xxv, 35, 69, 73, 203, 276,
278 278, 280
Fox, Anthony 276 Li, X. 62, 279
Freeman, Frances 282 Liberman, Alvin 254
Frei, Henri 217, 276 Liberman, Mark 20, 28, 38, 47, 97, 279, 284
Friederici, Angela D. 65, 113, 228, 276, 279 Lilias 272
Frota, Sonia 42, 52, 54, 120, 203, 275, Linne, Per 87, 279
276, 283 Lonchamp, François 279
Fuchs, Catherine 105, 274, 276 Ludwig, Carl 21
Gachet, Frédéric 17, 200, 276 Magdics, Klara xi, 33, 106, 276
Garde, Paul 65, 120, 121, 128, 276 Makuuchi, Michirou 113, 279
Germain, Aline xxvi Marandin, Jean-Marie 273, 280
Ghitza, Oded 107, 108 Marey, Étienne-Jules 21, 283
Gilbert, Annie xxiii, xxvi, 63, 65, 76, 100, 108, Martin, Philippe i, iii, iv, xxii, 14, 15, 35, 38,
109, 114, 122, 274, 277 48, 52, 56, 58, 61, 64, 65, 66, 69, 73, 74,
Gilléron, Jules 36 76, 77, 79, 82, 83, 97, 99, 102, 103, 104,
Giraud, Anne-Lise 276, 277 105, 106, 108, 109, 110, 111, 121, 128,
Godement, Rémi xxvi, 228, 277 130, 131, 133, 135, 140, 144, 159, 200,
Goldman, Jean-Philippe 11, 28, 277, 283 207, 216, 219, 222, 228, 233, 237, 250,
Goldsmith, John 12, 38, 47, 277 252, 253, 272, 273, 274, 275, 277, 278,
Grammont, Maurice 23, 277 279, 280, 281, 282
Grevisse, Maurice 217 Martinet, André 58, 72, 280
Grice, Martine 273 Matta-Machado, Mirian xxvi
Gussenhoven, Carlos 51, 277 Mattingly, Ignatus G. 254, 279
Meigret, Louis 14, 58, 77, 97, 100, 214, 280
Hagoort, Peter 62, 279 Mello, Helena 216, 282
Halle, Morris 128, 277 Mertens, Piet 32, 35, 36, 37, 48, 51, 145, 280
Harris, J. D. 37, 282 Michelas, Amandine 280
‘t Hart 32, 60, 277 Miller, George A. 63, 77, 280
Hayes, Brice 128, 274 Mittmann, Maryualê 216
Henry, Molly 108, 110, 277, 283 Mo, Deyuan 283
Hirschberg, Julia 273 Moneglia, Massimo xxvi, 218, 219, 241, 245,
Hirst, Daniel 32, 40, 67, 122, 274, 275, 277, 274, 275
278, 279, 280 Morel, Mary-Annick 225, 230, 278, 280
Holleran, Susan 30, 281 Morgan, Terrell A. 273
Hualde, José I. 46, 54, 97, 277, 278 Morris, Andrew 277, 278
Mota, Antonia 273
Jassem, Wiktor 122 Mouret, François 192, 280
Jitcă, Doina 54 Munson, W. A. 30
Jones, Daniel 33, 278
Jun, 15, 39, 47, 131, 203, 204, 206, 207, 273, Nemni, Monique 280
277, 278, 281 Nespor, Marina 48, 97, 280
Niculae, Vlad 274
Karcevski, Serge 76, 278 Nishinuma, Yukihiro 282
Kim, In-Young 278, 283 Nowak, Paul 253, 280
Lacheret, Anne 41, 42, 219, 272, 278 Obin, Nicolas 41, 272
Ladd, Robert 49, 56, 278 Obleser, Jonas 277
Lanchantin, Pierre 11, 278 Obrig, Hellmuth 113, 281
Le Gac, David 104, 106, 278 Ochs, Elinor 43, 86, 281
Author index 289
Palmer, Angel 30, 34, 35, 281 Shattuck-Hufnagel, Stefanie 273

Pasdeloup, Valérie 106, 281 Simon, Anne-Catherine 44, 281, 283
Perez, Patricia xxvi Simone, Raffaele 273, 281
Pierrehumbert, Janet 46, 47, 49, Smith, Henry 38, 283
273, 281 Sorensen, John 60, 274
Pike, Kenneth 32, 33, 34, 281 Sosa, Juan-Manuel 283
Pippa, Salvador 274 Steinhauer, Karsten 110, 113, 283
Pirvulescu, Michaela xxvi
Poeppel, David 276, 277 Telkemeyer, Silke 281
Poiré, François 43, 281 Teston, Bernard 21, 283
Posner, Rebecca 17, 281 Tevis Mcgory, Julia 273
Post, Brechtje xxiii, 52, 281 Tobin, Charles 282
Prieto, Luis 44, 76 Trager, George 38, 283
Prieto, Pilar 54, 275, 277, 281
Prince, Alan 20, 38, 47, 97, 279, 281 Unger, J. Marshall 254, 279
Profili, Olga 97, 128, 281 Ungurean, Catalin 127, 283
Uzcanga de Vivar, Isabel 273
Queneau, Raymond 223
Vaissière, Jacqueline 283
Raphael, Laurence 67, 282 Valli, André 273
Raso, Tommaso xxvi, 216, 282 Veaux, Christian 278
Rayner, Keith 253, 283 Vercherand, Géraldine 86, 283
Roca, Iggy 282 Vergnaud, Jean-Roger 128, 275,
Rodet, Xavier 278 277
Rosen, Jody 20, 122, 272, 282 Victorri, Bernard 41, 42, 272, 278
Rossi, Mario xxvi, 12, 14, 30, 37, 51, 76, 79, 91, Vigário, Marina 275, 276
129, 145, 275, 280, 281, 282 Vivalda, Enrico 124, 127, 282
Rousier-Vercruyssen, Lucie 272 Vogel, Irene 48, 97, 280
Rousselot, Jean-Pierre x, 21, 23, 282 von Essen, Otto 34, 283
Sabater, Maria-Josep 76, 283 Wang, Suiping 110, 228, 283

Sandri, S. 124, 127, 282 Ward, Ida C. 34, 272
Satosi, Watanabe 283 Watanabe, Satosi 43, 158, 283
Sauleau, Paul 5, 282 Wightman, Colin W. 42, 283
Scarano, Antonietta 220, 282 Wioland, François 58, 284
Scott de Martinville, Édouard-Léon 20
Seargent, R. L. 37, 282 Xu, Ruiping 283
Selkirk, Lisa 47, 48, 51, 283
Sereno, Sara 253, 283 Yang, Yufang 62, 279
Serra, Carolina 274 Yoo, Hiyon 280, 283
Sévigny, Alexandre xxvi Yuan, Jiahong 28, 284
Subject index
7 syllables, xxviii, 81, 114, 118, 249 declination, 3

accent phrase, xxvii, 44, 45, 46, 50, 77, 112, Delta brain wave, 27, 106, 107, 108, 110, 111,
252, 254, 255 112, 113, 114, 115, 116, 117, 144
alignment, 13, 27, 28, 40, 45, 52, 55, 56, 79, 80, dependency, 14, 43, 48, 50, 61, 77, 80, 81, 82,
96, 104, 106, 111, 112, 133, 256, 257, 259, 83, 90, 92, 93, 94, 96, 106, 112, 116, 144,
260, 261, 262, 264, 265, 267, 277 149, 180, 192, 201, 218, 219, 221, 222, 230,
ambiguous sentences, 12, 57 233, 234, 245, 251, 252
AMPER project, 32, 36 differed complement, 102
Analor, 32, 41, 272 domain, xxii, xxiii, xxviii, 27, 29, 52, 53, 54,
arc accentuel, 96, 131 57, 71, 81, 84, 85, 86, 88, 93, 94, 96, 107,
autosegmental-metrical, xxiii, 12, 13, 14, 41, 111, 112, 120, 134, 138, 139, 140, 218, 256
47, 48, 51, 61, 76, 77, 86, 96, 116, 120, doubt, 70, 72, 75, 123, 250
134, 203 duration, xxiii, xxvii, 1, 2, 10, 11, 12, 13, 16, 21,
22, 23, 27, 29, 32, 38, 45, 56, 58, 59, 62, 64,
boundary tone, 17, 38, 46, 47, 49, 51, 53, 54, 65, 67, 77, 78, 80, 81, 85, 86, 87, 90, 91, 96,
55, 56, 57, 58, 77, 78, 86, 120, 137, 203, 207, 98, 99, 101, 102, 103, 104, 106, 107, 108,
208, 220, 221 110, 111, 112, 113, 114, 115, 117, 118, 119,
brain waves, 27, 88, 106, 107, 108, 112, 113, 132, 135, 137, 138, 139, 144, 150, 158, 159,
116, 274 186, 187, 208, 249, 252, 254, 255, 260, 268,
broad focus, xxviii 270, 282
dysfluencies, 215, 224, 225, 226, 232, 233
Catalan, xxii, xxvi, 17, 19, 36, 54, 122, 123,
124, 132, 133, 135, 152, 153, 155, 160, 171, electroencephalography, 27, 107
175, 180, 184, 186, 187, 190, 224, 276, 281, emotion, 5, 6
285, 286 enumeration, 6, 49, 50, 77, 81, 93, 107, 119,
conclusive, 6, 17, 41, 50, 68, 69, 71, 73, 75, 78, 135, 140, 160, 161, 169, 174, 192, 199, 200,
81, 82, 84, 88, 91, 138, 139, 140, 141, 142, 202, 206, 231
144, 150, 151, 187, 193, 199, 210, 212, 215, eurhythmy, xxviii, 58, 103, 106, 111, 113, 117,
216, 219, 221, 222, 223, 224, 226, 229, 230, 136, 145, 233
232, 233, 234, 235, 242, 250, 253, 257 Eurom4, 133, 135, 199, 285
connexity, 83, 84, 96 Eurom5, 133, 135, 188, 285
continuation majeure, 41, 72, 81, 84, 232, 280 eye movement, 135, 252
contrast of melodic slope, xxii, 58, 77, 92, 94,
96, 119, 139, 140, 142, 145, 158, 159, 163, French, xxii, xxiii, xxv, 9, 17, 19, 23, 24, 31, 35,
168, 192 36, 39, 40, 43, 44, 51, 54, 55, 61, 62, 63, 64,
coordination, 17, 71, 135, 192, 195, 207, 65, 66, 67, 69, 71, 72, 73, 76, 77, 78, 80, 81,
195, 230 84, 87, 91, 92, 93, 94, 95, 96, 97, 98, 100,
101, 103, 104, 105, 106, 108, 110, 111, 120,
declarative, 52, 53, 55, 68, 69, 70, 71, 72, 73, 121, 122, 123, 126, 130, 131, 133, 134, 138,
74, 75, 78, 91, 138, 140, 145, 148, 150, 193, 139, 140, 142, 145, 149, 150, 151, 153, 154,
210, 212, 215, 216, 219, 220, 221, 222, 233, 158, 159, 160, 161, 162, 163, 164, 165, 166,
242, 250 167, 168, 169, 180, 185, 186, 187, 192, 195,
290
Subject index 291
198, 203, 206, 207, 208, 210, 212, 213, narrow focus, xxviii, 53, 55
214, 215, 216, 217, 219, 224, 225, 227, neutralization, xxviii, 58, 91, 117
228, 232, 233, 237, 249, 250, 251, 256, 259,
272, 273, 275, 278, 279, 280, 281, 283, oxyton, 123, 125
285, 286
fundamental frequency, xxi, 7, 8, 10, 13, 20, 22, parenthesis, 17, 50, 83, 183, 192, 200,
23, 24, 25, 26, 29, 30, 31, 36, 38, 41, 46, 53, 202, 203, 219, 222, 223, 224, 232,
60, 62, 72, 73, 78, 138, 142, 148, 207, 210, 233, 245
216, 259, 260, 266, 270, 277 Paroxyton, 123, 124
phonetics, xxv, 15, 20, 44, 55, 273, 281,
homographs, 124, 126, 129, 130 282
phonology, xxi, 6, 12, 19, 23, 42, 43, 46, 47, 48,
iconicity, 71, 279 55, 58, 61, 76, 85, 124, 131, 256, 273, 277,
imperative, 52, 69, 72, 73, 74, 111, 121, 130, 281, 283
148, 234, 250 pitch curve, 16, 30, 36, 267
implicative, 70, 72, 73, 148, 232, 250 Pitchmeter, 23
incremental storage concatenation, xxvii, 78, Planarity, xxiii, xxviii, 83, 96
84, 87, 144, 149, 208, 235, 236 ponctuant, 225, 232
intensity, xxvii, 1, 2, 8, 9, 11, 13, 17, 30, 32, 37, Portuguese, xxii, 17, 19, 36, 54, 97, 110, 122,
59, 62, 64, 73, 78, 85, 87, 144, 235, 259, 267, 123, 124, 132, 133, 153, 155, 157, 158, 171,
270, 271, 282 172, 173, 174, 176, 191, 199, 224, 226, 245,
interrogative, 52, 53, 55, 68, 69, 70, 71, 72, 73, 259, 274, 276, 285, 286
74, 75, 78, 121, 125, 126, 130, 145, 151, 212, Praat, xxii, 16, 23, 25, 27, 31, 36, 41, 59, 259,
215, 216, 221, 222, 250 265, 268, 270, 277
IntSint, 32 preprepreproparoxyton, 124
Italian, xxii, 17, 19, 36, 52, 54, 64, 71, 81, 97, prepreproparoxyton, 124
98, 100, 121, 122, 123, 124, 126, 127, 128, preproparoxyton, 124
130, 132, 133, 135, 140, 151, 153, 154, 156, proparoxyton, 123, 125
169, 170, 177, 178, 183, 184, 188, 189, 199, prosodic boundary, xxviii
200, 201, 202, 203, 224, 226, 241, 259, 274, prosodic constraints, 135, 280
275, 279, 282, 285, 286 prosodic Contour, xxvii
prosodic eraser, 226
kymograph, 20, 21, 23, 32 prosodic Marker, xxvii
prosodic phrasing, 82
laboratory phonology, xxii, 13, 49, 217 prosodic structure, xxii, xxiii, xxvii, xxviii, 5, 6,
laryngeal frequency, 1, 2, 3, 4, 5, 7, 13, 16, 21, 11, 12, 13, 14, 15, 16, 17, 19, 23, 24, 39, 42,
22, 29, 30, 73, 86 44, 45, 46, 48, 49, 50, 51, 52, 53, 54, 55, 56,
Latin, 17, 19, 120, 122, 128, 129, 130, 132 57, 58, 59, 61, 62, 68, 69, 71, 76, 77, 78, 79,
80, 81, 82, 83, 84, 85, 86, 87, 88, 90, 92, 96,
macrosegment, 219, 221, 222, 223, 231, 233, 102, 103, 106, 112, 113, 114, 115, 116, 118,
234, 235, 242, 248 120, 121, 133, 134, 135, 136, 137, 138, 140,
macrosyntax, xxiii, xxv, xxviii, 50, 84, 92, 102, 142, 144, 145, 150, 153, 159, 160, 169, 182,
138, 145, 148, 192, 215, 216, 217, 218, 221, 184, 185, 188, 192, 200, 208, 210, 222, 223,
228, 230, 248, 256 224, 226, 227, 228, 231, 232, 233, 235, 239,
melodic contours, xxviii, 6, 16, 19, 32, 39, 41, 244, 249, 250, 252, 253, 254, 255, 275,
49, 54, 56, 70, 71, 72, 73, 75, 79, 82, 86, 88, 280, 283
89, 90, 91, 92, 94, 96, 122, 133, 134, 137, prosodic word, xxii, xxiii, xxvii, 45, 59, 61,
138, 141, 142, 144, 145, 148, 149, 175, 62, 65, 71, 76, 77, 78, 81, 82, 83, 84, 88, 92,
185, 191, 194, 195, 196, 206, 210, 212, 213, 94, 96, 97, 98, 99, 101, 102, 103, 104, 105,
219, 221, 222, 229, 231, 235, 236, 249, 112, 113, 117, 133, 134, 135, 138, 140, 145,
257, 279 148, 150, 160, 164, 169, 172, 185, 190, 191,
melodic curve, 7, 8, 9, 10, 16, 21, 59, 60, 138, 192, 195, 197, 200, 208, 213, 221, 222, 224,
204, 207, 210, 217, 219, 233 225, 226, 227, 228, 231, 235, 236, 252,
Momel, 40, 41 254, 257
292 Subject index
prosogram, 35, 36, 37 121, 133, 136, 142, 144, 145, 192, 214, 215,
216, 217, 218, 220, 225, 226, 233, 248,
respiratory cycle, 1 257, 259
Romance language, xxii, xxiii, 17, 19, 36, 49, stress clash, xxiii, xxviii, 47, 58, 97, 98, 106,
52, 54, 58, 61, 63, 66, 78, 79, 80, 81, 87, 91, 113, 131, 206, 251, 278
94, 95, 97, 98, 100, 110, 120, 121, 122, 123, stressed syllable, 13, 122, 123, 137, 151
124, 126, 130, 132, 133, 134, 136, 137, 138, subvocalization, 112, 252, 253, 254, 255, 280
139, 140, 144, 145, 146, 148, 149, 150, 154, surprise, 5, 72, 75
160, 161, 169, 180, 182, 192, 198, 201, 212, syllabic chunk, 65, 66, 77, 78
213, 214, 216, 219, 226, 227, 228, 282 syllabic duration, 11, 12, 67, 144
Romanian, xxii, xxv, xxvi, 17, 19, 54, 122, 123, syntactic clash, xxiii, xxviii, 6, 58, 98, 105, 106,
124, 126, 127, 132, 133, 135, 153, 161, 175, 107, 115, 117, 251, 257
179, 180, 181, 182, 185, 191, 224, 226, 274,
285, 286 Theta brain wave, 107, 109, 111
ToBI, 12, 13, 24, 25, 29, 31, 32, 33, 38, 39, 40,
sense group, 101, 132 41, 42, 46, 47, 49, 52, 54, 55, 56, 88, 140,
silent reading, xxiii, xxvii, 15, 16, 112, 116, 204, 273, 275, 281, 283
252, 253, 254, 255 Transcriber, 27, 259, 265, 266, 270
Spanish, xxii, 17, 19, 36, 54, 71, 98, 122, 123,
124, 125, 132, 133, 140, 152, 153, 154, 156, voiced, 1, 3, 7, 10, 32, 60, 62, 67, 134, 140, 146,
170, 174, 186, 189, 190, 201, 224, 226, 259, 148, 190, 207
273, 278, 285, 286
spectrograph, xxi, 21, 23, 24, 32 WinPitch, xxii, 11, 23, 25, 26, 27, 28, 31, 59,
spontaneous speech, xxvii, 14, 16, 17, 20, 25, 133, 136, 216, 259, 260, 261, 262, 264, 266,
28, 57, 65, 66, 78, 103, 106, 108, 113, 116, 268, 270

The Structure of Spoken Language - Intonation in Romance

Uploaded by

Copyright:

Available Formats

You might also like

The Structure of Spoken Language - Intonation in Romance

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Structure of Spoken Language - Intonation in Romance

Uploaded by

Copyright:

Available Formats

Using an innovative approach, this book focuses on a widely debated area of

philippe martin is a Professor in the Linguistics Department at the

Cambridge University Press is part of the University of Cambridge.

List of ﬁgures and maps page x

2 The role of technological advances 20

Obtaining data: pitch curves 30

4 The Autosegmental-Metrical Prosodic Structure 46

5 The Incremental Prosodic Structure 59

The Incremental Storage-Concatenation process 88

6 Lexical stress in Romance languages 120

7 The Incremental Prosodic Structure in six Romance languages 133

Data mining for large speech corpora 262

2.7 WinPitch display 26

5.5 Voulez-vous du thé du café du chocolat? 71

5.28 Delta waves synchronize the transfer of chunks of syllables

7.29 mais les scientiﬁques japonais 159

7.62 A escolha da carreira proﬁssional 176

7.94 Le muret le donjon et l’église sont de style roman 194

7.113 Identiﬁcation of prosodic events 209

11.11 Command window displaying the available pitch tracking

5.1 Variants of modality page 69

This book is the culmination of some forty-ﬁve years of personal research on

Helena Dowson (Cambridge University Press) for her patience and

The respiratory cycle

Inspiration Expiration Inspiration Expiration

Inspiration Expiration Inspiration Expiration

Figure 1.2 An example of an out-of-breath speaker (NS), that is, when a

adequate and complete lung compression is impossible to achieve. This drop

The source-ﬁlter model of phonation

Vocal tract filter

Figure 1.3 Source-ﬁlter model of phonation.

Source Vocal tract filter

Figure 1.4 Interactions in the source-ﬁlter model between phonation and

Emotions Prosodic structure

Emotions dominates phonology: extreme stress or anger

Phonology and emotions coexist

Phonology dominates emotions: synthetic voice(diphones)

Figure 1.5 Extreme cases of the emotion–phonology relationship: emotion

synthetic speech based on diphones, totally deprived of emotional content (con-

Voiced and unvoiced speech sounds

Fundamental frequency and melodic curve

Figure 1.6 An example of melodic curve, interrupted at segments without

large interruption of the melodic curve in Figure 1.6 corresponding to a silent

0 0.5 1 1.5 2 2.5 3

Figure 1.7 Narrowband spectrogram for visualizing harmonics corresponding

harmonic amplitude (setting called broadband) for voiced sounds. In cases of

Indeed, syllabic boundaries result from phonological deﬁnitions, while their

Syntax and prosody

Figure 1.8 Staircase duration curves showing the evolution of syllabic

Figure 1.9 Bézier duration curves showing the evolution of syllabic

The prosodic structure: the structure of spoken language

and phonetic, or phonological in their attempt to capture regularities in their

Intonation and syntax

Brain waves and prosody

prosodic events (for example melodic contours on stressed syllables) being

From laboratory to spontaneous speech

Reading and listening

(spoken by 26 million speakers in six countries), and Catalan (12 million

H* L+H H H* + L LÈ tonal structure

4. Declarative sentences (statements) end with an LL% sequence (L being