Unit 1 - Sp. Lg. Proc. & Pros.

1
SLP 203: Speech Language Processing & Prosody

UNIT 1: INTRODUCTION TO SPEECH LANGUAGE PROCESSING
SUBMITTED BY: SWATHI SURESH & VAISHNAVI S
M.SC SLP I YEAR
SUBMITTED BY: DR. SUJA MATHEWS
SUBMITTED ON: 05.11.2020

2
CONTENTS
1.1 Introduction to speech Processing
I. What is Speech Perception?

II. Definition
III. The Principal Issues in Speech Perception
a. Linearity, Lack of Acoustic—Phonetic Invariance, and the Segmentation Problem
b. Units of Analysis in Speech Perception
c. Perceptual Constancy in Speech - The Normalization Problem in Speech
Perception
IV. Categorical Perception
V. Mcgurk Effect
VI. Perceptual organization in speech- Gestalt principles of perceptual grouping
VII. Phonetic organization
1.2 Theoretical approaches to speech perception
VIII. Motor Theory of Speech Perception
IX. Analysis by Synthesis Theory
X. Quantal Theory
XI. Auditory Theory of Vowel Perception
XII. Neurological theories of speech perception
XIII. Pandemonium model

XIV. Direct- realistic Approach
XV. Input to lexicon- Lexical access from spectra (LAFS)
XVI. TRACE
XVII. Dual stream model
XVIII. References
3
1.1 Introduction to speech Processing
I. What is Speech Perception?
The term ‘speech perception’ is used in a variety of contexts. Critical terminological
distinctions should be made.
https://www.youtube.com/watch?v=xY6DBIusIsI
II. DEFINITION:
4
The basic problems in speech perception involve a number of issues surrounding
● The internal representation of the speech signal and
● The perceptual constancy of this representation—the problem of acoustic-phonetic
invariance and
● The phenomena associated with perceptual contrast to identical stimulation.
III. The Principal issues in Speech Perception:
a. Linearity, Lack of Acoustic—Phonetic Invariance, and the Segmentation Problem
b. Units of Analysis in Speech Perception
c. Perceptual Constancy in Speech - The Normalization Problem in Speech Perception
a. Linearity, Lack of Acoustic—Phonetic Invariance, and the
Segmentation Problem:
LINEARITY
PROBLEM
5
A spectrogram of the phrase "I owe you". There are no clearly
distinguishable boundaries between speech sounds.
LACK OF
INVARIANCE
6
SEGMENTATION
PROBLEM
As first discussed by Chomsky and Miller (1963), one of the most important and central
problems in speech perception derives from the fact that the speech signal fails to meet the
conditions of linearity and invariance.

7
● As a consequence of failing to satisfy these two conditions, the basic recognition problem
can be seen as a substantially more complex task for humans to carry out.
● Although humans can perform it effortlessly, the recognition of fluent speech by
machines has thus far proven to be a nearly intractable problem.
Lack of invariance refers to the idea that there is no reliable connection between
the language phoneme and its acoustic manifestation in speech. The same word,
or even single phoneme, can sound completely differently depending on many
factors:
1) Individual differences. Acoustic structure of speech depends a lot on a
speaker's accent, physical and psychological characteristics.
2) Speech conditions.
3) Coarticulation. This is the idea that more than one sound is articulated at once,
so each of them is partly shaped by the sounds surrounding it. The articulates
8
(jaw, tongue, mouth) move from sound to sound, allowing us to speak faster, thus
acoustic structure of each phoneme depends a lot on its 'neighbours'. Consider the
following spectrograms of a sound /d/ in three different positions:
Place of articulation on the /d/ is different every time, thus producing several
'versions' of the same sound - however we as listeners will still hear /d/ every time,
despite big differences in their acoustic structure.
It is extremely difficult to identify acoustic segments and features that uniquely match the
perceived phonemes independently of the surrounding context.
● As a result of coarticulation in speech production, there is typically a great deal of
contextual variability in the acoustic signal correlated with any single phoneme.
● Often a single acoustic segment contains information about several neighboring linguistic
segments (i.e., parallel transmission), and, conversely, the same linguistic segment is
often represented acoustically in quite different ways depending on the surrounding
phonetic context, the rate of speaking, and the talker (i.e., context-conditioned variation).
9
● In addition, the acoustic characteristics of individual speech sounds and words exhibit
even greater variability in fluent speech because of the influence of the surrounding
context than when speech sounds are produced in isolation.
● The context-conditioned variability resulting from coarticulation also presents enormous
problems for segmentation of the speech signal into phonemes or even words based only
on an analysis of the physical signal.
b. Units of Analysis In Speech Perception
In various attempts to solve the joint problems of the lack of invariance and segmentation,
different-sized perceptual units have been proposed. The phonetic feature, phoneme, and syllable
have all been considered at one time or another by various investigators. However, the problems
of a lack of invariance and segmenting a continuous signal are common to all of these units of
analysis (see Pisoni, 1976; Studdert-Kennedy, 1976).
Coarticulation:
● The coarticulation phenomena that carry attributes of one phonetic feature or phoneme
onto the surrounding phonemes are also found for syllables (Öhman, 1966).
● Coarticulation refers to the influence of the muscle movements necessary to produce one
sound onto preceding and succeeding muscle movements and their resulting acoustic
manifestations.
● Thus the production of a stop consonant is conditioned or influenced by the production of
the adjacent phonemes.

10
● The acoustic consequences of coarticulation are that one sound segment may carry
information about a number of phonemes or syllables.
● Coarticulation effects obscure the boundaries between all potential units of analysis in
speech: phonetic features, phonemes, and syllables.
● As a result, syllabic units are difficult to segment by acoustically defined criteria and
show the same context-conditioned variability (i.e., lack of invariance) exhibited by
phonemes.
c. Perceptual Constancy in Speech - The Normalization Problem in
Speech Perception
Talker Variability:
● In addition to the problems arising from the lack of acoustic-phonetic invariance, there
are also two related problems having to do with the normalization of the speech signal.
One is the talker-normalization problem.
● Talkers differ in the length and shape of their vocal tracts, in the articulatory gestures
used for producing various types of phonetic segments, and in the types of coarticulatory
strategies present in their speech.
● As a consequence, substantial differences among talkers have been observed in the
absolute values of the acoustic correlates of many phonetic features.
● Differences in stress and speaking rate, as well as in dialect and affect, also contribute to
differences in the acoustic manifestation of speech.

11
● Despite the inherent variability in the speech signal due to talker differences, listeners are
nevertheless able to perceive speech from a wide range of vocal tracts under diverse sets
of conditions.
● Clearly, then, the invariant properties cannot be absolute physical values encoded in the
stimulus but must be relational in nature.
● Unfortunately, relatively little is known about this form of perceptual normalization, or
about the types of perceptual mechanisms involved in carrying out these computations.
Variability in speaking rate:
● A second normalization problem concerns time and rate normalization in speech
perception. It is well known that the duration of individual segments is influenced
substantially by speaking rate.
● However, the acoustic duration of segments is also affected by the location of various
syntactic boundaries in connected speech, by syllabic stress, and by the component
features of adjacent segments in words (see, e.g., Gaitenby, 1965; Klatt, 1975, 1976,
1979; Lehiste, 1970).
● In addition, there are substantial differences in the duration of segments of words when
produced in sentence contexts compared to the same words spoken in isolation.
● Changes in speech rate are reflected in changes in the number and duration of pauses, in
durational changes of vowels and some consonants, and in deletions and reductions of
some of the acoustic properties that are associated with particular linguistic units (J.L.
Miller, Grosjean, & Lomato, 1984).

12
● For example, changes in VOT and the relative duration of transitions and vowel
steady-states occur with changes in speaking rates (J.L. Miller & Baer, 1983; J.L. Miller,
Green, & Reeves, 1986; Summerfield, 1975).
● The duration of vowels produced in sentences is roughly half the duration of the same
vowels spoken in isolation.
● Speaking rate also influences the duration and acoustic correlates of various phonetic
features and segments.
● Numerous low-level phonetic and phonological effects such as vowel reduction, deletion,
and various types of assimilation phenomena have been well documented in the
literature.
● These effects seem to be influenced a great deal by speaking tempo, dialect, and
surrounding phonetic context.
https://www.youtube.com/watch?v=1nAy-I1OiDU
IV Categorical Perception:
Categorical perception (Repp 1984) is the recognition, along a continuum of data, of categories
meriting individual labels. Thus, for example, a rainbow is a continuous or linear spread of
colour (frequencies) within which certain colours appear to stand out or can be recognised by the
viewer. In fact these colours, it is hypothesised, are not intrinsically isolated or special in the
signal; their identification is a property of the way the viewer is cognitively treating the data.
Categorical Perception Theory is an attempt to model this observation.
Two observations are relevant to understanding categories:
1. Listeners often assign the same label to physically different signals. This occurs, for
example, when two people utter the ‘same’ sound. In fact the sounds are always
13
measurably different, but are perceived to belong to the same category. What is
interesting is that the two speakers also believe they are making identical or very similar
sounds, simply because they are both rendering the same underlying plan for a particular
phonological extrinsic allophone. The identity of the planned utterance within the minds
of the speakers matches the label assigned by the listener.
2. Sounds can be made which are only minutely different and occur on the same production
cline (A cline is a continuous range or slope between values associated with a particular
phenomenon. For example, in English the series of vowels [u, ʊ, ɔ, ɑ] stand on a cline of
back vowels ranging from highest to lowest). Take the example of vowel height, where
there is a cline, say, between high front vowels and low front vowels. Acoustically this
will show up as a cline in the frequency of formant 2 (F2) and also formant 1 (F1).
Listeners will segment the cline into different zones, and assign a distinct label to each.
Here we have the situation where, because phonologically there are two different
segments to be rendered, both speakers and listeners will believe that the difference has
carried over into the physical signal.
Each of these two observations is the reverse of the other. On the one hand different signals
prompt a belief of similarity, and on the other similar signals prompt a belief of difference.
Categorical Perception Theory – and other active or top- down theories – propose that the
explanation for these observations lies in the strength of the contribution made to the perceptual
process by the speaker/listener’s prior knowledge of the structure and workings of the language.
Within certain limits, whatever the nature of the physical signal the predictive nature of the
speaker’s/listener’s model of how the language works will push the data into the relevant
14
category. We speak of such active top- down approaches as being theory- driven, as opposed to
passive, data- driven, bottom- up theories.
So the categories in the minds of both speaker and listener are language dependent. That is, they
vary from language to language, depending on the distribution of sounds within the available
acoustic or articulatory space. Imagine a vector running from just below the alveolar ridge to the
lowest, most forward point the front of the tongue can reach for making vowels. Along this
vector the tongue can move to an infinite number of positions – and we could, in theory at any
rate, instruct a speaker to make all these sounds, or, at least, a large number of sounds along this
vector. If these sounds are recorded, randomised and played to listeners, they will subdivide the
sounds according to tongue positions along the vector which represent sounds in their own
language (Rosner and Pickering 1994).
https://www.youtube.com/watch?v=nNYhCP6NiUg
Mcgurk Effect
To facilitate communication humans have the remarkable ability to combine the visual
information from the mouth with auditory information from voice. When presented with certain
combinations of incongruent speech, such as auditory /ba/ paired with visual /ga/, individuals
may perceive an illusory fused percept /da/. This illusion, first described by McGurk and
MacDonald (1976), has come to be known as the McGurk effect. McGurk and MacDonald
(1976) reported a powerful multisensory illusion occurring with audiovisual speech. McGurk
effect is an acoustic utterance that is heard as another utterance when presented with discrepant
visual articulation (Tiippana, 2014). They recorded a voice articulating a consonant and dubbed
15
it with a face articulating another consonant. Even though the acoustic speech signal was well
recognized alone, it was heard as another consonant after dubbing with incongruent visual
speech.
https://www.youtube.com/watch?v=jtsfidRq2tw
Regarding the definition and interpretation the two main claims of the McGurk effect are, first
the McGurk effect should be defined as a categorical change in auditory perception induced by
incongruent visual speech, resulting in a single percept of hearing something other than what the
voice is saying. Second, when interpreting the McGurk effect, the perception of the unisensory
acoustic and visual stimulus components is also important(Tiippana, 2014) .
There are many variants of the McGurk effect. The best-known case is when dubbing a voice
saying [b] onto a face articulating [g] results in hearing [d]. This is called the fusion effect since
the percept differs from the acoustic and visual components.Here integration results in the
perception of a third consonant, by merging information from audition and vision (Setti et al.,
2013). When A[b]V[g] is heard as [d], the percept is thought to emerge due to fusion of the
features (for the place of articulation) provided via audition (bilabial) and vision (velar), so that a
different, intermediate consonant (alveolar) is perceived (van Wassenhove, 2013).
Another fact is that the other incongruent audiovisual stimuli produce different types of percepts.
For example, a reverse combination of these consonants, A[g]V[b], is heard as [bg], i.e., the
visual and auditory components one after the other. There are other pairings, which result in
hearing according to the visual component, e.g; acoustic [b] presented with visual [d] is heard as
[d].The different variants of the McGurk effect represent the outcome of audiovisual integration.
Tiippana (2014) concluded that when integration takes place, it results in a unified percept,
without access to the individual components that contributed to the percept. Thus, when the
16
McGurk effect occurs, the observer has the subjective experience of hearing a certain utterance,
even though another utterance is presented acoustically.
There is a high degree of individual variability in how frequently the illusion is perceived. Some
individuals almost always perceive the McGurk effect, while others rarely do (Mallick et al.,
2015). Frequent perceivers of the McGurk effect were also more likely to fixate the mouth of the
talker, and there was a significant correlation between McGurk frequency and mouth looking
time (Gurler et al., 2015).To which extent visual input influences the percept depends on how
coherent and reliable information each modality provides. Coherent information is integrated and
weighted, e.g., according to the reliability of each modality, which is reflected in unisensory
discriminability. Converging evidence suggests that the STS is a critical brain area for
multisensory integration of auditory and visual information about both speech and non-speech
stimuli (Barraclough et al., 2005). This suggests that the STS could be a neural locus for the
McGurk effect: if the left STS successfully combines the incongruent auditory and visual
syllables that comprise a McGurk stimulus, a McGurk percept is produced; if the left STS is not
active, then the auditory and visual syllables are not combined and a McGurk percept is not
produced.
Nath and Beauchamp (2012) used blood-oxygen level dependent functional magnetic resonance
imaging (BOLD fMRI) to measure brain activity as McGurk perceivers and non-perceivers were
presented with congruent audiovisual syllables, McGurk audiovisual syllables, and non-McGurk
incongruent syllables.14 healthy right-handed subjects (6 females, mean age 26.1) were
participants of the study. The stimuli consisted of three types of syllables: congruent (auditory
and visual matching) syllables and two types of incongruent syllables (auditory and visual
mismatch).
17
Not all incongruent syllables produce a McGurk percept, defined as a percept not present in the
original stimulus (McGurk and MacDonald, 1976). Just prior to scanning, a behavioral pre-test
was performed. Each subject was presented with 10 trials of McGurk syllables and 10 trials of
non-McGurk incongruent syllables. Auditory stimuli were delivered through headphones at
approximately 70 dB, and visual stimuli were presented on a computer screen. Subjects were
instructed to watch the mouth movements and listen to the speaker. In order to assess perception,
subjects were asked to repeat aloud the perceived syllable. Fused percepts such as “da” and “tha”
were used as indicators of McGurk perception, while responses strictly corresponding to the
visually-presented syllable (“ga”) were not counted as fused McGurk percept. Responses
corresponding to “ba” (the auditory component of the syllable) indicated that the effect was not
perceived (McGurk and MacDonald, 1976). Each subject underwent one fMRI scanning session.
In five subjects, each scan series contained 55 McGurk trials, 55 non-McGurk incongruent trials,
10 target trials (audiovisual “ma”) and 30 trials of fixation baseline. In the remaining nine
subjects, each scan series contained 25 McGurk trials, 25 non-McGurk trials, 25 congruent “ga”
18
trials (auditory + visual “ga”), 25 congruent “ba” trials, 10 target trials (audiovisual “ma”) and 30
trials of fixation baseline. During fixation, the crosshairs were presented at the same position as
the mouth during visual speech to minimize eye movements.On approximately 10% of trials, the
target stimulus (audiovisual “ma”) was presented. Subjects were required to respond to the target
stimulus by pressing a button, but not to other stimuli. Target stimuli were analyzed separately
from other stimuli. This ensures attention to the stimulus while preventing contamination of the
brain response to non-target stimuli by motor planning or execution.
Results:
In the behavioral test, there was a high degree of intersubject variability in McGurk
susceptibility. Three subjects never reported the McGurk percept (0% susceptibility) while two
subjects always reported it (100% susceptibility). Subjects were classified into two groups:
non-perceivers (seven subjects, susceptibility 0–49%) and perceivers (seven subjects,
50%–100%).
For the analysis of FMRI a 2-way ANOVA was performed with BOLD response in the left STS
as the dependent measure. The first factor was the McGurk susceptibility group determined from
behavioral testing (perceivers vs. non-perceivers). The second factor was the stimulus condition
(congruent vs. incongruent syllables). The ANOVA showed significant main effects of both
susceptibility groups and stimulus condition on the STS response. There was no interaction
between susceptibility group and stimulus condition.McGurk perceivers had the highest STS
response to incongruent syllables and non-perceivers had the lowest response. There was no
between-group difference in the response to congruent syllables.Effects of stimulus, across
groups, the STS significantly preferred incongruent to congruent syllables. There was no
significant difference between the responses to the two types of incongruent syllables (McGurk
19
and non-McGurk).Individual subject analyses Next, we examined individual responses to the
syllables (Fig. 2). On individual subject analysis, the subject with the weakest STS response had
the smallest likelihood of experiencing a McGurk percept; the subject with the strongest STS
response had the highest likelihood. Across all subjects, there was a significant positive
correlation between each subject's STS response to incongruent syllables and their likelihood of
experiencing the McGurk percept. . These results suggest that the left STS is a key locus for
interindividual differences in speech perception.
Feng et al. 2019 examined the McGurk effect in 324 native Mandarin speakers, consisting of 73
monozygotic (MZ) and 89 dizygotic (DZ) twin pairs (mean age ± SD = 16.8 ± 2.1 years). 150
underwent an additional testing session approximately 2 years after the initial session. The
McGurk stimuli consisted of nine digital audiovisual recordings, each 2 s long . Each stimulus
was presented eight times in random order. Each stimulus contained an auditory recording of a
syllable and a digital video of the face of the speaker enunciating a different, incongruent
syllable. Subjects were instructed to pay attention to each movie clip and report their percept by
typing it via a standard keyboard into a computer. Subject responses were classified into four
categories: auditory (responses corresponding to the auditory syllables), visual (responses
corresponding to the visual syllables), McGurk (specific responses containing an element not
contained in either the auditory or visual syllable, described in the original paper as “fused”
responses), and other (responses different from the previous three categories, e.g., “a”).
20
From the analysis it is revealed that some participants never perceived the illusion and others
always perceived it. Within participants, perception was similar across time (2-year retest in 150
participants) suggesting that McGurk susceptibility reflects a stable trait rather than short-term
perceptual fluctuations. The examination of the effects of shared genetics and prenatal
environment, compared McGurk susceptibility between MZ and DZ twins. Both twin types had
significantly greater correlation than unrelated pairs suggesting that the genes and environmental
factors shared by twins contribute to individual differences in multisensory speech perception.
Perceptual organization in speech- Gestalt principles of perceptual grouping
For the understanding of organization of visual and auditory forms Wertheimer (1923/1938)
first exposed the principles underlying perceptual grouping and segmentation, which are now
known as ‘the classic principles of perceptual grouping’.
A row of dots at equal distance from one another, is just perceived as a row of dots, without any
particular grouping or segmentation. When some of the inter-dot distances are increased
significantly relative to the others, one immediately perceives a grouping of some dots in pairs,
which become segregated from others. Apparently, elements that are relatively closer together
become grouped together, while elements that are relatively further apart are segregated from
one another, based on the principle of grouping by proximity.

21
When dots are differentiated from one another by size, color or another feature, the dots become
spontaneously grouped again, even with equal distances. In Figure C, for instance, the smaller
filled dots are grouped in pairs and so are the larger open ones. Apparently, elements that are
similar to one another are grouped together, while dissimilar elements are segregated from one
another, based on the principle of grouping by similarity.
With unequal distances and differentiated attributes, the two principles can cooperate or
compete. When both proximity and similarity are working together (Figure D), grouping is
enhanced compared to the conditions where only one principle can play a role (proximity in
Figure B, similarity in Figure C). When they are competing (see Figure E), one might perceive
pairs of dissimilar dots (when grouping by proximity wins) or pairs of similar dots at larger
22
distances (when grouping by similarity wins), with these two tendencies possibly differing in
strength between individuals and switching over time within a single individual.
Even with equal distances, elements become grouped again, when some undergo a particular
change together (e.g., an upward movement), while others do not change or change differently
(e.g., move downward; see Figure F). This principle is grouping by common fate.
Patterns with Continuity law suggest that a person who is looking at it, that the pattern
"continues" even after the end of the physical pattern itself, i.e. our brain can trace the connecting
lines between various elements of the design if there are none.
We perceive lines as a part of a continuous movement inorder to minimize the abrupt changes. In
this given figure we perceive two overlapping wavy lines instead of three shapes linked together.
23
In the law of symmetry items that form symmetrical units are grouped together. In the given
picture three sets of brackets are seen, not six unconnected lines.
The law of closure states that individuals perceive objects such as shapes, letters, pictures, etc.,
as being whole when they are not complete. Specifically, when parts of a whole picture are
missing, our perception fills in the visual gap.

24
In the left arrangement of below given figure, one perceives two identical six-sided shapes, in
different orientations and slightly overlapping, with line patterns a and c grouped together as
forming one figure, and b and d as another (a-c and b-d).In the right arrangement of figure, with
the same line patterns in different positions and relative orientations, one clearly perceives
something different, namely one elongated six-sided shape with a smaller diamond in the middle;
so, now patterns a and d are grouped as one form, and b and c as another (a-d and b-c). When
parts form a larger whole, the whole with a higher degree of regularity are better Gestalts and
they tend to dominate our perception this is the the principle of a good Gestalt.
25
In the Law of Common Region Gestalt law of perceptual organization suggests that elements that
are grouped together within the same region of space tend to be grouped together.
In the given figure, there are three oval shapes drawn on a piece of paper with two dots located at
each end of the oval. The ovals are right next to each other so that the dot at the end of one oval
is actually closer to the dot at the end of a separate oval. Despite the proximity of the dots, the
two that are inside each oval are perceived as being a group rather than the dots that are actually
closest to each other.
Wertheimer also discussed three additional factors, first When confronted with complex shapes,
we tend to reorganize them into simpler components or into a simpler whole. More likely to see
the left image above composed of the simple circle, square and triangle like you see on the right
than as the complex and ambiguous shape of the whole form.

26
Second, if one presents the parametrically different conditions as sequential trials in a
single experiment rather than as a series of separate experiments, the change from one
discrete percept to another would depend on the context of the preceding conditions:
the transition points from percept A to B would be delayed if the preceding stimuli were all
giving rise to percept A, and ambiguous conditions where A and B are equally strong based on
the parametric stimulus differences would yield percepts that go along with the organizations
that were prevailing in previous trials. Hence, in addition to isolated stimulus factors, the set of
trials within which a stimulus is presented also plays a role.This second additional factor is called
the factor of set. Third additional principle, implies that past experience does play a role in
perceptual organization, albeit a limited one. Wertheimer pointed out that this is just one of
several factors. When different factors come into play, it is not easy to predict which of the
possible organizations will have the strongest overall Gestalt qualities (the highest ‘goodness’).
Explicit auditory instances of organizational principles were again offered by Julesz and Hirsh
(1972).Bregman and colleagues have elaborated many of the speculations of Julesz and Hirsh
(1972) empirically. The findings demonstrate the dimensions and parameters of the perceptual
disposition to form groups. For example, a principle of proximity, here set in the frequency
27
domain, was offered to explain the formation of groups observed by Bregman and Campbell
(1971). If successive moments of the signal are similar with respect to frequency spectrum, they
should be considered part of the same source. From a repeating sequence of six 100-ms tones
with frequencies of 2.5 kHz, 2 kHz, 550 Hz, 1.6 kHz, 450 Hz, and 350 Hz, listeners perceived a
pattern forming two concurrent groups: one of the three low tones (550 Hz, 450 Hz, and 350 Hz)
and another of the three high tones (2.5 kHz, 2 kHz, and 1.6 kHz).
The principle of similarity also applies to the formation of auditory groups. The evidence comes
from studies (Bregman & Doehring, 1984; Steiger & Bregman, 1981) in which simple harmonic
relations or similar frequency excursions among a set of tones promoted the formation of
perceptual groups. In an extension of these studies, grouping of dichotically presented tones was
observed when harmonic relations occurred among them (Steiger & Bregman, 1982). Rapidly
repeating tones were grouped because they shared a common fundamental frequency, and even
small departures from this harmonic relation blocked dichotic fusion and also spectral similarity
appears to play a significant role in perceptual organization, as observed by Dannenbring and
Bregman (1976).Similarity between the perceptual attributes of successive events such as pitch,
timbre, loudness and location provides a basis for linking them (Moore and Gockel 2012). It
appears that it is not so much the raw difference that is important, but rather the rate of change;
the slower the rate of change between successive sounds the more similar they are judged
(Winkler, Denham et al. 2012). This leads one to consider that in the auditory modality, the law
of similarity is not separate from what the Gestalt psychologists termed good continuation. Good
continuation means that smooth continuous changes in perceptual attributes favour grouping,
while abrupt discontinuities are perceived as the start of something new. If successive moments
of the signal are similar with respect to frequency spectrum, they should be considered part of
28
the same source. Both the discontinuity and the dissimilarity of successive moments of sound
can be viewed as evidence that there is more than one source involved in producing the acoustic
input.
Good continuation can operate both within a single sound event (e.g., amplitude-modulating a
noise with a relatively high frequency results in the separate perception of a sequence of loud
sounds and a continuous softer sound (Bregman 1990)), and between events (e.g. glides can help
bind successive events (Bregman and Dannenbring 1973)).
Likewise, the principle of common fate was translated into the auditory domain by Bregman and
Pinker (1978).The principle of common fate refers to correlated changes in features; e.g.,
whether they start and/or stop at the same time. This principle has also been termed ‘temporal
coherence’ specifically with regard to correlations over time windows that span longer periods
than individual events. In a test of grouping by relative onsets of tone elements, when brief (147
ms) tones were synchronously onset and offset, they were grouped together; when tones were
offset by 58 ms or more, they were grouped into separate perceptual streams.
Organization by set has also been reported, in cases of musical experience, by Jones (1976)
although in some instances of this kind, perceptual organization also shows evidence of a
symmetry principle. Here, a prior portion of a musical display appears to induce an implicit
expectation about the latter portion of a musical display, in the dimensions both of melodic pitch
and of the temporal attributes of a melody
Both principles of continuity and closure apply to the grouping of simple tone glides reported by
Bregman and Dannenbring (1973, 1977). Group formation here occurred for tones that were
continuous in their frequency contours despite interruptions by silence (up to 20 ms) or by noise
bursts (up to 500 ms). Continuity and closure also operate in the amplitude domain, in which the
29
more abrupt the amplitude rise time, the likelier the occurrence of grouping (Dannenbring,
1976).
Phonetic organization
According to Remez et al. 1994, four points are emphasized for adequate auditory perceptual
organization for speech. First is the Auditory coherence for perceptual coherence, listener
directs attention to the properties internal to streams rather than the properties that differentiate
them (Remez, 1987). That is, acoustic elements that are proximate in frequency are attributed to
the same source; the rate at which the components succeed each other influences their
cohesiveness: The slower the procession, the greater the disposition to cohere. Acoustic elements
must also be similar in frequency changes to be grouped together, not only in the extent of
change but also in the temporal properties. A more subtle form of similarity also promotes
grouping of simultaneous components, namely, when harmonic relations occur among them.
Similarity in spectrum also appears to warrant cohesion of components, such that aperiodic
(noisy) acoustic elements are grouped together and periodic elements are grouped together.
Acoustic elements that occur concurrently must exhibit common onsets and offsets and must
show common (frequency or amplitude) modulation to come together. Continuity of changes in
frequency or in spectrum is also required for elements to form a single perceptual group. In
general, small temporal or spectral departures from these various similarities, continuities,
common modulations, and proximities result in the loss of coherence and the splitting of auditory
elements into separate groups.
The Speech Stream: During the production of speech sounds, the resonances are excited in
common by a pulse train produced by the larynx. This imposes harmonic relations and common
amplitude modulation across the formants (although the formant center frequencies are not
30
harmonically related). The attributes of harmonicity and common modulation may offer the only
basis for grouping the phonated formants as a single coherent stream. In the absence of common
pulsation and harmonic structure, we should find this signal fracturing into multiple perceptually
segregated streams when primitive auditory principles are applied: the first formant forming a
single continuous stream; the second formant splitting from the first as an intermittent stream
with highly varying frequency; and the nasal and third formants segregating from the others,
each varying rather less in frequency than the second formant.
In the spectrogram of the sentence "Why lie when you know I'm your lawyer?", some typical
acoustic attributes are the continuity of the lowest frequency resonance (the first formant) despite
intermittent energy (with gaps exceeding 75 ms) in resonances of higher frequency (the second,
third, and nasal formants), the marked dissimilarity in frequency trajectory of the first formant
and those of higher frequency formants, and the lack of temporal coincidence of large changes in
resonant frequencies.
31
32
The acoustic composition of the sentence depicted in Figure Ib, "The steady drip is worse than a
drenching rain," is considerably more varied. By criteria of spectral and frequency similarity, the
sentence should split into 10 streams: 3 that correspond to the three oral formants; a fourth,
which is composed of the four intermittent and discontinuous occurrences of nasal resonance (in
thaN, dreNchiNG, and raiN); a fifth, which is composed of the noise associated with the voiced
fricatives (in THe and THari); a sixth, which is composed of the noise manifest by the two
unvoiced fricatives (in Steady said worSe); a seventh, which is composed of the noise at the
release of the affricate (in drenCHing); an eighth, which is composed of two consonant release
bursts (in Drip and Drenching); a ninth, which contains the pulsed noise of the voiced fricative
(in the word iS)', and a tenth, which is composed of the consonantal release (in driP). If the
principle of common fate is applied to portions of the spectrum that are modulated by the pulsing
of the larynx, then the oral formants, the nasal formants, and perhaps the voiced fricatives are
grouped as 1 stream, leaving 4 remaining aperiodic streams associated respectively with the
acoustic correlates of unvoiced fricatives, affricates, and consonant releases. Neither continuity,
nor closure, nor symmetry, nor any obvious principle of "goodness" can accomplish the
reduction of this spectrum to the single vocal source that the listener perceives. These
commonplace examples suggest that the principles of the primitive auditory analysis fall short of
explaining the perceptual coherence of a single speech stream.
Perceptual Phenomena: Broadbent and Ladefoged (1957) in their study, investigated the effect
of common pulsing on the formant bands of a sentence. They made a two-formant synthetic
replica of a sentence but presented the first formant to one ear and the second formant to the
other. Despite the different location of the two signals, a rather obvious violation of spatial
similarity, listeners heard a single voice when the formants were excited at the same fundamental
33
frequency. When two different fundamentals were used, listeners reported hearing two voices, as
if fusion was lost; this occurred even when the differently pulsed formants were both presented
to the same ear. When each formant had a different fundamental, this created an impression of
two voices saying the same utterance, rather than two non speechlike buzzes varying in pitch.
Listeners evidently combined the information from each resonance to form phonetic impressions
despite the concurrent perception that each resonance issued from a different vocal source.Here,
the differently excited resonances were phonetically coherent, although they also were split into
two separate perceptual streams in the listener's impression of the auditory scene. These
perceptual streams, which should be segregated from each other according to the account given
by primitive auditory analysis, are combined nonetheless to produce phonetic impressions.
Last is the specific model of organization, which uses a schematic component to reconcile the
simplicity of the primitive analysis with the complexity of the speech signal. It has been
observed that rapid repetition of acoustically identical syllables actually destroys perceptual
stability and phonetic impressions of speech sounds are transformed by repetition. In such
conditions, perceptually segregated streams of auditory elements formed from the acoustic
constituents of the speech signal, much as they had with tones and noise, according to primitive
criteria of similarity, proximity, common fate, and continuity (Remez et al., 1994). In essence,
rapid repetition proved to be more effective in forcing perceptual segregation of acoustic
elements to occur with speech sounds than to produce phonetic impressions.

34
➔ The bottom-up theories are those in which the acoustic signal provides essential and
sufficient information for perceptual recognition. In this approach, the link between the
information received and the perceptual recognition is direct, with no or minimal
intervening stage. This approach is also referred to as data driven, precisely because the
data obtained from the acoustic signal drive or direct, the listener’s perception of speech.
In the top-down approach, the information from the acoustic signal is not sufficient for
perceptual recognition. Higher level information from contextual, linguistic, and
cognitive cues is necessary for accurate speech perception.
➔ The active/passive grouping of general attributes of theories addresses a related
concept—the degree to which information in addition to the acoustic signal is necessary.
Active theories emphasize the cognitive role in perception, including the formation and
testing of hypotheses about the phonetic or linguistic interpretation of the information in
the acoustic signal. In contrast, passive theories postulate a smaller role for cognitive
processing and presume a more automatic perceptual response.
➔ An autonomous theory posits that perceptual processing occurs in the absence of external
data, such as communicative context or general knowledge. In other words, perception
occurs within a closed system. In contrast, interactive theories are open systems, in which
the stages of perceptual processing access data external to that which is contained within
the acoustic signal.

35
➔ Certainly, no theory of speech perception is wholly active or wholly autonomous, just as
no theory is wholly passive or wholly interactive. Rather, these categorizations are simply
one tool that can be applied to the analysis of an otherwise highly complex set of theories
of speech perception.
1.2 Theoretical approaches to speech perception
VIII. Motor theory of speech perception (Liberan et al., 1967)
● Incorporating a biologically based link between perception and production, this
specialization prevents listeners from hearing the signal as an ordinary sound, but enables
them to use the systematic, yet special, relation between signal and gesture to perceive
the gesture.
36
● This theory hypothesizes that listeners actively reconstruct the actual ‘muscle activation
patterns’ which they themselves have as representations of vocal tract gestures. The
reconstruction invokes these muscle activation patterns in terms of sets of descriptors.
● These activation patterns constitute static rather than dynamic representations. And the
coarticulatory processes are evaluated by reference to static linear context rather than
dynamic hierarchically organized structures.
● The diagram shows how stored descriptors of muscle activation patterns for individual
sounds generalized from the listeners own experience contribute together with
coarticulatory descriptors to an active interpretation of the sound wave.
Stored muscle activation pattern descriptors
Soundwave Active interpretation hypothesis about speakers intended muscle activation
patterns
Coarticulatory descriptors
37
● The adaptive function of the perceptual side of this mode, the side with which the motor
theory is directly concerned, is to make the conversion from acoustic signal to gesture
automatically, and so to let listeners perceive phonetic structures without mediation by
(or translation from) the auditory appearances that the sounds might, on purely
psychoacoustic grounds, be expected to have.
● Data from transcranial magnetic stimulation (TMS), functional neuroimaging, and
neurophysiology indicate that frontal motor structures are automatically engaged during
passive speech perception (Fadiga et al., 2002; Hesslow, 2002; Kiefer & Pulvermüller,
2012). For instance, Fadiga et al. (2002) found an increase in motor-evoked potentials
38
recorded from a listener’s tongue muscles during a task in which participants heard
speech sounds but for which there was no explicit motor component.
● Watkins et al. (2003) applied TMS to the face area of the primary motor cortex in order
to elicit motor-evoked potentials in the lip muscles. They found that, in comparison with
control conditions (listening to non-verbal sounds and viewing eye and brow
movements), both listening to and viewing speech increased the size of the motor evoked
potentials. They concluded that both auditory and visual perception of speech leads to
activation of the speech motor system.
● A core prediction of motor theories of speech processing is that damage (whether
temporary or permanent) to the speech motor system should impair auditory speech
processing. Meister et al. (2007) found that when repetitive TMS was used to temporarily
suppress the premotor cortex, participants were impaired at discriminating stop
consonants embedded in noise.
● Also using repetitive TMS, Möttönen and Watkins. (2012) found that temporarily
disrupting the lip representations in the left motor cortex disrupted subjects’ ability to
discriminate between lip-articulated speech sounds, but did not affect those participants’
ability to discriminate sounds that were not lip articulated. Those two studies suggest that
disruption of the speech motor system can (subtly) impair speech sound processing.
● Stasenko et al. (2015) evaluated which aspects of auditory speech processing are
affected, and which are not, in a stroke patient with dysfunction of the speech motor
system. The participant was a 55-year-old, right-handed male who suffered a left
hemisphere ischaemic stroke, affecting the inferior frontal gyrus, premotor cortex, and
primary motor cortex and age and gender matched 6 controls also underwent the same
39
procedures. Participants presented with non-fluent speech that was marked by frequent
articulatory/phonological errors. THe phonological/articulatory errors were present in
both picture naming and word repetition suggests that production processes were in fact
the source of the errors. General procedure included picture naming and tests for the
integrity of auditory speech processing use minimal pair stimuli which uses
discrimination task. And additionally, his general intelligence, spontaneous speech,
verbal fluency, verbal working memory, repetition, reading, and spelling abilities were
assessed and Ultrasound imaging of the tongue was also performed. From analysis of the
results, found that the patient showed a normal phonemic categorical boundary when
discriminating two non-words that differ by a minimal pair (e.g., ADA–AGA). However,
using the same stimuli, the patient was unable to identify or label the non-word stimuli
(using a button-press response). A control task showed that he could identify speech
sounds by speaker gender, ruling out a general labelling impairment. These data suggest
that while the motor system is not causally involved in perception of the speech signal, it
may be used when other cues (e.g., meaning, context) are not available.
IX. Analysis by synthesis theory (Stevens & Halle, 1960)
● In its latest form by Stevens and House (1972), the model proposes that while some
acoustic attributes of speech signal may be simply converted to linguistic data, for large
part some reference is made to the articulatory mechanism during speech perception.
● The components of this model are shown i n Figure. The acoustic signal undergoes
peripheral processing whether it be speech or not. It is at this stage that the presence of
40
the dynamic characteristics of speech are sought and if found, the signals processed
accordingly .
● The peripheral processing transforms the acoustic waveform into a neural time-space
pattern that will be used at later stages of analysis . It is known that at this level more than
simple time-frequency analysis takes place and probably extraction of some of the more
easily discriminated information relevant to' description of certain features , also occurs.
Some normalization of the signal must also take place, so that the output is a set of
normalized attributes that contribute to later identification of linguistic units , (A).
● These auditory patterns are placed in temporary stores to await further processing. The
preliminary analysis component derives the features that are not strongly context
dependent from the signal (A). These features are available after conversion of the results
from peripheral processing and result in a partial specification of a feature matrix of the
utterance, (B).
41
● The control component has access to the results of preliminary processing as well as the
results of analysis of previous parts of the utterance, the lexicon , and the output of the
comparator. With this information, the control unit makes a hypothesis concerning the
representation of the utterance in terms of morphemes. The features are an abstract
quantity that underlie , but are not necessarily identified with, the acoustic attributes or
the production of the signal.
● This hypothesized representation (B) travel s to the generative rules where it is
transformed into a representation of the articulatory instructions that would be necessary
to generate such an utterance.(The articulatory instructions (V) that result , could be used
to control the articulatory mechanisms and produce speech output.)
● The generated patterns (V) are compared by the comparator with the attributes of the
analyzed utterance residing in a temporary store and judged as to their closeness of
match. This information is relayed to the control component and the hypothesized
sequence is either accepted or a new hypothesis is made using the error -detected in the
comparator.
● This loop is transversed until the message has been successfully identified. The model
relies on the comparison of auditory patterns (A) with articulator y instructions (V) that
potentially are able to produce such patterns.
● The comparator must contain a catalogue of such relations. The articulatory gesture is
represented in terms of tactile , proprioceptive sensations and motor-commands.
● Stevens and House state that the catalogue is built up as the child begins to utter sounds,
hear the auditory result and form their association . Therefore the child is aided i n
learning to perceive speech, by being able to articulate it .

42
● Kuhl et al. (2014) investigated motor brain activation, as well as auditory brain
activation, during discrimination of native and nonnative syllables in infants at two ages
(25- 7month old; 24- 11 month old) that straddle the developmental transition from
language-universal to language-specific speech perception. Adults are also tested in Exp.
1 (14- mean age 26.6). MEG data revealed that 7-month old infants activate auditory
(superior temporal) as well as motor brain areas (Broca’s area, cerebellum) in response to
speech, and equivalently for native and nonnative syllables. However, in 11- and
12-mo-old infants, native speech activates auditory brain areas to a greater degree than
nonnative, whereas nonnative speech activates motor brain areas to a greater degree than
native speech. This double dissociation in 11- to 12-mo-old infants matches the pattern of
results obtained in adult listeners. Infant data are consistent with Analysis by Synthesis:
auditory analysis of speech is coupled with synthesis of the motor plans necessary to
produce the speech signal. The findings have implications for: (i) perception-action
theories of speech perception, ie, both auditory and motor components contribute to the
developmental transition in speech perception that occurs at the end of the first year of
life. , (ii) the impact of “motherese” on early language learning, ie,motherese speech,
with its exaggerated acoustic and articulatory features, particularly in one-on-one
settings, enhances the activation of motor brain areas and the generation of internal motor
models of speech, and (iii) the “social-gating” hypothesis (social interaction improves
language learning by motivating infants via the reward systems of the brain) and humans’
development of social understanding (humans evolved brain mechanisms to detect and
interpret humans’ actions, behaviors, movements, and sounds. The present data
contribute to these views by demonstrating that auditory speech activates motor areas in
43
the infant brain. Motor brain activation in response to others’ communicative signals
could assist broader development of social understanding in humans).
X. Quantal Theory (Stevens, 1972)
● According to the QT, nonlinearities exist in the relation between vocal-tract
configurations and acoustic outputs.
● Along an articulatory dimension such as back-cavity length, there are regions where
perturbations in that parameter result in relatively small acoustic changes (e.g., in formant
frequencies) and other regions where comparable articulatory perturbations cause
substantial acoustic changes. These alternating regions of acoustic stability and instability
provide conditions for a kind of optimization of the phonemic inventory.

44
● The situation is represented schematically in Fig, which shows a hypothetical relation
between some acoustic parameter in the sound radiated from the vocal tract and some
articulatory parameter that takes on a series of values as indicated on the abscissa.
● There is a large acoustic (and auditory) difference between regions I and III. Within
regions I and III, however, the acoustic parameter is relatively insensitive to change in the
articulatory parameter. In other words, changes in articulation don’t have much effect on
the speech output. That is, there is a significant acoustic contrast between these two
regions, which are separated by the intermediate region II in which there is a rather
abrupt change in the acoustic parameter. Region II can, in some sense, be considered as a
threshold region such that as the acoustic parameter changes through this region the
auditory response shifts from one type of pattern to another.
● The theory also argues that these alternating regions of acoustic stability and instability
provide conditions for a kind of optimization of the phonemic inventory. If phonemes are
located in the stable regions, then they obviously require less articulatory precision on the
part of the talker. These same phonemes tend to be auditorily quite distinctive because
they are separated by regions of acoustic instability, that is, regions involving a relatively
high rate of acoustic change. This convergence of both talker-oriented and

45
listener-oriented selection criteria leads to a clear preference for certain " quantal"
phonemes such as the point vowels /i/, /a/, and /u/(Quantal vowels).
● In contrast with the centralized vowels such as /^/ and /ae/ (relatively stable, are not
bounded by acoustically unstable regions that would tend to make them highly
distinguishable from nearby vowels), the point vowels are both relatively stable and
relatively distinctive (in the sense that they are separated from nearby vowels by regions
of high acoustic instability).
● Examples of quantal categories over an articulatory continuum:
• Degree of glottal constriction: complete opening for voiceless sounds to less opening
for modal voicing to complete closure for a glottal stop
• Degree of vocal tract constriction: vowel (low V – mid V – high V) to glides to
fricatives to stops
• Place of articulation for vowels (i.e., place of constriction)
● Stevens et al. (2010) has reviewed three aspects of a theory of speech production and
perception: quantal theory, enhancement, and overlap.
● The section on quantal theory makes the claim that every phonological feature or contrast
is associated with its own quantal footprint. This footprint for a given feature is a
discontinuous (or quantal) relation between the displacement of an articulatory parameter
and the acoustical attribute that results from this articulatory movement.
● The second section shows that for a given quantally defined feature, the featural
specification during speech production may be embellished with other gestures that
enhance the quantally defined base. These enhancing gestures, together with the defining
gestures, provide a set of acoustic cues that are potentially available to a listener who
46
must use these cues to aid the identification of features, segments, and words. An
example of this type of enhancement for consonants is the rounding of the lips in the
production of /P/. This rounding tends to lower the natural frequency of the anterior
portion of the vocal tract, so that the frequency of the lowest major spectrum prominence
in the fricative spectrum is in the F3 range, well below the F4 or F5 range for the lowest
spectrum prominence for the contrasting fricative consonant /s/.
● The third section shows that even though rapid speech phenomena can obliterate defining
quantal information from the speech stream, nonetheless that information is recoverable
from the enhancement of the segment. A simple example of articulatory overlap occurs in
an utterance containing a sequence of two stop consonants, as in the casually produced
utterance top tag. In this example, the transition toward the labial closure for /p/ generates
enhancing cues for the labial place of articulation. However, the noise burst that would
normally signal the labial place of articulation is obliterated because the tongue blade
closure for /t/ occurs before the lip closure for /p/ is released, i.e., the two closures
overlap. Any cue for the labial place of articulation immediately prior to the /t/ release is
probably also obscured. In the case of /t/, there is little direct evidence of the presence of
the alveolar place during the time preceding the /t/ release. The alveolar burst, however,
provides strong evidence for alveolar place, as does the transition from this burst into the
following vowel /æ/. Thus some cues exist for /t/, but only weaker cues for /p/. The
‘‘defining’’ cue for /p/ is actually obliterated.
XI. Auditory theory of vowel perception (Fant,1967)
● Auditory theories claim that listeners identify acoustic patterns or features by matching
them to stored acoustic representations.

47
● According to Fant.1967, all of the arguments brought forth in support of motor theory
would fit just as well into sensory-based theories, in which the decoding process proceeds
without postulating the active mediation of speech-motor centers.
● The basic idea in Fant's approach is that the motor and sensory functions become more
and more involved as one proceeds from the peripheral to the central stages of analysis.
● He assumes that the final destination is a "message" that involves brain centers common
to both perception and production. According to Fant, there are separate sensory
(auditory) and motor (articulatory) branches, although he leaves open the possibility of
interaction between these two branches.
● Auditory input is first processed by the ear and is subject to primary auditory analysis.
These incoming auditory signals are then submitted to some kind of direct encoding into
distinctive auditory features (Fant, 1962). Finally, the features are combined in some
unspecified way to form phonemes, syllables, morphemes, and words.

48
● Although much of Fant's concern has been with continued acoustical investigations of the
distinctive features of phonemes, the problems of invariance and segmentation, which are
central issues in speech perception, remain unresolved by the model.
● The auditory-perceptual theory of phonetic recognition by Miller. 1987, is designed to
provide a comprehensive and coherent account of the facts relating acoustic waveforms
to perceived vowel categories.
● It consists of a three-stage process (Miller, 1984a), where the acoustic waveform of
speech is converted by the human listener to a string of category codes that correspond to
the allophones of the language. The theory describes the "bottom-up" aspects of phonetic
perception, and it is conceptual in nature.

49
● Stage 1 of the theory is the transformation of the acoustic waveform to auditory-sensory
dimensions. In stage 1, it is assumed that short-term spectral analyses are performed on
the incoming speech waveform. Each spectrum can be classified as a glottal-source
spectrum, a burst-friction spectrum, or a combination of the two. At each moment, the
spectral envelope patterns of the glottal source and burst-friction sounds are represented
as sensory responses or sensory pointers in a phonetically relevant auditory-perceptual
space.
● Stage 2 is the transformation of the sensory data to perceptual values. In stage 2, these
sensory responses, or sensory pointers, are converted into a unitary perceptual response,
or perceptual pointer, that is also located in the auditory-perceptual space. The perceptual
response is a hypothetical construct or intervening variable that is based on the general
notion that speech inputs are integrated to form a unitary perceptual stream.
● In stage 3, the perceptual variables are converted to phonetic-linguistic categories.
That is in stage 3, segmentation and categorization mechanisms that depend on the
dynamics of the perceptual pointer in relation to perceptual target zones within the
auditory-perceptual space result in a string of category codes that correspond to the
allophones of the language.
XII. Neurological theories of speech perception
● This theory suggests that the phonological attributes of human speech are decoded by
neurosensory receptive fields-"feature detectors"-innately structured to detect, and
respond to, the various distinguishing parameters of the acoustic sound stream.
50
● Menyuk (1968) states: "'If a comparatively small set of features or attributes can describe
the speech sounds in all languages, then, it is hypothesized, these attributes are related to
the physiological capacities of man to produce and perceive sounds.'"
● Abbs & Sussman, 1971 postulated "feature detector" theory of speech perception, this
view does not depend on a particular distinctive feature system, but rather concerns with
the process of auditory decoding of the acoustic speech signal which results in phonetic
identification.
● "Feature detectors" can be broadly defined as organizational configurations of the sensory
nervous system that are highly sensitive to certain parameters of complex stimuli.
Complex acoustic stimuli may be considered as composed of several physical parameters,
for example, intensity, wave-length, direction, form, and temporal patterning.
● Feature detectors can be distinguished from passive acoustic filters simply on the basis of
function. An acoustic filter is specific only to center frequency and resonance band width.
Such stimulus processing is the result of a limited dimensional sampling of the
information spatially and temporally embedded in a speech sound. Feature detectors, on
the other hand, respond to physical parameters that are composed of several different
aspects of the signal, i.e., frequency, intensity, rate of frequency change, rate of intensity
change, and durational characteristics of these attributes. Thus, in contrast to acoustic
filters, feature detectors are simultaneously sensitive to many characteristics of acoustic
stimuli.
● Physiological evidence for complex feature detectors in the auditory systems of
vertebrates and lower forms is also available. Nelson, Erulkar and Bryan (1966) has
found that particular features of sound stimulus patterns appear to be detected and
51
processed by neuronal cells structured to respond to a particular aggregate of physical
characteristics. Reports of lateral inhibition in the auditory system (Katsuki, 1961, 1962;
Nomoto et al., 1964) indicate the existence of a sophisticated sharpening mechanism at
the level of the eighth nerve.
● According to the later evidences a complex spatial configuration of receptor cells,
maximally sensitive to certain physical parameters of the auditory stimulus, can be
postulated which may offer an explanation for the neural decoding process involved in
speech perception. And also the operation of feature detectors might be applied to explain
perception of the distinctive features of speech (Jakobson, Fant, and Halle, 1952).
● In each of these acoustic features, a series of specific spatio-temporal "trade marks" have
been tentatively provided that may serve as identifying stimulus characteristics. Likewise,
each series of space-time stimulus characteristics can conceivably correspond to a
specific neurosensory detector process, especially sensitive to recognize specific stimulus
traits.
● Tonotopic organization has been identified in human auditory cortex using a variety of
imaging techniques. The majority of early studies used only two different stimulus
frequencies (Bilecen et al., 1998; Lauter et al., 1985; Lockwood et al., 1999; Talavage et
al., 2000; Wessinger et al., 1997). These studies suggested a general pattern in which
high frequencies activated medial auditory cortex and low frequencies activated more
anterolateral regions in the superior temporal plane. This pattern has usually been
interpreted as a single low-to-high frequency gradient oriented along Heschl's gyrus
(HG).
52
● Later functional magnetic resonance imaging (fMRI) studies improved on this design by
adding intermediate frequencies, allowing the identification of frequency gradients
(Formisano et al., 2003;Woods et al., 2009). Results and interpretations from these
studies have varied considerably. For example, one study reported a single high-to-low
gradient extending from posterior medial to anterior lateral auditory areas, similar to
earlier studies (Langers, Backes, van Dijk 2007).
● A second study, however, described two mirror-symmetric frequency gradients
(high-low-high) extending approximately along the axis of HG (Formisano et al., 2003).
In a third study, three consistent gradients were reported, none of which clearly follow
the long axis of HG (Talavage et al., 2004).
● Finally, the authors of a fourth study found differences in activation between anterior and
posterior HG as well as medial and lateral differences, but concluded that the observed
activation profile did not represent frequency gradients but instead different functional
regions within the auditory cortex (Schonwiesner, von Cramon, Rubsamen 2002).
● Talavage et al. (2004) done a functional magnetic resonance imaging study using
frequency-swept stimuli to identify progressions of frequency sensitivity across the
cortical surface. The center-frequency of narrow-band, amplitude-modulated noise was
slowly swept between 125 and 8,000 Hz. Areas of cortex exhibiting a progressive change
in response latency with position were considered tonotopically organized. There exist
two main findings. First, six progressions of frequency sensitivity (i.e., tonotopic
mappings) were repeatedly observed in the superior temporal plane. Second, the locations
of the higher- and lower-frequency endpoints of these progressions were approximately
congruent with regions reported to be most responsive to discrete higher- and

53
lower-frequency stimuli. They have concluded that a correspondence between these
progressions and anatomically defined cortical areas, suggesting that in human auditory
cortex exhibit at least six progressions of frequency sensitivity on the human superior
temporal lobe.
XIII. Pandemonium Model
Author & Year: Selfridge’s (1959)
Assumption:
Selfridge's (1959) pandemonium model of pattern analysis assumes that low level image demons
detect physical attributes of the stimulation, middle level computational demons detect features
peculiar to certain objects, and high level cognitive demons detect objects. The model is called
pandemonium because the signalling function of these demons is analogous to yelling, with the
demons who yell the loudest calling the shots, so to speak.
Background:
Perception and cognition are not just matters of connections among the neurons or the events,
things, features, categories, etc. that those neurons carry. Put another way, we are not true digital
computers. It is not a matter of this-or-that and here-or-there. There are also matters of degree,
of quantity, intensity, or volume.
Neurons don't just work on the on-or-off principle. They can fire repeatedly, rapidly, or just
occasionally or rarely. There can be hundreds of synapses telling a neuron to fire, and hundreds
54
more telling it not to. Likewise, the ideas we have in our minds can be powerful and influence
many other ideas, or they can be once-in-a-lifetime flashes of brilliance.
It is important, in order to understand how the mind/brain works, to keep this in mind. One of
the most memorable ways of doing this goes back to the early work on artificial intelligence,
specifically that of Oliver Selfridge (1959. Pandemonium: A paradigm for learning.
In Symposium on the mechanization of thought processes. London: HM Stationary Office)
Explanation:
In the pandemonium model, each letter or digit is represented by a ‘cognitive demon’,
which holds a list of the features that define its shape. The cognitive demons listen for evidence
that matches their description, which comes from ‘feature demons’ that detect individual lines,
such as a horizontal or a vertical, and shout out if they are activated. When a cognitive demon
starts to detect features consistent with its shape, it too begins to shout. So, if a letter such as ‘A’
is presented to the image demon, the feature demons will start providing evidence for it. Some of
this will be consistent with both an ‘A’ and a ‘H’, and the A demon will start shouting, but so too
will the H demon. As more evidence comes in from the feature demons, the evidence for ‘A’ will
be greater than for ‘H’ and the A demon will be shouting the loudest. The decision demon then
makes a decision on which is the most likely letter by seeing which voice dominates the noise
(see Figure 6.2) When information from the features is ambiguous, for example in handwriting,
cognitive demons can take account of word knowledge and context to disambiguate the letter. Of
course Selfridge did not propose that there really were demons in the brain, but the principles of
parallel processing for all features, and levels of excitation in the nervous system, are consistent
with what is known. However, the problem of who listens to the demons and the arrangement of
55
features relative to each other within a shape are crucially important. Two vertical lines and a
horizontal line do not define ‘H’; it is the relation between them that does so, for example ll-, is
not an H. A theory of pattern or object recognition must be able to specify the relation between
parts of an object. However, pandemonium can be considered a precursor for parallel distributed
processing (PDP) models, of which one very influential contribution is the interactive activation
model.
https://www.youtube.com/watch?v=YRqwXaCnxy8
\
56
XIV. Direct- Realistic Approach
Author and Year: Fowler, 1986
Direct realism postulates that speech perception is direct (i.e., happens through the perception of
articulatory gestures), but it is not special. All perception involves direct recovery of the distal
source of the event being perceived (Gibson). In vision, you perceive objects (e.g., trees, cars,
etc.). Likewise with smell you perceive e.g., cookies, roses, etc. Why not in the auditory
perception of speech? So, listeners perceive tongues and lips. The articulatory gestures that are
the objects of speech perception are not intended gestures (as in Motor Theory). Rather, they are
the actual gestures.
A theory of speech perception called direct realism (Fowler, Galantucci, & Saltzman, 2003) is an
alternative to both the motor theory and a general auditory approach. The direct realism
perspective on speech perception is not easy to explain in the absence of substantial background
information (which is not pursued here). Some scientists (Cleary & Pisoni, 2001) question the
57
utility of direct realism as a theory because it is hard to understand how reasonable experimental
tests can be made to support or falsify it.
The direct realism perspective on speech perception takes its inspiration from the pioneering
work of the psychologist J. J. Gibson (1904–1979). Gibson’s theoretical and experimental focus
was on visual perception. Gibson (1968, 1979) rejected the idea of cognitively driven,
“constructed” percepts. He did not like the idea of perception in which percepts—often referred
to as the “objects” of perception—were mediated by cognitive operations to produce a
representation of the external world. Rather, he proposed the idea that animals, including
humans, perceive the visual layouts of environments directly, by linking the stimulation of their
senses (by light waves, in the case of vision) with the sources of the stimulation. For example,
when perceivers are exposed to the patterning of light reflected from a chair, they perceive the
chair directly, rather than interpreting through neural mechanisms the light patterns reaching
their eyes.
Gibson’s view was that objects in the environment structure the medium through which they are
conveyed to the senses. A chair structures the medium (light waves) by the patterns of light
reflected from it, and humans presumably learn that structure and perceive the chair directly,
without cognitive mediation. Direct realists argue that the objects of perception are not the
proximal stimulation, which in the example above is the light reflections at the eye, from the
chair, but rather the distal source of the light reflection. In this example, the distal source is the
chair. More specifically, in Gibson’s view, perceivers do not “process” and “encode” the light
waves via cognitive operations whose output is a symbolic representation.

58
Gibson coined the term “ecological psychology” for this view of perception. The term is
consistent with a “realist” (i.e., ecologically valid) view of how organisms perceive objects. For
Gibson, much of the experimental psychologist’s vocabulary, including terms found in
information processing models such as “encoding” and “representation,” were not much more
than a convenient set of descriptive terms. For Gibson, the terms did not have ecological
validity—that is, they did not represent “real” things, “real” mechanisms that were the stuff of
perception. The motor theory of speech perception requires operations of a special module to
convert (decode) an unstable acoustic signal into a stable articulatory representation.
In the motor theory, the objects of speech perception are the articulatory characteristics (either
places of articulation or articulatory ges tures over time) of phonemes or phoneme sequences as
transformed by the special processor. On the other hand, a general auditory approach to speech
perception requires some processing stages to match the incoming acoustics to stored templates
or features or a statistical model of acoustic speech signals (Klatt, 1989; Massaro & Chen, 2008;
Stevens, 2005). The object of speech perception in the auditory theory is the speech acoustic
signal, plus other sources of sensory information (such as visual information from a speaker’s
face during speech). In both cases, cognitive operations of varying degrees of automaticity are
required for perception of incoming sounds. Fowler (1986, 1996; Fowler, Shankweiler, &
Studdert-Kennedy, 2016, especially pp. 138–143), the leading proponent of direct realism in
speech perception, rejects cognitive “constructions” in the perception of speech sounds. Fowler
agrees with the Gibsonian idea that scientists should reject the idea of perception driven by
cognitive processes that produce an “output symbol.” In the case of speech perception, the
simplest example of such a symbol is a phoneme. Rather, Fowler argues for direct perception of
articulatory gestures. In this case, the objects of speech perception are articulatory gestures, such
59
as movements of the tongue or jaw. The gestures are the distal source and the pressure wave
produced by articulatory movement and that reaches the ears is the proximal stimulation.
Movements of the articulators, in Fowler’s view, structure the medium (air) through which the
information is transmitted to the ears. Listeners track this structure as phonetic gestures unfold
over time. The perception of these gestures is direct, not mediated by other processes. Listeners
literally hear articulatory gestures (or, on another interpretation, the sounds they hear are the
articulatory gestures). The parallel with Gibson’s (1969, 1970) view of visual perception is easy
to see.
In the direct realism theory of speech perception, articulatory gestures are perceived directly. No
special mechanisms are required to make perceptual decisions concerning phonetic events. An
apparent advantage of this perspective is that the acoustic variability for a given vowel, which
seems to make auditory theories cumbersome with the need to “know” all the possible variations
of the vowel’s formant frequencies both within and across speakers , becomes a non-issue. The
directly-perceived gestures for the vowel may have variable formant frequencies, especially
across speakers whose vocal tract lengths are very different (e.g., men versus children), but the
gestures are nearly the same. Adult men have much longer vocal tracts than five-year-old
children and therefore different formant frequencies for any vowel, but the articulatory gestures
for both groups are nearly the same for any vowel. According to this view, children use
articulatory gestures for /i/ similar to those used for the same vowel produced by adult men. The
gestures can be perceived directly across age and sex and are not affected by the acoustic
variability for the vowel from speaker to speaker and other sources of acoustic variation as
discussed earlier. Unfortunately, it is not true that different speakers use the same articulatory
gestures for a specific sound. Using direct measures of speech movement, Johnson, Ladefoged,
60
and Lindau (1993) showed substantial articulatory variation for the same vowel across different
speakers. Similarly, significant speaker-to-speaker variability of tongue movements for
American English “r” before vowels was reported by Westbury, Hashi, and Lindstrom (1998).
There does not appear to be acrossspeaker stability of articulatory gestures for vowels or the
sonorant “r” (sonorant = vowel-like). Direct realism does not have a ready answer at the level of
articulatory gestures to get around the variable acoustic characteristics of a given vowel or other
vowel-like sounds.
https://www.youtube.com/watch?v=JF0ArkVDrT8
ARTICLE
Direct Perceptions of Carol Fowler’s Theoretical Perspective (Whalen, 2016)
Whalen (2016) argues that the work of Carol Fowler is usually criticized for things it does not
claim or treated as if it were not possible to claim what she claims.
Misunderstanding #1: “Direct” means “perfect”:
Small variations in articulation that lead to the same phoneme are often taken to disprove
Direct Realism.
Misunderstanding #2: “Direct” means “the signal is irrelevant”
The second misunderstanding is that creatures using direct perception cannot end up with
acoustically robust articulations. That is, if the signal is relevant, it is assumed that articulation
is irrelevant. One aspect of this mistake is that the information conveyed by the signal is
assumed to be specific articulatory shapes.
Misunderstanding #3: “Direct” doesn’t really mean anything

61
A third misunderstanding is that there is no sense in which perception is “direct.” This mostly
derives from the perfectly correct observation that speech information reaches our brain via
sense organs (for vision, see Ullman, 1980).
XV. Input to Lexicon- Lexical Access from Spectrac(LAFS)
Author & Year: Klatt, 1979
Explanation:
Phonetic processes and representations played no real part in early psycholinguistic models of
spoken-word recognition. Klatt’s (1979) lexical access FROM SPECTRA (LAFS) MODEL was
largely ignored by psychologists. In the LAFS Model spoken-word recognition is accomplished
based on acoustic information alone.
ASSUMPTION:
● Klatt’s Lexical Ac cess From Sp ectra (LAFS) model assumes direct, noninteractive access
of lexical entries based on context-sensitive spectral sections (Klatt, 1980). For example,
spectrograms of the current speech signal are mapped directly onto a lexicon of spectral
templates.
● Klatt’s model assumes that adult listeners have a dictionary of all lawful diphone
sequences in long-term memory. Associated with each diphone sequence is a prototypical

62
spectral representation. Klatt proposes spectral representations of diphone sequences to
overcome the contextual variability of individual segments.
In Klatt’s LAFS model, the listener computes spectral representations of an input word and
compares these representations to the prototypes in memory. Word recognition is accomplished
when a best match is found between the input spectra and the diphone representations. In this
portion of the model, word recognition is accomplished directly on the basis of spectral
representations of the sensory input.
One important aspect of Klatt’s LAFS model is that it explicitly avoids any need to compute a
distinct level of representation corresponding to discrete phonemic segments. Instead, LAFS uses
a precompiled, acoustically-based lexicon of all possible words in a network of diphone power
spectra. These spectral templates are assumed to be context-sensitive units much like
“Wick-elphones” because they are assumed to represent the acoustic correlates of phones in
different phonetic environments (Wickelgren, 1969). Diphones in the LAFS system accomplish
this by encoding the spectral characteristics of the segments themselves and the transitions from
the middle of one segment to the middle of the next segment.
Klatt argues that diphone concatenation is sufficient to capture much of the context-dependent
variability observed for phonetic segments in spoken words. Word recognition in this model is
accomplished by computing a power spectrum of the input speech signal every 10 ms and then
comparing this input spectrum to spectral templates stored in a precompiled network. The basic
idea of LAFS, adapted from the Harpy system, is to find the path through the network that best
63
represents the observed input spectra (Klatt, 1977). This single path is then assumed to represent
the optimal phonetic transcription of the input signal.
https://www.youtube.com/watch?v=x8HIAVTeGNk
XVI. TRACE
Author & Year: Elman and McClelland's (1981, 1983)
The first, Elman and McClelland's (1981, 1983) interactive activation TRACE model of word
recognition, is perhaps one of the most highly interactive theories to date.
The model is called the TRACE model because the network of units forms a dynamic processing
structure called “the Trace,” which serves at once as the perceptual processing mechanism and as
the system’s working memory.
The model is instantiated in two simulation programs. TRACE 1 which deals with short
segments of real speech, and suggests a mechanism for coping with the fact that the cues to the
identity of phonemes vary as a function of context.
TRACE II, simulates a large number of empirical findings on the perception of phonemes and
words and on the interactions of phoneme and word perception. At the phoneme level, TRACE
II simulates the influence of lexical information on the identification of phonemes and accounts
for the fact that lexical effects are found under certain conditions but not others. The model also
shows how knowledge of phonological constraints can be embodied in particular lexical items
but can still be used to influence processing of novel, nonword utterances.
Elman and McClelland's (1981, 1983) model is based on a system of simple processing units
called "nodes." Nodes may stand for features, phonemes, or words. However, nodes at each level
64
are alike in that each has an activation level signifying the degree to which the input is consistent
with the unit that the node represents.
In addition, each node has a resting level and a threshold. In the presence of confirmatory
evidence, the activation level of a node rises toward its threshold; in the absence of such
evidence, activation decays toward the resting level of the node.
Nodes within this system are highly interconnected and when a given node reaches threshold, it
may influence other nodes to which it is connected. Connections between nodes are of two types:
excitatory and inhibitory. Thus a node that has reached threshold may raise the activation of
some of the nodes to which it is connected while lowering the activation of others. Connections
between levels are exclusively excitatory and bidirectional. Thus phoneme nodes may excite
word nodes, and word nodes may in turn excite phoneme nodes.
65
FIG.. A subset of the units in TRACE II. Each rectangle represents a different unit. The labels
indicate the item for which the unit stands, and the horizontal edges of the rectangle indicate
the portion of the Trace spanned by each unit. The input feature specifications for the phrase
“tea cup,” preceded and followed by silence, are indicated for the three illustrated dimensions
by the blackening of the corresponding feature units.
In the figures, each rectangle corresponds to a separate processing unit. The labels on the units
and along the side indicate the spoken object (feature, phoneme, or word) for which each unit
stands. The left and right edges of each rectangle indicate the portion of the input the unit
spans.
At the feature level, there are several banks of feature detectors, one for each of several
66
dimensions of speech sounds. Each bank is replicated for each of several successive moments
in time, or time slices.
At the phoneme level, there are detectors for each of the phonemes.
At the word level, there are detectors for each word. There is one copy of each word detector
centered over every three feature slices.
The entire network of units is called “the Trace,” because the pattern of activation left by a
spoken input is a trace of the analysis of the input at each of the three processing levels.
The Elman and McClelland model illustrates how a highly interactive system may be
conceptualized. In addition, it incorporates notions of both excitation and inhibition. By doing
so, it directly incorporates a mechanism that reduces the possibility of nodes inconsistent with
the evidence being activated while allowing for positive evidence at one level to influence the
activation of nodes at another. Although Elman and McClelland's model is highly interactive, it
is not without constraints. Namely, connections between levels are only excitatory, and within
levels they are only inhibitory.
https://www.youtube.com/watch?v=oK4EQCYdXwM
XVII. Dual Stream Model
Author & Year: Hickok and Poeppel, 2007
The Dual Stream model of speech/language processing holds that there are two functionally
distinct computational/neural networks that process speech/language information, one that

67
interfaces sensory/phonological networks with conceptual-semantic systems, and one that
interfaces sensory/phonological networks with motor-articulatory systems (Hickok & Poeppel,
2000, 2004, 2007).
This model proposes that a ventral stream, which involves structures in the superior and middle
portions of the temporal lobe, is involved in processing speech signals for comprehension
(speech recognition). A dorsal stream, which involves structures in the posterior frontal lobe and
the posterior dorsal-most aspect of the temporal lobe and parietal operculum, is involved in
translating acoustic speech signals into articulatory representations in the frontal lobe, which is
essential for speech development and normal speech production.
STREAMS STRUCTURES FUNCTIONS
Ventral Stream Superior and middle portions Involves in processing speech
of the temporal lobe signals for comprehension
(speech recognition).
Dorsal Stream Posterior frontal lobe and the Involves in translating
posterior dorsal-most aspect acoustic speech signals into
of the temporal lobe and articulatory representations in
parietal operculum the frontal lobe, which is
essential for speech
development and normal
speech production.
68
The suggestion that the dorsal stream has an auditory–motor integration function differs from
earlier arguments for a dorsal auditory ‘where’ system, but is consistent with recent
conceptualizations of the dorsal visual stream and has gained support in recent years .Generally
they propose that speech perception tasks rely to a greater extent on dorsal stream circuitry,
whereas speech recognition tasks rely more on ventral stream circuitry (with shared neural tissue
in the left STG), thus explaining the observed double dissociations. In addition, in contrast to the
typical view that speech processing is mainly left-hemisphere dependent, the model suggests that
the ventral stream is bilaterally organized (although with important computational differences
between the two hemispheres); so, the ventral stream itself comprises parallel processing
streams. This would explain the failure to find substantial speech recognition deficits following
unilateral temporal lobe damage. The dorsal stream, however, is strongly left-dominant, which
explains why production deficits are prominent sequelae of dorsal temporal and frontal lesions,
and why left-hemisphere injury can substantially impair performance in speech perception tasks.
69
The dual-stream model of the functional anatomy of language. a | Schematic diagram of the
dual-stream model. The earliest stage of cortical speech processing involves some form of
spectrotemporal analysis, which is carried out in auditory cortices bilaterally in the
supratemporal plane. These spectrotemporal computations appear to differ between the two
hemispheres. Phonological-level processing and representation involves the middle to posterior
portions of the superior temporal sulcus (STS) bilaterally, although there may be a weak
left-hemisphere bias at this level of processing. Subsequently, the system diverges into two broad
streams, a dorsal pathway (blue) that maps sensory or phonological representations onto
articulatory motor representations, and a ventral pathway (pink) that maps sensory or
phonological representations onto lexical conceptual representations.

70
b | Approximate anatomical locations of the dual-stream model components, specified as
precisely as available evidence allows. Regions shaded green depict areas on the dorsal surface
of the superior temporal gyrus (STG) that are proposed to be involved in spectrotemporal
analysis. Regions shaded yellow in the posterior half of the STS are implicated in
phonological-level processes. Regions shaded pink represent the ventral stream, which is
bilaterally organized with a weak left-hemisphere bias. The more posterior regions of the ventral
stream, posterior middle and inferior portions of the temporal lobes correspond to the lexical
interface, which links phonological and semantic information, whereas the e more anterior
locations correspond to the proposed combinatorial network. Regions shaded blue represent the
dorsal stream, which is strongly left dominant. The posterior region of the dorsal stream
corresponds to an area in the Sylvian fissure at the parietotemporal boundary (area Spt), which is
proposed to be a sensorimotor interface, whereas the more anterior locations in the frontal lobe,
probably involving Broca’s region and a more dorsal premotor site, correspond to portions of the
articulatory network.
aITS, anterior inferior temporal sulcus; aMTG, anterior middle temporal gyrus; pIFG, posterior
inferior frontal gyrus; PM, premotor cortex.

71
https://www.youtube.com/watch?v=uLUOzUYC3u4
ARTICLE
Anatomy of aphasia revisited (Fridriksson et al., 2018)
In this article, they present a follow-up study to our previous work that used lesion data to
reveal the anatomical boundaries of the dorsal and ventral streams supporting speech and
language processing. Specifically, by emphasizing clinical measures, we examine the effect of
cortical damage and disconnection involving the dorsal and ventral streams on aphasic
impairment. The results reveal that measures of motor speech impairment mostly involve
damage to the dorsal stream, whereas measures of impaired speech comprehension are more
strongly associated with ventral stream involvement. Equally important, many clinical tests
that target behaviours such as naming, speech repetition, or grammatical processing rely on
interactions between the two streams. This latter finding explains why patients with seemingly
disparate lesion locations often experience similar impairments on given subtests. Namely,
72
these individuals’ cortical damage, although dissimilar, affects a broad cortical network that
plays a role in carrying out a given speech or language task. The current data suggested this is
a more accurate characterization than ascribing specific lesion locations as responsible for
specific language deficits.
https://www.youtube.com/watch?v=F0vnSsqnax0
References
1. Abbs, J. H., & Sussman, H. M. (1971). Neurophysiological Feature Detectors and Speech
Perception: A Discussion of Theoretical Implications. Journal of Speech Language and
Hearing Research, 14(1), 23.
2. Bever, Thomas & Poeppel, David. (2010). Analysis by Synthesis: A (Re-)Emerging
Program of Research for Language and Vision. Biolinguistics. 4. 174-200.
3. Broadbent, D. E., & Ladefoged, P. (1957). On the fusion of sounds reaching different
sense organs. 2 9, 708-710
4. Casserly, E. D., & Pisoni, D. B. (2010). Speech perception and production. Wiley
Interdisciplinary Reviews: Cognitive Science, 1( 5), 629-647.
5. Feng, G., Zhou, B., Zhou, W., Beauchamp, M. S., & Magnotti, J. F. (2019). A Laboratory
Study of the McGurk Effect in 324 Monozygotic and Dizygotic
6. Fridriksson, J., den Ouden, D. B., Hillis, A. E., Hickok, G., Rorden, C., Basilakos, A., ...
& Bonilha, L. (2018). Anatomy of aphasia revisited. Brain, 141( 3), 848-862.
7. Hickok, G., & Poeppel, D. (2007). The cortical organization of speech processing. Nature
reviews neuroscience, 8(5), 393-402.

73
8. Hixon, T. J., Weismer, G., & Hoit, J. D. (2018). Preclinical speech science: Anatomy,
physiology, acoustics, and perception. Plural Publishing.
9. Kenneth N. Stevens. (1989) On the quantal nature of speech. Journal of Phonetics,
Volume 17, Issues 1–2,Pages 3-45, ISSN 0095-4470
10. Miller, J. D. (1989). Auditory-perceptual interpretation of the vowel. The Journal of the
Acoustical Society of America, 85(5), 2114–2134. doi:10.1121/1.397862
11. Nath, A. R., & Beauchamp, M. S. (2012). A neural basis for interindividual differences in
the McGurk effect, a multisensory speech illusion. NeuroImage, 59(1), 781–787.
12. Pisoni, D. B., & Luce, P. A. (1987). Acoustic-phonetic representations in word
recognition. Cognition, 25(1-2), 21-52.
13. Remez, R. E., Rubin, P. E., Berns, S. M., Pardo, J. S., & Lang, J. M. (1994). On the
perceptual organization of speech. Psychological Review, 101(1), 129–156.
14. Schwab, E. C., & Nusbaum, H. C. (Eds.). (2013). Pattern recognition by humans and
machines: speech perception (Vol. 1). Academic Press.
15. Styles, E. A. (2005). Attention, perception and memory: an integrated introduction.
Psychology Press.
16. Tiippana, K. (2014). What is the McGurk effect? Frontiers in Psychology, 5.
17. van Wassenhove, V., Grant, K. W., and Poeppel, D. (2007). Temporal window of
integration in auditory-visual speech perception. Neuropsychologia 45, 598–607.
18. Wright, R., Frisch, S., & Pisoni, D. B. (1997). Speech perception. Research on Spoken
Language Processing Progress Report No, 21, 1-50.
19. Whalen, D. H. (2016). Direct perceptions of Carol Fowler's theoretical perspective.
Ecological Psychology, 28( 4), 183-187.

74

Unit 1 - Sp. Lg. Proc. & Pros.

Uploaded by

Copyright:

Available Formats

You might also like

Unit 1 - Sp. Lg. Proc. & Pros.

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1 - Sp. Lg. Proc. & Pros.

Uploaded by

Copyright:

Available Formats

1

SLP 203: Speech Language Processing & Prosody

SUBMITTED BY: SWATHI SURESH & VAISHNAVI S

M.SC SLP I YEAR

SUBMITTED BY: DR. SUJA MATHEWS

SUBMITTED ON: 05.11.2020

I. What is Speech Perception?

VI. Perceptual organization in speech- Gestalt principles of perceptual grouping

VII. Phonetic organization

1.2 Theoretical approaches to speech perception

VIII. Motor Theory of Speech Perception

IX. Analysis by Synthesis Theory

XI. Auditory Theory of Vowel Perception

XII. Neurological theories of speech perception

XIII. Pandemonium model

1.1 Introduction to speech Processing

I. What is Speech Perception?

The term ‘speech perception’ is used in a variety of contexts. Critical terminological

distinctions should be made.

The basic problems in speech perception involve a number of issues surrounding

● The internal representation of the speech signal and

● The perceptual constancy of this representation—the problem of acoustic-phonetic

● The phenomena associated with perceptual contrast to identical stimulation.

III. The Principal issues in Speech Perception:

a. Linearity, Lack of Acoustic—Phonetic Invariance, and the Segmentation Problem

b. Units of Analysis in Speech Perception

c. Perceptual Constancy in Speech - The Normalization Problem in Speech Perception

a. Linearity, Lack of Acoustic—Phonetic Invariance, and the

A spectrogram of the phrase "I owe you". There are no clearly

distinguishable boundaries between speech sounds.

conditions of linearity and invariance.​

● Although humans can perform it effortlessly, the recognition of fluent speech by

machines has thus far proven to be a nearly intractable problem.

or even single phoneme, can sound completely differently depending on many

​1) Individual differences.​ Acoustic structure of speech depends a lot on a

speaker's accent, physical and psychological characteristics.

​2) Speech conditions.

following spectrograms of a sound /d/ in three different positions:

despite big differences in their acoustic structure.

perceived phonemes independently of the surrounding context.

● As a result of coarticulation in speech production, there is typically a great deal of

often represented acoustically in quite different ways depending on the surrounding

context than when speech sounds are produced in isolation.

● The context-conditioned variability resulting from coarticulation also presents enormous

on an analysis of the physical signal.

b. Units of Analysis In Speech Perception

analysis (see Pisoni, 1976; Studdert-Kennedy, 1976).

● Thus the production of a stop consonant is conditioned or influenced by the production of

the adjacent phonemes.

information about a number of phonemes or syllables.

speech: phonetic features, phonemes, and syllables.

show the same context-conditioned variability (i.e., lack of invariance) exhibited by

c. Perceptual Constancy in Speech - The Normalization Problem in

One is the talker-normalization problem.

strategies present in their speech.

● As a consequence, substantial differences among talkers have been observed in the

absolute values of the acoustic correlates of many phonetic features.

conditions of linearity and invariance.

1) Individual differences. Acoustic structure of speech depends a lot on a

2) Speech conditions.

acoustic and visual stimulus components is also important(Tiippana, 2014) .

different, intermediate consonant (alveolar) is perceived (van Wassenhove, 2013).

factors shared by twins contribute to individual differences in multisensory speech perception.

one another, based on the principle of grouping by proximity.

another, based on the principle of grouping by similarity.

cognitive cues is necessary for accurate speech perception.

concept—the degree to which information in addition to the acoustic signal is necessary.

processing and presume a more automatic perceptual response.

data, such as communicative context or general knowledge. In other words, perception

reconstruction invokes these muscle activation patterns in terms of sets of descriptors.

coarticulatory descriptors to an active interpretation of the sound wave.

Soundwave Active interpretation hypothesis about speakers intended muscle activation

automatically, and so to let listeners perceive phonetic structures without mediation by