Unit 1 - Sp. Lg. Proc. & Pros.

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 74

1

SLP 203: Speech Language Processing & Prosody


UNIT 1: INTRODUCTION TO SPEECH LANGUAGE PROCESSING

SUBMITTED BY: SWATHI SURESH & VAISHNAVI S

M.SC SLP I YEAR

SUBMITTED BY: DR. SUJA MATHEWS

SUBMITTED ON: 05.11.2020


2

CONTENTS
1.1 Introduction to speech Processing

I. What is Speech Perception?


II. Definition
III. The Principal Issues in Speech Perception
a. Linearity, Lack of Acoustic—Phonetic Invariance, and the Segmentation Problem
b. Units of Analysis in Speech Perception
c. Perceptual Constancy in Speech - The Normalization Problem in Speech
Perception
IV. Categorical Perception

V. Mcgurk Effect

VI. Perceptual organization in speech- Gestalt principles of perceptual grouping

VII. Phonetic organization

1.2 Theoretical approaches to speech perception

VIII. Motor Theory of Speech Perception

IX. Analysis by Synthesis Theory

X. Quantal Theory

XI. Auditory Theory of Vowel Perception

XII. Neurological theories of speech perception

XIII. Pandemonium model


XIV. Direct- realistic Approach
XV. Input to lexicon- Lexical access from spectra (LAFS)
XVI. TRACE
XVII. Dual stream model
XVIII. References
3

1.1 Introduction to speech Processing

I. What is Speech Perception?

The term ‘speech perception’ is used in a variety of contexts. Critical terminological

distinctions should be made.

https://www.youtube.com/watch?v=xY6DBIusIsI

II. DEFINITION:
4

The basic problems in speech perception involve a number of issues surrounding

● The internal representation of the speech signal and

● The perceptual constancy of this representation—the problem of acoustic-phonetic

invariance and

● The phenomena associated with perceptual contrast to identical stimulation.

III. The Principal issues in Speech Perception:

a. Linearity, Lack of Acoustic—Phonetic Invariance, and the Segmentation Problem

b. Units of Analysis in Speech Perception

c. Perceptual Constancy in Speech - The Normalization Problem in Speech Perception

a. Linearity, Lack of Acoustic—Phonetic Invariance, and the

Segmentation Problem:

LINEARITY

PROBLEM
5

A spectrogram of the phrase "I owe you". There are no clearly

distinguishable boundaries between speech sounds.

LACK OF

INVARIANCE
6

SEGMENTATION

PROBLEM

As first discussed by Chomsky and Miller (1963), one of the most important and central

problems in speech perception derives from the fact that the speech signal fails to meet ​the

conditions of linearity and invariance.​


7

● As a consequence of failing to satisfy these two conditions​, the basic recognition problem

can be seen as a substantially more complex task for humans to carry out.

● Although humans can perform it effortlessly, the recognition of fluent speech by

machines has thus far proven to be a nearly intractable problem.

Lack of invariance refers to the idea that there is no reliable connection between

the language phoneme and its acoustic manifestation in speech. The same word,

or even single phoneme, can sound completely differently depending on many

factors:

​1) Individual differences.​ Acoustic structure of speech depends a lot on a

speaker's accent, physical and psychological characteristics.

​2) Speech conditions.

​3) Coarticulation.​ This is the idea that more than one sound is articulated at once,

so each of them is partly shaped by the sounds surrounding it. The articulates
8

(jaw, tongue, mouth) move from sound to sound, allowing us to speak faster, thus

acoustic structure of each phoneme depends a lot on its 'neighbours'. Consider the

following spectrograms of a sound /d/ in three different positions:

Place of articulation on the /d/ is different every time, thus producing several

'versions' of the same sound - however we as listeners will still hear /d/ every time,

despite big differences in their acoustic structure.

It is extremely difficult to identify acoustic segments and features that uniquely match the

perceived phonemes independently of the surrounding context.

● As a result of coarticulation in speech production, there is typically a great deal of

contextual variability in the acoustic signal correlated with any single phoneme.

● Often a single acoustic segment contains information about several neighboring linguistic

segments (i.e., parallel transmission), and, conversely, the same linguistic segment is

often represented acoustically in quite different ways depending on the surrounding

phonetic context, the rate of speaking, and the talker (i.e., context-conditioned variation).
9

● In addition, the acoustic characteristics of individual speech sounds and words exhibit

even greater variability in fluent speech because of the influence of the surrounding

context than when speech sounds are produced in isolation.

● The context-conditioned variability resulting from coarticulation also presents enormous

problems for segmentation of the speech signal into phonemes or even words based only

on an analysis of the physical signal.

b. Units of Analysis In Speech Perception

In various attempts to solve the joint problems of the lack of invariance and segmentation,

different-sized perceptual units have been proposed. The phonetic feature, phoneme, and syllable

have all been considered at one time or another by various investigators. However, the problems

of a lack of invariance and segmenting a continuous signal are common to all of these units of

analysis (see Pisoni, 1976; Studdert-Kennedy, 1976).

Coarticulation:

● The coarticulation phenomena that carry attributes of one phonetic feature or phoneme

onto the surrounding phonemes are also found for syllables (Öhman, 1966).

● Coarticulation refers to the influence of the muscle movements necessary to produce one

sound onto preceding and succeeding muscle movements and their resulting acoustic

manifestations.

● Thus the production of a stop consonant is conditioned or influenced by the production of

the adjacent phonemes.


10

● The acoustic consequences of coarticulation are that one sound segment may carry

information about a number of phonemes or syllables.

● Coarticulation effects obscure the boundaries between all potential units of analysis in

speech: phonetic features, phonemes, and syllables.

● As a result, syllabic units are difficult to segment by acoustically defined criteria and

show the same context-conditioned variability (i.e., lack of invariance) exhibited by

phonemes.

c. Perceptual Constancy in Speech - The Normalization Problem in

Speech Perception

Talker Variability:

● In addition to the problems arising from the lack of acoustic-phonetic invariance, there

are also two related problems having to do with the normalization of the speech signal.

One is the talker-normalization problem.

● Talkers differ in the length and shape of their vocal tracts, in the articulatory gestures

used for producing various types of phonetic segments, and in the types of coarticulatory

strategies present in their speech.

● As a consequence, substantial differences among talkers have been observed in the

absolute values of the acoustic correlates of many phonetic features.

● Differences in stress and speaking rate, as well as in dialect and affect, also contribute to

differences in the acoustic manifestation of speech.


11

● Despite the inherent variability in the speech signal due to talker differences, listeners are

nevertheless able to perceive speech from a wide range of vocal tracts under diverse sets

of conditions.

● Clearly, then, the invariant properties cannot be absolute physical values encoded in the

stimulus but must be relational in nature.

● Unfortunately, relatively little is known about this form of perceptual normalization, or

about the types of perceptual mechanisms involved in carrying out these computations.

Variability in speaking rate:

● A second normalization problem concerns time and rate normalization in speech

perception. It is well known that the duration of individual segments is influenced

substantially by speaking rate.

● However, the acoustic duration of segments is also affected by the location of various

syntactic boundaries in connected speech, by syllabic stress, and by the component

features of adjacent segments in words (see, e.g., Gaitenby, 1965; Klatt, 1975, 1976,

1979; Lehiste, 1970).

● In addition, there are substantial differences in the duration of segments of words when

produced in sentence contexts compared to the same words spoken in isolation.

● Changes in speech rate are reflected in changes in the number and duration of pauses, in

durational changes of vowels and some consonants, and in deletions and reductions of

some of the acoustic properties that are associated with particular linguistic units (J.L.

Miller, Grosjean, & Lomato, 1984).


12

● For example, changes in VOT and the relative duration of transitions and vowel

steady-states occur with changes in speaking rates (J.L. Miller & Baer, 1983; J.L. Miller,

Green, & Reeves, 1986; Summerfield, 1975).

● The duration of vowels produced in sentences is roughly half the duration of the same

vowels spoken in isolation.

● Speaking rate also influences the duration and acoustic correlates of various phonetic

features and segments.

● Numerous low-level phonetic and phonological effects such as vowel reduction, deletion,

and various types of assimilation phenomena have been well documented in the

literature.

● These effects seem to be influenced a great deal by speaking tempo, dialect, and

surrounding phonetic context.

https://www.youtube.com/watch?v=1nAy-I1OiDU

IV Categorical Perception:

Categorical perception (Repp 1984) is the recognition, along a continuum of data, of categories

meriting individual labels. Thus, for example, a rainbow is a continuous or linear spread of

colour (frequencies) within which certain colours appear to stand out or can be recognised by the

viewer. In fact these colours, it is hypothesised, are not intrinsically isolated or special in the

signal; their identification is a property of the way the viewer is cognitively treating the data.

Categorical Perception Theory is an attempt to model this observation.

Two observations are relevant to understanding categories:

1. Listeners often assign the same label to physically different signals. This occurs, for

example, when two people utter the ‘same’ sound. In fact the sounds are always
13

measurably different, but are perceived to belong to the same category. What is

interesting is that the two speakers also believe they are making identical or very similar

sounds, simply because they are both rendering the same underlying plan for a particular

phonological extrinsic allophone. The identity of the planned utterance within the minds

of the speakers matches the label assigned by the listener.

2. Sounds can be made which are only minutely different and occur on the same production

cline (A cline is a continuous range or slope between values associated with a particular

phenomenon. For example, in English the series of vowels [u, ʊ, ɔ, ɑ] stand on a cline of

back vowels ranging from highest to lowest). Take the example of vowel height, where

there is a cline, say, between high front vowels and low front vowels. Acoustically this

will show up as a cline in the frequency of formant 2 (F2) and also formant 1 (F1).

Listeners will segment the cline into different zones, and assign a distinct label to each.

Here we have the situation where, because phonologically there are two different

segments to be rendered, both speakers and listeners will believe that the difference has

carried over into the physical signal.

Each of these two observations is the reverse of the other. On the one hand different signals

prompt a belief of similarity, and on the other similar signals prompt a belief of difference.

Categorical Perception Theory – and other active or top- down theories – propose that the

explanation for these observations lies in the strength of the contribution made to the perceptual

process by the speaker/listener’s prior knowledge of the structure and workings of the language.

Within certain limits, whatever the nature of the physical signal the predictive nature of the

speaker’s/listener’s model of how the language works will push the data into the relevant
14

category. We speak of such active top- down approaches as being theory- driven, as opposed to

passive, data- driven, bottom- up theories.

So the categories in the minds of both speaker and listener are language dependent. That is, they

vary from language to language, depending on the distribution of sounds within the available

acoustic or articulatory space. Imagine a vector running from just below the alveolar ridge to the

lowest, most forward point the front of the tongue can reach for making vowels. Along this

vector the tongue can move to an infinite number of positions – and we could, in theory at any

rate, instruct a speaker to make all these sounds, or, at least, a large number of sounds along this

vector. If these sounds are recorded, randomised and played to listeners, they will subdivide the

sounds according to tongue positions along the vector which represent sounds in their own

language (Rosner and Pickering 1994).

https://www.youtube.com/watch?v=nNYhCP6NiUg

Mcgurk Effect

To facilitate communication humans have the remarkable ability to combine the visual

information from the mouth with auditory information from voice. When presented with certain

combinations of incongruent speech, such as auditory /ba/ paired with visual /ga/, individuals

may perceive an illusory fused percept /da/. This illusion, first described by McGurk and

MacDonald (1976), has come to be known as the McGurk effect. McGurk and MacDonald

(1976) reported a powerful multisensory illusion occurring with audiovisual speech. McGurk

effect is an acoustic utterance that is heard as another utterance when presented with discrepant

visual articulation (​Tiippana, 2014)​. They recorded a voice articulating a consonant and dubbed
15

it with a face articulating another consonant. Even though the acoustic speech signal was well

recognized alone, it was heard as another consonant after dubbing with incongruent visual

speech.

https://www.youtube.com/watch?v=jtsfidRq2tw

Regarding the definition and interpretation the two main claims of the McGurk effect are, first

the McGurk effect should be defined as a categorical change in auditory perception induced by

incongruent visual speech, resulting in a single percept of hearing something other than what the

voice is saying. Second, when interpreting the McGurk effect, the perception of the unisensory

acoustic and visual stimulus components is also important(​Tiippana, 2014)​ .

There are many variants of the McGurk effect. The best-known case is when dubbing a voice

saying [b] onto a face articulating [g] results in hearing [d]. This is called the fusion effect since

the percept differs from the acoustic and visual components.Here integration results in the

perception of a third consonant, by merging information from audition and vision (Setti et al.,

2013).​ ​When A[b]V[g] is heard as [d], the percept is thought to emerge due to fusion of the

features (for the place of articulation) provided via audition (bilabial) and vision (velar), so that a

different, intermediate consonant (alveolar) is perceived (​van Wassenhove, 2013​).

Another fact is that the other incongruent audiovisual stimuli produce different types of percepts.

For example, a reverse combination of these consonants, A[g]V[b], is heard as [bg], i.e., the

visual and auditory components one after the other. There are other pairings, which result in

hearing according to the visual component, e.g; acoustic [b] presented with visual [d] is heard as

[d].The different variants of the McGurk effect represent the outcome of audiovisual integration.

Tiippana (2014) concluded that when integration takes place, it results in a unified percept,

without access to the individual components that contributed to the percept. Thus, when the
16

McGurk effect occurs, the observer has the subjective experience of hearing a certain utterance,

even though another utterance is presented acoustically.

There is a high degree of individual variability in how frequently the illusion is perceived. Some

individuals almost always perceive the McGurk effect, while others rarely do (Mallick et al.,

2015). Frequent perceivers of the McGurk effect were also more likely to fixate the mouth of the

talker, and there was a significant correlation between McGurk frequency and mouth looking

time (Gurler et al., 2015).To which extent visual input influences the percept depends on how

coherent and reliable information each modality provides. Coherent information is integrated and

weighted, e.g., according to the reliability of each modality, which is reflected in unisensory

discriminability. Converging evidence suggests that the STS is a critical brain area for

multisensory integration of auditory and visual information about both speech and non-speech

stimuli (Barraclough et al., 2005). This suggests that the STS could be a neural locus for the

McGurk effect: if the left STS successfully combines the incongruent auditory and visual

syllables that comprise a McGurk stimulus, a McGurk percept is produced; if the left STS is not

active, then the auditory and visual syllables are not combined and a McGurk percept is not

produced.

Nath and Beauchamp (2012) used blood-oxygen level dependent functional magnetic resonance

imaging (BOLD fMRI) to measure brain activity as McGurk perceivers and non-perceivers were

presented with congruent audiovisual syllables, McGurk audiovisual syllables, and non-McGurk

incongruent syllables.14 healthy right-handed subjects (6 females, mean age 26.1) were

participants of the study. The stimuli consisted of three types of syllables: congruent (auditory

and visual matching) syllables and two types of incongruent syllables (auditory and visual

mismatch).
17

Not all incongruent syllables produce a McGurk percept, defined as a percept not present in the

original stimulus (McGurk and MacDonald, 1976). Just prior to scanning, a behavioral pre-test

was performed. Each subject was presented with 10 trials of McGurk syllables and 10 trials of

non-McGurk incongruent syllables. Auditory stimuli were delivered through headphones at

approximately 70 dB, and visual stimuli were presented on a computer screen. Subjects were

instructed to watch the mouth movements and listen to the speaker. In order to assess perception,

subjects were asked to repeat aloud the perceived syllable. Fused percepts such as “da” and “tha”

were used as indicators of McGurk perception, while responses strictly corresponding to the

visually-presented syllable (“ga”) were not counted as fused McGurk percept. Responses

corresponding to “ba” (the auditory component of the syllable) indicated that the effect was not

perceived (McGurk and MacDonald, 1976). Each subject underwent one fMRI scanning session.

In five subjects, each scan series contained 55 McGurk trials, 55 non-McGurk incongruent trials,

10 target trials (audiovisual “ma”) and 30 trials of fixation baseline. In the remaining nine

subjects, each scan series contained 25 McGurk trials, 25 non-McGurk trials, 25 congruent “ga”
18

trials (auditory + visual “ga”), 25 congruent “ba” trials, 10 target trials (audiovisual “ma”) and 30

trials of fixation baseline. During fixation, the crosshairs were presented at the same position as

the mouth during visual speech to minimize eye movements.On approximately 10% of trials, the

target stimulus (audiovisual “ma”) was presented. Subjects were required to respond to the target

stimulus by pressing a button, but not to other stimuli. Target stimuli were analyzed separately

from other stimuli. This ensures attention to the stimulus while preventing contamination of the

brain response to non-target stimuli by motor planning or execution.

Results:

In the behavioral test, there was a high degree of intersubject variability in McGurk

susceptibility. Three subjects never reported the McGurk percept (0% susceptibility) while two

subjects always reported it (100% susceptibility). Subjects were classified into two groups:

non-perceivers (seven subjects, susceptibility 0–49%) and perceivers (seven subjects,

50%–100%).

For the analysis of FMRI a 2-way ANOVA was performed with BOLD response in the left STS

as the dependent measure. The first factor was the McGurk susceptibility group determined from

behavioral testing (perceivers vs. non-perceivers). The second factor was the stimulus condition

(congruent vs. incongruent syllables). The ANOVA showed significant main effects of both

susceptibility groups and stimulus condition on the STS response. There was no interaction

between susceptibility group and stimulus condition.McGurk perceivers had the highest STS

response to incongruent syllables and non-perceivers had the lowest response. There was no

between-group difference in the response to congruent syllables.Effects of stimulus, across

groups, the STS significantly preferred incongruent to congruent syllables. There was no

significant difference between the responses to the two types of incongruent syllables (McGurk
19

and non-McGurk).Individual subject analyses Next, we examined individual responses to the

syllables (Fig. 2). On individual subject analysis, the subject with the weakest STS response had

the smallest likelihood of experiencing a McGurk percept; the subject with the strongest STS

response had the highest likelihood. Across all subjects, there was a significant positive

correlation between each subject's STS response to incongruent syllables and their likelihood of

experiencing the McGurk percept. . These results suggest that the left STS is a key locus for

interindividual differences in speech perception.

Feng et al. 2019 e​xamined the McGurk effect in 324 native Mandarin speakers, consisting of 73

monozygotic (MZ) and 89 dizygotic (DZ) twin pairs (mean age ± SD = 16.8 ± 2.1 years). 150

underwent an additional testing session approximately 2 years after the initial session. The

McGurk stimuli consisted of nine digital audiovisual recordings, each 2 s long . Each stimulus

was presented eight times in random order. Each stimulus contained an auditory recording of a

syllable and a digital video of the face of the speaker enunciating a different, incongruent

syllable. Subjects were instructed to pay attention to each movie clip and report their percept by

typing it via a standard keyboard into a computer. Subject responses were classified into four

categories: auditory (responses corresponding to the auditory syllables), visual (responses

corresponding to the visual syllables), McGurk (specific responses containing an element not

contained in either the auditory or visual syllable, described in the original paper as “fused”

responses), and other (responses different from the previous three categories, e.g., “a”).
20

From the analysis it is revealed that some participants never perceived the illusion and others

always perceived it. Within participants, perception was similar across time (2-year retest in 150

participants) suggesting that McGurk susceptibility reflects a stable trait rather than short-term

perceptual fluctuations. The examination of the effects of shared genetics and prenatal

environment, compared McGurk susceptibility between MZ and DZ twins. Both twin types had

significantly greater correlation than unrelated pairs suggesting that the genes and environmental

factors shared by twins contribute to individual differences in multisensory speech perception​.

Perceptual organization in speech- Gestalt principles of perceptual grouping

For the understanding of organization of visual and auditory forms Wertheimer (1923/1938)

first exposed the principles underlying perceptual grouping and segmentation, which are now

known as ‘the classic principles of perceptual grouping’.

A row of dots at equal distance from one another, is just perceived as a row of dots, without any

particular grouping or segmentation. When some of the inter-dot distances are increased

significantly relative to the others, one immediately perceives a grouping of some dots in pairs,

which become segregated from others. Apparently, elements that are relatively closer together

become grouped together, while elements that are relatively further apart are segregated from

one another, based on the ​principle of grouping by proximity.​


21

When dots are differentiated from one another by size, color or another feature, the dots become

spontaneously grouped again, even with equal distances. In Figure C, for instance, the smaller

filled dots are grouped in pairs and so are the larger open ones. Apparently, elements that are

similar to one another are grouped together, while dissimilar elements are segregated from one

another, based on the ​principle of grouping by similarity.

With unequal distances and differentiated attributes, the two principles can cooperate or

compete. When both proximity and similarity are working together (Figure D), grouping is

enhanced compared to the conditions where only one principle can play a role (proximity in

Figure B, similarity in Figure C). When they are competing (see Figure E), one might perceive

pairs of dissimilar dots (when grouping by proximity wins) or pairs of similar dots at larger
22

distances (when grouping by similarity wins), with these two tendencies possibly differing in

strength between individuals and switching over time within a single individual.

Even with equal distances, elements become grouped again, when some undergo a particular

change together (e.g., an upward movement), while others do not change or change differently

(e.g., move downward; see Figure F). ​This principle is grouping by common fate.

Patterns with ​Continuity law​ suggest that a person who is looking at it, that the pattern

"continues" even after the end of the physical pattern itself, i.e. our brain can trace the connecting

lines between various elements of the design if there are none.

We perceive lines as a part of a continuous movement inorder to minimize the abrupt changes. In

this given figure we perceive two overlapping wavy lines instead of three shapes linked together.
23

In the ​law of symmetry​ items that form symmetrical units are grouped together. In the given

picture three sets of brackets are seen, not six unconnected lines.

The ​law of closure​ states that individuals perceive objects such as shapes, letters, pictures, etc.,

as being whole when they are not complete. Specifically, when parts of a whole picture are

missing, our perception fills in the visual gap.


24

In the left arrangement of below given figure, one perceives two identical six-sided shapes, in

different orientations and slightly overlapping, with line patterns a and c grouped together as

forming one figure, and b and d as another (a-c and b-d).In the right arrangement of figure, with

the same line patterns in different positions and relative orientations, one clearly perceives

something different, namely one elongated six-sided shape with a smaller diamond in the middle;

so, now patterns a and d are grouped as one form, and b and c as another (a-d and b-c). When

parts form a larger whole, the whole with a higher degree of regularity are better Gestalts and

they tend to dominate our perception this is the the ​principle of a good Gestalt.​
25

In the Law of Common Region Gestalt law of perceptual organization suggests that elements that

are grouped together within the same region of space tend to be grouped together.

In the given figure, there are three oval shapes drawn on a piece of paper with two dots located at

each end of the oval. The ovals are right next to each other so that the dot at the end of one oval

is actually closer to the dot at the end of a separate oval. Despite the proximity of the dots, the

two that are inside each oval are perceived as being a group rather than the dots that are actually

closest to each other.

Wertheimer also discussed three additional factors, first When confronted with complex shapes,

we tend to reorganize them into simpler components or into a simpler whole. More likely to see

the left image above composed of the simple circle, square and triangle like you see on the right

than as the complex and ambiguous shape of the whole form.


26

Second, if one presents the parametrically different conditions as sequential trials in a

single experiment rather than as a series of separate experiments, the change from one

discrete percept to another would depend on the context of the preceding conditions:

the transition points from percept A to B would be delayed if the preceding stimuli were all

giving rise to percept A, and ambiguous conditions where A and B are equally strong based on

the parametric stimulus differences would yield percepts that go along with the organizations

that were prevailing in previous trials. Hence, in addition to isolated stimulus factors, the set of

trials within which a stimulus is presented also plays a role.This second additional factor is called

the factor of set. Third additional principle, implies that past experience does play a role in

perceptual organization, albeit a limited one. Wertheimer pointed out that this is just one of

several factors. When different factors come into play, it is not easy to predict which of the

possible organizations will have the strongest overall Gestalt qualities (the highest ‘goodness’).

Explicit auditory instances of organizational principles were again offered by Julesz and Hirsh

(1972).Bregman and colleagues have elaborated many of the speculations of Julesz and Hirsh

(1972) empirically. The findings demonstrate the dimensions and parameters of the perceptual

disposition to form groups. For example, a principle of proximity, here set in the frequency
27

domain, was offered to explain the formation of groups observed by Bregman and Campbell

(1971). If successive moments of the signal are similar with respect to frequency spectrum, they

should be considered part of the same source. From a repeating sequence of six 100-ms tones

with frequencies of 2.5 kHz, 2 kHz, 550 Hz, 1.6 kHz, 450 Hz, and 350 Hz, listeners perceived a

pattern forming two concurrent groups: one of the three low tones (550 Hz, 450 Hz, and 350 Hz)

and another of the three high tones (2.5 kHz, 2 kHz, and 1.6 kHz).

The principle of similarity also applies to the formation of auditory groups. The evidence comes

from studies (Bregman & Doehring, 1984; Steiger & Bregman, 1981) in which simple harmonic

relations or similar frequency excursions among a set of tones promoted the formation of

perceptual groups. In an extension of these studies, grouping of dichotically presented tones was

observed when harmonic relations occurred among them (Steiger & Bregman, 1982). Rapidly

repeating tones were grouped because they shared a common fundamental frequency, and even

small departures from this harmonic relation blocked dichotic fusion and also spectral similarity

appears to play a significant role in perceptual organization, as observed by Dannenbring and

Bregman (1976).Similarity between the perceptual attributes of successive events such as pitch,

timbre, loudness and location provides a basis for linking them (Moore and Gockel 2012). It

appears that it is not so much the raw difference that is important, but rather the rate of change;

the slower the rate of change between successive sounds the more similar they are judged

(Winkler, Denham et al. 2012). This leads one to consider that in the auditory modality, the law

of similarity is not separate from what the Gestalt psychologists termed good continuation. Good

continuation means that smooth continuous changes in perceptual attributes favour grouping,

while abrupt discontinuities are perceived as the start of something new. If successive moments

of the signal are similar with respect to frequency spectrum, they should be considered part of
28

the same source. Both the discontinuity and the dissimilarity of successive moments of sound

can be viewed as evidence that there is more than one source involved in producing the acoustic

input.

Good continuation can operate both within a single sound event (e.g., amplitude-modulating a

noise with a relatively high frequency results in the separate perception of a sequence of loud

sounds and a continuous softer sound (Bregman 1990)), and between events (e.g. glides can help

bind successive events (Bregman and Dannenbring 1973)).

Likewise, the principle of common fate was translated into the auditory domain by Bregman and

Pinker (1978).The principle of common fate refers to correlated changes in features; e.g.,

whether they start and/or stop at the same time. This principle has also been termed ‘temporal

coherence’ specifically with regard to correlations over time windows that span longer periods

than individual events. In a test of grouping by relative onsets of tone elements, when brief (147

ms) tones were synchronously onset and offset, they were grouped together; when tones were

offset by 58 ms or more, they were grouped into separate perceptual streams.

Organization by set has also been reported, in cases of musical experience, by Jones (1976)

although in some instances of this kind, perceptual organization also shows evidence of a

symmetry principle. Here, a prior portion of a musical display appears to induce an implicit

expectation about the latter portion of a musical display, in the dimensions both of melodic pitch

and of the temporal attributes of a melody

Both principles of continuity and closure apply to the grouping of simple tone glides reported by

Bregman and Dannenbring (1973, 1977). Group formation here occurred for tones that were

continuous in their frequency contours despite interruptions by silence (up to 20 ms) or by noise

bursts (up to 500 ms). Continuity and closure also operate in the amplitude domain, in which the
29

more abrupt the amplitude rise time, the likelier the occurrence of grouping (Dannenbring,

1976).

Phonetic organization

According to Remez et al. 1994, four points are emphasized for adequate auditory perceptual

organization for speech. First is the ​Auditory coherence​ for perceptual coherence, listener

directs attention to the properties internal to streams rather than the properties that differentiate

them (Remez, 1987). That is, acoustic elements that are proximate in frequency are attributed to

the same source; the rate at which the components succeed each other influences their

cohesiveness: The slower the procession, the greater the disposition to cohere. Acoustic elements

must also be similar in frequency changes to be grouped together, not only in the extent of

change but also in the temporal properties. A more subtle form of similarity also promotes

grouping of simultaneous components, namely, when harmonic relations occur among them.

Similarity in spectrum also appears to warrant cohesion of components, such that aperiodic

(noisy) acoustic elements are grouped together and periodic elements are grouped together.

Acoustic elements that occur concurrently must exhibit common onsets and offsets and must

show common (frequency or amplitude) modulation to come together. Continuity of changes in

frequency or in spectrum is also required for elements to form a single perceptual group. In

general, small temporal or spectral departures from these various similarities, continuities,

common modulations, and proximities result in the loss of coherence and the splitting of auditory

elements into separate groups.

The Speech Stream: ​During the production of speech sounds, the resonances are excited in

common by a pulse train produced by the larynx. This imposes harmonic relations and common

amplitude modulation across the formants (although the formant center frequencies are not
30

harmonically related). The attributes of harmonicity and common modulation may offer the only

basis for grouping the phonated formants as a single coherent stream. In the absence of common

pulsation and harmonic structure, we should find this signal fracturing into multiple perceptually

segregated streams when primitive auditory principles are applied: the first formant forming a

single continuous stream; the second formant splitting from the first as an intermittent stream

with highly varying frequency; and the nasal and third formants segregating from the others,

each varying rather less in frequency than the second formant.

In the spectrogram of the sentence "Why lie when you know I'm your lawyer?", some typical

acoustic attributes are the continuity of the lowest frequency resonance (the first formant) despite

intermittent energy (with gaps exceeding 75 ms) in resonances of higher frequency (the second,

third, and nasal formants), the marked dissimilarity in frequency trajectory of the first formant

and those of higher frequency formants, and the lack of temporal coincidence of large changes in

resonant frequencies.
31
32

The acoustic composition of the sentence depicted in Figure Ib, "The steady drip is worse than a

drenching rain," is considerably more varied. By criteria of spectral and frequency similarity, the

sentence should split into 10 streams: 3 that correspond to the three oral formants; a fourth,

which is composed of the four intermittent and discontinuous occurrences of nasal resonance (in

thaN, dreNchiNG, and raiN); a fifth, which is composed of the noise associated with the voiced

fricatives (in THe and THari); a sixth, which is composed of the noise manifest by the two

unvoiced fricatives (in Steady said worSe); a seventh, which is composed of the noise at the

release of the affricate (in drenCHing); an eighth, which is composed of two consonant release

bursts (in Drip and Drenching); a ninth, which contains the pulsed noise of the voiced fricative

(in the word iS)', and a tenth, which is composed of the consonantal release (in driP). If the

principle of common fate is applied to portions of the spectrum that are modulated by the pulsing

of the larynx, then the oral formants, the nasal formants, and perhaps the voiced fricatives are

grouped as 1 stream, leaving 4 remaining aperiodic streams associated respectively with the

acoustic correlates of unvoiced fricatives, affricates, and consonant releases. Neither continuity,

nor closure, nor symmetry, nor any obvious principle of "goodness" can accomplish the

reduction of this spectrum to the single vocal source that the listener perceives. These

commonplace examples suggest that the principles of the primitive auditory analysis fall short of

explaining the perceptual coherence of a single speech stream.

Perceptual Phenomena: ​ Broadbent and Ladefoged (1957) in their study, investigated the effect

of common pulsing on the formant bands of a sentence. They made a two-formant synthetic

replica of a sentence but presented the first formant to one ear and the second formant to the

other. Despite the different location of the two signals, a rather obvious violation of spatial

similarity, listeners heard a single voice when the formants were excited at the same fundamental
33

frequency. When two different fundamentals were used, listeners reported hearing two voices, as

if fusion was lost; this occurred even when the differently pulsed formants were both presented

to the same ear. When each formant had a different fundamental, this created an impression of

two voices saying the same utterance, rather than two non speechlike buzzes varying in pitch.

Listeners evidently combined the information from each resonance to form phonetic impressions

despite the concurrent perception that each resonance issued from a different vocal source.Here,

the differently excited resonances were phonetically coherent, although they also were split into

two separate perceptual streams in the listener's impression of the auditory scene. These

perceptual streams, which should be segregated from each other according to the account given

by primitive auditory analysis, are combined nonetheless to produce phonetic impressions.

Last is the specific model of organization, which uses a schematic component to reconcile the

simplicity of the primitive analysis with the complexity of the speech signal. It has been

observed that rapid repetition of acoustically identical syllables actually destroys perceptual

stability and phonetic impressions of speech sounds are transformed by repetition. In such

conditions, perceptually segregated streams of auditory elements formed from the acoustic

constituents of the speech signal, much as they had with tones and noise, according to primitive

criteria of similarity, proximity, common fate, and continuity (​Remez et al., 1994​). In essence,

rapid repetition proved to be more effective in forcing perceptual segregation of acoustic

elements to occur with speech sounds than to produce phonetic impressions.


34

➔ The ​bottom-up theories​ are those in which the ​acoustic signal provides essential and

sufficient information​ for perceptual recognition. In this approach, the link between the

information received and the perceptual recognition is direct, with no or minimal

intervening stage. This approach is also referred to as data driven, precisely because the

data obtained from the acoustic signal drive or direct, the listener’s perception of speech.

In the ​top-down approach​, the information from the ​acoustic signal is not sufficient for

perceptual recognition. Higher level information from con- textual, linguistic, and

cognitive cues is necessary​ for accurate speech perception.

➔ The active/passive grouping of general attributes of theories addresses a related

concept—​the degree to which information in addition to the acoustic signal is necessary​.

Active theories​ emphasize the​ cognitive role in perception​, ​including the formation and

testing of hypotheses about the phonetic or linguistic interpretation of the information in

the acoustic signal​. In contrast, ​passive theories​ postulate ​a smaller role for cognitive

processing and presume a more automatic perceptual response​.

➔ An​ autonomous theory​ posits that perceptual ​processing occurs in the absence of external

data​, such as communicative context or general knowledge. In other words, perception

occurs within a ​closed system​. In contrast,​ interactive theories​ are open systems, in which

the stages of perceptual processing ​access data external​ to that which is contained within

the acoustic signal.


35

➔ Certainly, no theory of speech perception is wholly active or wholly autonomous, just as

no theory is wholly passive or wholly interactive. Rather, these categorizations are simply

one tool that can be applied to the analysis of an otherwise highly complex set of theories

of speech perception.

1.2 Theoretical approaches to speech perception

VIII. Motor theory of speech perception (Liberan et al., 1967)

● Incorporating a biologically based link between perception and production, this

specialization prevents listeners from hearing the signal as an ordinary sound, but enables

them to use the systematic, yet special, relation between signal and gesture to perceive

the gesture.
36

● This theory hypothesizes that ​listeners actively reconstruct the actual ‘muscle activation

patterns’ which they themselves have as representations of vocal tract gestures​. ​The

reconstruction invokes these muscle activation patterns in terms of sets of descriptors​.

● These activation patterns constitute static rather than dynamic representations. And the

coarticulatory processes are evaluated by reference to static linear context rather than

dynamic hierarchically organized structures.

● The ​diagram shows how​ s​tored descriptors of muscle activation pattern​s for individual

sounds generalized from the ​listeners own experience contribute together​ with

coarticulatory descriptors ​to an active interpretation of the sound wave.

Stored muscle activation pattern descriptors

Soundwave Active interpretation  ​hypothesis about speakers intended muscle activation

patterns

Coarticulatory descriptors
37

● The ​adaptive function of the perceptual side of this mode​, the side with which the motor

theory is directly concerned, is ​to make the conversion from acoustic signal to gesture

automatically, and so to let listeners perceive phonetic structures​ without mediation by

(or translation from) the auditory appearances that the sounds might, on purely

psychoacoustic grounds, be expected to have.

● Data from transcranial magnetic stimulation (TMS), functional neuroimaging, and

neurophysiology indicate that frontal motor structures are automatically engaged during

passive speech perception (Fadiga et al., 2002; Hesslow, 2002; Kiefer & Pulvermüller,

2012). For instance, Fadiga et al. (2002) found an increase in motor-evoked potentials
38

recorded from a listener’s tongue muscles during a task in which participants heard

speech sounds but for which there was no explicit motor component.

● Watkins et al. (2003) applied TMS to the face area of the primary motor cortex in order

to elicit motor-evoked potentials in the lip muscles. They found that, in comparison with

control conditions (listening to non-verbal sounds and viewing eye and brow

movements), both listening to and viewing speech increased the size of the motor evoked

potentials. They concluded that both auditory and visual perception of speech leads to

activation of the speech motor system.

● A core prediction of motor theories of speech processing is that damage (whether

temporary or permanent) to the speech motor system should impair auditory speech

processing. Meister et al. (2007) found that when repetitive TMS was used to temporarily

suppress the premotor cortex, participants were impaired at discriminating stop

consonants embedded in noise.

● Also using repetitive TMS, Möttönen and Watkins. (2012) found that temporarily

disrupting the lip representations in the left motor cortex disrupted subjects’ ability to

discriminate between lip-articulated speech sounds, but did not affect those participants’

ability to discriminate sounds that were not lip articulated. Those two studies suggest that

disruption of the speech motor system can (subtly) impair speech sound processing.

● Stasenko et al. (2015) evaluated which aspects of auditory speech processing are

affected, and which are not, in a stroke patient with dysfunction of the speech motor

system. The participant was a 55-year-old, right-handed male who suffered a left

hemisphere ischaemic stroke, affecting the inferior frontal gyrus, premotor cortex, and

primary motor cortex and age and gender matched 6 controls also underwent the same
39

procedures. Participants presented with non-fluent speech that was marked by frequent

articulatory/phonological errors. THe phonological/articulatory errors were present in

both picture naming and word repetition suggests that production processes were in fact

the source of the errors. General procedure included picture naming and tests for the

integrity of auditory speech processing use minimal pair stimuli which uses

discrimination task. And additionally, his general intelligence, spontaneous speech,

verbal fluency, verbal working memory, repetition, reading, and spelling abilities were

assessed and Ultrasound imaging of the tongue was also performed. From analysis of the

results, found that the patient showed a normal phonemic categorical boundary when

discriminating two non-words that differ by a minimal pair (e.g., ADA–AGA). However,

using the same stimuli, the patient was unable to identify or label the non-word stimuli

(using a button-press response). A control task showed that he could identify speech

sounds by speaker gender, ruling out a general labelling impairment. These data suggest

that while the motor system is not causally involved in perception of the speech signal, it

may be used when other cues (e.g., meaning, context) are not available.

IX. Analysis by synthesis theory (Stevens & Halle, 1960)

● In its latest form by Stevens and House (1972), ​the model proposes that while some

acoustic attributes of speech signal may be simply converted to linguistic data, for large

part some reference is made to the articulatory mechanism during speech perception.

● The components of this model are shown i n Figure. T​he acoustic signal undergoes

peripheral processing ​whether it be speech or not​. It is at this stage that the ​presence of
40

the dynamic characteristics of speech are sought and if found, the signals processed

accordingly .

● The ​peripheral processing ​transforms the acoustic waveform into a neural time-space

pattern​ that will be ​used at later stages of analysis ​. It is known that ​at this level more than

simple time-frequency analysis takes place and probably extraction of some of the more

easily discriminated information relevant to' description of certain features , also occurs.

Some normalization of the signal must also take place, so that the output is a set of

normalized attributes that contribute to later identification of linguistic units ​, (A).

● These​ auditory patterns are placed in ​temporary stores​ to await further processing​. ​The

preliminary analysis​ component derives the features that are not strongly context

dependent from the signal (A)​. These features are available after conversion of the results

from peripheral processing and result in a partial specification of a feature matrix of the

utterance, (B).
41

● The ​control component ​has ​access to the results of preliminary processing as well as the

results of analysis of previous parts of the utterance, the lexicon , and the output of the

comparator. ​With this information, t​he control unit makes a hypothesis concerning the

representation of the utterance in terms of morphemes​. The features are an abstract

quantity that underlie , but are not necessarily identified with, the acoustic attributes or

the production of the signal.

● This ​hypothesized representation (B) travel s to ​the generative rules​ where it is

transformed into a representation of the articulatory instructions that would be necessary

to generate such an utterance.​(The articulatory instructions (V) that result , could be used

to control the articulatory mechanisms and produce speech output.)

● T​he generated patterns (V) are compared by the comparator with the attributes of the

analyzed utterance residing in a temporary store and judged as to their closeness of

match.​ This information is relayed to the control component and​ the hypothesized

sequence is either accepted or a new hypothesis is made using the error -detected in the

comparator.

● This loop is transversed until the message has been successfully identified. The model

relies on the comparison of auditory patterns (A) with articulator y instructions (V) that

potentially are able to produce such patterns.

● The ​comparator must contain a catalogue of such relations. The articulatory gesture is

represented in terms of tactile , proprioceptive sensations and motor-commands.

● Stevens and House state that the c​atalogue is built up as the child begins to utter sounds,

hear the auditory result and form their association . Therefore the child is aided i n

learning to perceive speech, by being able to articulate it .


42

● Kuhl et al. (2014) investigated motor brain activation, as well as auditory brain

activation, during discrimination of native and nonnative syllables in infants at two ages

(25- 7month old; 24- 11 month old) that straddle the developmental transition from

language-universal to language-specific speech perception. Adults are also tested in Exp.

1 (14- mean age 26.6). MEG data revealed that 7-month old infants activate auditory

(superior temporal) as well as motor brain areas (Broca’s area, cerebellum) in response to

speech, and equivalently for native and nonnative syllables. However, in 11- and

12-mo-old infants, native speech activates auditory brain areas to a greater degree than

nonnative, whereas nonnative speech activates motor brain areas to a greater degree than

native speech. This double dissociation in 11- to 12-mo-old infants matches the pattern of

results obtained in adult listeners. Infant data are consistent with Analysis by Synthesis:

auditory analysis of speech is coupled with synthesis of the motor plans necessary to

produce the speech signal. The findings have implications for: (i) perception-action

theories of speech perception, ie, both auditory and motor components contribute to the

developmental transition in speech perception that occurs at the end of the first year of

life. , (ii) the impact of “motherese” on early language learning, ie,motherese speech,

with its exaggerated acoustic and articulatory features, particularly in one-on-one

settings, enhances the activation of motor brain areas and the generation of internal motor

models of speech, and (iii) the “social-gating” hypothesis (social interaction improves

language learning by motivating infants via the reward systems of the brain) and humans’

development of social understanding (humans evolved brain mechanisms to detect and

interpret humans’ actions, behaviors, movements, and sounds. The present data

contribute to these views by demonstrating that auditory speech activates motor areas in
43

the infant brain. Motor brain activation in response to others’ communicative signals

could assist broader development of social understanding in humans).

X. Quantal Theory​ (​Stevens, 1972)

● According to the QT, ​nonlinearities exist in the relation between vocal-tract

configurations and acoustic outputs.

● Along an articulatory dimension s​uch as back-cavity length, ​there are regions where

perturbations in that parameter result in relatively small acoustic changes​ (e.g., in formant

frequencies) and ​other regions​ where comparable articulatory perturbations cause

substantial acoustic changes​. T​hese alternating regions of acoustic stability and instability

provide conditions for a kind of optimization of the phonemic inventory.


44

● The situation is represented ​schematically​ in Fig, which shows a ​hypothetical relation

between some ​acoustic parameter ​in the sound radiated from the vocal tract and some

a​rticulatory parameter​ that takes on a series of values as indicated on the abscissa.

● There is a large acoustic (and auditory) difference between regions I and III. Within

regions I and III, however, the acoustic parameter is relatively insensitive to change in the

articulatory parameter. In other words, changes in articulation don’t have much effect on

the speech output. ​That is, there is a significant acoustic contrast between these two

regions, which are separated by the intermediate region II in which there is a rather

abrupt change in the acoustic parameter. Region II can, in some sense, be considered as a

threshold region such that as the acoustic parameter changes through this region the

auditory response shifts from one type of pattern to another.

● The theory also argues that ​these alternating regions of acoustic stability and instability

provide conditions for a kind of optimization of the phonemic inventory​. ​If phonemes are

located in the stable regions, then they obviously require less articulatory precision on the

part of the talker. These same phonemes tend to be auditorily quite distinctive because

they are separated by regions of acoustic instability,​ that is, regions involving a relatively

high rate of acoustic change. ​This convergence of both talker-oriented and


45

listener-oriented selection criteria leads to a clear preference for certain " quantal"

phonemes such as the point vowels /i/, /a/, and /u/(Quantal vowels).

● In contrast with the centralized vowels such as /^/ and /ae/ (relatively stable, are not

bounded by acoustically unstable regions that would tend to make them highly

distinguishable from nearby vowels), the point vowels are both relatively stable and

relatively distinctive (in the sense that they are separated from nearby vowels by regions

of high acoustic instability).

● Examples of quantal categories over an articulatory continuum:

• Degree of glottal constriction: complete opening for voiceless sounds to less opening

for modal voicing to complete closure for a glottal stop

• Degree of vocal tract constriction: vowel (low V – mid V – high V) to glides to

fricatives to stops

• Place of articulation for vowels (i.e., place of constriction)

● Stevens et al. (2010) has reviewed three aspects of a theory of speech production and

perception: quantal theory, enhancement, and overlap.

● The section on quantal theory makes the claim that every phonological feature or contrast

is associated with its own quantal footprint. This footprint for a given feature is a

discontinuous (or quantal) relation between the displacement of an articulatory parameter

and the acoustical attribute that results from this articulatory movement.

● The second section shows that for a given quantally defined feature, the featural

specification during speech production may be embellished with other gestures that

enhance the quantally defined base. These enhancing gestures, together with the defining

gestures, provide a set of acoustic cues that are potentially available to a listener who
46

must use these cues to aid the identification of features, segments, and words. An

example of this type of enhancement for consonants is the rounding of the lips in the

production of /P/. This rounding tends to lower the natural frequency of the anterior

portion of the vocal tract, so that the frequency of the lowest major spectrum prominence

in the fricative spectrum is in the F3 range, well below the F4 or F5 range for the lowest

spectrum prominence for the contrasting fricative consonant /s/.

● The third section shows that even though rapid speech phenomena can obliterate defining

quantal information from the speech stream, nonetheless that information is recoverable

from the enhancement of the segment. A simple example of articulatory overlap occurs in

an utterance containing a sequence of two stop consonants, as in the casually produced

utterance top tag. In this example, the transition toward the labial closure for /p/ generates

enhancing cues for the labial place of articulation. However, the noise burst that would

normally signal the labial place of articulation is obliterated because the tongue blade

closure for /t/ occurs before the lip closure for /p/ is released, i.e., the two closures

overlap. Any cue for the labial place of articulation immediately prior to the /t/ release is

probably also obscured. In the case of /t/, there is little direct evidence of the presence of

the alveolar place during the time preceding the /t/ release. The alveolar burst, however,

provides strong evidence for alveolar place, as does the transition from this burst into the

following vowel /æ/. Thus some cues exist for /t/, but only weaker cues for /p/. The

‘‘defining’’ cue for /p/ is actually obliterated.

XI. Auditory theory of vowel perception (Fant,1967)

● Auditory theories claim that listeners identify acoustic patterns or features by matching

them to stored acoustic representations.


47

● According to Fant.1967, all of the arguments brought forth in support of motor theory

would fit just as well into sensory-based theories, in which the d​ecoding process proceeds

without postulating the active mediation of speech-motor centers.

● The​ basic idea​ in Fant's approach is that the ​motor and sensory functions become more

and more involved as one proceeds from the peripheral to the central stages of analysis.

● He assumes that the ​final destination is a "message" that involves brain centers common

to both perception and production.​ According to Fant, ​there are separate sensory

(auditory) and motor (articulatory) branches, although he leaves open the possibility of

interaction between these two branches.

● Auditory input is first processed by the ear ​and is subject to ​primary auditory analysis.

These incoming auditory signals are then submitted to some kind of direct encoding into

distinctive auditory features​ (Fant, 1962). ​Finally, the features are combined in some

unspecified way to form phonemes, syllables, morphemes, and words.


48

● Although much of Fant's concern has been with continued acoustical investigations of the

distinctive features of phonemes, the problems of invariance and segmentation, which are

central issues in speech perception, remain unresolved by the model.

● The ​auditory-perceptual theory of phonetic recognition by Miller. 1987, ​is designed to

provide a ​comprehensive and coherent account​ of the facts relating acoustic waveforms

to perceived vowel categories.

● It consists of a ​three-stage process​ (Miller, 1984a), where the acoustic waveform of

speech is converted by the human listener to a string of category codes that correspond to

the allophones of the language. The theory describes the "​bottom-up" aspects of phonetic

perception, ​and it is conceptual in nature.


49

● Stage 1​ of the theory is t​he transformation of the acoustic waveform to auditory-sensory

dimensions​. In stage 1, it is assumed that short-term spectral analyses are performed on

the incoming speech waveform. Each spectrum can be classified as a glottal-source

spectrum, a burst-friction spectrum, or a combination of the two. At each moment, the

spectral envelope patterns of the glottal ​source a​nd burst-friction sounds are represented

as sensory responses or sensory pointers in a phonetically relevant auditory-perceptual

space.

● Stage 2 is the transformation of the sensory data to perceptual values​. In stage 2, these

sensory responses, ​or sensory pointers, are converted into a unitary perceptual response,

or perceptual pointer, that is also located in the auditory-perceptual space. The perceptual

response is a hypothetical construct or intervening variable that is based on the general

notion that ​speech inputs are integrated to form a unitary perceptual stream.

● In stage 3, ​the perceptual variables are converted to phonetic-linguistic categories.

That is in stage 3, segmentation and categorization mechanisms that depend on the

dynamics ​of the perceptual pointer in relation to ​perceptual target​ zones within the

auditory-perceptual space result in a string of category codes that correspond to the

allophones of the language.

XII. Neurological theories of speech perception

● This theory suggests that​ the phonological attributes of human speech are decoded by

neurosensory receptive fields-"feature detectors"-innately structured to detect, and

respond to, the various distinguishing parameters of the acoustic sound stream.
50

● Menyuk (1968) states: ​"'If a comparatively small set of features or attributes can describe

the speech sounds in all languages, then, it is hypothesized, these attribu​tes are related to

the physiological capacities of man to produce and perceive sounds.'"

● Abbs & Sussman, 1971 postulated ​"feature detector" theory of speech perception​, this

view does not depend on a particular distinctive feature system, but rather c​oncerns with

the process of auditory decoding of the acoustic speech signal which results in phonetic

identification​.

● "Feature detectors" can be broadly defined as organizational configurations of the sensory

nervous system that are highly sensitive to certain parameters of complex stimuli.

Complex acoustic stimuli may be considered as composed of several physical parameters,

for example, intensity, wave-length, direction, form, and temporal patterning.

● Feature detectors can be distinguished from passive acoustic filters​ simply on the basis of

function. ​An acoustic filter is specific only to center frequency and resonance band width.

Such stimulus processing is the result of a limited dimensional sampling of the

information spatially and temporally embedded in a speech sound. ​Feature detectors, on

the other hand, ​respond to physical parameters that are composed of several different

aspects of the signal, i.e., frequency, intensity, rate of frequency change, rate of intensity

change, and durational characteristics of these attributes. Thus, in contrast to acoustic

filters, feature detectors are simultaneously sensitive to many characteristics of acoustic

stimuli.

● Physiological evidence for complex feature detectors in the auditory systems of

vertebrates and lower forms is also available.​ ​Nelson, Erulkar and Bryan (1966) has

found that particular features of sound stimulus patterns appear to be detected and
51

processed by neuronal cells structured to respond to a particular aggregate of physical

characteristics​.​ ​Reports of lateral inhibition in the auditory system (Katsuki, 1961, 1962;

Nomoto et al., 1964) indicate the existence of a sophisticated sharpening mechanism at

the level of the eighth nerve.

● According to the​ later evidences a complex spatial configuration of receptor cells,

maximally sensitive to certain physical parameters of the auditory stimulus, can be

postulated which may offer an explanation for the neural decoding process involved in

speech perception.​ And also the operation of feature detectors might be applied to explain

perception of the distinctive features of speech (Jakobson, Fant, and Halle, 1952).

● ​In each of these acoustic features, a series of specific spatio-temporal "trade marks" have

been tentatively provided that may serve as identifying stimulus characteristics. ​Likewise,

each series of space-time stimulus characteristics can conceivably correspond to a

specific neurosensory detector process, especially sensitive to recognize specific stimulus

traits.

● Tonotopic organization has been identified in human auditory cortex using a variety of

imaging techniques.​ The majority of early studies used only two different stimulus

frequencies (Bilecen et al., 1998; Lauter et al., 1985; Lockwood et al., 1999; Talavage et

al., 2000; Wessinger et al., 1997). ​These studies suggested a general pattern in which

high frequencies activated medial auditory cortex and low frequencies activated more

anterolateral regions in the superior temporal plane​. This pattern has usually been

interpreted as a single low-to-high frequency gradient oriented along Heschl's gyrus

(HG).
52

● Later functional magnetic resonance imaging (fMRI) studies improved on this design by

adding intermediate frequencies, allowing the identification of frequency gradients

(Formisano et al., 2003;Woods et al., 2009). Results and interpretations from these

studies have varied considerably. For example, one study reported a single high-to-low

gradient extending from posterior medial to anterior lateral auditory areas, similar to

earlier studies (Langers, Backes, van Dijk 2007).

● A second study, however, described two mirror-symmetric frequency gradients

(high-low-high) extending approximately along the axis of HG (Formisano et al., 2003).

In a third study, three consistent gradients were reported, none of which clearly follow

the long axis of HG (Talavage et al., 2004).

● Finally, the authors of a fourth study found differences in activation between anterior and

posterior HG as well as medial and lateral differences, but concluded that the observed

activation profile did not represent frequency gradients but instead different functional

regions within the auditory cortex (Schonwiesner, von Cramon, Rubsamen 2002).

● Talavage et al. (2004) done a functional magnetic resonance imaging study using

frequency-swept stimuli to identify progressions of frequency sensitivity across the

cortical surface. The center-frequency of narrow-band, amplitude-modulated noise was

slowly swept between 125 and 8,000 Hz. ​Areas of cortex exhibiting a progressive change

in response latency with position were considered tonotopically organized. There exist

two main findings. First, six progressions of frequency sensitivity (i.e., tonotopic

mappings) were repeatedly observed in the superior temporal plane. Second, the locations

of the higher- and lower-frequency endpoints of these progressions were approximately

congruent with regions reported to be most responsive to discrete higher- and


53

lower-frequency stimuli. They have concluded that a correspondence between these

progressions and anatomically defined cortical areas, suggesting that in human auditory

cortex exhibit at least six progressions of frequency sensitivity on the human superior

temporal lobe.

XIII. Pandemonium Model

Author & Year:​ Selfridge’s (1959)

Assumption:

Selfridge's (1959) pandemonium model of pattern analysis assumes that low level image demons

detect physical attributes of the stimulation, middle level computational demons detect features

peculiar to certain objects, and high level cognitive demons detect objects. The model is called

pandemonium because the signalling function of these demons is analogous to yelling, with the

demons who yell the loudest calling the shots, so to speak.

Background:

Perception and cognition are not just matters of connections among the neurons or the events,

things, features, categories, etc. that those neurons carry. Put another way, we are not true digital

computers. It is not a matter of this-or-that and here-or-there. There are also matters of degree,

of quantity, intensity, or volume.

Neurons don't just work on the on-or-off principle. They can fire repeatedly, rapidly, or just

occasionally or rarely. There can be hundreds of synapses telling a neuron to fire, and hundreds
54

more telling it not to. Likewise, the ideas we have in our minds can be powerful and influence

many other ideas, or they can be once-in-a-lifetime flashes of brilliance.

It is important, in order to understand how the mind/brain works, to keep this in mind. One of

the most memorable ways of doing this goes back to the early work on artificial intelligence,

specifically that of Oliver Selfridge (1959. Pandemonium: A paradigm for learning.

In ​Symposium on the mechanization of thought processes.​ London: HM Stationary Office)

Explanation:

In the pandemonium model, each letter or digit is represented by a ‘cognitive demon’,

which holds a list of the features that define its shape. The cognitive demons listen for evidence

that matches their description, which comes from ‘feature demons’ that detect individual lines,

such as a horizontal or a vertical, and shout out if they are activated. When a cognitive demon

starts to detect features consistent with its shape, it too begins to shout. So, if a letter such as ‘A’

is presented to the image demon, the feature demons will start providing evidence for it. Some of

this will be consistent with both an ‘A’ and a ‘H’, and the A demon will start shouting, but so too

will the H demon. As more evidence comes in from the feature demons, the evidence for ‘A’ will

be greater than for ‘H’ and the A demon will be shouting the loudest. The decision demon then

makes a decision on which is the most likely letter by seeing which voice dominates the noise

(see Figure 6.2) When information from the features is ambiguous, for example in handwriting,

cognitive demons can take account of word knowledge and context to disambiguate the letter. Of

course Selfridge did not propose that there really were demons in the brain, but the principles of

parallel processing for all features, and levels of excitation in the nervous system, are consistent

with what is known. However, the problem of who listens to the demons and the arrangement of
55

features relative to each other within a shape are crucially important. Two vertical lines and a

horizontal line do not define ‘H’; it is the relation between them that does so, for example ll-, is

not an H. A theory of pattern or object recognition must be able to specify the relation between

parts of an object. However, pandemonium can be considered a precursor for parallel distributed

processing (PDP) models, of which one very influential contribution is the interactive activation

model.

https://www.youtube.com/watch?v=YRqwXaCnxy8

\
56

XIV. Direct- Realistic Approach

Author and Year:​ Fowler, 1986

Direct realism postulates that speech perception is direct (i.e., happens through the perception of

articulatory gestures), but it is not special. All perception involves direct recovery of the distal

source of the event being perceived (Gibson). In vision, you perceive objects (e.g., trees, cars,

etc.). Likewise with smell you perceive e.g., cookies, roses, etc. Why not in the auditory

perception of speech? So, listeners perceive tongues and lips. The articulatory gestures that are

the objects of speech perception are not intended gestures (as in Motor Theory). Rather, they are

the actual gestures.

A theory of speech perception called direct realism (Fowler, Galantucci, & Saltzman, 2003) is an

alternative to both the motor theory and a general auditory approach. The direct realism

perspective on speech perception is not easy to explain in the absence of substantial background

information (which is not pursued here). Some scientists (Cleary & Pisoni, 2001) question the
57

utility of direct realism as a theory because it is hard to understand how reasonable experimental

tests can be made to support or falsify it.

The direct realism perspective on speech perception takes its inspiration from the pioneering

work of the psychologist J. J. Gibson (1904–1979). Gibson’s theoretical and experimental focus

was on visual perception. Gibson (1968, 1979) rejected the idea of cognitively driven,

“constructed” percepts. He did not like the idea of perception in which percepts—often referred

to as the “objects” of perception—were mediated by cognitive operations to produce a

representation of the external world. Rather, he proposed the idea that animals, including

humans, perceive the visual layouts of environments directly, by linking the stimulation of their

senses (by light waves, in the case of vision) with the sources of the stimulation. For example,

when perceivers are exposed to the patterning of light reflected from a chair, they perceive the

chair directly, rather than interpreting through neural mechanisms the light patterns reaching

their eyes.

Gibson’s view was that objects in the environment structure the medium through which they are

conveyed to the senses. A chair structures the medium (light waves) by the patterns of light

reflected from it, and humans presumably learn that structure and perceive the chair directly,

without cognitive mediation. Direct realists argue that the objects of perception are not the

proximal stimulation, which in the example above is the light reflections at the eye, from the

chair, but rather the distal source of the light reflection. In this example, the distal source is the

chair. More specifically, in Gibson’s view, perceivers do not “process” and “encode” the light

waves via cognitive operations whose output is a symbolic representation.


58

Gibson coined the term “ecological psychology” for this view of perception. The term is

consistent with a “realist” (i.e., ecologically valid) view of how organisms perceive objects. For

Gibson, much of the experimental psychologist’s vocabulary, including terms found in

information processing models such as “encoding” and “representation,” were not much more

than a convenient set of descriptive terms. For Gibson, the terms did not have ecological

validity—that is, they did not represent “real” things, “real” mechanisms that were the stuff of

perception. The motor theory of speech perception requires operations of a special module to

convert (decode) an unstable acoustic signal into a stable articulatory representation.

In the motor theory, the objects of speech perception are the articulatory characteristics (either

places of articulation or articulatory ges tures over time) of phonemes or phoneme sequences as

transformed by the special processor. On the other hand, a general auditory approach to speech

perception requires some processing stages to match the incoming acoustics to stored templates

or features or a statistical model of acoustic speech signals (Klatt, 1989; Massaro & Chen, 2008;

Stevens, 2005). The object of speech perception in the auditory theory is the speech acoustic

signal, plus other sources of sensory information (such as visual information from a speaker’s

face during speech). In both cases, cognitive operations of varying degrees of automaticity are

required for perception of incoming sounds. Fowler (1986, 1996; Fowler, Shankweiler, &

Studdert-Kennedy, 2016, especially pp. 138–143), the leading proponent of direct realism in

speech perception, rejects cognitive “constructions” in the perception of speech sounds. Fowler

agrees with the Gibsonian idea that scientists should reject the idea of perception driven by

cognitive processes that produce an “output symbol.” In the case of speech perception, the

simplest example of such a symbol is a phoneme. Rather, Fowler argues for direct perception of

articulatory gestures. In this case, the objects of speech perception are articulatory gestures, such
59

as movements of the tongue or jaw. The gestures are the distal source and the pressure wave

produced by articulatory movement and that reaches the ears is the proximal stimulation.

Movements of the articulators, in Fowler’s view, structure the medium (air) through which the

information is transmitted to the ears. Listeners track this structure as phonetic gestures unfold

over time. The perception of these gestures is direct, not mediated by other processes. Listeners

literally hear articulatory gestures (or, on another interpretation, the sounds they hear are the

articulatory gestures). The parallel with Gibson’s (1969, 1970) view of visual perception is easy

to see.

In the direct realism theory of speech perception, articulatory gestures are perceived directly. No

special mechanisms are required to make perceptual decisions concerning phonetic events​. An

apparent advantage of this perspective is that the acoustic variability for a given vowel, which

seems to make auditory theories cumbersome with the need to “know” all the possible variations

of the vowel’s formant frequencies both within and across speakers , becomes a non-issue. The

directly-perceived gestures for the vowel may have variable formant frequencies, especially

across speakers whose vocal tract lengths are very different (e.g., men versus children), but the

gestures are nearly the same. Adult men have much longer vocal tracts than five-year-old

children and therefore different formant frequencies for any vowel, but the articulatory gestures

for both groups are nearly the same for any vowel. According to this view, children use

articulatory gestures for /i/ similar to those used for the same vowel produced by adult men. The

gestures can be perceived directly across age and sex and are not affected by the acoustic

variability for the vowel from speaker to speaker and other sources of acoustic variation as

discussed earlier. Unfortunately, it is not true that different speakers use the same articulatory

gestures for a specific sound. Using direct measures of speech movement, Johnson, Ladefoged,
60

and Lindau (1993) showed substantial articulatory variation for the same vowel across different

speakers. Similarly, significant speaker-to-speaker variability of tongue movements for

American English “r” before vowels was reported by Westbury, Hashi, and Lindstrom (1998).

There does not appear to be acrossspeaker stability of articulatory gestures for vowels or the

sonorant “r” (sonorant = vowel-like). Direct realism does not have a ready answer at the level of

articulatory gestures to get around the variable acoustic characteristics of a given vowel or other

vowel-like sounds.

https://www.youtube.com/watch?v=JF0ArkVDrT8

ARTICLE

Direct Perceptions of Carol Fowler’s Theoretical Perspective (Whalen, 2016)

Whalen (2016) argues that the work of Carol Fowler is usually criticized for things it does not

claim or treated as if it were not possible to claim what she claims.

Misunderstanding #1: “Direct” means “perfect”:

Small variations in articulation that lead to the same phoneme are often taken to disprove

Direct Realism.

Misunderstanding #2: “Direct” means “the signal is irrelevant”

The second misunderstanding is that creatures using direct perception cannot end up with

acoustically robust articulations. That is, if the signal is relevant, it is assumed that articulation

is irrelevant. One aspect of this mistake is that the information conveyed by the signal is

assumed to be specific articulatory shapes.

Misunderstanding #3: “Direct” doesn’t really mean anything


61

A third misunderstanding is that there is no sense in which perception is “direct.” This mostly

derives from the perfectly correct observation that speech information reaches our brain via

sense organs (for vision, see Ullman, 1980).

XV. Input to Lexicon- Lexical Access from Spectrac(LAFS)

Author & Year:​ Klatt, 1979

Explanation:

Phonetic processes and representations played no real part in early psycholinguistic models of

spoken-word recognition. Klatt’s (1979) lexical access FROM SPECTRA (LAFS) MODEL was

largely ignored by psychologists. ​In the LAFS Model spoken-word recognition is accomplished

based on acoustic information alone.

ASSUMPTION:

● Klatt’s ​L​exical ​Ac​ cess ​F​rom ​Sp​ ectra (LAFS) model assumes direct, noninteractive access

of lexical entries based on context-sensitive spectral sections (​Klatt, 1980​). For example,

spectrograms of the current speech signal are mapped directly onto a lexicon of spectral

templates.

● Klatt’s model assumes that adult listeners have a dictionary of all lawful diphone

sequences in long-term memory. Associated with each diphone sequence is a prototypical


62

spectral representation. Klatt proposes spectral representations of diphone sequences to

overcome the contextual variability of individual segments.

In Klatt’s LAFS model, the listener ​computes spectral representations of an input word and

compares these representations to the prototypes in memory.​ Word recognition is accomplished

when a best match is found between the input spectra and the diphone representations. In this

portion of the model, word recognition is accomplished directly on the basis of spectral

representations of the sensory input.

One important aspect of Klatt’s LAFS model is that it explicitly avoids any need to compute a

distinct level of representation corresponding to discrete phonemic segments. Instead, ​LAFS uses

a precompiled, acoustically-based lexicon of all possible words in a network of diphone power

spectra. ​These spectral templates are assumed to be context-sensitive units much like

“Wick-elphones” because they are assumed to represent the acoustic correlates of phones in

different phonetic environments (​Wickelgren, 1969​). ​Diphones in the LAFS system accomplish

this by encoding the spectral characteristics of the segments themselves and the transitions from

the middle of one segment to the middle of the next segment.

Klatt argues that diphone concatenation is sufficient to capture much of the context-dependent

variability observed for phonetic segments in spoken words. Word recognition in this model is

accomplished ​by computing a power spectrum of the input speech signal every 10 ms and then

comparing this input spectrum to spectral templates stored in a precompiled network.​ The basic

idea of LAFS, adapted from the Harpy system, is to find the path through the network that best
63

represents the observed input spectra (​Klatt, 1977​). This single path is then assumed to represent

the optimal phonetic transcription of the input signal.

https://www.youtube.com/watch?v=x8HIAVTeGNk

XVI. TRACE

Author & Year: ​Elman and McClelland's (1981, 1983)

The first, Elman and McClelland's (1981, 1983) interactive activation TRACE model of word

recognition, is perhaps one of the most highly interactive theories to date.

The model is called the TRACE model because the network of units forms a dynamic processing

structure called “the Trace,” which serves at once as the perceptual processing mechanism and as

the system’s working memory.

The model is instantiated in ​two simulation programs. TRACE 1​ which deals with short

segments of real speech, and suggests a mechanism for coping with the fact that the cues to the

identity of phonemes vary as a function of context.

TRACE II,​ simulates a large number of empirical findings on the perception of phonemes and

words and on the interactions of phoneme and word perception. At the phoneme level, TRACE

II simulates the influence of lexical information on the identification of phonemes and accounts

for the fact that lexical effects are found under certain conditions but not others. The model also

shows how knowledge of phonological constraints can be embodied in particular lexical items

but can still be used to influence processing of novel, nonword utterances.

Elman and McClelland's (1981, 1983) model is based on a system of simple processing units

called ​"nodes."​ Nodes may stand for features, phonemes, or words. However, nodes at each level
64

are alike in that each has an activation level signifying the degree to which the input is consistent

with the unit that the node represents.

In addition, ​each node has a resting level and a threshold.​ In the presence of confirmatory

evidence, the activation level of a node rises toward its threshold; in the absence of such

evidence, activation decays toward the resting level of the node.

Nodes within this system are highly interconnected and when a given node reaches threshold, it

may influence other nodes to which it is connected. ​Connections between nodes are of two types:

excitatory and inhibitory​. Thus a node that has reached threshold may raise the activation of

some of the nodes to which it is connected while lowering the activation of others. Connections

between levels are exclusively excitatory and bidirectional. Thus phoneme nodes may excite

word nodes, and word nodes may in turn excite phoneme nodes.
65

FIG.. A subset of the units in TRACE II. Each rectangle represents a different unit. The labels

indicate the item for which the unit stands, and the horizontal edges of the rectangle indicate

the portion of the Trace spanned by each unit. The input feature specifications for the phrase

“tea cup,” preceded and followed by silence, are indicated for the three illustrated dimensions

by the blackening of the corresponding feature units.

In the figures, each rectangle corresponds to a separate processing unit. The labels on the units

and along the side indicate the spoken object (feature, phoneme, or word) for which each unit

stands. The left and right edges of each rectangle indicate the portion of the input the unit

spans.

At the feature level,​ there are several banks of feature detectors, one for each of several
66

dimensions of speech sounds. Each bank is replicated for each of several successive moments

in time, or time slices.

At the phoneme level,​ there are detectors for each of the phonemes.

At the word level,​ there are detectors for each word. There is one copy of each word detector

centered over every three feature slices.

The entire network of units is called “the Trace,” because the pattern of activation left by a

spoken input is a trace of the analysis of the input at each of the three processing levels.

The Elman and McClelland model illustrates how a highly interactive system may be

conceptualized. In addition, it incorporates notions of both excitation and inhibition. By doing

so, it directly incorporates a mechanism that reduces the possibility of nodes inconsistent with

the evidence being activated while allowing for positive evidence at one level to influence the

activation of nodes at another. Although Elman and McClelland's model is highly interactive, it

is not without constraints. Namely, connections between levels are only excitatory, and within

levels they are only inhibitory.

https://www.youtube.com/watch?v=oK4EQCYdXwM

XVII. Dual Stream Model

Author & Year: ​Hickok and Poeppel, 2007

The Dual Stream model of speech/language processing holds that there are two functionally

distinct computational/neural networks that process speech/language information, one that


67

interfaces ​sensory/phonological networks with conceptual-semantic systems​, and one that

interfaces s​ensory/phonological networks with motor-articulatory systems​ (Hickok & Poeppel,

2000, 2004, 2007).

This model proposes that ​a ventral stream​, which involves structures in the superior and middle

portions of the temporal lobe, is involved in processing speech signals for comprehension

(speech recognition). A dorsal stream, which involves structures in the posterior frontal lobe and

the posterior dorsal-most aspect of the temporal lobe and parietal operculum, is involved in

translating acoustic speech signals into articulatory representations in the frontal lobe, which is

essential for speech development and normal speech production.

STREAMS STRUCTURES FUNCTIONS

Ventral Stream Superior and middle portions Involves in processing speech

of the temporal lobe signals for comprehension

(speech recognition).

Dorsal Stream Posterior frontal lobe and the Involves in translating

posterior dorsal-most aspect acoustic speech signals into

of the temporal lobe and articulatory representations in

parietal operculum the frontal lobe, which is

essential for speech

development and normal

speech production.
68

The suggestion that the dorsal stream has an auditory–motor integration function differs from

earlier arguments for a dorsal auditory ‘where’ system, but is consistent with recent

conceptualizations of the dorsal visual stream and has gained support in recent years .Generally

they propose that speech perception tasks rely to a greater extent on dorsal stream circuitry,

whereas speech recognition tasks rely more on ventral stream circuitry (with shared neural tissue

in the left STG), thus explaining the observed double dissociations.​ In addition, in contrast to the

typical view that speech processing is mainly left-hemisphere dependent, the model suggests that

the ventral stream is bilaterally organized (although with important computational differences

between the two hemispheres); so, the ventral stream itself comprises parallel processing

streams. This would explain the failure to find substantial speech recognition deficits following

unilateral temporal lobe damage. The dorsal stream, however, is strongly left-dominant, which

explains why production deficits are prominent sequelae of dorsal temporal and frontal lesions,

and why left-hemisphere injury can substantially impair performance in speech perception tasks.
69

The dual-stream model of the functional anatomy of language. a | Schematic diagram of the

dual-stream model. The earliest stage of cortical speech processing involves some form of

spectrotemporal analysis, which is carried out in auditory cortices bilaterally in the

supratemporal plane. These spectrotemporal computations appear to differ between the two

hemispheres. Phonological-level processing and representation involves the middle to posterior

portions of the superior temporal sulcus (STS) bilaterally, although there may be a weak

left-hemisphere bias at this level of processing. Subsequently, the system diverges into two broad

streams, a dorsal pathway (blue) that maps sensory or phonological representations onto

articulatory motor representations, and a ventral pathway (pink) that maps sensory or

phonological representations onto lexical conceptual representations.


70

b | Approximate anatomical locations of the dual-stream model components, specified as

precisely as available evidence allows. Regions shaded green depict areas on the dorsal surface

of the superior temporal gyrus (STG) that are proposed to be involved in spectrotemporal

analysis. Regions shaded yellow in the posterior half of the STS are implicated in

phonological-level processes. Regions shaded pink represent the ventral stream, which is

bilaterally organized with a weak left-hemisphere bias. The more posterior regions of the ventral

stream, posterior middle and inferior portions of the temporal lobes correspond to the lexical

interface, which links phonological and semantic information, whereas the e more anterior

locations correspond to the proposed combinatorial network. Regions shaded blue represent the

dorsal stream, which is strongly left dominant. The posterior region of the dorsal stream

corresponds to an area in the Sylvian fissure at the parietotemporal boundary (area Spt), which is

proposed to be a sensorimotor interface, whereas the more anterior locations in the frontal lobe,

probably involving Broca’s region and a more dorsal premotor site, correspond to portions of the

articulatory network.

aITS, anterior inferior temporal sulcus; aMTG, anterior middle temporal gyrus; pIFG, posterior

inferior frontal gyrus; PM, premotor cortex.


71

https://www.youtube.com/watch?v=uLUOzUYC3u4

ARTICLE

Anatomy of aphasia revisited (Fridriksson et al., 2018)

In this article, they present a follow-up study to our previous work that used lesion data to

reveal the anatomical boundaries of the dorsal and ventral streams supporting speech and

language processing. Specifically, by emphasizing clinical measures, we examine the effect of

cortical damage and disconnection involving the dorsal and ventral streams on aphasic

impairment. The results reveal that measures of motor speech impairment mostly involve

damage to the dorsal stream, whereas measures of impaired speech comprehension are more

strongly associated with ventral stream involvement. Equally important, many clinical tests

that target behaviours such as naming, speech repetition, or grammatical processing rely on

interactions between the two streams. This latter finding explains why patients with seemingly

disparate lesion locations often experience similar impairments on given subtests. Namely,
72

these individuals’ cortical damage, although dissimilar, affects a broad cortical network that

plays a role in carrying out a given speech or language task. The current data suggested this is

a more accurate characterization than ascribing specific lesion locations as responsible for

specific language deficits.

https://www.youtube.com/watch?v=F0vnSsqnax0

References

1. Abbs, J. H., & Sussman, H. M. (1971). Neurophysiological Feature Detectors and Speech

Perception: A Discussion of Theoretical Implications. Journal of Speech Language and

Hearing Research, 14(1), 23.

2. Bever, Thomas & Poeppel, David. (2010). ​Analysis by Synthesis: A (Re-)Emerging

Program of Research for Language and Vision. Biolinguistics. 4. 174-200.

3. Broadbent, D. E., & Ladefoged, P. (1957). On the fusion of sounds reaching different

sense organs.​ 2​ 9, 708-710

4. Casserly, E. D., & Pisoni, D. B. (2010). Speech perception and production. ​Wiley

Interdisciplinary Reviews: Cognitive Science,​ ​1(​ 5), 629-647.

5. Feng, G., Zhou, B., Zhou, W., Beauchamp, M. S., & Magnotti, J. F. (2019). ​A Laboratory

Study of the McGurk Effect in 324 Monozygotic and Dizygotic

6. Fridriksson, J., den Ouden, D. B., Hillis, A. E., Hickok, G., Rorden, C., Basilakos, A., ...

& Bonilha, L. (2018). Anatomy of aphasia revisited. ​Brain,​ ​141(​ 3), 848-862.

7. Hickok, G., & Poeppel, D. (2007). The cortical organization of speech processing. ​Nature

reviews neuroscience​, ​8​(5), 393-402.


73

8. Hixon, T. J., Weismer, G., & Hoit, J. D. (2018). ​Preclinical speech science: Anatomy,

physiology, acoustics, and perception​. Plural Publishing.

9. Kenneth N. Stevens. (1989) On the quantal nature of speech.​ Journal of Phonetics,

Volume 17, Issues ​1–2,Pages 3-45, ISSN 0095-4470

10. Miller, J. D. (1989). ​Auditory-perceptual interpretation of the vowel. The Journal of the

Acoustical Society of America, 85(5), 2114–2134.​ doi:10.1121/1.397862

11. Nath, A. R., & Beauchamp, M. S. (2012). ​A neural basis for interindividual differences in

the McGurk effect, a multisensory speech illusion. NeuroImage, 59(1), 781–787.

12. Pisoni, D. B., & Luce, P. A. (1987). Acoustic-phonetic representations in word

recognition. ​Cognition​, ​25​(1-2), 21-52.

13. Remez, R. E., Rubin, P. E., Berns, S. M., Pardo, J. S., & Lang, J. M. (1994). ​On the

perceptual organization of speech. Psychological Review, 101(1), 129–156.

14. Schwab, E. C., & Nusbaum, H. C. (Eds.). (2013). ​Pattern recognition by humans and

machines: speech perception​ (Vol. 1). Academic Press.

15. Styles, E. A. (2005). ​Attention, perception and memory: an integrated introduction.​

Psychology Press.

16. Tiippana, K. (2014). ​What is the McGurk effect? Frontiers in Psychology, 5.​

17. van Wassenhove, V., Grant, K. W., and Poeppel, D. (2007). Temporal window of

integration in auditory-visual speech perception. ​Neuropsychologia​ 45, 598–607.

18. Wright, R., Frisch, S., & Pisoni, D. B. (1997). Speech perception. ​Research on Spoken

Language Processing Progress Report No,​ ​21,​ 1-50.

19. Whalen, D. H. (2016). Direct perceptions of Carol Fowler's theoretical perspective.

Ecological Psychology,​ ​28(​ 4), 183-187.


74

You might also like