Automatic Speech Recognition

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 30

Automatic Speech

Recognition
Automatic speech recognition
• What is the task?
• What are the main difficulties?
• How is it approached?
• Early history of ASR
• Dragon
• Harpy

2/34
What is the task?
• Process of converting an acoustic stream of
speech input, as gathered by a microphone and
associated electronic equipment, into a text
representation of its component words
• Getting a computer to understand spoken
language
• “Understand” might mean
– React appropriately
– Convert the input speech into another
medium, e.g. text
3/34
How do humans do it?

• Articulation produces sound


waves which the ear
conveys to the brain for
processing
4/34
How might computers do it?

Acoustic waveform Acoustic signal

• Digitization
• Acoustic analysis of the
Speech recognition
speech signal
• Linguistic interpretation
5/34
Speech Recognition Process
– Acoustic signal is the data
– Use various forms of signal
processing to obtain spectral
data
– Generate phonetic units to
account for groups of data
(phonemes, diphones,
demisyllables, syllables,
words)
– Combine phonetic units into
syllables and words
– Possibly use syntax (and
semantic knowledge) to ensure
the words make sense

6/34
Issues : Variability in individuals’
speech
• Variation among speakers due to
– Vocal range
– Voice quality (growl, whisper, physiological elements
such as nasality, adenoidality, etc)
– ACCENT !!! (especially vowel systems, but also
consonants, allophones, etc.)
• Variation within speakers due to
– Health, emotional state
– Ambient conditions
• Speech style: formal read vs spontaneous
7/34
Issues : Speaker-(in)dependent
systems
• Speaker-dependent systems
– Require “training” to “teach” the system your individual
idiosyncracies
• The more the merrier, but typically nowadays 5 or 10 minutes is
enough
• User asked to pronounce some key words which allow computer to
infer details of the user’s accent and voice
• Fortunately, languages are generally systematic
– More robust
– But less convenient
– And obviously less portable
• Speaker-independent systems
– Language coverage is reduced to compensate need to be
flexible in phoneme identification
– Clever compromise is to learn on the fly
8/34
Issues :(Dis)continuous speech
• Discontinuous speech much easier to
recognize
– Single words tend to be pronounced more
clearly
• Continuous speech involves contextual
coarticulation effects
– Weak forms
– Assimilation
– Contractions

9/34
A Speech Waveform to a Sentence
1. Early speech recognition systems attempted to segment the speech
waveform into its constituent phones and then to assemble the phones
into words.
2. The speech signal was first digitized, and various parameters, such as
the frequency or pitch, were extracted.
3. Segment the waveform into units containing phones.
4. Using dictionaries that associate the values of waveform parameters
with phones and phones with words, the waveform was finally converted
into text

10/34
11/34
12/34
13/34
Disambiguating homophones
• Differences between some phonemes are
sometimes very small
• Mostly differences are recognised by humans by
context and need to make sense
It’s hard to wreck a nice beach
What dime’s a neck’s drain to stop port?
• Systems can only recognize words that are in their
lexicon, so limiting the lexicon is an obvious ploy
• Some ASR systems include a grammar which can
help disambiguation
14/34
15/34
16/34
History

17/34
Radio Rex
• A toy from 1922
• A dog mounted on an
iron base with an
electromagnetic to
counteract the force
of a spring that would
push “Rex” out of his
house
• The electromagnetic
was interrupted if an
acoustic signal at
500 Hz was detected
• The sound “e” (/eh/)
as found in Rex is at
about 500 Hz
• Dog comes when
called
Isolated Speech Recognition
• Early speech recognition concentrated on
isolated speech because continuous speech
recognition was just not possible
– Even today, accurate continuous speech
recognition is extremely challenging
• Advantages to isolated speech recognition
– The speech signal is segmented so that it is easy
to determine the starting point and stopping point
of each word
– Co-articulation effects are minimized
A distinct gap
(silence) appears
between words
Early Speech Recognition
• Bell labs 1952 implemented a system for isolated digit
recognition (of a single speaker) by comparing the
formants of the speech signal to expected frequencies

• RCA labs implemented an isolated speech recognition


system for a single speaker on 10 different syllables
• MIT Lincoln Lab constructed a speaker independent
recognizer for 10 vowels
SR in 1970 s
• ARPA initiated wide scale research on multi speaker
continuous speech of large vocabularies
– Multi speaker speech – people speak at different frequencies,
particularly if speakers are of different sexes and widely
different ages
– Continuous speech – the speech signal for a given sound is
dependent on the preceding and succeeding sounds, and in
continuous speech, it is extremely hard to find the beginning of
a new word, making the search process even more
computationally difficult
– Large vocabularies – to this point, speech recognition systems
were limited to a few or a few dozen words, ARPA wanted
1000-word vocabularies
• this not only complicated the search process by introducing 10-100
times the complexity in words to match against, but also required the
ability to handle ambiguity at the lexical and syntactical levels
– ARPA permitted a restricted syntax but it still had to allow for a
multitude of sentence forms, and thus required some natural
language capabilities (syntactic parsing)
– ARPA demanded real-time performance
• Four systems were developed out of this research
ARPA Results
• None of the systems were thought to be scalable to larger sizes
• Only Harpy showed promise
• Harpy’s approach provided the most accuracy and this led to the adoption
of HMMs for SR
• On average, HARPY executed about 30 million computer instructions to
deal with one second of speech.
• Using a 0.4-million instructions per second (0.4 MIPS) machine (a DEC
PDP-KA10), it would take over a minute to process a second of speech

System Words Speakers Sentence Error Rate


s
Harpy 1011 3 male, 2 female 184 5%

Hearsay II 1011 1 male 22 9%, 26%


HWIM 1097 3 male 124 56%
SRI/SDC 1000 1 male 54 76%
DRAGON
• DRAGON was designed to understand sentences about chess moves.
• It used statistical techniques to make guesses about the most probable
strings of words that might have produced the observed speech signal.
• Suppose we let x stand for a string of words and y stand for the speech
waveform that is produced when x is spoken.
• Because the same speaker may say the same words somewhat
differently on different occasions, and different speakers certainly will
say them differently, the word string x does not completely determine
what the speech waveform y will be.

• For speech recognition, we want to know the probability of a word string


x, given the speech signal y, so that we can select the most probable x.
That is, we want p(x/y).
• We could use Bayes's rule as before, to produce the desired probability
as follows:

p(x / y) = p(y / x)p(x)/p(y)


Hierarchical levels in speech generation

• DRAGON and other modern speech-recognition systems


exploit the hierarchical structure involved in the way a
speech waveform is generated.
• At the top of the hierarchy a given semantic idea is
expressed by a string of words obeying the syntactic
rules of the language.
• The string of words, in turn, gives rise to a string of
phones, the phonetic units.
• Finally, the phone string is expressed by a speech
waveform at the bottom of the hierarchy.
HMM in Dragon
• DRAGON combined separate levels of speech into a network consisting of a
hierarchy of probabilistic functions of Markov processes.
• At each level, we have a sequence of entities, x1; x2; : : : xn, producing a
sequence of other entities, y1; y2; : : : ; yn.
• It was assumed that each yi was influenced only by xi and xi-1. (Markov
Assumption)
• At each level, Bayes's rule was used to compute probabilities of the x's given
the y’s.
• Because only the speech waveform at the bottom level was actually observed,
the phones and words were said to be hidden. For this reason, the entire
network employed hidden Markov models (HMMs).
• DRAGON was the first example of the use of HMMs in AI.
HMMs for some words

26/34
HARPY
• HARPY was a second system produced at CMU under DARPA's
speech understanding research effort.
• HARPY combined some of the ideas of HEARSAY-I and DRAGON
• HARPY could handle a vocabulary of 1,011 words.
• Instead of using a grammar with the conventional syntactic categories
such as Noun, Adjective, and so on, HARPY used what is called a
“semantic grammar,"
• Example:
SR which is able to answer questions about, and to retrieve
documents from, a database containing Abstracts of AI papers
Semantic grammar : Categories such as Topic, Author, Year, and
Publisher that were semantically related to data about AI papers
• HARPY's grammar was limited to handle just the set of
sentences about authors and papers that HARPY was supposed
to be able to recognize.
Harpy: Network of Phones
• HARPY combined all “knowledge sources” into a giant network of phones
representing all the possible ways that syntactically legal sentences might be
spoken.
• Each “phone node" in the network was paired with a representation of a
segment of a speech waveform, called a “spectral template," expected to be
associated with that particular phone.
• These templates were obtained initially by having a speaker read about 700
sentences.
• They could be “tuned" for a new speaker by having the speaker read about
20 selected sentences during a “learning" session.

HARPY's actual network had 15,000 nodes


Harpy: Search for words in Speech
• The observed speech waveform was first divided into variable-length
segments and a spectral template was computed for each of these
segments.
• The recognition process then proceeded as follows:
• Template corresponding to the first segment in the speech waveform
was compared against all the templates corresponding to the phones at
the beginning of the network.
• The best few matches were noted, and the paths to these nodes were
designated to be the ‘best one-step’ partial paths
• At the next stage, the template of the next waveform segment was
compared against all those phone nodes reachable by extending the
‘best one-step paths’ one more step.
• Using the values of the comparisons computed so far, a set of ‘best two-
step partial paths’ was identified.
• This process continues until the end of the network was reached.
• At that time, the very best path found so far could be associated with the
words associated with the nodes along that path.
• This word sequence was then produced as HARPY's recognition
decision.
Harpy: Beam Search
• HARPY's method of searching for a best path through
the network can be compared with the A* heuristic
search process.
• Whereas A* kept the entire search “frontier“ HARPY kept
on its frontier only those nodes on the best few paths
found so far.
• The number of nodes kept on the frontier was a
parameter that could be set as needed to control search.
• HARPY's designers called this technique “beam search"
because the nodes visited by the search process were
limited to a narrow beam through the network.

30/34

You might also like