Professional Documents
Culture Documents
Automatic Speech Recognition
Automatic Speech Recognition
Automatic Speech Recognition
Recognition
Automatic speech recognition
• What is the task?
• What are the main difficulties?
• How is it approached?
• Early history of ASR
• Dragon
• Harpy
2/34
What is the task?
• Process of converting an acoustic stream of
speech input, as gathered by a microphone and
associated electronic equipment, into a text
representation of its component words
• Getting a computer to understand spoken
language
• “Understand” might mean
– React appropriately
– Convert the input speech into another
medium, e.g. text
3/34
How do humans do it?
• Digitization
• Acoustic analysis of the
Speech recognition
speech signal
• Linguistic interpretation
5/34
Speech Recognition Process
– Acoustic signal is the data
– Use various forms of signal
processing to obtain spectral
data
– Generate phonetic units to
account for groups of data
(phonemes, diphones,
demisyllables, syllables,
words)
– Combine phonetic units into
syllables and words
– Possibly use syntax (and
semantic knowledge) to ensure
the words make sense
6/34
Issues : Variability in individuals’
speech
• Variation among speakers due to
– Vocal range
– Voice quality (growl, whisper, physiological elements
such as nasality, adenoidality, etc)
– ACCENT !!! (especially vowel systems, but also
consonants, allophones, etc.)
• Variation within speakers due to
– Health, emotional state
– Ambient conditions
• Speech style: formal read vs spontaneous
7/34
Issues : Speaker-(in)dependent
systems
• Speaker-dependent systems
– Require “training” to “teach” the system your individual
idiosyncracies
• The more the merrier, but typically nowadays 5 or 10 minutes is
enough
• User asked to pronounce some key words which allow computer to
infer details of the user’s accent and voice
• Fortunately, languages are generally systematic
– More robust
– But less convenient
– And obviously less portable
• Speaker-independent systems
– Language coverage is reduced to compensate need to be
flexible in phoneme identification
– Clever compromise is to learn on the fly
8/34
Issues :(Dis)continuous speech
• Discontinuous speech much easier to
recognize
– Single words tend to be pronounced more
clearly
• Continuous speech involves contextual
coarticulation effects
– Weak forms
– Assimilation
– Contractions
9/34
A Speech Waveform to a Sentence
1. Early speech recognition systems attempted to segment the speech
waveform into its constituent phones and then to assemble the phones
into words.
2. The speech signal was first digitized, and various parameters, such as
the frequency or pitch, were extracted.
3. Segment the waveform into units containing phones.
4. Using dictionaries that associate the values of waveform parameters
with phones and phones with words, the waveform was finally converted
into text
10/34
11/34
12/34
13/34
Disambiguating homophones
• Differences between some phonemes are
sometimes very small
• Mostly differences are recognised by humans by
context and need to make sense
It’s hard to wreck a nice beach
What dime’s a neck’s drain to stop port?
• Systems can only recognize words that are in their
lexicon, so limiting the lexicon is an obvious ploy
• Some ASR systems include a grammar which can
help disambiguation
14/34
15/34
16/34
History
17/34
Radio Rex
• A toy from 1922
• A dog mounted on an
iron base with an
electromagnetic to
counteract the force
of a spring that would
push “Rex” out of his
house
• The electromagnetic
was interrupted if an
acoustic signal at
500 Hz was detected
• The sound “e” (/eh/)
as found in Rex is at
about 500 Hz
• Dog comes when
called
Isolated Speech Recognition
• Early speech recognition concentrated on
isolated speech because continuous speech
recognition was just not possible
– Even today, accurate continuous speech
recognition is extremely challenging
• Advantages to isolated speech recognition
– The speech signal is segmented so that it is easy
to determine the starting point and stopping point
of each word
– Co-articulation effects are minimized
A distinct gap
(silence) appears
between words
Early Speech Recognition
• Bell labs 1952 implemented a system for isolated digit
recognition (of a single speaker) by comparing the
formants of the speech signal to expected frequencies
26/34
HARPY
• HARPY was a second system produced at CMU under DARPA's
speech understanding research effort.
• HARPY combined some of the ideas of HEARSAY-I and DRAGON
• HARPY could handle a vocabulary of 1,011 words.
• Instead of using a grammar with the conventional syntactic categories
such as Noun, Adjective, and so on, HARPY used what is called a
“semantic grammar,"
• Example:
SR which is able to answer questions about, and to retrieve
documents from, a database containing Abstracts of AI papers
Semantic grammar : Categories such as Topic, Author, Year, and
Publisher that were semantically related to data about AI papers
• HARPY's grammar was limited to handle just the set of
sentences about authors and papers that HARPY was supposed
to be able to recognize.
Harpy: Network of Phones
• HARPY combined all “knowledge sources” into a giant network of phones
representing all the possible ways that syntactically legal sentences might be
spoken.
• Each “phone node" in the network was paired with a representation of a
segment of a speech waveform, called a “spectral template," expected to be
associated with that particular phone.
• These templates were obtained initially by having a speaker read about 700
sentences.
• They could be “tuned" for a new speaker by having the speaker read about
20 selected sentences during a “learning" session.
30/34