Professional Documents
Culture Documents
Speech Recognition
Speech Recognition
Speech Recognition
Definition
Speech recognition is the process of converting
an acoustic signal, captured by a microphone or
a telephone, to a set of words.
The recognised words can be an end in
themselves, as for applications such as
commands & control, data entry, and document
preparation.
They can also serve as the input to further
linguistic processing in order to achieve speech
understanding
Speech Processing
Signal processing:
Convert the audio wave into a sequence of feature vectors
Speech recognition:
Decode the sequence of feature vectors into a sequence of
words
Semantic interpretation:
Determine the meaning of the recognized words
Dialog Management:
Correct errors and help get the task done
Response Generation
What words to use to maximize user understanding
Dialog Management
Goal: determine what to accomplish in response
to user utterances, e.g.:
Information access
airline schedules, stock quotes, directory
assistance
Problem solving
travel planning, logistics
ATIS system
air travel information retrieval
context management
Speech Recognition
Figure out what a person is saying.
Speaker Verification
Authenticate that a person is who she/he
claims to be.
Limited speech patterns
Speaker Identification
Assigns an identity to the voice of an
unknown person.
Arbitrary speech patterns
Spontaneous V Scripted
Spontaneous, speech contains
disfluencies, periods of pause and restart,
and is much more difficult to recognise
than speech read from script.
Enrolment
Some systems require speaker enrolment,
a user must provide samples of his or her
speech before using them, whereas other
systems are said to be speakerindependent, in that no enrolment is
necessary.
Perplexity
One popular measure of the difficulty of
the task, combining the vocabulary size
and the language model, is perplexity.
Loosely defined as the geometric mean of
the number of words that can follow a
word after the language model has been
applied., (Zue, Cole, and Ward, 1995).
Properties of Recognizers
Summary
Speaker Independent vs. Speaker Dependent
Large Vocabulary (2K-200K words) vs. Limited
Vocabulary (2-200)
Continuous vs. Discrete
Speech Recognition vs. Speech Verification
Real Time vs. multiples of real time
Continued
Spontaneous Speech vs. Read Speech
Noisy Environment vs. Quiet Environment
High Resolution Microphone vs. Telephone vs.
Cellphone
Push-and-hold vs. push-to-talk vs. alwayslistening
Adapt to speaker vs. non-adaptive
Low vs. High Latency
With online incremental results vs. final results
Dialog Management
Speaker Models
Speaker Dependent
Speaker Independent
Speaker Adaptive
A TIMELINE OF SPEECH
RECOGNITION
1890s Alexander Graham Bell discovers Phone while
trying to develop speech recognition system for deaf
people.
1936AT&T's Bell Labs produced the first electronic
speech synthesizer called the Voder (Dudley, Riesz and
Watkins).
This machine was demonstrated in the 1939 World Fairs
by experts that used a keyboard and foot pedals to play
the machine and emit speech.
1969John Pierce of Bell Labs said automatic speech
recognition will not be a reality for several decades
because it requires artificial intelligence.
Early 70s
Early 1970'sThe Hidden Markov Modeling
(HMM) approach to speech recognition was
invented by Lenny Baum of Princeton University
and shared with several ARPA (Advanced
Research Projects Agency) contractors including
IBM.
HMM is a complex mathematical patternmatching strategy that eventually was adopted
by all the leading speech recognition companies
including Dragon Systems, IBM, Philips, AT&T
and others.
70+
80+
1982Covox founded. Company brought digital sound (via
The Voice Master, Sound Master and The Speech Thing)
to the Commodore 64, Atari 400/800, and finally to the
IBM PC in the mid 80s.
1982Dragon Systems was founded in 1982 by speech
industry pioneers Drs. Jim and Janet Baker. Dragon
Systems is well known for its long history of speech and
language technology innovations and its large patent
portfolio.
1984SpeechWorks, the leading provider of over-thetelephone automated speech recognition (ASR)
solutions, was founded.
90s
95+
1997 Dragon introduced "Naturally Speaking", the first
"continuous speech" dictation software available
(meaning you no longer need to pause between words
for the computer to understand what you're saying).
1998 Lernout & Hauspie bought Kurzweil. Microsoft
invested $45 million in Lernout & Hauspie to form a
partnership that will eventually allow Microsoft to use
their speech recognition technology in their systems.
1999 Microsoft acquired Entropic, giving Microsoft
access to what was known as the "most accurate speech
recognition system" in the Old VCR!
2000
2000 Lernout & Hauspie acquired Dragon Systems
for approximately $460 million.
2000 TellMe introduces first world-wide voice
portal.
2000 NetBytel launched the world's first voice
enabler, which includes an on-line ordering
application with real-time Internet integration for
Office Depot.
2000s
2001ScanSoft Closes Acquisition of Lernout
& Hauspie Speech and Language Assets.
2003ScanSoft Ships Dragon
NaturallySpeaking 7 Medical, Lowers
Healthcare Costs through Highly Accurate
Speech Recognition.
2003ScanSoft closes deal to distribute and
support IBM ViaVoice Desktop Products.
Signal Variability
Speech recognition is a difficult problem, largely because
of the many sources of variability associated with the
signal.
The acoustic realisations of phonemes, the recognition
systems smallest sound units of which words are
composed, are highly dependent on the context in which
they appear.
These phonetic variables are exemplified by the acoustic
differences of the phoneme 't/'in two, true, and butter in
English.
At word boundaries, contextual variations can be quite
dramatic, and devo andare sound like devandare in
Italian.
More
Acoustic variability can result from changes in
the environment as well as in the position and
characteristics of the transducer.
Within-speaker variability can result from
changes in the speaker's physical and emotional
state, speaking rate, or voice quality.
Differences in socio-linguistic background,
dialect, and vocal tract size and shape can
contribute to across-speaker variability.
Articulation
The science of articulation is concerned with how
phonemes are produced. The focus of articulation is on
the vocal apparatus of the throat, mouth and nose where
the sounds are produced.
The phonemes themselves need to be classified, the
system most often used by speech recognition is the
ARPABET, (Rabiner and Juang, 1993) The ARPABET
was created in the 1970s by and for contractors working
on speech processing for the Advanced Research
Projects Agency of the U.S. department of defence.
ARPABET
Consonant Classifcation
Acoustics
Articulation provides valuable information
about how speech sounds are produced,
but a speech recognition system cannot
analyse movements of the mouth.
Instead, the data source for speech
recognition is the stream of speech itself.
This is an analogue signal, a sound
stream, and a continuous flow of sound
waves and silence.
Human Hearing
The human ear can detect frequencies from 20Hz to
20,000Hz but it is most sensitive in the critical frequency
range, 1000Hz to 6000Hz, (Ghitza, 1994).
Recent Research has uncovered the fact that humans
do not process individual frequencies.
Instead, we hear groups of frequencies, such as format
patterns, as cohesive units and we are capable of
distinguishing them from surrounding sound patterns,
(Carrell and Opie, 1992) .
This capability, called auditory object formation, or
auditory image formation, helps explain how humans can
discern the speech of individual people at cocktail parties
and separate a voice from noise over a poor telephone
channel, (Markowitz, 1995).
Pre-processing Speech
Like all sounds, speech is an analogue
waveform. In order for a Recognition System to
perform action on speech, it must be
represented in a digital manner.
All noise patterns silences and co-articulation
effects must be captured.
This is accomplished by digital signal
processing. The way the analogue speech is
processed is one of the most complex elements
of a Speech Recognition system.
Recognition Accuracy
To achieve high recognition accuracy the
speech representation process should,
(Markowitz, 1995),
Include all critical data.
Remove Redundancies.
Remove Noise and Distortion.
Avoid introducing new distortions.
Signal Representation
In statistically based automatic speech
recognition, the speech waveform is sampled at
a rate between 6.6 kHz and 20 kHz and
processed to produce a new representation as a
sequence of vectors containing values of what
are generally called parameters.
The vectors typically comprise between 10 and
20 parameters, and are usually computed every
10 or 20 milliseconds.
Parameter Values
These parameter values are then used in
succeeding stages in the estimation of the
probability that the portion of waveform just
analysed corresponds to a particular phonetic
event that occurs in the phone-sized or wholeword reference unit being hypothesised.
In practice, the representation and the
probability estimation interact strongly: what one
person sees as part of the representation
another may see as part of the probability
estimation process.
Emotional State
Representations aim to preserve the information
needed to determine the phonetic identity of a
portion of speech while being as impervious as
possible to factors such as speaker differences,
effects introduced by communications channels,
and paralinguistic factors such as the emotional
state of the speaker.
They also aim to be as compact as possible.
Template Matching,
Template match is the oldest and least effective method.
It is a form of pattern recognition.
It was the dominant technology in the 1950's and 1960's.
Each word or phrase in an application is stored as a
template.
The user input is also arranged into templates at the
word level and the best match with a system template is
found.
Although Template matching is currently in decline as the
basic approach to recognition, it has been adapted for
use in word spotting applications. It also remains the
primary technology applied to speaker verification,
(Moore, 1982).
Acoustic-Phonetic Recognition
Acoustic-phonetic recognition functions at the
phoneme level. It is an attractive approach to
speech as it limits the number of representations
that must be stored. In English there are about
forty discernible phonemes no matter how large
the vocabulary, (Markowitz, 1995).
Acoustic phonetic recognition involves three
steps,
Feature Extraction.
Segmentation and Labelling.
Word-Level recognition.
Stochastic Processing,
The term stochastic refers to the process of making a
sequence of non-deterministic selections from among a
set of alternatives.
They are non-deterministic because the choices during
the recognition process are governed by the
characteristics of the input and not specified in advance,
(Markowitz, 1995).
Like template matching, stochastic processing requires
the creation and storage of models of each of the items
that will be recognised.
It is based on a series of complex statistical or
probabilistic analyses. These statistics are stored in a
network-like structure called a Hidden Markov Model
(HMM), (Paul, 1990).
HMM
A Hidden Markov Model is made up of states and
transitions, which are shown, in the diagram. Each state
represents of a HMM holds statistics for a segment of a
word, which describe the value and variations that are
found in the model of that word segment. The transitions
allow for speech variations such as
The prolonging of a word segment, this would cause
several recursive transitions in the recogniser.
The omission of a word segment, This would cause a
transition that skips a state.
Stochastic processing using Hidden Markov Models is
accurate, flexible, and capable of being fully automated,
(Rabiner and Juang, 1986).
Neural networks
"if speech recognition systems could learn speech
knowledge automatically and represent this knowledge
in a parallel distributed fashion for rapid evaluation
such a system would mimic the function of the human
brain, which consists of several billion simple, inaccurate
and slow processors that perform reliable speech
processing", (Waibel and Hampshire, 1989).
An artificial neural network is a computer program, which
attempt to emulate the biological functions of the Human
brain. They are an excellent classification systems, and
have been effective with noisy, patterned, variable data
streams containing multiple, overlapping, interacting and
incomplete cues, (Markowitz, 1995).
Auditory Models,
The aim of auditory models to allow a Speech
Recognition system to screen all noise from the
signal and concentrate on the central speech
pattern in a similar way to the Human Brain.
Auditory modelling offers the promise of being
able to develop robust Speech Recognition
systems that are capable of working in difficult
environments.
Currently, it is purely an experimental
technology.
Performance of Speech
Recognitions systems
Performance of speech recognition systems is typically
described in terms of word error rate, defined as:
Deletion, The loss of a word within the original speech.
The system outputs "A E I U" while the input was "A E I
O U".
Substitution, The replacement of an element of the input,
such as a word, with another. The system outputs "song"
while the input was "long".
Insertion, The system adds an element to the input, such
as a word, when no word was input. The system outputs
"A E I O U" while the input was "A E I U".
Some links
The following are links to major speech
recognition links
Telephone Demos
Nuance
http://www.nuance.com
Banking: 1-650-847-7438
Travel Planning: 1-650-847-7427
Stock Quotes: 1-650-847-7423
SpeechWorks
http://www.speechworks.com/demos/demos.htm
Banking: 1-888-729-3366
Stock Trading: 1-800-786-2571
IBM http://www-3.ibm.com/software/speech/
Mutual Funds, Name Dialing: 1-877-VIAVOICE