Speech Recognition

Speech Recognition
Definition
Speech recognition is the process of converting
an acoustic signal, captured by a microphone or
a telephone, to a set of words.
The recognised words can be an end in
themselves, as for applications such as
commands & control, data entry, and document
preparation.
They can also serve as the input to further
linguistic processing in order to achieve speech
understanding
Speech Processing
Signal processing:
Convert the audio wave into a sequence of feature vectors
Speech recognition:
Decode the sequence of feature vectors into a sequence of
words
Semantic interpretation:
Determine the meaning of the recognized words
Dialog Management:
Correct errors and help get the task done
Response Generation
What words to use to maximize user understanding
Speech synthesis (Text to Speech):

Generate synthetic speech from a marked-up word string
Dialog Management
Goal: determine what to accomplish in response
to user utterances, e.g.:
Answer user question

Solicit further information
Confirm/Clarify user utterance
Notify invalid query
Notify invalid query and suggest alternative
Interface between user/language processing

components and system knowledge base
What you can do with Speech

Recognition
Transcription
dictation, information retrieval
Command and control

data entry, device control, navigation, call routing
Information access
airline schedules, stock quotes, directory
assistance
Problem solving
travel planning, logistics
Transcription and Dictation

Transcription is transforming a stream of
human speech into computer-readable
form
Medical reports, court proceedings, notes
Indexing (e.g., broadcasts)
Dictation is the interactive composition of

text
Report, correspondence, etc.
Speech recognition and

understanding
Sphinx system
speaker-independent
continuous speech
large vocabulary
ATIS system
air travel information retrieval
context management
Speech Recognition and Call

Centres
Automate services, lower payroll

Shorten time on hold
Shorten agent and client call time
Reduce fraud
Improve customer service
Applications related to Speech

Recognition
Speech Recognition
Figure out what a person is saying.
Speaker Verification
Authenticate that a person is who she/he
claims to be.
Limited speech patterns
Speaker Identification
Assigns an identity to the voice of an
unknown person.
Arbitrary speech patterns
Many kinds of Speech Recognition

Systems
Speech recognition systems can be
characterised by many parameters.
An isolated-word (Discrete) speech
recognition system requires that the
speaker pauses briefly between words,
whereas a continuous speech recognition
system does not.
Spontaneous V Scripted
Spontaneous, speech contains
disfluencies, periods of pause and restart,
and is much more difficult to recognise
than speech read from script.
Enrolment
Some systems require speaker enrolment,
a user must provide samples of his or her
speech before using them, whereas other
systems are said to be speakerindependent, in that no enrolment is
necessary.
Large V small vocabularies

Some of the other parameters depend on the
specific task. Recognition is generally more
difficult when vocabularies are large with many
similar-sounding words.
When speech is produced in a sequence of
words, language models or artificial grammars
are used to restrict the combination of words.
The simplest language model can be specified
as a finite-state network, where the permissible
words following each word are given explicitly.
Perplexity
One popular measure of the difficulty of
the task, combining the vocabulary size
and the language model, is perplexity.
Loosely defined as the geometric mean of
the number of words that can follow a
word after the language model has been
applied., (Zue, Cole, and Ward, 1995).
Finally, some external parameters can

affect speech recognition system
performance. These include the
characteristics of the environmental noise
and the type and the placement of the
microphone.
Properties of Recognizers
Summary
Speaker Independent vs. Speaker Dependent
Large Vocabulary (2K-200K words) vs. Limited
Vocabulary (2-200)
Continuous vs. Discrete
Speech Recognition vs. Speech Verification
Real Time vs. multiples of real time
Continued
Spontaneous Speech vs. Read Speech
Noisy Environment vs. Quiet Environment
High Resolution Microphone vs. Telephone vs.
Cellphone
Push-and-hold vs. push-to-talk vs. alwayslistening
Adapt to speaker vs. non-adaptive
Low vs. High Latency
With online incremental results vs. final results
Dialog Management
Features That Distinguish

Products & Applications
Words, phrases, and grammar
Models of the speakers
Speech flow
Vocabulary: How many words

How you add new words
Grammars
Branching Factor (Perplexity)
Available languages
Systems are also defined by Users

Different Kinds of Users
One time vs. Frequent users
Homogeneity
Technically sophisticated
Based on Users have different speaker
models
Speaker Models
Speaker Dependent
Speaker Independent
Speaker Adaptive
Sample Market: Call Centers

Automate services, lower payroll
Shorten time on hold
Shorten agent and client call time
Reduce fraud
Improve customer service
A TIMELINE OF SPEECH
RECOGNITION
1890s Alexander Graham Bell discovers Phone while
trying to develop speech recognition system for deaf
people.
1936AT&T's Bell Labs produced the first electronic
speech synthesizer called the Voder (Dudley, Riesz and
Watkins).
This machine was demonstrated in the 1939 World Fairs
by experts that used a keyboard and foot pedals to play
the machine and emit speech.
1969John Pierce of Bell Labs said automatic speech
recognition will not be a reality for several decades
because it requires artificial intelligence.
Early 70s
Early 1970'sThe Hidden Markov Modeling
(HMM) approach to speech recognition was
invented by Lenny Baum of Princeton University
and shared with several ARPA (Advanced
Research Projects Agency) contractors including
IBM.
HMM is a complex mathematical patternmatching strategy that eventually was adopted
by all the leading speech recognition companies
including Dragon Systems, IBM, Philips, AT&T
and others.
70+
1971DARPA (Defense Advanced Research Projects Agency)

established the Speech Understanding Research (SUR) program to
develop a computer system that could understand continuous
speech.
Lawrence Roberts, who initiated the program, spent $3 million per
year of government funds for 5 years. Major SUR project groups
were established at CMU, SRI, MIT's Lincoln Laboratory, Systems
Development Corporation (SDC), and Bolt, Beranek, and Newman
(BBN). It was the largest speech recognition project ever.
1978The popular toy "Speak and Spell" by Texas Instruments was

introduced. Speak and Spell used a speech chip which led to huge
strides in development of more human-like digital synthesis sound.
80+
1982Covox founded. Company brought digital sound (via
The Voice Master, Sound Master and The Speech Thing)
to the Commodore 64, Atari 400/800, and finally to the
IBM PC in the mid 80s.
1982Dragon Systems was founded in 1982 by speech
industry pioneers Drs. Jim and Janet Baker. Dragon
Systems is well known for its long history of speech and
language technology innovations and its large patent
portfolio.
1984SpeechWorks, the leading provider of over-thetelephone automated speech recognition (ASR)
solutions, was founded.
90s
1993 Covox sells its products out to Creative Labs, Inc.

1995 Dragon released discrete word dictation-level speech
recognition software. It was the first time dictation speech
recognition technology was available to consumers. IBM and
Kurzweil followed a few months later.
1996 Charles Schwab is the first company to devote resources
towards developing up a speech recognition IVR system with
Nuance. The program, Voice Broker, allows for up to 360
simultaneous customers to call in and get quotes on stock and
options... it handles up to 50,000 requests each day. The system
was found to be 95% accurate and set the stage for other
companies such as Sears, Roebuck and Co., and United Parcel
Service of America Inc., and E*Trade Securities to follow in their
footsteps.
1996 BellSouth launches the world's first voice portal, called Val
and later Info By Voice.
95+
1997 Dragon introduced "Naturally Speaking", the first
"continuous speech" dictation software available
(meaning you no longer need to pause between words
for the computer to understand what you're saying).
1998 Lernout & Hauspie bought Kurzweil. Microsoft
invested $45 million in Lernout & Hauspie to form a
partnership that will eventually allow Microsoft to use
their speech recognition technology in their systems.
1999 Microsoft acquired Entropic, giving Microsoft
access to what was known as the "most accurate speech
recognition system" in the Old VCR!
2000
2000 Lernout & Hauspie acquired Dragon Systems
for approximately $460 million.
2000 TellMe introduces first world-wide voice
portal.
2000 NetBytel launched the world's first voice
enabler, which includes an on-line ordering
application with real-time Internet integration for
Office Depot.
2000s
2001ScanSoft Closes Acquisition of Lernout
& Hauspie Speech and Language Assets.
2003ScanSoft Ships Dragon
NaturallySpeaking 7 Medical, Lowers
Healthcare Costs through Highly Accurate
Speech Recognition.
2003ScanSoft closes deal to distribute and
support IBM ViaVoice Desktop Products.
Signal Variability
Speech recognition is a difficult problem, largely because
of the many sources of variability associated with the
signal.
The acoustic realisations of phonemes, the recognition
systems smallest sound units of which words are
composed, are highly dependent on the context in which
they appear.
These phonetic variables are exemplified by the acoustic
differences of the phoneme 't/'in two, true, and butter in
English.
At word boundaries, contextual variations can be quite
dramatic, and devo andare sound like devandare in
Italian.
More
Acoustic variability can result from changes in
the environment as well as in the position and
characteristics of the transducer.
Within-speaker variability can result from
changes in the speaker's physical and emotional
state, speaking rate, or voice quality.
Differences in socio-linguistic background,
dialect, and vocal tract size and shape can
contribute to across-speaker variability.
What is a speech recognition

system?
Speech recognition is generally used as a
human computer interface for other software.
When it functions in this role, three primary tasks
need be performed.
Pre-processing, the conversion of spoken input
into a form the recogniser can process.
Recognition, the identification of what has been
said.
Communication, to send the recognised input to
the application that requested it.
How is pre-processing performed

To understand how the first of these
functions is performed, we must examine,
Articulation, the production of the sound.
Acoustics, the stream of the speech itself.
What characterises the ability to
understand spoke input, Auditory
perception.
Articulation
The science of articulation is concerned with how
phonemes are produced. The focus of articulation is on
the vocal apparatus of the throat, mouth and nose where
the sounds are produced.
The phonemes themselves need to be classified, the
system most often used by speech recognition is the
ARPABET, (Rabiner and Juang, 1993) The ARPABET
was created in the 1970s by and for contractors working
on speech processing for the Advanced Research
Projects Agency of the U.S. department of defence.
ARPABET
Like most phoneme classifications, the

ARPABET separates consonants from vowels.
Consonants are characterised by a total or
partial blockage of the vocal tract.
Vowels are characterised by strong harmonic
patterns and relatively free passage of air
through the vocal tract.
Semi-Vowels, such as the y in you, fall between
consonants and vowels.
Consonant Classifcation
Consonant classification uses the,

Point of articulation.
Manner of articulation.
Presence or absence of voicing.
Acoustics
Articulation provides valuable information
about how speech sounds are produced,
but a speech recognition system cannot
analyse movements of the mouth.
Instead, the data source for speech
recognition is the stream of speech itself.
This is an analogue signal, a sound
stream, and a continuous flow of sound
waves and silence.
Important Features (Acoustics)

Four important features of the acoustic analysis
of speech are, (Carter, 1984)
Frequency, the number of vibrations per second
a sound produces
Amplitude, the loudness of the sound.
Harmonic structure added to the fundamental
frequency of a sound are other frequencies that
contribute to its quality or timbre.
Resonance.
Auditory perception, hearing

speech.
"Phonemes tend to be abstractions that are implicitly
defined by the pronunciation of the words in the
language. In particular, the acoustic realisation of a
phoneme may heavily depend on the acoustic context in
which it occurs. This effect is usually called coarticulation", (Ney, 1994).
The way a phoneme is pronounced can be affected by its
position in a word, neighbouring phonemes and even the
word's position in a sentence. This affect is called the coarticulation effect.
The variability in the speech signal caused by coarticulation and other sources make speech analysis very
difficult.
Human Hearing
The human ear can detect frequencies from 20Hz to
20,000Hz but it is most sensitive in the critical frequency
range, 1000Hz to 6000Hz, (Ghitza, 1994).
Recent Research has uncovered the fact that humans
do not process individual frequencies.
Instead, we hear groups of frequencies, such as format
patterns, as cohesive units and we are capable of
distinguishing them from surrounding sound patterns,
(Carrell and Opie, 1992) .
This capability, called auditory object formation, or
auditory image formation, helps explain how humans can
discern the speech of individual people at cocktail parties
and separate a voice from noise over a poor telephone
channel, (Markowitz, 1995).
Pre-processing Speech
Like all sounds, speech is an analogue
waveform. In order for a Recognition System to
perform action on speech, it must be
represented in a digital manner.
All noise patterns silences and co-articulation
effects must be captured.
This is accomplished by digital signal
processing. The way the analogue speech is
processed is one of the most complex elements
of a Speech Recognition system.
Recognition Accuracy
To achieve high recognition accuracy the
speech representation process should,
(Markowitz, 1995),
Include all critical data.
Remove Redundancies.
Remove Noise and Distortion.
Avoid introducing new distortions.
Signal Representation
In statistically based automatic speech
recognition, the speech waveform is sampled at
a rate between 6.6 kHz and 20 kHz and
processed to produce a new representation as a
sequence of vectors containing values of what
are generally called parameters.
The vectors typically comprise between 10 and
20 parameters, and are usually computed every
10 or 20 milliseconds.
Parameter Values
These parameter values are then used in
succeeding stages in the estimation of the
probability that the portion of waveform just
analysed corresponds to a particular phonetic
event that occurs in the phone-sized or wholeword reference unit being hypothesised.
In practice, the representation and the
probability estimation interact strongly: what one
person sees as part of the representation
another may see as part of the probability
estimation process.
Emotional State
Representations aim to preserve the information
needed to determine the phonetic identity of a
portion of speech while being as impervious as
possible to factors such as speaker differences,
effects introduced by communications channels,
and paralinguistic factors such as the emotional
state of the speaker.
They also aim to be as compact as possible.
Representations used in current speech

recognisers, concentrate primarily on properties
of the speech signal attributable to the shape of
the vocal tract rather than to the excitation,
whether generated by a vocal-tract constriction
or by the larynx.
Representations are sensitive to whether the
vocal folds are vibrating or not (the
voiced/unvoiced distinction), but try to ignore
effects due to variations in their frequency of
vibration.
Future Improvements in Speech

Representation.
The vast majority of major commercial and
experimental systems use representations akin
to those described here.
However, in striving to develop better
representations, wave-let transforms
(Daubechies, 1990) are being explored, and
neural network methods are being used to
provide non-linear operations on log spectral
representations.
Work continues on representations more closely

reflecting auditory properties (Greenberg, 1988) and on
representations reconstructing articulatory gestures from
the speech signal (Schroeter & Sondhi, 1994).
It is attractive because it holds out the promise of a small

set of smoothly varying parameters that could deal in a
simple and principled way with the interactions that occur
between neighbouring phonemes and with the effects of
differences in speaking rate and of carefulness of
enunciation.
The ultimate challenge is to match the superior

performance of human listeners over automatic
recognisers.
This superiority is especially marked when there is little
material to allow adaptation to the voice of the current
speaker, and when the acoustic conditions are difficult.
The fact that it persists even when nonsense words are
used shows that it exists at least partly at the
acoustic/phonetic level and cannot be explained purely
by superior language modelling in the brain.
It confirms that there is still much to be done in
developing better representations of the speech signal,
(Rabiner and Schafer, 1978; Hunt, 1993).
Signal Recognition Technologies

Signal Recognition methodologies fall into
to four categories, most system will apply
one or more in the conversion process.
Template Matching,
Template match is the oldest and least effective method.
It is a form of pattern recognition.
It was the dominant technology in the 1950's and 1960's.
Each word or phrase in an application is stored as a
template.
The user input is also arranged into templates at the
word level and the best match with a system template is
found.
Although Template matching is currently in decline as the
basic approach to recognition, it has been adapted for
use in word spotting applications. It also remains the
primary technology applied to speaker verification,
(Moore, 1982).
Acoustic-Phonetic Recognition
Acoustic-phonetic recognition functions at the
phoneme level. It is an attractive approach to
speech as it limits the number of representations
that must be stored. In English there are about
forty discernible phonemes no matter how large
the vocabulary, (Markowitz, 1995).
Acoustic phonetic recognition involves three
steps,
Feature Extraction.
Segmentation and Labelling.
Word-Level recognition.
Acoustic phonetic recognition supplanted

template matching in the early 1970's.
The successful ARPA SUR systems
highlighted potential benefits of this
approach. Unfortunately acoustic phonetic
was at the time a poorly researched area
and many of the expected advances failed
to materialise.
The high degree of acoustic similarity among

phonemes combined with phoneme variability
resulting from the co-articulation effect and other
sources create uncertainty with regard to
potential phoneme labels, (Cole 1986).
If these problems can be overcome, there is
certainly an opportunity for this technology to
play a part in future Speech Recognition system.
Stochastic Processing,
The term stochastic refers to the process of making a
sequence of non-deterministic selections from among a
set of alternatives.
They are non-deterministic because the choices during
the recognition process are governed by the
characteristics of the input and not specified in advance,
(Markowitz, 1995).
Like template matching, stochastic processing requires
the creation and storage of models of each of the items
that will be recognised.
It is based on a series of complex statistical or
probabilistic analyses. These statistics are stored in a
network-like structure called a Hidden Markov Model
(HMM), (Paul, 1990).
HMM
A Hidden Markov Model is made up of states and
transitions, which are shown, in the diagram. Each state
represents of a HMM holds statistics for a segment of a
word, which describe the value and variations that are
found in the model of that word segment. The transitions
allow for speech variations such as
The prolonging of a word segment, this would cause
several recursive transitions in the recogniser.
The omission of a word segment, This would cause a
transition that skips a state.
Stochastic processing using Hidden Markov Models is
accurate, flexible, and capable of being fully automated,
(Rabiner and Juang, 1986).
Neural networks
"if speech recognition systems could learn speech
knowledge automatically and represent this knowledge
in a parallel distributed fashion for rapid evaluation
such a system would mimic the function of the human
brain, which consists of several billion simple, inaccurate
and slow processors that perform reliable speech
processing", (Waibel and Hampshire, 1989).
An artificial neural network is a computer program, which
attempt to emulate the biological functions of the Human
brain. They are an excellent classification systems, and
have been effective with noisy, patterned, variable data
streams containing multiple, overlapping, interacting and
incomplete cues, (Markowitz, 1995).
Neural networks do not require the complete

specification of a problem, learning instead through
exposure to large amount of example data. Neural
networks comprise of an input layer, one or more hidden
layers, and one output layer. The way in which the nodes
and layers of a network are organised is called the
networks architecture.
The allure of neural networks for speech recognition lies
in their superior classification abilities.
Considerable effort has been directed towards
development of networks to do word, syllable and
phoneme classification.
Auditory Models,
The aim of auditory models to allow a Speech
Recognition system to screen all noise from the
signal and concentrate on the central speech
pattern in a similar way to the Human Brain.
Auditory modelling offers the promise of being
able to develop robust Speech Recognition
systems that are capable of working in difficult
environments.
Currently, it is purely an experimental
technology.
Performance of Speech
Recognitions systems
Performance of speech recognition systems is typically
described in terms of word error rate, defined as:
Deletion, The loss of a word within the original speech.
The system outputs "A E I U" while the input was "A E I
O U".
Substitution, The replacement of an element of the input,
such as a word, with another. The system outputs "song"
while the input was "long".
Insertion, The system adds an element to the input, such
as a word, when no word was input. The system outputs
"A E I O U" while the input was "A E I U".
Speech Recognition as Assistive

Technology
Main use is as alternative Hands Free
Data entry mechanism
Very effective
Much faster than switch access
Mainstream technology
Used in many applications where hands
are needed for other things e.g. mobile
phone while driving, in surgical theatres
Dictation is a big part of office

administration and commercial speech
recognition systems are targeted at this
market.
Some interesting facts

Switch access users who were at around
5 words per minute achieved 80 words
with SR
This allowed them to do state exams
SR can be used for environmental control
systems around the home e.g.
Open Curtains
People with speech impairment (Dysarthic

Speech) have shown improved articulation
after using SR systems especially Discrete
systems
Reasons why SR may fail some

people
Crowded room - Cannot have everyone
talking at once
Too many errors because all noises,
coughs, throat clearances etc are picked
up
Speech not good enough to use it
Not enough training
Cognitive overhead too much for some
people
Too demanding physically Hard work to

talk for a long time
Cannot be bothered with Initial Enrolment
Drinking- Adversely affects vocal cords
Smoking, Shouting, Dry Mouth and illness
all affect the vocal tract
Need to drink water
Room must not be too stuffy
Some links
The following are links to major speech
recognition links
Carnegie Mellon Speech

Demos
CMU Communicator
Call: 1-877-CMU-PLAN (268-7526), also 268-5144,
or x8-1084
the information is accurate; you can use it for your
own travel planning
CMU Universal Speech Interface (USI)
CMU Movie Line

Seems to be about apartments now
Call: (412) 268-1185
Telephone Demos
Nuance
http://www.nuance.com
Banking: 1-650-847-7438
Travel Planning: 1-650-847-7427
Stock Quotes: 1-650-847-7423
SpeechWorks
http://www.speechworks.com/demos/demos.htm
Banking: 1-888-729-3366
Stock Trading: 1-800-786-2571
MIT Spoken Language Systems

Laboratory http://
www.sls.lcs.mit.edu/sls/whatwedo/applications.h
tml
Travel Plans (Pegasus): 1-877-648-8255
Weather (Jupiter): 1-888-573-8255
IBM http://www-3.ibm.com/software/speech/
Mutual Funds, Name Dialing: 1-877-VIAVOICE

Speech Recognition

Uploaded by

Copyright:

Available Formats

You might also like

Speech Recognition

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Recognition

Uploaded by

Copyright:

Available Formats

Speech Recognition

Speech synthesis (Text to Speech):

Answer user question

Interface between user/language processing

What you can do with Speech

Command and control

Transcription and Dictation

Dictation is the interactive composition of

Speech recognition and

Speech Recognition and Call

Automate services, lower payroll

Applications related to Speech

Many kinds of Speech Recognition

Large V small vocabularies

Finally, some external parameters can

Features That Distinguish

Vocabulary: How many words

Systems are also defined by Users

Sample Market: Call Centers

1971DARPA (Defense Advanced Research Projects Agency)

1978The popular toy "Speak and Spell" by Texas Instruments was

1993 Covox sells its products out to Creative Labs, Inc.

What is a speech recognition

How is pre-processing performed

Like most phoneme classifications, the

Consonant classification uses the,

Important Features (Acoustics)

Auditory perception, hearing

Representations used in current speech

Future Improvements in Speech

Work continues on representations more closely

It is attractive because it holds out the promise of a small

The ultimate challenge is to match the superior

Signal Recognition Technologies

Acoustic phonetic recognition supplanted

The high degree of acoustic similarity among

Neural networks do not require the complete

Speech Recognition as Assistive

Dictation is a big part of office

Some interesting facts

People with speech impairment (Dysarthic

Reasons why SR may fail some

Too demanding physically Hard work to

Carnegie Mellon Speech

CMU Universal Speech Interface (USI)

CMU Movie Line

MIT Spoken Language Systems

You might also like