Speech Recognition

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 70

Speech Recognition

Definition
Speech recognition is the process of converting
an acoustic signal, captured by a microphone or
a telephone, to a set of words.
The recognised words can be an end in
themselves, as for applications such as
commands & control, data entry, and document
preparation.
They can also serve as the input to further
linguistic processing in order to achieve speech
understanding

Speech Processing
Signal processing:
Convert the audio wave into a sequence of feature vectors

Speech recognition:
Decode the sequence of feature vectors into a sequence of
words

Semantic interpretation:
Determine the meaning of the recognized words

Dialog Management:
Correct errors and help get the task done

Response Generation
What words to use to maximize user understanding

Speech synthesis (Text to Speech):


Generate synthetic speech from a marked-up word string

Dialog Management
Goal: determine what to accomplish in response
to user utterances, e.g.:

Answer user question


Solicit further information
Confirm/Clarify user utterance
Notify invalid query
Notify invalid query and suggest alternative

Interface between user/language processing


components and system knowledge base

What you can do with Speech


Recognition
Transcription
dictation, information retrieval

Command and control


data entry, device control, navigation, call routing

Information access
airline schedules, stock quotes, directory
assistance

Problem solving
travel planning, logistics

Transcription and Dictation


Transcription is transforming a stream of
human speech into computer-readable
form
Medical reports, court proceedings, notes
Indexing (e.g., broadcasts)

Dictation is the interactive composition of


text
Report, correspondence, etc.

Speech recognition and


understanding
Sphinx system
speaker-independent
continuous speech
large vocabulary

ATIS system
air travel information retrieval
context management

Speech Recognition and Call


Centres

Automate services, lower payroll


Shorten time on hold
Shorten agent and client call time
Reduce fraud
Improve customer service

Applications related to Speech


Recognition

Speech Recognition
Figure out what a person is saying.
Speaker Verification
Authenticate that a person is who she/he
claims to be.
Limited speech patterns
Speaker Identification
Assigns an identity to the voice of an
unknown person.
Arbitrary speech patterns

Many kinds of Speech Recognition


Systems
Speech recognition systems can be
characterised by many parameters.
An isolated-word (Discrete) speech
recognition system requires that the
speaker pauses briefly between words,
whereas a continuous speech recognition
system does not.

Spontaneous V Scripted
Spontaneous, speech contains
disfluencies, periods of pause and restart,
and is much more difficult to recognise
than speech read from script.

Enrolment
Some systems require speaker enrolment,
a user must provide samples of his or her
speech before using them, whereas other
systems are said to be speakerindependent, in that no enrolment is
necessary.

Large V small vocabularies


Some of the other parameters depend on the
specific task. Recognition is generally more
difficult when vocabularies are large with many
similar-sounding words.
When speech is produced in a sequence of
words, language models or artificial grammars
are used to restrict the combination of words.
The simplest language model can be specified
as a finite-state network, where the permissible
words following each word are given explicitly.

Perplexity
One popular measure of the difficulty of
the task, combining the vocabulary size
and the language model, is perplexity.
Loosely defined as the geometric mean of
the number of words that can follow a
word after the language model has been
applied., (Zue, Cole, and Ward, 1995).

Finally, some external parameters can


affect speech recognition system
performance. These include the
characteristics of the environmental noise
and the type and the placement of the
microphone.

Properties of Recognizers
Summary
Speaker Independent vs. Speaker Dependent
Large Vocabulary (2K-200K words) vs. Limited
Vocabulary (2-200)
Continuous vs. Discrete
Speech Recognition vs. Speech Verification
Real Time vs. multiples of real time

Continued
Spontaneous Speech vs. Read Speech
Noisy Environment vs. Quiet Environment
High Resolution Microphone vs. Telephone vs.
Cellphone
Push-and-hold vs. push-to-talk vs. alwayslistening
Adapt to speaker vs. non-adaptive
Low vs. High Latency
With online incremental results vs. final results
Dialog Management

Features That Distinguish


Products & Applications
Words, phrases, and grammar
Models of the speakers
Speech flow

Vocabulary: How many words


How you add new words
Grammars
Branching Factor (Perplexity)
Available languages

Systems are also defined by Users


Different Kinds of Users
One time vs. Frequent users
Homogeneity
Technically sophisticated
Based on Users have different speaker
models

Speaker Models
Speaker Dependent
Speaker Independent
Speaker Adaptive

Sample Market: Call Centers


Automate services, lower payroll
Shorten time on hold
Shorten agent and client call time
Reduce fraud
Improve customer service

A TIMELINE OF SPEECH
RECOGNITION
1890s Alexander Graham Bell discovers Phone while
trying to develop speech recognition system for deaf
people.
1936AT&T's Bell Labs produced the first electronic
speech synthesizer called the Voder (Dudley, Riesz and
Watkins).
This machine was demonstrated in the 1939 World Fairs
by experts that used a keyboard and foot pedals to play
the machine and emit speech.
1969John Pierce of Bell Labs said automatic speech
recognition will not be a reality for several decades
because it requires artificial intelligence.

Early 70s
Early 1970'sThe Hidden Markov Modeling
(HMM) approach to speech recognition was
invented by Lenny Baum of Princeton University
and shared with several ARPA (Advanced
Research Projects Agency) contractors including
IBM.
HMM is a complex mathematical patternmatching strategy that eventually was adopted
by all the leading speech recognition companies
including Dragon Systems, IBM, Philips, AT&T
and others.

70+

1971DARPA (Defense Advanced Research Projects Agency)


established the Speech Understanding Research (SUR) program to
develop a computer system that could understand continuous
speech.
Lawrence Roberts, who initiated the program, spent $3 million per
year of government funds for 5 years. Major SUR project groups
were established at CMU, SRI, MIT's Lincoln Laboratory, Systems
Development Corporation (SDC), and Bolt, Beranek, and Newman
(BBN). It was the largest speech recognition project ever.

1978The popular toy "Speak and Spell" by Texas Instruments was


introduced. Speak and Spell used a speech chip which led to huge
strides in development of more human-like digital synthesis sound.

80+
1982Covox founded. Company brought digital sound (via
The Voice Master, Sound Master and The Speech Thing)
to the Commodore 64, Atari 400/800, and finally to the
IBM PC in the mid 80s.
1982Dragon Systems was founded in 1982 by speech
industry pioneers Drs. Jim and Janet Baker. Dragon
Systems is well known for its long history of speech and
language technology innovations and its large patent
portfolio.
1984SpeechWorks, the leading provider of over-thetelephone automated speech recognition (ASR)
solutions, was founded.

90s

1993 Covox sells its products out to Creative Labs, Inc.


1995 Dragon released discrete word dictation-level speech
recognition software. It was the first time dictation speech
recognition technology was available to consumers. IBM and
Kurzweil followed a few months later.
1996 Charles Schwab is the first company to devote resources
towards developing up a speech recognition IVR system with
Nuance. The program, Voice Broker, allows for up to 360
simultaneous customers to call in and get quotes on stock and
options... it handles up to 50,000 requests each day. The system
was found to be 95% accurate and set the stage for other
companies such as Sears, Roebuck and Co., and United Parcel
Service of America Inc., and E*Trade Securities to follow in their
footsteps.
1996 BellSouth launches the world's first voice portal, called Val
and later Info By Voice.

95+
1997 Dragon introduced "Naturally Speaking", the first
"continuous speech" dictation software available
(meaning you no longer need to pause between words
for the computer to understand what you're saying).
1998 Lernout & Hauspie bought Kurzweil. Microsoft
invested $45 million in Lernout & Hauspie to form a
partnership that will eventually allow Microsoft to use
their speech recognition technology in their systems.
1999 Microsoft acquired Entropic, giving Microsoft
access to what was known as the "most accurate speech
recognition system" in the Old VCR!

2000
2000 Lernout & Hauspie acquired Dragon Systems
for approximately $460 million.
2000 TellMe introduces first world-wide voice
portal.
2000 NetBytel launched the world's first voice
enabler, which includes an on-line ordering
application with real-time Internet integration for
Office Depot.

2000s
2001ScanSoft Closes Acquisition of Lernout
& Hauspie Speech and Language Assets.
2003ScanSoft Ships Dragon
NaturallySpeaking 7 Medical, Lowers
Healthcare Costs through Highly Accurate
Speech Recognition.
2003ScanSoft closes deal to distribute and
support IBM ViaVoice Desktop Products.

Signal Variability
Speech recognition is a difficult problem, largely because
of the many sources of variability associated with the
signal.
The acoustic realisations of phonemes, the recognition
systems smallest sound units of which words are
composed, are highly dependent on the context in which
they appear.
These phonetic variables are exemplified by the acoustic
differences of the phoneme 't/'in two, true, and butter in
English.
At word boundaries, contextual variations can be quite
dramatic, and devo andare sound like devandare in
Italian.

More
Acoustic variability can result from changes in
the environment as well as in the position and
characteristics of the transducer.
Within-speaker variability can result from
changes in the speaker's physical and emotional
state, speaking rate, or voice quality.
Differences in socio-linguistic background,
dialect, and vocal tract size and shape can
contribute to across-speaker variability.

What is a speech recognition


system?
Speech recognition is generally used as a
human computer interface for other software.
When it functions in this role, three primary tasks
need be performed.
Pre-processing, the conversion of spoken input
into a form the recogniser can process.
Recognition, the identification of what has been
said.
Communication, to send the recognised input to
the application that requested it.

How is pre-processing performed


To understand how the first of these
functions is performed, we must examine,
Articulation, the production of the sound.
Acoustics, the stream of the speech itself.
What characterises the ability to
understand spoke input, Auditory
perception.

Articulation
The science of articulation is concerned with how
phonemes are produced. The focus of articulation is on
the vocal apparatus of the throat, mouth and nose where
the sounds are produced.
The phonemes themselves need to be classified, the
system most often used by speech recognition is the
ARPABET, (Rabiner and Juang, 1993) The ARPABET
was created in the 1970s by and for contractors working
on speech processing for the Advanced Research
Projects Agency of the U.S. department of defence.

ARPABET

Like most phoneme classifications, the


ARPABET separates consonants from vowels.
Consonants are characterised by a total or
partial blockage of the vocal tract.
Vowels are characterised by strong harmonic
patterns and relatively free passage of air
through the vocal tract.
Semi-Vowels, such as the y in you, fall between
consonants and vowels.

Consonant Classifcation

Consonant classification uses the,


Point of articulation.
Manner of articulation.
Presence or absence of voicing.

Acoustics
Articulation provides valuable information
about how speech sounds are produced,
but a speech recognition system cannot
analyse movements of the mouth.
Instead, the data source for speech
recognition is the stream of speech itself.
This is an analogue signal, a sound
stream, and a continuous flow of sound
waves and silence.

Important Features (Acoustics)


Four important features of the acoustic analysis
of speech are, (Carter, 1984)
Frequency, the number of vibrations per second
a sound produces
Amplitude, the loudness of the sound.
Harmonic structure added to the fundamental
frequency of a sound are other frequencies that
contribute to its quality or timbre.
Resonance.

Auditory perception, hearing


speech.
"Phonemes tend to be abstractions that are implicitly
defined by the pronunciation of the words in the
language. In particular, the acoustic realisation of a
phoneme may heavily depend on the acoustic context in
which it occurs. This effect is usually called coarticulation", (Ney, 1994).
The way a phoneme is pronounced can be affected by its
position in a word, neighbouring phonemes and even the
word's position in a sentence. This affect is called the coarticulation effect.
The variability in the speech signal caused by coarticulation and other sources make speech analysis very
difficult.

Human Hearing
The human ear can detect frequencies from 20Hz to
20,000Hz but it is most sensitive in the critical frequency
range, 1000Hz to 6000Hz, (Ghitza, 1994).
Recent Research has uncovered the fact that humans
do not process individual frequencies.
Instead, we hear groups of frequencies, such as format
patterns, as cohesive units and we are capable of
distinguishing them from surrounding sound patterns,
(Carrell and Opie, 1992) .
This capability, called auditory object formation, or
auditory image formation, helps explain how humans can
discern the speech of individual people at cocktail parties
and separate a voice from noise over a poor telephone
channel, (Markowitz, 1995).

Pre-processing Speech
Like all sounds, speech is an analogue
waveform. In order for a Recognition System to
perform action on speech, it must be
represented in a digital manner.
All noise patterns silences and co-articulation
effects must be captured.
This is accomplished by digital signal
processing. The way the analogue speech is
processed is one of the most complex elements
of a Speech Recognition system.

Recognition Accuracy
To achieve high recognition accuracy the
speech representation process should,
(Markowitz, 1995),
Include all critical data.
Remove Redundancies.
Remove Noise and Distortion.
Avoid introducing new distortions.

Signal Representation
In statistically based automatic speech
recognition, the speech waveform is sampled at
a rate between 6.6 kHz and 20 kHz and
processed to produce a new representation as a
sequence of vectors containing values of what
are generally called parameters.
The vectors typically comprise between 10 and
20 parameters, and are usually computed every
10 or 20 milliseconds.

Parameter Values
These parameter values are then used in
succeeding stages in the estimation of the
probability that the portion of waveform just
analysed corresponds to a particular phonetic
event that occurs in the phone-sized or wholeword reference unit being hypothesised.
In practice, the representation and the
probability estimation interact strongly: what one
person sees as part of the representation
another may see as part of the probability
estimation process.

Emotional State
Representations aim to preserve the information
needed to determine the phonetic identity of a
portion of speech while being as impervious as
possible to factors such as speaker differences,
effects introduced by communications channels,
and paralinguistic factors such as the emotional
state of the speaker.
They also aim to be as compact as possible.

Representations used in current speech


recognisers, concentrate primarily on properties
of the speech signal attributable to the shape of
the vocal tract rather than to the excitation,
whether generated by a vocal-tract constriction
or by the larynx.
Representations are sensitive to whether the
vocal folds are vibrating or not (the
voiced/unvoiced distinction), but try to ignore
effects due to variations in their frequency of
vibration.

Future Improvements in Speech


Representation.
The vast majority of major commercial and
experimental systems use representations akin
to those described here.
However, in striving to develop better
representations, wave-let transforms
(Daubechies, 1990) are being explored, and
neural network methods are being used to
provide non-linear operations on log spectral
representations.

Work continues on representations more closely


reflecting auditory properties (Greenberg, 1988) and on
representations reconstructing articulatory gestures from
the speech signal (Schroeter & Sondhi, 1994).

It is attractive because it holds out the promise of a small


set of smoothly varying parameters that could deal in a
simple and principled way with the interactions that occur
between neighbouring phonemes and with the effects of
differences in speaking rate and of carefulness of
enunciation.

The ultimate challenge is to match the superior


performance of human listeners over automatic
recognisers.
This superiority is especially marked when there is little
material to allow adaptation to the voice of the current
speaker, and when the acoustic conditions are difficult.
The fact that it persists even when nonsense words are
used shows that it exists at least partly at the
acoustic/phonetic level and cannot be explained purely
by superior language modelling in the brain.
It confirms that there is still much to be done in
developing better representations of the speech signal,
(Rabiner and Schafer, 1978; Hunt, 1993).

Signal Recognition Technologies


Signal Recognition methodologies fall into
to four categories, most system will apply
one or more in the conversion process.

Template Matching,
Template match is the oldest and least effective method.
It is a form of pattern recognition.
It was the dominant technology in the 1950's and 1960's.
Each word or phrase in an application is stored as a
template.
The user input is also arranged into templates at the
word level and the best match with a system template is
found.
Although Template matching is currently in decline as the
basic approach to recognition, it has been adapted for
use in word spotting applications. It also remains the
primary technology applied to speaker verification,
(Moore, 1982).

Acoustic-Phonetic Recognition
Acoustic-phonetic recognition functions at the
phoneme level. It is an attractive approach to
speech as it limits the number of representations
that must be stored. In English there are about
forty discernible phonemes no matter how large
the vocabulary, (Markowitz, 1995).
Acoustic phonetic recognition involves three
steps,
Feature Extraction.
Segmentation and Labelling.
Word-Level recognition.

Acoustic phonetic recognition supplanted


template matching in the early 1970's.
The successful ARPA SUR systems
highlighted potential benefits of this
approach. Unfortunately acoustic phonetic
was at the time a poorly researched area
and many of the expected advances failed
to materialise.

The high degree of acoustic similarity among


phonemes combined with phoneme variability
resulting from the co-articulation effect and other
sources create uncertainty with regard to
potential phoneme labels, (Cole 1986).
If these problems can be overcome, there is
certainly an opportunity for this technology to
play a part in future Speech Recognition system.

Stochastic Processing,
The term stochastic refers to the process of making a
sequence of non-deterministic selections from among a
set of alternatives.
They are non-deterministic because the choices during
the recognition process are governed by the
characteristics of the input and not specified in advance,
(Markowitz, 1995).
Like template matching, stochastic processing requires
the creation and storage of models of each of the items
that will be recognised.
It is based on a series of complex statistical or
probabilistic analyses. These statistics are stored in a
network-like structure called a Hidden Markov Model
(HMM), (Paul, 1990).

HMM
A Hidden Markov Model is made up of states and
transitions, which are shown, in the diagram. Each state
represents of a HMM holds statistics for a segment of a
word, which describe the value and variations that are
found in the model of that word segment. The transitions
allow for speech variations such as
The prolonging of a word segment, this would cause
several recursive transitions in the recogniser.
The omission of a word segment, This would cause a
transition that skips a state.
Stochastic processing using Hidden Markov Models is
accurate, flexible, and capable of being fully automated,
(Rabiner and Juang, 1986).

Neural networks
"if speech recognition systems could learn speech
knowledge automatically and represent this knowledge
in a parallel distributed fashion for rapid evaluation
such a system would mimic the function of the human
brain, which consists of several billion simple, inaccurate
and slow processors that perform reliable speech
processing", (Waibel and Hampshire, 1989).
An artificial neural network is a computer program, which
attempt to emulate the biological functions of the Human
brain. They are an excellent classification systems, and
have been effective with noisy, patterned, variable data
streams containing multiple, overlapping, interacting and
incomplete cues, (Markowitz, 1995).

Neural networks do not require the complete


specification of a problem, learning instead through
exposure to large amount of example data. Neural
networks comprise of an input layer, one or more hidden
layers, and one output layer. The way in which the nodes
and layers of a network are organised is called the
networks architecture.
The allure of neural networks for speech recognition lies
in their superior classification abilities.
Considerable effort has been directed towards
development of networks to do word, syllable and
phoneme classification.

Auditory Models,
The aim of auditory models to allow a Speech
Recognition system to screen all noise from the
signal and concentrate on the central speech
pattern in a similar way to the Human Brain.
Auditory modelling offers the promise of being
able to develop robust Speech Recognition
systems that are capable of working in difficult
environments.
Currently, it is purely an experimental
technology.

Performance of Speech
Recognitions systems
Performance of speech recognition systems is typically
described in terms of word error rate, defined as:
Deletion, The loss of a word within the original speech.
The system outputs "A E I U" while the input was "A E I
O U".
Substitution, The replacement of an element of the input,
such as a word, with another. The system outputs "song"
while the input was "long".
Insertion, The system adds an element to the input, such
as a word, when no word was input. The system outputs
"A E I O U" while the input was "A E I U".

Speech Recognition as Assistive


Technology
Main use is as alternative Hands Free
Data entry mechanism
Very effective
Much faster than switch access
Mainstream technology
Used in many applications where hands
are needed for other things e.g. mobile
phone while driving, in surgical theatres

Dictation is a big part of office


administration and commercial speech
recognition systems are targeted at this
market.

Some interesting facts


Switch access users who were at around
5 words per minute achieved 80 words
with SR
This allowed them to do state exams
SR can be used for environmental control
systems around the home e.g.
Open Curtains

People with speech impairment (Dysarthic


Speech) have shown improved articulation
after using SR systems especially Discrete
systems

Reasons why SR may fail some


people
Crowded room - Cannot have everyone
talking at once
Too many errors because all noises,
coughs, throat clearances etc are picked
up
Speech not good enough to use it
Not enough training
Cognitive overhead too much for some
people

Too demanding physically Hard work to


talk for a long time
Cannot be bothered with Initial Enrolment
Drinking- Adversely affects vocal cords
Smoking, Shouting, Dry Mouth and illness
all affect the vocal tract
Need to drink water
Room must not be too stuffy

Some links
The following are links to major speech
recognition links

Carnegie Mellon Speech


Demos
CMU Communicator
Call: 1-877-CMU-PLAN (268-7526), also 268-5144,
or x8-1084
the information is accurate; you can use it for your
own travel planning

CMU Universal Speech Interface (USI)

CMU Movie Line


Seems to be about apartments now
Call: (412) 268-1185

Telephone Demos
Nuance

http://www.nuance.com
Banking: 1-650-847-7438
Travel Planning: 1-650-847-7427
Stock Quotes: 1-650-847-7423

SpeechWorks
http://www.speechworks.com/demos/demos.htm
Banking: 1-888-729-3366
Stock Trading: 1-800-786-2571

MIT Spoken Language Systems


Laboratory http://
www.sls.lcs.mit.edu/sls/whatwedo/applications.h
tml
Travel Plans (Pegasus): 1-877-648-8255
Weather (Jupiter): 1-888-573-8255

IBM http://www-3.ibm.com/software/speech/
Mutual Funds, Name Dialing: 1-877-VIAVOICE

You might also like