lecture1

Introduction to
Statistical Speech Recognition

Lecture 1
CS 753
Instructor: Preethi Jyothi
Course Plan (I)
• Cascaded ASR System Speech Waveform
- Acoustic Model (AM) Acoustic Analysis
- Pronunciation Model (PM) Frame #

1
Acoustic Features
- Language Model (LM) 3

4
Acoustic Features
5
AM
• Weighted Finite State Transducers for ASR
:
ay
• AM: HMMs, DNN and RNN-based models WORD PRON

d
• PM: Phoneme and Grapheme-based good

like
g uh d
l ay k DECODER
Acoustic model
models is ih z NGRAMS
Good prose
SCORE
2.5
Pronunciation model like 0.7
• LM: Ngram models (+smoothing), PM is like
is like a
1.2
0.8
RNNLMs Grammar (language) model
LM
good prose is like a windowpane
• Decoding Algorithms, Lattices

Figure 1.1: A standard automatic speech recognition architecture
Course Plan (II)
Speller We
y2 y3 y4 heosi (pBLSTM
the time
•
architect
End-to-end Neural Models for ASR compute
- CTC loss function

c1 c2
In the pB
steps of e
- Encoder-decoder Architectures with Attention

h h h
In o
• Speaker Adaptation s1 s2
BLSTM
allows th
the relev
• Speech Synthesis hsosi y2 y3 yS 1 addition

model to
Figure 1
The
• Recent Generative Models (GANs, VAEs) for
h = (h1 , . . . , hU )
plexity.
tional co
Speech Processing Listener and infer
been des
h1 hU hierarchi
2.2. Att
Check www.cse.iitb.ac.in/~pjyothi/cs753 for latest updates The Att
based LS
ducer pr
Moodle will be used for assignment/project-related submissions condition
for yi is
and all announcements x1 x2 x3 x4 xT coder sta
emitted c
Fig. 1: Listen, Attend and Spell (LAS) model: the listener is a pyra- produced
Image from: Chan etmidal
al., Listen,
BLSTMAttend
encoding and
ourSpell: A NN forx LVCSR,
input sequence into high ICASSP
level fea- 2016
tures h, the speller is an attention-based decoder generating the y
Other Course Info
• Teaching Assistants (TAs):
- Vinit Unni (vinit AT cse)
- Saiteja Nalla (saitejan AT cse)
- Naman Jain (namanjain AT cse)
• TA oﬃce hours: Wednesdays, 10 am to 12 pm (tentative)

Instructor 1-1: Email me to schedule a time
• Readings:
- No fixed textbook. “Speech and Language Processing” by Jurafsky and Martin serves
as a good starting point.
- All further readings will be posted online.
• Audit requirements: Complete all assignments/quizzes and score ≥ 40%

Course Evaluation
• 3 Assignments OR 2 Assignments + 1 Quiz 35%
• At least one programming assignment

- Set up ASR system based on a recipe & improve said recipe
• Midsem Exam + Final Exam 15% + 25%
• Final Project 20%
• Participation 5%
Attendance Policy? Strongly advised to attend lectures.

Also, participation points hinges on it.
Academic Integrity Policy
Assignments/Exams
• Always cite your sources (be it images, papers or existing code repos).
Follow proper citation guidelines.
• Unless specifically permitted, collaborations are not allowed.
• Do not copy or plagiarise. Will incur significant penalties.

Academic Integrity Policy
Assignments/Exams
• Always cite your sources (be it images, papers or existing code repos).
Follow proper citation guidelines.
• Unless specifically permitted, collaborations are not allowed.
• Do not copy or plagiarise. Will incur significant penalties.

Final Project
• Projects can be on any topic related to speech/audio processing.
Check website for abstracts from a previous oﬀering.
• No individual projects and no more than 3 members in a team.
• Preliminary Project Evaluation: Short report detailing project statement, SEP 1-7
goals, specific tasks and preliminary experiments
• Final Evaluation: NOV 7-14
- Presentation (Oral or poster session, depending on final class strength)

- Report (Use ML conference style files & provide details about the project)
• Excellent Projects:
- Will earn extra credit that counts towards the final grade
- Can be turned into a research paper
#1: Speech-driven Facial Animation
https://arxiv.org/pdf/1906.06337.pdf, June 2019

Videos from: https://sites.google.com/view/facial-animation
#2: Speech2Gesture
https://arxiv.org/abs/1906.04160, CVPR 2019

Image from: http://people.eecs.berkeley.edu/~shiry/projects/speech2gesture/
#3: Decoding Brain Signals Into Speech
https://www.nature.com/articles/s41586-019-1119-1, April 2019

Introduction to ASR
Automatic Speech Recognition
• Problem statement: Transform a spoken utterance into a sequence of

tokens (words, syllables, phonemes, characters)
• Many downstream applications of ASR. Examples:

- Speech understanding
- Spoken translation
- Audio information retrieval
• Speech demonstrates variabilities at multiple levels: Speaker style,

accents, room acoustics, microphone properties, etc.
History of ASR
RADIO REX (1922)

History of ASR
SHOEBOX (IBM, 1962)

1 word
Freq.
detector
1922 1932 1942 1952 1962 1972 1982 1992 2002 2012
History of ASR
HARPY (CMU, 1976)
1 word 16 words
Freq. Isolated word
detector recognition
1922 1932 1942 1952 1962 1972 1982 1992 2002 2012
History of ASR
HIDDEN MARKOV MODELS

(1980s)
1 word 16 words 1000 words

Freq. Isolated word Connected
detector recognition speech
1922 1932 1942 1952 1962 1972 1982 1992 2002 2012
History of ASR
Cortana
Siri
DEEP NEURAL NETWORK BASED SYSTEMS (>2010)
1 word 16 words 1000 words 10K+ words

Freq. Isolated word Connected LVCSR
detector recognition speech systems
1922 1932 1942 1952 1962 1972 1982 1992 2002 2012
How are ASR systems evaluated?
• Error rates computed on an unseen test set by comparing W* (decoded
sentence) against Wref (reference sentence) for each test utterance
- Sentence/Utterance error rate (trivial to compute!)
- Word/Phone error rate
• Word/Phone error rate (ER) uses the Levenshtein distance measure: What
are the minimum number of edits (insertions/deletions/substitutions)
required to convert W* to Wref?
On a test set with N instances:
PN
j=1 Insj + Delj + Subj
ER = PN
`
j=1 j
Insj, Delj, Subj are number of insertions/deletions/substitutions in the jth ASR output
`j is the total number of words/phones in the jth reference
NIST STT Benchmark Test History
Remarkable progress in ASR in the last decade
http://www.itl.nist.gov/iad/mig/publications/ASRhistory/
100%
Switchboard
Conversational Speech
Meeting Speech
(Non-English)
Meeting – SDM OV4

Read
Meeting – MDM OV4
Speech Switchboard II
Broadcast CTS Arabic (UL)

Switchboard Cellular
Speech CTS Mandarin (UL)0
Meeting - IHM
Air Travel News Mandarin 10X
Planning Kiosk Varied (Non-English)
Speech Microphones
News Arabic 10X
CTS Fisher (UL)

20k
WER (in %)
News English 1X
News English unlimited
10%
Noisy News English 10X
5k
1k
4%
2%
1%
2018
Image from: http://www.itl.nist.gov/iad/mig/publications/ASRhistory/
Statistical Speech Recognition
Pioneer of ASR technology, Fred Jelinek (1932 - 2010): Cast ASR as a

channel coding problem.
Let O be a sequence of acoustic features corresponding to a speech signal.

d
That is, O = {O1, …, OT}, where Oi ∈ ℝ refers to a d-dimensional
acoustic feature vector and T is the length of the sequence.
Let W denote a word sequence. An ASR decoder solves the foll. problem:
W* = arg max Pr(W | O) Language

W Model
= arg max Pr(O | W) Pr(W)
W
Acoustic
Model
Simple example of isolated word ASR
• Task: Recognize utterances which consist of speakers saying either “up"

or “down" or “left” or “right” per recording.
• Vocabulary: Four words, “up”, “down”, “left”, “right”
• Data splits
- Training data: 30 utterances
- Test data: 20 utterances
• Acoustic model: Let’s parameterize Prθ(O | W) using a Markov model

with parameters θ.
St+1
Word-based acoustic model

Trt a11 a22 a33
a01 a12 a23 a34 Model for

Pht+1 0 1 2 3 4 “up”
b1( ) b2( ) b3( )

Ot+1
O1 O2 O3 O4 .... OT
aij → Transition probabilities going from state i to state j

Figure 2.1: Standard topology used to represent a phone HMM.
bj(Oi) → Probability of generating Oi from state j
sub-word units Q corresponding to the word∑

Compute Pr(O | "up") = Pr(O, Q | "up") Efficient algorithm exists.
sequence W and the languageWill
model
appear in a later class.
Q
P (W ) provides a prior probability for W .
St+1
Isolated word recognition
Trt a11 a22 a33
a01 a12 a23 a34

Pr(O | "up")
up
Pht+1 0 1 2 3 4
St+1 b1( ) b2( ) b3( )

Ot+1
O1 O2 O3 O4 .... OT
Trt a11 a22 a33
Pr(O | "down")
a01 a12 a23 a34
down
Pht+1 0 1 2 3 4
sub-word units Q corresponding to the word sequence W and the language model
b ()
P (W ) provides a prior
1 probability for2 W . b () b3( )
St+1
Ot+1
O O O O .... O
Compute arg max Pr(O | w)
Acoustic model:
1 The most2commonly3 used acoustic
4 models in ASR
T systems to-
day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com-
Trt prehensive tutorial of a a
11HMMs and their22applicability to ASR a
33 in the 1980’s (with acoustic w
ideas that are largely applicable to systems today). HMMs are used to build prob- features
a a a a
left abilistic models01 12 labeling problems.
23 Since speech 34 is represented O
Pr(O | "left")
for linear sequence
1 2 3
Pht+1 0
sub-word units Q corresponding to the word sequence W and the language model 4
in the form of a sequence of acoustic vectors O, it lends itself to be naturally mod-
St+1
eled using HMMs. 1 b () 2 b () 3 b ()
j
Themodel:
Acoustic HMM isThe
defined
mostby specifyingused
commonly transition probabilities
acoustic models in (a i ) and
ASR observation
systems to-
Ot+1
day(or
areemission)
O
1
probability
Hidden Markov Models
2 O
distributions
(HMMs). Please
O
(b3j (Oi ))refer
(along O
to Rabiner
....
4 with the number
T O
(1989) for aofcom-
hidden
Trt states intutorial

prehensive the HMM). An HMM
of HMMs makes
and their a transitiontofrom
applicability ASRstate i to1980’s
in the state (with
j with a
a11 a22 a33
probability
ideas of aji2.1:
that areFigure
largely . applicable
On reaching
Standard to a stateused
systems
topology the
j,today).observation
HMMsa are
to represent vector
used
phone at build
to
HMM. that state
prob-(Oj )
a
abilistic models for01
linear sequence labeling20
a12 problems. Since speech is represented
a23 a34
right
Pht+1
insub-word
0
the form of a sequence
units
1
of acoustictovectors
Q corresponding
2
O, it
the word lends itself
sequence
3
to bethe
W and naturally
languagemod-
model
4
eled using
P (W HMMs.a prior probability for W .
) provides
b1( ) b2( ) b3( j)
Pr(O | "right")
The HMM is defined by specifying transition probabilities (ai ) and observation
Ot+1 Acoustic
(or model:
emission) O The most
probability
1
commonly
O2
distributions used
(bj (O O3 O4 ....
acoustic models in ASR systems to-
i )) (along with the number of hidden OT
day are
states Hidden
in the Markov
HMM). Models
An HMM (HMMs).
makes Please refer
a transition fromtostate
Rabiner
i to(1989)
state jfor a com-
with a
prehensiveoftutorial
probability aji . On of HMMsaand
reaching statetheir applicability
j, the observationtovector
ASR inat the
that1980’s (with
state (O j)
ideas that are largely applicable to systems today). HMMs are used to build prob-
Small tweak

St S
or “down"
t+1
multiple times per recording.
Trt a11 a22 a33
a01 a12 a23 a34

Pht up
Pht+1 0 1 2 3 4
b1( ) b2( ) b3( )

OStt OSt+1
t+1 O1 O2 O3 O4 .... OT
Trt Figure 2.1: Standard topology used to represent a phone HMM.

a11 a22 a33
a
sub-word units Q01
corresponding to the word sequence W and the language model
a12 a23 a34
Pht down
Pht+1 0 1
2 3 4
b ()
Acoustic model: The1most commonly used b () b ()
3 in ASR systems to-
2 acoustic models
Ot Ot+1
O1 O2 O3 O4 ....
OT
prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with
Figure
abilistic models for2.1: Standard
linear topology
sequence usedproblems.
labeling to represent a phone
Since HMM.
speech is represented
sub-word
eled units Q corresponding to the word sequence W and the language model
using HMMs.
) provides
The
P (W a prior probability
HMM is defined fortransition
by specifying W. probabilities (aji ) and observation
(or emission) probability distributions (bj (Oi )) (along with the number of hidden
Acoustic model: The most commonly used acoustic models in ASR systems to-
states in the HMM). An HMM makes a transition from state i to state j with a
Small tweak

or “down" multiple times per recording.
St-1 St St+1
Trt-1 Trt a11 a22 a33
a01 a12 a23 a34

Pht-1 Pht Pht+1 0 1 2 3 4
b1( ) b2( ) b3( )

OSt-1 OStt OSt+1
t-1 t+1 O1 O2 O3 O4 .... OT
Trt-1
Search within this graph
Trt Figure 2.1: Standard topology used to represent a phone HMM.
a11 a22 a33
a
sub-word units Q01
corresponding to the word sequence W and the language model
a12 a23 a34
Pht-1 Pht Pht+1 0 1 2 3 4
b ()
Acoustic model: The1most commonly used b () b ()
3 in ASR systems to-
2 acoustic models
Ot-1 Ot Ot+1
O1 O2 O3 O4 ....
OT
prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with
Figure
abilistic models for2.1: Standard
linear topology
sequence usedproblems.
labeling to represent a phone
Since HMM.
speech is represented
sub-word
eled units Q corresponding to the word sequence W and the language model
using HMMs.
) provides
The
P (W a prior probability
HMM is defined fortransition
by specifying W. probabilities (aji ) and observation
(or emission) probability distributions (bj (Oi )) (along with the number of hidden
Small vocabulary ASR
• Task: Recognize utterances which consist of speakers saying one of 1000
words multiple times per recording.
• Not scalable anymore to use words as speech units
• Model using phones instead of words as individual speech units

- Phonemes are abstract, subword units that distinguish one word from another
(minimal pair; e.g. “pan” vs. “can”)
- Phones are actually sounds that are realized and not language-specific units
• What's an obvious advantage of using phones over entire words?

Hint: Think of words with zero coverage in the training data.
Architecture of an ASR system
Acoustic
Model (phones)
Acoustic
Pronunciation
Feature SEARCH Model
Generator
speech O
signal
word sequence Language

W* Model
Cascaded ASR ⇒
! End-to-end ASR
Acoustic
Feature
word sequence
Generator
W*
speech O
signal
Single end-to-end model that directly learns a mapping from speech to text
ASR Progress contd.
AUG ‘16
AUG '17
MAR ‘19
https://venturebeat.com/2019/04/22/amazons-ai-system-could-cut-alexa-speech-recognition-errors-by-15/
https://www.microsoft.com/en-us/research/blog/microsoft-researchers-achieve-new-conversational-speech-recognition-milestone/
https://www.npr.org/sections/alltechconsidered/2016/08/24/491156218/voice-recognition-software-finally-beats-humans-at-typing-study-finds
What are some unsolved problems related to ASR?
• State-of-the-art ASR systems do not work well on regional accents, dialects
• Code-switching is hard for ASR systems to deal with
• How do we rapidly build competitive ASR systems for a new language?

Low-resource ASR and keyword spotting.
• How do we recognize speech from meetings where a primary speaker is

speaking amidst other speakers?
Next class: HMMs for Acoustic Modeling

lecture1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

lecture1

Uploaded by

Copyright:

Available Formats

Introduction to

Statistical Speech Recognition

• Cascaded ASR System Speech Waveform

- Acoustic Model (AM) Acoustic Analysis

- Pronunciation Model (PM) Frame #

- Language Model (LM) 3

• AM: HMMs, DNN and RNN-based models WORD PRON

• PM: Phoneme and Grapheme-based good

• Decoding Algorithms, Lattices

- CTC loss function

- Encoder-decoder Architectures with Attention

• Speech Synthesis hsosi y2 y3 yS 1 addition

• TA oﬃce hours: Wednesdays, 10 am to 12 pm (tentative)

• Audit requirements: Complete all assignments/quizzes and score ≥ 40%

• 3 Assignments OR 2 Assignments + 1 Quiz 35%

• At least one programming assignment

• Midsem Exam + Final Exam 15% + 25%

• Final Project 20%

Attendance Policy? Strongly advised to attend lectures.

• Unless specifically permitted, collaborations are not allowed.

• Do not copy or plagiarise. Will incur significant penalties.

• Unless specifically permitted, collaborations are not allowed.

• Do not copy or plagiarise. Will incur significant penalties.

• No individual projects and no more than 3 members in a team.

goals, specific tasks and preliminary experiments

• Final Evaluation: NOV 7-14

- Presentation (Oral or poster session, depending on final class strength)

https://arxiv.org/pdf/1906.06337.pdf, June 2019

https://arxiv.org/abs/1906.04160, CVPR 2019

https://www.nature.com/articles/s41586-019-1119-1, April 2019

• Problem statement: Transform a spoken utterance into a sequence of

• Many downstream applications of ASR. Examples:

• Speech demonstrates variabilities at multiple levels: Speaker style,

RADIO REX (1922)

SHOEBOX (IBM, 1962)

HARPY (CMU, 1976)

HIDDEN MARKOV MODELS

1 word 16 words 1000 words

DEEP NEURAL NETWORK BASED SYSTEMS (>2010)

1 word 16 words 1000 words 10K+ words

Meeting – SDM OV4

Broadcast CTS Arabic (UL)

CTS Fisher (UL)

Pioneer of ASR technology, Fred Jelinek (1932 - 2010): Cast ASR as a

Let O be a sequence of acoustic features corresponding to a speech signal.

W* = arg max Pr(W | O) Language

• Task: Recognize utterances which consist of speakers saying either “up"

• Vocabulary: Four words, “up”, “down”, “left”, “right”

• Acoustic model: Let’s parameterize Prθ(O | W) using a Markov model

Word-based acoustic model

a01 a12 a23 a34 Model for

b1( ) b2( ) b3( )

aij → Transition probabilities going from state i to state j

sub-word units Q corresponding to the word∑

a01 a12 a23 a34

St+1 b1( ) b2( ) b3( )

Trt states intutorial

• Task: Recognize utterances which consist of speakers saying either “up"

a01 a12 a23 a34

b1( ) b2( ) b3( )

Trt Figure 2.1: Standard topology used to represent a phone HMM.

in the form of a sequence of acoustic vectors O, it lends itself to be naturally mod-

• Task: Recognize utterances which consist of speakers saying either “up"

Trt-1 Trt a11 a22 a33