Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Introduction to

Statistical Speech Recognition


Lecture 1

CS 753
Instructor: Preethi Jyothi
Course Plan (I)

• Cascaded ASR System Speech Waveform

- Acoustic Model (AM) Acoustic Analysis

- Pronunciation Model (PM) Frame #


1
Acoustic Features

- Language Model (LM) 3


4
Acoustic Features

5
AM
• Weighted Finite State Transducers for ASR
:

ay

• AM: HMMs, DNN and RNN-based models WORD PRON


d

• PM: Phoneme and Grapheme-based good


like
g uh d
l ay k DECODER
Acoustic model

models is ih z NGRAMS
Good prose
SCORE
2.5
Pronunciation model like 0.7
• LM: Ngram models (+smoothing), PM is like
is like a
1.2
0.8
RNNLMs Grammar (language) model

LM
good prose is like a windowpane

• Decoding Algorithms, Lattices


Figure 1.1: A standard automatic speech recognition architecture
Course Plan (II)
Speller We
y2 y3 y4 heosi (pBLSTM
the time


architect
End-to-end Neural Models for ASR compute

- CTC loss function


c1 c2
In the pB
steps of e

- Encoder-decoder Architectures with Attention


h h h
In o
• Speaker Adaptation s1 s2
BLSTM
allows th
the relev

• Speech Synthesis hsosi y2 y3 yS 1 addition


model to
Figure 1
The
• Recent Generative Models (GANs, VAEs) for
h = (h1 , . . . , hU )
plexity.
tional co
Speech Processing Listener and infer
been des
h1 hU hierarchi

2.2. Att
Check www.cse.iitb.ac.in/~pjyothi/cs753 for latest updates The Att
based LS
ducer pr
Moodle will be used for assignment/project-related submissions condition
for yi is
and all announcements x1 x2 x3 x4 xT coder sta
emitted c
Fig. 1: Listen, Attend and Spell (LAS) model: the listener is a pyra- produced
Image from: Chan etmidal
al., Listen,
BLSTMAttend
encoding and
ourSpell: A NN forx LVCSR,
input sequence into high ICASSP
level fea- 2016
tures h, the speller is an attention-based decoder generating the y
Other Course Info
• Teaching Assistants (TAs):
- Vinit Unni (vinit AT cse)
- Saiteja Nalla (saitejan AT cse)
- Naman Jain (namanjain AT cse)

• TA office hours: Wednesdays, 10 am to 12 pm (tentative)


Instructor 1-1: Email me to schedule a time

• Readings:
- No fixed textbook. “Speech and Language Processing” by Jurafsky and Martin serves
as a good starting point.
- All further readings will be posted online.

• Audit requirements: Complete all assignments/quizzes and score ≥ 40%


Course Evaluation

• 3 Assignments OR 2 Assignments + 1 Quiz 35%

• At least one programming assignment


- Set up ASR system based on a recipe & improve said recipe

• Midsem Exam + Final Exam 15% + 25%

• Final Project 20%

• Participation 5%

Attendance Policy? Strongly advised to attend lectures.


Also, participation points hinges on it.
Academic Integrity Policy
Assignments/Exams

• Always cite your sources (be it images, papers or existing code repos).
Follow proper citation guidelines.

• Unless specifically permitted, collaborations are not allowed.

• Do not copy or plagiarise. Will incur significant penalties.


Academic Integrity Policy
Assignments/Exams

• Always cite your sources (be it images, papers or existing code repos).
Follow proper citation guidelines.

• Unless specifically permitted, collaborations are not allowed.

• Do not copy or plagiarise. Will incur significant penalties.


Final Project
• Projects can be on any topic related to speech/audio processing.
Check website for abstracts from a previous offering.

• No individual projects and no more than 3 members in a team.

• Preliminary Project Evaluation: Short report detailing project statement, SEP 1-7

goals, specific tasks and preliminary experiments

• Final Evaluation: NOV 7-14

- Presentation (Oral or poster session, depending on final class strength)


- Report (Use ML conference style files & provide details about the project)

• Excellent Projects:
- Will earn extra credit that counts towards the final grade
- Can be turned into a research paper
#1: Speech-driven Facial Animation

https://arxiv.org/pdf/1906.06337.pdf, June 2019


Videos from: https://sites.google.com/view/facial-animation
#2: Speech2Gesture

https://arxiv.org/abs/1906.04160, CVPR 2019


Image from: http://people.eecs.berkeley.edu/~shiry/projects/speech2gesture/
#3: Decoding Brain Signals Into Speech

https://www.nature.com/articles/s41586-019-1119-1, April 2019


Introduction to ASR
Automatic Speech Recognition

• Problem statement: Transform a spoken utterance into a sequence of


tokens (words, syllables, phonemes, characters)

• Many downstream applications of ASR. Examples:


- Speech understanding
- Spoken translation
- Audio information retrieval

• Speech demonstrates variabilities at multiple levels: Speaker style,


accents, room acoustics, microphone properties, etc.
History of ASR

RADIO REX (1922)


History of ASR

SHOEBOX (IBM, 1962)


1 word
Freq.
detector

1922 1932 1942 1952 1962 1972 1982 1992 2002 2012
History of ASR

HARPY (CMU, 1976)

1 word 16 words
Freq. Isolated word
detector recognition

1922 1932 1942 1952 1962 1972 1982 1992 2002 2012
History of ASR

HIDDEN MARKOV MODELS


(1980s)

1 word 16 words 1000 words


Freq. Isolated word Connected
detector recognition speech

1922 1932 1942 1952 1962 1972 1982 1992 2002 2012
History of ASR

Cortana
Siri

DEEP NEURAL NETWORK BASED SYSTEMS (>2010)

1 word 16 words 1000 words 10K+ words


Freq. Isolated word Connected LVCSR
detector recognition speech systems

1922 1932 1942 1952 1962 1972 1982 1992 2002 2012
How are ASR systems evaluated?
• Error rates computed on an unseen test set by comparing W* (decoded
sentence) against Wref (reference sentence) for each test utterance
- Sentence/Utterance error rate (trivial to compute!)
- Word/Phone error rate

• Word/Phone error rate (ER) uses the Levenshtein distance measure: What
are the minimum number of edits (insertions/deletions/substitutions)
required to convert W* to Wref?
On a test set with N instances:
PN
j=1 Insj + Delj + Subj
ER = PN
`
j=1 j

Insj, Delj, Subj are number of insertions/deletions/substitutions in the jth ASR output
`j is the total number of words/phones in the jth reference
NIST STT Benchmark Test History
Remarkable progress in ASR in the last decade
http://www.itl.nist.gov/iad/mig/publications/ASRhistory/

100%
Switchboard
Conversational Speech
Meeting Speech
(Non-English)

Meeting – SDM OV4


Read
Meeting – MDM OV4
Speech Switchboard II

Broadcast CTS Arabic (UL)


Switchboard Cellular
Speech CTS Mandarin (UL)0
Meeting - IHM
Air Travel News Mandarin 10X
Planning Kiosk Varied (Non-English)
Speech Microphones
News Arabic 10X

CTS Fisher (UL)


20k
WER (in %)

News English 1X
News English unlimited
10%
Noisy News English 10X
5k

1k

4%

2%

1%
2018
Image from: http://www.itl.nist.gov/iad/mig/publications/ASRhistory/
Statistical Speech Recognition

Pioneer of ASR technology, Fred Jelinek (1932 - 2010): Cast ASR as a


channel coding problem.

Let O be a sequence of acoustic features corresponding to a speech signal.


d
That is, O = {O1, …, OT}, where Oi ∈ ℝ refers to a d-dimensional
acoustic feature vector and T is the length of the sequence.

Let W denote a word sequence. An ASR decoder solves the foll. problem:

W* = arg max Pr(W | O) Language


W Model
= arg max Pr(O | W) Pr(W)
W
Acoustic
Model
Simple example of isolated word ASR

• Task: Recognize utterances which consist of speakers saying either “up"


or “down" or “left” or “right” per recording.

• Vocabulary: Four words, “up”, “down”, “left”, “right”

• Data splits
- Training data: 30 utterances
- Test data: 20 utterances

• Acoustic model: Let’s parameterize Prθ(O | W) using a Markov model


with parameters θ.
St+1

Word-based acoustic model


Trt a11 a22 a33

a01 a12 a23 a34 Model for


Pht+1 0 1 2 3 4 “up”

b1( ) b2( ) b3( )


Ot+1
O1 O2 O3 O4 .... OT

aij → Transition probabilities going from state i to state j


Figure 2.1: Standard topology used to represent a phone HMM.
bj(Oi) → Probability of generating Oi from state j

sub-word units Q corresponding to the word∑


Compute Pr(O | "up") = Pr(O, Q | "up") Efficient algorithm exists.
sequence W and the languageWill
model
appear in a later class.
Q
P (W ) provides a prior probability for W .
St+1
Isolated word recognition
Trt a11 a22 a33

a01 a12 a23 a34


Pr(O | "up")
up
Pht+1 0 1 2 3 4

St+1 b1( ) b2( ) b3( )


Ot+1
O1 O2 O3 O4 .... OT
Trt a11 a22 a33
Figure 2.1: Standard topology used to represent a phone HMM.

Pr(O | "down")
a01 a12 a23 a34
down
Pht+1 0 1 2 3 4
sub-word units Q corresponding to the word sequence W and the language model

b ()
P (W ) provides a prior
1 probability for2 W . b () b3( )
St+1
Ot+1
O O O O .... O
Compute arg max Pr(O | w)
Acoustic model:
1 The most2commonly3 used acoustic
4 models in ASR
T systems to-
day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com-
Trt prehensive tutorial of a a
11HMMs and their22applicability to ASR a
33 in the 1980’s (with acoustic w
Figure 2.1: Standard topology used to represent a phone HMM.
ideas that are largely applicable to systems today). HMMs are used to build prob- features
a a a a
left abilistic models01 12 labeling problems.
23 Since speech 34 is represented O
Pr(O | "left")
for linear sequence
1 2 3
Pht+1 0
sub-word units Q corresponding to the word sequence W and the language model 4
in the form of a sequence of acoustic vectors O, it lends itself to be naturally mod-
P (W ) provides a prior probability for W .

St+1
eled using HMMs. 1 b () 2 b () 3 b ()
j
Themodel:
Acoustic HMM isThe
defined
mostby specifyingused
commonly transition probabilities
acoustic models in (a i ) and
ASR observation
systems to-
Ot+1
day(or
areemission)
O
1
probability
Hidden Markov Models
2 O
distributions
(HMMs). Please
O
(b3j (Oi ))refer
(along O
to Rabiner
....
4 with the number
T O
(1989) for aofcom-
hidden

Trt states intutorial


prehensive the HMM). An HMM
of HMMs makes
and their a transitiontofrom
applicability ASRstate i to1980’s
in the state (with
j with a
a11 a22 a33
probability
ideas of aji2.1:
that areFigure
largely . applicable
On reaching
Standard to a stateused
systems
topology the
j,today).observation
HMMsa are
to represent vector
used
phone at build
to
HMM. that state
prob-(Oj )

a
abilistic models for01
linear sequence labeling20
a12 problems. Since speech is represented
a23 a34
right
Pht+1
insub-word
0
the form of a sequence
units
1
of acoustictovectors
Q corresponding
2
O, it
the word lends itself
sequence
3
to bethe
W and naturally
languagemod-
model
4
eled using
P (W HMMs.a prior probability for W .
) provides
b1( ) b2( ) b3( j)

Pr(O | "right")
The HMM is defined by specifying transition probabilities (ai ) and observation
Ot+1 Acoustic
(or model:
emission) O The most
probability
1
commonly
O2
distributions used
(bj (O O3 O4 ....
acoustic models in ASR systems to-
i )) (along with the number of hidden OT
day are
states Hidden
in the Markov
HMM). Models
An HMM (HMMs).
makes Please refer
a transition fromtostate
Rabiner
i to(1989)
state jfor a com-
with a
prehensiveoftutorial
probability aji . On of HMMsaand
reaching statetheir applicability
j, the observationtovector
ASR inat the
that1980’s (with
state (O j)
Figure 2.1: Standard topology used to represent a phone HMM.
ideas that are largely applicable to systems today). HMMs are used to build prob-
Small tweak

• Task: Recognize utterances which consist of speakers saying either “up"


St S
or “down"
t+1
multiple times per recording.
Trt a11 a22 a33

a01 a12 a23 a34


Pht up
Pht+1 0 1 2 3 4

b1( ) b2( ) b3( )


OStt OSt+1
t+1 O1 O2 O3 O4 .... OT

Trt Figure 2.1: Standard topology used to represent a phone HMM.


a11 a22 a33

a
sub-word units Q01
corresponding to the word sequence W and the language model
a12 a23 a34
Pht down
Pht+1 0 1
P (W ) provides a prior probability for W .
2 3 4

b ()
Acoustic model: The1most commonly used b () b ()
3 in ASR systems to-
2 acoustic models

Ot Ot+1
O1 O2 O3 O4 ....
day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com-
OT
prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with

ideas that are largely applicable to systems today). HMMs are used to build prob-
Figure
abilistic models for2.1: Standard
linear topology
sequence usedproblems.
labeling to represent a phone
Since HMM.
speech is represented

in the form of a sequence of acoustic vectors O, it lends itself to be naturally mod-

sub-word
eled units Q corresponding to the word sequence W and the language model
using HMMs.

) provides
The
P (W a prior probability
HMM is defined fortransition
by specifying W. probabilities (aji ) and observation

(or emission) probability distributions (bj (Oi )) (along with the number of hidden
Acoustic model: The most commonly used acoustic models in ASR systems to-
states in the HMM). An HMM makes a transition from state i to state j with a
Small tweak

• Task: Recognize utterances which consist of speakers saying either “up"


or “down" multiple times per recording.
St-1 St St+1

Trt-1 Trt a11 a22 a33

a01 a12 a23 a34


Pht-1 Pht Pht+1 0 1 2 3 4

b1( ) b2( ) b3( )


OSt-1 OStt OSt+1
t-1 t+1 O1 O2 O3 O4 .... OT

Trt-1
Search within this graph
Trt Figure 2.1: Standard topology used to represent a phone HMM.
a11 a22 a33

a
sub-word units Q01
corresponding to the word sequence W and the language model
a12 a23 a34
Pht-1 Pht Pht+1 0 1 2 3 4
P (W ) provides a prior probability for W .

b ()
Acoustic model: The1most commonly used b () b ()
3 in ASR systems to-
2 acoustic models

Ot-1 Ot Ot+1
O1 O2 O3 O4 ....
day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com-
OT
prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with

ideas that are largely applicable to systems today). HMMs are used to build prob-
Figure
abilistic models for2.1: Standard
linear topology
sequence usedproblems.
labeling to represent a phone
Since HMM.
speech is represented

in the form of a sequence of acoustic vectors O, it lends itself to be naturally mod-

sub-word
eled units Q corresponding to the word sequence W and the language model
using HMMs.

) provides
The
P (W a prior probability
HMM is defined fortransition
by specifying W. probabilities (aji ) and observation

(or emission) probability distributions (bj (Oi )) (along with the number of hidden
Small vocabulary ASR
• Task: Recognize utterances which consist of speakers saying one of 1000
words multiple times per recording.

• Not scalable anymore to use words as speech units

• Model using phones instead of words as individual speech units


- Phonemes are abstract, subword units that distinguish one word from another
(minimal pair; e.g. “pan” vs. “can”)
- Phones are actually sounds that are realized and not language-specific units

• What's an obvious advantage of using phones over entire words?


Hint: Think of words with zero coverage in the training data.
Architecture of an ASR system

Acoustic
Model (phones)

Acoustic
Pronunciation
Feature SEARCH Model
Generator
speech O
signal

word sequence Language


W* Model
Cascaded ASR ⇒
! End-to-end ASR

Acoustic
Feature
word sequence
Generator
W*
speech O
signal

Single end-to-end model that directly learns a mapping from speech to text
ASR Progress contd.

AUG ‘16

AUG '17

MAR ‘19

https://venturebeat.com/2019/04/22/amazons-ai-system-could-cut-alexa-speech-recognition-errors-by-15/
https://www.microsoft.com/en-us/research/blog/microsoft-researchers-achieve-new-conversational-speech-recognition-milestone/
https://www.npr.org/sections/alltechconsidered/2016/08/24/491156218/voice-recognition-software-finally-beats-humans-at-typing-study-finds
What are some unsolved problems related to ASR?

• State-of-the-art ASR systems do not work well on regional accents, dialects

• Code-switching is hard for ASR systems to deal with

• How do we rapidly build competitive ASR systems for a new language?


Low-resource ASR and keyword spotting.

• How do we recognize speech from meetings where a primary speaker is


speaking amidst other speakers?
Next class: HMMs for Acoustic Modeling

You might also like