M 1.4 Introduction To Speech Recognition & Synthesis 2

Introduction to
Speech
Recognition &
Synthesis
Roopavath Jethya
Assistant Professor
CSE Department
School Of Technology
GITAM(Deemed to be ) University
Hyderabad
PAGE 1
Agenda
Speech Fundamentals
Speech Analysis
Speech Modelling
Speech Recognition (Speech-to-Text)
Speech Synthesis (Text-to-Speech)
PAGE 2
Block diagram of NLP
Speech is input
Automatic speech recognition (ASR) Function of ASR Speech to text
Function of NLU is
Natural Language Understanding(NLU) Understand or interpret
Natural language Generation(NLG) Function of NLG Depending on

NLU, function of NLG is to give
valid reply
Output is also speech
PAGE 3
Speech Fundamentals
Speech Fundamentals emphasizes the fundamentals of organizing and
presenting speeches in a variety of styles.
For example each student’s use of verbal and nonverbal communication in
the transmission of ideas, as well as to the development of creativity,
critical insights, and listening skills.
Basic speech units: phoneme, syllable, word, phrase, sentence, speaking
turn
Respiration: We (normally) speak while breathing out. Respiration
provides airflow.
Phonation: Airstream sets vocal folds in motion. Vibration of vocal folds
produces sounds.
Speech is also depending on Natural language processing (steps in NLP)
PAGE 4
Speech Analysis
Speech analytics is the process of analyzing recorded
calls to gather customer information to improve
communication and future interaction.
The process is primarily used by customer contact
centers to extract information buried in client
interactions with an enterprise.
Although speech analytics includes elements of
automatic speech recognition(ASR), it is known for
analyzing the topic being discussed, which is explored
against the emotional character of the speech and the
amount and locations of speech versus non-speech
during the interaction.
Speech analytics provides categorical analysis of
recorded phone conversations between a company and
its customers.
PAGE 5
Natural Language Processing
It focus on 2 speech Technologies:

1. Speech Recognition (Speech to Text)
2. Speech Synthesis ( Text to Speech)
PAGE 6
Speech Recognition
It enable the recognition and translation of spoken language into text

by computers.
It is also known as automatic speech recognition (ASR), computer
speech recognition or speech to text (STT).
It transform a speaking into a sequence of tokens (words, syllables,
phonemes, characters)
PAGE 7
Model or Components of ASR System
Speech is input
Speech Modelling
Speech recognition works using

algorithms through some model those
are:
1. Acoustic Model
2. Language Model
3. Pronunciation Model
Text is Output
PAGE 8
Acoustic Analysis
spectrogram is generated
time vs frequency plot
PAGE 9
Acoustic Features
To represent the relationship between an audio signal and

the phonemes or other linguistic units that make up speech.
PAGE 10
Acoustic Model
To represent the relationship between
an audio signal and the phonemes or other
linguistic units that make up speech.
The acoustic model (AM), models the
acoustics of speech.
phoneme is the smallest unit of sound in
speech
For example, if the system is set up with a
simple grammar file to recognize the word
"house" (whose phonemes are: "hh aw s“)
HOUSDEN [HOUSDEN] hh aw s d
The model is learned from a set of audio ax n
HOUSE [HOUSE] hh aw s
recordings and their corresponding transcripts. HOUSE'S [HOUSE'S] hh aw s ix
z
HOUSEAL [HOUSEAL] hh awPAGE
s 11
PAGE 12
Job of the acoustic model is to predict which sound, or
phoneme, from the phone set is being spoken in each
frame of audio.
PAGE 13
PAGE 14
Pronunciation Model
Also called as Lexicon Model
Pronunciation is defined as how you say a
word.
Example of pronunciation of word Tomato
difference in how many people say the word
tomato. ... (uncountable)
Given an acoustic recording of a sequence of
one or more spoken words, the task is to infer
the word(s)
Provides the link between Phonemes and the
word.
PAGE 15
Provides the link between
Phonemes and the word.
PAGE 16
Language model
It learns which sequences of words are most likely to
be spoken, and its job is to predict which words will
follow on from the current words and with what
probability.
The language model provides context to distinguish
between words and phrases that sound similar.
the appearance of each word only depends on that
word's own probability in the document.
PAGE 17
PAGE 18
Decoder
Searches for best possible word sequence
among all possible sequences
It uses searching algorithms
PAGE 19
Or Pronunciation model
PAGE 20
Why is ASR a challenging problem?
Variability's in diﬀ erent dimensions:

Style: Read speech or spontaneous (conversational) speech? Continuous
natural speech or command & control?
Speaker characteristics: Rate of speech, accent, prosody (stress,

intonation), speaker age, pronunciation variability even when the same
speaker speaks the same word
Channel characteristics: Background noise, room acoustics, microphone

properties, interfering speakers
Task specifics: Vocabulary size (very large number of words to be

recognized), language-specific complexity, resource limitations PAGE 21
Need to do more…
Robust to variation in age, accent and ability

Handling noisy real-life settings with many speakers (e.g., meetings,
parties)
High quality embedded ASR in mobile devices
Handling new languages
Should handle code switching/code mixing in multilingual community
PAGE 22
... with less
Fast (real-time) decoding using limited computational power/

memory
Faster training algorithms
Reduce duplicate efforts across domains/languages
Reduce dependence on language-specific resources
Train with less labeled data
PAGE 23
Speech Synthesis
Speech synthesis, also called voice synthesis, is the electronic
generation of sounds that mimic the human voice.
These sounds can be generated from digital text or from printed
documents.
Speech can also be generated by high-level computers that have
artificial intelligence (AI), in the form of responses to input from
humans or other machines.
A text-to-speech (TTS) system converts normal language text into
speech.
PAGE 24
PAGE 25
PAGE 26

M 1.4 Introduction To Speech Recognition & Synthesis 2

Uploaded by

Copyright:

Available Formats

You might also like

M 1.4 Introduction To Speech Recognition & Synthesis 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

M 1.4 Introduction To Speech Recognition & Synthesis 2

Uploaded by

Copyright:

Available Formats

Introduction to

Automatic speech recognition (ASR) Function of ASR Speech to text

Natural language Generation(NLG) Function of NLG Depending on

Output is also speech

It focus on 2 speech Technologies:

2. Speech Synthesis ( Text to Speech)

It enable the recognition and translation of spoken language into text

Speech recognition works using

time vs frequency plot

To represent the relationship between an audio signal and

Variability's in diﬀ erent dimensions:

Speaker characteristics: Rate of speech, accent, prosody (stress,

Channel characteristics: Background noise, room acoustics, microphone

Task specifics: Vocabulary size (very large number of words to be

Robust to variation in age, accent and ability

Fast (real-time) decoding using limited computational power/

You might also like