Professional Documents
Culture Documents
M 1.4 Introduction To Speech Recognition & Synthesis 2
M 1.4 Introduction To Speech Recognition & Synthesis 2
M 1.4 Introduction To Speech Recognition & Synthesis 2
Speech
Recognition &
Synthesis
Roopavath Jethya
Assistant Professor
CSE Department
School Of Technology
GITAM(Deemed to be ) University
Hyderabad
PAGE 1
Agenda
Speech Fundamentals
Speech Analysis
Speech Modelling
Speech Recognition (Speech-to-Text)
Speech Synthesis (Text-to-Speech)
PAGE 2
Block diagram of NLP
Speech is input
Function of NLU is
Natural Language Understanding(NLU) Understand or interpret
PAGE 3
Speech Fundamentals
Speech Fundamentals emphasizes the fundamentals of organizing and
presenting speeches in a variety of styles.
For example each student’s use of verbal and nonverbal communication in
the transmission of ideas, as well as to the development of creativity,
critical insights, and listening skills.
Basic speech units: phoneme, syllable, word, phrase, sentence, speaking
turn
Respiration: We (normally) speak while breathing out. Respiration
provides airflow.
Phonation: Airstream sets vocal folds in motion. Vibration of vocal folds
produces sounds.
Speech is also depending on Natural language processing (steps in NLP)
PAGE 4
Speech Analysis
Speech analytics is the process of analyzing recorded
calls to gather customer information to improve
communication and future interaction.
The process is primarily used by customer contact
centers to extract information buried in client
interactions with an enterprise.
Although speech analytics includes elements of
automatic speech recognition(ASR), it is known for
analyzing the topic being discussed, which is explored
against the emotional character of the speech and the
amount and locations of speech versus non-speech
during the interaction.
Speech analytics provides categorical analysis of
recorded phone conversations between a company and
its customers.
PAGE 5
Natural Language Processing
PAGE 6
Speech Recognition
PAGE 7
Model or Components of ASR System
Speech is input
Speech Modelling
Text is Output
PAGE 8
Acoustic Analysis
spectrogram is generated
PAGE 9
Acoustic Features
PAGE 10
Acoustic Model
To represent the relationship between
an audio signal and the phonemes or other
linguistic units that make up speech.
The acoustic model (AM), models the
acoustics of speech.
phoneme is the smallest unit of sound in
speech
For example, if the system is set up with a
simple grammar file to recognize the word
"house" (whose phonemes are: "hh aw s“)
HOUSDEN [HOUSDEN] hh aw s d
The model is learned from a set of audio ax n
HOUSE [HOUSE] hh aw s
recordings and their corresponding transcripts. HOUSE'S [HOUSE'S] hh aw s ix
z
HOUSEAL [HOUSEAL] hh awPAGE
s 11
PAGE 12
Job of the acoustic model is to predict which sound, or
phoneme, from the phone set is being spoken in each
frame of audio.
PAGE 13
PAGE 14
Pronunciation Model
Also called as Lexicon Model
Pronunciation is defined as how you say a
word.
Example of pronunciation of word Tomato
difference in how many people say the word
tomato. ... (uncountable)
Given an acoustic recording of a sequence of
one or more spoken words, the task is to infer
the word(s)
Provides the link between Phonemes and the
word.
PAGE 15
Provides the link between
Phonemes and the word.
PAGE 16
Language model
It learns which sequences of words are most likely to
be spoken, and its job is to predict which words will
follow on from the current words and with what
probability.
The language model provides context to distinguish
between words and phrases that sound similar.
the appearance of each word only depends on that
word's own probability in the document.
PAGE 17
PAGE 18
Decoder
Searches for best possible word sequence
among all possible sequences
It uses searching algorithms
PAGE 19
Or Pronunciation model
PAGE 20
Why is ASR a challenging problem?
PAGE 22
... with less
PAGE 23
Speech Synthesis
Speech synthesis, also called voice synthesis, is the electronic
generation of sounds that mimic the human voice.
These sounds can be generated from digital text or from printed
documents.
Speech can also be generated by high-level computers that have
artificial intelligence (AI), in the form of responses to input from
humans or other machines.
A text-to-speech (TTS) system converts normal language text into
speech.
PAGE 24
PAGE 25
PAGE 26