M 1.4 Introduction To Speech Recognition & Synthesis 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Introduction to

Speech
Recognition &
Synthesis
Roopavath Jethya
Assistant Professor
CSE Department
School Of Technology
GITAM(Deemed to be ) University
Hyderabad

PAGE 1
Agenda

Speech Fundamentals
Speech Analysis
Speech Modelling
Speech Recognition (Speech-to-Text)
Speech Synthesis (Text-to-Speech)

PAGE 2
Block diagram of NLP
Speech is input

Automatic speech recognition (ASR) Function of ASR Speech to text

Function of NLU is
Natural Language Understanding(NLU) Understand or interpret

Natural language Generation(NLG) Function of NLG Depending on


NLU, function of NLG is to give
valid reply

Output is also speech

PAGE 3
Speech Fundamentals
Speech Fundamentals emphasizes the fundamentals of organizing and
presenting speeches in a variety of styles.
For example each student’s use of verbal and nonverbal communication in
the transmission of ideas, as well as to the development of creativity,
critical insights, and listening skills.
Basic speech units: phoneme, syllable, word, phrase, sentence, speaking
turn
Respiration: We (normally) speak while breathing out. Respiration
provides airflow.
Phonation: Airstream sets vocal folds in motion. Vibration of vocal folds
produces sounds.
Speech is also depending on Natural language processing (steps in NLP)
PAGE 4
Speech Analysis
Speech analytics is the process of analyzing recorded
calls to gather customer information to improve
communication and future interaction.
The process is primarily used by customer contact
centers to extract information buried in client
interactions with an enterprise.
Although speech analytics includes elements of
automatic speech recognition(ASR), it is known for
analyzing the topic being discussed, which is explored
against the emotional character of the speech and the
amount and locations of speech versus non-speech
during the interaction.
Speech analytics provides categorical analysis of
recorded phone conversations between a company and
its customers.

PAGE 5
Natural Language Processing

It focus on 2 speech Technologies:


1. Speech Recognition (Speech to Text)

2. Speech Synthesis ( Text to Speech)

PAGE 6
Speech Recognition

It enable the recognition and translation of spoken language into text


by computers.
It is also known as automatic speech recognition (ASR), computer
speech recognition or speech to text (STT).
It transform a speaking into a sequence of tokens (words, syllables,
phonemes, characters)

PAGE 7
Model or Components of ASR System
Speech is input
Speech Modelling

Speech recognition works using


algorithms through some model those
are:
1. Acoustic Model
2. Language Model
3. Pronunciation Model

Text is Output
PAGE 8
Acoustic Analysis

spectrogram is generated

time vs frequency plot

PAGE 9
Acoustic Features

To represent the relationship between an audio signal and


the phonemes or other linguistic units that make up speech.

PAGE 10
Acoustic Model
To represent the relationship between
an audio signal and the phonemes or other
linguistic units that make up speech.
The acoustic model (AM), models the
acoustics of speech.
phoneme is the smallest unit of sound in
speech
For example, if the system is set up with a
simple grammar file to recognize the word
"house" (whose phonemes are: "hh aw s“)
HOUSDEN [HOUSDEN] hh aw s d
The model is learned from a set of audio ax n
HOUSE [HOUSE] hh aw s
recordings and their corresponding transcripts. HOUSE'S [HOUSE'S] hh aw s ix
z
HOUSEAL [HOUSEAL] hh awPAGE
s 11
PAGE 12
Job of the acoustic model is to predict which sound, or
phoneme, from the phone set is being spoken in each
frame of audio.

PAGE 13
PAGE 14
Pronunciation Model
Also called as Lexicon Model
Pronunciation is defined as how you say a
word.
Example of pronunciation of word Tomato
difference in how many people say the word
tomato. ... (uncountable)
Given an acoustic recording of a sequence of
one or more spoken words, the task is to infer
the word(s)
Provides the link between Phonemes and the
word.
PAGE 15
Provides the link between
Phonemes and the word.

PAGE 16
Language model
It learns which sequences of words are most likely to
be spoken, and its job is to predict which words will
follow on from the current words and with what
probability.
The language model provides context to distinguish
between words and phrases that sound similar.
the appearance of each word only depends on that
word's own probability in the document.

PAGE 17
PAGE 18
Decoder
Searches for best possible word sequence
among all possible sequences
It uses searching algorithms

PAGE 19
Or Pronunciation model

PAGE 20
Why is ASR a challenging problem?

Variability's in diff erent dimensions:


Style: Read speech or spontaneous (conversational) speech? Continuous
natural speech or command & control?

Speaker characteristics: Rate of speech, accent, prosody (stress,


intonation), speaker age, pronunciation variability even when the same
speaker speaks the same word

Channel characteristics: Background noise, room acoustics, microphone


properties, interfering speakers

Task specifics: Vocabulary size (very large number of words to be


recognized), language-specific complexity, resource limitations PAGE 21
Need to do more…

Robust to variation in age, accent and ability


Handling noisy real-life settings with many speakers (e.g., meetings,
parties)
High quality embedded ASR in mobile devices
Handling new languages
Should handle code switching/code mixing in multilingual community

PAGE 22
... with less

Fast (real-time) decoding using limited computational power/


memory
Faster training algorithms
Reduce duplicate efforts across domains/languages
Reduce dependence on language-specific resources
Train with less labeled data

PAGE 23
Speech Synthesis
Speech synthesis, also called voice synthesis, is the electronic
generation of sounds that mimic the human voice.
These sounds can be generated from digital text or from printed
documents.
Speech can also be generated by high-level computers that have
artificial intelligence (AI), in the form of responses to input from
humans or other machines.
A text-to-speech (TTS) system converts normal language text into
speech.

PAGE 24
PAGE 25
PAGE 26

You might also like