Voice Technology Seminar

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 35

Voice Technology

Internal Guide:E.Nithya
HOD: C.Murugamani

• Introduction
• Evolution
• Voice recognition through AI
• Voice recognition and algorithm
• Applications
• Pros & cons
• Conclusion
What makes voice technology so popular?

It’s actually the technology for everyone,

which makes the users feel natural. People
expect conversations and actions, so their
voice queries are often more precise and
action-oriented. Daily routines do not
prevent the users from accessing their
devices and using voice assistants, the
latters in their turn can be accessed
anywhere and at any moment.

• In 1952, ‘Audrey’ was invented by Bell

• In 1962, the ‘shoebox’ technology.
• In 1972 development of a device by bell
laboratories that could understand more than one
person’s voice.
• In 1996, the first voice activated portal (VAL)
was made by BellSouth.
• In 2001 Google invented an application called
‘Google Voice Search’ for iPhones .
• In 2010 Google introduced personalized
recognition on Android devices
Voice technology — what actually it is?

Speaking about Alexa, Bixby, or Siri, we, in

fact, speak about the interface, covering
multiple software layers, from voice recognition
through AI to voice-enabled applications. In
fact, voice technology is the combination of IoT
(devices and gadgets), AI (services), and UX
(interaction) resulting in a hands-free
technology which to a great extent still
resembles science fiction.
Voice Recognition Through AI
• When artificial intelligence (AI) evolved, it touched almost all
facets of life and surroundings.
• Speech recognition is one such technology that is
empowered by AI to add convenience to its users.
• This new technology has the power to convert voice
messages to text. And it also has the ability to recognize an
individual based on their voice command.
• Hence, this AI-powered speech recognition technology
gained mammoth importance among tech giants such as
Apple, Microsoft, Amazon, Google, Facebook, etc.
• Amazon Echo, Siri and Google Home are some of the apps
and devices that flooded the market with speech recognition
Voice recognition
• voice recognition (or sometimes referred to as
Automatic Speech Recognition) is the process
by which a computer (or other type of machine)
identifies spoken words.
• voice recognition is an alternative to traditional
methods of interacting with a computer,such as
textual input through a keyboard.
Speech Recognition
Types of speech Recognition System

• Isolated word
• Connected Word
• Continuous speech
• Spontaneous speech
• Voice verification/identification
speech recognition process
• It involves following five major steps:
1) Signal processing
2) Speech recognition
3) Semantic interpretation
4) Dialog management
5) Response Generation
Signal processing:

• The sound is received through a microphone in the form

of analog electrical signals.
• These signals are coverted into a sequence of feauture
Speech Recognition:

• This is the most important part of this process.Here the

actual recognition is done.
• In this decoding is done on the basis of algorithm such
as Hidden Markov Model,Neural Network,Dynamic
time wraping.
Speech Recognition Algorithms
• Dynamic Time Warping
• Hidden Markov Model
• Neural Networks
Dynamic Time Warping
• Dynamic Time Warping is one of the oldest and
most important algorithms in voice recognition
• The simplest way to recognize an isolated word
sample is to compare it against a number of
stored word templates and determine the “best
Hidden Markov Model(HMM)

• The most flexible and successful approach to

voice recognition so far has been hidden
Markovel Model
• It is a collection of states connected by Transitions

Simple HMM with 2 States and 2 output

Formally, an HMM consists of the following elements:
{s}= A set of states.
{aij}=A set of transition probabilities, where aij
is the probability of taking the transition from state i to state j.
{bi(u)}=A set of emission probabilities, where bi
is the probability distributionover the acoustic space
describing the likelihood of emitting each possible sound u
while in state i.
Since aij and bi are both probabilities, they must satisfy the
following properties:
aij>=0, bi(u)>=0,for all I,j,u
sum(aij)j=1, for all i.
sum(bi(u))u=1, for all i.
Neural Networks
• Neural Network consists of many simple
processing units each of which connected to
many other units
• Each unit has a numerical activation value.
• In the standard case,the net input xi for input j is
just the weighted sum of inputs:
• Xi=(sum (yiwji))(i)
Here yi is the output activation of an incoming unit
and wji is the weight from unit I to unit j.
Unit activations for neural networks
Semantic Interpretation:
• Here there will be a grammar check.
• It tries to find out whether or not the combination have
any sense

Dialog Management:
• The errors encountered are tried to be corrected.
Response Generation:

• After the task is performed,the response or the result of

the task is generated.
Structure of standard speech recognition system
Raw Speech:
Speech is typically sampled at high frequency,e.g., 16KHz
over a microphone or 8 KHz over a telephone.

Signal Analysis:
Raw speech should be initially transformed and
compressed, in order to simplify subsequent processing.
Signal analysis converts raw speech to speech frames
Speech Frames:
The result of signal Analysis is a sequence of speech
frames, typically at 10 millisecond intervals,with about
16 coefficients per frame.
The speech frames are used for acoustic analysis.

Acoustic Models:
In order to analyze the speech frames for their
acoustic content, we need a set of acoustic models.
Acoustic model:template and state representation for the word “cat”
Acoustic analysis and frame score:
Acoustic analysis is performed by applying each
acoustic model over each frame of speech , yielding a
matrix of frame score . Scores are computed
according to the type of acoustic model that is being
The alignment path with the best total
score identifies the word sequence and
Time Alignment:
The process of searching for best
alignment path is called time alignment.
Word Sequence:
The end result of time alignment is a
word sequence-The sentence
hypothesis for the utterances.
Voice Recognition softwares
• Julius
• Google now
• S Voice
• Iris(Intelligent Rival Imitator of SIRI)
• Dragon Naturally speaking
• Windows speech recognition
• Talking is faster than typing.
• This can specially assist the people who
have little keyboard skills or experience.
• Assist who are slow typist or do not have
the time or resources to develop keyboard
• People with physical disabilities that affect
either data entry or ability to read what
they have entered.
• Privacy of voice recorded data.
• Error and misinterpretation of words.

This technology will spawn revolutionary changes in the

modern world and become a pivot technology,within five
years,this technology will become so pervasive in our daily
lives that service environments lacking technology will be
considered inferior.

You might also like