Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

As the name suggests, speech recognition means recognising the speech of human and responding

accordingly. It uses speech signal processing and pattern recognition to identify the human spoken
speech.The speech recognition technology allows the application to turn the voice signal into the
appropriate text or command via the operation of identification and interpretation. Speech
recognition has become a key technology in the today’s high-tech information technology world.
Speech recognition system finds application in entering simple facts and data, call routing, banking
services, medical services, travelling services and so on.

The development process and current state of the speech recognition


technology
Speech recognition technology was introduced in 1950 for the first time. From then onwards, this
technology has made a perceptible advancement. The HMM model and artificial neural network
(ANN) are successfully used in speech recognition in 1980. In 1990, the Dragon company released
’Dragon Dictate’ which became the world’s first speech recognition software for users.
Many countries such as the United States of America, Japan, South Korea, Chinaalong with IBM,
Apple, Microsoft and other well-known companies invest heavily in research and development of
practical speech recognition system. At present, the focus is on voice activated speaker with
compatible size (such as Alexa) which is used by more than 50% of users. However, it is still not
perfect and has an error rate of 5% or below. It is easier to seek out these devices which work on
voice recognition technology that enable greater convenience.

Basic principles and methods of speech recognition technology


Speech recognition methods include various models and networks such as dynamic time warping,
hidden Markov model (HMM), vector quantization, artificial neural network (ANN)and so on. Some
has been described below -

1. Pre- Processing : The voice signal is transferred into electrical signal with the help of
microphone for further processing. The system extracts the useful data from the voice signal
and eliminate unnecessary noise. This helps in increasing the energy of the input signal at
higher frequency.

2. Feature Extraction : This step finds certain parameters of human voice language and
correlate with speech signal. The system uses recognition process and compare the input
voice data with the voice blueprint and perform the action accordingly.

3. Acoustic model :An acoustic model is used in automatic speech recognition to represent the
link between an audio signal and the phonemes  that make up speech signal.

4. Language Model : This model is used to impel certain words following after certain word
sequence. This model also differentiate between words and phrases that sounds same.
5. Pattern Classification : This is a process of comparing unrecognized pattern with the
blueprint voice pattern.

Speech to Text Conversion


It is a process of converting the spoken words into the written texts that represent uttered word just
after speaking. Speech to conversion follows the same principle and methods along with various
combination of techniques.

Some widely used methods for speech to text conversions are :

1. Hidden Markov Model (HMM):This model has a finite state machine and have fix number of
states. As voice is a random signal, so the purpose of HMM is to find certain parameters of
voice signal in a well-defined manner. [xyz]

• Recognition accuracy- Recognition is the process of comparing the unknown pattern with
each blueprint sound reference pattern and measure the similarity between the unknown
pattern and each reference pattern.It is the important that any recognition system should be
independent of thespeaker.

• Recognition speed – If the system respond to the user’s uttered word late then it
decreases the system efficiency. So the system must have good recognition speed.

The signals undergoes the following steps:

• Pre-processing: The input speech signals are converted into speech frames and give a
unique sample and reducingnoise.

• HMM Training: Training involves creating a pattern representative of the features of a class
using one or more test patterns that correspond to speech sounds of the same class.

• HMM Recognition: It is the process of comparing the unknown test pattern with each
sound class referencepattern and computing a measure of similarity (distance). Maximum
likelihood is used for recognition.

SPEECH

Hidden Markov Model


SPEECH ANALYSIS
Feature Vectors

ROBUST
Feature Vectors PROCESSING

Reference Model SPEECH Recognition Result


RECOGNITION
HMM model in a state j under the corresponding observed values by a set of probability bik, k=1,2,
…,M, to describe, it is one of the M discrete countable observations, and thus known as the discrete
the HMM. When the observed value of a continuous random variable X, its corresponding observed
values in the state j observed by a probability density function bj(X), which became continuous
HMM. Continuous HMM using the Baum- Welch algorithm to estimate model parameters applied in
the estimation of bj(X), A parameter, but the description in the estimation of bj(X) parameter must
be a certain limit can be established. Current most widely used is the Gaussian bj(X) it can be
represented using the following formula [ isme 2012 ka 3 number jayega]:
bj(x)=∑k=1kcjkbjk(x)=∑k=1kcjkN(xμjk,∑jk)1≤i≤N
2. Artificial Neural Network Classier(ANN) based Cuckoo Search Optimization: This model is
used for better communication between the user and the system in which this model helps
to remove unwanted noise. For the same, a three-step process is followed: [xyz]

• Pre-processing of the speech signals is the most important part of speech recognition which is
executed toremove avoidable waveform of the signal. The signals are fed to the high-pass filters
to remove the backgroundnoises.

• Two kinds of acoustic features are extracted, from the speech signal. They are Mel Frequency
Cep- strumCoefficients (MFCC) and Linear Predictive Coding coefficients (LPCC).

• Classification: In this, artificial neural network is used as the classifier. The neural network is a
three-layeredclassifier with n input nodes, l hidden nodes and k output nodes. In CSO (Cuckoo
Search Optimization), ANN isimplemented by two-layered Feed Forward Backpropagation
Neural Network (FFBNN) with 3 units; two input unit, three Hidden units and one output unit.

Here, the input layer consists of two inputs having two featuresextracted which are MFCC and
LPCC features. These features are given as input in which networks get trainedand it produces a
corresponding output.

SPEECH

FEATURE
EXTRACTION

TRAINING USING
ANN

TESTING

RECOGNISING
SPEECH
Voice Based E-mail(V-Mail) –
Blind people find difficulty in using technology because it is difficult for them to use keyboard. As a
result, computers have become an unfeasible thing for the disable people. E-mails are considered
to be the most reliable way of communication over Internet, for sending or receiving some
important information. So, this project will help them to access multi-media and can send e-mail to
someone as it reduces the burden of remembering the keyboard buttons. Based on artificial
intelligence, blind people or disable people can easily communicate over e-mails.

 Mail send and compose - In this, we compose the mail through the voice based detection
method where the speech is converted to text and the commands are saved in the server.
Thus, the mail is composed using text to speech conversion method. Based on the command
the voice is recognized and it will be converted into the text and understood by the
application and then the mail is sent through the mail server to the specified recipient.
 Read email - When viewing a list of conversations in your Inbox or you can open a particular
mail to read its messages. A mail is opened and read according to the user's convenience
and mostly priority is given to the unread mails. When a user chooses a mail by telling the
number of the mail it opens and the text in it is converted into voice and the mail is read. All
these activities take place without the use of keyboard.[xyz]

Speech to app -
Speech recognition technology, which is able to recognize human speech and change to text, or to
perform a command, has emerged as the ’Next Big Thing’ of the IT industry. Speech recognition is
technology that uses desired equipment and the service which can be controlled through voice
without using items such as a mouse or keyboard. One of the most prominent examples of a mobile
voice interface is Siri, the voice-activated personal assistant that comes built into the latest iPhone.

The whole processing of our application can be define as

Fig 2.2 paste krde speech to app ki

Here, the voice is converted into digital signals through the analog to digital converter. Firstly the
project loaded then the user voice is converted into digital signal which passes through the speech
recognition engine. Then the speech recognition engine recognized the voice and the user logins by
answering the various random questions. And then the user performs the task by taking action or
speaking commands. Now the machine gives the response according to the users command. Here,
the user take action by speaking the commands. There are various commands present here like open
notepad, open command prompt, open Google and the many more commands are present here.
When the user say open command prompt then the voice of the user is recognized by the speech
recognition engine and the command prompt will open easily . Here we create a large recognition
vocabulary for interacting the user to machine. We define a user dictionary in which all the
commands are present and the user can interact with machine through these commands. It is
assumed that each utterance consists of a sequence of meaningful and structured words, and our
main goal is convert the spoken signal into the word sequence as accurately as possible . The output
of the utterances depends on recognized sequence of words. This task is sometimes known as the
speech to text conversion.

You might also like