Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

[Type the document title]



Fig 1.1 Speech recognition

The purpose of this paper is to present the illustration of different advancement in artificial intelligence,
in the perspective of speech recognition. It has been established from the analysis of research, which was
conducted by that speech recognition is one of the most advanced concepts of electrical engineering and
computer science. Basically, this approach deals with the conversion of the spoken words into text.

Speech recognition is also referred as ASR (automatic speech recognition), STT (speech to text) or just
computer speech recognition. On the contrary, it has been claimed by that speech recognition can also be
understood as the field of computer science, which deals with the designing and development of computer
systems, in order to recognize the spoken words. In this regard, has asserted that speech recognition or
computer speech recognition or ASR is nothing more than the approach of converting a speech signal into
the sequence of words, by the help of different algorithms and techniques.

It has been documented in the research, which was carried out by that these approaches include artificial
intelligence approach, pattern recognition approach, as well as acoustic phonetic approach. In accordance
with the views and perceptions of, artificial intelligence is the most developing and effective techniques,
which supports flawless and accurate speech recognition. It is because; artificial intelligence incorporates
certain algorithmic approaches, which fosters coherent conversion and transformation of speech into
readable patterns, and vice versa.

This research will assist in understanding these concepts, which are associated with speech recognition.
Amid all of these approaches, artificial intelligence is found to be the most effective and integrated
approach, which has strengthened and improved speech recognition practices.

The proceeding manuscript will commendably help in illustrating the core concept of artificial
intelligence, as well as the technological advances, which have been occurred in artificial intelligence. In
[Type text] Page 1
[Type the document title]

addition to this, the paper will also assist in understanding and identifying the statistical models for speech

Humans interact among to various methods such as sounds, sign language, facial expressions… etc.
How ever voice is regarded as the most essential medium that humans make use of, as it aids
communication more over it is most commonly used amongst speakers. Speech is a convenient to intterance
and has a specific meaning and it is consists of various terms, in succession it consists of various letters that
go along with voices. This voice can develop objects of air and empty and emerge in the mode of waves; a
wave that coincides between them or initiates as small circles of the origin of the sound.
This environment is indicated by force and then spreads these circles gradually before they vanish
entirely when they extend over wide range Reasonable conversation is done when spokes persons talk in
common language, by which its meant that the communicator on the sending side and receiving side have
matching keys that assist both parties in interpreting . The researchers applied this occurrence and evolved it
to be made a essential section in human communication along with the machine hence the sound has aided
to ease to utilize the machine by the user more over build a natural communication amongst them.
Automatic speech recognition has substantially provided to the growth of artificial intelligence, which
tries to construct very flexible techniques of operating machines, this enables the user to interchange
information and communicate without utilizing the general input/output conventions namely the keyboard,
remotes .speech-oriented input/output approaches are very essential in various areas, like in the care of
differently abled people, the operation of vehicles, specifically while driving, call for help in emergency
situations, etc.
In the following paper, we are presenting a analysis of the recent works that concentrated on the ASR
working where we convey their essential attributes/features, perks, and draw backs, as we analyse these
works to describe our perception as a further feasible method. Segment2 consists of an outline of the
automatic speech recognition system, in Segment3 is made up of characteristics of speech recognition
system, we demonstrate the meaning of automatic speech recognition, and in Segment4, illustrating the
architecture with regard to automated speech recognition systems, with an explanation of each part’s work.
In Sections5 and 6, we present some current day work on automated speech processing, talk about and
display our perception of the work written in the preceding section. We described to what lengths hybrid
models as well as neural network models are utilizes. In the end, we complete the paper by a conclusion and
present the record of references that were used to help us write this paper.

[Type text] Page 2

[Type the document title]


An Overview of Artificial intelligence in speech recognition

2.1 Speech recognition in artificial intelligence

Fig 2.1 AI in speech recognition

Speech recognition is an AI-enhanced technology converting human speech from an analog form to
digital form. Advanced computer programs then use the digital speech for further processing. Speech
recognition is a computer receiving dictation and is different from NLP. NLP technology helps to
understand the digitized dictated speech captured by speech recognition. One technology simply learns
speech data. The other attempts to comprehend and respond to the speech data.

Speech recognition, or speech-to-text, is the ability of a machine or program to identify words

spoken aloud and convert them into readable text. Rudimentary speech recognition software has a limited
vocabulary and may only identify words and phrases when spoken clearly. More sophisticated software can
handle natural speech, different accents and various languages.

Speech recognition uses a broad array of research in computer science, linguistics and computer
engineering. Many modern devices and text-focused programs have speech recognition functions in them to
allow for easier or hands-free use of a device.

Speech recognition and voice recognition are two different technologies and should not be confused
Speech recognition is used to identify words in spoken language.

 Voice recognition is a biometric technology for identifying an individual's voice.

[Type text] Page 3

[Type the document title]

Speech recognition is fast overcoming the challenges of poor recording equipment and noise cancellation,
variations in people’s voices, accents, dialects, semantics, contexts, etc using artificial intelligence and
machine learning. This also includes challenges of understanding human disposition, and the varying human
language elements like colloquialisms, acronyms, etc. The technology can provide a 95% accuracy now as
compared to traditional models of speech recognition, which is at par with regular human communication.

Furthermore, it is now an acceptable format of communication given the large companies that endorse it and
regularly employ speech recognition in their operations. It is estimated that a majority of search engines will
adopt voice technology as an integral aspect of their search mechanism.

2.2 How does speech recognition work

Speech recognition systems use computer algorithms to process and interpret spoken words and convert
them into text. A software program turns the sound a microphone records into written language that
computers and humans can understand, following these four steps:

1. Analyze the audio.

2. Break it into parts.
3. Digitize it into a computer-readable format.
4. Use an algorithm to match it to the most suitable text representation.

Speech recognition software must adapt to the highly variable and context-specific nature of human speech.
The software algorithms that process and organize audio into text are trained on different speech patterns,
speaking styles, languages, dialects, accents and phrasings. The software also separates spoken audio from
background noise that often accompanies the signal.

Fig 2.2 Block diagram of speech recognition

[Type text] Page 4

[Type the document title]

 Acoustic model (AM)

Acoustic One of the most prominent and widely adopted models of speech recognition is acoustic
model (AM). It has been established that acoustic models of speech recognition capture the characteristics of
the basic recognition units. According to the recognition units can be at the phoneme level, syllable level,
and at the word level. Several inadequacies and constraints come into consideration with the selection of
each of these units. Reference has claimed that for LVCSR (large vocabulary continuous speech
recognition) systems, phoneme is the most favourable unit. Hidden Markov models and neural networks
(NN) are the widely adopted approaches, which are being utilized for the acoustic modeling of speech
recognition systems.

 Language model( LM)

Language model is another most significant statistical model of speech recognition. One of the major
objectives of language model is to convey or transmit the behaviour of the language. It is due to the fact that
it intends to forecast the existence of the specific word sequences within the target speech. According to
from the aspect of recognition engine, this statistical model of speech recognition assists in minimizing the
search space for a reliable and credible combination of words.

2.3 Characteristic regarding speech recognition

There exist numerous variables contained in the systems in relation to speech recognition also it is essential
to be aware of these variables in order to work out the algorithm suitable to the system along with the most
significant of these variables:
Classification of Speech in a majority of studies, speech is categorized into four types:
 Isolated Words: This category typically needs a mute (silence gap) in the middle of utterances.
 Connected Words: Word systems happen to be alike isolated words, the sole dissimilarity between
themes to permit unconnected words to merge along with a slight pause in the middle of those words.
 Continuous Speech: The participants of the already stated category talk more or less normally,
while the machine sets the content. It is among the hardest of systems.
 Spontaneous Speech: Near the fundamental level, its conceivable to think of it as a speech which is
natural sounding further more not robot like or rehearsed.

Expanse or size of vocabulary utilized inside speech recognition system is essential as it influences the
complexity more over the processing requirements as well as it decides the precision in respect to the

[Type text] Page 5

[Type the document title]

system. We take into account that there exists applications that almost not utilize words, while others stand
in need of utilization of a gigantic number.
There are no specified definitions, how ever we can explain them in the following manner:
 Small vocabulary: Which consists of around tens of sets of words,
 Medium vocabulary: Which consists of around hundreds of sets of words,
 Large vocabulary: Which consists of around thousands of sets of words,
 Very-large vocabulary: Which consists of around tens of thousands of sets of words.

Speaker Dependence:
 Speaker dependent system: Where it is required by the systems that the user trains the system using
the user’s voice.
 Speaker independent system: Where the systems is developed for any general speaker and not any
specific/particular speaker.
 Speaker adaptable system: Where the system developed adapts to the traits /aspects of the current

2.4 Speech recognition examples

 Voice Activated Digital Assistants

These are smart phone and computer features such as Siri, Alexa, These are voice activated and draw
information from a vast number of available databases and other digitized sources to respond to commands
or answer questions. These digital assistants transform the way people interact with their devices.

Fig 2.3: Voice activated digital assistant

[Type text] Page 6

[Type the document title]

 Speech Recognition Solutions In Banking

Voice recognition helps banking customers with their personalized queries and responds to such requests
as account balances, transactions, and payments. It can improve customer care satisfaction and loyalty.

 Speech Recognition In Healthcare

Healthcare often demands quick decision-making and responses. Being able to direct patient care
with the voice, freeing the hands of medical professionals, improves both the speed and quality of
healthcare. Less paperwork is needed. Health records can be easily accessed. Nursing staff can be reminded
of appointments. It can improve hospital bedding administration. It can improve patient data inputting and
change service delivery in healthcare.

Fig 2.4 :Speech recognition in health care

[Type text] Page 7

[Type the document title]



A voice communication and speech processing has been one of the most exciting areas of the
artificial intelligence Speech Recognition technology has made it possible for computer to follow human
voice commands and understand human languages. The main goal of speech recognition area is to develop
techniques and systems for speech input to machine.

The main objective of speech recognition is for a machine to be able to “Listen”, “Understand”, and
“Act upon” the information provided through the voice input. Automatic speaker recognition aims to
analyze, extract, characterize, and recognize information about the speaker’s identity.

[Type text] Page 8

[Type the document title]


Applications, Advantages and Disadvantages

4.1 Applications

Let’s explore the uses of speech recognition applications in different fields:

1. Voice-based speech recognition software is now used to initiate purchases, send emails, transcribe
meetings, doctor appointments, and court proceedings, etc.
2. Virtual assistants or digital assistants and smart home devices use voice recognition software to
answer questions, provide weather news, play music, check traffic, place an order, and so on.
3. Companies like Venoms and PayPal allow customers to make transactions using voice assistants.
Several banks in North America and Canada also provide online banking using voice-based software.
4. Ecommerce is significantly powered by voice-based assistants and allows users to make purchases
quickly and seamlessly.
5. Speech recognition is poised to impact transportation services and streamline scheduling, routing,
and navigating across cities.
6. Podcasts, meetings, and journalist interviews can be transcribed using voice recognition. It is also
used to provide accurate subtitles to a video.
7. There has been a huge impact on security through voice biometry where the technology analyses the
varying frequencies, tone and pitch of an individual’s voice to create a voice profile. An example of
this is Switzerland’s telecom company Swisscom which has enabled voice authentication technology
in its call centres to prevent security breaches.
8. Customer care services are being traced by AI-based voice assistants, and chat bots to automate
repeatable tasks.

[Type text] Page 9

[Type the document title]

4.2 Advantages
 It can help to increase productivity in many businesses, such as in healthcare industries.

 It can capture speech much faster than you can type.

 You can use text-to-speech in real-time.

 The software can spell the same ability as any other writing tool.

 Helps those who have problems with speech or sight.

4.3 Disadvantages
 Lack of Accuracy and Misinterpretation.
 Time Costs and Productivity.
 Accents and Speech Recognition.
 Background Noise Interference.
 Physical Side Effects

[Type text] Page 10

[Type the document title]


Speech is the primary and most convenient way for people to communicate. We’re at a turning point
where voice and natural language understanding are suddenly at the forefront. The main goal of the speech
recognition area is to develop techniques and systems for speech input to the machine. Since humans do a
daily activity of speech recognition, it is one of the most consolidating areas of machine intelligence. Speech
recognition has created a technological impact on society. Further, expected to flourish in this area of
human-machine interaction.

With significant advances in voice technologies, users will now need to spend less time performing lengthy
searches or transcribing huge voice data to text transcripts. It is imperative that this new technology also
establish a new mark in the construction of a brand through new-age voice dynamics enabled by the AI.
More innovation can offer companies in the field of speech recognition, a wide horizon of opportunity to

[Type text] Page 11

[Type the document title]


[1] Laszlo, T. 2018. Deep Neural Networks with Linearly Augmented Rectifier Layers for Speech
Recognition, SAMI 2018 IEEE 16th World Symposium on Applied Machine Intelligence and Informatics
February7-10 Košice, Herl’any, Slovakia.

[2] Yuki, S., Shinnosuke, T. (2018). Statistical Parametric Speech Synthesis Incorporating Generative
Adversarial Networks, IEEE /ACM Transactions on Audio, Speech, and Language Processing, 26 (1).

[3] Michael, P., James, G., Anantha, P.C. (2018). A Low Power Speech Recognizer and Voice Activity
Detector Using Deep Neural Networks, IEEE Journal of Solid-state Circuits, 53(1).

[Type text] Page 12

[Type the document title]

[Type text] Page 13

[Type the document title]

[Type text] Page 14

You might also like