Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Page |1

A
Case Study
On
Speech Recognition:
Natural Language Processing
Submitted
For
Masters in Computer Application
At

Submitted To: Submitted By:


Dr. Gourav Bathla Naina Nautiyal
Asst. Professor SAP ID: - 500090567
MCA Kashish Khandelwal
SAP ID: - 500095339
Misba Parveen
SAP ID: - 500095338
Page |2

INDEX
S.NO TOPIC PAGE NUMBER

1 Contribution to case study 3-4

2 Introduction of Speech Recognition 5-6

3 Techniques used in Speech 7-8


Recognition
4 History of Speech Recognition 8-10

5 Application of Speech Recognition 10-12

6 Different Speech Recognition 12-14


Software
7 Overview of Siri 15-17

8 Features of Siri 18-20

9 Use of Siri 21-22

10 Working of Siri 23-24

11 References 25
Page |3

Contribution to the case study of Speech


Recognition:

1. INTRODUCTION TO SPEECH RECOGNITION – MISBA

2. TECHNIQUES USED IN SPEECH RECOGNITION –


KASHISH

3. HISTORY OF SPEECH RECOGNITION - NAINA

4. APPLICATIONS OF SPEECH RECOGNITION – NAINA

5. DIFFERENT SPEECH RECOGNITION SOFTWARE –


KASHISH

6. OVERVIEW OF SIRI - MISBA+KASHISH

7. FEATURES OF SIRI - NAINA+MISBA+KASHISH

8. USE OF SIRI – MISBA

9. WORKING OF SIRI – NAINA


Page |4

MY PART OF THE CONTRIBUTION IN


DETAIL:

HISTORY OF SPEECH RECOGNITION:


In this section, I have discovered the evolution of Speech recognition
technology and how companies are working on increasing the speed
recognition accuracy day by day, using certain facts beginning from
the year 1922 when the first toy (Radio Rex) was created to use
speech recognition to till date, where the tech giants (Apple & google)
are ruling this domain.

APPLICATION OF SPEECH RECOGNITION:


In this section, I have covered most of the fields and areas like
Automotive, Technology, Healthcare, Sales, Marketing Advertising,
etc. using the various applications of speech recognition technology
and implementing it, saving time and lives, and helping businesses
and consumers.

FEATURES OF SIRI:
I, together with my group members went on using SIRI on different
IOS platforms and I have added my contribution in exploring the
features of SIRI pointwise in the section.

WORKING OF SIRI:
In this section, I have tried to explain the working of SIRI using a
diagram first. I figured out the reasons for SIRI’s human-like behavior
and tried to explain how a command given to SIRI goes through four
stages in brief by taking certain references and some research work.
Page |5

CASE STUDY: SPEECH RECOGNITION


INTRODUCTION
Speech recognition is the process of converting human audio signals into words
or instructions. Speech recognition is based on speech. It is an important line of
research in the speech signal processing and pattern recognition industry.
Computer science, artificial intelligence, digital signal processing, pattern
recognition, acoustics, linguistics, and cognitive science are really just a few of
the disciplines which are included in speech recognition research.
It is a multidisciplinary complex research field. Different fields of research have
emerged based on research tasks under different constraints. These groupings
can also be broken down into isolated words, connected words, and continuous
speech recognition systems based on the requirements of the speaker's speech
pattern. These categories can be further broken into isolated words, connected
words, and continuous speech recognition systems according to the demands of
the speaker's speaking style. These categories can be separated into person-
specific and non-person-specific speech recognition systems dependent on how
reliant these are on the speaker. They can be categorized into small vocabulary,
medium vocabulary, large vocabulary, and infinite vocabulary voice recognition
systems based on the size of the vocabulary.

Many speech recognition apps and devices are available, but more advanced
solutions use artificial intelligence and machine learning. To interpret and
process human speech, they integrate the grammar, syntax, structure, and
Page |6

composition of audio and voice signals. In a perfect situation, kids build


innovative responses with each encounter and learn as they go.
The best kind of systems also allows organizations to customize and tailor the
technology to their specific requirements – everything from language and
nuances of speech to brand recognition. For example:
• Language weighting: Increase accuracy by weighting specific words that are
spoken frequently (such as product names or industry jargon) over and above
terms already in the core vocabulary.
• Speaker Tagging: Transcript output that cites or tags each speaker's
contributions to a multi-participant conversation.
• Acoustic training: Address the acoustic side of the business. Teach the
system to adapt to the acoustic environment (such as ambient noise in a call
center) and speaker styles (such as pitch, volume, and tempo).
• Profanity Filtering: Use filters to identify certain words or phrases and
sanitize your speech output.

Meanwhile, speech recognition is still developing. Companies like IBM are


making inroads in several areas, the better to improve human-machine
interaction.
Page |7

TECHNIQUES USED IN SPEECH RECOGNITION


Various algorithms and computing techniques are used to recognize speech-to-
text and improve transcription accuracy. Some of the most widely used
techniques are discussed briefly below:
• Natural Language Processing (NLP): While not necessarily a specific
algorithm used in speech recognition, NLP is an artificial intelligence area
focusing on human-machine interaction through language through speech and
text. Many mobile devices incorporate speech recognition into their systems to
perform voice searches — such as Siri — or provide better access to text
messages.

• Hidden Markov Models (HMM): Hidden Markov models are based on the
Markov chain model, which states that the probability of a given state depends
on the current state, not on its previous states. While a Markov chain model is
useful for observable events such as text inputs, hidden Markov models allow
us to incorporate hidden events such as parts of speech into a probabilistic
model. They are used as sequence models in speech recognition, assigning
labels to each unit—ie. words, syllables, sentences, etc. — in order. These
labels create a mapping with the input provided, allowing it to determine the
most appropriate sequence of labels.
• N-grams: This type of language model (LM), which assigns chances to
sentences or phrases, is by far the most basic. An N-gram is a sequence of N-
words. For example, "order a pizza" is a trigram or 3 grams, and "order a pizza"
is 4 grams. Grammar and probability of certain word sequences are used to
improve recognition and accuracy.
• Neural Networks: Primarily used for deep learning algorithms, neural
networks process training data by mimicking the interconnections of the human
brain through layers of nodes. Each node consists of inputs, weights, bias (or
Page |8

threshold), and output. If this output value exceeds a given threshold, it "fires"
or activates the node and passes the data to the next layer in the network. Neural
networks learn this mapping function through supervised learning and adapt to
the loss function through a gradient descent process. While neural networks
tend to be more accurate and can accept more data, this comes at a cost in
performance efficiency, as they tend to be slower to train compared to
traditional language models.

HISTORY OF SPEECH RECOGNITION


Speech recognition started from just recognizing a single word or a few
syllables to recognizing an entire language! It has certainly come a long way
since the beginning of the 20th century.

• 1922 – The first toy to use speech recognition, the Radio Rex, was created.
Radio Rex was a brown bulldog who came out of his kennel when he heard his
name.

• 1928 - Homer Dudley invented the vocoder (short for "voice coder") at Bell
Labs in New Jersey, the first machine that could generate human speech
electronically when a person typed words on a special keyboard.

• 1939 - Dudley's vocoder was introduced at the World's Fair at the AT&T
building in New York.

• the 1950s – Other labs and scientists developed speech recognition machines
that could recognize 10 syllables or 10 vowels. Speech recognition software
has gradually evolved more and more. Specifically, in 1952, Davis et. At Bell
Labs, Al developed a tool known as the "Audrey System" that might recognise
speech (isolated digits) from a single speaker.
Page |9

• 1962 – At the World's Fair, IBM exhibited a system that could recognize 16
English words. It was called a shoe box machine.

• 1971-1976- The Department of Defence saw the importance of speech


recognition and funded the DARPA Speech Understanding Research program.
It contributed funds to Carnegie Mellon University to create the Harpy, a
machine that could understand 1011 words!

• the 1980s – The Hidden Markov Model was groundbreaking because it


recognized that certain unknown sounds could be real words.

• 1990 - Dragon came out with Dragon Dictate, a dictation software that could
recognize human speech and dictate it into a word processing program. It was
very expensive! About $9000! Now you can download the Dragon dictation
app to your smartphone for just a few dollars!

• In 2000, speech recognition software was about 80% accurate.

• Today - Google and Apple are several companies at the forefront of speech
recognition with speech recognition capabilities for Google Search, Google
Maps, or with Siri on the iPhone. Google now says their speech recognition
accuracy is about 92%.
P a g e | 10

Applications of Speech Recognition: -


Many industries today use various applications of speech technology, helping
businesses and consumers save time and even lives. Some examples:
1) Automotive: Speech recognition enhances driver safety by enabling
voice-activated navigation systems and in-car radio search capabilities.
In-vehicle systems such as Ford’s SYNC 3 and GM’s OnStar let drivers
control navigation, climate, audio/visual entertainment, and other
functions with voice commands.

2) Technology: Virtual agents are increasingly integrated into our daily lives,
especially our mobile devices. We use voice commands to access them through
our smartphones, such as Google Assistant or Apple Siri, for tasks such as voice
search, or through our speakers, through Amazon Alexa or Microsoft Cortana,
P a g e | 11

to play music. They will continue to be integrated into the everyday products we
use, fuelling the "Internet of Things" movement.

3) Healthcare: Doctors and nurses use dictation apps to capture and record
patient diagnoses and treatment notes. Health professionals may be more
efficient because of speech recognition. It enables them to care for more
patients concurrently. Furthermore, it makes inter-department communication
more efficient and effective by accelerating turnaround time, and saving
healthcare institutions a significant amount of money.

4) Sales: Speech recognition technology has several applications in sales. It can


help a call center transcribe thousands of phone calls between customers and
agents to identify common call patterns and problems. AI chatbots can also talk
to people through a website, answering common questions and addressing basic
requests without having to wait for a contact center agent to be available. In
both cases, speech recognition systems help reduce the time it takes to resolve
consumer issues.
P a g e | 12

5) Hands-free Interaction with Mobile Devices: Furthermore, speech


recognition enables hands-free use of mobile devices such as smartphones and
tablets. This is especially helpful for tasks like making phone calls, texting, and
doing online searches that need a lot of user input. For instance, users can use
their voice to interact with their mobile device by making statements like "call
John Smith" or "find nearby restaurants."

6) Accessibility: Speech recognition technology has the potential to


significantly improve accessibility, both in terms of encouraging access to
information and services and in terms of enhancing interaction with our
surroundings.

7) Marketing and Advertising: Speech recognition is a valuable campaign


optimization and evaluation tool. For example, it can be used to track calls to
sales or customer support lines to spot areas that require improvement. ASR can
also be used to examine telephone interviews or focus groups that have been
recorded to gain feedback on an item or service.

8) Voice biometrics: Emerging speech recognition technology called voice


biometrics uses a person's particular vocal characteristics to verify their identity.
Potential uses for all of this include fraud prevention and security.

DIFFERENT SPEECH RECOGNITION(VIRTUAL


ASSISTANT) SOFTWARE
Microsoft Cortana- Microsoft released a virtual assistant in 2014. Although it
is available for Android and iOS users, Windows OS users seem to use it the
greatest. You can just use Cortana to manage your calendar, open apps on the
computer, join Microsoft Teams meetings, and set reminders. Depending on the
software platform and the location where it is utilized, Cortana is now available
in language editions in English, Portuguese, French, German, Italian, Spanish,
Chinese, and Japanese.

In addition to understanding voice commands and doing tasks, Cortana is


integrated into the Microsoft 365 product family and all versions of Windows
10 launched in 2004 and later. Along with conducting typical web searches,
Cortana can organize and manage your daily meetings, appointment timers, and
more.
P a g e | 13

Google Assistant- Google Now marked the start of the industry's creation of
online assistants. Users might perform voice searches for information thanks to
the capability of Google search. After a little while, Google halted work on this
project and launched Google Assistant in 2016. It was initially built into Google
Pixel smartphones and Google Home smart speakers. All new Android phones
acquired Google Assistant like wildfire. The fact that Samsung gives Google
Assistant on their devices in addition to Bixby, their virtual secretary, says a lot.
It would be wasteful not to use Google Assistant, as the bulk of Android phone
customers do.

Nuance Dragon Assistant and Dragon Naturally Speaking- Software for


speech recognition called Dragon Naturally Speaking was created by Nuance
Communications. The Dragon Dictate application was mentioned before in this
article. It has evolved over the years and is now referred to as Dragon Naturally
Speaking. The company also sells the Dragon Assistant, a personal assistant for
PCs. Carleton currently uses Dragon NaturallySpeaking as its speech-to-text
program. For vocal-style computer navigation and word processing, students
could find Dragon to be a handy option.
P a g e | 14

Samsung Bixby- In the vein of Apple's Siri, Amazon's Alexa, and Google's
Assistant, Bixby is Samsung's own AI-powered personal assistant. You can use
vocal and text inputs with Bixby to perform many of the common operations
you'd carry out on your smart device. Although it may be found on several
Samsung products, including TVs and freezers, Samsung smartphones which
are where it primarily resides.
P a g e | 15

CASE STUDY: COMPANY-SPECIFIC APPLE’S SIRI


OVERVIEW OF SIRI
Apple Inc.'s iOS, iPad, watchOS, macOS, tvOS, and audio operating systems all
include the virtual assistant Siri. It uses voice prompts, gesture control, focus
tracking, and a natural language user interface to answer questions, make
recommendations, and perform actions by delegating requests to a set of
Internet services.

It adapts to individual language habits, searches, and user preferences during


further use and returns individualized results.

Siri is a spin-off of a project developed by the SRI International Artificial


Intelligence Center. Its speech recognition engine is provided by Nuance
Communications and uses advanced machine learning technologies to operate.

Its original American, British, and Australian voice actors recorded their
respective voices around 2005, unaware of the potential use of the recordings.
In February 2010, an iOS app Siri was made accessible.

Two months later, it was acquired by Apple and integrated into the iPhone 4S
at its release on October 4, 2011, removing the standalone app from the iOS
App Store. Since then, Siri has been an integral part of Apple products, which
have been adapted to other hardware devices including the newer models of
iPhone, iPad, iPod Touch, Mac, Air Pods, Apple TV, and Home Pod.
P a g e | 16

Siri supports a wide range of user commands, including performing phone


actions, checking basic information, scheduling events and reminders,
manipulating device settings, searching the Internet, navigating areas, and
searching for entertainment information, and can work with applications
integrated into iOS. After iOS 10 was launched in 2016, Apple enabled a
limited group of third-party companies to use Siri, including messaging,
payment, ride-sharing, and Internet calling apps. With the release of iOS 11,
Apple updated Siri's voice and added support for follow-up questions, language
translation, and other third-party actions.

The original release of Siri on the iPhone 4S in 2011 received mixed reviews. It
received praise for its voice recognition and contextual knowledge of user
information, including calendar appointments, but was criticized for requiring
heavy user commands and lacking flexibility. It was also criticized for its lack
P a g e | 17

of information about certain nearby locations, and for its inability to understand
certain English accents.

In 2016 and 2017, several media reports stated that Siri lacked innovation,
especially against new competing voice assistants. Reports related to Siri's
limited feature set, "poor" voice recognition, and underdeveloped service
integration causing Apple difficulties in AI and cloud services; the basis for
complaints allegedly due to the suppression of development, as caused by
Apple's prioritization of user privacy and the struggle for executive power
within the company.
The demise of Steve Jobs, which occurred one day after its premiere, also hung
a shadow over it.
P a g e | 18

FEATURES OF SIRI
Apple offers a wide variety of voice commands to interact with Siri, including
but not limited to:

• Phone and text actions such as "Call Sarah", "Read new messages", "Set timer
for 10 minutes" and "Email Mom"

• Look up a few things, such as "What's the weather like today?" and "How
many dollars are in euros?"

• Find basic facts, including "How many people live in France?" and "How high
is Mount Everest?". Siri usually uses Wikipedia to answer.

• Event scheduling and reminders, including "Schedule an appointment" and


"Remind me..."

• Manipulate device settings such as "Take a picture", "Turn off Wi-Fi" and
"Increase brightness"

• Search the Internet, including "Define...", "Find Images..." and "Search


Twitter..."

• Instructions, like "Take me home" and "Where is the traffic on the way
home?"

• Translate words and phrases from English into several languages, for example,
"How do you say where is the nearest hotel in French?"

• Entertainment, such as "What are the basketball games today?", "What movies
are playing near me?" and "What is the content of...?"

• Use iOS-compatible apps like "Like" and "Pause Apple Music" to engage.

• Handle payments with Apple Pay, such as "Apple Pay $25 to Mike for concert
tickets" or "Send $41 to Ivana."

• Siri also offers numerous pre-programmed answers to fun questions. Such


questions include "What is the meaning of life?" to which Siri can reply "All
evidence so far points to it being chocolate"; "Why am I here?", to which he can
reply "I don't know. To be honest, I wondered about it myself"; and "Will you
marry me?" to which she might respond "My End User License Agreement does
not cover the marriage. I'm sorry."
P a g e | 19

• Initially limited to female voices, Apple announced in June 2013 that Siri
would include gender selection and add a male voice counterpart.

• In September 2014, Apple added the ability for users to speak "Hey Siri" to
activate the assistant without requiring physical handling of the device.

• In September 2015, the "Hey Siri" feature was updated to include personalized
voice recognition, a presumed effort to prevent activation by a non-owner user.

• With the announcement of iOS 10 in June 2016, Apple opened limited access
for third-party developers to Siri through a dedicated application programming
interface (API). The API limits the use of Siri to using third-party messaging
apps, payment apps, ride-sharing apps, and internet calling apps.

• In iOS 11, Siri can handle follow-up questions, supports language translation,
and opens up more third-party actions, including task management.
Additionally, users can type into Siri, and a new privacy-focused "learning on
device" technique improves Siri's suggestions by privately analyzing the
personal usage of various iOS apps.
P a g e | 20

Siri can use signals such as location, time of day, and movement type (such as
walking, running, or driving) to intelligently predict the right time and place and
suggest actions from your app. Depending on the information your app is
sharing and people's current context, Siri may offer shortcut suggestions on the
lock screen, in search results, or on the Siri watch face. For instance, Siri can
use the Calendar app to add an event shared by your app. Siri can also use
certain types of information to suggest actions that system apps support. Here
are some example scenarios.

• Shortly before 7:30 a.m., Siri can suggest an action to order coffee for
people who use the coffee app every morning.

• After people use a checkout-type app to buy movie tickets, Siri can remind
them to turn on Do Not Disturb shortly before a screening.

• Siri can suggest automation that starts a workout in the user's favorite
exercise app and plays their favorite workout playlist when they enter their
usual gym.

• When people enter the airport after a flight home, Siri can suggest that they
request a ride home from their favorite ride-sharing app.
P a g e | 21

USE OF SIRI
People can use Siri to get things done while in the car, exercise, use apps on the
device, or interact with the HomePod. You don't always know the context in
which people are using Siri to perform your app's actions, so flexibility is key to
making sure people have a great experience no matter what they're doing.

To communicate with people regardless of their current context, you should


provide information that Siri can provide both vocally and visually. Supporting
voice content as well as on-screen content allows Siri to decide which way of
communication is best for people in their current situation. For instance, if
anybody wears AirPods and speaks "Hey Siri," Siri can speak to them through
the AirPods.
P a g e | 22

In voice-only situations, Siri verbally describes the information that would be


presented on the screen in other situations. Consider a food delivery app that
requires people to confirm a transaction before completing an order. In a voice-
only scenario, Siri can say, "Your total is fifteen dollars and your order will take
thirty minutes to arrive at your door. Siri can simply say "Ready to order?"

When supporting your intentions, you are responsible for providing only voice
dialogue that describes these types of information on the screen.
P a g e | 23

WORKING OF SIRI

Siri's human-like behaviour is the result of a revolutionary union of artificial


intelligence and natural language processing. It's a system made to pay attention
to, understand, and process users' requests and, if possible, provide an
appropriate outcome. When we asked Siri to do something, the commands go
through four stages:

Stage 1: Voice Recognition

When you speak to Siri, it records your voice, turns it into a data file, and then
sends the file to servers. It must consider your accent, dialect, and subtle vocal
variations in addition to any other speech impairments you may have. In
addition, it struggles to distinguish your voice from background noise.

Stage 2: Connecting to Apple servers

Siri records your request, converts it to a file, and then sends it to Apple servers
for processing. For this reason, Siri cannot run without an Internet connection.
Your spoken words are processed through many flowchart branches once they
are in the Apple servers to come up with a potential answer. The computers
P a g e | 24

already have a sizable database of questions and their likely responses, so it is


typically not difficult to find the answer to popular queries like "What is the best
spot in India?" or "Which type of clothes does the American prefer during
snowfall?"

Somehow if Siri did not get what commands the user asked for, then the entire
command is trashed and Siri gives the standard response like “Would you like
to search the web for that?”

Stage 3: Understanding the command’s meaning

In this stage, the system will try to understand what you really want to be done
by the system. The concept of Natural Language Processing is used by the
system to make Siri as intuitive as a machine can be.

Stage 4: Producing the results

It is great that Siri can understand what you're saying, but what does it matter to
you if she doesn't actually carry out your request? For Siri to give you the
results you want, other apps on your phone must communicate with it. Let us
take the example of wanting to set a reminder. In this situation, setting a
reminder at the specified time will need Siri to "speak" to the Organiser app.

After completing all four stages, Siri reflects the results either by speaking or
flashing a text, so that we can know about the status of the task to be performed
by it.
P a g e | 25

REFERENCES
https://en.wikipedia.org/wiki/Siri

https://www.sciencedirect.com/topics/engineering/speech-recognition
https://www.ibm.com/cloud/learn/speech-recognition
https://study.com/academy/lesson/speech-recognition-history-
fundamentals.html
https://developer.apple.com/design/human-interface-
guidelines/technologies/siri/introduction
Top Six Use Cases for Automatic Speech Recognition (ASR)
(linkedin.com)
Speech and Voice Recognition Technology in Healthcare | Voice
Search (delveinsight.com)
What Is Cortana? a Guide to Microsoft's Virtual Assistant
(businessinsider.com)
What Is Google Assistant? (techjunkie.com)
What is Bixby? A complete guide to Samsung's smart AI assistant
(trustedreviews.com)
https://www.scienceabc.com/innovation/what-is-siri-app-working-
apple-eyes-free-artificial-intelligence-voice-recognition-natural-
language-processing.html#how-does-siri-work
https://machinelearning.apple.com/research/hey-siri

You might also like