Professional Documents
Culture Documents
CASE STUDY - Speech Recognition
CASE STUDY - Speech Recognition
A
Case Study
On
Speech Recognition:
Natural Language Processing
Submitted
For
Masters in Computer Application
At
INDEX
S.NO TOPIC PAGE NUMBER
11 References 25
Page |3
FEATURES OF SIRI:
I, together with my group members went on using SIRI on different
IOS platforms and I have added my contribution in exploring the
features of SIRI pointwise in the section.
WORKING OF SIRI:
In this section, I have tried to explain the working of SIRI using a
diagram first. I figured out the reasons for SIRI’s human-like behavior
and tried to explain how a command given to SIRI goes through four
stages in brief by taking certain references and some research work.
Page |5
Many speech recognition apps and devices are available, but more advanced
solutions use artificial intelligence and machine learning. To interpret and
process human speech, they integrate the grammar, syntax, structure, and
Page |6
• Hidden Markov Models (HMM): Hidden Markov models are based on the
Markov chain model, which states that the probability of a given state depends
on the current state, not on its previous states. While a Markov chain model is
useful for observable events such as text inputs, hidden Markov models allow
us to incorporate hidden events such as parts of speech into a probabilistic
model. They are used as sequence models in speech recognition, assigning
labels to each unit—ie. words, syllables, sentences, etc. — in order. These
labels create a mapping with the input provided, allowing it to determine the
most appropriate sequence of labels.
• N-grams: This type of language model (LM), which assigns chances to
sentences or phrases, is by far the most basic. An N-gram is a sequence of N-
words. For example, "order a pizza" is a trigram or 3 grams, and "order a pizza"
is 4 grams. Grammar and probability of certain word sequences are used to
improve recognition and accuracy.
• Neural Networks: Primarily used for deep learning algorithms, neural
networks process training data by mimicking the interconnections of the human
brain through layers of nodes. Each node consists of inputs, weights, bias (or
Page |8
threshold), and output. If this output value exceeds a given threshold, it "fires"
or activates the node and passes the data to the next layer in the network. Neural
networks learn this mapping function through supervised learning and adapt to
the loss function through a gradient descent process. While neural networks
tend to be more accurate and can accept more data, this comes at a cost in
performance efficiency, as they tend to be slower to train compared to
traditional language models.
• 1922 – The first toy to use speech recognition, the Radio Rex, was created.
Radio Rex was a brown bulldog who came out of his kennel when he heard his
name.
• 1928 - Homer Dudley invented the vocoder (short for "voice coder") at Bell
Labs in New Jersey, the first machine that could generate human speech
electronically when a person typed words on a special keyboard.
• 1939 - Dudley's vocoder was introduced at the World's Fair at the AT&T
building in New York.
• the 1950s – Other labs and scientists developed speech recognition machines
that could recognize 10 syllables or 10 vowels. Speech recognition software
has gradually evolved more and more. Specifically, in 1952, Davis et. At Bell
Labs, Al developed a tool known as the "Audrey System" that might recognise
speech (isolated digits) from a single speaker.
Page |9
• 1962 – At the World's Fair, IBM exhibited a system that could recognize 16
English words. It was called a shoe box machine.
• 1990 - Dragon came out with Dragon Dictate, a dictation software that could
recognize human speech and dictate it into a word processing program. It was
very expensive! About $9000! Now you can download the Dragon dictation
app to your smartphone for just a few dollars!
• Today - Google and Apple are several companies at the forefront of speech
recognition with speech recognition capabilities for Google Search, Google
Maps, or with Siri on the iPhone. Google now says their speech recognition
accuracy is about 92%.
P a g e | 10
2) Technology: Virtual agents are increasingly integrated into our daily lives,
especially our mobile devices. We use voice commands to access them through
our smartphones, such as Google Assistant or Apple Siri, for tasks such as voice
search, or through our speakers, through Amazon Alexa or Microsoft Cortana,
P a g e | 11
to play music. They will continue to be integrated into the everyday products we
use, fuelling the "Internet of Things" movement.
3) Healthcare: Doctors and nurses use dictation apps to capture and record
patient diagnoses and treatment notes. Health professionals may be more
efficient because of speech recognition. It enables them to care for more
patients concurrently. Furthermore, it makes inter-department communication
more efficient and effective by accelerating turnaround time, and saving
healthcare institutions a significant amount of money.
Google Assistant- Google Now marked the start of the industry's creation of
online assistants. Users might perform voice searches for information thanks to
the capability of Google search. After a little while, Google halted work on this
project and launched Google Assistant in 2016. It was initially built into Google
Pixel smartphones and Google Home smart speakers. All new Android phones
acquired Google Assistant like wildfire. The fact that Samsung gives Google
Assistant on their devices in addition to Bixby, their virtual secretary, says a lot.
It would be wasteful not to use Google Assistant, as the bulk of Android phone
customers do.
Samsung Bixby- In the vein of Apple's Siri, Amazon's Alexa, and Google's
Assistant, Bixby is Samsung's own AI-powered personal assistant. You can use
vocal and text inputs with Bixby to perform many of the common operations
you'd carry out on your smart device. Although it may be found on several
Samsung products, including TVs and freezers, Samsung smartphones which
are where it primarily resides.
P a g e | 15
Its original American, British, and Australian voice actors recorded their
respective voices around 2005, unaware of the potential use of the recordings.
In February 2010, an iOS app Siri was made accessible.
Two months later, it was acquired by Apple and integrated into the iPhone 4S
at its release on October 4, 2011, removing the standalone app from the iOS
App Store. Since then, Siri has been an integral part of Apple products, which
have been adapted to other hardware devices including the newer models of
iPhone, iPad, iPod Touch, Mac, Air Pods, Apple TV, and Home Pod.
P a g e | 16
The original release of Siri on the iPhone 4S in 2011 received mixed reviews. It
received praise for its voice recognition and contextual knowledge of user
information, including calendar appointments, but was criticized for requiring
heavy user commands and lacking flexibility. It was also criticized for its lack
P a g e | 17
of information about certain nearby locations, and for its inability to understand
certain English accents.
In 2016 and 2017, several media reports stated that Siri lacked innovation,
especially against new competing voice assistants. Reports related to Siri's
limited feature set, "poor" voice recognition, and underdeveloped service
integration causing Apple difficulties in AI and cloud services; the basis for
complaints allegedly due to the suppression of development, as caused by
Apple's prioritization of user privacy and the struggle for executive power
within the company.
The demise of Steve Jobs, which occurred one day after its premiere, also hung
a shadow over it.
P a g e | 18
FEATURES OF SIRI
Apple offers a wide variety of voice commands to interact with Siri, including
but not limited to:
• Phone and text actions such as "Call Sarah", "Read new messages", "Set timer
for 10 minutes" and "Email Mom"
• Look up a few things, such as "What's the weather like today?" and "How
many dollars are in euros?"
• Find basic facts, including "How many people live in France?" and "How high
is Mount Everest?". Siri usually uses Wikipedia to answer.
• Manipulate device settings such as "Take a picture", "Turn off Wi-Fi" and
"Increase brightness"
• Instructions, like "Take me home" and "Where is the traffic on the way
home?"
• Translate words and phrases from English into several languages, for example,
"How do you say where is the nearest hotel in French?"
• Entertainment, such as "What are the basketball games today?", "What movies
are playing near me?" and "What is the content of...?"
• Use iOS-compatible apps like "Like" and "Pause Apple Music" to engage.
• Handle payments with Apple Pay, such as "Apple Pay $25 to Mike for concert
tickets" or "Send $41 to Ivana."
• Initially limited to female voices, Apple announced in June 2013 that Siri
would include gender selection and add a male voice counterpart.
• In September 2014, Apple added the ability for users to speak "Hey Siri" to
activate the assistant without requiring physical handling of the device.
• In September 2015, the "Hey Siri" feature was updated to include personalized
voice recognition, a presumed effort to prevent activation by a non-owner user.
• With the announcement of iOS 10 in June 2016, Apple opened limited access
for third-party developers to Siri through a dedicated application programming
interface (API). The API limits the use of Siri to using third-party messaging
apps, payment apps, ride-sharing apps, and internet calling apps.
• In iOS 11, Siri can handle follow-up questions, supports language translation,
and opens up more third-party actions, including task management.
Additionally, users can type into Siri, and a new privacy-focused "learning on
device" technique improves Siri's suggestions by privately analyzing the
personal usage of various iOS apps.
P a g e | 20
Siri can use signals such as location, time of day, and movement type (such as
walking, running, or driving) to intelligently predict the right time and place and
suggest actions from your app. Depending on the information your app is
sharing and people's current context, Siri may offer shortcut suggestions on the
lock screen, in search results, or on the Siri watch face. For instance, Siri can
use the Calendar app to add an event shared by your app. Siri can also use
certain types of information to suggest actions that system apps support. Here
are some example scenarios.
• Shortly before 7:30 a.m., Siri can suggest an action to order coffee for
people who use the coffee app every morning.
• After people use a checkout-type app to buy movie tickets, Siri can remind
them to turn on Do Not Disturb shortly before a screening.
• Siri can suggest automation that starts a workout in the user's favorite
exercise app and plays their favorite workout playlist when they enter their
usual gym.
• When people enter the airport after a flight home, Siri can suggest that they
request a ride home from their favorite ride-sharing app.
P a g e | 21
USE OF SIRI
People can use Siri to get things done while in the car, exercise, use apps on the
device, or interact with the HomePod. You don't always know the context in
which people are using Siri to perform your app's actions, so flexibility is key to
making sure people have a great experience no matter what they're doing.
When supporting your intentions, you are responsible for providing only voice
dialogue that describes these types of information on the screen.
P a g e | 23
WORKING OF SIRI
When you speak to Siri, it records your voice, turns it into a data file, and then
sends the file to servers. It must consider your accent, dialect, and subtle vocal
variations in addition to any other speech impairments you may have. In
addition, it struggles to distinguish your voice from background noise.
Siri records your request, converts it to a file, and then sends it to Apple servers
for processing. For this reason, Siri cannot run without an Internet connection.
Your spoken words are processed through many flowchart branches once they
are in the Apple servers to come up with a potential answer. The computers
P a g e | 24
Somehow if Siri did not get what commands the user asked for, then the entire
command is trashed and Siri gives the standard response like “Would you like
to search the web for that?”
In this stage, the system will try to understand what you really want to be done
by the system. The concept of Natural Language Processing is used by the
system to make Siri as intuitive as a machine can be.
It is great that Siri can understand what you're saying, but what does it matter to
you if she doesn't actually carry out your request? For Siri to give you the
results you want, other apps on your phone must communicate with it. Let us
take the example of wanting to set a reminder. In this situation, setting a
reminder at the specified time will need Siri to "speak" to the Organiser app.
After completing all four stages, Siri reflects the results either by speaking or
flashing a text, so that we can know about the status of the task to be performed
by it.
P a g e | 25
REFERENCES
https://en.wikipedia.org/wiki/Siri
https://www.sciencedirect.com/topics/engineering/speech-recognition
https://www.ibm.com/cloud/learn/speech-recognition
https://study.com/academy/lesson/speech-recognition-history-
fundamentals.html
https://developer.apple.com/design/human-interface-
guidelines/technologies/siri/introduction
Top Six Use Cases for Automatic Speech Recognition (ASR)
(linkedin.com)
Speech and Voice Recognition Technology in Healthcare | Voice
Search (delveinsight.com)
What Is Cortana? a Guide to Microsoft's Virtual Assistant
(businessinsider.com)
What Is Google Assistant? (techjunkie.com)
What is Bixby? A complete guide to Samsung's smart AI assistant
(trustedreviews.com)
https://www.scienceabc.com/innovation/what-is-siri-app-working-
apple-eyes-free-artificial-intelligence-voice-recognition-natural-
language-processing.html#how-does-siri-work
https://machinelearning.apple.com/research/hey-siri