Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 34

4INTRODUCTION

Today the development of artificial intelligence (AI) systems that can organize a
natural human-machine interaction (through voice, communication, gestures, facial
expressions, etc.) are gaining in popularity. One of the most studied and popular
was the direction of interaction, based on the understanding of the machine by the
machine of the natural human language. It is no longer a human who learns to
communicate with a machine, but a machine learns to communicate with a human,
exploring his actions, habits, behaviour and trying to become his personalized
assistant. Virtual assistants are software programs that help you ease your day to
day tasks, such as showing weather reports, creating remainders, making shopping
lists etc. They can take commands via text (online chatbots) or by voice. Voice-
based intelligent assistants need an invoking word or wake word to activate the
listener, followed by the command. We have so many virtual assistants, such as
Apple’s Siri, Amazon’s Alexa and Microsoft’s Cortana.

This system is designed to be used efficiently on desktops. Personal assistants’


software improves user productivity by managing routine tasks of the user and by
providing information from an online source to the user. This project was started
on the premise that there is a sufficient amount of openly available data and
information on the web that can be utilized to build a virtual assistant that has
access to making intelligent decisions for routine user activities.

1.1 RELATED WORK

Each company developer of the intelligent assistant applies his own specific
methods and approaches for development, which in turn affects the final product.
One assistant can synthesize speech more qualitatively, another can more
accurately and without additional explanations and corrections perform tasks,

1
others can perform a narrower range of tasks, but most accurately and as the user
wants. Obviously, there is no universal assistant who would perform all tasks
equally well. The set of characteristics that an assistant has depends entirely on
which area the developer has paid more attention to. Since all systems are based on
machine learning methods and use for their creation huge amounts of data
collected from various sources and then trained on them, an important role is
played by the source of this data, be it search systems, various information sources
or social networks. The amount of information from different sources determines
the nature of the assistant, which can result as a result. Despite the different
approaches to learning, different algorithms and techniques, the principle of
building such systems remains approximately the same.

1.2 PROPOSED PLAN OF WORK

The work started with analyzing the audio commands given by the user through
the microphone. This can be anything like getting any information, operating a
computer’s internal files, etc. This is an empirical qualitative study, based on
reading above mentioned literature and testing their examples. Tests are made by
programming according to books and online resources, with the explicit goal to
find best practices and a more advanced understanding of a Voice Assistant.

2
Fig.1.1: voice assistant workflow [7]

Fig.1 shows the workflow of the basic process of the voice assistant. Speech
recognition is used to convert the speech input to text. This text is then fed to the
central processor which determines the nature of the command and calls the
relevant script for execution. But, the complexities don’t stop there. Even with
hundreds of hours of input, other factors can play a huge role in whether or not the
software can understand you. Background noise can easily throw a speech
recognition device off track. This is because it does not inherently have the ability
to distinguish the ambient sounds it “hears” of a dog barking or a helicopter flying
overhead, from your voice. Engineers have to program that ability into the device;
they conduct data collection of these ambient sounds and “tell” the device to filter
them out. Another factor is the way humans naturally shift the pitch of their voice
to accommodate for noisy environments; speech recognition systems can be
sensitive to these pitch changes.

3
1.3 METHODOLOGY OF VIRTUAL ASSISTANT USING PYTHON

Fi
g 1.2: virtual assistant methodology

Speech Recognition module

The system uses Google’s online speech recognition system for converting speech
input to text. The speech input Users can obtain texts from the special corpora
organized on the computer network server at the information centre from the
microphone is temporarily stored in the system which is then sent to Google cloud
for speech recognition. The equivalent text is then received and fed to the central
processor.

Python Backend

The python backend gets the output from the speech recognition module and then
identifies whether the command or the speech output is an API Call and Context
Extraction. The output is then sent back to the python backend to give the required
output to the user.

4
API calls

API stands for Application Programming Interface. An API is a software


intermediary that allows two applications to talk to each other. In other words, an
API is a messenger that delivers your request to the provider that you’re requesting
it from and then delivers the response back to you.

Content Extraction

Context extraction (CE) is the task of automatically extracting structured


information from unstructured and/or semi-structured machine-readable
documents. In most cases, this activity concerns processing human language texts
using natural language processing (NLP). Recent activities in multimedia
document processing like automatic annotation and content extraction out of
images/audio/video could be seen as context extraction TEST RESULTS. Text-to-
speech module Text-to-Speech (TTS) refers to the ability of computers to read text
aloud. A TTS Engine converts written text to a phonemic representation, then
converts the phonemic representation to waveforms that can be output as sound.
TTS engines with different languages, dialects and specialized vocabularies are
available through third-party publishers.

5
CHAPTER TWO

LITERATURE SURVEY

2.1 SPEECH RECOGNITION AND VOICE ASSISTANTS

Literature survey is done in order to gain an understanding of the basic concepts


of the project. It gives an insight on latest approaches and theories related to the
project. It also provides us with research topics based on the existing systems. The
challenges that come with literature surveys are reliability and validity of the
research. It replicates the original content of the source and gives a platform to
base our opinions. The survey conducted in the field of speech recognition and
voice assistants is analyzed below.

2.1.1 SPEECH RECOGNITION

Speech recognition is the ability of a machine to recognize words spoken in some


language and translate the words into a machine-readable format like text. A
personal assistant is a software that performs tasks for the user. These assistants
may work through text, speech or images. Voice assistants perform operations on
devices based on the spoken command. The most widely used voice assistants are
Apple’s Siri, Google Assistant, Amazon’s Alexa and Microsoft’s Cortana. Largely
installed in smartphones, desktops and standalone devices, these software agents
are commonly integrated with the operating system and provide a means for
improved accessibility in executing any task.

In the early 1960s, IBM’s William C. Dersh created “Shoebox”, the first ever
speech recognition machine that could understand a total of sixteen words and
perform mathematical computations based on the spoken command phrase. The

6
forerunner of today’s voice-controlled systems, it had a hardware that consisted of
a microphone, into which the user would speak the command. The output was
printed on a paper and displayed. The machine was not like recent innovations and
required each word to be clear, distinct and slow, with pauses in between.
However, after this, the next 57 years brought many enhancements in technology,
with the advent of the Internet and cloud computing, resulting in the growth of
virtual assistants as we know it today. The IBM Shoebox started a revolution with
its technology. [1]

Fig 2.1 The IBM Shoebox

2.1.2 VOICE ASSISTANTS

With the advent of the Internet and artificial intelligence, voice assistants are one
of the most commonly used interfaces today. Innovations like Siri, Alexa and
Cortana have made our personal as well as professional lives easier and more
efficient. Over ninety million people in the US alone use voice assistants at least
once a month. According to research, around 20% of Google searches are made
through voice medium. As the most preferred way of communication between
humans and devices, voice-based systems are a game changer. They have digitized
gadgets to the extent of being able to find directions, make calls, book

7
appointments and even order food through a simple spoken command. These
assistants have access to a large collection of online or server data, that is accessed
quickly to interpret and perform desired tasks, as per the user’s choice.

Research has found that around 600 million people around the world use voice-
based applications at least once in a week, in the form of Google Assistant,
Amazon Alexa, Apple Siri or any other voice-controlled software. With the
emergence of smart speakers and home appliances, voice assistants have grown in
usage.

In the beginning, search engines needed extremely precise and short data to
perform basic searches. Nowadays, evolution has led to engines that can predict
and understand the user’s words. Providing a hands-free approach to all devices,
voice assistants act as a remote control for all our devices. With the rise of IoT
(Internet Of Things) and smart applications, such assistants can interact with not
only one device, but automate many appliances in a house. [2]

Fig 2.2: Popular Voice Assistants

Some of the functions that a voice assistant can perform include unlocking
devices, opening applications, making calls, sending messages, fetching the latest

8
news, capturing images, playing music, setting reminders or alarms, emailing
colleagues, performing online searches, booking tickets and offering
recommendations for food, entertainment and more.

The main reason for the shift from traditional systems to voice user interfaces is
because of changing user demands. The constant improvement and optimization of
speed, accuracy, efficiency and convenience have led to the need for voice-
controlled systems. Another major contributor is the growth of artificial
intelligence in every phase of our lives. With the increasing number of IoT devices
like smart refrigerators, thermostats, televisions, speakers and microwaves, the
users’ lives are becoming more connected and voice assistants are helping to create
that bond between all the appliances.

In the banking sector, voice technology has enabled customers to check their
balance and pay their bills using the voice assistant. In advertising, HR and
marketing, voice assistants will form the interface between clients and companies,
enabling faster, more efficient data analysis and taking care of the repetitive and
transactional phases of the industry. Even in retail, voice assistants have given
customers the ability to order any item online through just a few simple
commands. This also extends to the public transportation sector, where the users
can book a taxi or enquire about flight schedules using the voice assistants. All
industries are already saturated with multiple products in which they have tried to
integrate voice technology, including entertainment sectors.

There are many technical challenges that are faced by developers when
attempting to create a voice assistant.

 Firstly, there are separate processes for each step from input to output,
instead of a streamlined, integrated approach that enables machines to

9
actually understand the commands. This means that the input, recognition,
learning and decoding phases are separate entities. Thus, the machine does
not actually understand the commands, but just goes from one phase to the
next, mapping each input to an output. Due to this, the intelligence of the
machine is limited to a certain level and does not rise at par with human
behaviour.
 Secondly, there is a lack of context-aware responses from the machine. The
machine does not take into consideration the voice of the users, the
intonations, the emotions perceived, the environmental situations or other
contextual factors. This may sometimes lead to answers that do not match
the expectations of the user.
 Thirdly, there is a lack of customization of the agent’s personality as per the
user’s wishes. In most modern voice assistants, it is not possible to change
the wake word, the agent’s gender or personalize the agent according to
multiple users.
 Lastly, there is difficulty in achieving 100% accuracy due to infinite
variations in speech. The English language itself contains 40 phonemes or
sounds. There are innumerous accents and variations in this language itself,
with vocabulary counts rising up to millions. Since customers expect voice
assistants to understand quite a few languages, it becomes difficult to
achieve 100% accuracy, due to the amount of data and the endless
possibilities in terms of variations or pronunciations, in addition to voice
differences.

There are several key factors that come into play while considering the creation of
an intelligent personal assistant. These include individual and extensive
developments that encompass the following fields: speech-to-text (STT), text-to-

10
speech (TTS), noise control, speech compression, voice biometrics and the voice
user interface. Each of these areas form a crucial part of the voice assistant
development, and need to be prepared efficiently with importance given to their
accuracy and speed.

Going forward, voice assistants are only going to become more sophisticated.
Some of the improvements to look forward to will be personalized and context-
aware responses, efficient voice differentiation, multi-platform integration,
increased voice searches (research says 50% of searches will be via voice by
2020), notifications given through voice assistants and advanced security measures
[3]. Since the market for this technology is ever growing, it makes it difficult for
developers to create products that stand out from the existing systems. There are
several open-source developer platforms that have been released by many
companies, which enable programmers to create and customize their own virtual
assistant. These independent services contain the modules for speech recognition,
the libraries for voice samplings, the algorithms for machine learning and the
means to convert and map text to executable commands.

2.1.3 BENEFITS OF VOICE ASSISTANTS

 Voice assistants offer a hands-free approach to accomplishing tasks.


 It is faster than typing or clicking.
 It increases productivity.
 It can be helpful for people who have disabilities.
 It cuts down on delays and social bias in terms of the customer service
sector.
 It can be integrated with smart home security systems [4].

2.1.4 DRAWBACKS OF VOICE ASSISTANTS

11
 Voice assistants recently have been subject to privacy concerns, due to the
listening interface that has access to personal information.
 The machines take time to adjust and learn.
 It is less accurate in case of continuous flow of words or background noise.
 It is not always cost effective.
 It is prone to hackers.

2.2 EXISTING SYSTEM

It has been widely observed that voice assistants have taken over many aspects of
our lives. As previously mentioned, there are some very popular and high-demand
products that have changed the face of voice recognition technology.

One of the most used voice assistants is the Google Assistant. It was originally
released as Google Now in 2016, which later changed into Google Assistant, with
the ability to engage in two-way conversations. It allows the user to ask queries
related to a wide range of topics, including weather, sports, transportation, traffic,
entertainment, food and business. It performs the search for the user and allows
translation of this information into over 100 languages. It has the power to
integrate with your smart home and perform tasks easily. It also has quizzes and
games for entertainment. It can even remember and remind you about things. It has
now developed the ability to differentiate between voices. The biggest concerns
with Google Assistant have been related to privacy. Each command is recorded
and stored as an internal library, in order to make the machine able to understand
and integrate with other appliances, but this could also mean that Google has
access to more of your personal information and can target more ads your way.

The strongest competitor of Google Assistant is Apple’s Siri. Though it can be


integrated with only a few hundred devices, Apple’s main game changing factor is

12
the security and privacy. Not only does it incorporate smart encryption schemes for
device connections, it also makes sure that though the commands and voice
samplings are stored in its system, they cannot be traced back to individual users.
For people living in an all Apple household, Siri has proven to have the upper hand
in terms of quality and safety.

The latest and biggest seller, Amazon’s Alexa has turned tables in favour of
cheaper devices. With integration in over 20,000 devices, Alexa has renovated the
home automation scene. It has been able to smartly perform all the major tasks in
comparison to other voice assistants. The biggest downside of using Alexa has
been the privacy concerns. It has made quite a few leaks in terms of privacy, and
records and stores every conversation in its database. Customers also find that the
smartphone integration has been the least appealing, compared to other assistants.

Samsung’s Bixby has also stood out as a decent product. It can perform the
everyday tasks according to the users wish, including online searches,
downloading applications and opening other applications. However, it is only
available for Samsung devices.

Integrated into the Windows operating system, Microsoft’s Cortana is also a fair
contender. Though it is in active development, it performs adequately well. It is not
as efficient and accurate as other assistants, but it is a worthy mention.

An interesting customer self-service aid, Nina is being used by many companies


to simulate service conversations that have a human touch. It has the ability to
understand complex sentences and questions. It provides the necessary answers
efficiently.

There are several other important mentions like Hound, Robin, Ubi Kit, Dragon
Go, SILVIA, Aido, Braina, Lucida and Mycroft. These voice assistants have been

13
adapted to use the latest technologies and perform seamless automated tasks
through voice input. They have been integrated as applications in desktops and
smartphones, and even used in automobiles, home automation systems and other
hardware devices. In the initial stages of development, the purpose of such
assistants was to perform tasks on smartphones. With smart speakers like Google
Home and Amazon Echo, it has become a trend to decentralize usage and perform
tasks all around the house. This includes turning off lights, starting the microwave,
increasing the air conditioner temperature, turning on the television or resetting the
home alarm system.

2.3 DISADVANTAGES OF EXISTING SYSTEMS

 Privacy concerns: With the large amount of recorded data, companies are
targeting suitable advertisements to the consumers. This could be infuriating
if the purpose of the assistant is diverted.
 Accuracy: Most systems have an accuracy rate of about 70% - 95% in terms
of the English language, with much lesser accuracy for other languages. This
can lead to many misunderstandings and slower learning capacity of the
machine.
 Fragmented systems: The various voice assistants have trouble mixing and
integrating with devices of other types of brands, causing inconvenience to
the buyers. Each assistant is designed for only one platform.

2.4 PROPOSED SYSTEM

The proposed system is an adaptation of existing open-source voice assistants,


with improved functionality, accuracy and security measures. It aims to parallel
smaller ranks of Google Assistant in terms of accuracy. The project is a standalone
application that can integrate with any platform and will be released as an open-

14
source software. Through this application, the user will be able to perform tasks
such as searching online, opening the browser application and conversing with the
Voice Assistant. The application analyzes data quickly and returns outputs in a
matter of seconds. Every time a command is spoken, it translates the speech into
text and displays it on the screen. It provides appropriate feedback for the user to
understand that it has listened and that the output has been displayed successfully.
The Google Speech Recognition package will be used in Python to convert the
speech heard to text. The Voice Assistant will run for the user to instruct them to
start speaking. Then, the input will be received from the in-built microphone
through the Python code and recognized words will be converted to binary for
processing. There are appropriate instructions and error messages on the display, to
engage the user.

The resultant application will be runnable on personal computers of different


operating systems and will avoid the general threat of security, since it doesn’t
save the user data in any way. The application will depend on the Python
programming language which has been discussed below in detail.

2.5 SOFTWARE DESCRIPTION

2.5.1 PYTHON

Python is a high-level, interpreted programming language designed by Guido van


Rossum. It was written majorly to provide a language that has a simple syntax and
is readable. Due to shorter codes and ease of writing, programmers began to
increasingly stick to Python for coding. It also has many built-in functions and can
operate as object- oriented, functional or procedural programming. It is also
platform independent. Since it is free and open source, and also has vast library
support, it can be used to perform a huge variety of actions and programmers find

15
it easier to learn and implement compared to other languages. It also has exception
handling and in-built memory management techniques. Since it is dynamically
typed, there are no declarations, making it compact and concise. The most
important part of Python is the indentation as it determines the flow of statements.

Python also contains artificial intelligence and natural language processing


libraries, which makes it useful in these fields. It is also used in information
security, game development and as the base language for Raspberry Pi. However,
compared to C/C++, Python is slightly slower and doesn’t support browsers and
some mobile devices.

Python comes installed in many Linux and Mac OS computers. If your system
doesn’t have Python, the installer can be downloaded from the Download page of
python.org website. Running the installer and selecting the option to add to PATH
variable will setup the Python files. There are two versions of Python: 2.x and 3.x.
There are few minor differences between them like print function, error handling
and division operator. There are also multiple ways to run code in Python. One
way is by typing “python” in the command prompt under the installed folder,
which will open the Python shell. Here, one statement at a time can be executed.
Another way is to execute a Python file containing a sequence of statements. The
third way is to use an IDE (integrated software development) like PyCharm.

2.5.2 PYCHARM

PyCharm is one of the best, if not the best, full-featured, dedicated, and versatile
IDEs for Python development. It offers a ton of benefits, saving users a lot of time
by helping with routine tasks [5].

16
Features of PyCharm

Besides, a developer will find PyCharm comfortable to work with because of the
features mentioned below –

 Code Completion

PyCharm enables smoother code completion whether it is for built in or for


an external package.

 SQLAlchemy as Debugger

You can set a breakpoint, pause in the debugger and can see the SQL
representation of the user expression for SQL Language code.

 Git Visualization in Editor

When coding in Python, queries are normal for a developer. You can check
the last commit easily in PyCharm as it has the blue sections that can define
the difference between the last commit and the current one.

 Package Management

All the installed packages are displayed with proper visual representation.
This includes a list of installed packages and the ability to search and add
new packages.

 Local History

Local History is always keeping track of the changes in a way that


complements like Git. Local history in PyCharm gives complete details of
what is needed to rollback and what is to be added [6].

The following are some of the applications of Python:

17
 Desktop applications with GUI
 Language development
 Prototyping
 Operating systems

CHAPTER 3
REQUIREMENT ANALYSIS AND DESIGN

3.1 FUNCTIONAL REQUIREMENTS

The requirements that specify what all services a system can provide the end-user
are called the functional requirements. These define exactly what functions the
system can do. The functional requirements are closely related to the user
requirement specifications. This may include calculations, data processing,
technical operations and other such functionality that aim to fulfill the application
objectives. These are captured in the form of use cases, which are the system
responses to events by external agents or internal deadlines. Any tracking
operations, legal requirements, interface details, authorization levels, transaction
updates and cancellations, and administrative functions come under functional
requirements. The technical architecture of the system is determined by these
requirements.

The functional requirements of this project are:

 It should listen to the spoken commands and recognize the words heard.
 It should provide appropriate directions to the user for ease of use.
 It should give an acknowledgement for the recognition process through
some communication.

18
 It should be able to perform the tasks that the user requires through Python
automation.
 It should refresh or reload after every command and clear off any extra
cache memory used.
 It should maintain suitable time limits of execution, while recording audio or
providing results.
 It should deliver error messages whenever needed

3.2 NON-FUNCTIONAL REQUIREMENTS

The requirements that describe how the system runs are the non-functional
requirements. These requirements determine the way the system behaves and the
constraints on it. These criteria are used to judge the system in terms of
performance, reliability and security.

Some of the non-functional requirements of this project are:

3.2.1 EXTENSIBILITY

The design principle that determines the ability of a system to be extended is


called extensibility. The extension can be an added functionality or modification of
existing functionality. Overall, the system is enhanced while not affecting existing
working functions. A light software framework can easily be extended; as small
changes can be added with ease.

To add new functionality to this project, it is sufficient to add statements in


existing modules. Since every major aspect of the project has been implemented in
separate files, it is easy to add new features even as a new file. The project is
compact and has lightweight processes, ensuring integration with new components
with ease.

19
3.2.2 MAINTAINABILITY

Maintainability is the ease with which users of a system can modify the system
and correct any defects. It also determines ways to add new features, maximize
efficiency and error recovery. Usually continuous improvement is required for a
system to be maintainable. Maintainability is closely related to extensibility. Since
the system in this project is modular, it can be maintained easily and updated.

3.2.3 PERFORMANCE

Performance is assessed using following specifications:

 Response time: It is the time taken for the system to accept user input and
respond to it by displaying some output. Usually, feedback messages are
displayed within 1 second, which in itself is a noticeable delay. A maximum
of 10 seconds on a dialog window ensures that the user does not lose interest
or train of thought. The response time must also be consistent and not vary
based on the number of concurrent sessions.
 Workload: It is the amount of stress or work that the system can hold at
once. This could be in terms of parallel sessions, number of active users or
number of database transactions. The workload is usually described as the
scenarios that users will most likely encounter. Special cases like error
scenarios, backups and management requests should be taken into
consideration while specifying the workload.
 Throughput: The number of samples or bytes of data that are processed per
second is referred to as throughput. The data processing rate should be as

20
high as possible to ensure that the outputs are consistent and the user
sustains interest in using the system.

3.2.4 PLATFORM COMPATIBILITY

Platform is the environment in which a software executes. It may be an operating


system, hardware structure or even a web browser. Different platforms access their
underlying low-level functionality in different ways. One of the main objectives of
this project was to create a desktop app that can be run on any platform without
having separate versions and coding for every operating system. Due to the
execution of tasks through the Python automation library, which solely depends on
the keyboard, it is safe to say that this project is platform independent.

3.2.5 SECURITY

The resilience to potential harm caused by malicious intent or viruses sums up


security. One of the major drawbacks of Google Assistant is that every customer
recording is stored in a repository and can be accessed and linked to the exact
customer. This gives way for the possibility of data leaks. Many people are not
aware about this, but blindly agree to the terms and conditions. To avoid this, this
project doesn’t store any user data or recordings anywhere. It works on a single
command basis, making sure there are no threats to the application or other user
data.

3.2.6 USABILITY

Usability describes how easy or difficult it will be to learn to operate the system.
This is often measured in learning time or similar metrics. It defines the user
experience across softwares and environments. It shouldn’t be confused with user-
friendliness, which relates to the accessibility of the application. Usability is more
closely related to how fast users can learn to use the system, how easily they can
21
re-establish proficiency if using after a long period of time and how efficiently they
can carry out desired tasks once they learn to use the software. It also includes how
pleasant the UI design looks and the satisfaction that the user gets while using it.
With the simple UI and prompt instructions of this system, it can be easily learnt
and used by anyone, within just a couple of trials.

3.3 HARDWARE REQUIREMENTS

 Processor : Any processor above 1 GHz


 RAM : 1 GB
 Hard Disk : 10 GB
 Input device : In-built microphone, mouse and keyboard
 Output device : Monitor or display

3.4 DESIGN GOALS

To compare design alternatives and evaluate outcomes, design goals are defined
as the target of the design phase. The stakeholders discuss the various requirements
and finalize the design goals. Some of the design goals are given below:

3.4.1 EFFICIENCY

The speed and quality of the output provided by the system is known as
efficiency. The amount of resources used should be minimum, the accuracy of the
execution should be high and the system should not cause too many delays. It
should act as a means for improved task automation.

3.4.2 RECOVERY FROM ERRORS

The system should be able to recover quickly from errors and failures. In case the
agent cannot recognize the spoken command, it should provide appropriate
feedback to the user and try again automatically.
22
3.4.3 MAINTAINABILITY

The cost and ease of maintenance determine how effectively the system will last
in the long run, through multiple iterations of post release management. The ability
of the system to be maintained by the users without additional knowledge or extra
costs can be helpful in improving user experience.

3.5 SYSTEM ARCHITECTURE

The process of designing the system architecture focuses on the break-down of a


system into various components and their interactions that satisfy functional and
nonfunctional requirements. The inputs for software architecture design are the
requirements and the hardware architecture. This project has no peripheral
hardware devices, so the software components only interact with the system
microphone and display.

The architecture diagram is given below:

Monitor Python
Parser
(display) process 2

Python Google Speech


User Input
process 1 Recognition API

23
Fig.3.1: System Architecture

3.6 DATA FLOW DIAGRAM

Data flow diagrams (DFDs) are a way of describing the flow of data through the
system from process to process. There are no control decisions or loops in DFDs.
The input and output of each process is represented. There are two popular
notations for DFDs, Yourdon & Coad and Gane & Sarson. There are slight
variations in the symbols used in these two types. Since Yourdon & Coad notation
is generally used for analysis and design phase, the following diagram is in that
format:

24
Fig.3.2. Data Flow Diagram [7]

3.7 USE CASE DIAGRAM

To model the functionality of a system, use case diagrams are used. A use case is
a system response to an event that an external agent (actor) triggers. The use case
diagram shows the various actors, use cases and their interactions. The use case
diagram for this project is given below:

Fig.
3.3. Use Case Diagram [7]

25
CHAPTER 4
IMPLEMENTATION

4.1 DESCRIPTION

4.1.1 PYTHON FILES

The implementation of the speech recognition as well as the main automation that
causes this project to execute the tasks that the user wants, is all done in the Python
programming language.

The Python file is called “main.py”. The Python libraries that are used are:

 The “sys” library: For flushing out the words heard through standard output,
back to the main process. This module comes built-in with Python.
 The “SpeechRecognition” library: This enables the program to recognize
our speech patterns and then convert that audio into text for further
processing.
 The “Subprocess” module: This is used to get system subprocess details
used in various commands i.e Shutdown, Exit, etc. This module comes built-
in with Python.

26
 WolframAlpha: Wolfram Alpha is an engine that contains expert
knowledge on different fields including but not limited to mathematics,
geography and current affairs. This library is used to compute expert-level
answers using Wolfram’s algorithms, knowledgebase and AI technology.
 Pyttsx3: This module is used for the conversion of text to speech in a
program. It works offline.
 Wikipedia: As we all know, Wikipedia is a great source of knowledge. This
module helps us get information from Wikipedia or to perform a Wikipedia
search.
 Web browser: This helps to perform web searches. This module comes
built-in with Python.
 Datetime: It is used in showing date and time. This module comes built-in
with Python.

These libraries are installed from command prompt(cmd) or from within the
Pycharm IDE.

First we initialize the text-to-speech engine to have it ready to give us our verbal
feedback from our Voice Assistant. We’ll set the voice as well as an activation
word that would wake up the Assistant. The code is given below:

# Speech engine initialization


engine = pyttsx3.init()
voices = engine.getProperty('voices')
engine.setProperty('voice', voices[0].id) # 0 = male,
1 = female
activationWord = 'computer' # Single word
 .getproperty() : helps to get the voice/speech property
 .setproperty(): sets the Assistant voice type [8]

27
Next we need to have a way for the system to listen to commands, so we’ll define a
method for that:

def parseCommand():
listener = sr.Recognizer()
print('Listening for a command')

with sr.Microphone() as source:


listener.pause_threshold = 2
input_speech = listener.listen(source)

try:
print('Recognizing speech...')
query = listener.recognize_google(input_speech,
language='en_gb')
print(f'The input speech was: {query}')

except Exception as exception:


print('I did not quite catch that')
speak('I did not quite catch that')
print(exception)

return 'None'

return query [8]


 .recognizer(): it’ll recognize the system microphone, get the voice input and
parse it into text.
 .microphone(): Accesses the system microphone.
 .pause_threshold(): It sets how long there can be a gap in the speech before
the assistant ends the listening.
 .listen(): To get input speech.
 .recognize_google(): To recognize the speech pattern

Next we’ll define our speaking method. This is how we can make our Text-to-
speech(TTS) library useful:

28
def speak(text, rate=120):
engine.setProperty('rate', rate)
engine.say(text)
engine.runAndWait() [8]

 .runAndWait(): To make sure the Assistant won’t recognize a second speech


instruction until it completes the first.

With this we have our two basic methods here that understands what coming out
of the microphone, parsing it using the google API to convert speech into text and
then general method for using the text to speech library to make our Assistant able
to speak text.

Next we create commands to execute the following functions:

 Website navigation
 Wikipedia queries for general knowledge
 Wolfram Alpha queries
 Note recording

Here’s the code to implement the Wikipedia library. It will be a method that’ll use
the library to search wikipedia for the queries of the user. The execution should
produce an array of results with which the assistant will pick just the first and most
important. We’ll also include an error message if the result cannot be found on the
website:
def search_wikipedia(keyword=''):
searchResults = wikipedia.search(keyword)
if not searchResults:
return 'No result received'
try:
wikiPage = wikipedia.page(searchResults[0])
except wikipedia.DisambiguationError as error:
wikiPage = wikipedia.page(error.options[0])
print(wikiPage.title)
wikiSummary = str(wikiPage.summary)
return wikiSummary [8]

29
4.2 TESTING

Testing is done with the purpose of finding out if the software is complete and
acceptable to the user for delivery and final release. The data integrity,
confidentiality, timeliness, performance, usability and completeness of the
software has been tested and the final product is deemed ready to be delivered to
the user.

4.2.1 TEST CASE DESIGN

 Test Case 1

Test Title: Response Time

Test ID: T1

Test Priority: High

Test Objective: To make sure that the system respond back time is efficient.

Description:

Time is very critical in a voice based system. As we are not typing inputs, we
are speaking them. The system must also reply in a moment. User must get
instant response of the query made.

 Test Case 2

30
Test Title: Accuracy

Test ID: T2

Test Priority: High

Test Objective: To assure that answers retrieved by system are accurate as per
gathered data.

Description:

A virtual assistant system is mainly used to get precise answers to any question
asked. Getting answer in a moment is of no use if the answer is not correct.
Accuracy is of utmost importance in a virtual assistant system.

 Test Case 3

Test Title: Approximation

Test ID: T3

Test priority: Moderate

Test Objective: To check approximate answers about calculations.

Description:

There are times when mathematical calculation requires approximate


value. For example, if someone asks for value of PI the system must respond
with approximate value and not the accurate value. Getting exact value in such
cases is undesirable.

31
Note: There might include a few more test cases and these test cases are also
subject to change with the final software development.

CHAPTER 5
CONCLUSION AND FUTURE ENHANCEMENT

5.1 CONCLUSION

The final outcome of this project is an intelligent voice assistant, as described in


the title. It combines natural language processing techniques with python
programming to present an efficient digital personal assistant. It can perform
everyday tasks on the system based on the spoken command of the user. It can
recognize the words, map the speech into text and decide what task to execute
accordingly. The application can perform operations on the device such as opening
websites, writing notes and performing online searches.

5.2 FUTURE ENHANCEMENT

In the future versions of this project, some attributes that can be added are:

32
 Facility for extended conversation (like a chatbox)
 Features for persistent data storage
 Run system commands
 Personalization for different users
 Home automation
 Voice biometrics and security
 Custom AI voice

REFERENCES

[1] Norman, Jeremy. (2004, January 4). William Dersh of IBM unveils shoebox,
an early application of voice recognition to calculating. Retrieved from
https://www.historyofinformation.com

[2] Voice Assistants: How Artificial Intelligence Assistants Are Changing Our
Lives Every Day. Retrieved from https://www.smartsheet.com/voice-assistants-
artificial-intelligence

[3] Armour, Britt. (2018, November 15). 7 Key Predictions For The Future Of
Voice Assistants And AI. Retrieved from https://clearbridgemobile.com/7-
keypredictions-for- the-future-of-voice-assistants-and-ai/

[4] Kh, Nataliya. (n.d.) 3 Efficient Ways to Supply Your App with a Virtual
Assistant. Retrieved from https://www.cleveroad.com

33
[5]Jahongir, Rahmonov. 2012. Pycharm for productive python development.
ReaPython. September 30, 2011. [Cited: December 30, 2022.]
https://www.realpython.com/pycharm_guide/.
[6]Tutorialspoint. 2021. Pycharm introduction. Tutorialspoint. September 30,
2021. [Cited: December 30, 2022.]
https://www.tutorialspoint.com/pycharm/pycharm_introduction/.
[7] Lucidchart Tool. Retrieved from https://www.lucidchart.com/

[8] Mikael Codes. Create your own AI assistant in Python. Retrieved from
https://www.github.com

34

You might also like