Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

Pro ject Report

on

Instant access to healthcare using AI - voice enabled chat bot


Submitted to

Shri Ramdeobaba College of Engineering & Management, Nagpur

(An Autonomous Institute Affiliated to Rashtrasant Tukdoji Maharaj Nagpur University)

for partial fulfillment of the degree in

Bachelor of Engineering
(Information Technology)
Sixth Semester
by

S IMRAN S INGH (23)

P ARTHSARTHI P AHUJA (54)

Y ASH G UPTA (70)

Under the Guidance of

Dr. D.S. Adane

Department of Information Technology

Shri Ramdeobaba College of Engineering & Management,

Nagpur-13

2020-21
CERTIFICATE
This is to certify that the Project Report on

INSTANT ACCESS TO HEALTHCARE USING AI - VOICE


ENABLED CHAT BOT
is a bonafide work and it is submitted to

Shri Ramdeobaba College of Engineering & Management, Nagpur

(An Autonomous Institute Affiliated To Rashtrasant Tukdoji Maharaj Nagpur University)

by

Simran Singh, Parthsarthi Pahuja, Yash Gupta

For partial fulfillment of the degree in

Bachelor of Engineering in Information Technology,

Sixth Semester

during the academic year 2020- 21

under the guidance of

Dr. D.S. Adane


Head, Department of Information Technology, RCOEM, Nagpur

Dr. D. S. Adane Dr. R. S. Pande


Head, Department of Information Technology Principal

RCOEM, Nagpur RCOEM, Nagpur

Department of Information Technology

Shri Ramdeobaba College of Engineering & Management,

Nagpur-13

2020-21
ACKNOWLEDGEMENTS

It is our proud privilege to present a project report on " INSTANT ACCESS TO


HEALTHCARE USING AI - VOICE ENABLED CHAT BOT". We take this
opportunity to express our deep sense of gratitude & whole hearted thanks to our guide Dr.
D.S. Adane, Head, Department of Information Technology, Shri Ramdeobaba college of
Engineering and Management, Nagpur for his valuable guidance, inspiration and
encouragement that has led to successful completion of our project.

We would like to express our deepest gratitude to Dr. D. S. Adane Head, Department
of Information Technology, RCOEM, Nagpur for providing us the opportunity to embark on
this project.

A special word of thanks goes to Entire Department of Information Technology,


RCOEM, Nagpur fortheir encouragement and their cooperation to accomplish our work on
time. Finally, we would like to thank and express sincere gratitude towards our Principal Dr.
R.S. Pande for being our source of inspiration throughout this project. We would also like to
thank each and every member involved in the completion of this project.

Name of Projectees
Simran Singh (23)
Parthsarthi Pahuja (54)
Yash Gupta (70)

i
CONTENTS
Page No.

ABSTRACT iii
LIST OF FIGURES iv
LIST OF TABLES v

CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION TO CHATBOT 1
1.2 ARTIFICIAL INTELLIGENCE IN MEDICINE 2
1.3. FUTURE SCENARIO FOR INDIA 5

CHAPTER 2
OVERVIEW OF HEALTHBOT
2.1 CHATBOTS IN HEALTHCARE INDUSTRY 6
2.2 USE CASES IN HEALTHCARE 7
2.3 CHALLENGES AND LIMITATIONS 9

CHAPTER 3
AIMS AND OBJECTIVES
3.1 PROBLEM STATEMENT 11
3.2 PROPOSED SOLUTION 11

CHAPTER 4
LITERATURE REVIEW
4.1 SURVEY OF EXISTING MODELS 12

CHAPTER 5
METHODOLOGY
5.1 CHATBOT ARCHITECTURE 15
5.2 PHASES AND THEIR WORKING 15
5.3 MODULES 16

CHAPTER 6
NATURAL LANGUAGE PROCESSING
6.1 INTRODUCTION TO NLP 17
6.2 NLP TECHNIQUES 18
6.3 IMPLEMENTATION 19

CHAPTER 7
MACHINE LEARNING
7.1 INTRODUCTION TO ML 21
7.2 RESEARCH ON ML ALGORITHMS 23
7.3 IMPLEMENTATION 30

CHAPTER 8
DATABASE
8.1 DATA IN HEALTHCARE 35
8.2 DATABASE DEVELOPMENT 35
8.3 IMPLEMENTATION 36

CHAPTER 9
CONCLUSION AND REFERENCES
9.1 CONCLUSION 39
9.2 FUTURE WORK 39
9.3 REFERENCE 39

ii
ABSTRACT

With the current growth in the interest of individuals in health, life care, and disease, medical
institution services had been moving from remedy awareness to prevention and fitness
control. The clinical enterprise is growing extra offerings for fitness- and lifestyles-
merchandising programs. This trade represents a clinical-provider paradigm shift because of
the extended lifestyles expectancy, aging, life-style adjustments, and profits increases, and
consequently, the idea of the clever fitness provider has emerged as a first-rate issue.
However, as the quantity of information is growing and the clinical-information complexity is
intensifying, the constraints of the preceding strategies are an increasing number of
problematic. With the incoming trends in technology, AI chatbots have managed to pave their
way in healthcare domain. Although healthcare was not the first sector in which experiments
with chatbots have been carried out, since the beginning of 2018 we have seen the emergence
of and experimentation with many different use cases in this field. A chatbot is an intelligent
conversation platform that interacts with users via a chatting interface, and since its use can
be facilitated by linkages with the major social network service messengers, general users can
easily access and receive various health services. The layout of the framework contains the
subsequent three levels: Natural language Processing, Machine Learning and Database. This
is followed by focusing on two Machine Learning algorithms, Random forest and KNN
which are supervised learning algorithm taking user input and providing diagnosis based on
the information stored in the knowledge base of the system. Currently the project is in
development phase with the algorithm being tested on ten diseases and the future plans have
been stated.

iii
LIST OF FIGURES

Sr. No. Description Page


No.
Figure 1.1 Example of conversational bot 2

Figure 1.2 Use cases of bots in AI 4

Figure 5.1 Chatbot Architecture


15
Figure 6.1 NLP working 17

Figure 6.3.1 Speech recognition code 19

Figure 6.3.2 Text Pre-processing code 20

Figure 6.3.3 Output of NLP Methods 20

Figure 7.3.1.1 Execution of Random Forest 31

Figure 7.3.1.2 Sample input to the code 31

Figure 7.3.1.3 Output of the following code 32

Figure 7.3.2.1 Execution of K-Nearest Neighbor 33

Figure 7.3.2.2 Sample input to the code 33

Figure 7.3.2.3 Output of the following code 34

Figure 8.3.1 Code for Web Scrapping 36

Figure 8.3.2 Code for Exporting Scrapped Data to CSV File 37

Figure 8.3.3 Snapshot of Cleaned Training.csv File 37

Figure 8.3.4 Snapshot of Cleaned Testing.csv File 38

iv
LIST OF TABLES

Sr. No. Description Page No.


Table 7.2.1 The difference between supervised learning and 28
unsupervised learning
Table 7.2.2 Summary of the reviewed ML algorithms. 29

v
CHAPTER - 1
INTRODUCTION
1.1 INTRODUCTION TO CHATBOT

1.1.1 What is a chatbot?

Several million people enter keywords every day in search engines such as Google and
then have to choose from a list of results, usually in the form of web pages in which it is again
necessary to search for specific information.

A chatbot is a software robot that can reproduce natural language and interact with an individual
through automated conversations. Chatbots allow you to receive a unique answer or a service. In
the literature, chatbots and conversational agents can be distinguished according to their level of
understanding of natural language, the former using keyword or rule engines instead, while the
latter are based on machine learning. We shall use the term chatbot in its generic sense in this
white paper. The operating model of a chatbot is always the same, whatever its scope, its theme
and its level:

 Users formulate their queries in natural language via a voice or text interface.
 The chatbot receives the request and its engine interprets it to understand it.
 The chatbot provides a unique and qualified answer to the user‘s query.

The answer may be generic (i.e. the same for everyone), contextualized (adapted to the context,
for example, at a given time and place) or customized (adapted to users, for example, by providing
them with their bank balance).

1.1.2 Types of chatbot

There are three types of chatbot :

Assistants: Provide the user with a predefined answer like in a page for "Frequently Asked
Questions".

Concierges: Provide a contextualized response and facilitate a service to the user, for example by
explaining the steps of an action to be taken.

Advisors: Integrate customized answers to complex requests with automated processes to perform
certain actions.

1
Figure 1.1: Example of conversational bot

1.1.3 History of Chatbots

Chatbots are in the spotlight today, but the first chatbot emerged in 1964 with ELIZA.
Several chatbots have been tested to try to understand and reproduce the human ability to conduct
a conversation, through research on artificial intelligence in computer science. Other noteworthy
chatbots were then created with Jabberwacky in 1982 and A.L.I.C.E. in 1995 for example. Since
2010, the web giants have been launching smart assistants for smartphones and PCs to improve
the user experience. The best known is Siri, launched by Apple on the iPhone in 2010. Then there
was Google Now in 2012, Cortana at Microsoft and Alexa at Amazon in 2014. Since 2016, chatbot
solutions have been multiplying, particularly on Facebook Messenger, thanks to the simplification
of chatbot technologies and implementation tools that anyone can use.

1.2 ARTIFICIAL INTELLIGENCE IN MEDICINE

1.2.1 What is Artificial Intelligence?

2
―Artificial Intelligence is neither a new technology nor a machine‖. Artificial intelligence
is the recognition of outcome-direction which is the rapid analysis of live data to achieve the
expected goal. Outcome-directed thinking splits from the confines of the rule-directed approach
that is accomplished through artificial intelligence. The generalized practice of AI can be broken
down into a straightforward process. First of all, a numerical representation is established for the
target or outcome. Specific data is then associated with the target is gathered and conditions and
behaviors are investigated to increase the likelihood of achieving the expected target. Multiple
aspects can determine the outcome. The weight of each aspects effect is computed. ―AI uses the
relative weighting of each aspect to create a prediction (evaluation) formula‖ (Yano, K. 2017).
Lastly, the formula devised from the weighted aspects are employed to business decisions. AI can
be classified into four groups: ―systems that think like humans, systems that act like humans,
systems that think rationally and systems that act rationally‖. AI is generally categorized as strong
and weak AI: strong AI is the production of human-like intelligent systems. Weak AI would be
the integration of intelligent algorithms embedded within a system. ―Machine learning, deep-
learning, natural language processing and neural networks are often summarized under the term
of AI‖.

1.2.2 Artificial intelligence in medicine

The application of AI in medicine has two main branches: Virtual branch and Physical
branch.

Virtual branch –

The virtual component is represented by Machine Learning, (also called Deep Learning)-
mathematical algorithms that improve learning through experience. Three types of machine
learning algorithms:

1. Unsupervised (ability to find patterns)

2. Supervised (classification and prediction algorithms based on previous examples)

3. Reinforcement learning (use of sequences of rewards and punishments to form a strategy


for operation in a specific problem space)

3
Physical branch –

It includes: Physical objects, Medical devices, Sophisticated robots for delivery of care (carebots)/ robots
for surgery.

Figure 1.2: Use cases of bots in AI

1.2.3 Applications of Artificial intelligence in Healthcare

 AI can assist physicians


 Clinical decision making - better clinical decisions
 Replace human judgement in certain functional areas of healthcare (eg, radiology).
 up-to-date medical information from journals, textbooks and clinical practices
 Experienced vs fresh Clinician
 24x7 availability of expert
 Early diagnosis
 Prediction of outcome of the disease as well as treatment
 Feedback on treatment
 Reinforce non pharmacological management
 Reduce diagnostic and therapeutic errors
 Increased patient safety and Huge cost savings associated with use of AI
 AI system extracts useful information from a large patient population
 Assist making real-time inferences for health risk alert and health outcome prediction

4
1.3 FUTURE SCENARIO FOR INDIA

 Collaboration between medical and technical institutions


 Stop working in silos
 Government funding – more intelligent and result oriented
 Current status of medical records
 Incommunicable silos of wasted information for the health system and for
knowledge acquisition. Laboratories and clinics need to collaborate to accelerate
the implementation of electronic health records
 Data need to be captured in real-time, and institutions should promote their transformation
into intelligible processes
 New scientific and clinical findings should be shared through open-source, and aggregated
data must be displayed for open-access by physicians and scientists and made
automatically available as point-of-care information.
 Integration and interoperability including ethical, legal and logistical concerns are
enormous
 Simplification, readability and clinical utility of data sets
 Each result must be questioned for its clinical applicability.
 Aim of increasing their clinical value and decreasing health costs
 Electronic medical or health records
 Are essential tools for personalized medicine
 Early detection and targeted prevention, again

5
CHAPTER - 2
OVERVIEW OF
HEALTHBOT
2.1 CHATBOTS IN HEALTHCARE INDUSTRY

2.1.1 Healthcare Chatbot

Although healthcare was not the first sector in which experiments with chatbots have been
carried out, since the beginning of 2018 we have seen the emergence of and experimentation with
many different use cases in this field. The chatbots thus try to handle several needs, such as
personalized medical follow-up, communication and transmission of test results, dissemination of
information, or even advice to patients or preliminary diagnosis. It is in this context and based on
the project initiated by Sanofi, in partnership with Orange Healthcare and Kap Code, that we are
exploring in this white paper some practical cases of healthcare chatbots and the specificities of
the healthcare sector. The white paper also includes our proposals for evaluating user perception
of these new digital tools

2.1.2 Proposing Chatbot as an Alternative System

The use of chat-bots has spread from consumer customer service to matters of life and
death. Chatbots are entering the healthcare industry and can help solve many of its problems.
Chat-bot is a computer program designed to carry on a dialogue with people, particularly on the
Internet. It assists individuals via text messages within websites, applications or instant messaging
and enables businesses to attract, keep and satisfy clients. This kind of bots is an automated system
of communicating with users. There are chatbots which can provide information to the following
and similar to them questions. ―How long is someone infectious after a viral infection?‖ ―How
can I get a prescription?‖ ―How can I find out my blood type (blood group)?‖ Thereby, clinics
building a chatbot for their sites, lower the number of repetitive calls that their specialists have to
answer. This, in its turn, enables hospital employees to concentrate on more significant tasks
which will lead to better healthcare service quality. The proposed system will not only provide the
personal assistance to the patients but also users can keep their previous medical record on the
platform for future use. The platform will provide a conversational experience to patients acting
like a doctor is treating them online.

6
2.2 USE CASES IN HEALTHCARE

1. Checking Symptoms

Plugging a collection of symptoms into a


search engine can yield unclear or
unnecessarily alarming results. Chatbots
can ask clarifying questions and factor in
personal details before offering advice.
They can also identify when a person
might need urgent care and pass along
chat transcripts to providers so that
patients don't have to repeat themselves.

2. Finding health services

Finding health services that are close by and in your


care network can be difficult. Chatbots can
personalize their responses based on account
information and use location data to find the nearest
relevant services.

7
3. Medication Guidance

Chatbots aren't replacements for


pharmacists but they can be handy for
sharing basic drug information and
reminding patients when to take their
medication. Chatbots can interact over
web, social, SMS, and even through your
mobile app so your customers will always
see the reminder.

4. Book an appointment

Scheduling Appointments Getting time with


your practitioner is typically done through a
phone call. But with demand for digital
options increasing, a chatbot that can book
appointments might be just what the doctor
ordered. They can hook into your existing
scheduling tools or, if you already have online
appointment booking, host that service inside
the chat window.

8
2.3 CHALLENGES AND LIMITATIONS

1. Obstacle for AI chatbot in the Future –

One of the main hurdles for Al would be its adoption. Healthcare professionals would
have to educate about the need for Al. They should also be made comfortable for work in
an environment where Al is present. Many doctors would not be open to the information
provided by a machine, and they would be educated to accept Al. Compliance and FDA
regulations can be another major problem. Currently, with Al being only partially
understood, the amount of importance that has to be given Al would also be a question
that lurks in the minds of the FDA personnel.

2. Difficulties in healthcare AI adoption

The industry is receptive to new ways to improve diagnostics, patient care, and financial
efficiencies. However, these AI healthcare companies contend with some significant
challenges with regards to widespread Al adoption in the healthcare.
 Case study conundrum
 Black box issue
 Stakeholder complexities
 Current trends

3. Other challenges and limitation

Giving human intelligence is almost impossible, Time constraints, Enough knowledge


representation, Should be very specific keyword, Technological limitation of Al,
Medical limitation, Ethical challenges, Better regulations, Misconceptions and overhyping
Human rejection.

4. Data safety and privacy and risk

The ministry of health and family welfare is working on a sector -specific legislation,
tentatively called the healthcare data privacy and security act. In 2016 , the hacking of a
Mumbai — based diagnostic laboratory database led to the leaking of medical records (

9
including HIV reports of over 35000 patients ). Hacker can exploit Al solutions to collect
private and sensitive information such as electronic health record.

5. Common vulnerabilities addressed in chatbot

 Man-m-the-middle
 Chat log stored on user device
 Encryption of messages in transit
 Encryption of data at rest
 Use of external NLP services
 Logging and access rights

10
CHAPTER-3
AIM & OBJECTIVES
3.1 PROBLEM STATEMENT

In rural areas especially in India, faces a lot of challenges like expensive medical care, lack
of infrastructure or absence of doctors. They have to travel long distances to get a medical assist.
There are many more such challenges faced by the people which are compromising the human‘s
life. To overcome this, we come with a problem statement stated as “Instant access to healthcare
using AI - voice enabled chat bot”.

3.2 PROPOSED SOLUTION

For the given problem statement, we propose an ―AI - Healthcare Chatbot‖ which will
provide an instant solution.

 The chatbot will provide a diagnosis to the user based on the symptoms they will
provide.
 The chatbot will provide assistance to the users in emergency situations. For example,
if there is a diagnosis of severe chest pain or heart attack based on the user‘s symptoms,
the chatbot will immediately suggest seeking medical attention right away.
 The chatbot will also offer solutions for non – severe medical issues. These solutions
can be in the form of say to do gargling when diagnosis with common cold.
 The chatbot will also provide details of the medical to be taken for the diagnosed issue.
 Place like India where people are more comfortable with Hindi language, we will have
the feature of Hindi language where user can interact in Hindi with the chatbot. This
will ease the use of chatbot.

11
CHAPTER- 4

LITERATURE REVIEW
LITERATURE REVIEW
Chatbot in healthcare is a system which assist users to know about their disease, give treatment
related to the disease or give information about the nearby healthcare centre in a cost effective
and efficient manner. Most of the researchers have used techniques such as NLP, ML to predict
the disease but the difference arise when it comes to machine learning algorithms and some
novel functionalities. The research work is done from verified journals or research papers which
are either SCI or Scopus certified journals or research papers. Through the research work it was
analysed that there are various techniques to build, train and deploy the chatbot some of the
analysis which was done are listed below.

4.1 SURVEY OF EXISTING MODELS

4.1.1 Microservice chatbot architecture for chronic patient support


This paper aims to offer solution based on microservices architecture for chronic patient
support and provide eHealth functionalities and a virtual assistant was developed which was
based on most common diseases. Some novel functionalities like speech recognition were to be
added on this project.

4.1.2 Acceptability of artificial intelligence (AI)-led chatbot services in healthcare: A


mixed-methods study
This was a paper in which research work was done and the researchers analysed about
the topics like Understanding use of chatbots in Healthcare, AI hesitancy, Motivations of
healthcare chatbots also the researchers raised issue regarding the accuracy and the security
concerns of the chatbot. The drawback concluded from the paper was the researchers didn‘t
focused on any particular population and they only explored the general views on healthcare
chatbots.

4.1.3 Design and development of smart healthcare chatbot application using AI – ML


The developers were mainly concerned about the unavailability of doctors and healthcare
services during the COVID-19, so they developed an AI based chatbot that will provide medical
consultation to end user. The bot consisted of two major modules that is extracting the
information form the user through voice signals and provide medicinal remedy to user by
extracting information from the user query through tokenization technique. One of the problem
with this model was Data authenticity as the sources of data were not specified so including
Deep Learning concepts might increase the accuracy and efficiency of the model.

12
4.1.4 Self-diagnosing health care chatbot using machine learning
This project aims at providing basic consultation to a user before consulting a doctor.
The chatbot identifies the symptoms and categories it as major or minor symptoms and if it is a
major one the chatbot suggests the user to consult a doctor. NLP and decision tree algorithm was
used by the developers to provide diagnosis.

4.1.5 Design and development of diagnostic chatbot for supporting primary health care
systems
The chatbot was based on Supervised Learning method and methods like NLP and
Decision Tree Algorithm was used. The chatbot provided diagnosis based on the symptoms
entered by the user. It also consists of functionalities like the chatbot can connect the user to a
Doctor and if the doctor is unavailable then preliminary consultation is provided by the chatbot.
The disadvantage of this model that it worked with only limited number of disease and accuracy
is low for uncommon diseases.

4.1.6 AI chatbot design during an epidemic like the novel coronavirus


In this paper the researchers proposed a chatbot in which they wanted to develop a
virtual assistant that can measure the infection severity and connects the patient to a doctor if the
situation becomes serious. Also the chatbot can check whether the user is suffering from
COVID-19 if the user is suffering from COVID-19 then it tells the user to consult a doctor and if
user is not suffering from the infection then the chatbot provides basic safety measures the user
should follow in order to be safe.

4.1.7 The smart healthcare prediction using chatbot


The paper proposed of a model in which the chatbot asks user for the symptoms and
based on the analysis the chatbot gives diagnosis. Methods such as JAVA language, NLP and
ML algorithms was used. The main drawback of the system was that the developers didn‘t check
the accuracy of various ML algorithms they just finalised the first algorithm they checked.

4.1.8 AI healthcare interactive talking agent using NLP


This project focused on the physical fitness of the user, it asks the user to enter their
height and weight based on that the chatbot calculates the BMI of the user and identifies whether
the user is underweight or overweight. The chatbot can also provide the diet plan to the user, it
uses NLP and mainly focused on Morphology. The drawback of this system is that the input
from the user is not in sequential order which may lead to incorrect response collection.

13
4.1.9 Text messaging-based medical diagnosis using natural language processing and
fuzzy logic
This system was designed in python and is able to diagnose using a direct approach of
the question and answering technique to suggest a medical diagnosis. The developers extracted
data from different standard websites for building their knowledge base. The entire project was
deployed in Telegram apk. The drawback of the system was it was not secure the false positive
cases of falsely suggesting disease.

4.1.10 Automated medical chatbot


A medical Chatbot that provides diagnosis and remedies based on the symptoms
provided to the system. The system will be able to measure the seriousness of the diagnosis and
if needed, it will connect the user to a doctor available online. The limitation of the project is
only 56.6% which is quiet low.

14
CHAPTER - 5
MEHTODOLOGY
5.1 CHATBOT ARCHITECTURE

Figure 5.1: Chatbot Architecture

5.2 PHASES AND THEIR WORKING

This is the complete architecture of our chatbot. It has three main phases:
 Interaction with user
 This phase deals with the users, messaging platform and speech recognition
component of the chatbot. The phase focuses on the conversation with the user.
 Using the messaging platform (GUI of chatbot), the user can interact with the
chatbot.
 User can interact with chatbot through the voice message or can type their input as
text message.
 For voice input, the chatbot will convert the voice message into text for further
process.

15
 If input is text, then it is directly transferred to the NLP component of the
architecture.

 Processing the Query


 This phase deals with the NLP component of the chatbot.
 Text preprocessing is done with the help of NLP in this phase.
 The input will go through the various NLP techniques like tokenization, stemming
and removal of stopwords to clean the input data.
 The output of this phase will be the extracted keywords, i.e. symptoms. These
symptoms will be transferred to next phase.

 Predicting the output


 This phase deals with the ML and database component of the chatbot.
 Prediction is done with the help of ML algorithms in this phase.
 The input will be fed to the machine learning algorithm so that they can predict the
corresponding disease as per the user‘s symptoms.
 It will build a ML model using the actual training and testing datasets to provide
accurate results.

5.3 MODULES
The project is divided into the three modules:
 Natural Language Processing (NLP)
 Machine Learning (ML)
 Database (Datasets)

16
CHAPTER-6
NATURAL LANGUAGE
PROCESSING
6.1 INTRODUCTION TO NLP

6.1.1 What is NLP?

Natural Language Processing (NLP) is a field of Artificial Intelligence (AI) that makes
human language intelligible to machines. NLP combines the power of linguistics and computer
science to study the rules and structure of language, and create intelligent systems (run on machine
learning and NLP algorithms) capable of understanding, analyzing, and extracting meaning from
text and speech.

6.1.2 What is NLP used for?

NLP is used to understand the structure and meaning of human language by analyzing
different aspects like syntax, semantics, pragmatics, and morphology. Then, computer science
transforms this linguistic knowledge into rule-based, machine learning algorithms that can solve
specific problems and perform desired tasks.

6.1.3 How does NLP work?

Figure 6.1: NLP working

By using NLP tools, the input data is pre-processed and data is converted into something
that a machine can understand. Then machine learning algorithms are fed with the outcomes to
train machines to make associations between a particular input and its corresponding output.

In our project, the NLP is used to understand the user‘s input and extract key features i.e.
symptoms so that they can be fed to machine learning algorithms to predict the corresponding
disease based on the user‘s symptoms.

17
6.2 NLP TECHNIQUES

6.2.1 Tokenization

Tokenization is an essential task in natural language processing used to break up a string


of words into semantically useful units called tokens. Sentence tokenization splits sentences
within a text, and word tokenization splits words within a sentence. Generally, word tokens are
separated by blank spaces and sentence tokens by stops.

An example of how word tokenization simplifies text:


Sentence: ―I have a fever‖
After word tokenization: ‗I‘, ‗have‘, ‗a‘, ‗fever‘

6.2.2 Lemmatization & Stemming

Stemming usually refers to a crude heuristic process that chops off the ends of words in
the hope of achieving this goal correctly most of the time, and often includes the removal of
derivational affixes. Lemmatization usually refers to doing things properly with the use of a
vocabulary and morphological analysis of words, normally aiming to remove inflectional endings
only and to return the base or dictionary form of a word, which is known as the lemma.

6.2.3 Stopword Removal

Removing stop words is an essential step in NLP text processing. It involves filtering out
high-frequency words that add little or no semantic value to a sentence, for example, which, to,
at, for, is, etc. You can even customize lists of stopwords to include words that you want to ignore.

6.2.4 Bag of word & TF-IDF

A bag-of-words model is a way of extracting features from text for use in modeling, such
as with machine learning algorithms.

TF-IDF stands for ―Term Frequency — Inverse Document Frequency‖. This is a technique to
quantify a word in documents; we generally compute a weight to each word which signifies the
importance of the word in the document and corpus.

18
6.3 IMPLEMENTATION

For speech recognition, we have implemented the python code to get the input as voice
from user‘s microphone which will get converted into the corresponding text. Here is the code
snippet for speech recognition:

Figure 6.3.1: Speech recognition code

For text pre-processing, we have used various NLP techniques like tokenization,
stemming, lemmatization and removal of stop words. Here is the code snippet for this:

19
Figure 6.3.2: Text Pre-processing code

To identify the word importance in the user‘s input, we have implemented two more NLP
methods, Bag of Words and TF-IDF. Using these methods, we can get a numerical value which
tells the importance of each word present in the corpus. We have tested these methods on 2
statements. Here is the snippet of the output of these methods:

Figure 6.3.3: Output of NLP Methods

20
CHAPTER - 7
MACHINE LEARNING
7.1 INTRODUCTION TO ML

7.1.1 What is Machine Learning?

Machine learning is an application of artificial intelligence (AI) that provides systems the
ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer programs that can
access data and use it to learn for themselves.

The process of learning begins with observations or data, such as examples, direct experience,
or instruction, in order to look for patterns in data and make better decisions in the future based
on the examples that we provide. The primary aim is to allow the computers learn automatically
without human intervention or assistance and adjust actions accordingly.

But, using the classic algorithms of machine learning, text is considered as a sequence of
keywords; instead, an approach based on semantic analysis mimics the human ability to
understand the meaning of a text.

7.1.2 Machine Learning Methods

Machine learning algorithms are often categorized as supervised or unsupervised.

 Supervised machine learning algorithms can apply what has been learned in the past
to new data using labeled examples to predict future events. Starting from the analysis
of a known training dataset, the learning algorithm produces an inferred function to
make predictions about the output values. The system is able to provide targets for any
new input after sufficient training. The learning algorithm can also compare its output
with the correct, intended output and find errors in order to modify the model
accordingly.
 In contrast, unsupervised machine learning algorithms are used when the
information used to train is neither classified nor labeled. Unsupervised learning studies
how systems can infer a function to describe a hidden structure from unlabeled data.
The system doesn‘t figure out the right output, but it explores the data and can draw
inferences from datasets to describe hidden structures from unlabeled data.
 Semi-supervised machine learning algorithms fall somewhere in between supervised
and unsupervised learning, since they use both labeled and unlabeled data for training

21
– typically a small amount of labeled data and a large amount of unlabeled data. The
systems that use this method are able to considerably improve learning accuracy.
Usually, semi-supervised learning is chosen when the acquired labeled data requires
skilled and relevant resources in order to train it / learn from it. Otherwise, acquiring
unlabeled data generally doesn‘t require additional resources.
 Reinforcement machine learning algorithms is a learning method that interacts with
its environment by producing actions and discovers errors or rewards. Trial and error
search and delayed reward are the most relevant characteristics of reinforcement
learning. This method allows machines and software agents to automatically determine
the ideal behavior within a specific context in order to maximize its performance.
Simple reward feedback is required for the agent to learn which action is best; this is
known as the reinforcement signal.

7.1.3 History of Machine Learning in Healthcare

Research in the 1960s and 1970s produced the first problem-solving program, or expert
system, known as Dendral. While it was designed for applications in organic chemistry, it
provided the basis for a subsequent system MYCIN, considered one of the most significant
early uses of artificial intelligence in medicine. MYCIN and other systems such as
INTERNIST-1 and CASNET did not achieve routine use by practitioners, however.

The 1980s and 1990s brought the proliferation of the microcomputer and new levels of
network connectivity. During this time, there was a recognition by researchers and developers
that AI systems in healthcare must be designed to accommodate the absence of perfect data
and build on the expertise of physicians. Approaches involving fuzzy set theory, Bayesian
networks, and artificial neural networks, have been applied to intelligent computing systems in
healthcare.

Medical and technological advancements occurring over this half-century period that have
enabled the growth healthcare-related applications of AI include:

 Improvements in computing power resulting in faster data collection and data


processing
 Widespread implementation of electronic health record systems

22
 Improvements in natural language processing and computer vision, enabling machines
to replicate human perceptual processes
 Enhanced the precision of robot-assisted surgery
 Improvements in deep learning techniques and data logs in rare diseases

7.2 RESEARCH ON ML ALGORITHMS

Machine learning can be introduced as a scientific discipline that focuses on how


computers learn from data and continuously improve themselves. It is mainly based on
probability and statistics. But it is more powerful than the standard statistical methodologies
when it comes to decision making. Information gathered from a dataset which is being given
to the algorithm is called features. The accuracy of the predictions made by the model is
dependent on the quality of the features provided to the algorithm. It is the duty of a machine
learning developer to detect the subset of features that could best fit the purpose, increasing the
accuracy of the model. This is not an easy task. Continuous experiments should be carried out
to identify the said feature subset for the algorithm. When considering putting a machine
learning algorithm to applications, there are basically three steps to follow, which are training,
testing, and validation. Training is important as the accuracy of the results will be depending
on the training dataset. Using the test dataset, the performance of the algorithm will be
measured. When using the test data for measuring the performance, it is also important to lower
the bias and to increase the variance in this testing period. A good machine learning algorithm
must optimize the bias-variance trade-off. The evaluation of the final machine learning
algorithm performance is done based on the validation dataset in the validation period. As a
start, it would be better to have an idea about various approaches taken in machine learning
along with several algorithms that are being used excessively for clustering and classification
purposes in machine learning.

7.2.1 Supervised Learning

In supervised learning, a training set is provided with appropriate objectives in this


approach. Classification and regression are the two categories found in supervised learning. In
classification, with the use of classification methods, the trained system allocates inputs into
classes. In regression, the sources are continuous rather than discrete. The root-mean-squared
error is being used to evaluate regression predictions, while accuracy is being used to evaluate

23
classification predictions. Supervised learning has the goal of predicting a known output based
on a common dataset. Tasks performed by supervised learning can most of the time be
performed by a trained person as well. Supervised learning focuses on classification which
involves choosing among subgroups to best describe a new instance of data and prediction,
which involves estimating an unknown parameter. This is often used to estimate and model
risk while finding relationships which are not readily visible to humans. Below are a few
supervised learning algorithms which are widely used in the field of computational biology and
biomedicine.

K-Nearest Neighbour (KNN)

KNN is a popular supervised classification algorithm which is used in many fields such
as pattern recognition, intrusion detection, and so on. KNN is a simple algorithm which is easy
to understand. Even the accuracy is high in KNN, but the issues are that it is computationally
expensive and it has a high memory requirement as both testing and training data need to be
stored. A prediction for a new instance is obtained by finding the most similar instances at
first and then summarizing the output variable according to those similar instances. For
regression, this can be the mean value, and for classification, this may be the mode value. To
determine the similar instance, the distance measure is used. Euclidean distance is the most
popular approach used to calculate the distance. The training dataset should be vectors in a
multidimensional feature space, each with a class label.

Support Vector Machine (SVM)

SVM is a supervised machine learning algorithm which is used to address mainly


classification problems but also used for regression issues. In this algorithm, initially, the data
items are plotted as points in an n-dimensional space with the feature value being the particular
coordinate. Then, it identifies the hyperplane that separates the datapoints into two classes. By
this, the marginal distance between the decision hyperplane and instances that are close to the
boundary can be maximized [5].What brings SVM ahead of other algorithms is that it has basic
functions that can map points to other dimensions by using nonlinear relationships. As it
divides the datapoints to two classes, SVM is also known as the nonprobabilistic binary
classifier. SVM has more accuracy when compared with many other algorithms. But it is best
suited for problems with small datasets. The reason is that when the dataset keeps on getting

24
larger, the training becomes more complex and time consuming. When data have noise, it
cannot perform well. To make the classification more efficient, SVM uses a subset of training
points. SVM is capable of solving both linear and nonlinear problems, but nonlinear SVM is
preferred over linear SVM as it has better performance.

Decision Trees (DTs)

DT is a supervised algorithm which has a tree like model where decisions, possible
consequences, and their outcomes are being considered. Each node carries a question, and each
branch represents an outcome. The leaf nodes are class labels. When a leaf node is being
reached by a sample data, the label of the corresponding node will be assigned to the sample.
This approach is suited when the problem is simple and when the dataset is small. Even though
the algorithm is easy to understand, it has certain issues such as the overfitting problem and
biased outcomes when working with imbalanced datasets. But DT is capable of mapping both
linear and nonlinear relationships.

7.2.2 Classification and Regression Trees (CARTs)

CART is a predictive model from which the output value is predicted based on the
existing values in the constructed tree. The representation for the CART model is a binary tree
in which each root represents a single input and a split point on that variable. Leaf nodes contain
an output which is used to make predictions.

Logistic Regression (LR)

LR is a popular mathematical modeling procedure which is used for epidemiologic


datasets in the area of machine learning. It first calculates using the logistic function. Then, it
learns the coefficients for the logistic regression model and then finally makes predictions using
that logistic regression model. This model is a generalized linear model and has two parts,
namely, linear part and link function. The linear part is responsible for carrying out the
calculations of the classification model, and the link function is responsible for delivering the

25
output of the calculation. LR is a supervised machine learning algorithm which needs a
hypothesis and a cost function. It is to be noted that optimizing the cost function is important.

Random Forest Algorithm (RFA)

RFA is a trending machine learning technique which is capable of both regression and
classification. It is a supervised learning algorithm in which the ground methodology is
recursion. In this algorithm, a group of decision trees are being created and the bagging method
is used for training purposes. RFA is insensitive to noise and can be used for imbalanced
datasets. The problem of overfitting is also not prominent in RFA.

Naive Bayes (NB)

NB is a classification algorithm which is used for binary and multiclass problems. The
NB classifiers are a collection of classifying algorithms that are based on the Bayes theorem.
But they all adhere to a common principle which is every pair of features being classified must
be independent of each other. This is a bit similar to SVM, but the process takes advantage
from statistical methods. In this method, when there is a new input, the probabilistic value will
be calculated among the classes with regard to the given input and the data will be labeled
with the class which has the highest probabilistic value for the given input.

7.2.3 Unsupervised Learning

When a developer does not have a clear understanding of the data that are involved with
the system, it is not possible to label the data and provide them as the training dataset. In these
cases, the machine learning algorithms themselves can be used to detect similarities and
differences between the data objects. This is the unsupervised approach of machine learning.
In this method, existing patterns will be identified and the data will be clustered according to
the identified patterns. Therefore, in unsupervised learning, the system makes decisions
without being trained by a dataset as no labeled data are being given to the system which could
be used for predictions. It is to be noted that unsupervised learning is an attempt to find

26
naturally occurring patterns or groups within data. The challenging part in it is to find whether
the recognized patterns or groups are useful in some way. This is the reason for unsupervised
learning to play a major role in precision medicine. As a simple example, when grouping
individuals according to their genetics, environment, and medical history, certain relationships
among them which were not visible before might get identified by unsupervised machine
learning algorithms. K-means, mean shift, affinity propagation, density-based spatial clustering
of applications with noise (DBSCAN), Gaussian mixture modelling, Markov random fields,
iterative self-organizing data (ISODATA), and fuzzy C-means systems are a few examples for
unsupervised algorithms.

Clustering is an approach in unsupervised learning, and it can be used for dividing inputs into
clusters. But these clusters are not identified initially but are grouped based on resemblance [.
In clustering, the root approaches are separated as per the different features that they carry.
They can be partitioning (k-means), hierarchical, grid-based, density-based, or model-based,
and they can be further divided as numerical, discrete, and mixed data types. Inheritance
relationships between clustering algorithms within an approach show common features and
improvements that they make on each other. Speed, minimal parameters, robustness to noise,
outliers, redundancy handling, and object order independence are the desired clustering
features which are required in a clustering algorithm to be implemented within a biomedical
application. Clustering algorithms are used when datasets are too large and complex for manual
analysis. Therefore, they must be fast and they must not be affected by redundant sequences.

27
Learning Data Type Usage Type Output Affecte Scalable Cost
Class
Accuracy d by
/ Missing
Perform Data
ance
Supervised Labeled Classification High Yes Yes, but Expensive
Regression we need to
label large
volumes
of data
automatically.

Unsupervised Unlabeled Clustering Low No Yes, but we Inexpensive


Transformations need to verify
the accuracy
of the
predicted
output.

Table 7.2.1: The difference between supervised learning and unsupervised learning

28
Algorithm Learning Used for Positives Negatives
Name Type
K-Nearest Supervised Classification Nonparametric approach. Takes a long time to calculate the
Neighbor , Regression Intuitive to understand. Easy similarity between the datasets. The
(K-NN) to implement. Does not performance is degraded because of
require explicit training. Can imbalanced datasets. The
be easily adapted to changes performance is sensitive to the
simply by updating its set of choice of hyper parameter (K
labeled observations. value). The information might be
lost, so we need
to use homogeneous features.
Naïve Supervised Probabilistic Scanning of data by Requires only a small amount of
Bayes classification looking at each feature training data. Determines only the
(NB) individually. variances of the variables for each
Collecting simple per-class class.
statistics from each feature
helps with increasing the
assumptions
accuracy.
Decision Supervised Prediction, Easy to implement. Can Sensitive to the imbalanced dataset
Trees Classification handle categorical and and noise in the training dataset.
(DTs) continuous attributes. Expensive, and needs more memory.
Requires little to no data Must select the depth of the node
preprocessing. carefully to avoid variance and bias.
Random Supervised Classification Lower correlations across the Does not work well on
Forest , Regression decision trees. Improves the high- dimensional, sparse
DT's performance. data.

Support Supervised Binary More effective in high- Selecting the best hyperplane
Vector classification, dimensional space. Using the and kernel trick is not easy.
Machine Nonlinear kernel trick is the real strength
(SVM) classification of SVM.

Table 7.2.2: Summary of the reviewed ML algorithms.

29
7.3 IMPLEMENTATION

Upon going through certain research papers, we decided to try our data on two
algorithms one of them being random forest.

7.3.1 Random Forest

As stated earlier Random Forest is a classifier that instead of relying on one decision
tree, it takes the prediction from each tree and based on the majority votes of predictions,
gives the final output. The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting. Overfitting refers to the scenario where a machine
learning model can‘t generalize or fit well on unseen dataset. It occurs when a function
corresponds too closely to a dataset failing to fit additional data, and this may affect the
accuracy of predicting future observations. It is a binary decision tree that is constructed by
firstly, selecting random K data points from the training set. Build the decision trees
associated with the selected data points. Choose the number N for decision trees that we want
to build. Repeat the steps, for new data points, find the predictions of each decision tree, and
assign the new data points to the category that wins the majority votes. Now, another great
quality of the random forest algorithm is that it is very easy to measure the relative
importance of each feature on the prediction. Sklearn provides a great tool for this that
measures a feature's importance by looking at how much the tree nodes that use that feature
reduce impurity across all trees in the forest. It computes this score automatically for each
feature after training and scales the results so the sum of all importance is equal to one. In the
following code, we fit the Random forest algorithm to the training set. To fit it, we have
imported the RandomForestClassifier class from the sklearn.ensemble library. In the code,
the classifier object takes the parameter, n_estimators. The required number of trees in the
Random Forest. The default value is 10 but we have taken 100. In general, a higher number
of trees increases the performance and makes the predictions more stable, but it also slows
down the computation. Now, since our model is fitted to the training set, so we can predict
the test result. For prediction, we have created a new prediction vector y_pred.

30
Figure 7.3.1.1: Execution of Random Forest

Figure 7.3.1.2: Sample input to the code

31
Figure 7.3.1.3: Output of the following code

7.3.2 K-Nearest Neighbor (K-NN)

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique. K-NN algorithm assumes the similarity between the new
case/data and available cases and put the new case into the category that is most similar to the
available categories. K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then it can be easily classified
into a well suite category by using K- NN algorithm. K-NN algorithm can be used for
Regression as well as for Classification but mostly it is used for the Classification problems.
K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.

The K-NN working can be explained on the basis of the below algorithm:

Step-1: Select the number K of the neighbors

Step-2: Calculate the Euclidean distance of K number of neighbors

Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

Step-4: Among these k neighbors, count the number of the data points in each category.

Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.

Step-6: Our model is ready.

32
How to select the value of K in the K-NN Algorithm?

There is no particular way to determine the best value for "K", so we need to try some values
to find the best out of them. The most preferred value for K is 5. A very low value for K such
as K=1 or K=2, can be noisy and lead to the effects of outliers in the model. Large values for
K are good, but it may find some difficulties.

Figure 7.3.2.1: Execution of K-Nearest Neighbor

Figure 7.3.2.2: Sample input to the code

33
Figure 7.3.2.3: Output of the following code

34
CHAPTER-8

DATABASE
DATABASE
There are many changes taking place in the healthcare sector. Healthcare databases
are an important part of running the entire operations. A database is any record that a
practitioner maintains in paper form or on a computer. It does not matter whether it is a sole
practitioner or corporate bodies. With technological innovations, medical facilities are
leaning towards online functioning of services.
8.1 Data in healthcare
The Healthcare system generates data that requires delicate handling. A patient‘s life
depends on this information, and it is therefore important for the Healthcare provider to be
able to access it in the shortest time possible and ensure that the information is correct to the
best of the knowledge.
The healthcare data is very crucial and difficult to manage and handle because of the
following reasons –
1. Efficiency Management of data is important since a lot of data is to be stored for one
patient only and there are lot of patients suffering from various disease so the data
base should also be updated on regular intervals.
2. Data Manipulation is also a tedious task as the database in healthcare is huge and it
need to updated every now and then.
3. Since data is huge so it should be organized, maintained and managed in such a way
that it can be easily fetched or extracted in the shortest possible time and it should be
available to the user whenever needed.
4. Since the data is related to patient‘s life there cannot be scope of any mistake in this
data.
5. Data security is also important since it a crucial data.

8.2 Database development


Database development is the most important step since the chatbot functioning is
completely dependent on data, if suppose data is not present or developed then Machine
Learning algorithms, NLP and even the basic function of the chatbot won‘t work without
data.
Database is required in the functioning of each and every step of chatbot. There are various
types of dataset that are to be created some of them are listed below –

 Training Dataset – Data used to train the machine learning algorithm


 Testing Dataset – Completely new dataset to check the accuracy of algorithm for
completely new inputs to machine learning algorithm.
 Question Answer Dataset – Required for basic interaction with the user
 Dialogue Datasets
From the above listed datasets the most important datasets are Training and Testing dataset
because these are used to train the chatbot.
To develop the dataset Web Scrapping technique is used to extract the data from various
sources of database which is present on the internet.

35
Where Web Scrapping or Web harvesting is a technique is a technique used for extracting
data from websites. The web scraping directly access the World Wide Web using the
Hypertext Transfer Protocol or a web browser.
Web Scrapping can be done using Python programming using BeautifulSoup and Pandas
library. The scrapped data can be of the format CSV, XML or JSON as per the user needs.
After the data is scrapped from various sources then that data is to be combined called as data
integration.
After Data integration comes the data cleaning step. Since the data from the internet is not in
the proper format as one want or it may contain some unwanted characters or text or
repetitive data so that is to be cleaned and that should pe properly formatted before that data
is used in Training the algorithms.
And once the training data is created using python programming Testing data set is also
created.
8.3 Implementation
For developing Training dataset we performed web scrapping on some websites and
extracted the medical data from that website. This was done using Python Programming,
inbuilt python Libraries such as BeautifulSoup and Pandas was used.
In that web scrapping code first the class name of the data was checked in the inspect
section of the web page and that was passed as an attribute in the python code also the url of
the page from which the data is to be extracted is also passed in the program and through
read_html method present in python the contents of the table were read from the website and
if the scrapped data is not present in tabular form on the website then using dataframe we can
convert the scrapped data into tabular form and then the scrapped data is exported into CSV
file using to_excel method.

Figure 8.3.1 Code for Web Scrapping

36
Figure 8.3.2 Code for Exporting Scrapped Data to CSV File

After the Data is Scrapped then using excel commands and find and replace option data was
cleaned and formatted according to our needs.

Command on Excel to remove numbers from alphanumeric data –


=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITU
TE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(B3,1,""),2,""),3,""),4,""),5
,""),6,""),7,""),8,""),9,""),0,"")

Figure 8.3.3 Snapshot of Cleaned Training.csv File

37
Figure 8.3.4 Snapshot of Cleaned Testing.csv File

38
CHAPTER-9

CONCLUSION AND
REFRENCES
CONCLUSION AND REFRENCES
9.1 Conclusion
The proposed system is designed for understanding the user query and based on the
symptoms faced by the user give proper diagnosis in efficient and cost effective way. The
main aim of the model is to provide healthcare service to people living in rural areas because
they don‘t have the access to healthcare services.
The chatbot is expected to provide assistance in emergency situation and detect solutions for
non-severe medical issues till the time the doctor sees or consults a doctor.

9.2 Future work


 At present we have worked on 2 machine learning algorithms i.e. Random Forest and
KNN algorithm so we need to test the remaining algorithms and finalise the best
Machine Learning algorithm that works well with our database and provide correct
and accurate results.
 The Dataset currently has only 10 diseases and symptoms related to it, so in future we
will add more diseases and make the system more efficient in predicting the diseases
for given set of symptoms.
 Working on NLP module followed by Integration and deployment of the modules.

9.3 References
1. https://www.sciencedirect.com/science/article/abs/pii/S1532046419302242

2. https://journals.sagepub.com/doi/pdf/10.1177/2055207619871808

3. https://www.jnronline.com/ojs/index.php/about/article/view/423/408

4. https://www.sciencedirect.com/science/article/pii/S1877050920306499

5. http://sersc.org/journals/index.php/IJAST/article/download/19027/9666/

6. Healthcare | Free Full-Text | AI Chatbot Design during an Epidemic like the Novel
Coronavirus | HTML (mdpi.com)

7. International Journal of Recent Technology and Engineering (IJRTE)

8. https://www.ijitee.org/wp-content/uploads/papers/v9i1/A4915119119.pdf

9. https://downloads.hindawi.com/journals/jhe/2020/8839524.pdf

10.https://www.researchgate.net/publication/326469944_Automated_Medical_Chatbo
t

39

You might also like