Professional Documents
Culture Documents
Bachelor of Engineering (Information Technology) Sixth Semester
Bachelor of Engineering (Information Technology) Sixth Semester
on
Bachelor of Engineering
(Information Technology)
Sixth Semester
by
Nagpur-13
2020-21
CERTIFICATE
This is to certify that the Project Report on
by
Sixth Semester
Nagpur-13
2020-21
ACKNOWLEDGEMENTS
We would like to express our deepest gratitude to Dr. D. S. Adane Head, Department
of Information Technology, RCOEM, Nagpur for providing us the opportunity to embark on
this project.
Name of Projectees
Simran Singh (23)
Parthsarthi Pahuja (54)
Yash Gupta (70)
i
CONTENTS
Page No.
ABSTRACT iii
LIST OF FIGURES iv
LIST OF TABLES v
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION TO CHATBOT 1
1.2 ARTIFICIAL INTELLIGENCE IN MEDICINE 2
1.3. FUTURE SCENARIO FOR INDIA 5
CHAPTER 2
OVERVIEW OF HEALTHBOT
2.1 CHATBOTS IN HEALTHCARE INDUSTRY 6
2.2 USE CASES IN HEALTHCARE 7
2.3 CHALLENGES AND LIMITATIONS 9
CHAPTER 3
AIMS AND OBJECTIVES
3.1 PROBLEM STATEMENT 11
3.2 PROPOSED SOLUTION 11
CHAPTER 4
LITERATURE REVIEW
4.1 SURVEY OF EXISTING MODELS 12
CHAPTER 5
METHODOLOGY
5.1 CHATBOT ARCHITECTURE 15
5.2 PHASES AND THEIR WORKING 15
5.3 MODULES 16
CHAPTER 6
NATURAL LANGUAGE PROCESSING
6.1 INTRODUCTION TO NLP 17
6.2 NLP TECHNIQUES 18
6.3 IMPLEMENTATION 19
CHAPTER 7
MACHINE LEARNING
7.1 INTRODUCTION TO ML 21
7.2 RESEARCH ON ML ALGORITHMS 23
7.3 IMPLEMENTATION 30
CHAPTER 8
DATABASE
8.1 DATA IN HEALTHCARE 35
8.2 DATABASE DEVELOPMENT 35
8.3 IMPLEMENTATION 36
CHAPTER 9
CONCLUSION AND REFERENCES
9.1 CONCLUSION 39
9.2 FUTURE WORK 39
9.3 REFERENCE 39
ii
ABSTRACT
With the current growth in the interest of individuals in health, life care, and disease, medical
institution services had been moving from remedy awareness to prevention and fitness
control. The clinical enterprise is growing extra offerings for fitness- and lifestyles-
merchandising programs. This trade represents a clinical-provider paradigm shift because of
the extended lifestyles expectancy, aging, life-style adjustments, and profits increases, and
consequently, the idea of the clever fitness provider has emerged as a first-rate issue.
However, as the quantity of information is growing and the clinical-information complexity is
intensifying, the constraints of the preceding strategies are an increasing number of
problematic. With the incoming trends in technology, AI chatbots have managed to pave their
way in healthcare domain. Although healthcare was not the first sector in which experiments
with chatbots have been carried out, since the beginning of 2018 we have seen the emergence
of and experimentation with many different use cases in this field. A chatbot is an intelligent
conversation platform that interacts with users via a chatting interface, and since its use can
be facilitated by linkages with the major social network service messengers, general users can
easily access and receive various health services. The layout of the framework contains the
subsequent three levels: Natural language Processing, Machine Learning and Database. This
is followed by focusing on two Machine Learning algorithms, Random forest and KNN
which are supervised learning algorithm taking user input and providing diagnosis based on
the information stored in the knowledge base of the system. Currently the project is in
development phase with the algorithm being tested on ten diseases and the future plans have
been stated.
iii
LIST OF FIGURES
iv
LIST OF TABLES
v
CHAPTER - 1
INTRODUCTION
1.1 INTRODUCTION TO CHATBOT
Several million people enter keywords every day in search engines such as Google and
then have to choose from a list of results, usually in the form of web pages in which it is again
necessary to search for specific information.
A chatbot is a software robot that can reproduce natural language and interact with an individual
through automated conversations. Chatbots allow you to receive a unique answer or a service. In
the literature, chatbots and conversational agents can be distinguished according to their level of
understanding of natural language, the former using keyword or rule engines instead, while the
latter are based on machine learning. We shall use the term chatbot in its generic sense in this
white paper. The operating model of a chatbot is always the same, whatever its scope, its theme
and its level:
Users formulate their queries in natural language via a voice or text interface.
The chatbot receives the request and its engine interprets it to understand it.
The chatbot provides a unique and qualified answer to the user‘s query.
The answer may be generic (i.e. the same for everyone), contextualized (adapted to the context,
for example, at a given time and place) or customized (adapted to users, for example, by providing
them with their bank balance).
Assistants: Provide the user with a predefined answer like in a page for "Frequently Asked
Questions".
Concierges: Provide a contextualized response and facilitate a service to the user, for example by
explaining the steps of an action to be taken.
Advisors: Integrate customized answers to complex requests with automated processes to perform
certain actions.
1
Figure 1.1: Example of conversational bot
Chatbots are in the spotlight today, but the first chatbot emerged in 1964 with ELIZA.
Several chatbots have been tested to try to understand and reproduce the human ability to conduct
a conversation, through research on artificial intelligence in computer science. Other noteworthy
chatbots were then created with Jabberwacky in 1982 and A.L.I.C.E. in 1995 for example. Since
2010, the web giants have been launching smart assistants for smartphones and PCs to improve
the user experience. The best known is Siri, launched by Apple on the iPhone in 2010. Then there
was Google Now in 2012, Cortana at Microsoft and Alexa at Amazon in 2014. Since 2016, chatbot
solutions have been multiplying, particularly on Facebook Messenger, thanks to the simplification
of chatbot technologies and implementation tools that anyone can use.
2
―Artificial Intelligence is neither a new technology nor a machine‖. Artificial intelligence
is the recognition of outcome-direction which is the rapid analysis of live data to achieve the
expected goal. Outcome-directed thinking splits from the confines of the rule-directed approach
that is accomplished through artificial intelligence. The generalized practice of AI can be broken
down into a straightforward process. First of all, a numerical representation is established for the
target or outcome. Specific data is then associated with the target is gathered and conditions and
behaviors are investigated to increase the likelihood of achieving the expected target. Multiple
aspects can determine the outcome. The weight of each aspects effect is computed. ―AI uses the
relative weighting of each aspect to create a prediction (evaluation) formula‖ (Yano, K. 2017).
Lastly, the formula devised from the weighted aspects are employed to business decisions. AI can
be classified into four groups: ―systems that think like humans, systems that act like humans,
systems that think rationally and systems that act rationally‖. AI is generally categorized as strong
and weak AI: strong AI is the production of human-like intelligent systems. Weak AI would be
the integration of intelligent algorithms embedded within a system. ―Machine learning, deep-
learning, natural language processing and neural networks are often summarized under the term
of AI‖.
The application of AI in medicine has two main branches: Virtual branch and Physical
branch.
Virtual branch –
The virtual component is represented by Machine Learning, (also called Deep Learning)-
mathematical algorithms that improve learning through experience. Three types of machine
learning algorithms:
3
Physical branch –
It includes: Physical objects, Medical devices, Sophisticated robots for delivery of care (carebots)/ robots
for surgery.
4
1.3 FUTURE SCENARIO FOR INDIA
5
CHAPTER - 2
OVERVIEW OF
HEALTHBOT
2.1 CHATBOTS IN HEALTHCARE INDUSTRY
Although healthcare was not the first sector in which experiments with chatbots have been
carried out, since the beginning of 2018 we have seen the emergence of and experimentation with
many different use cases in this field. The chatbots thus try to handle several needs, such as
personalized medical follow-up, communication and transmission of test results, dissemination of
information, or even advice to patients or preliminary diagnosis. It is in this context and based on
the project initiated by Sanofi, in partnership with Orange Healthcare and Kap Code, that we are
exploring in this white paper some practical cases of healthcare chatbots and the specificities of
the healthcare sector. The white paper also includes our proposals for evaluating user perception
of these new digital tools
The use of chat-bots has spread from consumer customer service to matters of life and
death. Chatbots are entering the healthcare industry and can help solve many of its problems.
Chat-bot is a computer program designed to carry on a dialogue with people, particularly on the
Internet. It assists individuals via text messages within websites, applications or instant messaging
and enables businesses to attract, keep and satisfy clients. This kind of bots is an automated system
of communicating with users. There are chatbots which can provide information to the following
and similar to them questions. ―How long is someone infectious after a viral infection?‖ ―How
can I get a prescription?‖ ―How can I find out my blood type (blood group)?‖ Thereby, clinics
building a chatbot for their sites, lower the number of repetitive calls that their specialists have to
answer. This, in its turn, enables hospital employees to concentrate on more significant tasks
which will lead to better healthcare service quality. The proposed system will not only provide the
personal assistance to the patients but also users can keep their previous medical record on the
platform for future use. The platform will provide a conversational experience to patients acting
like a doctor is treating them online.
6
2.2 USE CASES IN HEALTHCARE
1. Checking Symptoms
7
3. Medication Guidance
4. Book an appointment
8
2.3 CHALLENGES AND LIMITATIONS
One of the main hurdles for Al would be its adoption. Healthcare professionals would
have to educate about the need for Al. They should also be made comfortable for work in
an environment where Al is present. Many doctors would not be open to the information
provided by a machine, and they would be educated to accept Al. Compliance and FDA
regulations can be another major problem. Currently, with Al being only partially
understood, the amount of importance that has to be given Al would also be a question
that lurks in the minds of the FDA personnel.
The industry is receptive to new ways to improve diagnostics, patient care, and financial
efficiencies. However, these AI healthcare companies contend with some significant
challenges with regards to widespread Al adoption in the healthcare.
Case study conundrum
Black box issue
Stakeholder complexities
Current trends
The ministry of health and family welfare is working on a sector -specific legislation,
tentatively called the healthcare data privacy and security act. In 2016 , the hacking of a
Mumbai — based diagnostic laboratory database led to the leaking of medical records (
9
including HIV reports of over 35000 patients ). Hacker can exploit Al solutions to collect
private and sensitive information such as electronic health record.
Man-m-the-middle
Chat log stored on user device
Encryption of messages in transit
Encryption of data at rest
Use of external NLP services
Logging and access rights
10
CHAPTER-3
AIM & OBJECTIVES
3.1 PROBLEM STATEMENT
In rural areas especially in India, faces a lot of challenges like expensive medical care, lack
of infrastructure or absence of doctors. They have to travel long distances to get a medical assist.
There are many more such challenges faced by the people which are compromising the human‘s
life. To overcome this, we come with a problem statement stated as “Instant access to healthcare
using AI - voice enabled chat bot”.
For the given problem statement, we propose an ―AI - Healthcare Chatbot‖ which will
provide an instant solution.
The chatbot will provide a diagnosis to the user based on the symptoms they will
provide.
The chatbot will provide assistance to the users in emergency situations. For example,
if there is a diagnosis of severe chest pain or heart attack based on the user‘s symptoms,
the chatbot will immediately suggest seeking medical attention right away.
The chatbot will also offer solutions for non – severe medical issues. These solutions
can be in the form of say to do gargling when diagnosis with common cold.
The chatbot will also provide details of the medical to be taken for the diagnosed issue.
Place like India where people are more comfortable with Hindi language, we will have
the feature of Hindi language where user can interact in Hindi with the chatbot. This
will ease the use of chatbot.
11
CHAPTER- 4
LITERATURE REVIEW
LITERATURE REVIEW
Chatbot in healthcare is a system which assist users to know about their disease, give treatment
related to the disease or give information about the nearby healthcare centre in a cost effective
and efficient manner. Most of the researchers have used techniques such as NLP, ML to predict
the disease but the difference arise when it comes to machine learning algorithms and some
novel functionalities. The research work is done from verified journals or research papers which
are either SCI or Scopus certified journals or research papers. Through the research work it was
analysed that there are various techniques to build, train and deploy the chatbot some of the
analysis which was done are listed below.
12
4.1.4 Self-diagnosing health care chatbot using machine learning
This project aims at providing basic consultation to a user before consulting a doctor.
The chatbot identifies the symptoms and categories it as major or minor symptoms and if it is a
major one the chatbot suggests the user to consult a doctor. NLP and decision tree algorithm was
used by the developers to provide diagnosis.
4.1.5 Design and development of diagnostic chatbot for supporting primary health care
systems
The chatbot was based on Supervised Learning method and methods like NLP and
Decision Tree Algorithm was used. The chatbot provided diagnosis based on the symptoms
entered by the user. It also consists of functionalities like the chatbot can connect the user to a
Doctor and if the doctor is unavailable then preliminary consultation is provided by the chatbot.
The disadvantage of this model that it worked with only limited number of disease and accuracy
is low for uncommon diseases.
13
4.1.9 Text messaging-based medical diagnosis using natural language processing and
fuzzy logic
This system was designed in python and is able to diagnose using a direct approach of
the question and answering technique to suggest a medical diagnosis. The developers extracted
data from different standard websites for building their knowledge base. The entire project was
deployed in Telegram apk. The drawback of the system was it was not secure the false positive
cases of falsely suggesting disease.
14
CHAPTER - 5
MEHTODOLOGY
5.1 CHATBOT ARCHITECTURE
This is the complete architecture of our chatbot. It has three main phases:
Interaction with user
This phase deals with the users, messaging platform and speech recognition
component of the chatbot. The phase focuses on the conversation with the user.
Using the messaging platform (GUI of chatbot), the user can interact with the
chatbot.
User can interact with chatbot through the voice message or can type their input as
text message.
For voice input, the chatbot will convert the voice message into text for further
process.
15
If input is text, then it is directly transferred to the NLP component of the
architecture.
5.3 MODULES
The project is divided into the three modules:
Natural Language Processing (NLP)
Machine Learning (ML)
Database (Datasets)
16
CHAPTER-6
NATURAL LANGUAGE
PROCESSING
6.1 INTRODUCTION TO NLP
Natural Language Processing (NLP) is a field of Artificial Intelligence (AI) that makes
human language intelligible to machines. NLP combines the power of linguistics and computer
science to study the rules and structure of language, and create intelligent systems (run on machine
learning and NLP algorithms) capable of understanding, analyzing, and extracting meaning from
text and speech.
NLP is used to understand the structure and meaning of human language by analyzing
different aspects like syntax, semantics, pragmatics, and morphology. Then, computer science
transforms this linguistic knowledge into rule-based, machine learning algorithms that can solve
specific problems and perform desired tasks.
By using NLP tools, the input data is pre-processed and data is converted into something
that a machine can understand. Then machine learning algorithms are fed with the outcomes to
train machines to make associations between a particular input and its corresponding output.
In our project, the NLP is used to understand the user‘s input and extract key features i.e.
symptoms so that they can be fed to machine learning algorithms to predict the corresponding
disease based on the user‘s symptoms.
17
6.2 NLP TECHNIQUES
6.2.1 Tokenization
Stemming usually refers to a crude heuristic process that chops off the ends of words in
the hope of achieving this goal correctly most of the time, and often includes the removal of
derivational affixes. Lemmatization usually refers to doing things properly with the use of a
vocabulary and morphological analysis of words, normally aiming to remove inflectional endings
only and to return the base or dictionary form of a word, which is known as the lemma.
Removing stop words is an essential step in NLP text processing. It involves filtering out
high-frequency words that add little or no semantic value to a sentence, for example, which, to,
at, for, is, etc. You can even customize lists of stopwords to include words that you want to ignore.
A bag-of-words model is a way of extracting features from text for use in modeling, such
as with machine learning algorithms.
TF-IDF stands for ―Term Frequency — Inverse Document Frequency‖. This is a technique to
quantify a word in documents; we generally compute a weight to each word which signifies the
importance of the word in the document and corpus.
18
6.3 IMPLEMENTATION
For speech recognition, we have implemented the python code to get the input as voice
from user‘s microphone which will get converted into the corresponding text. Here is the code
snippet for speech recognition:
For text pre-processing, we have used various NLP techniques like tokenization,
stemming, lemmatization and removal of stop words. Here is the code snippet for this:
19
Figure 6.3.2: Text Pre-processing code
To identify the word importance in the user‘s input, we have implemented two more NLP
methods, Bag of Words and TF-IDF. Using these methods, we can get a numerical value which
tells the importance of each word present in the corpus. We have tested these methods on 2
statements. Here is the snippet of the output of these methods:
20
CHAPTER - 7
MACHINE LEARNING
7.1 INTRODUCTION TO ML
Machine learning is an application of artificial intelligence (AI) that provides systems the
ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer programs that can
access data and use it to learn for themselves.
The process of learning begins with observations or data, such as examples, direct experience,
or instruction, in order to look for patterns in data and make better decisions in the future based
on the examples that we provide. The primary aim is to allow the computers learn automatically
without human intervention or assistance and adjust actions accordingly.
But, using the classic algorithms of machine learning, text is considered as a sequence of
keywords; instead, an approach based on semantic analysis mimics the human ability to
understand the meaning of a text.
Supervised machine learning algorithms can apply what has been learned in the past
to new data using labeled examples to predict future events. Starting from the analysis
of a known training dataset, the learning algorithm produces an inferred function to
make predictions about the output values. The system is able to provide targets for any
new input after sufficient training. The learning algorithm can also compare its output
with the correct, intended output and find errors in order to modify the model
accordingly.
In contrast, unsupervised machine learning algorithms are used when the
information used to train is neither classified nor labeled. Unsupervised learning studies
how systems can infer a function to describe a hidden structure from unlabeled data.
The system doesn‘t figure out the right output, but it explores the data and can draw
inferences from datasets to describe hidden structures from unlabeled data.
Semi-supervised machine learning algorithms fall somewhere in between supervised
and unsupervised learning, since they use both labeled and unlabeled data for training
21
– typically a small amount of labeled data and a large amount of unlabeled data. The
systems that use this method are able to considerably improve learning accuracy.
Usually, semi-supervised learning is chosen when the acquired labeled data requires
skilled and relevant resources in order to train it / learn from it. Otherwise, acquiring
unlabeled data generally doesn‘t require additional resources.
Reinforcement machine learning algorithms is a learning method that interacts with
its environment by producing actions and discovers errors or rewards. Trial and error
search and delayed reward are the most relevant characteristics of reinforcement
learning. This method allows machines and software agents to automatically determine
the ideal behavior within a specific context in order to maximize its performance.
Simple reward feedback is required for the agent to learn which action is best; this is
known as the reinforcement signal.
Research in the 1960s and 1970s produced the first problem-solving program, or expert
system, known as Dendral. While it was designed for applications in organic chemistry, it
provided the basis for a subsequent system MYCIN, considered one of the most significant
early uses of artificial intelligence in medicine. MYCIN and other systems such as
INTERNIST-1 and CASNET did not achieve routine use by practitioners, however.
The 1980s and 1990s brought the proliferation of the microcomputer and new levels of
network connectivity. During this time, there was a recognition by researchers and developers
that AI systems in healthcare must be designed to accommodate the absence of perfect data
and build on the expertise of physicians. Approaches involving fuzzy set theory, Bayesian
networks, and artificial neural networks, have been applied to intelligent computing systems in
healthcare.
Medical and technological advancements occurring over this half-century period that have
enabled the growth healthcare-related applications of AI include:
22
Improvements in natural language processing and computer vision, enabling machines
to replicate human perceptual processes
Enhanced the precision of robot-assisted surgery
Improvements in deep learning techniques and data logs in rare diseases
23
classification predictions. Supervised learning has the goal of predicting a known output based
on a common dataset. Tasks performed by supervised learning can most of the time be
performed by a trained person as well. Supervised learning focuses on classification which
involves choosing among subgroups to best describe a new instance of data and prediction,
which involves estimating an unknown parameter. This is often used to estimate and model
risk while finding relationships which are not readily visible to humans. Below are a few
supervised learning algorithms which are widely used in the field of computational biology and
biomedicine.
KNN is a popular supervised classification algorithm which is used in many fields such
as pattern recognition, intrusion detection, and so on. KNN is a simple algorithm which is easy
to understand. Even the accuracy is high in KNN, but the issues are that it is computationally
expensive and it has a high memory requirement as both testing and training data need to be
stored. A prediction for a new instance is obtained by finding the most similar instances at
first and then summarizing the output variable according to those similar instances. For
regression, this can be the mean value, and for classification, this may be the mode value. To
determine the similar instance, the distance measure is used. Euclidean distance is the most
popular approach used to calculate the distance. The training dataset should be vectors in a
multidimensional feature space, each with a class label.
24
larger, the training becomes more complex and time consuming. When data have noise, it
cannot perform well. To make the classification more efficient, SVM uses a subset of training
points. SVM is capable of solving both linear and nonlinear problems, but nonlinear SVM is
preferred over linear SVM as it has better performance.
DT is a supervised algorithm which has a tree like model where decisions, possible
consequences, and their outcomes are being considered. Each node carries a question, and each
branch represents an outcome. The leaf nodes are class labels. When a leaf node is being
reached by a sample data, the label of the corresponding node will be assigned to the sample.
This approach is suited when the problem is simple and when the dataset is small. Even though
the algorithm is easy to understand, it has certain issues such as the overfitting problem and
biased outcomes when working with imbalanced datasets. But DT is capable of mapping both
linear and nonlinear relationships.
CART is a predictive model from which the output value is predicted based on the
existing values in the constructed tree. The representation for the CART model is a binary tree
in which each root represents a single input and a split point on that variable. Leaf nodes contain
an output which is used to make predictions.
25
output of the calculation. LR is a supervised machine learning algorithm which needs a
hypothesis and a cost function. It is to be noted that optimizing the cost function is important.
RFA is a trending machine learning technique which is capable of both regression and
classification. It is a supervised learning algorithm in which the ground methodology is
recursion. In this algorithm, a group of decision trees are being created and the bagging method
is used for training purposes. RFA is insensitive to noise and can be used for imbalanced
datasets. The problem of overfitting is also not prominent in RFA.
NB is a classification algorithm which is used for binary and multiclass problems. The
NB classifiers are a collection of classifying algorithms that are based on the Bayes theorem.
But they all adhere to a common principle which is every pair of features being classified must
be independent of each other. This is a bit similar to SVM, but the process takes advantage
from statistical methods. In this method, when there is a new input, the probabilistic value will
be calculated among the classes with regard to the given input and the data will be labeled
with the class which has the highest probabilistic value for the given input.
When a developer does not have a clear understanding of the data that are involved with
the system, it is not possible to label the data and provide them as the training dataset. In these
cases, the machine learning algorithms themselves can be used to detect similarities and
differences between the data objects. This is the unsupervised approach of machine learning.
In this method, existing patterns will be identified and the data will be clustered according to
the identified patterns. Therefore, in unsupervised learning, the system makes decisions
without being trained by a dataset as no labeled data are being given to the system which could
be used for predictions. It is to be noted that unsupervised learning is an attempt to find
26
naturally occurring patterns or groups within data. The challenging part in it is to find whether
the recognized patterns or groups are useful in some way. This is the reason for unsupervised
learning to play a major role in precision medicine. As a simple example, when grouping
individuals according to their genetics, environment, and medical history, certain relationships
among them which were not visible before might get identified by unsupervised machine
learning algorithms. K-means, mean shift, affinity propagation, density-based spatial clustering
of applications with noise (DBSCAN), Gaussian mixture modelling, Markov random fields,
iterative self-organizing data (ISODATA), and fuzzy C-means systems are a few examples for
unsupervised algorithms.
Clustering is an approach in unsupervised learning, and it can be used for dividing inputs into
clusters. But these clusters are not identified initially but are grouped based on resemblance [.
In clustering, the root approaches are separated as per the different features that they carry.
They can be partitioning (k-means), hierarchical, grid-based, density-based, or model-based,
and they can be further divided as numerical, discrete, and mixed data types. Inheritance
relationships between clustering algorithms within an approach show common features and
improvements that they make on each other. Speed, minimal parameters, robustness to noise,
outliers, redundancy handling, and object order independence are the desired clustering
features which are required in a clustering algorithm to be implemented within a biomedical
application. Clustering algorithms are used when datasets are too large and complex for manual
analysis. Therefore, they must be fast and they must not be affected by redundant sequences.
27
Learning Data Type Usage Type Output Affecte Scalable Cost
Class
Accuracy d by
/ Missing
Perform Data
ance
Supervised Labeled Classification High Yes Yes, but Expensive
Regression we need to
label large
volumes
of data
automatically.
Table 7.2.1: The difference between supervised learning and unsupervised learning
28
Algorithm Learning Used for Positives Negatives
Name Type
K-Nearest Supervised Classification Nonparametric approach. Takes a long time to calculate the
Neighbor , Regression Intuitive to understand. Easy similarity between the datasets. The
(K-NN) to implement. Does not performance is degraded because of
require explicit training. Can imbalanced datasets. The
be easily adapted to changes performance is sensitive to the
simply by updating its set of choice of hyper parameter (K
labeled observations. value). The information might be
lost, so we need
to use homogeneous features.
Naïve Supervised Probabilistic Scanning of data by Requires only a small amount of
Bayes classification looking at each feature training data. Determines only the
(NB) individually. variances of the variables for each
Collecting simple per-class class.
statistics from each feature
helps with increasing the
assumptions
accuracy.
Decision Supervised Prediction, Easy to implement. Can Sensitive to the imbalanced dataset
Trees Classification handle categorical and and noise in the training dataset.
(DTs) continuous attributes. Expensive, and needs more memory.
Requires little to no data Must select the depth of the node
preprocessing. carefully to avoid variance and bias.
Random Supervised Classification Lower correlations across the Does not work well on
Forest , Regression decision trees. Improves the high- dimensional, sparse
DT's performance. data.
Support Supervised Binary More effective in high- Selecting the best hyperplane
Vector classification, dimensional space. Using the and kernel trick is not easy.
Machine Nonlinear kernel trick is the real strength
(SVM) classification of SVM.
29
7.3 IMPLEMENTATION
Upon going through certain research papers, we decided to try our data on two
algorithms one of them being random forest.
As stated earlier Random Forest is a classifier that instead of relying on one decision
tree, it takes the prediction from each tree and based on the majority votes of predictions,
gives the final output. The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting. Overfitting refers to the scenario where a machine
learning model can‘t generalize or fit well on unseen dataset. It occurs when a function
corresponds too closely to a dataset failing to fit additional data, and this may affect the
accuracy of predicting future observations. It is a binary decision tree that is constructed by
firstly, selecting random K data points from the training set. Build the decision trees
associated with the selected data points. Choose the number N for decision trees that we want
to build. Repeat the steps, for new data points, find the predictions of each decision tree, and
assign the new data points to the category that wins the majority votes. Now, another great
quality of the random forest algorithm is that it is very easy to measure the relative
importance of each feature on the prediction. Sklearn provides a great tool for this that
measures a feature's importance by looking at how much the tree nodes that use that feature
reduce impurity across all trees in the forest. It computes this score automatically for each
feature after training and scales the results so the sum of all importance is equal to one. In the
following code, we fit the Random forest algorithm to the training set. To fit it, we have
imported the RandomForestClassifier class from the sklearn.ensemble library. In the code,
the classifier object takes the parameter, n_estimators. The required number of trees in the
Random Forest. The default value is 10 but we have taken 100. In general, a higher number
of trees increases the performance and makes the predictions more stable, but it also slows
down the computation. Now, since our model is fitted to the training set, so we can predict
the test result. For prediction, we have created a new prediction vector y_pred.
30
Figure 7.3.1.1: Execution of Random Forest
31
Figure 7.3.1.3: Output of the following code
The K-NN working can be explained on the basis of the below algorithm:
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
32
How to select the value of K in the K-NN Algorithm?
There is no particular way to determine the best value for "K", so we need to try some values
to find the best out of them. The most preferred value for K is 5. A very low value for K such
as K=1 or K=2, can be noisy and lead to the effects of outliers in the model. Large values for
K are good, but it may find some difficulties.
33
Figure 7.3.2.3: Output of the following code
34
CHAPTER-8
DATABASE
DATABASE
There are many changes taking place in the healthcare sector. Healthcare databases
are an important part of running the entire operations. A database is any record that a
practitioner maintains in paper form or on a computer. It does not matter whether it is a sole
practitioner or corporate bodies. With technological innovations, medical facilities are
leaning towards online functioning of services.
8.1 Data in healthcare
The Healthcare system generates data that requires delicate handling. A patient‘s life
depends on this information, and it is therefore important for the Healthcare provider to be
able to access it in the shortest time possible and ensure that the information is correct to the
best of the knowledge.
The healthcare data is very crucial and difficult to manage and handle because of the
following reasons –
1. Efficiency Management of data is important since a lot of data is to be stored for one
patient only and there are lot of patients suffering from various disease so the data
base should also be updated on regular intervals.
2. Data Manipulation is also a tedious task as the database in healthcare is huge and it
need to updated every now and then.
3. Since data is huge so it should be organized, maintained and managed in such a way
that it can be easily fetched or extracted in the shortest possible time and it should be
available to the user whenever needed.
4. Since the data is related to patient‘s life there cannot be scope of any mistake in this
data.
5. Data security is also important since it a crucial data.
35
Where Web Scrapping or Web harvesting is a technique is a technique used for extracting
data from websites. The web scraping directly access the World Wide Web using the
Hypertext Transfer Protocol or a web browser.
Web Scrapping can be done using Python programming using BeautifulSoup and Pandas
library. The scrapped data can be of the format CSV, XML or JSON as per the user needs.
After the data is scrapped from various sources then that data is to be combined called as data
integration.
After Data integration comes the data cleaning step. Since the data from the internet is not in
the proper format as one want or it may contain some unwanted characters or text or
repetitive data so that is to be cleaned and that should pe properly formatted before that data
is used in Training the algorithms.
And once the training data is created using python programming Testing data set is also
created.
8.3 Implementation
For developing Training dataset we performed web scrapping on some websites and
extracted the medical data from that website. This was done using Python Programming,
inbuilt python Libraries such as BeautifulSoup and Pandas was used.
In that web scrapping code first the class name of the data was checked in the inspect
section of the web page and that was passed as an attribute in the python code also the url of
the page from which the data is to be extracted is also passed in the program and through
read_html method present in python the contents of the table were read from the website and
if the scrapped data is not present in tabular form on the website then using dataframe we can
convert the scrapped data into tabular form and then the scrapped data is exported into CSV
file using to_excel method.
36
Figure 8.3.2 Code for Exporting Scrapped Data to CSV File
After the Data is Scrapped then using excel commands and find and replace option data was
cleaned and formatted according to our needs.
37
Figure 8.3.4 Snapshot of Cleaned Testing.csv File
38
CHAPTER-9
CONCLUSION AND
REFRENCES
CONCLUSION AND REFRENCES
9.1 Conclusion
The proposed system is designed for understanding the user query and based on the
symptoms faced by the user give proper diagnosis in efficient and cost effective way. The
main aim of the model is to provide healthcare service to people living in rural areas because
they don‘t have the access to healthcare services.
The chatbot is expected to provide assistance in emergency situation and detect solutions for
non-severe medical issues till the time the doctor sees or consults a doctor.
9.3 References
1. https://www.sciencedirect.com/science/article/abs/pii/S1532046419302242
2. https://journals.sagepub.com/doi/pdf/10.1177/2055207619871808
3. https://www.jnronline.com/ojs/index.php/about/article/view/423/408
4. https://www.sciencedirect.com/science/article/pii/S1877050920306499
5. http://sersc.org/journals/index.php/IJAST/article/download/19027/9666/
6. Healthcare | Free Full-Text | AI Chatbot Design during an Epidemic like the Novel
Coronavirus | HTML (mdpi.com)
8. https://www.ijitee.org/wp-content/uploads/papers/v9i1/A4915119119.pdf
9. https://downloads.hindawi.com/journals/jhe/2020/8839524.pdf
10.https://www.researchgate.net/publication/326469944_Automated_Medical_Chatbo
t
39