Chronic Kidney Disease Prediction Using Machine Learning Techniques (Documentation)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

CHRONIC KIDNEY DISEASE PREDICTION USING MACHINE

LEARNING TECHNIQUES

A Project submitted to Nirmala College for Women (Autonomous)

In partial fulfillment of the requirement for the degree of

Master of Computer Science

To be awarded by Bharathiar University, Coimbatore-641018

SUBMITTED BY

NETHRA.S (REG.NO:222CS002)

GUIDED BY

Mrs. M . LINCY JACQULINE., M.Sc., M.Phil., (Ph.D)

ASSISTANT PROFESSOR

DEPARTMENT OF COMPUTER SCIENCE

DEPARTMENT OF COMPUTER SCIENCE

NIRMALA COLLEGE FOR WOMEN (AUTONOMOUS)

(Accredited with “A++” grade by NAAC)

Affiliated to Bharathiar University

2023-2024
CERTIFICATE
CERTIFICATE

This is to certify that the project entitled “CHRONIC KIDNEY DISEASE PREDICTION
USING MACHINE LEARNING TECHNIQUES”, submitted to Nirmala College for
Women (Autonomous), in partial fulfillment of the requirement for the award of the Degree of
Master of Computer Science is a record of the original project work done by NETHRA.S
(222CS002), during the period 2023-2024 of her study in the Department of Computer Science
at Nirmala College For Women (Autonomous), Coimbatore, under my supervision and
guidance and the project has not formed the basis for the award of any degree to any candidate
of the University.

Signature of the Guide


Mrs. M. LINCY JACQULINE, M.Sc., M.Phil., (Ph.D)

Principal Head of the Department


Rev. Sr. Dr. G.S. MARY FABIOLA Mrs. R. UMA MAHESWARI

M.Sc., M.Phil., PGDCA., Ph.D. SLET MCA., M.Phil., (Ph.D)

Internal Examiner External Examiner


DECLARATION
DECLARATION

I NETHRA.S (222CS002), do hereby declare that the project entitled “CHRONIC KIDNEY
DISEASE PREDICTION USING MACHINE LEARNING TECHNIQUES” submitted to
Nirmala College for Women(Autonomous), on partial fulfillment of the requirements for the
award of the degree of Master of Computer Science to be awarded by Bharathiar University is
a record of original and independent project work done during the period 2023-2024 under the
supervision and guidance of Mrs. M. LINCY JACQULINE, M.Sc., M.Phil., (Ph.D).,
Assistant Professor, Department of Computer Science at Nirmala College for Women
(Autonomous), Coimbatore and it has not formed the basis for the award of any Degree to any
candidate of the University.

PLACE : Coimbatore

DATE :

Signature of the Candidate


NETHRA.S (222CS002)
ACKNOWLEDGEMENT
ACKNOWLEDGEMENT

I pledge this opportunity to express my gratitude to the people who have contributed and
encouraged me towards the completion of this project.

I thank God for giving me a good health and wisdom to do this project and complete it
successfully.

I am much grateful to Rev. Sr. Dr. G.S. Mary Fabiola, M.Sc., M.Phil., PGDCA., Ph.D.
SLET. Principal, Nirmala College for Women (Autonomous), Coimbatore for her
encouragement and generous help.

I render my heartfelt gratitude to Mrs. R. Uma Maheswari, MCA., M.Phil., (Ph.D)., Head,
Department of Computer Science, Nirmala College for Women, Coimbatore, for allowing me
to do this project and for contributing her encouragement and support.

I also render my heartfelt gratitude to Mrs. M. Lincy Jacquline, M.Sc., M.Phil., (Ph.D).,
Assistant Professor of Department of Computer Science, Nirmala College for Women,
Coimbatore, for allowing me to do this project and for contributing her encouragement and
guidance towards this project and I wish to express my deep sense of gratitude for her valuable
advice and suggestions, able guidance and generous help for successful completion of this
project.
SYNOPSIS
SYNOPSIS

Chronic Kidney Disease (CKD), i.e., gradual decrease in the renal function spanning
over a duration of several months to years without any major symptoms, is a life-threatening
disease. It progresses in six stages according to the severity level. It is categorized into various
stages based on the Glomerular Filtration Rate (GFR), which in turn utilizes several attributes,
like age, sex, race and Serum Creatinine. Among multiple available models for estimating GFR
value, Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI), which is a linear
model, has been found to be quite efficient because it allows detecting all CKD stages.
Methods: Early detection and cure of CKD is extremely desirable as it can lead to the
prevention of unwanted consequences. Machine learning methods are being extensively
advocated for early detection of symptoms and diagnosis of several diseases recently. With the
same motivation, the aim of this study is to predict the various stages of CKD using machine
learning classification algorithms on the dataset obtained from the medical records of affected
people. Specifically, we have used the Hybrid algorithm named as Bagged tree classifier and
Random Forest classifier to obtain a sustainable and practicable model to detect various stages
of CKD with comprehensive medical accuracy.
CONTENT
CONTENT

S.NO TITLE PAGE NO

ACKNOWLEDGEMENT
SYNOPSIS
1 INTRODUCTION 1
2 LITERATURE REVIEW 8
3 METHODOLOGY 13
3.1 Existing System
3.2 Limitations of Existing System
3.3 Proposed System
3.4 Proposed Method
4 EXPERIMENTAL RESULT 20
5 CONCLUSION 27
6 FUTURE WORK 28
REFERENCES
INTRODUCTION
1. INTRODUCTION

Chronic kidney disease (CKD) is a global public health problem affecting approximately
10% of the world’s population. The percentage of prevalence of CKD in China is 10.8%, and
the range of prevalence is 10%-15% in the United States. According to another study, this
percentage has reached 14.7% in the Mexican adult general population. This disease is
characterized by a slow deterioration in renal function, which eventually causes a complete
loss of renal function. CKD does not show obvious symptoms in its early stages. Therefore,
the disease may not be detected until the kidney loses about 25% of its function. In addition,
CKD has high morbidity and mortality, with a global impact on the human body. It can induce
the occurrence of cardiovascular disease. CKD is a progressive and irreversible pathologic
syndrome. Hence, the prediction and diagnosis of CKD in its early stages is quite essential, it
may be able to enable patients to receive timely treatment to ameliorate the progression of the
disease.

Kidney failure treatment targets to control the causes and decelerate the advance of the
renal failure. If treatments are not enough, patient will be in the end-stage of renal failure and
the last treatment is dialysis or renal transplant. At present, 4 out of every 1000 person in the
United Kingdom are suffering from renal failure and more than 300,000 American patients in
the end-stage of kidney disease survive with dialysis. Moreover, according to the National
Health Service kidney disease is more frequent in South Asia, Africa, than in the other
countries. Due to detecting the chronic kidney failure is not feasible until the kidney failure is
completely progressed; thus, realizing the kidney failure in the first stage is extremely
important. Through early diagnosis, the act of each kidney can be taken under control, which
leads to decreasing the risk of irreversible consequences. For this reason, routine check-up and
early diagnosis are crucial to the patients, for they can prevent vital risks of renal failure and
related diseases. Blood test is one of the steps to detect CKD. Therefore, it can be distinguished
by measuring factors, and physicians can decide treatment processes, reducing the rate of
progression.

1
1.1 Machine Learning

Machine Learning (ML) is a category of algorithm that allows software applications to


become more accurate in predicting outcomes without being explicitly programmed.

The basic premise of machine learning is to build algorithms that can receive input data
and use statistical analysis to predict an output while updating outputs as new data becomes
available.

The processes involved in machine learning are similar to that of data mining and predictive
modeling. Both require searching through data to look for patterns and adjusting program
actions accordingly. Many people are familiar with machine learning from shopping on the
internet and being served ads related to their purchase. This happens because recommendation
engines use machine learning to personalize online ad delivery in almost real time. Beyond
personalized marketing, other common machine learning use cases include fraud detection,
spam filtering, network security threat detection, predictive maintenance and building news
feeds.

Well, Machine Learning is a concept which allows the machine to learn from examples and
experience, and that too without being explicitly programmed. So instead of you writing the
code, what you do is you feed data to the generic algorithm, and the algorithm/ machine builds
the logic based on the given data.

Machine Learning is a subset of artificial intelligence that focuses mainly on machine


learning from their experience and making predictions based on their experience. It enables the
computers or the machines to make data-driven decisions rather than being explicitly
programmed to carry out a certain task. These programs or algorithms are designed in a way
that they learn and improve over time when exposed to new data.

Machine learning is the science of getting computers to act without being explicitly
programmed. In the past decade, machine learning has given us self-driving cars, practical
speech recognition, effective web search, and a vastly improved understanding of the human
genome. Machine learning is so pervasive today that you probably use it dozens of times a day
without knowing it.

2
Machine learning algorithms are often categorized as supervised or unsupervised.
Supervised algorithms require a data scientist or data analyst with machine learning skills to
provide both input and desired output, in addition to furnishing feedback about the accuracy of
predictions during algorithm training. Data scientists determine which variables, or features,
the model should analyze and use to develop predictions. Once training is complete, the
algorithm will apply what was learned to new data.

Unsupervised algorithms do not need to be trained with desired outcome data. Instead,
they use an iterative approach called deep learning to review data and arrive at conclusions.
Unsupervised learning algorithms -- also called neural networks -- are used for more complex
processing tasks than supervised learning systems, including image recognition, speech-to-text
and natural language generation.

These neural networks work by combing through millions of examples of training data
and automatically identifying often subtle correlations between many variables. Once trained,
the algorithm can use its bank of associations to interpret new data. These algorithms have only
become feasible in the age of big data, as they require massive amounts of training data.

Machine Learning algorithm is trained using a training data set to create a model. When
new input data is introduced to the ML algorithm, it makes a prediction on the basis of the
model.

The prediction is evaluated for accuracy and if the accuracy is acceptable, the Machine
Learning algorithm is deployed. If the accuracy is not acceptable, the Machine Learning
algorithm is trained again and again with an augmented training data set. This is just a very
high-level example as there are many factors and other steps involved.

3
Fig 1.1. Working of Machine Learning

Fig 1.2. Working Process of machine Learning

1.1.1 Types of Machine Learning

Machine learning is sub-categorized to three types:

 Supervised Learning
 Unsupervised Learning
 Reinforcement Learning

4
Supervised Learning

Supervised Learning is the one, where you can consider the learning is guided by a
teacher. We have a dataset which acts as a teacher and its role is to train the model or the
machine. Once the model gets trained it can start making a prediction or decision when new
data is given to it.

Unsupervised Learning

The model learns through observation and finds structures in the data. Once the model
is given a dataset, it automatically finds patterns and relationships in the dataset by creating
clusters in it. What it cannot do is add labels to the cluster, like it cannot say this a group of
apples or mangoes, but it will separate all the apples from mangoes.

Suppose we presented images of apples, bananas and mangoes to the model, so what it
does, based on some patterns and relationships it creates clusters and divides the dataset into
those clusters. Now if a new data is fed to the model, it adds it to one of the created clusters.

Reinforcement Learning

It is the ability of an agent to interact with the environment and find out what is the best
outcome. It follows the concept of hit and trial method. The agent is rewarded or penalized
with a point for a correct or a wrong answer, and on the basis of the positive reward points
gained the model trains itself. And again once trained it gets ready to predict the new data
presented to it.

The processes involved in machine learning are similar to that of data mining and
predictive modeling. Both require searching through data to look for patterns and adjusting
program actions accordingly. Many people are familiar with machine learning from shopping
on the internet and being served ads related to their purchase.

5
Fig 1.3. Types of Machine Learning

1.1.2 Types of machine learning algorithms

Just as there are nearly limitless uses of machine learning, there is no shortage of
machine learning algorithms. They range from the fairly simple to the highly complex. Here
are a few of the most commonly used models:

This class of machine learning algorithm involves identifying a correlation -- generally


between two variables -- and using that correlation to make predictions about future data points.

 Decision trees. These models use observations about certain actions and identify an
optimal path for arriving at a desired outcome.
 K-means clustering. This model groups a specified number of data points into a
specific number of groupings based on like characteristics.
 Neural networks. These deep learning models utilize large amounts of training data to
identify correlations between many variables to learn to process incoming data in the
future.

6
 Reinforcement learning. This area of deep learning involves models iterating over
many attempts to complete a process. Steps that produce favourable outcomes are
rewarded and steps that produce undesired outcomes are penalized until the algorithm
learns the optimal process.

1.1.3 The future of Machine Learning

While machine learning algorithms have been around for decades, they've attained new
popularity as artificial intelligence (AI) has grown in prominence. Deep learning models in
particular power today's most advanced AI applications.

Machine learning platforms are among enterprise technology's most competitive


realms, with most major vendors, including Amazon, Google, Microsoft, IBM and others,
racing to sign customers up for platform services that cover the spectrum of machine learning
activities, including data collection, data preparation, model building, training and application
deployment.

As machine learning continues to increase in importance to business operations and AI


becomes ever more practical in enterprise settings, the machine learning platform wars will
only intensify.

7
LITERATURE REVIEW
2. LITERATURE REVIEW

2.1 Diagnosis of Chronic Kidney Disease using Random Forest Algorithms

Machine learning (ML) is a category of algorithm that allows software applications to


become more accurate in predicting outcomes without being explicitly programmed. Chronic
kidney disease (CKD) is a slow and progressive loss of kidney function over a period of several
years. Medical tests for other purposes sometimes contain useful information about CKD
disease. Attributes of different medical tests are investigated to identify what attributes contain
useful information about CKD. A database with several attributes of CKD are analysed with
different techniques. Common Spatial Pattern (CSP) filter and Linear Discriminant Analysis
(LDA) are first used to identify the dominant attributes that could contribute in detecting CKD.
Classification methods are used to identify the dominant attributes. These analyses suggest that
hemoglobin, albumin, specific gravity, hypertension and diabetes mellitus together with serum
creatinine are the most important attributes in the early detection of CKD. Further, it suggests
that in the absence of the information of hypertension and diabetes mellitus, the attributes blood
glucose random, and blood pressure may be used. The main objective of this project is to
determine the kidney function failure by applying the Random Forest algorithm and to classify
the chronic and non-chronic kidney diseases.
In contrast, unsupervised machine learning algorithms are used when the information
used to train is neither classified nor labelled. Unsupervised learning studies how systems can
infer a function to describe a hidden structure from unlabelled data. The system doesn’t figure
out the right output, but it explores the data and can draw inferences from datasets to describe
hidden structures from unlabelled data.
Semi-supervised machine learning algorithms fall somewhere in between supervised
and unsupervised learning, since they use both labelled and unlabelled data for training –
typically a small amount of labelled data and a large amount of unlabelled data. The systems
that use this method are able to considerably improve learning accuracy. Usually, semi
supervised learning is chosen when the acquired labelled data requires skilled and relevant
resources in order to train it / learn from it. Otherwise, acquiring unlabelled data generally
doesn’t require additional resources. Reinforcement machine learning algorithms is a learning
method that interacts with its environment by producing actions and discovers errors or
rewards. Trial and error search and delayed reward are the most relevant characteristics of
reinforcement learning.

8
2.2 Incorporating temporal EHR data in predictive models for risk stratification of renal
function deterioration

Predictive models built using temporal data in electronic health records (EHRs) can
potentially play a major role in improving management of chronic diseases. However, these
data present a multitude of technical challenges, including irregular sampling of data and
varying length of available patient history. In this paper, we describe and evaluate three
different approaches that use machine learning to build predictive models using temporal EHR
data of a patient.
The first approach is a commonly used non-temporal approach that aggregates values
of the predictors in the patient’s medical history. The other two approaches exploit the temporal
dynamics of the data. The two temporal approaches vary in how they model temporal
information and handle missing data. Using data from the EHR of Mount Sinai Medical Center,
we learned and evaluated the models in the context of predicting loss of estimated glomerular
filtration rate (eGFR), the most common assessment of kidney function.
Our results show that incorporating temporal information in patient’s medical history can lead
to better prediction of loss of kidney function. They also demonstrate that exactly how this
information is incorporated is important. In particular, our results demonstrate that the relative
importance of different predictors varies over time, and that using multi-task learning to
account for this is an appropriate way to robustly capture the temporal dynamics in EHR data.
Using a case study, we also demonstrate how the multi-task learning based model can yield
predictive models with better performance for identifying patients at high risk of short-term
loss of kidney function.

In this study we presented three different methods to leverage longitudinal data: one
that does not use temporal information and two methods that capture temporal information.
These methods address some of the challenges faced in using EHR data, rather than data from
controlled studies, in building models. These challenges include irregularly sampled data and
varying lengths of patient history. Our results show that exploiting temporal information can
yield improvements in predicting deterioration of kidney function. Our results also demonstrate
that the choice of approach is crucial in successfully learning temporal models that generalize
well. In particular, we showed that a model based on multi-task machine learning can capture
temporal dynamics in EHR data without over fitting compared to other models we evaluated.

9
Using a case study, we demonstrate the potential clinical utility of the proposed multi-
task learning based temporal model for predicting renal deterioration for patients with
compromised kidney function.

2.3 Prevalence of Chronic Kidney Disease in an Adult Population

One strategy to prevent and manage chronic kidney disease (CKD) is to offer screening
programs. The aim of this study was to determine the percentage prevalence and risk factors of
CKD in a screening program performed in an adult general population. This is a cross-sectional
study. Six-hundred ten adults without previously known chronic kidney disease (CKD) were
evaluated. Participants were subjected to a questionnaire, blood pressure measurement and
anthropometry. Glomerular filtration rate estimated by CKD-EPI formula and urine tested with
albuminuria dipstick. More than 50% of subjects reported family antecedents of diabetes
mellitus (DM), hypertension and obesity, and 30% of CKD. DM was self-reported in 19% and
hypertension in 29%. During screening, overweight/obesity was found in 75%; women had a
higher frequency of obesity (41 vs. 34%) and high-risk abdominal waist circumference (87 vs.
75%) than men. Hypertension (both self-reported and diagnosed in screening) was more
frequent in men (49%) than in women (38%). CKD was found in 14.7%: G1, 5.9%; G2, 4.5%;
G3a, 2.6%; G3b, 1.1%, G4, 0.3%; and G5, 0.3%. Glomerular filtration rate was
mildly/moderately reduced in 2.6%, moderately/severely reduced in 1.1%, and severely
reduced in !1%. Abnormal albuminuria was found in 13%. CKD was predicted by DM,
hypertension and male gender. A percentage CKD prevalence of 14.7% was found in this
sample of an adult population, with most patients at early stages.
One limitation of the present study, and this of kind screening programs is the cross-
sectional nature. Diagnosis of CKD requires confirmation, which could reduce the number of
diagnosed cases by screening programs. In addition, when screening is to be performed, several
principles should be considered including the availability of facilities for diagnosis and
treatment and the economic balance in the expenditure of medical care when cases are found.
If a screening is performed but there is not comprehensive or adequate coverage for managing
diagnosed cases, serious ethical problems may arise. In the case of the IMSS system there is
no restriction for diagnosis and treatment of these kinds of patients and they were referred for
medical management as appropriate. Patients without social security were advised to seek
medical attention in other health-care systems. Further studies with larger sample sizes are
needed to corroborate our results and their implications. In conclusion, a percentage CKD
prevalence of 14.7% was found in this sample of the adult population with most patients at

10
early stages. Screening programs constitute excellent opportunities in the fight against kidney
disease, particularly in populations at high risk.

2.4 Diagnosis of Chronic Kidney Disease Based on Support Vector Machine by Feature
Selection Methods

As Chronic Kidney Disease progresses slowly, early detection and effective treatment
are the only cure to reduce the mortality rate. Machine learning techniques are gaining
significance in medical diagnosis because of their classification ability with high accuracy
rates. The accuracy of classification algorithms depend on the use of correct feature selection
algorithms to reduce the dimension of datasets. In this study, Support Vector Machine
classification algorithm was used to diagnose Chronic Kidney Disease. To diagnose the
Chronic Kidney Disease, two essential types of feature selection methods namely, wrapper and
filter approaches were chosen to reduce the dimension of Chronic Kidney Disease dataset. In
wrapper approach, classifier subset evaluator with greedy stepwise search engine and wrapper
subset evaluator with the Best First search engine were used. In filter approach, correlation
feature selection subset evaluator with greedy stepwise search engine and filtered subset
evaluator with the Best First search engine were used. The results showed that the Support
Vector Machine classifier by using filtered subset evaluator with the Best First search engine
feature selection method has higher accuracy rate (98.5%) in the diagnosis of Chronic Kidney
Disease compared to other selected methods.

In this study, wrapper and filter methods have been utilized on data set of CKD. Two
different evaluators have been used for each method. For filter approach, CfsSubsetEval with
Greedy stepwise search engine and FilterSubsetEval with Best First search engine have been
used. In addition to wrapper approach, ClassifierSubsetEval with Greedy stepwise search
engine and WrapperSubsetEval with Best First search engine have been used. The accuracy
rate of SVM classifier on full training set has been compared with its accuracy rate on 4 reduced
datasets which have been gained by feature selection methods. The results show that after
reducing dimension of CKD dataset, in all 4 methods accuracy rate of diagnosis have been
improved.

ClassifierSubsetEval with Greedy stepwise search engine is not have the highest
accuracy rate (98%), however, the accuracy rate of the SVM classifier on 13 attributes of the
CKD dataset, by using FilterSubsetEval with Best First feature selection method, has got the
most accuracy rate (98.5%) in CKD diagnosis. Furthermore, with different methods of feature

11
selection and classification algorithms, on distinct datasets of disease, classification results can
be different in accuracy rates.

2.5 A new machine learning approach for predicting the response to anemia treatment in
a large cohort of End Stage Renal Disease patients undergoing dialysis

Chronic Kidney Disease (CKD) anemia is one of the main common comorbidities in
patients undergoing End Stage Renal Disease (ESRD). Iron supplement and especially
Erythropoiesis Stimulating Agents (ESA) have become the treatment of choice for that anemia.
However, it is very complicated to find an adequate treatment for every patient in each
particular situation since dosage guidelines are based on average behaviours, and thus, they do
not take into account the particular response to those drugs by different patients, although that
response may vary enormously from one patient to another and even for the same patient in
different stages of the anemia. This work proposes an advance with respect to previous works
that have faced this problem using different methodologies (Machine Learning (ML), among
others), since the diversity of the CKD population has been explicitly taken into account in
order to produce a general and reliable model for the prediction of ESA/Iron therapy response.
Furthermore, the ML model makes use of both human physiology and drug pharmacology to
produce a model that outperforms previous approaches, yielding Mean Absolute Errors (MAE)
of the Hemoglobin (Hb) prediction around or lower than 0.6 g/dl in the three countries analysed
in the study, namely, Spain, Italy, and Portugal. This paper has presented a reliable ML
approach to predict Hb values in patients undergoing secondary anemia to CKD. The work is
the result of a long experience of the authors in this problem, with some previous works in
which the produced models were not completely satisfactory.
The proposed approach puts together the potential of ML models in general, and the
MLP in particular, to produce accurate models given a representative data set with a better use
of the available information using a priori knowledge of the lifespan of RBC and the effect
produced by Iron and ESAs. The result is an improved approach that outperforms the models
published so far to face this problem.
The successful results and the fact that the model has been tested on a large dialysis
population make the model suitable for its application as support tool in the real clinical
practice. Our ongoing work deals with the implementation of the model in pilot clinics spread
over different geographical areas to test the real performance on completely new patients. The
goal is to check whether the model can be finally used in the short term as a decision support
system to be implemented and successfully working in most of FME clinics.

12
METHODOLOGY
3. METHODOLOGY

3.1 Existing System

Chronic Kidney Disease (CKD) has become a major problem in modern times, and it is dubbed
the silent assassin due to its delayed signs. To overcome these critical issues, early identification
may minimize the prevalence of chronic diseases, though it is quite difficult because of
different kinds of limitations in the dataset. The novelty of our study is that we extracted the
best features from the dataset to provide the best classification models for diagnosing patients
with chronic kidney disease. In our study, we used CKD patients’ clinical datasets to predict
CKD using some popular machine-learning algorithms. In this study, the K-Nearest Neighbour
classification algorithm and Naïve Bayes classifier algorithm were used to diagnose Chronic
Kidney Disease. To diagnose Chronic Kidney Disease, two essential types of feature selection
methods namely, wrapper and filter approaches were chosen to reduce the dimension of the
Chronic Kidney Disease dataset. After selecting features from our dataset, we used a variety of
machine learning models to determine the best classification models.

3.2 Limitations of Existing System

 Unsuitable to Large Datasets.

 Large training time.

 More features, more complexities.

 It happens when you assign zero probability for categorical variables in the training
dataset that is not available. When you use a smooth method for overcoming this
problem, you can make it work the best.

13
3.3 Proposed System

This paper investigates how CKD can be diagnosed by using machine learning (ML)
techniques. ML algorithms have been a driving force in detection of abnormalities in different
physiological data, and are, with a great success, employed in different classification tasks. In
the present study, a number of different ML classifiers are experimentally validated to a real
data set, taken from the Machine Learning Repository, and our findings are compared with the
findings reported in the recent literature. The results are quantitatively and qualitatively
discussed and our findings reveal that the Hybrid algorithm Named as Bagged tree classifier
and Random Forest classifier achieves the near-optimal performances on the identification of
CKD subjects.

3.3.1 Advantages

 Accuracy of Random Forest is generally very high.


 Its efficiency is particularly notable in Large Data sets.
 Provides an estimate of important variables in classification.
 Forests Generated can be saved and reused.
 By combining multiple base models, it can better generalize to unseen data.
 Robustness: Bagging reduces the impact of outliers and noise in the data by aggregating
predictions from multiple models. This enhances the overall stability and robustness of
the model.

14
FLOW DESIGN

Data Preprocessing
Install
Libraries Finding and
replacing
missing values
Chronic Kidney
Disease Dataset Data
Visualization
Label Encoder

Hybrid
Classification Splitting of data

Bagged Tree Training Data


Classifier

Random Forest Testing Data


Classifier

Accuracy
Disease or not Results

15
3.4 Proposed Method
3.4.1 Modules and Descriptions

Importing the packages

For this project, our primary packages are going to be Pandas to work with data, NumPy
to work with arrays, scikit-learn for data split, building and evaluating the classification
models. Let’s import all of our primary packages into our python environment.

Data Preprocessing

First, we’re going to squish our price data in the range [0, 1]. Normalization is a
technique often applied as part of data preparation for machine learning. The goal of
normalization is to change the values of numeric columns in the dataset to a common scale,
without distorting differences in the ranges of values. We’re going to use the Label Encoder.
Let’s also remove NaNs since our model won’t be able to handle them well. We use isnan as a
mask to filter out NaN values. Again we reshape the data after removing the NaNs.
Splitting of Data

In this process, we are going to define the independent (X) and the dependent variables
(Y). Using the defined variables, we will split the data into a training set and testing set which
is further used for modeling and evaluating. We can split the data easily using the
‘train_test_split’ algorithm in python.

Training the Data

Next, I want to create the independent data set (X). This is the data set that i will use to
train the machine learning model(s). To do this I will create a variable called ‘X’ , and convert
the data into a numpy (np) array after dropping the ‘Prediction’ column, then store this new
data into ‘X’. Then I will remove the last 30 rows of data from ‘X’, and store the new data back
into ‘X’. Last but not least I print the data. I created the independent data set in the previous
step, now I will create the dependent data set called ‘y’. This is the target data, the one that
holds the future price predictions. To create this new data set ‘y’, I will convert the data frame
into a numpy array and from the ‘Prediction’ column, store it into a new variable called ‘y’ and
then remove the last 30 rows of data from ‘y’. Then I will print ‘y’ to make sure their are no
NaN’s. Now that I have my new cleaned and processed data sets ‘X’ & ‘y’. I can split them up
into 65% training and 35% testing data for the model(s).

16
3.4.2 Classification Models

A Bagging classifier

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on


random subsets of the original dataset and then aggregate their individual predictions (either
by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be
used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by
introducing randomization into its construction procedure and then making an ensemble out of
it.

This algorithm encompasses several works from the literature. When random subsets
of the dataset are drawn as random subsets of the samples, then this algorithm is known as
Pasting. If samples are drawn with replacement, then the method is known as Bagging. When
random subsets of the dataset are drawn as random subsets of the features, then the method is
known as Random Subspaces. Finally, when base estimators are built on subsets of both
samples and features, then the method is known as Random Patches.

Fig 3.1 A Bagging Classifier

The concept behind bagging is to combine the predictions of several base learners to
create a more accurate output. Bagging is the application of the Bootstrap procedure to a
high-variance machine learning algorithm, typically decision trees.

17
1. Suppose there are N observations and M features. A sample from observation is selected
randomly with replacement (Bootstrapping).
2. A subset of features are selected to create a model with sample of observations and
subset of features.
3. Feature from the subset is selected which gives the best split on the training data.
4. This is repeated to create many models and every model is trained in parallel
5. Prediction is given based on the aggregation of predictions from all the models.
This approach can be used with machine learning algorithms that have a high variance,
such as decision trees. A separate model is trained on each bootstrap sample of data and the
average output of those models used to make predictions. This technique is called bootstrap
aggregation or bagging for short.

Variance means that an algorithm’s performance is sensitive to the training data, with
high variance suggesting that the more the training data is changed, the more the
performance of the algorithm will vary.

Random Forest classifier

Random forest is like bootstrapping algorithm with Decision tree (CART) model.
Suppose we have 1000 observations in the complete population with 10 variables. Random
forest will try to build multiple CART along with different samples and different initial
variables. It will take a random sample of 100 observations and then chose 5 initial variables
randomly to build a CART model. It will go on repeating the process say about 10 times and
then make a final prediction on each of the observations. Final prediction is a function of
each prediction. This final prediction can simply be the mean of each prediction.

18
Fig 3.2 Random Forest classifier

The random forest is a model made up of many decision trees. Rather than just simply
averaging the prediction of trees (which we could call a “forest”), this model uses two key
concepts that gives it the name random:

 Random sampling of training data points when building trees


 Random subsets of features considered when splitting nodes
Random Forest Algorithm Works

The basic steps involved in performing the random forest algorithm are mentioned below:

1. Pick N random records from the dataset.


2. Build a decision tree based on these N records.
3. Choose the number of trees you want in your algorithm and repeat steps 1 and 2.
4. In case of a regression problem, for a new record, each tree in the forest predicts a value
for Y (output). The final value can be calculated by taking the average of all the values
predicted by all the trees in the forest. Or, in the case of a classification problem, each
tree in the forest predicts the category to which the new record belongs. Finally, the
new record is assigned to the category that wins the majority vote.

19
EXPERIMENTAL RESULT
4. EXPERIMENTAL RESULT

Displaying Dataset
Pandas is used to read dataset, input is given as csv file, and displaying the dataset
below.

Preprocessing Dataset
Label Encoder is used for preprocessing, it converts the text input into categorical input,
that is pcc, ba is given as text or string, the algorithm receives input as only numerical values
and we have to convert that pcc, ba names. Here Label Encoder converts the pcc, ba names
(Example : notpresent is converted to 0, and present is converted to 1), this result is shown
below.

20
K-Nearest Neighbour Algorithm Implementation

21
Naïve Bayes Classifier Algorithm Implementation

22
Hybrid Algorithm (Bagged Tree classifier and Random Forest classifier) Implementation

23
UI Page Screenshots

Performance Graph

24
Prediction Results

25
26
CONCLUSION
5. CONCLUSION

The CKD diagnosis is a difficult challenge. In our literature, we have presented a


predictive model using different machine learning algorithms, including NN, RF, SVM, RT,
and BTM, to predict CKD earlier. We mainly focus on the empirical comparisons of those
mentioned ML algorithms. Based on the empirical results of the applied algorithms, the NN,
RF, and SVM have given the highest accuracy on the full dataset. Moreover, comparing overall
accuracy metrics, Bagged Tree classifier has given the best performance on the full dataset,
and Random Forest Classifier has given the highest performance on the dataset. Our study has
limitations due to the small size of the dataset used. In the future, we will use our developed
models on other datasets of other diseases and try to develop more advanced and expert
systems.

27
FUTURE WORK
6. FUTURE WORK

To enhance Chronic Kidney Disease (CKD) management in the future, integrating AI-
powered predictive analytics could help anticipate disease progression, personalized treatment
plans, and early intervention strategies. Additionally, advancements in wearable technology for
continuous monitoring of renal function and vital signs could provide real-time data for more
precise management and timely interventions. Furthermore, implementing telemedicine
platforms could improve access to specialized care and facilitate remote monitoring, especially
for patients in rural or underserved areas.

28
REFERENCES
REFERENCES

[1] M. M. Hossain, R. K. Detwiler, E. H. Chang, M. C. Caughey, M. W. Fisher,T. C. Nichols,


E. P. Merricks, R. A. Raymer, M. Whitford, D. A. Bellinger,L. E. Wimsey, and C. M. Gallippi,
‘‘Mechanical anisotropy assessment in kidney cortex using ARFI peak displacement:
Preclinical validation and pilot in vivo clinical results in kidney allografts,’’ IEEE Trans.
Ultrason.,Ferroelectr., Freq. Control, vol. 66, no. 3, pp. 551–562, Mar. 2019

[2] H. Polat, H. D. Mehr, and A. Cetin, ‘‘Diagnosis of chronic kidney disease based on support
vector machine by feature selection methods,’’ J. Med. Syst., vol. 41, no. 4, p. 55, Apr. 2017.

[3] N. R. Hill, ‘‘Global prevalence of chronic kidney disease - A systematic review and meta
analysis,’’ PLoS ONE, vol. 11, no. 7, Jul. 2016, Art. no. e0158765.

[4] K. M. Z. Hasan and M. Z. Hasan, ‘‘Performance evaluation of ensemble based machine


learning techniques for prediction of chronic kidney disease,’’ in Emerging Research in
Computing, Information, Communication and Applications. Singapore: Springer, 2019.

[5] M. Alloghani, D. Al-Jumeily, T. Baker, A. Hussain, J. Mustafina, and A. J. Aljaaf,


‘‘Applications of machine learning techniques for software engineering learning and early
prediction of students’ performance,’’ in Proc. Int. Conf. Soft Comput. Data Sci., Dec. 2018,
pp. 246–258.

[6] A. Singh, G. Nadkarni, O. Gottesman, S. B. Ellis, E. P. Bottinger, andJ. V. Guttag,


‘‘Incorporating temporal EHR data in predictive models for risk stratification of renal function
deterioration,’’ J. Biomed. Informat.,vol. 53, pp. 220–228, Feb. 2015.

[7] N. Park, E. Kang, M. Park, H. Lee, H.-G. Kang, H.-J. Yoon, and U. Kang,‘‘Predicting acute
kidney injury in cancer patients using heterogeneous and irregular data,’’ PLoS ONE, vol. 13,
no. 7, Jul. 2018, Art. no. e0199839.

[8] D. Gupta, S. Khare, and A. Aggarwal, ‘‘A method to predict diagnostic codes for chronic
diseases using machine learning techniques,’’ in Proc. Int. Conf. Comput., Commun. Autom.
(ICCCA), Apr. 2016, pp. 281–287.

29
CONFERENCE

Mrs. M. Lincy Jacquline, M.Sc., M.Phil., (Ph.D)., Ms. S. Nethra, M.Sc., - “Chronic
Kidney Disease Prediction using Machine Learning Techniques” - International Conference on
New Age Mathematics for Data Science and Artificial Intelligence - Rathinam College of Arts
& Science.

30

You might also like