Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

DIABETES PREDICTION

CHAPTER 1
INTRODUCTION

In recent years, the global prevalence of diabetes has reached alarming levels, posing a
significant health challenge that demands innovative solutions. With the advent of advanced
technologies, machine learning (ML) has emerged as a powerful tool for predictive analysis in
healthcare. This project aims to harness the potential of ML to develop a robust diabetes
prediction model, providing a proactive and personalized approach to healthcare.

The increasing incidence of diabetes is influenced by a myriad of factors, including lifestyle


choices, genetic predispositions, and environmental variables. Traditional methods of
diagnosis and risk assessment often fall short in providing timely and accurate predictions.
This ML project seeks to bridge this gap by leveraging diverse datasets and sophisticated
algorithms to create a predictive model capable of identifying individuals at risk of
developing diabetes.

By integrating a comprehensive set of features, ranging from demographic information to


lifestyle patterns, this project aims to develop a nuanced understanding of the factors contributing
to diabetes risk. The utilization of ML algorithms will enable the system to learn and adapt,
continually refining its predictive capabilities as more data becomes available.

The ultimate goal of this project is to move beyond reactive healthcare approaches and usher
in an era of proactive and preventive strategies. Early detection of individuals at risk of
diabetes allows for timely interventions, lifestyle modifications, and targeted healthcare plans,
ultimately reducing the burden of the disease on individuals and healthcare systems alike.

As we delve into the realm of ML for diabetes prediction, the potential impact on public
health is substantial. This project represents a pivotal step towards personalized medicine,
where individuals can benefit from tailored interventions based on their unique risk profiles.
The intersection of technology and healthcare has the power to transform how we approach
and manage chronic conditions.

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 1


DIABETES PREDICTION

Problem Statement
The increasing prevalence of diabetes worldwide presents a formidable public health
challenge, necessitating proactive strategies for early detection and intervention. Traditional
methods of diabetes risk assessment often rely on retrospective analysis and lack the precision
needed for personalized healthcare. Addressing this gap, our project aims to develop a machine
learning-based solution for diabetes prediction, with a focus on improving accuracy, early
identification, and personalized risk assessment.

DATASET DESCRIPTION

The datasets consists of several medical predictor variables and one target variable, Outcome.
Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin
level, age, and so on.

Pregnancies: Number of times pregnant

Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

BloodPressure: Diastolic blood pressure (mm Hg)

SkinThickness: Triceps skin fold thickness (mm)

Insulin: 2-Hour serum insulin (mu U/ml)

BMI: Body mass index (weight in kg/(height in m)^2)

DiabetesPedigreeFunction: Diabetes pedigree function

Age: Age (years)

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 2


DIABETES PREDICTION

BREAKDOWN OF DATASETS

Before we start data visualization, we need to perform the following steps:


1. Import the necessary packages for the future.

2. Insert the drive and read files on Google Drive.

3. Removed warnings regarding future sea spawn areas.

4. See all sources of relevant information.

5. View all information.

6. Separation of code and data

PROJECT FLOWCHART

1. Loading data and diagnosing the data

2. Data Filtering

3. EDA of Row data to understand inside correlations.

4. Feature Engineering

5. Feature Selection: we are not use much because of limited features in our data
(only one feature eliminate using hitman to escalate multicollinearity issue)

6. Model Building

7. Model Training and Testing

8. Model Evalution & Hyper Parameter tuning.

9. Model Deployment

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 3


DIABETES PREDICTION

AN INRODUCTION TO GOOGLE COLAB :

Google Colab is a document that allows you to write, run, and share Python code within your
browser. It is a version of the popular Jupyter Notebook within the Google suite of tools.
Jupyter Notebooks (and therefore Google Colab) allow you to create a document containing
executable code along with text, images, HTML, LaTeX, etc. which is then stored in your
google drive and shareable to peers and colleagues for editing, commenting, and viewing.

STARTING DOCUMENT

In order to create a new document, access Google Colab at colab.research.google.com. Once


here, you can begin a new document, or “notebook,” in one of two ways. Upon visiting the
site, a box with your recently visited Colab documents will appear. Select “New Notebook”
to begin a new document. Alternatively, to create a new document from any screen, select
“File” in the top left corner, then select “New Notebook” from the dropdown box.

BASIC FUNCTIONS

After you have created a new notebook, you will see an empty code cell. Python code can be
entered into these code cells and executed at any time by either clicking the Play button to the
left of the code cell or by pressing Command /Ctrl +Enter. On your keyboard.

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 4


DIABETES PREDICTION

At the top of your notebook, you will find two buttons:


+Code and +Text Add a new code cell by clicking the “+ Code” button in the top left corner
of the document. Add a text cell by clicking the “+ Text” button in the top left corner of the
document.

When a cell is selected, a toolbar will appear in the top right corner of the cell. This toolbar
contains functions specific to that cell. Options include moving the cell up and down, adding
comments, and deleting the cell.

SHARING A COLAB NOTEBOOK

As with other Google Apps, Colab Notebooks can be shared. Look for the “share” button in
the top right-hand corner of the window. Google Colab documents can also be shared in
Google Drive, just you do with other types of documents.

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 5


DIABETES PREDICTION

DATA PREPARATION

EXAMINING NULL / MISSING VALUES Null values are a big problem in machine
learning and deep learning. If you are using sklearn, TensorFlow, or any other machine
learning or deep learning packages, it is required to clean up null values before you pass your
data to the machine learning or deep learning framework. Otherwise, it will give you a long
and ugly error message. So we are checking for null/ missing values. There is no missing
value and no null value in provided dataset.

DATA CLEANING

Data cleaning is the foremost step in any data science project. No data is clean, but most is
useful. Data cleaning is the process of detecting and correcting (or removing) corrupt or
inaccurate records from a record set, table, database and refers to identifying incomplete,
incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting
the dirty or coarse data. To begin with our data cleaning, first we check for duplicate values
and there is no duplicate values in given dataset. After doing so we are converting datatypes,
and then we have done exploratory data analysis and find best fit model of dataset.

Tool Used:

The project leverages Python as the primary programming language for its flexibility,
extensive libraries, and machine learning frameworks.
Popular libraries and tools such as scikit-learn,streamlit, are employed for data preprocessing,
model building, and deployment.

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 6


DIABETES PREDICTION

Software required and specifications


Software:

Python: Primary language for ML.


Libraries: TensorFlow/PyTorch, Scikit-learn, XGBoost, OpenCV, NLTK/SpaCy.
Development : Jupyter, IDEs (PyCharm, VSCode).
Version Control : Git, GitHub / GitLab.
Visualization: Matplotlib, Seaborn, Plotly

Hardware:

CPU : AMD Ryzen 5


GPU: NVIDIA GTX 10 series or above.
RAM: 16GB
Storage: SSD

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 7


DIABETES PREDICTION

CHAPTER 2
EXPLORATORY DATA ANALYSIS

In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to


summarize their main characteristics, often using statistical graphics and other data
visualization methods. A statistical model can be used or not, but primarily EDA is for seeing
what the data can tell us beyond the formal modeling and thereby contrasts traditional
hypothesis testing. EDA is helped us figuring out various aspects and relationships among the
target and the independent variables.

Fig 1 : Data Science Process


A correlation matrix is a square matrix that shows the correlation between each pair of
variables. The correlation between two variables can range from -1 to 1, with a correlation of
1 indicating a perfect positive correlation, a correlation of -1 indicating a perfect negative
correlation, and a correlation of 0 indicating no correlation.

A heatmap is a graphical representation of a correlation matrix. The color of each cell in the
heatmap represents the strength and direction of the correlation between the corresponding
variables. Darker colors indicate stronger correlations.

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 8


DIABETES PREDICTION

The heatmap with annotations shows the correlation between each pair of numeric variables
in the dataset. The annotations show the exact correlation value for each pair of variables.

Fig 2 : Heatmap

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 9


DIABETES PREDICTION

CHAPTER 3
FEATURE ENGINEERING

Feature engineering is the act of converting raw observation into desired features using
statistical or machine learning approaches. Feature engineering refers to manipulation
addition, deletion, combination, mutation of our dataset to improve machine learning model
training, leading to better performance and greater accuracy. Effective feature engineering is
based on sound knowledge of business problem and the available data sources.

Fig 3 : feature engineering

We are encoding categorical data in both encoder and check accuracy of encoders:
1. One Hot Encoder Data
2. Label Encoder Data

1.One Hot Encoder:

One-Hot encoding is used in machine learning as a method to quantify categorical data.

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 10


DIABETES PREDICTION

One-hot encoding approach eliminates the order but it causes the number of columns to
expand vastly. So for columns with more unique values try using other techniques like
LabelEncoding

2.Label Encoder

Label Encoding refers to converting the labels into a numeric form so as to convert them into
the machine-readable form. Machine learning algorithms can then decide in a better way how
those labels must be operated. It is an important pre-processing step for the structured dataset
in supervised learning.

OUTLIER

Outliers is a data point in the dataset that differs significantly from the other data or
observation. The thing to remember that, not all outliers are the same. Some have a strong
influence, some not at all. Some are valid and important data values. Some are simply errors
or noise. Many parametric statistics like mean, correlations, and every statistic based on these
is sensitive to

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 11


DIABETES PREDICTION

CHAPTER 4
MODEL SELECTION

ALGORITHMS

Linear Regression :
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. LR makes prediction for continuous as
well as numeric variables.

Fig 4 : Linear Regression

Linear regression shows the relationship between a dependent and one or more independent
variables. The following equation defines an LR line: 𝑌=𝑎+𝑏𝑋

It is done by fitting a linear equation of line to the observed data. For fitting the model, it is
more important to check, whether there is a connection between the variables or features of
interest, which is supposed to use the numerical variables, i.e. the correlation coefficient.

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 12


DIABETES PREDICTION

2.Decision Tree Regression :


Decision Tree is a supervised learning method used in data mining for classification and
regression methods. It is a tree that helps us in decision-making purposes. It separates a data
set into smaller subsets, and at the same time, the decision tree is steadily developed. The
final tree is a tree with the decision nodes and leaf nodes.

Fig 5 : Decision Tree

3.Random Forest Regression :


Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique.
"Random Forest is a classifier that contains a number of decision trees on various subsets of
the given dataset and takes the average to improve the predictive accuracy of the dataset”.

Fig 6 : Random Forest

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 13


DIABETES PREDICTION

The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.

4.XGBoost Regressor
XGBoost stans for Extreme Gradient Boosting is an open source library that provides an
efficient and effective implementation of the gradient boosting algorithm. . Shortly after its
development and initial release, XGBoost became the go-to method and often the key
component in winning solutions for a range of problems in machine learning competitions.
Regression predictive modeling problems involve predicting a numerical value such as a
dollar amount or a height. XGBoost can be used directly for regression predictive modeling .

XGBoostis one of the fastest implementations of gradient boosting. trees.

It does this by tackling one of the major inefficiencies of gradient boosted trees: considering
the potential loss for all possible splits to create a new branch (especially if you consider the
case where there are thousands of features, and therefore thousands of possible splits).
XGBoost tackles this inefficiency by looking at the distribution of features across all data
points in a leaf and using this information to reduce the search space of possible feature
splits.

5.Gradient Boosting Regressor :

Gradient boosting is one of the most popular machine learning algorithms for tabular
datasets. It is powerful enough to find any nonlinear relationship between your model target
and features and has great usability that can deal with missing values, outliers, and high
cardinality categorical values on your features without any special treatment. While you can
build barebone gradient boosting trees using some popular libraries such as XGBoost or
LightGBM without knowing any details of the algorithm, you still want to know how it
works when you start tuning hyper-parameters, customizing the loss functions, etc., to get
better quality on your model.

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 14


DIABETES PREDICTION

Fig 7 : Gradient Boosting

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 15


DIABETES PREDICTION

CHAPTER 5
MODEL EVALUTION

Fig 8 : Model Evalution steps

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 16


DIABETES PREDICTION

Root Mean Square Error (RMSE)

RSME (Root mean square error) calculates the transformation between values predicted by a
model and actual values. In other words, it is one such error in the technique of measuring the
precision and error rate of any machine learning algorithm of a regression problem.

RMSE is a square root of value gathered from the mean square error function. It helps us plot
a difference between the estimate and actual value of a parameter of the model.
Using RSME, we can easily measure the efficiency of the model.

Mean Square Error (MSE)


MSE is a risk method that facilitates us to signify the average squared difference between the
predicted and the actual value of a feature or variable.

Mean Squared Error is calculated in much the same way as the general loss equation from
earlier. We will consider the bias value as well since that is also a parameter that needs to be
updated during the training process.
.
R-squared (𝑹𝟐)
R-squared is a statistical measure that represents the goodness of fit of a regression model.
The ideal value for r-square is 1. The closer the value of r-square to 1, the better is the model
fitted.

R-square is a comparison of the residual sum of squares with the total sum of squares. The
total sum of squares is calculated by summation of squares of perpendicular distance between
data points and the average line.

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 17


DIABETES PREDICTION

Fig : Stastical Measure

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 18


DIABETES PREDICTION

CHAPTER 6:
MODEL DEPLOYMENT WITH STREAMLIT:

Streamlit, a powerful Python library, simplifies the intricate process of constructing web
applications tailored for data science and machine learning projects. This chapter delves
into the multifaceted capabilities of Streamlit, highlighting its significance in the
deployment phase of machine learning models.

Data Exploration and Visualization:

Streamlit excels in creating interactive charts, graphs, and visualizations with minimal
Python code. This functionality proves invaluable for delving into data intricacies, allowing
for dynamic and engaging presentations of key findings. The streamlined process enhances
the accessibility of complex data insights.

Data Interaction:

Leveraging widgets like sliders, input boxes, and buttons, Streamlit facilitates user
interaction with data. These elements empower users to actively engage with the content,
enabling scenarios to be demonstrated or granting readers the autonomy to explore data
within the report independently. This level of interactivity enhances the overall user
experience.

Custom Dashboards:

The versatility of Streamlit extends to crafting custom dashboards that showcase pivotal
metrics and insights derived from the data. Custom dashboards prove invaluable when
aiming to deliver an interactive summary within the report, offering a consolidated view of
key information.

Embedding in Reports:

While Streamlit itself does not generate traditional reports in formats such as PDF or Word,
it seamlessly integrates into web-based reports or documents.

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 19


DIABETES PREDICTION

CHAPTER 7
DIABETES PREDICTION

HOMEPAGE

Enter it in the sidebar:

Finally, after entering all the data, we make a prediction using tick- shaped symbols. As shown
below

After clicking on the checkbox we get the estimated result: Remove old model

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 20


DIABETES PREDICTION

After clicking the submit button

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 21


DIABETES PREDICTION

CHAPTER 7:

FUTURE SCOPE

Use more data: The more data you have, the more accurate your predictions will be. You can
gather information from various sources, such as healthcare databases, patient records, and
relevant social media platforms.

Discover new learning models: Many different machine learning models are available for
medical predictions. You can try different models to see what works best for your healthcare
data.

Consider Other Factors Affecting Diabetes Prediction: In addition to the factors you include in
your model, there are other elements that can affect diabetes predictions, such as lifestyle
changes, genetic factors, and environmental conditions. You may consider adding these
variables to your model to improve its accuracy.

Create real-time forecasting systems: Real-time forecasting systems will be able to predict
diabetes risk over the next few hours or days. This is beneficial for both healthcare providers
and individuals managing their health.

Develop location-specific prediction methods: Location-specific prediction methods will be


able to predict diabetes risk at different healthcare facilities or regions. This will be more useful
than broader predictions because it will allow healthcare professionals and individuals to plan
interventions and manage health resources more efficiently.

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 22


DIABETES PREDICTION

CONCLUSION:

In conclusion, leveraging machine learning for diabetes prediction holds significant promise
in advancing healthcare and improving patient outcomes. By incorporating diverse and
abundant data sources, experimenting with various learning models, and considering a
comprehensive set of influencing factors, we can enhance the accuracy and reliability of
diabetes prediction models. The integration of real-time forecasting systems empowers
healthcare professionals and individuals to proactively address and manage diabetes risks
over short-term intervals, facilitating timely interventions.

Furthermore, the development of location-specific prediction methods adds a layer of


precision, enabling tailored strategies for diabetes management at different healthcare
facilities or regions. This not only optimizes resource allocation but also empowers
individuals to make informed decisions about their health.

In the realm of diabetes prediction, the continuous exploration of new technologies, models,
and data sources is essential. As the field of machine learning evolves, it opens avenues for
more sophisticated and personalized approaches to diabetes prediction, ultimately
contributing to more effective preventative measures and personalized healthcare strategies.
By embracing these advancements, we move closer to a future where machine learning plays
a pivotal role in early detection, intervention, and management of diabetes, improving overall
health outcomes and quality of life.

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 23


DIABETES PREDICTION

REFERENCES
Project path https://github.com/navanathmadakatte1/Diabitis

Python
Python is a widely used programming language. You can find more information about
Python at the official Python website: https://www.python.org/

Pandas
Pandas is a powerful data manipulation and analysis library for Python. You can find more
information about Pandas here: https://pandas.pydata.org/

Streamlit
Streamlit is an open-source Python library for creating web applications fordata science
and machine learning. You can learn more about Streamlit here:
https://afroze.streamlit.app/

Scikit-learn (sklearn):
Scikit-learn is a machine learning library for Python. It provides simple and efficient
tools for data mining and data analysis. More information can be found at:
https://scikit- learn.org/

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 24

You might also like