Afroz content

DIABETES PREDICTION
CHAPTER 1
INTRODUCTION
In recent years, the global prevalence of diabetes has reached alarming levels, posing a
significant health challenge that demands innovative solutions. With the advent of advanced
technologies, machine learning (ML) has emerged as a powerful tool for predictive analysis in
healthcare. This project aims to harness the potential of ML to develop a robust diabetes
prediction model, providing a proactive and personalized approach to healthcare.
The increasing incidence of diabetes is influenced by a myriad of factors, including lifestyle

choices, genetic predispositions, and environmental variables. Traditional methods of
diagnosis and risk assessment often fall short in providing timely and accurate predictions.
This ML project seeks to bridge this gap by leveraging diverse datasets and sophisticated
algorithms to create a predictive model capable of identifying individuals at risk of
developing diabetes.
By integrating a comprehensive set of features, ranging from demographic information to

lifestyle patterns, this project aims to develop a nuanced understanding of the factors contributing
to diabetes risk. The utilization of ML algorithms will enable the system to learn and adapt,
continually refining its predictive capabilities as more data becomes available.
The ultimate goal of this project is to move beyond reactive healthcare approaches and usher
in an era of proactive and preventive strategies. Early detection of individuals at risk of
diabetes allows for timely interventions, lifestyle modifications, and targeted healthcare plans,
ultimately reducing the burden of the disease on individuals and healthcare systems alike.
As we delve into the realm of ML for diabetes prediction, the potential impact on public
health is substantial. This project represents a pivotal step towards personalized medicine,
where individuals can benefit from tailored interventions based on their unique risk profiles.
The intersection of technology and healthcare has the power to transform how we approach
and manage chronic conditions.
Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 1

DIABETES PREDICTION
Problem Statement
The increasing prevalence of diabetes worldwide presents a formidable public health
challenge, necessitating proactive strategies for early detection and intervention. Traditional
methods of diabetes risk assessment often rely on retrospective analysis and lack the precision
needed for personalized healthcare. Addressing this gap, our project aims to develop a machine
learning-based solution for diabetes prediction, with a focus on improving accuracy, early
identification, and personalized risk assessment.
DATASET DESCRIPTION
The datasets consists of several medical predictor variables and one target variable, Outcome.
Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin
level, age, and so on.
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)

DIABETES PREDICTION
BREAKDOWN OF DATASETS
Before we start data visualization, we need to perform the following steps:

1. Import the necessary packages for the future.
2. Insert the drive and read files on Google Drive.
3. Removed warnings regarding future sea spawn areas.
4. See all sources of relevant information.
5. View all information.
6. Separation of code and data
PROJECT FLOWCHART
1. Loading data and diagnosing the data
2. Data Filtering
3. EDA of Row data to understand inside correlations.
4. Feature Engineering
5. Feature Selection: we are not use much because of limited features in our data
(only one feature eliminate using hitman to escalate multicollinearity issue)
6. Model Building
7. Model Training and Testing
8. Model Evalution & Hyper Parameter tuning.
9. Model Deployment

DIABETES PREDICTION
AN INRODUCTION TO GOOGLE COLAB :
Google Colab is a document that allows you to write, run, and share Python code within your
browser. It is a version of the popular Jupyter Notebook within the Google suite of tools.
Jupyter Notebooks (and therefore Google Colab) allow you to create a document containing
executable code along with text, images, HTML, LaTeX, etc. which is then stored in your
google drive and shareable to peers and colleagues for editing, commenting, and viewing.
STARTING DOCUMENT
In order to create a new document, access Google Colab at colab.research.google.com. Once

here, you can begin a new document, or “notebook,” in one of two ways. Upon visiting the
site, a box with your recently visited Colab documents will appear. Select “New Notebook”
to begin a new document. Alternatively, to create a new document from any screen, select
“File” in the top left corner, then select “New Notebook” from the dropdown box.
BASIC FUNCTIONS
After you have created a new notebook, you will see an empty code cell. Python code can be
entered into these code cells and executed at any time by either clicking the Play button to the
left of the code cell or by pressing Command /Ctrl +Enter. On your keyboard.

DIABETES PREDICTION
At the top of your notebook, you will find two buttons:

+Code and +Text Add a new code cell by clicking the “+ Code” button in the top left corner
of the document. Add a text cell by clicking the “+ Text” button in the top left corner of the
document.
When a cell is selected, a toolbar will appear in the top right corner of the cell. This toolbar
contains functions specific to that cell. Options include moving the cell up and down, adding
comments, and deleting the cell.
SHARING A COLAB NOTEBOOK
As with other Google Apps, Colab Notebooks can be shared. Look for the “share” button in
the top right-hand corner of the window. Google Colab documents can also be shared in
Google Drive, just you do with other types of documents.

DIABETES PREDICTION
DATA PREPARATION
EXAMINING NULL / MISSING VALUES Null values are a big problem in machine
learning and deep learning. If you are using sklearn, TensorFlow, or any other machine
learning or deep learning packages, it is required to clean up null values before you pass your
data to the machine learning or deep learning framework. Otherwise, it will give you a long
and ugly error message. So we are checking for null/ missing values. There is no missing
value and no null value in provided dataset.
DATA CLEANING
Data cleaning is the foremost step in any data science project. No data is clean, but most is
useful. Data cleaning is the process of detecting and correcting (or removing) corrupt or
inaccurate records from a record set, table, database and refers to identifying incomplete,
incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting
the dirty or coarse data. To begin with our data cleaning, first we check for duplicate values
and there is no duplicate values in given dataset. After doing so we are converting datatypes,
and then we have done exploratory data analysis and find best fit model of dataset.
Tool Used:
The project leverages Python as the primary programming language for its flexibility,
extensive libraries, and machine learning frameworks.
Popular libraries and tools such as scikit-learn,streamlit, are employed for data preprocessing,
model building, and deployment.

DIABETES PREDICTION
Software required and specifications

Software:
Python: Primary language for ML.

Libraries: TensorFlow/PyTorch, Scikit-learn, XGBoost, OpenCV, NLTK/SpaCy.
Development : Jupyter, IDEs (PyCharm, VSCode).
Version Control : Git, GitHub / GitLab.
Visualization: Matplotlib, Seaborn, Plotly
Hardware:
CPU : AMD Ryzen 5

GPU: NVIDIA GTX 10 series or above.
RAM: 16GB
Storage: SSD

DIABETES PREDICTION
CHAPTER 2
EXPLORATORY DATA ANALYSIS
In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to

summarize their main characteristics, often using statistical graphics and other data
visualization methods. A statistical model can be used or not, but primarily EDA is for seeing
what the data can tell us beyond the formal modeling and thereby contrasts traditional
hypothesis testing. EDA is helped us figuring out various aspects and relationships among the
target and the independent variables.
Fig 1 : Data Science Process

A correlation matrix is a square matrix that shows the correlation between each pair of
variables. The correlation between two variables can range from -1 to 1, with a correlation of
1 indicating a perfect positive correlation, a correlation of -1 indicating a perfect negative
correlation, and a correlation of 0 indicating no correlation.
A heatmap is a graphical representation of a correlation matrix. The color of each cell in the
heatmap represents the strength and direction of the correlation between the corresponding
variables. Darker colors indicate stronger correlations.

DIABETES PREDICTION
The heatmap with annotations shows the correlation between each pair of numeric variables
in the dataset. The annotations show the exact correlation value for each pair of variables.
Fig 2 : Heatmap

DIABETES PREDICTION
CHAPTER 3
FEATURE ENGINEERING
Feature engineering is the act of converting raw observation into desired features using
statistical or machine learning approaches. Feature engineering refers to manipulation
addition, deletion, combination, mutation of our dataset to improve machine learning model
training, leading to better performance and greater accuracy. Effective feature engineering is
based on sound knowledge of business problem and the available data sources.
Fig 3 : feature engineering
We are encoding categorical data in both encoder and check accuracy of encoders:
1. One Hot Encoder Data
2. Label Encoder Data
1.One Hot Encoder:
One-Hot encoding is used in machine learning as a method to quantify categorical data.

DIABETES PREDICTION
One-hot encoding approach eliminates the order but it causes the number of columns to
expand vastly. So for columns with more unique values try using other techniques like
LabelEncoding
2.Label Encoder
Label Encoding refers to converting the labels into a numeric form so as to convert them into
the machine-readable form. Machine learning algorithms can then decide in a better way how
those labels must be operated. It is an important pre-processing step for the structured dataset
in supervised learning.
OUTLIER
Outliers is a data point in the dataset that differs significantly from the other data or
observation. The thing to remember that, not all outliers are the same. Some have a strong
influence, some not at all. Some are valid and important data values. Some are simply errors
or noise. Many parametric statistics like mean, correlations, and every statistic based on these
is sensitive to

DIABETES PREDICTION
CHAPTER 4
MODEL SELECTION
ALGORITHMS
Linear Regression :
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. LR makes prediction for continuous as
well as numeric variables.
Fig 4 : Linear Regression
Linear regression shows the relationship between a dependent and one or more independent
variables. The following equation defines an LR line: 𝑌=𝑎+𝑏𝑋
It is done by fitting a linear equation of line to the observed data. For fitting the model, it is
more important to check, whether there is a connection between the variables or features of
interest, which is supposed to use the numerical variables, i.e. the correlation coefficient.

DIABETES PREDICTION
2.Decision Tree Regression :

Decision Tree is a supervised learning method used in data mining for classification and
regression methods. It is a tree that helps us in decision-making purposes. It separates a data
set into smaller subsets, and at the same time, the decision tree is steadily developed. The
final tree is a tree with the decision nodes and leaf nodes.
Fig 5 : Decision Tree
3.Random Forest Regression :

Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique.
"Random Forest is a classifier that contains a number of decision trees on various subsets of
the given dataset and takes the average to improve the predictive accuracy of the dataset”.
Fig 6 : Random Forest

DIABETES PREDICTION
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
4.XGBoost Regressor
XGBoost stans for Extreme Gradient Boosting is an open source library that provides an
efficient and effective implementation of the gradient boosting algorithm. . Shortly after its
development and initial release, XGBoost became the go-to method and often the key
component in winning solutions for a range of problems in machine learning competitions.
Regression predictive modeling problems involve predicting a numerical value such as a
dollar amount or a height. XGBoost can be used directly for regression predictive modeling .
XGBoostis one of the fastest implementations of gradient boosting. trees.
It does this by tackling one of the major inefficiencies of gradient boosted trees: considering
the potential loss for all possible splits to create a new branch (especially if you consider the
case where there are thousands of features, and therefore thousands of possible splits).
XGBoost tackles this inefficiency by looking at the distribution of features across all data
points in a leaf and using this information to reduce the search space of possible feature
splits.
5.Gradient Boosting Regressor :
Gradient boosting is one of the most popular machine learning algorithms for tabular
datasets. It is powerful enough to find any nonlinear relationship between your model target
and features and has great usability that can deal with missing values, outliers, and high
cardinality categorical values on your features without any special treatment. While you can
build barebone gradient boosting trees using some popular libraries such as XGBoost or
LightGBM without knowing any details of the algorithm, you still want to know how it
works when you start tuning hyper-parameters, customizing the loss functions, etc., to get
better quality on your model.

DIABETES PREDICTION
Fig 7 : Gradient Boosting

DIABETES PREDICTION
CHAPTER 5
MODEL EVALUTION
Fig 8 : Model Evalution steps

DIABETES PREDICTION
Root Mean Square Error (RMSE)
RSME (Root mean square error) calculates the transformation between values predicted by a
model and actual values. In other words, it is one such error in the technique of measuring the
precision and error rate of any machine learning algorithm of a regression problem.
RMSE is a square root of value gathered from the mean square error function. It helps us plot
a difference between the estimate and actual value of a parameter of the model.
Using RSME, we can easily measure the efficiency of the model.
Mean Square Error (MSE)

MSE is a risk method that facilitates us to signify the average squared difference between the
predicted and the actual value of a feature or variable.
Mean Squared Error is calculated in much the same way as the general loss equation from
earlier. We will consider the bias value as well since that is also a parameter that needs to be
updated during the training process.
.
R-squared (𝑹𝟐)
R-squared is a statistical measure that represents the goodness of fit of a regression model.
The ideal value for r-square is 1. The closer the value of r-square to 1, the better is the model
fitted.
R-square is a comparison of the residual sum of squares with the total sum of squares. The
total sum of squares is calculated by summation of squares of perpendicular distance between
data points and the average line.

DIABETES PREDICTION
Fig : Stastical Measure

DIABETES PREDICTION
CHAPTER 6:
MODEL DEPLOYMENT WITH STREAMLIT:
Streamlit, a powerful Python library, simplifies the intricate process of constructing web
applications tailored for data science and machine learning projects. This chapter delves
into the multifaceted capabilities of Streamlit, highlighting its significance in the
deployment phase of machine learning models.
Data Exploration and Visualization:
Streamlit excels in creating interactive charts, graphs, and visualizations with minimal
Python code. This functionality proves invaluable for delving into data intricacies, allowing
for dynamic and engaging presentations of key findings. The streamlined process enhances
the accessibility of complex data insights.
Data Interaction:
Leveraging widgets like sliders, input boxes, and buttons, Streamlit facilitates user
interaction with data. These elements empower users to actively engage with the content,
enabling scenarios to be demonstrated or granting readers the autonomy to explore data
within the report independently. This level of interactivity enhances the overall user
experience.
Custom Dashboards:
The versatility of Streamlit extends to crafting custom dashboards that showcase pivotal
metrics and insights derived from the data. Custom dashboards prove invaluable when
aiming to deliver an interactive summary within the report, offering a consolidated view of
key information.
Embedding in Reports:
While Streamlit itself does not generate traditional reports in formats such as PDF or Word,
it seamlessly integrates into web-based reports or documents.

DIABETES PREDICTION
CHAPTER 7
DIABETES PREDICTION
HOMEPAGE
Enter it in the sidebar:
Finally, after entering all the data, we make a prediction using tick- shaped symbols. As shown
below
After clicking on the checkbox we get the estimated result: Remove old model

DIABETES PREDICTION
After clicking the submit button

DIABETES PREDICTION
CHAPTER 7:
FUTURE SCOPE
Use more data: The more data you have, the more accurate your predictions will be. You can
gather information from various sources, such as healthcare databases, patient records, and
relevant social media platforms.
Discover new learning models: Many different machine learning models are available for
medical predictions. You can try different models to see what works best for your healthcare
data.
Consider Other Factors Affecting Diabetes Prediction: In addition to the factors you include in
your model, there are other elements that can affect diabetes predictions, such as lifestyle
changes, genetic factors, and environmental conditions. You may consider adding these
variables to your model to improve its accuracy.
Create real-time forecasting systems: Real-time forecasting systems will be able to predict
diabetes risk over the next few hours or days. This is beneficial for both healthcare providers
and individuals managing their health.
Develop location-specific prediction methods: Location-specific prediction methods will be

able to predict diabetes risk at different healthcare facilities or regions. This will be more useful
than broader predictions because it will allow healthcare professionals and individuals to plan
interventions and manage health resources more efficiently.

DIABETES PREDICTION
CONCLUSION:
In conclusion, leveraging machine learning for diabetes prediction holds significant promise
in advancing healthcare and improving patient outcomes. By incorporating diverse and
abundant data sources, experimenting with various learning models, and considering a
comprehensive set of influencing factors, we can enhance the accuracy and reliability of
diabetes prediction models. The integration of real-time forecasting systems empowers
healthcare professionals and individuals to proactively address and manage diabetes risks
over short-term intervals, facilitating timely interventions.
Furthermore, the development of location-specific prediction methods adds a layer of

precision, enabling tailored strategies for diabetes management at different healthcare
facilities or regions. This not only optimizes resource allocation but also empowers
individuals to make informed decisions about their health.
In the realm of diabetes prediction, the continuous exploration of new technologies, models,
and data sources is essential. As the field of machine learning evolves, it opens avenues for
more sophisticated and personalized approaches to diabetes prediction, ultimately
contributing to more effective preventative measures and personalized healthcare strategies.
By embracing these advancements, we move closer to a future where machine learning plays
a pivotal role in early detection, intervention, and management of diabetes, improving overall
health outcomes and quality of life.

DIABETES PREDICTION
REFERENCES
Project path https://github.com/navanathmadakatte1/Diabitis
Python
Python is a widely used programming language. You can find more information about
Python at the official Python website: https://www.python.org/
Pandas
Pandas is a powerful data manipulation and analysis library for Python. You can find more
information about Pandas here: https://pandas.pydata.org/
Streamlit
Streamlit is an open-source Python library for creating web applications fordata science
and machine learning. You can learn more about Streamlit here:
https://afroze.streamlit.app/
Scikit-learn (sklearn):
Scikit-learn is a machine learning library for Python. It provides simple and efficient
tools for data mining and data analysis. More information can be found at:
https://scikit- learn.org/

Afroz content

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Afroz content

Uploaded by

Copyright:

Available Formats

DIABETES PREDICTION

The increasing incidence of diabetes is influenced by a myriad of factors, including lifestyle

By integrating a comprehensive set of features, ranging from demographic information to

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 1

Pregnancies: Number of times pregnant

Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

BloodPressure: Diastolic blood pressure (mm Hg)

SkinThickness: Triceps skin fold thickness (mm)

Insulin: 2-Hour serum insulin (mu U/ml)

BMI: Body mass index (weight in kg/(height in m)^2)

DiabetesPedigreeFunction: Diabetes pedigree function

Age: Age (years)

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 2

Before we start data visualization, we need to perform the following steps:

2. Insert the drive and read files on Google Drive.

3. Removed warnings regarding future sea spawn areas.

4. See all sources of relevant information.

5. View all information.

6. Separation of code and data

1. Loading data and diagnosing the data

3. EDA of Row data to understand inside correlations.

7. Model Training and Testing

8. Model Evalution & Hyper Parameter tuning.

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 3

AN INRODUCTION TO GOOGLE COLAB :

In order to create a new document, access Google Colab at colab.research.google.com. Once

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 4

At the top of your notebook, you will find two buttons:

SHARING A COLAB NOTEBOOK

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 5

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 6

Software required and specifications

Python: Primary language for ML.

CPU : AMD Ryzen 5

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 7

In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to

Fig 1 : Data Science Process

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 8

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 9

Fig 3 : feature engineering

1.One Hot Encoder:

One-Hot encoding is used in machine learning as a method to quantify categorical data.

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 10

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 11

Fig 4 : Linear Regression

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 12

2.Decision Tree Regression :

Fig 5 : Decision Tree

3.Random Forest Regression :

Fig 6 : Random Forest

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 13

XGBoostis one of the fastest implementations of gradient boosting. trees.

5.Gradient Boosting Regressor :

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 14

Fig 7 : Gradient Boosting

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 15

Fig 8 : Model Evalution steps

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 16

Root Mean Square Error (RMSE)

Mean Square Error (MSE)

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 17

Fig : Stastical Measure

Dept. of CSE(AIML), BKIT Bhalki2023-2024 Page 18

Data Exploration and Visualization: