Dap Project

VISVESVARAYA TECHNOLOGICAL UNIVERSITY
BELAGAVI-590 018
III Semester
Data Analytics Mini-Project Report on
“Diabetes Prediction Analysis”
Submitted in partial fulfillment of the requirements for the degree of

Master of Computer Applications
of Visvesvaraya Technological University, Belagavi
by
Lokesha S
1RN22MC023
Under the guidance of
Mrs. Roopa H M
Assistant Professor
Department of MCA
Estd : 2001
Department of Master of Computer Applications

RNS INSTITUTE OF TECHNOLOGY
Dr. Vishnuvardhan Road, Channasandra, Bengaluru – 560 098
2023-24
RNS INSTITUTE OF TECHNOLOGY
Dr. Vishnuvardhan Road, Channasandra, Bengaluru – 560 098
Department of Master of Computer Applications
Estd: 2001
CERTIFICATE
This is to certify that the Data Analytics Mini-Project work entitled “Diabetes Prediction
Analysis” has been successfully carried out by Lokesha S of III Semester bearing USN
1RN22MC023, bonafide student of RNS Institute of Technology, in partial fulfillment of the
requirements for award of degree of Master of Computer Applications of Visvesvaraya
Technological University, Belagavi, during the year 2023-24. It is certified that all
corrections/suggestions indicated for internal assessment have been incorporated in this report.
The Internship report has been approved as it satisfies the academic requirements for the said
degree.
Mrs. Roopa H M Dr. N P Kavya

Project Coordinator Head of Department
Department of MCA Department of MCA
RNSIT, Bengaluru. RNSIT, Bengaluru.
External Viva
Name of Examiners Signature with Date
1.
2.
DECLARATION
I, Lokesha S student of 3rd MCA, RNS Institute of Technology, bearing

USN:1RN22MC023 hereby by declare that the project entitled “Diabetes Prediction
Analysis” has been carried out by me under the supervision of Project Coordinator Mrs.
Roopa H M, Assistant Professor, Department of MCA and submitted in partial fulfillment
of the requirements for the award of the Degree of Master of Computer Applications by the
Visvesvaraya Technological University during the academic year 2022-23. This report has not
been submitted to any other Organization/University for any award of degree or certificate.
Name: Lokesha S
USN: 1RN22MC023
Signature of the candidates
i
ACKNOWLEDGEMENT
I am overwhelmed in all humbleness and gratefulness to acknowledge my depth to all those

who have helped me to put these ideas, well above the level of simplicity and into something
concrete.
I take this opportunity to acknowledge the help I have received from different individuals
who directly or indirectly helped in completion of this Project Work.
I express my sincere words of gratitude to Dr. R N Shetty, Founder and Sri. Satish R
Shetty, Managing Trustee, RN Shetty Trust & Chairman, RNS group of Institutions for
providing us wonderful academic environment.
I am deeply indebted to our beloved our Principal, Dr. Ramesh Babu H S, RNSIT for
providing the necessary facilities to carry out this work.
I am extremely grateful to Dr. N P Kavya, Professor and Head, Department of MCA,

RNSIT for nurturing our technical skills and contributing to the success of this project.
I would also express my heartfelt thanks to our Project Coordinator Mrs. Roopa H M
Assistant Professor, Department of MCA, RNSIT for her continuous guidance and valuable
suggestions for this Project work.
I also express my heartfelt thanks to all the teaching and nonteaching staff members of MCA
Department for their encouragement and support throughout this work.
Name:LokeshaS
USN:1RN22MC023
ii
ABSTRAC
This project focuses on the development of a robust diabetes prediction and analysis system
using advanced data analysis, machine learning, and statistical techniques. The primary
objective is to leverage a comprehensive dataset containing patient demographics, medical
history, laboratory results, and lifestyle factors related to diabetes. Initial data preprocessing
involved handling missing values, removing duplicates, and transforming variables to ensure
data quality and consistency. Descriptive statistics, correlation analysis, and exploratory data
visualization techniques provided valuable insights into the dataset's characteristics, revealing
significant correlations between variables such as blood glucose levels, HbA1c levels, age,
and hypertension status. A predictive model was then developed using machine learning
algorithms to predict the likelihood of diabetes based on patient features. The model
demonstrated promising accuracy and performance metrics, making it a valuable tool for
healthcare professionals in assessing diabetes risk and optimizing patient management
strategies. Future enhancements include incorporating additional datasets, exploring deep
learning models, and integrating real-time monitoring for proactive healthcare interventions.
i
TABLE OF
Chapter Name Page No
Declaration i
Acknowledgement ii
Abstract iii
Table of contents iv
List of Figures v
CHAPTERS
1. INTRODUCTION 01
1.1. Introduction 01
1.2. Objective 01
1.3. Motivation 02
1.4. Overview of Project 03

2. DATA COLLECTION 04
2.1 Parameter Implemented 05
3. LITERATURE SURVEY 07
3.1. Library/Module Requirements 07
3.2. Hardware & Software Requirements 09
3.3. Tools Used 10
4. DATA CLEANING AND WRANGLING MECHANISMS 11
4. DATA ANALYSIS AND VISUALIZATION 15
5. CONCLUSION 22
REFERENCES 24
i
LIST OF
Figure No. Name Page No.
Fig. 2.1 Data 04
Fig. 2.2 Dataset Information 05
Fig. 4.1 Dataset Null Values 12
Fig. 5.1 Dataframe Info 15
Fig. 5.2 Calculating Mean,Count,Min,Max,SD 16
Fig. 5.3 Patient Analysis Respect to Age 18
Fig. 5.4 Gender vs Diabetes 19
Fig. 5.5 Smoking History of Patient 19
Fig. 5.6 Smoking History Distribution 20
Fig. 5.7 Correlation Heatmap 21
Fig. 5.8 Visualization 21
v
Diabetes Prediction 1
CHAPTER – 1
INTRODUCTION
1.1 INTRODUCTION
Various classification strategies are used in the medical field for classifying data into different
classes. Diabetes is a condition that affects the body's ability to produce the hormone insulin, which
causes carbohydrate metabolism to become irregular and blood glucose levels to increase. High
blood sugar is a common symptom of diabetes. If diabetes is not treated, it can lead to a variety of
complications. Diabetic ketoacidosis and nonketotic hyperosmolar coma are two significant
complications. Diabetes is considered a severe health problem in which the amount of sugar in the
blood cannot be regulated. Diabetes is influenced by a variety of factors such as height, weight,
genetic factors, and insulin, but the most important factor to remember is sugar concentration. The
only way to avoid problems is to identify the problem early. This dataset comes from the ‘National
Institute of Diabetes and Digestive Diseases’ Pima Indians Diabetes Database (PIDD). Several
constraints were taken from the massive database.
The aim of this project is to utilize statistical analysis techniques to develop a predictive model for
diabetes and gain insights into the factors influencing the disease. By analyzing a comprehensive
dataset containing demographic, clinical, and lifestyle-related features, we seek to identify key
predictors and patterns associated with diabetes onset. This project aims to contribute to the early
detection and management of diabetes by providing healthcare professionals with valuable insights
and predictive tools.
1.2 Objectives
 Investigate the distribution and characteristics of the dataset, including demographic

information, clinical measurements, and lifestyle factors.
 Identify potential risk factors and predictive features associated with diabetes onset through
correlation analysis and exploratory data analysis.
 Develop a predictive model using machine learning algorithms to classify individuals as
either diabetic or non-diabetic based on their features.
 Evaluate the performance of the predictive model using appropriate evaluation metrics and
assess its reliability and generalization ability.
RNS Department of 2023-

 Provide insights and recommendations for healthcare professionals and policymakers to
improve diabetes prevention, diagnosis, and management strategies.
1.3 Motivation
 Public Health Impact: Diabetes is a prevalent chronic disease with significant public health
implications. By developing accurate predictive models, we can contribute to early detection
and intervention, potentially reducing the burden of diabetes-related complications and
improving overall health outcomes for affected individuals.
 Clinical Decision Support: Healthcare professionals often face challenges in identifying

individuals at risk of developing diabetes or those who require closer monitoring and
intervention. Predictive models can serve as valuable tools for clinicians, providing them with
actionable insights to personalize patient care and optimize treatment strategies.
 Preventive Healthcare: Prevention is often more effective and less costly than treatment
after the onset of disease. By identifying risk factors and early warning signs of diabetes, we
can empower individuals to make informed lifestyle choices and adopt preventive measures
to mitigate their risk of developing the disease.
 Data-Driven Insights: With the abundance of healthcare data available today, there is an
opportunity to leverage advanced analytics and machine learning techniques to extract
meaningful insights and patterns from complex datasets. By applying statistical analysis to
healthcare data, we can uncover hidden relationships and factors contributing to diabetes
onset, informing future research and policy decisions.
 Personal Interest: Many members of the project team have a personal or professional interest
in healthcare and data science. By combining their expertise and passion for these fields, they
are driven to make a meaningful impact on public health through data-driven research and
innovation.
Overall, the motivation behind this project lies in the intersection of healthcare, data science, and
social impact. By harnessing the power of statistical analysis and predictive modeling, we aim to
contribute to the prevention and management of diabetes, ultimately improving the health and well-
being of individuals and communities worldwide.

1.4 Overview of Project
The project revolves around predicting and analyzing diabetes without resorting to machine learning
algorithms, focusing instead on statistical analysis techniques. We will start by collecting a
comprehensive dataset containing various demographic, clinical, and lifestyle-related features,
alongside a target variable denoting diabetes status (positive/negative). Through meticulous data
preprocessing, including the handling of duplicates, missing values, and necessary transformations,
we will ensure the dataset's cleanliness and integrity. Exploratory data analysis will play a pivotal
role in understanding the dataset's structure, distribution, and inherent characteristics, providing
valuable insights into potential predictors of diabetes. Feature engineering will involve crafting new
features or transforming existing ones to augment predictive power while encoding categorical
variables appropriately for subsequent analysis.
Moving forward, our analysis will delve into examining feature distributions, correlations, and
relationships using statistical methods, supplemented by visualization techniques such as plots,
histograms, and pair plots. By harnessing statistical analysis, we aim to uncover significant predictors
and patterns associated with diabetes onset, paving the way for informed decision-making and
actionable insights. Throughout the project, we will interpret findings and derive insights from
statistical analyses, offering a comprehensive understanding of factors contributing to diabetes
prediction.

CHAPTER – 2
DATA COLLECTION
For data collection using pandas, we will utilize the read_csv() function to load the dataset into a
pandas DataFrame. The dataset will be stored in a CSV (Comma-Separated Values) file format,
which is widely used for tabular data representation. We will specify the file path to the CSV file
containing the diabetes dataset as an argument to the read_csv() function. Upon loading the dataset
into the DataFrame, pandas will automatically parse the CSV file and organize the data into rows and
columns. We will then proceed to explore the DataFrame using various pandas functions and
methods to gain insights into the dataset's structure
e, distribution, and characteristics. This process will involve examining the first few rows of the
DataFrame using the head() method, as well as checking the dimensions of the DataFrame using the
shape attribute. Additionally, we will use the info() method to obtain a summary of the dataset's
features, including data types, missing values, and memory usage. Finally, pandas' powerful data
manipulation capabilities will enable us to clean, preprocess, and analyze the dataset efficiently,
setting the stage for subsequent stages of the project, such as data visualization and statistical
analysis.
Figure 2.1: Data

Figure 2.2: Dataset Information
2.1 Parameters Implemented
 Age: This parameter represents the age of the individual in years. Age is a crucial
demographic factor that can influence the risk of developing diabetes, with older individuals
generally being at higher risk.
 Gender: Gender refers to the biological sex of the individual and is typically categorized as
male or female. Gender may play a role in diabetes risk and management, as certain risk
factors and treatment approaches may vary between sexes.
 Hypertension: Hypertension, also known as high blood pressure, is a medical condition

characterized by elevated blood pressure levels. Hypertension is a common comorbidity of
diabetes and can exacerbate the risk of cardiovascular complications.
 Smoking History: This parameter indicates the individual's smoking history, categorizing
them as a non-smoker, current smoker, former smoker, or having no information available.
Smoking is a modifiable risk factor for diabetes and can increase the likelihood of developing
the disease.

 Heart Disease: Heart disease refers to various conditions affecting the heart, such as
coronary artery disease, myocardial infarction (heart attack), and heart failure. Individuals
with diabetes are at higher risk of developing heart disease, making it an important parameter
to consider.
 HbA1c Level: Hemoglobin A1c (HbA1c) level is a measure of long-term blood glucose
control and indicates average blood sugar levels over the past two to three months.
Monitoring HbA1c levels is essential for managing diabetes and assessing treatment
effectiveness.
 Blood Glucose Level: Blood glucose level represents the concentration of glucose (sugar) in
the blood and is a key parameter in diabetes diagnosis and management. Elevated blood
glucose levels, especially fasting glucose and postprandial glucose, are indicative of diabetes
mellitus.
 BMI: Body mass index (weight in kg/ (height in m)

^2) Range of BMI:
BMI < 18.5 - underweight
18.5 < BMI < 24.9 - ideal weight
25 < BMI < 29.9 - overweight
29.9 < BMI - obese
These parameters provide comprehensive information about demographic characteristics, medical

history, and clinical measurements relevant to diabetes diagnosis, risk assessment, and management.
Analyzing these parameters in conjunction with the target variable (diabetes status) can help identify
patterns, risk factors, and predictive factors associated with diabetes onset and progression.

CHAPTER – 3
LITERATURE SURVEY
The literature survey for this project involves looking at previous studies and research papers on
diabetes prediction and analysis. We'll review what other researchers have done in this area to
understand what works and what doesn't. By looking at past work, we can learn about the different
factors that contribute to diabetes, like age, gender, blood pressure, and smoking history. We'll also
see how these factors have been used to build predictive models for diabetes. This helps us figure out
the best approach for our project and what we can do differently to improve predictions. Overall, the
literature survey gives us a good foundation to build on and helps us make informed decisions
throughout the project.
3.1 Library
Requirements Pandas:
Pandas is a powerful Python library for data manipulation and analysis. It provides data structures
such as DataFrame and Series, which are essential for handling tabular data. Pandas offers a wide
range of functions and methods for data cleaning, preprocessing, exploration, and manipulation,
making it indispensable for working with datasets.
Example: Suppose we have a CSV file containing the diabetes dataset. We can use Pandas to read the
CSV file into a DataFrame, allowing us to explore and analyze the data easily:
import pandas as pd
NumPy:
# ReadisCSV
NumPy file into a package for scientific computing in Python. It provides support for multi-
a fundamental
DataFrame df =
dimensional arrays (ndarrays), which are efficient data structures for representing and manipulating
pd.read_csv('diabetes.csv')
numerical data. NumPy offers a wide range of mathematical functions for array operations, including
# Display the first few rows of the
arithmetic, linear algebra, statistics, and random number generation. It also integrates seamlessly with
other libraries like Pandas and Matplotlib, enabling efficient data processing and analysis. NumPy's
array-

oriented computing capabilities make it indispensable for numerical computations and data
manipulation tasks in this project.
Example: Suppose we want to calculate the mean and standard deviation of blood glucose levels in the
diabetes dataset using NumPy:
import numpy as np
Matplotlib:
# Calculate mean and standard deviation of blood glucose
Matplotlib is a versatile =plotting
levels mean_glucose library in Python used for creating static, interactive, and animated
np.mean(df['blood_glucose_level'])
std_glucose It
visualizations. = np.std(df['blood_glucose_level'])
provides a wide variety of plots, including line plots, bar plots, scatter plots,
print('Mean glucose level:', mean_glucose)
histograms, and more.
print('Standard Matplotlib
deviation enables
of glucose us to visualize data distributions, trends, and relationships,
level:',
facilitating data exploration and interpretation.
Example: Suppose we want to visualize the age distribution of individuals in the diabetes dataset
using a histogram:
import matplotlib.pyplot as plt

Seaborn:
Seaborn
# Plot isa histogram
a high-level statistical plotting library built on top of Matplotlib. It provides a more
of age
concise and aesthetically
plt.hist(df['age'], pleasing interface for creating complex statistical visualizations. Seaborn
bins=30,
offers specialized
edgecolor='black') plt.title('Age
Distribution')
plt.xlabel('Age')
plt.ylabel('Coun
t') plt.show()

functions for visualizing distributions, relationships between variables, and categorical data, enhancing
the effectiveness of data analysis and communication.
Example: Suppose we want to create a scatter plot to visualize the relationship between age and blood
glucose level, with different colors representing diabetes status:
import seaborn as sns
3.2#Hardware andplot
Create a scatter Software Requirements
sns.scatterplot(x='age', y='blood_glucose_level', hue='diabetes',
data=df)Requirements:
Hardware plt.title('Relationship between Age and Blood Glucose
Level') plt.xlabel('Age')
plt.ylabel('Blood Glucose Level')
 Personal Computer: A standard personal computer or laptop capable of running Python and
plt.show()
supporting data analysis tasks.
 Sufficient RAM: Adequate RAM (Random Access Memory) to handle data processing and
analysis tasks efficiently, especially when working with large datasets.
 Storage Capacity: Sufficient storage capacity to store datasets, Python scripts, and project
files.
Software Requirements:
 Python: Python is the primary programming language used for this project. Ensure Python is
installed on your system, preferably the latest version.
 Integrated Development Environment (IDE): An IDE such as Jupyter Notebook, Spyder,
or PyCharm for writing and executing Python code. Jupyter Notebook is recommended for its
interactive and exploratory nature, ideal for data analysis projects.

 Python Libraries: Install necessary Python libraries, including Pandas, NumPy, Matplotlib,
and Seaborn, using a package manager such as pip or conda. These libraries are essential for
data manipulation, analysis, and visualization tasks.
 Dataset: Obtain the diabetes dataset in CSV format or any other compatible format for
analysis. Ensure the dataset is accessible from the Python environment.
3.3 Tools Used:
 Jupyter Notebook:
Jupyter Notebook is an open-source web application that allows you to create and share
documents containing live code, equations, visualizations, and narrative text. It provides an
interactive computing environment suitable for data exploration, analysis, and visualization.
Jupyter Notebook is an excellent tool for conducting exploratory data analysis (EDA) and
documenting the project workflow.
 Python:
Python is the primary programming language used for this project. Python's simplicity,
versatility, and rich ecosystem of libraries make it well-suited for data analysis, machine
learning, and scientific computing tasks. We will leverage Python's libraries such as Pandas,
NumPy, Matplotlib, Seaborn, and Scikit-learn for data manipulation, analysis, visualization,
and model building.

CHAPTER – 4
DATA CLEANING AND WRANGLING MECHANISMS
Data cleaning and wrangling are essential processes in preparing a dataset for analysis. In data
cleaning, we identify and address issues such as missing values, duplicates, inconsistencies, and
formatting errors. One common task is handling missing values, where we may choose to impute
them with statistical measures like mean or median, or remove them altogether if appropriate.
Additionally, we address duplicates by identifying and removing redundant rows, ensuring data
integrity and accuracy. Inconsistencies and formatting errors are resolved by standardizing
categorical variables, removing leading or trailing spaces, and correcting typographical errors.
Feature engineering involves creating new features or transforming existing ones to extract more
meaningful information from the dataset. This may include extracting information from text data,
converting data types, or normalizing numerical features. Handling outliers is crucial for ensuring
data quality, where we detect and remove outliers using statistical methods to prevent them from
skewing analysis result.
4.1 Handling Missing Values
Handling missing values is a crucial aspect of data cleaning and wrangling, ensuring that the dataset
is suitable for analysis. There are several approaches to dealing with missing values, depending on
the nature of the data and the specific requirements of the analysis:
 Identify missing values in the dataset using functions like isnull() or info().
 Decide on a strategy for handling missing values, such as imputation (replacing missing
values with a statistical measure like mean or median), deletion of rows or columns with
missing values, or using domain knowledge to fill in missing values.
 Implement the chosen strategy using Pandas functions like fillna(), dropna(), or custom
functions.
# Import Pandas
library import pandas
as pd
# Read dataset
df = pd.read_csv('diabetes.csv')
# Check for missing

Figure 4.1: Dataset Null Values
 Here there is no null values
4.2 Handling Duplicates
Handling duplicates is another critical step in data cleaning to ensure data integrity and accuracy.
Duplicate rows can skew analysis results and lead to incorrect conclusions. Here are common
approaches to handling duplicates:
Identifying Duplicates:
Before handling duplicates, it's essential to identify them. Duplicate rows are identical or nearly
identical rows in the dataset.
# Handle duplicates
duplicate_rows_data = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_data.shape)
This step revealed the existence of duplicate record in the dataset
Dropping Duplicates:
By#Drop
removing duplicates, the dataset is streamlined, reducing redundancy and enhancing the efficiency
duplicates
ofdf
subsequent analyses.
= df.drop_duplicates()

4.3 Feature Engineering

Feature engineering involves creating new features or transforming existing ones to extract more
meaningful information from the dataset. This may include extracting information from text data,
converting data types, or normalizing numerical features.
Adding new feature
df['bmi_category'] = pd.cut(df['bmi'], bins=[0, 18.5, 24.9, 29.9, 100], labels=['Underweight',

4.4 Data Type Conversion
'Normal', 'Overweight', 'Obese'])
Data type conversion is a fundamental aspect of data preprocessing, allowing us to ensure that each
feature in the dataset is represented in a suitable format for analysis and modeling. In our dataset, the
‘age’ column underwent conversion to a numeric type.
Age column data type conversion
# Convert 'age' column to

integer df['age'] =
4.5df['age'].astype(int)
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is a crucial step in understanding the underlying patterns, trends,
and relationships within a dataset. It involves using descriptive statistics, visualization techniques,
and hypothesis testing to gain insights and inform subsequent analysis or modeling tasks. Here's an
overview of the steps involved in exploratory data analysis:
Summary Statistics:
Calculate summary statistics such as mean, median, mode, standard deviation, and quartiles to
understand the central tendency, dispersion, and shape of the data distribution.
# Calculate summary
statistics df.describe()

Visualizing Diabetes Ratio in Dataset:
plt.pie(df['diabetes'].value_counts()values,labels=df['diabetes'].value_counts().index,
autopct='%1.1f%%')
plt.title('Diabetes Ratio in
Dataset') plt.show()

CHAPTER – 5
DATA ANALYSIS AND VISUALIZATION
Data analysis and visualization are critical components of the data science process, allowing us to
explore, understand, and communicate insights from the dataset. Here's a comprehensive approach to
data analysis and visualization using Python and relevant libraries such as Pandas, Matplotlib, and
Seaborn.
5.1 Detailed Data Analysis

Data analysis involves a systematic examination of the dataset to derive meaningful insights and
extract valuable information.
Data frame Info
#Data frame info of the data

sets df.info( )
Figure 5.1: Dataframe Info
Detailed Summary Statistics
Calculate summary statistics such as mean, median, mode, standard deviation, and quartiles to
understand the central tendency, dispersion, and shape of the data distribution.
#Detaile Summary Statistics

df.describe().style.format("{:2f}")

Figure 5.2: Calculating Mean, Count, Min, Max and Standard Deviation
5.2 Patient Detailed Analysis
Patient detailed analysis involves a thorough examination of individual patient data within a dataset
to gain insights into their characteristics, health status, and outcomes. This analysis is crucial for
healthcare professionals, researchers, and policymakers to understand patient populations, identify
trends, and inform personalized healthcare interventions.
Male and Female Ratio
plt.pie(df['gender'].value_counts().values,labels=df['gender'].value_counts().index,
The male and female ratio in datasets refers to the proportion of male and female individuals within a
autopct='%1.1f%%')
given dataset. This ratio
plt.title('Male-Female is in
Ratio essential for understanding the gender distribution and demographic
composition of the dataset, which can have implications for various analyses and interpretations. To
Dataset') plt.show()
calculate the male and female ratio, we typically count the number of male and female individuals
separately and then express their ratio as a percentage or fraction.

Patient Analysis Respect to Age
# Sample data (replace this with your actual data)

ages = [25, 35, 45, 55, 65, 75, 85, 95]
diabetes_status = ['Diabetic', 'Non-
Diabetic'] diabetes_count = {
'Diabetic': [5, 10, 20, 25, 30, 15, 5, 1],
'Non-Diabetic': [50, 40, 35, 30, 25, 20, 15, 10]
}
# Plotting the
histogram bar_width
= 0.35
opacity = 0.8
index = np.arange(len(ages))
plt.bar(index, diabetes_count['Diabetic'], bar_width, alpha=opacity, color='b', label='Diabetic')

plt.bar(index + bar_width, diabetes_count['Non-Diabetic'], bar_width, alpha=opacity, color='g',
label='Non-Diabetic')
plt.xlabel('Age')
plt.ylabel('Number of People')
plt.title('Diabetes Status by
Age')
plt.xticks(index + bar_width / 2,
ages) plt.legend()

Figure 5.3: Patient Analysis Respect to Age
5.3 Data Visualization
Data visualization is an essential aspect of data analysis, as it facilitates the communication of

insights and findings in a clear and accessible manner. Visualizations such as bar plots, pair plots,
count plots, and heatmaps are used to present key findings, trends, and comparisons visually,
enhancing understanding and interpretation.
Count Plot of Gender vs Diabetes
A count plot of gender vs. diabetes description provides a visual representation of the distribution of
diabetes descriptions among different genders within a patient population. This type of analysis is
essential for understanding how diabetes is characterized across genders and can reveal potential
disparities or patterns in disease presentation. The count plot displays the frequency of each diabetes
description category (e.g., type 1 diabetes, type 2 diabetes, prediabetes) for males and females
separately, allowing for a comparative analysis.
Upon generating the count plot, we can observe the number of male and female patients falling into
each diabetes description category.

sns.countplot(x='gender', hue='diabetes',
data=df) plt.title('Gender vs Diabetes')
plt.show()
Figure 5.4: Gender vs Diabetes
Smoking History of Patient
Figure 5.5: Smoking History of Patient

Figure 5.6: Smoking History Distribution
5.4 Advanced Visualization Techniques
In addition to the previously discussed visualization, incorporating advanced techniques such as

heatmaps and pair plot can enhance the presentation of insight.
Heatmap of Diabetes Correlation
numeric_columns =
Creating a heatmap of diabetes correlation involves visualizing the correlation matrix of diabetes-
df.select_dtypes(include=['number'])
related variables to understand
correlation_matrix the relationships between different aspects of diabetes. This analysis
= numeric_columns.corr()
helps
# Plotidentify which variables are strongly correlated with each other and which ones may have
the heatmap
weaker or no correlations.
plt.figure(figsize=(12,
8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")

Figure 5.7: Correlation Heatmap
SNS Pair Plot
# Pair plot for numeric features

sns.pairplot(df,
hue='diabetes') plt.show()
Figure 5.8: Visualization

CHAPTER – 6
CONCLUSION AND FUTURE ENHANCEMENT
In conclusion, this diabetes prediction analysis project, have provided valuable insights into
understanding and predicting diabetes based on detailed data analysis and statistical techniques. Key
findings from the project include:
 Data Analysis: Thorough data analysis revealed important patterns, correlations, and trends
within the diabetes dataset. Descriptive statistics, correlation analysis, and visualizations such
as heatmaps and scatter plots contributed to a comprehensive understanding of the dataset's
characteristics.
 Predictive Modeling: Utilizing machine learning algorithms, a diabetes prediction model

was developed to predict the likelihood of diabetes based on various patient factors. The
model demonstrated promising accuracy and performance metrics, making it a valuable tool
for healthcare professionals in assessing diabetes risk.
 Health Insights: The project shed light on significant factors influencing diabetes risk,
including age, blood glucose levels, HbA1c levels, and hypertension status. These insights
can inform preventive strategies, personalized interventions, and patient management plans.
 Gender Analysis: An analysis of gender-specific trends in diabetes prevalence and

characteristics highlighted potential gender disparities and the need for gender-sensitive
healthcare approaches.
Future Enhancements:
Moving forward, several enhancements and avenues for future research and development can be
explored:
 Data Expansion: Incorporating additional datasets with a larger sample size and more
diverse patient demographics can enhance the robustness and generalizability of the diabetes
prediction model.

 Feature Engineering: Further refining feature engineering techniques and incorporating
domain knowledge to create new relevant features can improve the predictive power of the
model.
 Deep Learning Models: Exploring the use of deep learning models such as neural networks
for diabetes prediction can leverage complex patterns and relationships in the data for
enhanced accuracy.
 Real-time Monitoring: Integrating real-time data streams and wearable technology for
continuous monitoring of patient health parameters can enable proactive diabetes
management and early intervention strategies.
 Interpretability: Enhancing model interpretability through techniques such as feature

importance analysis, SHAP values, and model explainability tools can facilitate trust and
understanding among healthcare providers and patients.
 Clinical Validation: Conducting rigorous clinical validation studies to assess the real-world
effectiveness and reliability of the diabetes prediction model in clinical settings.
Overall, this project lays the foundation for ongoing research, innovation, and advancements in
diabetes prediction, personalized healthcare, and data-driven decision-making in the healthcare
domain.

REFERENCES
1. American Diabetes Association. (2020). Standards of medical care in diabetes—2020

abridged for primary care providers. Clinical Diabetes, 38(1), 10-38.
2. Cho, S., Yoon, S., & Kim, K. H. (2019). Classification of diabetes data using machine
learning techniques. Healthcare Informatics Research, 25(2), 121-129.
3. Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., & Chouvarda, I.
(2017). Machine learning and data mining methods in diabetes research. Computational and
Structural Biotechnology Journal, 15, 104-116.
4. Ravi, D., Wong, C., Deligianni, F., Berthelot, M., Andreu-Perez, J., Lo, B., & Yang, G. Z.
(2016). Deep learning for health informatics. IEEE Journal of Biomedical and Health
Informatics, 21(1), 4-21.
5. Saeedi, P., Petersohn, I., Salpea, P., Malanda, B., Karuranga, S., Unwin, N., ... & Guariguata,
L. (2019). Global and regional diabetes prevalence estimates for 2019 and projections for
2030 and 2045: Results from the International Diabetes Federation Diabetes Atlas, 9th
edition. Diabetes Research and Clinical Practice, 157, 107843.
6. Zeevi, D., Korem, T., Zmora, N., Israeli, D., Rothschild, D., Weinberger, A., ... & Segal, E.
(2015). Personalized nutrition by prediction of glycemic responses. Cell, 163(5), 1079-1094.

Dap Project

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dap Project

Uploaded by

Copyright:

Available Formats

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

Data Analytics Mini-Project Report on

“Diabetes Prediction Analysis”

Submitted in partial fulfillment of the requirements for the degree of

Under the guidance of

Department of Master of Computer Applications

Department of Master of Computer Applications

Mrs. Roopa H M Dr. N P Kavya

I, Lokesha S student of 3rd MCA, RNS Institute of Technology, bearing

I am overwhelmed in all humbleness and gratefulness to acknowledge my depth to all those

I am extremely grateful to Dr. N P Kavya, Professor and Head, Department of MCA,

1.4. Overview of Project 03

3.1. Library/Module Requirements 07

3.2. Hardware & Software Requirements 09

3.3. Tools Used 10

4. DATA CLEANING AND WRANGLING MECHANISMS 11

4. DATA ANALYSIS AND VISUALIZATION 15

Fig. 2.1 Data 04

Fig. 2.2 Dataset Information 05

Fig. 4.1 Dataset Null Values 12

Fig. 5.1 Dataframe Info 15

Fig. 5.2 Calculating Mean,Count,Min,Max,SD 16

Fig. 5.3 Patient Analysis Respect to Age 18

Fig. 5.4 Gender vs Diabetes 19

Fig. 5.5 Smoking History of Patient 19

Fig. 5.6 Smoking History Distribution 20

Fig. 5.7 Correlation Heatmap 21

Fig. 5.8 Visualization 21

 Investigate the distribution and characteristics of the dataset, including demographic

RNS Department of 2023-

 Clinical Decision Support: Healthcare professionals often face challenges in identifying

RNS Department of 2023-

1.4 Overview of Project

RNS Department of 2023-

Figure 2.1: Data

RNS Department of 2023-

Figure 2.2: Dataset Information

2.1 Parameters Implemented

 Hypertension: Hypertension, also known as high blood pressure, is a medical condition

RNS Department of 2023-

 BMI: Body mass index (weight in kg/ (height in m)

These parameters provide comprehensive information about demographic characteristics, medical

RNS Department of 2023-

RNS Department of 2023-

import matplotlib.pyplot as plt

RNS Department of 2023-

import seaborn as sns

RNS Department of 2023-

3.3 Tools Used:

RNS Department of 2023-

DATA CLEANING AND WRANGLING MECHANISMS

4.1 Handling Missing Values

# Check for missing

RNS Department of 2023-

Figure 4.1: Dataset Null Values

 Here there is no null values

4.2 Handling Duplicates

RNS Department of 2023-

4.3 Feature Engineering

df['bmi_category'] = pd.cut(df['bmi'], bins=[0, 18.5, 24.9, 29.9, 100], labels=['Underweight',

Age column data type conversion

# Convert 'age' column to

RNS Department of 2023-

RNS Department of 2023-