Professional Documents
Culture Documents
Project 4
Project 4
i
CHAPTER 1
1. Introduction
The purpose of this chapter is to provide an overview of diabetes and its effects on the
nation, Ghana and the world at large.
In addition to this, the study focuses on the objectives, methodology, justification, scope
and limitations of the research as well as the limitations of the study.
2
including the development of new treatments, improved diagnostic tools, and
enhanced management strategies. The research conducted in Ghana and around the
world has provided valuable evidence on the burden of diabetes, its risk factors, and
effective prevention and management approaches. However, there is still much
work to be done. Continued research efforts, innovative interventions, and strong
policy support are essential to address the growing diabetes epidemic in Ghana and
globally.
Diabetes is a major metabolic disorder which can affect entire body system
adversely. Undiagnosed diabetes can increase the risk of cardiac stroke,
diabetic nephropathy and other disorders. All over the world millions of people
are affected by this disease. Early detection of diabetes is very important to
maintain a healthy life. This disease is a reason of global concern as the cases
of diabetes are rising rapidly. Machine learning (ML) is a computational method
for automatic learning from experience and improves the performance to make
more accurate predictions. In the current research we have utilized machine
learning technique in Pima Indian diabetes dataset to develop trends and
detect patterns with risk factors using R data manipulation tool. To classify the
patients into diabetic and non-diabetic we have developed and analyzed five
different predictive models using R data manipulation tool. For this purpose we
used supervised machine learning algorithms namely linear kernel support
vector machine (SVM-linear), radial basis function (RBF) kernel support vector
machine, k-nearest neighbour (k-NN), artificial neural network (ANN) and
multifactor dimensionality reduction (MDR).
3
diabetes has been increasing in Ghana, with approximately 4.1 million people
estimated to be living with diabetes in 2018. This alarming growth has created a
significant burden on Ghana’s healthcare system, as the costs associated with
diagnosis, treatment, and management of diabetes can be prohibitively expensive.
Furthermore, the inadequate availability of specialized healthcare providers,
including endocrinologists and diabetes educators, has exacerbated the issue.
This statement of research problems seeks to identify the key issues related to
diabetes in Ghana and explore the associated global trends. Data will be collected
from various sources, including government health reports, surveys, and
interviews with healthcare professionals. This data will be used to analyze the
rate of diabetes prevalence, the accessibility of diabetes services, and the cost of
diabetes care in Ghana. Additionally, the research will explore the global trends
in diabetes in order to identify best practices that can be applied in Ghana to
improve the management of diabetes. Ultimately, this research will provide
valuable insights into the challenges posed by diabetes in Ghana and how to
address them.
1.3 Objective
1.4 METHODOLOGY
4
In this study, we will be using LOGISTIC regression analysis to model our data.
Logistic regression will be used to classify individuals into two groups: those
with diabetes (1) and those without diabetes (0) as well as to explain the
relationship between diabetes and various independent variables, such as Age,
Body Mass Index (BMI), Insulin, Diabetes Pedigree Function, Skin Thickness,
Blood Pressure, outcome, Pregnancy, exercise, and Glucose.
The analysis of this study will be conducted using R Statistical Software. Data
will be gathered from the Internet, libraries, personal notes, lecture notes and
other relevant sources such as the World Health Organization (WHO). All of
these sources will provide valuable insight into the research topic, allowing us to
draw meaningful conclusions.
1.5 JUSTIFICATION
The success of this study will provide valuable insight into the factors that
contribute to Diabetes in Ghana, and beyond. With this knowledge, we can work
towards reducing the prevalence of Diabetes, not just in Ghana, but around the
world. With a greater understanding of the causes of this condition, everyone can
take steps to protect their health and reduce the number of people who suffer
from Diabetes.
5
1.7 Thesis Organization
In our research study, there are five chapters. Chapter one deals with the
background of the study, problem statement, objectives of the study, the
methodology, justification, limitations of the study and the organization of the
study. Chapter two reviews the related literature of the study. Chapter three
focuses on the methodology of the study. Problems discussed include analytical
framework, data source, sample and sampling procedure, logistic regression,
generalized linear model and binary logistic regression, estimating the single
regression model, estimation techniques, marginal effect, definition and
measurement of variables and data analysis procedure. Chapter four focuses on
data collection, the research findings and the results of our findings. Chapter five
discusses the summary, conclusions from findings and recommendation from the
study.
CHAPTER 2
6
Literature Review
2.1 Introduction
Diabetes is a chronic metabolic disorder characterized by high blood glucose levels due
to insufficient insulin production or impaired insulin function. With the prevalence of
diabetes escalating worldwide, there has been an increasing interest in analyzing
diabetes data to gain insights into the disease's etiology, risk factors, management, and
potential treatments. This literature review aims to explore various analyses conducted
on diabetes data, highlighting the methodologies, findings, and contributions of each
study.
2.2.1 Diabetes
Diabetes mellitus refers to a group of diseases that affect how the body uses blood sugar
(glucose). Glucose is an important source of energy for the cells that make up the muscles
and tissues. It's also the brain's main source of fuel.
The main cause of diabetes varies by type. But no matter what type of diabetes you have, it
can lead to excess sugar in the blood. Too much sugar in the blood can lead to serious
health problems.
Chronic diabetes conditions include type 1 diabetes and type 2 diabetes. Potentially
reversible diabetes conditions include prediabetes and gestational diabetes. Prediabetes
happens when blood sugar levels are higher than normal. But the blood sugar levels aren't
high enough to be called diabetes. And prediabetes can lead to diabetes unless steps are
taken to prevent it. Gestational diabetes happens during pregnancy. But it may go away after
the baby is born.
7
Presence of ketones in the urine. Ketones are a byproduct of the
breakdown of muscle and fat that happens when there's not enough
available insulin.
Feeling tired and weak.
Feeling irritable or having other mood changes.
Having blurry vision.
Having slow-healing sores.
Getting a lot of infections, such as gum, skin and vaginal infections.
Type 1 diabetes can start at any age. But it often starts during childhood or teen years. Type
2 diabetes, the more common type, can develop at any age. Type 2 diabetes is more
8
2.3 Genomics and Genetics
Genetic factors also play a role in diabetes susceptibility. Genome-wide association
studies (GWAS) have been performed to identify genetic loci associated with diabetes.
Mahajan et al. (2018) conducted a meta-analysis of GWAS data, discovering novel
genetic variants linked to type 2 diabetes. These findings have furthered our
understanding of the disease's genetic architecture and potential therapeutic targets. [3]
9
2.8 Social Determinants of Diabetes
Some studies have investigated the impact of social determinants on diabetes
prevalence and outcomes. A review by Berkowitz et al. (2018) examined how factors
like food insecurity, poverty, and access to healthcare services affected diabetes
management in vulnerable populations. Their findings highlighted the need for
addressing social determinants to improve diabetes care and outcomes. [7]
The accuracy of the logistic regression model is typically assessed using various
performance metrics, including:
1. Accuracy: The proportion of correct predictions (both true positives and true
negatives) over the total number of predictions. It provides an overall measure of the
model's correctness in classifying individuals.
2. Sensitivity (True Positive Rate or Recall): The proportion of correctly identified
diabetic individuals (true positives) over the total number of actual diabetics. It
indicates the model's ability to correctly identify positive cases.
3. Specificity (True Negative Rate): The proportion of correctly identified non-diabetic
individuals (true negatives) over the total number of actual non-diabetics. It measures
the model's ability to correctly identify negative cases.
4. Area Under the Receiver Operating Characteristic Curve (AUC-ROC): The AUC-
ROC represents the overall performance of the model across different classification
thresholds. It ranges from 0 to 1, with higher values indicating better model
performance.
10
5. Precision: The proportion of true positive predictions (diabetic individuals correctly
identified) over the total number of positive predictions (both true positives and false
positives). It provides a measure of the model's precision in identifying diabetic cases.
The higher the accuracy, sensitivity, specificity, and AUC-ROC, the better the logistic
regression model is at accurately predicting the presence or absence of diabetes in the
dataset.
It's important to note that the model's accuracy may vary depending on the quality and
size of the dataset, the choice of predictor variables, and the preprocessing steps
performed on the data. Cross-validation techniques and hold-out validation are
commonly used to assess the generalization performance of the logistic regression
model on unseen data.
When analyzing diabetes data using logistic regression, researchers often report these
metrics along with p-values and confidence intervals for the regression coefficients to
determine the significance and strength of the associations between the predictor
variables and the risk of diabetes. This allows for a comprehensive evaluation of the
logistic regression model's accuracy and performance in predicting diabetes outcomes.
11
variations in diabetes prevalence among countries, influenced by various factors such as
socio-economic status, urbanization, and lifestyle choices.
Genomics and genetics studies have contributed to a better understanding of diabetes
susceptibility by identifying genetic loci associated with the disease. Through genome-
wide association studies (GWAS), researchers have discovered novel genetic variants
linked to type 2 diabetes, providing valuable insights into potential therapeutic targets.
The review also highlights the importance of analyzing treatment outcomes and
intervention studies to assess the efficacy of anti-diabetic medications and lifestyle
modifications. Understanding treatment responses helps optimize diabetes management
strategies.
Diabetes is associated with various complications and comorbidities, including mental
health issues. Studies have shown a bidirectional association between diabetes and
depression, emphasizing the need for addressing mental health in diabetes management.
The advent of big data has revolutionized diabetes research, enabling real-time data
analysis and predictive modeling. Researchers have demonstrated how big data
analytics can identify early signs of diabetic retinopathy and improve patient care
through timely interventions.
The impact of social determinants on diabetes prevalence and outcomes has also been
studied. Factors like food insecurity, poverty, and access to healthcare services
significantly influence diabetes management, necessitating interventions to address
these social determinants.
The general model accuracy of the logistic regression model was then given and
explained. The specific model accuracies of the cited research works were not given
and it is essential for any interested person to refer to the individual studies for detailed
information on their respective model performance and statistical measures.
In conclusion, the literature review highlights the diverse approaches and
methodologies used to analyze diabetes data, offering valuable insights into disease
patterns, risk factors, treatment outcomes, and potential interventions. These studies
collectively contribute to better diabetes prevention, management, and care strategies,
with the potential to positively impact public health.
12
CHAPTER 3
Methodology
3.1 Introduction
This chapter highlights the methods, data and analytical procedures employed in order
to attain the objectives of the research study. The study emphasis on the analytical
framework, data source and acquisition, sampling and sample size and binary logistic
regression, estimation techniques, definition and measurement of variables.
13
3.2 Data Source and Acquisition
To obtain data for our study, we used the secondary data collection. Secondary data is
the data that has been already been collected through primary sources and made readily
available for researchers to use for their own research. It is a type of data that has been
collected from the past. Secondary source of data includes books, personal sources,
journals, newspaper, websites, government records etc. The Research analysis is based
on data taken from National Institute of Diabetes and Digestive and Kidney Diseases.
The objective is to predict based on diagnosis measurement whether a patient has
diabetes. This data provides a wide range of information on variables including
Pregnancies, Glucose, Blood Pressure (BP), Skin Thickness, Insulin, Body Mass Index
(BMI), Diabetes Pedigree Function, Age and Outcome
https://www.kaggle.com/datasets/mathchi/diabetes-data-set
The dataset contains 768 rows and 9 columns. These columns’s label are listed below.
[1] "Pregnancies"
[2] "Glucose"
[3] "BloodPressure"
[4] "SkinThickness"
14
[5] "Insulin"
[6] "BMI"
[7] "DiabetesPedigreeFunction"
[8] "Age"
[9] "Outcome"
There are 8 variables are taken as indicators in the dataset. The variable Outcome is a
response stated whether or not a person has diabetes by showing the result value
as 0 for NO and 1 for Yes. Number of Attributes: 8 plus class
15
DEPENDENT Probability of the 0-normal
individual having diabetes
1-high
or not
INDEPENDENT
Age Age Years
16
Variable Description
pregnant : Number of times pregnant
glucose : Plasma glucose concentration (glucose tolerance test)
triceps : Triceps skin fold thickness (mm Hg)
insulin : 2-hour serum insulin (mu U/ml)
mass : Body mass index (weight in kg/(height in m)^2)
pedigree : Diabetes pedigree function
age : Age (years)
diabetes : Test for diabetes
Then, we inspect whether there is any missing value of our observation using colsums(is.na())
The logistic regression model is based on the concept of the logit function, which
transforms the linear regression equation into a range of [0, 1]. This allows us to
interpret the output as the probability of the event occurring.
In logistic regression, the dependent variable is binary, meaning it can take only two
values, such as "yes" or "no," "success" or "failure." The independent variables can be
continuous or categorical. The goal is to estimate the coefficients of the independent
variables that maximize the likelihood of the observed data.
The logistic regression model assumes that the relationship between the independent
variables and the log-odds of the dependent variable is linear. However, this linearity
assumption can be relaxed by including higher-order terms or interaction terms in the
model.
17
Once the logistic regression model is fitted, it can be used to make predictions on new
data. The predicted probabilities can be converted into binary outcomes using a
specified cutoff value, such as 0.5. However, the choice of the cutoff value depends on
the specific application and the trade-off between false positives and false negatives.
There are several evaluation metrics that can be used to assess the performance of a
logistic regression model, such as accuracy, precision, recall, and F1 score. These
metrics provide insights into how well the model is able to classify the binary outcome.
In conclusion, logistic regression is a widely used statistical model for predicting binary
outcomes. It provides a flexible framework for modeling the relationship between
independent variables and the probability of the event occurring. By estimating the
coefficients using maximum likelihood estimation, the logistic regression model can
make predictions and evaluate its performance using various metrics.
Some of the instances in which binary logistic regression can used are;
1. Modelling the probability that a patient is diabetic given some factors.
2. Modelling the factors that determine whether or not a student smokes, drinks, and
takes a particular elective course.
3. Determining the risk factors of accident severity
4. Establishing the risk factors of marital resolution or determining the probability that
couples will get divorce.
The logistic regression is most appropriate for categorical and binary outcomes because;
1. The response variable, Yi takes only 0 and 1 hence, the logistic regression ensures
that predicted values lie between 0 and 1 inclusively.
2. The errors are heteroskedastic.
3. Error terms are not normally distributed.
4. The logistic regression does not need a linear relationship between the predictor and
response variables.
18
3.6.1 Binary Model
In the simplest case of one predictor X and one binary or dichotomous outcome variable
Y , the logistic regression model predicts the logit of Y from X. Diabetes status ( y ) is
coded as y=1(diabetic) and y=0 (not diabetic). The method models the log odds(y)
using the logistic function. Denote P ( y=1 ) as P ( y ); the probability that y=1.
Logistic regression (LR) is one of the most important predictive models in
classification. To put it simply, logistic regression can be used to model the probability
of diabetes. The key concept of logistic regression is the logit, the natural logarithm of
the odds ratio.
For this dichotomous classification task, I will be using R programming to load the
data, split it into training and test datasets, perform data visualization and model
training using the training dataset, and eventually evaluate the model using the hold-out
dataset.
The simple logistic model has the form:
p( y)
Odds (y) = 1− p( y )
( p( y)
)
Logit ( p ( y ) )=ln 1− p( y ) =ω
exp (ω)
p ( y )=
1−exp (ω)
¿
( 1−p(py()y) )=β + β X + β X +…+ β X
ο 1 1 2 2 k k
19
Where;
β ο Is the model intercept
y is the binary outcome variable The logistic regression model above models the
logarithm of the odds of the outcome variable as a linear combination of the predictor
variables. The model coefficients β 0 are estimated using the maximum likelihood
estimation.
20
3.6.2 Estimation of Prevalence
Logistic regression can indeed be used to estimate prevalence indirectly. The estimated
prevalence can be calculated using the logistic regression equation and the proportion of
individuals with a predicted probability above a certain threshold.
Let's assume we have a logistic regression model with one independent variable,
denoted as X. The logistic regression equation can be written as:
logit(p) = β0 + β1*X
Where;
Logit (p) represents the log-odds of the event, p represents the probability of the event
(prevalence), and β0 and β1 are the coefficients estimated from the logistic regression
model.
To estimate the prevalence, we need to convert the log-odds back to the probability
scale. This can be done using the inverse of the logistic function, also known as the
sigmoid function:
p = 1 / (1 + exp(-logit(p)))
Now, let's say we have a threshold value of p threshold. We can estimate the prevalence
as the proportion of individuals in our dataset whose predicted probability (calculated
using the logistic regression equation) exceeds the threshold:
21
In summary, the logistic regression equation and the sigmoid function allow us to
estimate the probability (prevalence) of an event based on the coefficients obtained
from the logistic regression model. By setting a threshold, we can determine the
proportion of individuals above that threshold and estimate the prevalence accordingly.
Please note that the threshold value is a subjective choice and can impact the estimated
prevalence. Additionally, this approach assumes that the logistic regression model is
appropriately specified and valid for the data being analyzed.
22
3.8.1 Hypotheses Testing
All hypotheses testing and confidence intervals in this study make use of 95
confidence level. When the p−value <α =0.05 , the null hypothesis is rejected.
Testing for significance of individual coefficients is based on the following
hypothesis;
H o : β=0
H 1 : β ≠ 0 i=1,2,3 , … k
The p-value of this test can be found from the standard normal table which is then
compared to the level of significance, α =0.05
23
3.9 Model Accuracy
The accuracy of a logistic regression model refers to the proportion of correct
classifications. To estimate the accuracy of the final model, predictions were
made using the test set and the responses rounded to the nearest binary digit.
The results of the prediction were finally summarized in a confusion matrix
and the accuracy of the model calculated as;
0 1
TP+TN
Accuracy= TP+ TN + FP+ FN
24
Chapter 4
Data Analysis and Results
4.1 Introduction
This chapter emphasizes on the analysis and presentation of results. It includes
descriptive and summary statistics, establishing relationship using odds ratios,
interpretation of relationship, and estimation of prevalence, model fitting and
diagnostics.
25
4.2 Descriptive Statistics
The dataset used was the Pima Indian Diabetes dataset from Machine Learning
Repository (originally from National Institute of Diabetes and Digestive and Kidney
Disease) which contains 8 medical diagnostic attributes and one target variable (i.e,
Outcome) of 768 female patients with 34.9% having diabetes (268 patients). The
variance for insulin for both categories was quite high. This dataset is used to predict
whether a person with certain medical diagnostic attributes is likely to have a diabetes
or not. The dataset contains 768 rows and 9 columns. All analyses were made using
SPSS and R- software version “4.2.1”. The demographic and socio-economic
characteristics of respondents (students) in the study are summarized below. 2
26
Pedigree
Function
Age 21.00 24.00 29.00 33.24 41.00 81.00
Outcome 0.000 0.000 0.000 0.349 1.000 1.000
27
Figure: Boxplot of Pregnancies and Outcome
28
4.4.3 Blood Pressure and Outcome
The boxplots below gives a visual insight of the association Blood Pressure and
Outcome.
29
4.4.4 Skin Thickness and Outcome
The boxplots below gives a visual insight of the association Skin Thickness and
Outcome
30
4.4.5 Insulin and Outcome
The boxplots below gives a visual insight of the association Insulin and
Outcome.
31
4.4.6 BMI and Outcome
The boxplots below gives a visual insight of the association BMI and Outcome.
32
4.4.6 Age and Outcome
The boxplots below gives a visual insight of the association Age and Outcome.
33
Frequencies and Percentage of Age
34
Contingency Table for Diabetes and Age
Age Group Diabetes Status Total
No Diabetes Diabetes
Below 30 327 90 417
30-50 127 135 262
Above 50 46 43 89
Total 500 268 768
Confusion Matrix
Predicted
0 1 Total
Actual 0 443 57 500
1 120 148 268
Total 563 205 768
From the table among the 768, 268 people suffer from diabetes and 500 people do not have diabetes.
35
Hence, the accuracy of this model is 77%.
Signif. Codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null deviance: 993.48 on 767 degrees of freedom
Residual Standard Error: 690.52 on 761 degrees of freedom
AIC: 704.52
Number of Fisher Scoring iterations: 5
36
Analysis of Variance (ANOVA)
Summary
We can see that women in non-diabetes group (outcome =0) have fewer number of pregnancies
compared to those who are in the diabetes group. The distribution of pregnant women in non-diabetes
group is skew to the right. The diabetic women appear to have higher glucose concentrations. The two
37
group have quite similar blood pressure measurement. There are many outliers in Insulin from both
groups, especially women with diabetes are heavily skewed to the right. The diabetic women have
slightly higher BMI than the other group. The pedigree function distribution in both groups have
outliers and have positive skew. The average age of women in diabetic group seem older than women
in non-diabetic group. Let’s take a closer look into those variables which has outliers: In reality, living
organisms can’t have zero value for their Blood Pressure. We will check if there how many rows that
contains 0 value in Blood Pressure.
Reference
1. American Diabetes Association. (2021). Standards of Medical Care in Diabetes—2021. Diabetes
Care, 44(Supplement 1), S1-S232. doi: 10.2337/dc21-S000
2. Centers for Disease Control and Prevention. (2021). National Diabetes Statistics Report, 2020.
Retrieved from https://www.cdc.gov/diabetes/pdfs/data/statistics/national-diabetes-statistics-report.pdf
3. International Diabetes Federation. (2019). IDF Diabetes Atlas, 9th Edition. Retrieved from
https://www.diabetesatlas.org
38
2. Kleinbaum, D. G., & Klein, M. (2010). Logistic regression: A self-
learning text (3rd ed.). Springer. This book provides a comprehensive
introduction to logistic regression, including discussions on prevalence
estimation.
4. Zhang, Z., & Yu, K. F. (1998). What's the relative risk? A method of
correcting the odds ratio in cohort studies of common outcomes. JAMA,
280(19), 1690-1691. This article introduces a method for estimating
prevalence directly from the odds ratio obtained from logistic regression.
39
[4] Riaz, M., et al. (2019). Efficacy of Anti-diabetic Agents in Achieving Glycemic
Control and Cardiovascular Outcomes in Type 2 Diabetes Mellitus: A Systematic
Review and Meta-analysis of 204 Studies. PLoS ONE, 14(5), e0216169.
[5] Javanbakht, M., et al. (2020). Bidirectional Association between Depression and
Type 2 Diabetes Mellitus in a Sample of Adults: A Systematic Review and Meta-
analysis. Frontiers in Psychiatry, 11, 562.
[6] Simó-Servat, O., et al. (2021). Big Data Analytics for Early Identification of
Diabetic Retinopathy and Individualized Patient Care. Journal of Diabetes Science and
Technology, 15(3), 734-741.
[7] Berkowitz, S. A., et al. (2018). Addressing Social Determinants of Health in the
Prevention and Management of Type 2 Diabetes: A Review of the Literature. Current
Diabetes Reports, 18(8), 58.
40