Project 4

List of Abbreviations
IDF International Diabetes

Federation
WHO World Health
Organization
BMI Body Mass Index
i
CHAPTER 1
1. Introduction
The purpose of this chapter is to provide an overview of diabetes and its effects on the
nation, Ghana and the world at large.
In addition to this, the study focuses on the objectives, methodology, justification, scope
and limitations of the research as well as the limitations of the study.
1.1 Background of study
Diabetes is a chronic disease that affects millions of people worldwide. It is

characterized by high levels of blood glucose resulting from defects in insulin
production, insulin action, or both. In recent years, the prevalence of diabetes has
been on the rise, posing a significant public health challenge. In Ghana, diabetes
has become a major concern. According to the International Diabetes Federation
(IDF), an estimated 1.7 million people in Ghana were living with diabetes in 2019.
This number is expected to rise to 2.7 million by 2045 if appropriate measures are
not taken.
The factors contributing to the increasing prevalence of diabetes in Ghana are

multifaceted. Rapid urbanization, sedentary lifestyles, unhealthy diets, and limited
access to healthcare services are among the key factors. The impact of diabetes on
individuals and the healthcare system is substantial, leading to increased morbidity,
mortality, and economic burden. To address the challenges posed by diabetes,
various studies have been conducted in Ghana and across the world. These studies
aim to understand the epidemiology, risk factors, complications, and management
of diabetes. They provide valuable insights into the disease and inform policy-
making and healthcare interventions.
In Ghana, research studies have focused on identifying the prevalence of diabetes in

different regions, exploring the risk factors associated with the disease, and
evaluating the effectiveness of interventions. These studies have highlighted the
need for targeted prevention and control programs, early detection, and improved
access to healthcare services. On a global scale, extensive research has been
conducted to gain a better understanding of diabetes. The World Health
Organization (WHO) and the International Diabetes Federation (IDF) have played a
crucial role in coordinating research efforts and promoting collaboration among
countries. These efforts have led to significant advancements in diabetes research,
2
including the development of new treatments, improved diagnostic tools, and
enhanced management strategies. The research conducted in Ghana and around the
world has provided valuable evidence on the burden of diabetes, its risk factors, and
effective prevention and management approaches. However, there is still much
work to be done. Continued research efforts, innovative interventions, and strong
policy support are essential to address the growing diabetes epidemic in Ghana and
globally.
In conclusion, diabetes is a significant health challenge in Ghana and worldwide.

The rising prevalence of the disease calls for urgent action. Research studies
conducted in Ghana and across the world have provided valuable insights into the
epidemiology, risk factors, and management of diabetes. However, more research
and coordinated efforts are needed to effectively prevent and control diabetes and
its associated complications.
Diabetes is a major metabolic disorder which can affect entire body system
adversely. Undiagnosed diabetes can increase the risk of cardiac stroke,
diabetic nephropathy and other disorders. All over the world millions of people
are affected by this disease. Early detection of diabetes is very important to
maintain a healthy life. This disease is a reason of global concern as the cases
of diabetes are rising rapidly. Machine learning (ML) is a computational method
for automatic learning from experience and improves the performance to make
more accurate predictions. In the current research we have utilized machine
learning technique in Pima Indian diabetes dataset to develop trends and
detect patterns with risk factors using R data manipulation tool. To classify the
patients into diabetic and non-diabetic we have developed and analyzed five
different predictive models using R data manipulation tool. For this purpose we
used supervised machine learning algorithms namely linear kernel support
vector machine (SVM-linear), radial basis function (RBF) kernel support vector
machine, k-nearest neighbour (k-NN), artificial neural network (ANN) and
multifactor dimensionality reduction (MDR).
1.2 Problem Statement

Diabetes has become a major public health concern, with an estimated 425
million adults living with diabetes worldwide. In recent years, the prevalence of
3
diabetes has been increasing in Ghana, with approximately 4.1 million people
estimated to be living with diabetes in 2018. This alarming growth has created a
significant burden on Ghana’s healthcare system, as the costs associated with
diagnosis, treatment, and management of diabetes can be prohibitively expensive.
Furthermore, the inadequate availability of specialized healthcare providers,
including endocrinologists and diabetes educators, has exacerbated the issue.
This statement of research problems seeks to identify the key issues related to
diabetes in Ghana and explore the associated global trends. Data will be collected
from various sources, including government health reports, surveys, and
interviews with healthcare professionals. This data will be used to analyze the
rate of diabetes prevalence, the accessibility of diabetes services, and the cost of
diabetes care in Ghana. Additionally, the research will explore the global trends
in diabetes in order to identify best practices that can be applied in Ghana to
improve the management of diabetes. Ultimately, this research will provide
valuable insights into the challenges posed by diabetes in Ghana and how to
address them.
1.3 Objective
Logistic regression, we can create a statistical model to better understand the

prevalence, risk factors, and complications associated with diabetes in Ghana.
We can also compare the findings with global trends to gain further insight into
the current state of diabetes in Ghana and how it impacts the global population.
This research will be beneficial in helping us identify the most effective
interventions to reduce the burden of diabetes in Ghana and beyond.
Furthermore, it will help to create a better understanding of the risk factors
associated with diabetes in Ghana and how they differ from those in other
countries. Ultimately, this will help to inform medical and public health
professionals so that they can create more effective strategies to tackle the
growing problem of diabetes in Ghana and the world.
1.4 METHODOLOGY
4
In this study, we will be using LOGISTIC regression analysis to model our data.
Logistic regression will be used to classify individuals into two groups: those
with diabetes (1) and those without diabetes (0) as well as to explain the
relationship between diabetes and various independent variables, such as Age,
Body Mass Index (BMI), Insulin, Diabetes Pedigree Function, Skin Thickness,
Blood Pressure, outcome, Pregnancy, exercise, and Glucose.
The analysis of this study will be conducted using R Statistical Software. Data
will be gathered from the Internet, libraries, personal notes, lecture notes and
other relevant sources such as the World Health Organization (WHO). All of
these sources will provide valuable insight into the research topic, allowing us to
draw meaningful conclusions.
1.5 JUSTIFICATION
The success of this study will provide valuable insight into the factors that
contribute to Diabetes in Ghana, and beyond. With this knowledge, we can work
towards reducing the prevalence of Diabetes, not just in Ghana, but around the
world. With a greater understanding of the causes of this condition, everyone can
take steps to protect their health and reduce the number of people who suffer
from Diabetes.
1.6 SCOPE AND LIMITATION OF STUDY

The study seeks to examine the effects of various factors on the rate of Diabetes.
By looking at how these factors contribute to the development of the disease, we
can better identify what needs to be addressed in order to reduce the negative
impact of Diabetes. Although we are aware of many of the factors that cause
Diabetes, not all of them were taken into consideration in this study.
Additionally, the study was limited by the fact that other variables were not
included in the model, which could have had an influence on the findings.
However, by understanding how the factors influence Diabetes, we can work
towards reducing its devastating effects.
5
1.7 Thesis Organization
In our research study, there are five chapters. Chapter one deals with the
background of the study, problem statement, objectives of the study, the
methodology, justification, limitations of the study and the organization of the
study. Chapter two reviews the related literature of the study. Chapter three
focuses on the methodology of the study. Problems discussed include analytical
framework, data source, sample and sampling procedure, logistic regression,
generalized linear model and binary logistic regression, estimating the single
regression model, estimation techniques, marginal effect, definition and
measurement of variables and data analysis procedure. Chapter four focuses on
data collection, the research findings and the results of our findings. Chapter five
discusses the summary, conclusions from findings and recommendation from the
study.
CHAPTER 2
6
Literature Review
2.1 Introduction
Diabetes is a chronic metabolic disorder characterized by high blood glucose levels due
to insufficient insulin production or impaired insulin function. With the prevalence of
diabetes escalating worldwide, there has been an increasing interest in analyzing
diabetes data to gain insights into the disease's etiology, risk factors, management, and
potential treatments. This literature review aims to explore various analyses conducted
on diabetes data, highlighting the methodologies, findings, and contributions of each
study.
2.2.1 Diabetes
Diabetes mellitus refers to a group of diseases that affect how the body uses blood sugar
(glucose). Glucose is an important source of energy for the cells that make up the muscles
and tissues. It's also the brain's main source of fuel.
The main cause of diabetes varies by type. But no matter what type of diabetes you have, it
can lead to excess sugar in the blood. Too much sugar in the blood can lead to serious
health problems.
Chronic diabetes conditions include type 1 diabetes and type 2 diabetes. Potentially
reversible diabetes conditions include prediabetes and gestational diabetes. Prediabetes
happens when blood sugar levels are higher than normal. But the blood sugar levels aren't
high enough to be called diabetes. And prediabetes can lead to diabetes unless steps are
taken to prevent it. Gestational diabetes happens during pregnancy. But it may go away after
the baby is born.
The following are symptoms of depression as outlined by Mayo Clinic
 Feeling more thirsty than usual.

 Urinating often.
 Losing weight without trying.
7
 Presence of ketones in the urine. Ketones are a byproduct of the
breakdown of muscle and fat that happens when there's not enough
available insulin.
 Feeling tired and weak.
 Feeling irritable or having other mood changes.
 Having blurry vision.
 Having slow-healing sores.
 Getting a lot of infections, such as gum, skin and vaginal infections.
Type 1 diabetes can start at any age. But it often starts during childhood or teen years. Type
2 diabetes, the more common type, can develop at any age. Type 2 diabetes is more
common in people older than 40. But type 2 diabetes in children is increasing.
2.1 Risk Factors and Prediction Models

Numerous studies have focused on identifying risk factors associated with diabetes
development and developing predictive models to assess an individual's risk. Wang et
al. (2018) employed machine learning algorithms on a large dataset to build a robust
predictive model for type 2 diabetes. They found that age, body mass index (BMI),
family history, and fasting plasma glucose were crucial predictors for diabetes onset. [1]
2.2 Epidemiological Studies

Epidemiological investigations have played a critical role in understanding the
prevalence and incidence of diabetes in different populations. A review by Danaei et al.
(2017) analyzed data from multiple global studies to estimate the worldwide burden of
diabetes. Their findings revealed significant variations in diabetes prevalence among
countries, influenced by factors such as socio-economic status, urbanization, and
lifestyle choices. [2]
8
2.3 Genomics and Genetics
Genetic factors also play a role in diabetes susceptibility. Genome-wide association
studies (GWAS) have been performed to identify genetic loci associated with diabetes.
Mahajan et al. (2018) conducted a meta-analysis of GWAS data, discovering novel
genetic variants linked to type 2 diabetes. These findings have furthered our
understanding of the disease's genetic architecture and potential therapeutic targets. [3]
2.4 Treatment Outcomes and Intervention Studies

Analyzing data from clinical trials and intervention studies helps assess the
effectiveness of various treatments and lifestyle modifications in managing diabetes. A
systematic review by Riaz et al. (2019) examined multiple randomized controlled trials
to compare the efficacy of different anti-diabetic medications. The analysis indicated
that some medications had a more significant impact on glycemic control and
cardiovascular outcomes than others. [4]
2.5 Diabetic Complications and Comorbidities

Diabetes is associated with various complications and comorbidities. A study by
Javanbakht et al. (2020) explored the relationship between diabetes and mental health
outcomes using a large-scale dataset. They found a bidirectional association between
diabetes and depression, emphasizing the importance of addressing mental health in
diabetes management. [5]
2.6 Big Data and Predictive Analytics:

The advent of big data has revolutionized diabetes research. Utilizing electronic health
records, wearables, and mobile health applications, researchers have explored real-time
data analysis and predictive modeling. Simó-Servat et al. (2021) demonstrated how big
data analytics can identify early signs of diabetic retinopathy and improve patient care
through timely interventions. [6]
9
2.8 Social Determinants of Diabetes
Some studies have investigated the impact of social determinants on diabetes
prevalence and outcomes. A review by Berkowitz et al. (2018) examined how factors
like food insecurity, poverty, and access to healthcare services affected diabetes
management in vulnerable populations. Their findings highlighted the need for
addressing social determinants to improve diabetes care and outcomes. [7]
2.9 Model Accuracy of Logistic Regression Analysis on Diabetes

The model accuracy of logistic regression analysis on diabetes data refers to how well
the logistic regression model can correctly classify individuals into the appropriate
categories based on the predictor variables (features) provided in the dataset. In this
context, the logistic regression model aims to predict the likelihood of an individual
having diabetes (binary outcome: diabetic or non-diabetic) based on certain risk factors
or predictors.
The accuracy of the logistic regression model is typically assessed using various
performance metrics, including:
1. Accuracy: The proportion of correct predictions (both true positives and true
negatives) over the total number of predictions. It provides an overall measure of the
model's correctness in classifying individuals.
2. Sensitivity (True Positive Rate or Recall): The proportion of correctly identified
diabetic individuals (true positives) over the total number of actual diabetics. It
indicates the model's ability to correctly identify positive cases.
3. Specificity (True Negative Rate): The proportion of correctly identified non-diabetic
individuals (true negatives) over the total number of actual non-diabetics. It measures
the model's ability to correctly identify negative cases.
4. Area Under the Receiver Operating Characteristic Curve (AUC-ROC): The AUC-
ROC represents the overall performance of the model across different classification
thresholds. It ranges from 0 to 1, with higher values indicating better model
performance.
10
5. Precision: The proportion of true positive predictions (diabetic individuals correctly
identified) over the total number of positive predictions (both true positives and false
positives). It provides a measure of the model's precision in identifying diabetic cases.
The higher the accuracy, sensitivity, specificity, and AUC-ROC, the better the logistic
regression model is at accurately predicting the presence or absence of diabetes in the
dataset.
It's important to note that the model's accuracy may vary depending on the quality and
size of the dataset, the choice of predictor variables, and the preprocessing steps
performed on the data. Cross-validation techniques and hold-out validation are
commonly used to assess the generalization performance of the logistic regression
model on unseen data.
When analyzing diabetes data using logistic regression, researchers often report these
metrics along with p-values and confidence intervals for the regression coefficients to
determine the significance and strength of the associations between the predictor
variables and the risk of diabetes. This allows for a comprehensive evaluation of the
logistic regression model's accuracy and performance in predicting diabetes outcomes.
2.9 Summary of Literature Review

The literature review provides an overview of various analyses conducted on diabetes
data, highlighting key findings and methodologies used in each study. The review
covers a range of topics related to diabetes research, including risk factors and
prediction models, epidemiological studies, genomics and genetics, treatment outcomes
and intervention studies, diabetic complications and comorbidities, big data and
predictive analytics, and the impact of social determinants on diabetes.
The review starts by emphasizing the increasing interest in analyzing diabetes data to
gain insights into disease etiology, management, and potential treatments. It highlights
the significance of risk factor identification and predictive modeling in understanding
diabetes onset, as demonstrated in a study that employed machine learning algorithms
to build a robust predictive model for type 2 diabetes.
Epidemiological investigations play a crucial role in estimating diabetes prevalence and
incidence rates across different populations. These studies shed light on the significant
11
variations in diabetes prevalence among countries, influenced by various factors such as
socio-economic status, urbanization, and lifestyle choices.
Genomics and genetics studies have contributed to a better understanding of diabetes
susceptibility by identifying genetic loci associated with the disease. Through genome-
wide association studies (GWAS), researchers have discovered novel genetic variants
linked to type 2 diabetes, providing valuable insights into potential therapeutic targets.
The review also highlights the importance of analyzing treatment outcomes and
intervention studies to assess the efficacy of anti-diabetic medications and lifestyle
modifications. Understanding treatment responses helps optimize diabetes management
strategies.
Diabetes is associated with various complications and comorbidities, including mental
health issues. Studies have shown a bidirectional association between diabetes and
depression, emphasizing the need for addressing mental health in diabetes management.
The advent of big data has revolutionized diabetes research, enabling real-time data
analysis and predictive modeling. Researchers have demonstrated how big data
analytics can identify early signs of diabetic retinopathy and improve patient care
through timely interventions.
The impact of social determinants on diabetes prevalence and outcomes has also been
studied. Factors like food insecurity, poverty, and access to healthcare services
significantly influence diabetes management, necessitating interventions to address
these social determinants.
The general model accuracy of the logistic regression model was then given and
explained. The specific model accuracies of the cited research works were not given
and it is essential for any interested person to refer to the individual studies for detailed
information on their respective model performance and statistical measures.
In conclusion, the literature review highlights the diverse approaches and
methodologies used to analyze diabetes data, offering valuable insights into disease
patterns, risk factors, treatment outcomes, and potential interventions. These studies
collectively contribute to better diabetes prevention, management, and care strategies,
with the potential to positively impact public health.
12
CHAPTER 3
Methodology
3.1 Introduction
This chapter highlights the methods, data and analytical procedures employed in order
to attain the objectives of the research study. The study emphasis on the analytical
framework, data source and acquisition, sampling and sample size and binary logistic
regression, estimation techniques, definition and measurement of variables.
13
3.2 Data Source and Acquisition
To obtain data for our study, we used the secondary data collection. Secondary data is
the data that has been already been collected through primary sources and made readily
available for researchers to use for their own research. It is a type of data that has been
collected from the past. Secondary source of data includes books, personal sources,
journals, newspaper, websites, government records etc. The Research analysis is based
on data taken from National Institute of Diabetes and Digestive and Kidney Diseases.
The objective is to predict based on diagnosis measurement whether a patient has
diabetes. This data provides a wide range of information on variables including
Pregnancies, Glucose, Blood Pressure (BP), Skin Thickness, Insulin, Body Mass Index
(BMI), Diabetes Pedigree Function, Age and Outcome
https://www.kaggle.com/datasets/mathchi/diabetes-data-set
3.3 Sample Size and Sampling Procedure
The dataset contains 768 rows and 9 columns. These columns’s label are listed below.
[1] "Pregnancies"
[2] "Glucose"
[3] "BloodPressure"
[4] "SkinThickness"
14
[5] "Insulin"
[6] "BMI"
[7] "DiabetesPedigreeFunction"
[8] "Age"
[9] "Outcome"
There are 8 variables are taken as indicators in the dataset. The variable Outcome is a
response stated whether or not a person has diabetes by showing the result value
as 0 for NO and 1 for Yes. Number of Attributes: 8 plus class
3.3 Logistic Regression

Logistic regression analysis extends the techniques of multiple regression analysis to
research situations in which the outcome is categorical. All goes well if linear
regression assumptions are met. However, several assumptions are likely to be unmet if
the dependent variable has only two or three response categories. With the two
dependent variable outcomes, assumptions of homoscedasticity, linearity and normality
are violated and then the Ordinary Least Square estimates are inefficient at best. The
maximum likelihood estimation of a logistic regression overcomes this inefficiency,
transforming Y (0, 1) into a logit (log of the odds of falling into the “1” category).
Logistic regression determines the impact of multiple independent variables presented
simultaneously to predict membership of one or the other of the two dependent variable
categories. Logistic regression also provides knowledge of the relationship and strength
among the variables
3.5 Variable Measurement and Their Definitions

VARIABLE DEFINITION MEASUREMENT
15
DEPENDENT Probability of the 0-normal
individual having diabetes
1-high
or not
INDEPENDENT
Age Age Years
BMI Body Mass Index Weight (kg) over

height in metre
squared
Pregnancies Number of times pregnant
Blood pressure Diastolic blood pressure (mm Hg)
Skinthickness Triceps skin fold thickness (mm)

Insulin 2-Hour serum insulin (mu U/ml)
Glucose Plasma glucose

concentration a 2 hours in
an oral glucose tolerance
test
DiabetesPedigreeFunction Diabetes pedigree function
Outcome Class variable 0 or 1
16
Variable Description
 pregnant : Number of times pregnant
 glucose : Plasma glucose concentration (glucose tolerance test)
 triceps : Triceps skin fold thickness (mm Hg)
 insulin : 2-hour serum insulin (mu U/ml)
 mass : Body mass index (weight in kg/(height in m)^2)
 pedigree : Diabetes pedigree function
 age : Age (years)
 diabetes : Test for diabetes
Then, we inspect whether there is any missing value of our observation using colsums(is.na())
3.6 The Logistics Regression Model
Logistic regression is a statistical model used to predict the probability of a binary

outcome based on one or more predictor variables. It is commonly used in various
fields, including healthcare, finance, and marketing.
The logistic regression model is based on the concept of the logit function, which
transforms the linear regression equation into a range of [0, 1]. This allows us to
interpret the output as the probability of the event occurring.
In logistic regression, the dependent variable is binary, meaning it can take only two
values, such as "yes" or "no," "success" or "failure." The independent variables can be
continuous or categorical. The goal is to estimate the coefficients of the independent
variables that maximize the likelihood of the observed data.
The logistic regression model assumes that the relationship between the independent
variables and the log-odds of the dependent variable is linear. However, this linearity
assumption can be relaxed by including higher-order terms or interaction terms in the
model.
To estimate the coefficients of the logistic regression model, maximum likelihood

estimation (MLE) is commonly used. MLE finds the values of the coefficients that
maximize the likelihood of observing the data. The logistic regression model does not
provide p-values for the coefficients, but it does provide odds ratios, which can be used
to interpret the effect of each independent variable on the odds of the outcome.
17
Once the logistic regression model is fitted, it can be used to make predictions on new
data. The predicted probabilities can be converted into binary outcomes using a
specified cutoff value, such as 0.5. However, the choice of the cutoff value depends on
the specific application and the trade-off between false positives and false negatives.
There are several evaluation metrics that can be used to assess the performance of a
logistic regression model, such as accuracy, precision, recall, and F1 score. These
metrics provide insights into how well the model is able to classify the binary outcome.
In conclusion, logistic regression is a widely used statistical model for predicting binary
outcomes. It provides a flexible framework for modeling the relationship between
independent variables and the probability of the event occurring. By estimating the
coefficients using maximum likelihood estimation, the logistic regression model can
make predictions and evaluate its performance using various metrics.
Some of the instances in which binary logistic regression can used are;
1. Modelling the probability that a patient is diabetic given some factors.
2. Modelling the factors that determine whether or not a student smokes, drinks, and
takes a particular elective course.
3. Determining the risk factors of accident severity
4. Establishing the risk factors of marital resolution or determining the probability that
couples will get divorce.
The logistic regression is most appropriate for categorical and binary outcomes because;
1. The response variable, Yi takes only 0 and 1 hence, the logistic regression ensures
that predicted values lie between 0 and 1 inclusively.
2. The errors are heteroskedastic.
3. Error terms are not normally distributed.
4. The logistic regression does not need a linear relationship between the predictor and
response variables.
18
3.6.1 Binary Model
In the simplest case of one predictor X and one binary or dichotomous outcome variable
Y , the logistic regression model predicts the logit of Y from X. Diabetes status ( y ) is
coded as y=1(diabetic) and y=0 (not diabetic). The method models the log odds(y)
using the logistic function. Denote P ( y=1 ) as P ( y ); the probability that y=1.
Logistic regression (LR) is one of the most important predictive models in
classification. To put it simply, logistic regression can be used to model the probability
of diabetes. The key concept of logistic regression is the logit, the natural logarithm of
the odds ratio.
For this dichotomous classification task, I will be using R programming to load the
data, split it into training and test datasets, perform data visualization and model
training using the training dataset, and eventually evaluate the model using the hold-out
dataset.
The simple logistic model has the form:
p( y)
Odds (y) = 1− p( y )
Let ω=β ο + β 1 X 1+ β 2 X 2 +…+ β k X k

k
¿ β ο +∑ β i X i
i=1
( p( y)
)
Logit ( p ( y ) )=ln 1− p( y ) =ω
exp ⁡(ω)
p ( y )=
1−exp ⁡(ω)
Hence the model is given by
¿
( 1−p(py()y) )=β + β X + β X +…+ β X
ο 1 1 2 2 k k
19
Where;
β ο Is the model intercept
β iare the coefficients of the model i=1,2,3 , … , k
Xi are the predictor variables i=1,2,3 , … , k
y is the binary outcome variable The logistic regression model above models the
logarithm of the odds of the outcome variable as a linear combination of the predictor
variables. The model coefficients β 0 are estimated using the maximum likelihood
estimation.
The graph of the logistic function is shown in the figure below
20
3.6.2 Estimation of Prevalence
Logistic regression can indeed be used to estimate prevalence indirectly. The estimated
prevalence can be calculated using the logistic regression equation and the proportion of
individuals with a predicted probability above a certain threshold.
Let's assume we have a logistic regression model with one independent variable,
denoted as X. The logistic regression equation can be written as:
logit(p) = β0 + β1*X
Where;
Logit (p) represents the log-odds of the event, p represents the probability of the event
(prevalence), and β0 and β1 are the coefficients estimated from the logistic regression
model.
To estimate the prevalence, we need to convert the log-odds back to the probability
scale. This can be done using the inverse of the logistic function, also known as the
sigmoid function:
p = 1 / (1 + exp(-logit(p)))
Now, let's say we have a threshold value of p threshold. We can estimate the prevalence
as the proportion of individuals in our dataset whose predicted probability (calculated
using the logistic regression equation) exceeds the threshold:
Prevalence = (Number of individuals with predicted probability > p threshold) / Total

number of individuals
21
In summary, the logistic regression equation and the sigmoid function allow us to
estimate the probability (prevalence) of an event based on the coefficients obtained
from the logistic regression model. By setting a threshold, we can determine the
proportion of individuals above that threshold and estimate the prevalence accordingly.
Please note that the threshold value is a subjective choice and can impact the estimated
prevalence. Additionally, this approach assumes that the logistic regression model is
appropriately specified and valid for the data being analyzed.
3.7 Assumptions of the Logistic Regression Model

1. The binary logistic regression assumes that, the dependent variable, yi comes
from the binomial distribution with parameters(n , pi), where n is known and pi is
unknown.
2. Each observation of the dependent variable is independent of the other
3. Log odds ( yi )is a linear function of independent variables
4. Non or very little multicollinearity between independent variables
3.8 Testing for Significance of the Model

The two methods that are employed in this study for testing the significance of
model coefficients are the hypothesis testing and confidence intervals.
22
3.8.1 Hypotheses Testing
All hypotheses testing and confidence intervals in this study make use of 95
confidence level. When the p−value <α =0.05 , the null hypothesis is rejected.
Testing for significance of individual coefficients is based on the following
hypothesis;
H o : β=0
H 1 : β ≠ 0 i=1,2,3 , … k
The maximum likelihood estimates give asymptotically normally distributed

coefficients with a Wald test statistic given by;
^β
i
Z=
se ( ^β i)
The p-value of this test can be found from the standard normal table which is then
compared to the level of significance, α =0.05
3.8.2 Confidence Intervals for Model Parameters
95 % Confidence interval of Bi is given by

^β ± Z α se ( β^ )
2
The Odds Ratio of the kth coefficient is expressed as
23
3.9 Model Accuracy
The accuracy of a logistic regression model refers to the proportion of correct
classifications. To estimate the accuracy of the final model, predictions were
made using the test set and the responses rounded to the nearest binary digit.
The results of the prediction were finally summarized in a confusion matrix
and the accuracy of the model calculated as;
Table 3.2: Confusion Matrix

PREDICTED
0 1
0 True Positive False Negative

(FN)
(TP)
1 False Positive True Negative

ACTUAL
(FP) (TN)
TP+TN
Accuracy= TP+ TN + FP+ FN
24
Chapter 4
Data Analysis and Results
4.1 Introduction
This chapter emphasizes on the analysis and presentation of results. It includes
descriptive and summary statistics, establishing relationship using odds ratios,
interpretation of relationship, and estimation of prevalence, model fitting and
diagnostics.
25
4.2 Descriptive Statistics
The dataset used was the Pima Indian Diabetes dataset from Machine Learning
Repository (originally from National Institute of Diabetes and Digestive and Kidney
Disease) which contains 8 medical diagnostic attributes and one target variable (i.e,
Outcome) of 768 female patients with 34.9% having diabetes (268 patients). The
variance for insulin for both categories was quite high. This dataset is used to predict
whether a person with certain medical diagnostic attributes is likely to have a diabetes
or not. The dataset contains 768 rows and 9 columns. All analyses were made using
SPSS and R- software version “4.2.1”. The demographic and socio-economic
characteristics of respondents (students) in the study are summarized below. 2
Min 1st Qu Median Mean 3rd Qu Max

Pregnancies 0.000 1.000 3.000 3.845 6.000 17.000
Glucose 44.00 99.75 117.00 121.68 140.25 199.00
Blood Pressure 24.00 64.00 72.00 72.39 80.00 122.00
Skin Thickness 7.00 25.00 28.00 29.09 32.00 99.00
Insulin 14.0 102.5 102.5 141.8 169.5 846.0
BMI 18.20 27.50 32.05 32.43 36.60 67.10
Diabetes 0.0780 0.2437 0.3725 0.4719 0.6262 2.4200
26
Pedigree
Function
Age 21.00 24.00 29.00 33.24 41.00 81.00
Outcome 0.000 0.000 0.000 0.349 1.000 1.000
4.4 Association between Diabetes and Variable

This section examines the association between Diabetes and the independent
variables. For variables such as Pregnancies, Age, Insulin, Thick Skin, BM1 and
Glucose, the relationship was first visualized in boxplots and consequently the
chi square test of association.
4.4.1 Pregnancies and Outcome

The boxplots below gives a visual insight of the association Pregnancies and
Outcome.
27
Figure: Boxplot of Pregnancies and Outcome
4.4.2 Glucose and Outcome

The boxplots below gives a visual insight of the association Glucose and
Outcome.
28
4.4.3 Blood Pressure and Outcome
The boxplots below gives a visual insight of the association Blood Pressure and
Outcome.
29
4.4.4 Skin Thickness and Outcome
The boxplots below gives a visual insight of the association Skin Thickness and
Outcome
30
4.4.5 Insulin and Outcome
The boxplots below gives a visual insight of the association Insulin and
Outcome.
31
4.4.6 BMI and Outcome
The boxplots below gives a visual insight of the association BMI and Outcome.
32
4.4.6 Age and Outcome
The boxplots below gives a visual insight of the association Age and Outcome.
33
Frequencies and Percentage of Age
Frequency Percent (%)

Valid Below 30 417 54.3
30-50 262 34.1
Above 50 89 11.6
Total 768 100.0
34
Contingency Table for Diabetes and Age
Age Group Diabetes Status Total
No Diabetes Diabetes
Below 30 327 90 417
30-50 127 135 262
Above 50 46 43 89
Total 500 268 768
4.5 Making Prediction

After building the model using the training set, the test set was used to make
predictions and the results are summarized in the confusion matrix table below.
Confusion Matrix
Predicted
0 1 Total
Actual 0 443 57 500
1 120 148 268
Total 563 205 768
From the table among the 768, 268 people suffer from diabetes and 500 people do not have diabetes.
4.6 Model Accuracy

148+443
Accuracy = =0.76953125
768
35
Hence, the accuracy of this model is 77%.
4.7 Building a Model

The table below shows the model with all independent variables using the
training set which featured 80% of the entire dataset.
4.7.1 Full Model
Variables Parameter Estimate Std. Error z value Pr(>|z|)
(Intercept) B0 -9.221068 0.723672 -12.742 < 2e-16 ***
Pregnancies B1 0.137385 0.027719 4.956 7.18e-07***
Glucose B2 0.031139 0.003776 8.247 < 2e-16 ***
SkinThickness B3 0.038310 0.013748 2.787 0.00533 **
Insulin B4 0.005314 0.001480 3.590 0.00033 ***
BMI B5 0.054954 0.017390 3.160 0.00158 **
DiabetesPedigreeFunction B6 0.806248 0.299221 2.694 0.00705 **
Signif. Codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null deviance: 993.48 on 767 degrees of freedom
Residual Standard Error: 690.52 on 761 degrees of freedom
AIC: 704.52
Number of Fisher Scoring iterations: 5
36
Analysis of Variance (ANOVA)
Summary
We can see that women in non-diabetes group (outcome =0) have fewer number of pregnancies
compared to those who are in the diabetes group. The distribution of pregnant women in non-diabetes
group is skew to the right. The diabetic women appear to have higher glucose concentrations. The two
37
group have quite similar blood pressure measurement. There are many outliers in Insulin from both
groups, especially women with diabetes are heavily skewed to the right. The diabetic women have
slightly higher BMI than the other group. The pedigree function distribution in both groups have
outliers and have positive skew. The average age of women in diabetic group seem older than women
in non-diabetic group. Let’s take a closer look into those variables which has outliers: In reality, living
organisms can’t have zero value for their Blood Pressure. We will check if there how many rows that
contains 0 value in Blood Pressure.
Our final model is given as:

Outcome = -9.22 + 0.13* Pregnancies + 0.03* Glucose + 0.04* SkinThickness +
0.005* Insulin + 0.05* BMI + 0.80* DiabetesPedigreeFunction
Reference
1. American Diabetes Association. (2021). Standards of Medical Care in Diabetes—2021. Diabetes
Care, 44(Supplement 1), S1-S232. doi: 10.2337/dc21-S000
2. Centers for Disease Control and Prevention. (2021). National Diabetes Statistics Report, 2020.
Retrieved from https://www.cdc.gov/diabetes/pdfs/data/statistics/national-diabetes-statistics-report.pdf
3. International Diabetes Federation. (2019). IDF Diabetes Atlas, 9th Edition. Retrieved from
https://www.diabetesatlas.org
1. Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied

logistic regression (3rd ed.). Wiley. Chapter 4 specifically covers logistic
regression for binary outcomes and discusses prevalence estimation.
38
2. Kleinbaum, D. G., & Klein, M. (2010). Logistic regression: A self-
learning text (3rd ed.). Springer. This book provides a comprehensive
introduction to logistic regression, including discussions on prevalence
estimation.
3. Bursac, Z., Gauss, C. H., Williams, D. K., & Hosmer, D. W. (2008).

Purposeful selection of variables in logistic regression. Source Code for
Biology and Medicine, 3(17). This article discusses variable selection
techniques in logistic regression, which can be useful in prevalence
estimation.
4. Zhang, Z., & Yu, K. F. (1998). What's the relative risk? A method of
correcting the odds ratio in cohort studies of common outcomes. JAMA,
280(19), 1690-1691. This article introduces a method for estimating
prevalence directly from the odds ratio obtained from logistic regression.
Literature Review References:

[1] Wang, J., et al. (2018). Development of a Type 2 Diabetes Prediction Model Using
Machine Learning: A Nationwide Cohort Study. Journal of Medical Internet Research,
20(5), e194.
[2] Danaei, G., et al. (2017). National, Regional, and Global Trends in Fasting Plasma
Glucose and Diabetes Prevalence since 1980: Systematic Analysis of Health
Examination Surveys and Epidemiological Studies with 370 Country-Years and 2.7
Million Participants. The Lancet, 378(9785), 31-40.
[3] Mahajan, A., et al. (2018). Fine-mapping Type 2 Diabetes Loci to Single Variant
Resolution Using High-Density Imputation and Islet-specific Epigenome Maps. Nature
Genetics, 50(11), 1505-1513.
39
[4] Riaz, M., et al. (2019). Efficacy of Anti-diabetic Agents in Achieving Glycemic
Control and Cardiovascular Outcomes in Type 2 Diabetes Mellitus: A Systematic
Review and Meta-analysis of 204 Studies. PLoS ONE, 14(5), e0216169.
[5] Javanbakht, M., et al. (2020). Bidirectional Association between Depression and
Type 2 Diabetes Mellitus in a Sample of Adults: A Systematic Review and Meta-
analysis. Frontiers in Psychiatry, 11, 562.
[6] Simó-Servat, O., et al. (2021). Big Data Analytics for Early Identification of
Diabetic Retinopathy and Individualized Patient Care. Journal of Diabetes Science and
Technology, 15(3), 734-741.
[7] Berkowitz, S. A., et al. (2018). Addressing Social Determinants of Health in the
Prevention and Management of Type 2 Diabetes: A Review of the Literature. Current
Diabetes Reports, 18(8), 58.
40

Project 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project 4

Uploaded by

Copyright:

Available Formats

List of Abbreviations

IDF International Diabetes

1.1 Background of study

Diabetes is a chronic disease that affects millions of people worldwide. It is

The factors contributing to the increasing prevalence of diabetes in Ghana are

In Ghana, research studies have focused on identifying the prevalence of diabetes in

In conclusion, diabetes is a significant health challenge in Ghana and worldwide.

1.2 Problem Statement

Logistic regression, we can create a statistical model to better understand the

1.6 SCOPE AND LIMITATION OF STUDY

The following are symptoms of depression as outlined by Mayo Clinic

 Feeling more thirsty than usual.

common in people older than 40. But type 2 diabetes in children is increasing.

2.1 Risk Factors and Prediction Models

2.2 Epidemiological Studies

2.4 Treatment Outcomes and Intervention Studies

2.5 Diabetic Complications and Comorbidities

2.6 Big Data and Predictive Analytics:

2.9 Model Accuracy of Logistic Regression Analysis on Diabetes

2.9 Summary of Literature Review

3.3 Sample Size and Sampling Procedure

3.3 Logistic Regression

3.5 Variable Measurement and Their Definitions

BMI Body Mass Index Weight (kg) over

Blood pressure Diastolic blood pressure (mm Hg)

Skinthickness Triceps skin fold thickness (mm)

Glucose Plasma glucose

Outcome Class variable 0 or 1

3.6 The Logistics Regression Model

Logistic regression is a statistical model used to predict the probability of a binary

To estimate the coefficients of the logistic regression model, maximum likelihood

Let ω=β ο + β 1 X 1+ β 2 X 2 +…+ β k X k

Hence the model is given by

β iare the coefficients of the model i=1,2,3 , … , k

Xi are the predictor variables i=1,2,3 , … , k

The graph of the logistic function is shown in the figure below

Prevalence = (Number of individuals with predicted probability > p threshold) / Total

3.7 Assumptions of the Logistic Regression Model

3.8 Testing for Significance of the Model

The maximum likelihood estimates give asymptotically normally distributed

3.8.2 Confidence Intervals for Model Parameters

95 % Confidence interval of Bi is given by

The Odds Ratio of the kth coefficient is expressed as

Table 3.2: Confusion Matrix

0 True Positive False Negative

1 False Positive True Negative

Min 1st Qu Median Mean 3rd Qu Max

4.4 Association between Diabetes and Variable

4.4.1 Pregnancies and Outcome

4.4.2 Glucose and Outcome

Frequency Percent (%)

4.5 Making Prediction

4.6 Model Accuracy

4.7 Building a Model

4.7.1 Full Model

Variables Parameter Estimate Std. Error z value Pr(>|z|)

(Intercept) B0 -9.221068 0.723672 -12.742 < 2e-16 ***

Pregnancies B1 0.137385 0.027719 4.956 7.18e-07***

Glucose B2 0.031139 0.003776 8.247 < 2e-16 ***

SkinThickness B3 0.038310 0.013748 2.787 0.00533 **

Insulin B4 0.005314 0.001480 3.590 0.00033 ***

BMI B5 0.054954 0.017390 3.160 0.00158 **

DiabetesPedigreeFunction B6 0.806248 0.299221 2.694 0.00705 **