Binary Logistic Regression Concept

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

CONCEPT ON BINARY LOGISTIC REGRSSION AND ITS ANALYSIS

Submitted by

Siddhanta Subedi

Symbol no:-569/77,4th SEM

Submitted to

Dr. Parvat Uprety


Professor
Central Department of Statistics
Tribhuvan University Kirtipur, Kathmandu

5th December, 2023

1
INTRODUCTION
Binary logistic regression is a type of regression analysis that is used to estimate the
relationship between a dichotomous dependent variable and dichotomous-, interval-, and
ratio-level independent variables. Many different variables of interest are dichotomous – e.g.,
whether or not someone voted in the last election, whether or not someone is a smoker,
whether or not one has a child, whether or not one is unemployed, etc. These types of
variables are often referred to as discrete or qualitative. Many discrete or qualitative variables
can be thought of as events. Dichotomous or dummy variables are usually coded 1, indicating
“success” or “yes,” and 0, indicating “failure” or “no.” The mean of a dichotomous variable
coded 1 and 0 is equal to the proportion of cases coded as 1, which can also be interpreted as
a probability.

For example :-Deciding on whether or not to offer a loan to a bank customer: Outcome = yes
or no, Evaluating the risk of cancer: Outcome = high or low, Predicting a team’s win in a
football match: Outcome = yes or no.

LOGISTIC REGRESSION MODEL


Logistic regression is a supervised machine learning algorithm that accomplishes binary
classification tasks by predicting the probability of an outcome, event, or observation. The
model delivers a binary or dichotomous outcome limited to two possible outcomes: yes/no,
0/1, or true/false.

Logical regression analyzes the relationship between one or more independent variables and
classifies data into discrete classes. It is extensively used in predictive modeling, where the
model estimates the mathematical probability of whether an instance belongs to a specific
category or not.

The specific form of logistic model we use will be:

Where and are unknown parameter.

The logit transformation in terms of is as:

2
The logit, g(x) is linear in its parameters, may be continuous, and may range from - to +,
depending on the range of x.

Fitting of Logistic Regression:


In, the logistic regression model, the outcome variable is dichotomous or binary. In our
discussion the primary outcome variable is Having CHD which is coded as 1 for Yes and 0
for No. Hence,

Yi= 1 for Yes or having CHD,

= 0 for No or not having CHD

Xi = Age (explanatory variable)

Thus, we can express the pair (xi, yi) through expression,

[ 1-

Since the observation are assumed to be independent, the likelihood function is expressed as:

l() =

Also, we use expression of log likelihood which is defined as

L()=ln l()=

From above expression we estimate our required parameter.

TEST OF SIGNIFICATION
Under null hypothesis H0, is equal to zero, the G statistics follows Chi-square distribution
with 1 degree of freedom, where G statistics is given by:

G= -2ln

Wald test for significance of 


The Wald test is used for significance in logistic regression.


W=

Confidence interval for exp (

exp  

3
And exp   

For predictive logistic regression model:

P(Z)=

Where Z=

Calculation and Result discussion:


First, we see the Descriptive Status of the coronary heart disease (CHD).

Table1: Status of CHD

CHD
Frequency Percentage
Valid No 57 57.0
Yes 43 43.0
Total 100 100.0

From above table we found out that out of 100, there are 43 participants having CHD and 57
without disease.

For independent variable the minimum age of participants is 20 years and maximum age of
participant is 69 years age. The median age of participants was found to be 44 years with
mean age of 44.38 years.

4
Scatter plot between age and CHD

1.2

0.8
CHD

0.6

0.4

0.2

0
0 10 20 30 40 50 60 70 80

Age

5
Here, we are interested to see the association between age of the participants and the presence
or absence of the CHD in this study of population. So, we plot the scatter plot with our
outcome variable CHD verses age as the independent variable. The scatter plot of data is
obtained in figure above.

In above scatterplot all points fall on two parallel lines representing the absence of CHD
(y=0) and the presence of CHD (y=1). This scatter plot clearly shows that the dichotomous
nature of the outcome variable and we can see that the scatter plot does not provide the clear
explanation the relationship between CHD and Age.

Thus, the outcome variable i.e., status of CHD is a dichotomous or binary and there is no
linear relationship between the predictor variable (Age) and response variable (outcome). So,
we choose to use logistic regression model. The conditional mean of the dichotomous
variable and the range of logistic function both lies between 0 and 1

Since as we have already seen that the scatterplot didn’t explain much the relationship
between CHD and Age (predictor variable). So, we create the intervals for the independent
age variable and the frequency and the proportion of having CHD is calculated for each age
group.

0.900

0.800
Proportion of having CHD

0.700

0.600

0.500

0.400

0.300

0.200

0.100

0.000
20-29 30-34 35-39 40-44 45-49 50-54 55-59 60-69
Age group

6
From the graph, we can inference that there is a reasonable assessment of the functional
relationship between proportion of CHD and AGE. With a dichotomous outcome variable,
the conditional mean must be greater than or equal to zero and less than or equal to one,
which can be seen in figure above. The curve is said to be S-shaped and resembles a plot of
the cumulative distribution of continuous random variable. The model we use is based on the
logistic distribution. Hence the logistic model mathematically given by,

A transformation of is the logit transformation and given by

Where j is estimated by logistic regression using SPSS.

Fitting the logistic regression model


The results of the Logistic Regression Model to the CHD AGE data, n= 100

Variable Coeff. Std. Err Wald df p-value Odd


ratio
AGE .111 .024 21.254 1 .000 1.117
Constant -5.309 1.134 21.935 1 .000 .005

From table we have estimated value of Also, the Wald statistic for
age (W is 21.254 which is significant at 0.000(<0.05) at 5% level of significance. The fitted
values are given by the equation,

And the estimated logit, is given by the equation

The log-likelihood given in the table is computed using estimates and it is 53.677.

-2 Log likelihood Nagelkerke R Square 


107.353a .341

7
Also, the log likelihood for the model containing only constant term is;
[43 ln (43) + 57 ln (57) − 100 ln (100)] = -68.331 (no = 57 & n1 = 43, n= 100)

Now, G= 2[-53.677– (-68.331)]


= 29.31
The p-value associated with this test is P [ (1) > 29.31] < 0.001; thus, we have evidence that
the AGE is significant variable in predicting CHD.
Also, Nagelkerke R square, also known as Pseudo R2 is an extension of traditional R-square
which measures the best fit of model. From above table Nagelkerke R-square is 0.341 which
means that approximately 34.1% of the variation in the dependent variable i.e., status of CHD
is explained by the logistic regression model.
Hosmer and Lameshow Test:
Null hypothesis (H0): There is no difference between the observed and expected outcomes
based on the logistic regression model.
Alternative hypothesis (H1): There is a significant difference between the observed and
expected outcomes based on the logistic regression model.

Chi-square p-value
.890 .999

Here, the Hosmer-Lameshow statistic is 0.890 with p-value of 0.999 > 0.05, at 5% level
significance and we fail to reject null hypothesis. This means that the model fits well in the
given data.
Odd Ratio:
OR =

From calculation table we have odd ratio = 1.117, which means there is a 1.117 times higher
odd of having coronary heart disease (CHD) when comparing two groups with a one-unit
difference in the independent variable (in this case, age).

Confidence interval for Ln (OR) is given by exp [   ] and obtained as


(1.066,1.171). This confidence interval suggests that the odds of having CHD could be as
little as 1.066 times or much as 1.171 times when there is one unit change age of patients.

8
Fitting Logistic Regression model with AGE groups:
The logistic regression is performed between status of CHD and different age groups of
patients defined taking the age group 20-29 as reference category, following results are
obtained.
Since Nagelkerke R2 is found to be 0.335, this means that a 33.5% variation in the status of
CHD in patients is explained by the age of patients.

-2 Log likelihood Nagelkerke R Square


107.961a .335

Also, Hosmer and Lameshow test is performed to observe the how well the model fits the
observed data. And it is found to be chi square statistic value 0.000 with a p-value
1.00(>0.05) which suggests that it fails to reject null hypothesis concluding that the model
fits the data well.

Hosmer and Lameshow Test


Chi-square p-value
.000 1.000

Fitting logistic regression between status of CHD and Age groups with reference age group
20-29

Coefficients Standard 95% C.I.for EXP ()


Age group  error Wald p-value Exp () Lower Upper
20-29 (Ref. group) 21.855 .003
30-34 .325 1.299 .063 .802 1.385 .108 17.670
35-39 1.099 1.247 .776 .378 3.000 .260 34.575
40-44 1.504 1.188 1.603 .205 4.500 .439 46.170
45-49 2.043 1.192 2.938 .087 7.714 .746 79.771
50-54 2.708 1.282 4.460 .035 15.000 1.215 185.198
55-59 3.376 1.199 7.925 .005 29.250 2.789 306.811
60-69 3.584 1.318 7.397 .007 36.000 2.721 476.276
Constant -2.197 1.054 4.345 .037 .111

From the SPSS result shown in above table we can fit the logistic model as:
ln = -2.197 + 0.325 * age (30-34) + 1.099* age (35-39) + 1.504 * age (40-44) + 2.043 *

age (45-49) + 2.078 * age (50-54) + 3.376 *age (55-59) + 3.584 * age (60-69)

9
& the table again shows that the estimated regression coefficients of age groups, 50-54 years, 55-59
years and 60-69 years are significant at 5% level of significance with p-values 0.035, 0.005, and 0.007
respectively. Again, we have standard error and value of Wald test given in the table.
Furthermore, the odd ratio (Exp (is obtained in the column (Exp (and all odd ratios exceeds 1
suggesting that the value of odds ratio times there is chance of developing CHD in the corresponding
age groups as compared to the reference age group (20-29) years.
For example, the odd ratio for age group 30-34 years is 1.385 which suggests that there is 1.385 times
chance of developing CHD in age group 30-34 years as compared to reference group 20-29 years of
Age and So on. Also, it is seen that the highest chance of developing CHD is in age group 60-69 years
with odds of 36 times as compare to reference age group.

Now, predicted model with age being independent variable and CHD as response variable is
given by;

Where, x is the age of the respondents (in years).

CONCLUSION
From the above calculations and results we fitted the logistic model for given data between
the status of having CHD as dependent variable and AGE as independent variable and found
that AGE is significant variable in predicting CHD.

10

You might also like