Logistic Regression

Logistic Regression
1
Classification Problems
Classification is an important category of problems in which the

decision maker would like to classify the case/entity/customers into
two or more groups.
Examples of Classification Problems:
Customer profiling (customer segmentation)
Customer Churn
 Credit Classification (low, high and medium risk)
Employee attrition
Fraud (classification of transaction to fraud/no-fraud)
Stress levels
Text Classification (Sentiment Analysis)
Outcome of any binomial and multinomial experiment
Logistic Regression Classification
Class probability
Simple Example of Logistic
Regression
Linear regression not appropriate where response categorical
Alternatively, Logistic Regression method describes relationship
between categorical response and set of predictors
Specifically, we explore applications with dichotomous response
Example: Suppose researchers interested in potential relationship
between patient age and presence/absence of disease
Data set includes 20 patients
4
age disease
1 25 0
2 29 0
3 30 0
4 31 0
5 32 0
6 41 0
7 41 0
8 42 0
9 44 1
10 49 1
11 50 0
12 59 1
13 60 0
14 62 0
15 68 1
16 72 0
17 79 1
18 80 0
19 81 1
20 84 1
5
Simple Example of Logistic
Regression (cont’d)
Plot shows least squares
regression line (straight)
and logistic regression
line (curved) for disease
on age
6
𝑒𝑒 𝛽𝛽0 +𝛽𝛽1 𝑥𝑥
P(𝑌𝑌 = 1|𝑥𝑥) = 𝜋𝜋(𝑥𝑥) = 1+𝑒𝑒 𝛽𝛽0 +𝛽𝛽1 𝑥𝑥
 For low balance, predicted probability is close to zero but never less than 0.
 For high balance, predicted probability is close to 1 but never above 1.
 Better able to capture the range of probabilities.
8
exp(α + βx)
P(Y = 1 | X = x) = π ( x) =
1 + exp(α + βx)
Estimating β with b
b = 0 implies that P(Y|x) is same for each value of

x
b > 0 implies that P(Y|x) is increases as the value
of x increases
b < 0 implies that P(Y|x) is decreases as the value
of x increases
z
1 e
P(Y = 1) = π ( z ) = =
−z z
1+ e 1+ e
z = β 0 + β1x1 + β 2 x2 + ... + β n xn
Maximum Likelihood Estimation
Probability of Y=1 response, given data:
π(x) = P(Y=1 | x)
Probability of Y=0 response, given data:
1 – π(x) = P(Y=0 | x)
Likelihood of a point: (x, y) is
π(x)^y* (1-π(x))^(1-y)
Since observations assumed independent and take on
values Yi = 0 or 1, likelihood function expressed as product
of terms:
l (β | x) = ∏i =1 [π ( xi )] i [1 − π ( xi )]
n y 1− yi
11
Maximum Likelihood Estimation
Log-likelihood form L(β | x) more computationally tractable:
L(β | x) = ln [l (β | x)] = ∑i =1 {yi ln[π ( xi )] + (1 − yi ) ln[1 − π ( xi )]}

n
= ∑i =1 {yi ln[π ( xi )] + (1 − yi ) ln[1 − π ( xi )]}

n
Finally, maximum likelihood estimates found by differentiating L(β | x) with

respect to each parameter βi, and setting result = 0
Iterative weighted least squares method applied, since closed form solutions for
differentiations non-existent
Maximum likelihood estimation is a method that determines values for the
parameters of a model. The parameter values are found such that they
maximise the likelihood that the process described by the model produced the
data that were actually observed.
12
Estimation of LR parameters
∂ ln( L( β 0, β1 )) n n exp( β + β x )
= ∑ yi − ∑ 0 1 i =0
∂β 0 i =1 i =11 + exp( β 0 + β1xi )
∂ ln( L( β 0, β1 )) n n x exp( β + β x )
= ∑ xi yi − ∑ i 0 1 i =0
∂β1 i =1 i =11 + exp( β 0 + β1xi )
The above system of equations are solved iteratively to

estimate β0 and β1
Linear V/s Logistic
Linear Logistic
Function lim
 ea 
 a=0  ea 
a → −∞ 1 + e  a → ∞ 1 + e a 
lim =1
 
E (Y | x) = β 0 + β1 x
e β 0 + β1x
E (Y | x) = π ( x) =
1 + e β 0 + β1x
Cost Function Likelihood Function

SSE p = ∑i =1 ε i2 = ∑i =1 ( yi − β 0 − β1 xi ) 2 l (β | x) = n [π ( x )]y [1 − π ( x )]1− y
n n
∏i =1 i i
i
i
Least Square Method Maximum Likelihood Estimation
Linear in Parameters Non linear in parameters
14
Linear V/s Logistic
Linear Logistic
Errors are normally distributed Because response dichotomous,
errors take one of two forms:
Y = 1 Occurs with probability π(x), ε =
1 – π(x);
Y = 0 Occurs with probability 1 – π(x),
ε = 0 – π(x) = –π(x)
Constant error variances Variance of ε = π(x)·(1 – π(x))
The mean response is not constrained The mean response should lie in
between 0 and 1. between 0 and 1.
15
Interpreting Logistic
Regression Output
−4.372 + 0.06696 ( age )
e gˆ ( x ) e
gˆ ( x) = −4.372 + 0.06696(50) = −1.024 πˆ ( x) = =
1 + e gˆ ( x ) 1 + e −4.372 + 0.06696 ( age )
Call:
glm(formula = disease ~ age, family = binomial, data = patients)
Deviance Residuals:
Min 1Q Median 3Q Max e gˆ ( x ) e −1.024
-1.6136 -0.6591 -0.4310 0.7856 1.8118
πˆ ( x) = = −1.024
= 0.26
Coefficients: 1 + e gˆ ( x ) 1 + e
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.37210 1.96555 -2.224 0.0261 *
age 0.06696 0.03223 2.077 0.0378 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 25.898 on 19 degrees of freedom

Residual deviance: 20.201 on 18 degrees of freedom
AIC: 24.201
Number of Fisher Scoring iterations: 4
Estimate probability of disease present in particular patient, age = 50
16
Interpreting Logistic Regression
Coefficients
Slope coefficient β1 interpreted as change in value of logit,
for unit increase in predictor:
logit
g ( x)= ln[π ( xi ) /(1 − π ( xi )]=β 0 + β1 xi
β1 = g ( x + 1) − g ( x)
β1 + ve → If x ↑ Logit ↑ Π ( x) ↑
17
Call:
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6136 -0.6591 -0.4310 0.7856 1.8118
Coefficients:
(Intercept) -4.37210 1.96555 -2.224 0.0261 *
age 0.06696 0.03223 2.077 0.0378 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual deviance: 20.201 on 18 degrees of freedom
Deviance is a measure of badness of fit. Higher numbers indicate worse fit.
The null deviance is the deviance when the response variable is predicted by
a model that includes only the intercept (grand mean).
The Residual Deviance is the deviance when the response variable is

predicted by the predictor variable(s).
18
Overall Model Significance
𝐻𝐻0 : 𝛽𝛽1 = 𝛽𝛽2 … = 𝛽𝛽𝑘𝑘 = 0
𝐻𝐻𝐴𝐴 : Not all 𝛽𝛽𝑖𝑖 = 0
Test statistic: Null Deviance- Residual Deviance
It follows a chi-square distribution with k degree of freedom
19
#Overall Significance of the model
with(lr1,null.deviance-deviance)
with(lr1,df.null-df.residual)
with(lr1,pchisq(null.deviance-deviance,df.null-
df.residual,lower.tail=FALSE))
20
#Overall Significance of the model > with(lr1,null.deviance-
deviance) [1] 5.696407 >
 with(lr1,df.null-df.residual) [1] 1
 > > with(lr1,pchisq(null.deviance-deviance,df.null-

df.residual,lower.tail=FALSE)) [1] 0.01699968
>
21
Inference: Is a predictor Significant?
Wald test: hypothesis test for assessing significance of predictor
Under Ho: that β1 = 0, Ha: β1 ≠ 0 b1

Z Wald =
ZWald statistic follows standard normal distribution: SE (b1 )
Call:
. Min Residuals:
Deviance
1Q Median 3Q Max
-1.6136 -0.6591 -0.4310 0.7856 1.8118
Coefficients:
(Intercept) -4.37210 1.96555 -2.224 0.0261 *
age 0.06696 0.03223 2.077 0.0378 * b1 ± z ⋅ SE (b1 )
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
= 0.06696 ± (1.96)(0.03223)
= 0.06696 ± 0.06317
Residual deviance: 20.201
AIC: 24.201
on 18 degrees of freedom = (0.00379, 0.13013)
22
npreg glu bp skin bmi ped age type
1 6 148 72 35 33.6 0.627 50 Yes
2 1 85 66 29 26.6 0.351 31 No
3 1 89 66 23 28.1 0.167 21 No
4 3 78 50 32 31 0.248 26 Yes
5 2 197 70 45 30.5 0.158 53 Yes
6 5 166 72 19 25.8 0.587 51 Yes
7 0 118 84 47 45.8 0.551 31 Yes
npreg: number of pregnancies.

glu: plasma glucose concentration in an oral glucose tolerance test.
bp: diastolic blood pressure (mm Hg).
skin: triceps skin fold thickness (mm).
bmi: body mass index (weight in kg/(height in m)\^2).
ped: diabetes pedigree function.
age: age in years.
Type: Yes or No, for diabetic according to WHO criteria
23
Refer to file Pima.te.
Q.1 Check the overall significance of the model. What is the value of
statistic and what is its degree of freedom
Q.2 Which all variables are significant? What is the z-statistic for bp,
skin?
Q.3 Predict the probability of disease of the person with the values
(npreg=2,glu= 90,bp=70,bmi=44,ped=0.487,age=60)
Q.4 Whether the person is diabetic or not.
24
Assumptions
 The dependent variable should be binary.
 Logistic regression typically requires a large sample size. A
general guideline is that you need at minimum of 10 cases
with each outcome for each independent variable in your
model.
 Logistic regression assumes linearity of independent
variables and log odds.
25
Cutoff Probability
𝑌𝑌 = 1; 𝑖𝑖𝑖𝑖 𝑃𝑃 𝑦𝑦 = 1 ≥ 𝑎𝑎
𝑌𝑌 = 0; 𝑖𝑖𝑖𝑖 𝑃𝑃 𝑦𝑦 = 1 < 𝑎𝑎
26
Model Evaluation
Sensitivity=
True Positive (TP)
True Positive (TP) + False Negative (FN)
Specificity=
True Negative (TN)
True Negative (TN) + False Positive (FP)
Accuracy=
True Positive TP + True Negative (T𝑁𝑁)
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇
Area under ROC curve
27
Predicted Value
ActualValue 0 (negative) 1 (positive)
0 40 4
1 14 9
False positive 4
False Negative 14
True Positive 9
True Negative 40
Accuracy 49/67
Sensitivity 9/23
Specificity 40/44
TPR 9/23
FPR 4/44
28
Accuracy Paradox
Predicted Value
 Suppose we have data
ActualValue 0 (negative) 1(positive)
set with 100
observations with 20
0 80(TN) 0 (FP)
observations with value
1 20 0 (TP) “1” and 80 with “0”.
(FN)
 A classifier predicted all
the observations to be
 accuracy <-
“0”
(p[1,1]+p[2,2])/sum(p)
 Accuracy will be 0.8
 0.8
29
Optimal Cutoff
30

Logistic Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Logistic Regression

Uploaded by

Copyright:

Available Formats

Logistic Regression

Classification is an important category of problems in which the

b = 0 implies that P(Y|x) is same for each value of

L(β | x) = ln [l (β | x)] = ∑i =1 {yi ln[π ( xi )] + (1 − yi ) ln[1 − π ( xi )]}

= ∑i =1 {yi ln[π ( xi )] + (1 − yi ) ln[1 − π ( xi )]}

Finally, maximum likelihood estimates found by differentiating L(β | x) with

The above system of equations are solved iteratively to

Cost Function Likelihood Function

Least Square Method Maximum Likelihood Estimation

Linear in Parameters Non linear in parameters

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 25.898 on 19 degrees of freedom

Number of Fisher Scoring iterations: 4

Estimate probability of disease present in particular patient, age = 50

g ( x)= ln[π ( xi ) /(1 − π ( xi )]=β 0 + β1 xi

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 25.898 on 19 degrees of freedom

Number of Fisher Scoring iterations: 4

Deviance is a measure of badness of fit. Higher numbers indicate worse fit.

The Residual Deviance is the deviance when the response variable is

Test statistic: Null Deviance- Residual Deviance

It follows a chi-square distribution with k degree of freedom

 > > with(lr1,pchisq(null.deviance-deviance,df.null-

Under Ho: that β1 = 0, Ha: β1 ≠ 0 b1

npreg: number of pregnancies.

You might also like