Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Logistic Regression

1
Classification Problems

Classification is an important category of problems in which the


decision maker would like to classify the case/entity/customers into
two or more groups.
Examples of Classification Problems:
Customer profiling (customer segmentation)
Customer Churn
 Credit Classification (low, high and medium risk)
Employee attrition
Fraud (classification of transaction to fraud/no-fraud)
Stress levels
Text Classification (Sentiment Analysis)
Outcome of any binomial and multinomial experiment
Logistic Regression Classification

Class probability
Simple Example of Logistic
Regression
Linear regression not appropriate where response categorical
Alternatively, Logistic Regression method describes relationship
between categorical response and set of predictors
Specifically, we explore applications with dichotomous response
Example: Suppose researchers interested in potential relationship
between patient age and presence/absence of disease
Data set includes 20 patients

4
age disease
1 25 0
2 29 0
3 30 0
4 31 0
5 32 0
6 41 0
7 41 0
8 42 0
9 44 1
10 49 1
11 50 0
12 59 1
13 60 0
14 62 0
15 68 1
16 72 0
17 79 1
18 80 0
19 81 1
20 84 1

5
Simple Example of Logistic
Regression (cont’d)
Plot shows least squares
regression line (straight)
and logistic regression
line (curved) for disease
on age

6
𝑒𝑒 𝛽𝛽0 +𝛽𝛽1 𝑥𝑥
P(𝑌𝑌 = 1|𝑥𝑥) = 𝜋𝜋(𝑥𝑥) = 1+𝑒𝑒 𝛽𝛽0 +𝛽𝛽1 𝑥𝑥
 For low balance, predicted probability is close to zero but never less than 0.
 For high balance, predicted probability is close to 1 but never above 1.
 Better able to capture the range of probabilities.

8
exp(α + βx)
P(Y = 1 | X = x) = π ( x) =
1 + exp(α + βx)

Estimating β with b

b = 0 implies that P(Y|x) is same for each value of


x
b > 0 implies that P(Y|x) is increases as the value
of x increases
b < 0 implies that P(Y|x) is decreases as the value
of x increases
z
1 e
P(Y = 1) = π ( z ) = =
−z z
1+ e 1+ e
z = β 0 + β1x1 + β 2 x2 + ... + β n xn
Maximum Likelihood Estimation
Probability of Y=1 response, given data:
π(x) = P(Y=1 | x)
Probability of Y=0 response, given data:
1 – π(x) = P(Y=0 | x)
Likelihood of a point: (x, y) is
π(x)^y* (1-π(x))^(1-y)
Since observations assumed independent and take on
values Yi = 0 or 1, likelihood function expressed as product
of terms:
l (β | x) = ∏i =1 [π ( xi )] i [1 − π ( xi )]
n y 1− yi

11
Maximum Likelihood Estimation
Log-likelihood form L(β | x) more computationally tractable:

L(β | x) = ln [l (β | x)] = ∑i =1 {yi ln[π ( xi )] + (1 − yi ) ln[1 − π ( xi )]}


n

= ∑i =1 {yi ln[π ( xi )] + (1 − yi ) ln[1 − π ( xi )]}


n

Finally, maximum likelihood estimates found by differentiating L(β | x) with


respect to each parameter βi, and setting result = 0
Iterative weighted least squares method applied, since closed form solutions for
differentiations non-existent
Maximum likelihood estimation is a method that determines values for the
parameters of a model. The parameter values are found such that they
maximise the likelihood that the process described by the model produced the
data that were actually observed.
12
Estimation of LR parameters
∂ ln( L( β 0, β1 )) n n exp( β + β x )
= ∑ yi − ∑ 0 1 i =0
∂β 0 i =1 i =11 + exp( β 0 + β1xi )

∂ ln( L( β 0, β1 )) n n x exp( β + β x )
= ∑ xi yi − ∑ i 0 1 i =0
∂β1 i =1 i =11 + exp( β 0 + β1xi )

The above system of equations are solved iteratively to


estimate β0 and β1
Linear V/s Logistic
Linear Logistic
Function lim
 ea 
 a=0  ea 
a → −∞ 1 + e  a → ∞ 1 + e a 
lim =1
 

E (Y | x) = β 0 + β1 x
e β 0 + β1x
E (Y | x) = π ( x) =
1 + e β 0 + β1x

Cost Function Likelihood Function


SSE p = ∑i =1 ε i2 = ∑i =1 ( yi − β 0 − β1 xi ) 2 l (β | x) = n [π ( x )]y [1 − π ( x )]1− y
n n

∏i =1 i i
i
i

Least Square Method Maximum Likelihood Estimation

Linear in Parameters Non linear in parameters

14
Linear V/s Logistic
Linear Logistic
Errors are normally distributed Because response dichotomous,
errors take one of two forms:
Y = 1 Occurs with probability π(x), ε =
1 – π(x);
Y = 0 Occurs with probability 1 – π(x),
ε = 0 – π(x) = –π(x)
Constant error variances Variance of ε = π(x)·(1 – π(x))

The mean response is not constrained The mean response should lie in
between 0 and 1. between 0 and 1.

15
Interpreting Logistic
Regression Output
−4.372 + 0.06696 ( age )
e gˆ ( x ) e
gˆ ( x) = −4.372 + 0.06696(50) = −1.024 πˆ ( x) = =
1 + e gˆ ( x ) 1 + e −4.372 + 0.06696 ( age )

Call:
glm(formula = disease ~ age, family = binomial, data = patients)

Deviance Residuals:
Min 1Q Median 3Q Max e gˆ ( x ) e −1.024
-1.6136 -0.6591 -0.4310 0.7856 1.8118
πˆ ( x) = = −1.024
= 0.26
Coefficients: 1 + e gˆ ( x ) 1 + e
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.37210 1.96555 -2.224 0.0261 *
age 0.06696 0.03223 2.077 0.0378 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 25.898 on 19 degrees of freedom


Residual deviance: 20.201 on 18 degrees of freedom
AIC: 24.201

Number of Fisher Scoring iterations: 4

Estimate probability of disease present in particular patient, age = 50

16
Interpreting Logistic Regression
Coefficients
Slope coefficient β1 interpreted as change in value of logit,
for unit increase in predictor:
logit

g ( x)= ln[π ( xi ) /(1 − π ( xi )]=β 0 + β1 xi

β1 = g ( x + 1) − g ( x)

β1 + ve → If x ↑ Logit ↑ Π ( x) ↑

17
Call:
glm(formula = disease ~ age, family = binomial, data = patients)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.6136 -0.6591 -0.4310 0.7856 1.8118

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.37210 1.96555 -2.224 0.0261 *
age 0.06696 0.03223 2.077 0.0378 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 25.898 on 19 degrees of freedom


Residual deviance: 20.201 on 18 degrees of freedom

Number of Fisher Scoring iterations: 4

Deviance is a measure of badness of fit. Higher numbers indicate worse fit.

The null deviance is the deviance when the response variable is predicted by
a model that includes only the intercept (grand mean).

The Residual Deviance is the deviance when the response variable is


predicted by the predictor variable(s).

18
Overall Model Significance
𝐻𝐻0 : 𝛽𝛽1 = 𝛽𝛽2 … = 𝛽𝛽𝑘𝑘 = 0
𝐻𝐻𝐴𝐴 : Not all 𝛽𝛽𝑖𝑖 = 0

Test statistic: Null Deviance- Residual Deviance

It follows a chi-square distribution with k degree of freedom

19
#Overall Significance of the model
with(lr1,null.deviance-deviance)
with(lr1,df.null-df.residual)
with(lr1,pchisq(null.deviance-deviance,df.null-
df.residual,lower.tail=FALSE))

20
#Overall Significance of the model > with(lr1,null.deviance-
deviance) [1] 5.696407 >

 with(lr1,df.null-df.residual) [1] 1

 > > with(lr1,pchisq(null.deviance-deviance,df.null-


df.residual,lower.tail=FALSE)) [1] 0.01699968

>

21
Inference: Is a predictor Significant?
Wald test: hypothesis test for assessing significance of predictor

Under Ho: that β1 = 0, Ha: β1 ≠ 0 b1


Z Wald =
ZWald statistic follows standard normal distribution: SE (b1 )

Call:
glm(formula = disease ~ age, family = binomial, data = patients)

. Min Residuals:
Deviance
1Q Median 3Q Max
-1.6136 -0.6591 -0.4310 0.7856 1.8118

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.37210 1.96555 -2.224 0.0261 *
age 0.06696 0.03223 2.077 0.0378 * b1 ± z ⋅ SE (b1 )
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
= 0.06696 ± (1.96)(0.03223)
(Dispersion parameter for binomial family taken to be 1)
= 0.06696 ± 0.06317
Null deviance: 25.898 on 19 degrees of freedom
Residual deviance: 20.201
AIC: 24.201
on 18 degrees of freedom = (0.00379, 0.13013)
Number of Fisher Scoring iterations: 4

22
npreg glu bp skin bmi ped age type
1 6 148 72 35 33.6 0.627 50 Yes
2 1 85 66 29 26.6 0.351 31 No
3 1 89 66 23 28.1 0.167 21 No
4 3 78 50 32 31 0.248 26 Yes
5 2 197 70 45 30.5 0.158 53 Yes
6 5 166 72 19 25.8 0.587 51 Yes
7 0 118 84 47 45.8 0.551 31 Yes

npreg: number of pregnancies.


glu: plasma glucose concentration in an oral glucose tolerance test.
bp: diastolic blood pressure (mm Hg).
skin: triceps skin fold thickness (mm).
bmi: body mass index (weight in kg/(height in m)\^2).
ped: diabetes pedigree function.
age: age in years.
Type: Yes or No, for diabetic according to WHO criteria

23
Refer to file Pima.te.
Q.1 Check the overall significance of the model. What is the value of
statistic and what is its degree of freedom
Q.2 Which all variables are significant? What is the z-statistic for bp,
skin?
Q.3 Predict the probability of disease of the person with the values
(npreg=2,glu= 90,bp=70,bmi=44,ped=0.487,age=60)
Q.4 Whether the person is diabetic or not.

24
Assumptions
 The dependent variable should be binary.
 Logistic regression typically requires a large sample size. A
general guideline is that you need at minimum of 10 cases
with each outcome for each independent variable in your
model.
 Logistic regression assumes linearity of independent
variables and log odds.

25
Cutoff Probability

𝑌𝑌 = 1; 𝑖𝑖𝑖𝑖 𝑃𝑃 𝑦𝑦 = 1 ≥ 𝑎𝑎
𝑌𝑌 = 0; 𝑖𝑖𝑖𝑖 𝑃𝑃 𝑦𝑦 = 1 < 𝑎𝑎

26
Model Evaluation

Sensitivity=
True Positive (TP)
True Positive (TP) + False Negative (FN)

Specificity=
True Negative (TN)
True Negative (TN) + False Positive (FP)

Accuracy=
True Positive TP + True Negative (T𝑁𝑁)
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇
Area under ROC curve

27
Predicted Value
ActualValue 0 (negative) 1 (positive)

0 40 4

1 14 9

False positive 4
False Negative 14
True Positive 9
True Negative 40
Accuracy 49/67
Sensitivity 9/23
Specificity 40/44
TPR 9/23
FPR 4/44
28
Accuracy Paradox

Predicted Value
 Suppose we have data
ActualValue 0 (negative) 1(positive)
set with 100
observations with 20
0 80(TN) 0 (FP)
observations with value
1 20 0 (TP) “1” and 80 with “0”.
(FN)
 A classifier predicted all
the observations to be
 accuracy <-
“0”
(p[1,1]+p[2,2])/sum(p)
 Accuracy will be 0.8
 0.8

29
Optimal Cutoff

30

You might also like