Session 5 - Logistic Regression

LOGISTIC REGRESSION
CLASSIFICATION PROBLEMS
Classification is an important category of problems in which the decision maker would like to
classify the case/entity/customers into two or more groups.
Examples of Classification Problems:

 Customer profiling (customer segmentation)
 Customer Churn.
 Credit Classification (low, high and medium risk)
 Employee attrition.
 Fraud (classification of transaction to fraud/no-fraud)
 Stress levels
 Text Classification (Sentiment Analysis)
 Outcome of any binomial and multinomial experiment.
3
CHALLENGING CLASSIFICATION PROBLEMS
 Ransomware
 Anomaly Detection
 Image Classification (Medical Devices, Satellite images)
 Text Classification
4
CLASSIFICATION PROBLEMS
 Classification problems are an important category of problems in analytics in which the
response variable (Y) takes a discrete value.
 The primary objective is to predict the class of a customer (or class probability) based
on the values of explanatory variables or predictors.
 It’s referred to as regression because it takes the output of the linear regression
function as input and uses a sigmoid function to estimate the probability for the
given class.
 The difference between linear regression and logistic regression is that linear
regression output is the continuous value that can be anything while logistic
regression predicts the probability that an instance belongs to a given class or not.
5
EXAMPLE
 The X-axis of this graph displays the number of years in the company, which is the dependent
variable. The Y-axis tells us the probability that a person will get promoted, and these values
range from 0 to 1.
 A probability of 0 indicates that the

person will not get promoted and 1 tells
us that they will get promoted.
 Logistic regression returns an outcome
of 0 (Promoted = No) for probabilities
less than 0.5. A prediction of 1
(Promoted = Yes) is returned for
probabilities greater than or equal to 0.5:
6
EXAMPLE
 You can see that as an employee spends more time working in the company, their chances of
getting promoted increases.
Why do we need logistic

regression—Why can’t we
simply use a straight line to
predict whether a person will
get promoted?
7
EXAMPLE
 Linear regression is a technique that is commonly used to model problems with continuous output.
Here is an example of how linear regression fits a straight line to model the observed data:
 For classification problems, however, we

cannot fit a straight line like this onto
the data points at hand.
 This is because the line of best fit in

linear regression has no upper and
lower bound, and the predictions can
become negative or exceed one.
8
EXAMPLE
If we used linear regression to model whether a person will get promoted:
 Linear regression still works for certain values.

 For people with 10 years of work experience, we
get a prediction of 1.
 For 3 years of work experience, the model predicts
a probability of 0.3, which will return an outcome of
0.
 However, since there is no upper bound, observe
how the predictions for employees with 12 years of
work experience exceed 1.
 Similarly, for employees with less than 1 year of
work experience, the predictions become negative.
 Since we want to predict a binary outcome
(yes/no), the predictions need to range from 0 to 1.
It isn’t possible to have negative predictions or
predictions that exceed 1.
9
EXAMPLE
To ensure that the outcome is always a probability, logistic regression uses the sigmoid
function to squash the output of linear regression between 0 and 1:
10
LOGISTIC REGRESSION
Logistic Regression
Classification
Discrete Choice
Class probability
12
LOGISTIC REGRESSION - INTRODUCTION
Logistic Function (Sigmoidal function)
 The name logistic regression emerges from logistic distribution function.

 A sigmoid function is a mathematical function having a characteristic "S"-shaped curve or
sigmoid curve.
 Sigmoid function or the logistic function:
1 𝑒𝑒 𝑧𝑧
−𝑧𝑧
=
1 + 𝑒𝑒 1 + 𝑒𝑒 𝑧𝑧
 Mathematically, logistic regression attempts to estimate conditional probability of an event

(or class probability).
13
BINARY LOGISTIC REGRESSION
1 𝑒𝑒 𝑧𝑧
𝑃𝑃(𝑌𝑌 = 1) = 𝜋𝜋(𝑧𝑧) = −𝑧𝑧
=
1 + 𝑒𝑒 1 + 𝑒𝑒 𝑧𝑧
where 𝑧𝑧 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 +. . . +𝛽𝛽𝑛𝑛 𝑥𝑥𝑛𝑛
𝑌𝑌 – response variable takes only two values. For example, assume that the value of
Y is either 1 (positive outcome) or 0 (negative outcome).
𝑥𝑥1,…, 𝑥𝑥n - Independent variables
14
LOGISTIC REGRESSION WITH ONE EXPLANATORY VARIABLE
𝑃𝑃 𝑌𝑌 = 1 𝑋𝑋 = 𝑥𝑥 = 𝜋𝜋 𝑥𝑥 =
𝑒𝑒 (𝛽𝛽0+𝛽𝛽1𝑥𝑥)
1 + 𝑒𝑒 (𝛽𝛽0+𝛽𝛽1𝑋𝑋1)
𝛽𝛽 = 0 implies that P(Y|x) is same for each value of x

𝛽𝛽 > 0 implies that P(Y|x) is increases as the value of x increases
𝛽𝛽 < 0 implies that P(Y|x) is decreases as the value of x increases
15
LOGISTIC TRANSFORMATION
 The logistic regression model is given by:
𝜋𝜋
𝑒𝑒 (𝛽𝛽0+𝛽𝛽1𝑥𝑥) = 𝑒𝑒 (𝛽𝛽0+𝛽𝛽1𝑥𝑥)
𝜋𝜋 = 1 − 𝜋𝜋
1 + 𝑒𝑒 (𝛽𝛽0+𝛽𝛽1𝑥𝑥)
Odds: Likelihood of an occurrence

relative to the likelihood of a non-
occurrence.
𝜋𝜋
ln = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥
1 − 𝜋𝜋
Function with linear properties (Link Function)
16
LOGIT FUNCTION
 The logit function is the logarithmic transformation of the logistic function.
 It is defined as the natural logarithm of odds.
 Logit of a variable π (with value between 0 and 1) is given by:
𝜋𝜋
𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿(𝜋𝜋) = ln( ) = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥
1 − 𝜋𝜋
17
PARAMETER ESTIMATION IN LOGISTIC
REGRESSION (MAXIMUM LIKELIHOOD
ESTIMATE)
18
LIKELIHOOD FUNCTION FOR BINARY LOGISTIC FUNCTION
 One of the major assumptions of simple linear regression and multiple linear
regression is that the residuals follow a normal distribution. However, the
residuals in logistics regression does not follow a normal distribution.
 For example, consider a regression, where the response variable Y takes only
two values (0 or 1).
𝑌𝑌𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1𝑖𝑖 + 𝜖𝜖𝑖𝑖
 The error (𝜖𝜖𝑖𝑖) is given by
19
Probability density function for binary logistic regression is given by:
𝑦𝑦
𝑓𝑓(𝑦𝑦𝑖𝑖 ) = 𝜋𝜋𝑖𝑖 𝑖𝑖 (1 − 𝜋𝜋𝑖𝑖 )1−𝑦𝑦𝑖𝑖
𝑛𝑛
𝑦𝑦
𝐿𝐿(𝛽𝛽) = 𝑓𝑓(𝑦𝑦1 , 𝑦𝑦2 , . . . , 𝑦𝑦𝑛𝑛 ) = � 𝜋𝜋𝑖𝑖 𝑖𝑖 (1 − 𝜋𝜋𝑖𝑖 )1−𝑦𝑦𝑖𝑖
𝑖𝑖=1
𝑛𝑛 𝑛𝑛
ln( 𝐿𝐿(𝛽𝛽)) = � 𝑦𝑦𝑖𝑖 ln 𝜋𝜋(𝑥𝑥𝑖𝑖 ) + �(1 − 𝑦𝑦𝑖𝑖 ) ln( 1 − 𝜋𝜋𝑖𝑖 (𝑥𝑥𝑖𝑖 ))

𝑖𝑖=1 𝑖𝑖=1
20
n
ln[L( β )] = ∑ yi ( β 0 + β1xi ) −
i =1
n
∑ ln(1 + exp( β 0 + β1xi ))
i =1
21
ESTIMATION OF LR PARAMETERS
∂ ln( L( β 0, β1 )) n n exp( β + β x )
0 1 i
= ∑ yi − ∑ =0
∂β 0 i =1 i =11 + exp( β 0 + β1xi )
∂ ln( L( β 0, β1 )) n n x exp( β + β x )
= ∑ xi yi − ∑ i 0 1 i =0
∂β1 i =1 i =11 + exp( β 0 + β1xi )
The above system of equations are solved iteratively to estimate β0 and β1
22
LIMITATIONS OF MLE
 Maximum likelihood estimator may not be unique or may not exist.
 Closed form solution may not exist for many cases, one may have to use
iterative procedure to estimate the parameter values.
23
LOGISTIC REGRESSION MODEL DEVELOPMENT
Pre-process the data –

Derive and Analyze Divide the data into
Explore the data
Descriptive Statistics training and validation
data
Perform Estimate regression Define functional form

Diagnostic Tests parameters of the relationship
NO
Model satisfies
diagnostic test
YES STOP
24
SPACE SHUTTLE CHALLENGER CRASH
 Space shuttle orbiter Challenger (Mission STS-51-L) was the 25th shuttle
launched by NASA on January 28, 1986 (Smith, 1986; Feynman 1988).
 The Challenger crashed 73 seconds into its flight due to the erosion of O-rings
which were part of the solid rocket boosters of the shuttle.
 Before the launch, the engineers at NASA were concerned about the outside
temperature which was very low (the actual launch occurred at 36°F).
25
SPACE SHUTTLE CHALLENGER DATA
Flt Temp Damage Flt Temp Damage
STS-1 66 No STS-41G 78 No
STS-2 70 Yes STS-51-A 67 No
STS-3 69 No STS-51-C 53 Yes
STS-4 80 No STS-51-D 67 No
STS-5 68 No STS-51-B 75 No
STS-6 67 No STS-51-G 70 No
STS-7 72 No STS-51-F 81 No
STS-8 73 No STS-51-I 76 No
STS-9 70 No STS-51-J 79 No
STS-41B 57 Yes STS-61-A 75 Yes
STS-41C 63 Yes STS-61-B 76 No
STS-41D 70 Yes STS-61-C 58 Yes
EXAMPLE
Challenger launch temperature vs damage data
27
LOGISTIC REGRESSION OF CHALLENGER DATA
Let:
 Yi = 0 denote no damage
 Yi = 1 denote damage to the O-ring
 P(Yi = 1) = Πi and P(Yi = 0) = 1 - ΠI.
 We have to estimate P(Yi = 1 |Xi).
𝑒𝑒 (𝛽𝛽0+𝛽𝛽1𝑥𝑥)
 The logistic regression model is given by: 𝜋𝜋 =
1 + 𝑒𝑒 (𝛽𝛽0+𝛽𝛽1𝑥𝑥)
𝜋𝜋
= 𝑒𝑒 (𝛽𝛽0+𝛽𝛽1𝑥𝑥) 𝜋𝜋
1 − 𝜋𝜋 ln
1 − 𝜋𝜋
= 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥
Odds: Likelihood of a damage relative to
the likelihood of a non-damage. Logit Function
28
LOGISTIC REGRESSION OF CHALLENGER DATA
Let:
 Yi = 0 denote no damage
 Yi = 1 denote damage to the O-ring
 P(Yi = 1) = Πi and P(Yi = 0) = 1 - ΠI.
 We have to estimate P(Yi = 1 |Xi).
Variables in the Equation
B S.E. Wald df Sig . Exp(B)

Step
a
LaunchTemperature -.236 .107 4.832 1 .028 .790
1 Constant 15.297 7.329 4.357 1 .037 4398676
a. Variable(s) entered on step 1: LaunchTemperature.
 πi 
ln  = 15.297 − 0.236 X i
1− πi  29
CHALLENGER: PROBABILITY OF FAILURE ESTIMATE
Probability of damage to O-ring as a function of launch temperature is:
𝑒𝑒 15.297−0.236𝑋𝑋𝑖𝑖
𝜋𝜋𝑖𝑖 =
1 + 𝑒𝑒 15.297−0.236𝑋𝑋𝑖𝑖
30
EXERCISE
 Use the model to calculate the probability that an O-ring will become
unmanaged by the following ambient temperatures: 51, 53, and 55 degrees
Fahrenheit.
 πi 
ln  = 15.297 − 0.236 X i
1− πi 
𝑒𝑒 15.297−0.236𝑋𝑋𝑖𝑖 Probability of damage to O-

𝜋𝜋𝑖𝑖 = ring as a function of launch
1 + 𝑒𝑒 15.297−0.236𝑋𝑋𝑖𝑖 temperature
31
ACTUAL VS. PREDICTED
Flight Launch Damage to Predicted Flight Launch Damage to Predicted
Number Temperature O-ring Probability Number Temperature O-ring Probability
STS 1 66 0 0.4306 STS 41G 78 0 0.0426

STS 2 70 1 0.2274 STS 51A 67 0 0.3740
STS 3 69 0 0.2715 STS 51C 53 1 0.9421
STS 4 80 0 0.0270 STS 51D 67 0 0.3740
STS 5 68 0 0.3206 STS 51B 75 0 0.0829
STS 6 67 0 0.3740 STS 51G 70 0 0.2274
STS 7 72 0 0.1551 STS 51F 81 0 0.0215
STS 8 73 0 0.1266 STS 51I 76 0 0.0667
STS 9 70 0 0.2274 STS 51J 79 0 0.0340
STS 41B 57 1 0.8635 STS 61A 75 1 0.0829
STS 41C 63 1 0.6056 STS 61B 76 0 0.0667
STS 41D 70 1 0.2274 STS 61C 58 1 0.8332
32
INTERPRETATION
𝜋𝜋
ln( ) = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥
1 − 𝜋𝜋
 𝛽𝛽1: Change in logged odd ratio. For the given example, it is change in llikelihood
of a damage relative to the likelihood of a non-damage.
33
INTERPRETATION
𝛽𝛽1: Change in logged odd ratio

𝑒𝑒𝛽𝛽1 : Change in the odds ratio.
34
ODDS AND ODDS RATIO
 ODDS: Ratio of two probability values.
π
odds =
1− π
 ODDS RATIO: Ratio of two odds.
35
ODDS RATIO
 Assume that X is an independent variable. The odds ratio, OR, is defined as

the ratio of the odds for X = 1 to the odds for X = 0. The odds ratio is given
by:
π (1) 1 − π (1)
OR =
π ( 0) 1 − π ( 0)
36
ODDS RATIO
π (1) 1 − π (1)
OR = = e β1
π (0) 1 − π (0)
 If OR = 2, then the event is twice likely to occur when X = 1 compared

to X = 0.
 Odds ratio approximates the relative risk.
37
INTERPRETATION OF LR COEFFICIENTS
 β1 is the change in log-odds ratio for unit change in the explanatory

variable.
 β1 is the change in odds ratio by a factor exp(β1).
 π ( x + 1) (1 − π ( x + 1)) 
β1 = ln  = Change in ln odds ratio
 π ( x) (1 − π ( x + 1)) 
β1 π ( x + 1) (1 − π ( x + 1))
e = = Change in odds ratio
π ( x) (1 − π ( x + 1))
38
CLASSIFICATION TABLE
 The output from a logistic regression model is the class probability P(Y = 1). Based on the
value of P(Y = 1), the decision maker has to classify the observation as belonging to either
class 1 (positive) or class 0 (negative).
 To classify the observations, the decision maker has to first decide the classification
cut-off probability Pc .
 Whenever the predicted probability of an observation, P(Yi = 1), is less than the
classification cut-off probability, Pc , then the observation is classified as negative (Yi = 0)
 if the predicted probability is greater than or equal to Pc , then the observation is classified
as positive (Yi = 1).
39
SENSITIVITY, SPECIFICITY AND PRECISION
 The ability of the model to correctly classify positives and negatives are called sensitivity and
specificity, respectively.
 Sensitivity = P(model classifies Yi as positive | Yi is positive)
 Sensitivity is calculated using the following equation:
 Sensitivity = True Positive (TP)

True Positive (TP) + False Negative (FN)
 where True Positive (TP) is the number of positives correctly classified as positives by the model
and False Negative (TN) is positives misclassified as negative by the model.
 Sensitivity is also called as recall.
40
SPECIFICITY
 Specificity is the ability of the diagnostic test to correctly classify the test as negative when the
disease is not present. That is:
Specificity = P(model classifies Yi as negative | Yi is negative)
Specificity can be calculated using the following equation:
Specificity = True Negative (TN)

True Negative (TN) + False Positive (FP)
where True Negative (TN) is number of the negatives correctly classified as negatives by the model
and False Positive (FP) is number of negatives misclassified as positives by the model.
41
PRECISION
 The decision maker has to consider the tradeoff between sensitivity and
specificity to arrive at an optimal cut-off probability.
 Precision measures the accuracy of positives classified by the model.
Precision = P(Yi is positive | model classifies Yi as positive)
True Positive (TP)

Precision =
True Positive (TP) + False Positive (FP)
42
F-SCORE
 F Score (F Measure) is another measure used in binary logistic regression that

combines both precision and recall and is given by:
2 × Precision × Recall
F − Score =
Precision + Recall
 In the F-score case, the metric value interpretation is straightforward.
 The higher the measured value, the better.
 For any beta parameter, the best possible value for F-measure is 1, and the worst is 0.
43
SENSITIVITY, SPECIFICITY AND PRECISION
Predicted (cut-off prob=0.2)
Damage to O-ring
Observed Percentage
0 (Negative) 1 (positive) Correct
0 (Negative) 17 9 (TN) 8 (FP) 52.9%

Damage to O-
ring
1 (Positive) 7 1 (FN) 6 (TP) 85.7%
Overall Percentage 62.5%
TP 6
Sensitivity = = = 0.857
TP + FN 6+1
TN 9 TP + TN
Specificity = = = 0.529 Accuracy =
TN + FP 9+8 TP + FP + FN + TN
T𝑃𝑃
Precision = 6 =0.625
TP + FP = = 0.428
6+8
44
ACCURACY PARADOX
Damage to O-ring
Observed Percentage
Damage to O- 0 (Negative) 17 9 (TN) 8 (FP) 52.9%

ring 1 (Positive) 7 1 (FN) 6 (TP) 85.7%
Damage to O-ring
Observed Percentage
Damage to O- 0 (Negative) 17 17 (TN) 0 (FP) 100%

ring
1 (Positive) 7 3 (FN) 4 (TP) 57.1%
The accuracy paradox in classification problem states that a model with higher overall accuracy may not be a
better model.
45
CUSTOMER SUBSCRIPTION PREDICTION FOR A FINANCIAL SERVICE
 Background: A Bank offers a special term deposit scheme with different benefits, and
they want to improve their subscription rates for this service. The bank is interested in
predicting whether a customer will subscribe to the premium service based on various
features. This prediction will enable the bank to tailor marketing strategies and
promotions to target potential subscribers more effectively.
 Objective: The objective is to build a logistic regression model to predict whether a
customer will subscribe to the premium financial service or not based on features such
as income, account balance, credit score, and transaction history.
46
DATA DESCRIPTION:
 The dataset includes the following features:

 Income (in thousands): Monthly income of the customer.
 Account Balance (in thousands): The average balance in the customer's account.
 Credit Score: The credit score of the customer.
 Transaction History (1-10): A rating of the customer's transaction history (1 being
the lowest and 10 being the highest).
 Age: The age of the customer.
 Subscription (Target): Whether the customer subscribed to the service (1 if
subscribed, 0 if not).
47
DATA
Account Credit Transaction

index Income Age Subscription
Balance Score History
0 50 20 700 8 30 1
1 80 50 750 9 35 1
2 45 10 650 6 28 0
3 75 40 720 7 40 1
4 60 30 680 7 32 0
5 85 60 800 10 45 1
6 55 25 700 8 29 0
7 70 35 760 9 38 1
8 40 15 630 5 27 0
9 95 70 820 10 50 1
48
CONCORDANT AND DISCORDANT PAIRS
 Discordant Pairs. A pair of positive and negative observations for which

the model has no cut-off probability to classify both of them correctly are called
discordant pairs.
 Concordant Pairs. A pair of positive and negative observations for which the
model has a cut-off probability to classify both of them correctly are called
concordant pairs.
49
ACTUAL VS. PREDICTED
Flight Launch Damage to Predicted Flight Launch Damage to Predicted
Number Temperature O-ring Probability Number Temperature O-ring Probability
Discordant pairs
STS 1 66 0 0.4306 STS 41G 78 0 0.0426

STS 2 70 1 0.2274 STS 51A 67 0 0.3740
STS 3 69 0 0.2715 STS 51C 53 1 0.9421
STS 4 80 0 0.0270 STS 51D 67 0 0.3740
STS 5 68 0 0.3206 STS 51B 75 0 0.0829
STS 6 67 0 0.3740 STS 51G 70 0 0.2274
STS 7 72 0 0.1551 STS 51F 81 0 0.0215
Concordant Pairs
STS 8 73 0 0.1266 STS 51I 76 0 0.0667

STS 9 70 0 0.2274 STS 51J 79 0 0.0340
STS 41B 57 1 0.8635 STS 61A 75 1 0.0829
STS 41C 63 1 0.6056 STS 61B 76 0 0.0667
STS 41D 70 1 0.2274 STS 61C 58 1 0.8332
50
RECEIVER OPERATING CHARACTERISTICS (ROC) CURVE
 Receiver operating characteristic (ROC)

curve can be used to understand the
overall worth of a logistic regression
model
 ROC curve is a plot between
sensitivity (true positive rate) in the
vertical axis and 1 – specificity (false
positive rate) in the horizontal axis.
51
RECEIVER OPERATING CHARACTERISTICS (ROC) CURVE
 ROC curve is a plot between sensitivity (true positive
rate) in the vertical axis and 1 – specificity (false positive
rate) in the horizontal axis.
 The area under the ROC curve is (AUC).
 Model with higher AUC is preferred and AUC is

frequently used for model selection
 In our model, the AUC value is 0.794.
 If we use the LR model, then there will be 79.4%

concordant pairs and 20.6% discordant pairs.
 For a randomly selected pair of positive and negative
observations, probability of correctly specifying them is
0.794.
52
AREA UNDER THE ROC CURVE
AUC = 0.629 AUC = 0.801

53
AREA UNDER ROC CURVE (AUC)
Thumb rule for acceptance of the model:
 AUC of at least 0.7 is required for practical application of
the model.
 AUC of greater than 0.9 implies an outstanding model.
 Caution should be used while selecting models based on
AUC, especially when the data is imbalanced (that is,
data set which has less than 10% positives).
 In case of imbalanced data sets, the AUC may be very
high (greater than 0.9); however, either sensitivity or
specificity values may be poor.
 Area Under the ROC Curve (AUC) is a measure of the
ability of the logistic regression model to discriminate
positives and negatives correctly.
54
AUC
• General rule for acceptance of the model:
• If the area under ROC is:
• 0.5 ⇒ No discrimination
• 0.7 ≤ ROC area < 0.8 ⇒ Acceptable discrimination
• 0.8 ≤ ROC area < 0.9 ⇒ Excellent discrimination
• ROC area ≥ 0.9 ⇒ Outstanding discrimination
55
YOUDEN’S INDEX FOR OPTIMAL CUT-OFF PROBABILITY
Youden’s Index (1950) is a classification cut-off probability Pc , for which the

following function is maximized (also known as J statistic):
Youden’s Index = J Statistic = Max

P
[Sensitivity(p) + Specificity(p) – 1]
58
IDEAL
POSITION
(0,1)
Maximum Youden’s Index

Minimum
59
 In the ROC curve, the coordinate (0,1) implies sensitivity = specificity = 1,
which is the ideal model that we would like to use.
 The ROC curve provides information regarding how sensitivity and specificity
change when the classification cut-off probability changes.
 The point on the ROC curve which is at minimum distance from coordinate
(0,1) (or which is at the maximum distance from the diagonal line) will give the
best cut-off probability.
 Youden’s Index is the classification cut-off probability value for which the
distance from the diagonal line to the ROC curve is maximum.
 We use Youden’s Index only when both sensitivity and specificity are equally
important.
60
61
Cut-off Sensitivity Specificity Youden’s

probability Index
0.05 1 0.235 0.235
0.1 0.857 0.412 0.269
0.2 0.857 0.529 0.386
0.3 0.571 0.706 0.277
0.4 0.571 0.941 0.512
0.5 0.571 1 0.571
0.6 0.571 1 0.571
0.7 0.429 1 0.429
0.8 0.429 1 0.429
0.9 0.143 1 0.143
1 0 1 0
62
COST-BASED CUT-OFF PROBABILITY
 When sensitivity and specificity are not equally important then we use cost-based approach for
cut-off probability
 In cost-based approach, we assign penalty cost for misclassification of positives and negatives.
Assume that cost of misclassifying negative (0) as positive (1) is C01 and cost of
misclassifying positive (1) as negative (0) is C10 as shown in Table
Classified
Observed
0 1
0 --- C01
1 C10 ---
The optimal cut-off probability is the one which minimizes the total penalty cost and is given
by
[
Min C 01P01 + C10 P10 ] P01 and P10 are the probability of classifying negative as positive
p and positive as negative.
63
COST-BASED CUT-OFF PROBABILITY
Predicted
 C01 = Cost of classifying 0 as 1 = 100 Observed 0 1
0 P00 P01 (FP)
 C10 = Cost of classifying 1 as 0 = 200
1 P10 (FN) P11
Cut-off
Probability P01 P10 C01 C10 Cost
0.05 0.85 0.01 100.00 200.00 88.00
0.10 0.68 0.07 100.00 200.00 81.80
0.15 0.52 0.11 100.00 200.00 74.20
0.20 0.44 0.14 100.00 200.00 72.3
0.25 0.37 0.21 100.00 200.00 78.1
0.28 0.33 0.23 100.00 200.00 77.8
0.30 0.29 0.24 100.00 200.00 77.3
0.35 0.24 0.76 100.00 200.00 175.3
64
LOGISTIC REGRESSION MODEL
DIAGNOSTICS
65
OMNIBUS TESTS
 In logistic regression model, likelihood ratio test is used as the test for
checking the statistical significance of the overall model.
 The log likelihood function for binary logistic regression model is given by
n n
LL = ∑ Yi ln[π ( Z )] + ∑ (1 − Yi )[ln(1 − π ( Z ))]
i =1 i =1
 The null and alternate hypothesis are as follows:

 H0: β1 = β2 = β3 = … = βk = 0
 H1: Not all βs are zero.
66
WALD’S TEST
Wald’s test is used for checking statistical significance of individual predictor

variables (equivalent to t-test in MLR model). The null and alternative
hypotheses for Wald’s test are:
H0: βi = 0
H1: βi ≠ 0
2
Wald’s test statistic is given by  ∧ 
 βi 
W =
∧ 
 S e ( β i ) 
67
What are the applications of logistic regression?
68
PYTHON CODE – CUSTOMER PREDICTION
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix, roc_auc_score
import seaborn as sns
import matplotlib.pyplot as plt
69
# Load the dataset
data = {
'Income': [50, 80, 45, 75, 60, 85, 55, 70, 40, 95],
'AccountBalance': [20, 50, 10, 40, 30, 60, 25, 35, 15, 70],
'CreditScore': [700, 750, 650, 720, 680, 800, 700, 760, 630, 820],
'TransactionHistory': [8, 9, 6, 7, 7, 10, 8, 9, 5, 10],
'Age': [30, 35, 28, 40, 32, 45, 29, 38, 27, 50],
'Subscription': [1, 1, 0, 1, 0, 1, 0, 1, 0, 1]
}
df = pd.DataFrame(data)
70
# Split the data into features (X) and target variable (y)
X = df.drop('Subscription', axis=1)
y = df['Subscription']
# Standardize numerical features

scaler = StandardScaler()
X = scaler.fit_transform(X)
# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train the logistic regression model

model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions on the testing set

y_pred = model.predict(X_test)
71
from sklearn.metrics import accuracy_score, roc_curve
print('Accuracy score for test data is:', accuracy_score(y_test,y_pred))
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(y_test, y_pred)
# Plot ROC curve
y_proba = model.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test, y_proba)
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = {:.2f })'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show(
72

Session 5 - Logistic Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session 5 - Logistic Regression

Uploaded by

Copyright:

Available Formats

LOGISTIC REGRESSION

Examples of Classification Problems:

 Image Classification (Medical Devices, Satellite images)

 A probability of 0 indicates that the

Why do we need logistic

 For classification problems, however, we

 This is because the line of best fit in

 Linear regression still works for certain values.

 The name logistic regression emerges from logistic distribution function.

 Mathematically, logistic regression attempts to estimate conditional probability of an event

where 𝑧𝑧 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 +. . . +𝛽𝛽𝑛𝑛 𝑥𝑥𝑛𝑛

𝑥𝑥1,…, 𝑥𝑥n - Independent variables

𝛽𝛽 = 0 implies that P(Y|x) is same for each value of x

 The logistic regression model is given by:

Odds: Likelihood of an occurrence

Probability density function for binary logistic regression is given by:

ln( 𝐿𝐿(𝛽𝛽)) = � 𝑦𝑦𝑖𝑖 ln 𝜋𝜋(𝑥𝑥𝑖𝑖 ) + �(1 − 𝑦𝑦𝑖𝑖 ) ln( 1 − 𝜋𝜋𝑖𝑖 (𝑥𝑥𝑖𝑖 ))

The above system of equations are solved iteratively to estimate β0 and β1

 Maximum likelihood estimator may not be unique or may not exist.

Pre-process the data –

Perform Estimate regression Define functional form

Challenger launch temperature vs damage data

B S.E. Wald df Sig . Exp(B)

𝑒𝑒 15.297−0.236𝑋𝑋𝑖𝑖 Probability of damage to O-

STS 1 66 0 0.4306 STS 41G 78 0 0.0426

𝛽𝛽1: Change in logged odd ratio

 ODDS: Ratio of two probability values.

 ODDS RATIO: Ratio of two odds.

 Assume that X is an independent variable. The odds ratio, OR, is defined as

 If OR = 2, then the event is twice likely to occur when X = 1 compared

 Odds ratio approximates the relative risk.

 β1 is the change in log-odds ratio for unit change in the explanatory

 Sensitivity is calculated using the following equation:

 Sensitivity = True Positive (TP)

Specificity = True Negative (TN)

True Positive (TP)

 F Score (F Measure) is another measure used in binary logistic regression that

 The higher the measured value, the better.

0 (Negative) 17 9 (TN) 8 (FP) 52.9%

Damage to O- 0 (Negative) 17 9 (TN) 8 (FP) 52.9%

Damage to O- 0 (Negative) 17 17 (TN) 0 (FP) 100%

 The dataset includes the following features:

Account Credit Transaction

 Discordant Pairs. A pair of positive and negative observations for which

STS 1 66 0 0.4306 STS 41G 78 0 0.0426

STS 8 73 0 0.1266 STS 51I 76 0 0.0667

 Receiver operating characteristic (ROC)

 Model with higher AUC is preferred and AUC is

 If we use the LR model, then there will be 79.4%

AUC = 0.629 AUC = 0.801

• General rule for acceptance of the model:

• If the area under ROC is:

• 0.7 ≤ ROC area < 0.8 ⇒ Acceptable discrimination

• 0.8 ≤ ROC area < 0.9 ⇒ Excellent discrimination

• ROC area ≥ 0.9 ⇒ Outstanding discrimination

Youden’s Index (1950) is a classification cut-off probability Pc , for which the

Youden’s Index = J Statistic = Max

Maximum Youden’s Index

Cut-off Sensitivity Specificity Youden’s

 The null and alternate hypothesis are as follows:

Wald’s test is used for checking statistical significance of individual predictor

# Standardize numerical features

# Split the data into training and testing sets

# Initialize and train the logistic regression model