Professional Documents
Culture Documents
Logistic Regression - Explained - Towards Data Science
Logistic Regression - Explained - Towards Data Science
To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including
cookie policy.
Soner Yildirim
Feb 19 · 8 min read
Logistic regression is a supervised learning algorithm which is mostly used for binary
classification problems. Although “regression” contradicts with “classification”, the
focus here is on the word “logistic” referring to logistic function which does the
classification task in this algorithm. Logistic regression is a simple yet very effective
classification algorithm so it is commonly used for many binary classification tasks.
Customer churn, spam email, website or ad click predictions are some examples of the
areas where logistic regression offers a powerful solution.
The basis of logistic regression is the logistic function, also called the sigmoid function,
which takes in any real valued number and maps it to a value between 0 and 1.
Logistic regression model takes a linear equation as input and use logistic function and
log odds to perform a binary classification task. Before going in detail on logistic
regression, it is better to review some concepts in the scope probability.
Probability
Probability measures the likelihood of an event to occur. For example, if we say “there is
a 90% chance that this email is spam”:
https://towardsdatascience.com/logistic-regression-explained-593e9ddb7c6c 1/10
30/3/2020 Logistic Regression — Explained - Towards Data Science
makeis
Odds
To the ratio
Medium ofwe
work, the
logprobabilities of positive
user data. By using Medium,class andtonegative
you agree class.
our Privacy Policy, including
cookie policy.
All these concepts essentially represent the same measure but in different ways. In the
case of logistic regression, log odds is used. We will see the reason why log odds is
preferred in logistic regression algorithm.
Probability of 0,5 means that there is an equal chance for the email to be spam or not
spam. Please note that the log odds of probability 0,5 is 0. We will use that.
. . .
https://towardsdatascience.com/logistic-regression-explained-593e9ddb7c6c 2/10
30/3/2020 Logistic Regression — Explained - Towards Data Science
To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including
cookie policy.
Assume y is the probability of positive class. If z is 0, then y is 0,5. For positive values of
z, y is higher than 0,5 and for negative values of z, y is less than 0,5. If the probability of
positive class is more than 0,5 (i.e. more than 50% chance), we can predict the outcome
as a positive class (1). Otherwise, the outcome is a negative class (0).
Note: In binary classification, there are many ways to represent two classes such as
positive/negative, 1/0, True/False.
The table below shows some values of z with corresponding y (probability) values. All
real numbers are mapped between 0 and 1.
https://towardsdatascience.com/logistic-regression-explained-593e9ddb7c6c 3/10
30/3/2020 Logistic Regression — Explained - Towards Data Science
To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including
cookie policy.
If we plot this function, we will get the famous s shaped graph of logistic regression:
We
To canMedium
make use thework,
calculated probability
we log user ‘asMedium,
data. By using is’. For example,
you agree tothe
our output can be
Privacy Policy, a
including
cookie policy. that the email is spam is 95% or the probability that customer will click on
probability
this ad is 70%. However, in most cases, probabilities are used to classify data points. If
the probability is greater than 50%, the prediction is positive class (1). Otherwise, the
prediction is negative class (0).
Everything seems ok up until now except for one issue. It is not always desired to
choose positive class for all probability values higher than 50%. Regarding the spam
email case, we have to be almost sure in order to classify an email as spam. Since emails
detected as spam directly go to spam folder, we do not want the user to miss important
emails. Emails are not classified as spam unless we are almost sure. On the other hand,
when classification in a health-related issue requires us to be much more sensitive. Even
if we are a little suspicious that a cell is malignant, we do not want to miss it. So the
value that serves as a threshold between positive and negative class is problem-
dependent. Good thing is that logistic regression allows us to adjust this threshold
value.
If we set a high threshold (i.e. 95%), almost all the predictions we made as positive will
be correct. However, we will miss some of the positive class and label them as negative.
If we set a low threshold (i.e. 30%), we will predict almost all the positive classes
correctly. However, we will classify some of the negative classes as positive.
Both of these cases will affect the accuracy of our model. The simplest way to measure
accuracy is:
However, this is usually not enough to evaluate classification models. In some binary
classification tasks, there is an imbalance between positive and negative classes. Think
about classifying tumors as malignant and benign. Most of the target values (tumors) in
the dataset will be 0 (benign) because malignant tumors are very rare compared to
benign ones. A typical set would include more than 90% benign (0) class. So if the
model predicts all the examples as 0 without making any calculation, the accuracy is
more than 90%. It sounds good but is useless in this case. Therefore, we need other
measures to evaluate classification models. These measures are precision and recall.
https://towardsdatascience.com/logistic-regression-explained-593e9ddb7c6c 5/10
30/3/2020 Logistic Regression — Explained - Towards Data Science
Precision measures how good our model is when the prediction is positive.
Recall measures how good our model is at correctly predicting positive classes.
We cannot try to maximize both precision and recall because there is a trade-off
between them. The figures below clearly explain the trade-off:
https://towardsdatascience.com/logistic-regression-explained-593e9ddb7c6c 6/10
30/3/2020 Logistic Regression — Explained - Towards Data Science
To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including
cookie policy.
In both tables, there are 8 negative (0) classes and 11 positive (1) classes. The
prediction of the model and hence precision and recall change according to the
threshold values. The precision and recall values are calculated as below:
Increasing precision decreases recall and vice versa. You can aim to maximize precision
or recall depending on the task. For an email spam detection model, we try to maximize
precision because we want to be correct when an email is detected as spam. We do not
want to label a normal email as spam (i.e. false positive). If false positive is low, then
precision is high.
There is another measure that combines precision and recall into a single numbet:
F1_score. It is the weighted average of precision and recall and calculated as:
https://towardsdatascience.com/logistic-regression-explained-593e9ddb7c6c 7/10
30/3/2020 Logistic Regression — Explained - Towards Data Science
To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including
cookie policy.
F1_score is a more useful measure than accuracy for problems with uneven class
distribution because it takes into account both false positive and false negatives.
Scikit-learn Implementation
I will use one of the datasets available under datasets module of scikit-learn. I will
import the dataset and dependencies:
Then load the dataset and divide into train and test sets:
https://towardsdatascience.com/logistic-regression-explained-593e9ddb7c6c 8/10
30/3/2020 Logistic Regression — Explained - Towards Data Science
Scikit-learn
To make Medium provides
work, weclassification_report function
log user data. By using Medium, to calculate
you agree precision,
to our Privacy recall and
Policy, including
cookie policy.
f1-score at the same time. It also shows the number of positive and negative classes in
the support column.
It is worth noting that data preparation, model creation and evaluation in real life
projects are extremely complicated and time-consuming compared to this very simple
example. I just wanted to show you the steps of model creation. In real-life cases, most
of your time will be spent on data cleaning and preparation (assuming data collection is
done by someone else). You will also need to spend a good amount of time on the
accuracy of your model with hyperparameter tuning and re-evaluating many times.
. . .
Thank you for reading. Please let me know if you have any feedback.
References
https://developers.google.com/machine-learning/crash-course/logistic-
regression/calculating-a-probability
https://towardsdatascience.com/logistic-regression-explained-593e9ddb7c6c 9/10
30/3/2020 Logistic Regression — Explained - Towards Data Science
ToMachine Learning
make Medium work, Logistic Regression
we log user DataMedium,
data. By using Science youSupervised Learning
agree to our Dataincluding
Privacy Policy, Analysis
cookie policy.
https://towardsdatascience.com/logistic-regression-explained-593e9ddb7c6c 10/10