Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Regression and Classification

In layman terms, we can define linear regression as it is used for learning the linear relationship
between the target and one or more forecasters, and it is probably one of the most popular and well
inferential algorithms in statistics. Linear regression endeavours to demonstrate the connection
between two variables by fitting a linear equation to observed information. One variable is viewed as
an explanatory variable, and the other is viewed as a dependent variable.

Regression models are used for the elaborated explanation of the relationship between two given
variables. There are certain types of regression models like logistic regression models, nonlinear
regression models, and linear regression models. The linear regression model fits a straight line into
the summarized data to establish the relationship between two variables.

Types of Linear Regression


Normally, linear regression is divided into two types: Simple linear regression and Multiple linear
regression. So, for better clearance, we will discuss these types in detail.

1. Simple Linear Regression


It can be described as a method of statistical analysis that can be used to study the relationship
between two quantitative variables.

Primarily, there are two things which can be found out by using the method of simple linear
regression:

1. Strength of the relationship between the given duo of variables. (For example, the
relationship between global warming and the melting of glaciers)
2. How much the value of the dependent variable is at a given value of the independent
variable. (For example, the amount of melting of a glacier at a certain level of global warming
or temperature)

2. Multiple Linear Regression


In this type of linear regression, we always attempt to discover the relationship between two or more
independent variables or inputs and the corresponding dependent variable or output and the
independent variables can be either continuous or categorical.

This linear regression analysis is very helpful in several ways like it helps in foreseeing trends, future
values, and moreover predict the impacts of changes.

Assumptions of Linear Regression


To conduct a simple linear regression, one has to make certain assumptions about the data. This is
because it is a parametric test. The assumptions used while performing a simple linear regression are
as follows:
1. Homogeneity of variance (homoscedasticity)- One of the main predictions in a simple linear
regression method is that the size of the error stays constant. This simply means that in the
value of the independent variable, the error size never changes significantly.

2. Independence of observations- All the relationships between the observations are


transparent, which means that nothing is hidden, and only valid sampling methods are used
during the collection of data.

3. Normality- There is a normal rate of flow in the data.


Simple Linear Regression
Applications of Simple Linear Regression
1. Marks scored by students based on number of hours studied (ideally)- Here marks scored in
exams are independent and the number of hours studied is independent.
2. Predicting crop yields based on the amount of rainfall- Yield is a dependent variable while the
measure of precipitation is an independent variable.
3. Predicting the Salary of a person based on years of experience- Therefore, Experience
becomes the independent while Salary turns into the dependent variable.

Limitations of Simple Linear Regression


It can be described as a method of statistical analysis that can be used to study the relationship
between two quantitative variables. (Read also: Statistical Data Analysis)

Primarily, there are two things which can be found out by using the method of simple linear regression:
 Strength of the relationship between the given duo of variables. (For example, the relationship
between global warming and the melting of glaciers)
 How much the value of the dependent variable is at a given value of the independent variable.
(For example, the amount of melting of a glacier at a certain level of global warming or
temperature)

Regression models are used for the elaborated explanation of the relationship between two given
variables. There are certain types of regression models like logistic regression models, nonlinear
regression models, and linear regression models. The linear regression model fits a straight line into
the summarized data to establish the relationship between two variables.

Examples of Simple Linear Regression


Now, let’s move towards understanding simple linear regression with the help of an example. We will
take an example of teen birth rate and poverty level data.

This dataset of size n = 51 is for the 50 states and the District of Columbia in the United States
(poverty.txt). The variables are y = year 2002 birth rate for each 1000 females 15 to 17 years of age
and x = destitution rate, which is the percent of the state's populace living in families with wages
underneath the governmentally characterized neediness level. (Information source: Mind On
Statistics, 3rd version, Utts and Heckard).

Below is the graph (right image) in which you can see the (birth rate on the vertical) is indicating a
normally linear relationship, on average, with a positive slope. As the poverty level builds, the birth
rate for 15 to 17-year-old females will in general increment too.
Fig: Example graph of simple linear regression

Here is another graph (left graph) which is showing a regression line superimposed on the data.

The condition of the fitted regression line is given close to the highest point of the plot. The condition
should express that it is for the "average" birth rate (or "anticipated" birth rate would be alright as
well) as a regression condition portrays the normal estimation of y as a component of at least one x-
variables. In statistical documentation, the condition could be composed y^=4.267+1.373x.

 The interpretation of the slope (value = 1.373) is that the 15 to 17-year-old birth rate
increases 1.373 units, on average, for each one unit (one per cent) increase in the poverty
rate.
 The translation of the intercept (value=4.267) is that if there were states with a population
rate = 0, the anticipated normal for the 15 to 17-year-old birth rate would be 4.267 for
those states. Since there are no states with a poverty rate = 0 this understanding of the
catch isn't basically significant for this model.
 In the chart with a repression line present, we additionally observe the data that s =
5.55057 and R2 = 53.3%.
 The estimation of s discloses to us generally the standard deviation of the contrasts
between the y-estimations of individual perceptions and expectations of y dependent on
the regression line. The estimation of R2 can be deciphered to imply that destitution rates
"clarify" 53.3% of the noticed variety in the 15 to 17-year-old normal birth paces of the
states.

The R2 (adj) value (52.4%) is a change in accordance with R2 dependent on the number of x-variables
in the model (just one here) and the example size. With just a single x-variable, the charged R2 isn't
significant.

Multiple Linear Regression


Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique
that uses several explanatory variables to predict the outcome of a response variable. The goal of
multiple linear regression is to model the linear relationship between the explanatory (independent)
variables and response (dependent) variables. In essence, multiple regression is the extension of
ordinary least-squares (OLS) regression because it involves more than one explanatory variable.
Formula and Calculation of Multiple Linear Regression
yi=β0+β1xi1+β2xi2+...+βpxip+ϵ

where, for i=n observations:

yi=dependent variable
xi=explanatory variables
β0=y-intercept (constant term)
βp=slope coefficients for each explanatory variable
ϵ=the model’s error term (also known as the residuals)

What Multiple Linear Regression Can Tell You


Simple linear regression is a function that allows an analyst or statistician to make predictions about
one variable based on the information that is known about another variable. Linear regression can
only be used when one has two continuous variables—an independent variable and a dependent
variable. The independent variable is the parameter that is used to calculate the dependent variable
or outcome. A multiple regression model extends to several explanatory variables.

The multiple regression model is based on the following assumptions:

 There is a linear relationship between the dependent variables and the independent
variables
 The independent variables are not too highly correlated with each other
 yi observations are selected independently and randomly from the population
 Residuals should be normally distributed with a mean of 0 and variance σ

The coefficient of determination (R-squared) is a statistical metric that is used to measure how much
of the variation in outcome can be explained by the variation in the independent variables. R2 always
increases as more predictors are added to the MLR model, even though the predictors may not be
related to the outcome variable.

R2 by itself can't thus be used to identify which predictors should be included in a model and which
should be excluded. R2 can only be between 0 and 1, where 0 indicates that the outcome cannot be
predicted by any of the independent variables and 1 indicates that the outcome can be predicted
without error from the independent variables.

When interpreting the results of multiple regression, beta coefficients are valid while holding all other
variables constant ("all else equal"). The output from a multiple regression can be displayed
horizontally as an equation, or vertically in table form.

Use of Multiple Linear Regression


As an example, an analyst may want to know how the movement of the market affects the price of
ExxonMobil (XOM). In this case, their linear equation will have the value of the S&P 500 index as the
independent variable, or predictor, and the price of XOM as the dependent variable.
In reality, multiple factors predict the outcome of an event. The price movement of ExxonMobil, for
example, depends on more than just the performance of the overall market. Other predictors such as
the price of oil, interest rates, and the price movement of oil futures can affect the price of XOM and
stock prices of other oil companies. To understand a relationship in which more than two variables
are present, multiple linear regression is used.

Multiple linear regression (MLR) is used to determine a mathematical relationship among several
random variables. In other terms, MLR examines how multiple independent variables are related to
one dependent variable. Once each of the independent factors has been determined to predict the
dependent variable, the information on the multiple variables can be used to create an accurate
prediction on the level of effect they have on the outcome variable. The model creates a relationship
in the form of a straight line (linear) that best approximates all the individual data points.3

Referring to the MLR equation above, in our example:

 yi = dependent variable—the price of XOM


 xi1 = interest rates
 xi2 = oil price
 xi3 = value of S&P 500 index
 xi4= price of oil futures
 B0 = y-intercept at time zero
 B1 = regression coefficient that measures a unit change in the dependent variable when xi1
changes - the change in XOM price when interest rates change
 B2 = coefficient value that measures a unit change in the dependent variable when xi2
changes—the change in XOM price when oil prices change

The least-squares estimates—B0, B1, B2…Bp—are usually computed by statistical software. As many
variables can be included in the regression model in which each independent variable is differentiated
with a number—1,2, 3, 4...p. The multiple regression model allows an analyst to predict an outcome
based on information provided on multiple explanatory variables.

Still, the model is not always perfectly accurate as each data point can differ slightly from the outcome
predicted by the model. The residual value, E, which is the difference between the actual outcome
and the predicted outcome, is included in the model to account for such slight variations.

Assuming we run our XOM price regression model through a statistics computation software, that
returns this output:

An analyst would interpret this output to mean if other variables are held constant, the price of XOM
will increase by 7.8% if the price of oil in the markets increases by 1%. The model also shows that the
price of XOM will decrease by 1.5% following a 1% rise in interest rates. R2 indicates that 86.5% of the
variations in the stock price of Exxon Mobil can be explained by changes in the interest rate, oil price,
oil futures, and S&P 500 index.
Difference Between Linear and Multiple Regression
Ordinary linear squares (OLS) regression compares the response of a dependent variable given a
change in some explanatory variables. However, a dependent variable is rarely explained by only one
variable. In this case, an analyst uses multiple regression, which attempts to explain a dependent
variable using more than one independent variable. Multiple regressions can be linear and nonlinear.

Multiple regressions are based on the assumption that there is a linear relationship between both the
dependent and independent variables. It also assumes no major correlation between the independent
variables.
Least Square Method
The least square method is the process of finding the best-fitting curve or line of best fit for a set of
data points by reducing the sum of the squares of the offsets (residual part) of the points from the
curve. During the process of finding the relation between two variables, the trend of outcomes are
estimated quantitatively. This process is termed as regression analysis. The method of curve fitting is
an approach to regression analysis. This method of fitting equations which approximates the curves
to given raw data is the least squares.

It is quite obvious that the fitting of curves for a particular data set are not always unique. Thus, it is
required to find a curve having a minimal deviation from all the measured data points. This is known
as the best-fitting curve and is found by using the least-squares method.

Definition
The least-squares method is a crucial statistical method that is practised to find a regression line or a
best-fit line for the given pattern. This method is described by an equation with specific parameters.
The method of least squares is generously used in evaluation and regression. In regression analysis,
this method is said to be a standard approach for the approximation of sets of equations having more
equations than the number of unknowns.

The method of least squares actually defines the solution for the minimization of the sum of squares
of deviations or the errors in the result of each equation. Find the formula for sum of squares of errors,
which help to find the variation in observed data.

The least-squares method is often applied in data fitting. The best fit result is assumed to reduce the
sum of squared errors or residuals which are stated to be the differences between the observed or
experimental value and corresponding fitted value given in the model.

There are two basic categories of least-squares problems:

 Ordinary or linear least squares


 Nonlinear least squares

These depend upon linearity or nonlinearity of the residuals. The linear problems are often seen in
regression analysis in statistics. On the other hand, the non-linear problems are generally used in the
iterative method of refinement in which the model is approximated to the linear one with each
iteration.

Least Square Method Graph


In linear regression, the line of best fit is a straight line as shown in the following diagram:
The given data points are to be minimized by the method of reducing residuals or offsets of each point
from the line. The vertical offsets are generally used in surface, polynomial and hyperplane problems,
while perpendicular offsets are utilized in common practice.

Least Square Method Formula


The least-square method states that the curve that best fits a given set of observations, is said to be a
curve having a minimum sum of the squared residuals (or deviations or errors) from the given data
points. Let us assume that the given points of data are (x1, y1), (x2, y2), (x3, y3), …, (xn, yn) in which all
x’s are independent variables, while all y’s are dependent ones. Also, suppose that f(x) is the fitting
curve and d represents error or deviation from each given point.

Now, we can write:

d1 = y1 − f(x1)
d2 = y2 − f(x2)
d3 = y3 − f(x3)
…..
dn = yn – f(xn)

The least-squares explain that the curve that best fits is represented by the property that the sum of
squares of all the deviations from given values must be minimum, i.e:
Sum = Minimum Quantity

Suppose when we have to determine the equation of line of best fit for the given data, then we first
use the following formula.

The equation of least square line is given by Y = a + bX

Normal equation for ‘a’:

∑Y = na + b∑X

Normal equation for ‘b’:

∑XY = a∑X + b∑X2

Solving these two normal equations we can get the required trend line equation.

Thus, we can get the line of best fit with formula y = ax + b

Solved Example
The Least Squares Model for a set of data (x1, y1), (x2, y2), (x3, y3), …, (xn, yn) passes through the point
(xa, ya) where xa is the average of the xi‘s and ya is the average of the yi‘s. The below example explains
how to find the equation of a straight line or a least square line using the least square method.

Question:

Consider the time series data given below:

xi 8 3 2 10 11 3 6 5 6 8
yi 4 12 1 12 9 4 9 6 1 14

Use the least square method to determine the equation of line of best fit for the data. Then plot the
line.

Solution:

Mean of xi values = (8 + 3 + 2 + 10 + 11 + 3 + 6 + 5 + 6 + 8)/10 = 62/10 = 6.2

Mean of yi values = (4 + 12 + 1 + 12 + 9 + 4 + 9 + 6 + 1 + 14)/10 = 72/10 = 7.2

Straight line equation is y = a + bx.


The normal equations are

∑y = an + b∑x

∑xy = a∑x + b∑x2

x y x2 xy
8 4 64 32
3 12 9 36
2 1 4 2
10 12 100 120
11 9 121 99
3 4 9 12
6 9 36 54
5 6 25 30
6 1 36 6
8 14 64 112
∑x = 62 ∑y = 72 ∑x2 = 468 ∑xy = 503

Substituting these values in the normal equations,

10a + 62b = 72….(1)

62a + 468b = 503….(2)

(1) × 62 – (2) × 10,

620a + 3844b – (620a + 4680b) = 4464 – 5030

-836b = -566

b = 566/836

b = 283/418

b = 0.677

Substituting b = 0.677 in equation (1),

10a + 62(0.677) = 72

10a + 41.974 = 72

10a = 72 – 41.974

10a = 30.026

a = 30.026/10

a = 3.0026
Therefore, the equation becomes,

y = a + bx

y = 3.0026 + 0.677x

This is the required trend line equation.

Now, we can find the sum of squares of deviations from the obtained values as:

d1 = [4 – (3.0026 + 0.677*8)] = (-4.4186)

d2 = [12 – (3.0026 + 0.677*3)] = (6.9664)

d3 = [1 – (3.0026 + 0.677*2)] = (-3.3566)

d4 = [12 – (3.0026 + 0.677*10)] = (2.2274)

d5 = [9 – (3.0026 + 0.677*11)] =(-1.4496)

d6 = [4 – (3.0026 + 0.677*3)] = (-1.0336)

d7 = [9 – (3.0026 + 0.677*6)] = (1.9354)

d8 = [6 – (3.0026 + 0.677*5)] = (-0.3876)

d9 = [1 – (3.0026 + 0.677*6)] = (-6.0646)

d10 = [14 – (3.0026 + 0.677*8)] = (5.5814)

∑d2 = (-4.4186)2 + (6.9664)2 + (-3.3566)2 + (2.2274)2 + (-1.4496)2 + (-1.0336)2 + (1.9354)2 + (-


0.3876)2 + (-6.0646)2 + (5.5814)2 = 159.27990
Limitations for Least-Square Method
The least-squares method is a very beneficial method of curve fitting. Despite many benefits, it has a
few shortcomings too. One of the main limitations is discussed here.

In the process of regression analysis, which utilizes the least-square method for curve fitting, it is
inevitably assumed that the errors in the independent variable are negligible or zero. In such cases,
when independent variable errors are non-negligible, the models are subjected to measurement
errors. Therefore, here, the least square method may even lead to hypothesis testing, where
parameter estimates and confidence intervals are taken into consideration due to the presence of
errors occurring in the independent variables.

Frequently Asked Questions – FAQs

How do you calculate least squares?


Let us assume that the given points of data are (x_1, y_1), (x_2, y_2), …, (x_n, y_n) in which all x’s are
independent variables, while all y’s are dependent ones. Also, suppose that f(x) be the fitting curve
and d represents error or deviation from each given point. The least-squares explain that the curve
that best fits is represented by the property that the sum of squares of all the deviations from given
values must be minimum.

How many methods are available for the Least Square?


There are two primary categories of least-squares method problems: Ordinary or linear least squares
Nonlinear least squares

What is the principle of least squares?


The least squares principle states that by getting the sum of the squares of the errors a minimum
value, the most probable values of a system of unknown quantities can be obtained upon which
observations have been made.

What does the least square mean?


The least square method is the process of obtaining the best-fitting curve or line of best fit for the
given data set by reducing the sum of the squares of the offsets (residual part) of the points from the
curve.

What is least square curve fitting?


The least-squares method is a generally used method of the fitting curve for a given data set. It is the
most prevalent method used to determine the trend line for the given time series data.
Gradient Descent
Gradient Descent is one of the most used machine learning algorithms in the industry. And yet it
confounds a lot of newcomers.

What is a Cost Function?


It is a function that measures the performance of a model for any given data. Cost Function quantifies
the error between predicted values and expected values and presents it in the form of a single real
number.

After making a hypothesis with initial parameters, we calculate the Cost function. And with a goal to
reduce the cost function, we modify the parameters by using the Gradient descent algorithm over the
given data. Here’s the mathematical representation for it:

What is Gradient Descent?


Let’s say you are playing a game where the players are at the top of a mountain, and they are asked
to reach the lowest point of the mountain. Additionally, they are blindfolded. So, what approach do
you think would make you reach the lake?

Take a moment to think about this before you read on.

The best way is to observe the ground and find where the land descends. From that position, take a
step in the descending direction and iterate this process until we reach the lowest point.

Finding the lowest point in a hilly landscape.

Gradient descent is an iterative optimization algorithm for finding the local minimum of a function.
To find the local minimum of a function using gradient descent, we must take steps proportional to
the negative of the gradient (move away from the gradient) of the function at the current point. If we
take steps proportional to the positive of the gradient (moving towards the gradient), we will approach
a local maximum of the function, and the procedure is called Gradient Ascent.

Gradient descent was originally proposed by CAUCHY in 1847. It is also known as steepest descent.

The goal of the gradient descent algorithm is to minimize the given function (say cost function). To
achieve this goal, it performs two steps iteratively:

1. Compute the gradient (slope), the first order derivative of the function at that point
2. Make a step (move) in the direction opposite to the gradient, opposite direction of slope
increase from the current point by alpha times the gradient at that point

Gradient Descent Algorithm

Alpha is called Learning rate – a tuning parameter in the optimization process. It decides the length
of the steps.

Plotting the Gradient Descent Algorithm


When we have a single parameter (theta), we can plot the dependent variable cost on the y-axis and
theta on the x-axis. If there are two parameters, we can go with a 3-D plot, with cost on one axis and
the two parameters (thetas) along the other two axes.
Cost along z-axis and parameters(thetas) along x-axis and y-axis (source: Research gate)

It can also be visualized by using Contours. This shows a 3-D plot in two dimensions with parameters
along both axes and the response as a contour. The value of the response increases away from the
center and has the same value along with the rings. The response is directly proportional to the
distance of a point from the center (along a direction).

Gradient descent using Contour Plot.

Alpha – The Learning Rate


We have the direction we want to move in, now we must decide the size of the step we must take.

*It must be chosen carefully to end up with local minima.

 If the learning rate is too high, we might OVERSHOOT the minima and keep bouncing, without
reaching the minima
 If the learning rate is too small, the training might turn out to be too long
1. a) Learning rate is optimal, model converges to the minimum
2. b) Learning rate is too small, it takes more time but converges to the minimum
3. c) Learning rate is higher than the optimal value, it overshoots but converges ( 1/C < η <2/C)
4. d) Learning rate is very large, it overshoots and diverges, moves away from the minima,
performance decreases on learning

Note: As the gradient decreases while moving towards the local minima, the size of the step decreases.
So, the learning rate (alpha) can be constant over the optimization and need not be varied iteratively.

Local Minima
The cost function may consist of many minimum points. The gradient may settle on any one of the
minima, which depends on the initial point (i.e initial parameters(theta)) and the learning rate.
Therefore, the optimization may converge to different points with different starting points and
learning rate.
Convergence of cost function with different starting points

Code Implementation of Gradient Descent in Python


Linear Classification
In the field of machine learning, the goal of statistical classification is to use an object's characteristics
to identify which class (or group) it belongs to. A linear classifier achieves this by making a classification
decision based on the value of a linear combination of the characteristics. An object's characteristics
are also known as feature values and are typically presented to the machine in a vector called a feature
vector. Such classifiers work well for practical problems such as document classification, and more
generally for problems with many variables (features), reaching accuracy levels comparable to non-
linear classifiers while taking less time to train and use.

Definition

If the input feature vector to the classifier is a real vector , then the output score is

where is a real vector of weights and f is a function that converts the dot product of the two vectors
into the desired output. (In other words, is a one-form or linear functional mapping onto R.) The
weight vector is learned from a set of labeled training samples. Often f is a threshold function,
which maps all values of above a certain threshold to the first class and all other values to the
second class; e.g.,

The superscript T indicates the transpose and is a scalar threshold. A more complex f might give the
probability that an item belongs to a certain class.

For a two-class classification problem, one can visualize the operation of a linear classifier as splitting
a high-dimensional input space with a hyperplane: all points on one side of the hyperplane are
classified as "yes", while the others are classified as "no".

A linear classifier is often used in situations where the speed of classification is an issue, since it is
often the fastest classifier, especially when is sparse. Also, linear classifiers often work very well
when the number of dimensions in is large, as in document classification, where each element in
is typically the number of occurrences of a word in a document (see document-term matrix). In
such cases, the classifier should be well-regularized.

Generative models vs. discriminative models

There are two broad classes of methods for determining the parameters of a linear classifier . They
can be generative and discriminative models. Methods of the former model joint probability
distribution, whereas methods of the latter model conditional density functions .
Examples of such algorithms include:

 Linear Discriminant Analysis (LDA)—assumes Gaussian conditional density models


 Naive Bayes classifier with multinomial or multivariate Bernoulli event models.
The second set of methods includes discriminative models, which attempt to maximize the quality of
the output on a training set. Additional terms in the training cost function can easily perform
regularization of the final model. Examples of discriminative training of linear classifiers include:

 Logistic regression—maximum likelihood estimation of assuming that the observed


training set was generated by a binomial model that depends on the output of the classifier.
 Perceptron—an algorithm that attempts to fix all errors encountered in the training set
 Fisher's Linear Discriminant Analysis—an algorithm (different than "LDA") that maximizes the
ratio of between-class scatter to within-class scatter, without any other assumptions. It is in
essence a method of dimensionality reduction for binary classification.
 Support vector machine—an algorithm that maximizes the margin between the decision
hyperplane and the examples in the training set.
Logistic Regression
Logistic regression models the probabilities for classification problems with two possible outcomes.
It’s an extension of the linear regression model for classification problems.

Issues with Linear Regression for Classification


The linear regression model can work well for regression, but fails for classification. Why is that? In
case of two classes, you could label one of the classes with 0 and the other with 1 and use linear
regression. Technically it works and most linear model programs will spit out weights for you. But
there are a few problems with this approach:

A linear model does not output probabilities, but it treats the classes as numbers (0 and 1) and fits the
best hyperplane (for a single feature, it is a line) that minimizes the distances between the points and
the hyperplane. So it simply interpolates between the points, and you cannot interpret it as
probabilities.

A linear model also extrapolates and gives you values below zero and above one. This is a good sign
that there might be a smarter approach to classification.

Since the predicted outcome is not a probability, but a linear interpolation between points, there is
no meaningful threshold at which you can distinguish one class from the other. A good illustration of
this issue has been given.

Fig: A linear model classifies tumors as malignant (1) or benign (0) given their size. The lines show
the prediction of the linear model. For the data on the left, we can use 0.5 as classification threshold.
After introducing a few more malignant tumor cases, the regression line shifts and a threshold of 0.5
no longer separates the classes. Points are slightly jittered to reduce over-plotting.
Linear models do not extend to classification problems with multiple classes. You would have to start
labeling the next class with 2, then 3, and so on. The classes might not have any meaningful order, but
the linear model would force a weird structure on the relationship between the features and your
class predictions. The higher the value of a feature with a positive weight, the more it contributes to
the prediction of a class with a higher number, even if classes that happen to get a similar number are
not closer than other classes.

Theory
A solution for classification is logistic regression. Instead of fitting a straight line or hyperplane, the
logistic regression model uses the logistic function to squeeze the output of a linear equation between
0 and 1. The logistic function is defined as:

And it looks like this:

Fig: The logistic function. It outputs numbers between 0 and 1. At input 0, it outputs 0.5.

The step from linear regression to logistic regression is kind of straightforward. In the linear regression
model, we have modelled the relationship between outcome and features with a linear equation:

For classification, we prefer probabilities between 0 and 1, so we wrap the right side of the equation
into the logistic function. This forces the output to assume only values between 0 and 1.
Let us revisit the tumor size example again. But instead of the linear regression model, we use the
logistic regression model:

Fig: The logistic regression model finds the correct decision boundary between malignant and benign
depending on tumor size. The line is the logistic function shifted and squeezed to fit the data.

Classification works better with logistic regression and we can use 0.5 as a threshold in both cases.
The inclusion of additional points does not really affect the estimated curve.

Interpretation
The interpretation of the weights in logistic regression differs from the interpretation of the weights
in linear regression, since the outcome in logistic regression is a probability between 0 and 1. The
weights do not influence the probability linearly any longer. The weighted sum is transformed by the
logistic function to a probability. Therefore, we need to reformulate the equation for the
interpretation so that only the linear term is on the right side of the formula.
We call the term in the log() function “odds” (probability of event divided by probability of no event)
and wrapped in the logarithm it is called log odds.

This formula shows that the logistic regression model is a linear model for the log odds. Great! That
does not sound helpful! With a little shuffling of the terms, you can figure out how the prediction
changes when one of the features is changed by 1 unit. To do this, we can first apply the exp()
function to both sides of the equation:

Then we compare what happens when we increase one of the feature values by 1. But instead of
looking at the difference, we look at the ratio of the two predictions:

We apply the following rule:

And we remove many terms:

In the end, we have something as simple as exp() of a feature weight. A change in a feature by one
unit changes the odds ratio (multiplicative) by a factor of . We could also interpret it this way:
A change in by one unit increases the log odds ratio by the value of the corresponding weight. Most
people interpret the odds ratio because thinking about the log() of something is known to be hard on
the brain. Interpreting the odds ratio already requires some getting used to. For example, if you have
odds of 2, it means that the probability for y=1 is twice as high as y=0. If you have a weight (= log odds
ratio) of 0.7, then increasing the respective feature by one unit multiplies the odds by exp(0.7)
(approximately 2) and the odds change to 4. But usually you do not deal with the odds and interpret
the weights only as the odds ratios. Because for actually calculating the odds you would need to set a
value for each feature, which only makes sense if you want to look at one specific instance of your
dataset.

These are the interpretations for the logistic regression model with different feature types:

 Numerical feature: If you increase the value of feature by one unit, the estimated odds
change by a factor of
 Binary categorical feature: One of the two values of the feature is the reference category (in
some languages, the one encoded in 0). Changing the feature from the reference category
to the other category changes the estimated odds by a factor of
 Categorical feature with more than two categories: One solution to deal with multiple
categories is one-hot-encoding, meaning that each category has its own column. You only
need L-1 columns for a categorical feature with L categories, otherwise it is over-
parameterized. The L-th category is then the reference category. You can use any other
encoding that can be used in linear regression. The interpretation for each category then is
equivalent to the interpretation of binary features.
 Intercept : When all numerical features are zero and the categorical features are at the
reference category, the estimated odds are . The interpretation of the intercept
weight is usually not relevant.

Example
We use the logistic regression model to predict cervical cancer based on some risk factors. The
following table shows the estimate weights, the associated odds ratios, and the standard error of the
estimates.

Table: The results of fitting a logistic regression model on the cervical cancer dataset. Shown are
the features used in the model, their estimated weights and corresponding odds ratios, and the
standard errors of the estimated weights.

Weight Odds ratio Std. Error

Intercept -2.91 0.05 0.32

Hormonal contraceptives y/n -0.12 0.89 0.30

Smokes y/n 0.26 1.30 0.37

Num. of pregnancies 0.04 1.04 0.10

Num. of diagnosed STDs 0.82 2.27 0.33

Intrauterine device y/n 0.62 1.86 0.40

Interpretation of a numerical feature (“Num. of diagnosed STDs”): An increase in the number of


diagnosed STDs (sexually transmitted diseases) changes (increases) the odds of cancer vs. no cancer
by a factor of 2.27, when all other features remain the same. Keep in mind that correlation does not
imply causation.

Interpretation of a categorical feature (“Hormonal contraceptives y/n”): For women using hormonal
contraceptives, the odds for cancer vs. no cancer are by a factor of 0.89 lower, compared to women
without hormonal contraceptives, given all other features stay the same.

Like in the linear model, the interpretations always come with the clause that ‘all other features stay
the same’.
Advantages and Disadvantages
Many of the pros and cons of the linear regression model also apply to the logistic regression model.
Logistic regression has been widely used by many different people, but it struggles with its restrictive
expressiveness (e.g. interactions must be added manually) and other models may have better
predictive performance.

Another disadvantage of the logistic regression model is that the interpretation is more difficult
because the interpretation of the weights is multiplicative and not additive.

Logistic regression can suffer from complete separation. If there is a feature that would perfectly
separate the two classes, the logistic regression model can no longer be trained. This is because the
weight for that feature would not converge, because the optimal weight would be infinite. This is
really a bit unfortunate, because such a feature is really useful. But you do not need machine learning
if you have a simple rule that separates both classes. The problem of complete separation can be
solved by introducing penalization of the weights or defining a prior probability distribution of weights.

On the good side, the logistic regression model is not only a classification model, but also gives you
probabilities. This is a big advantage over models that can only provide the final classification. Knowing
that an instance has a 99% probability for a class compared to 51% makes a big difference.

Logistic regression can also be extended from binary classification to multi-class classification. Then it
is called Multinomial Regression.

You might also like