Data Science Simpli-Ed Part 4: Simple Linear Regression Models

Open in app Get started
Data Science Simpli-ed Part

4: Simple Linear Regression
Models
Pradeep Menon
Jul 30, 2017 · 9 min read
In the previous posts of this series, we

discussed the concepts of statistical learning
and hypothesis testing. In this article, we dive
into linear regression models.
Before we dive in, let us recall some important

aspects of statistical learning.
Independent and Dependent variables:
In the context of Statistical learning, there are

two types of data:
Independent variables: Data that can be

controlled directly.
Dependent variables: Data that cannot be

controlled directly.
The data that can’t be controlled i.e. dependent

variables need to predicted or estimated.
Model:
A model is a transformation engine that

helps us to express dependent variables as a
function of independent variables.
Parameters:
Parameters are ingredients added to the model

for estimating the output.
Concept
Linear regression models provide a simple
approach towards supervised learning. They
are simple yet eCective.
Wait, what do we mean by linear?
Linear implies the following: arranged in or

extending along a straight or nearly straight
line. Linear suggests that the relationship
between dependent and independent variable
can be expressed in a straight line.
Recall the geometry lesson from high school.

What is the equation of a line?
y = mx + c
Linear regression is nothing but a manifestation

of this simple equation.
y is the dependent variable i.e. the variable

that needs to be estimated and predicted.
x is the independent variable i.e. the

variable that is controllable. It is the input.
m is the slope. It determines what will be

the angle of the line. It is the parameter
denoted as β.
c is the intercept. A constant that

determines the value of y when x is 0.
George Box, a famous British statistician, once

quoted:
“All models are wrong; some are useful.”
Linear regression models are not perfect. It

tries to approximate the relationship between
dependent and independent variables in a
straight line. Approximation leads to errors.
Some errors can be reduced. Some errors are
inherent in the nature of the problem. These
errors cannot be eliminated. They are called as
an irreducible error, the noise term in the true
relationship that cannot fundamentally be
reduced by any model.
The same equation of a line can be re-written

as:
β0 and β1 are two unknown constants that

represent the intercept and slope. They are the
parameters.
ε is the error term.
Formulation
Let us go through an example to explain the
terms and workings of a Linear regression
model.
Fernando is a Data Scientist. He wants to buy a

car. He wants to estimate or predict the car
price that he will have to pay. He has a friend at
a car dealership company. He asks for prices for
various other cars along with a few
characteristics of the car. His friend provides
him with some information.
The following are the data provided to him:
make: make of the car.
fuelType: type of fuel used by the car.
nDoor: number of doors.
engineSize: size of the engine of the car.
price: the price of the car.
First, Fernando wants to evaluate if indeed he

can predict car price based on engine size. The
Prst set of analysis seeks the answers to the
following questions:
Is price of car price related with engine

size?
How strong is the relationship?
Is the relationship linear?
Can we predict/estimate car price based on

engine size?
Fernando does a correlation analysis.

Correlation is a measure of how much the two
variables are related. It is measured by a metric
called as the correlation coeBcient. Its value
is between 0 and 1.
If the correlation coeTcient is a large(> 0.7)

+ve number, it implies that as one variable
increases, the other variable increases as well.
A large -ve number indicates that as one
variable increases, the other variable decreases.
He does a correlation analysis. He plots the

relationship between price and engine size.
He splits the data into training and test set.

75% of data is used for training. Remaining is
used for the test.
He builds a linear regression model. He uses a

statistical package to create the model. The
model creates a linear equation that expresses
price of the car as a function of engine size.
Following are the answers to the questions:
Is price of car price related with engine

size?
Yes, there is a relationship.
How strong is the relationship?
The correlation coe<cient is 0.872 => There

is a strong relationship.
Is the relationship linear?
A straight line can Ct => A decent prediction

of price can be made using engine size.
Can we predict/estimate the car price

based on engine size?
Yes, car price can be estimated based on

engine size.
Fernando now wants to build a linear

regression model that will estimate the price of
the car price based on engine size.
Superimposing the equation to the car price
problem, Fernando formulates the following
equation for price prediction.
price = β0 + β1 x engine size
Model Building and

Interpretation
Model
Recall the earlier discussion, on how the data
needs to be split into training and testing set.
The training data is used to learn about the
data. The training data is used to create the
model. The testing data is used to evaluate the
model performance.
Fernando splits the data into training and test

set. 75% of data is used for training. Remaining
is used for the test. He builds a linear regression
model. He uses a statistical package to create
the model. The model produces a linear
equation that expresses price of the car as a
function of engine size.
He splits the data into training and test set.

75% of data is used for training. Remaining is
used for the test.
He builds a linear regression model. He uses a

statistical package to create the model. The
model creates a linear equation that expresses
price of the car as a function of engine size.
The model estimates the parameters:
β0 is estimated as -6870.1
β1 is estimated as 156.9
The linear equation is estimated as:
price = -6870.1 + 156.9 x engine size
Interpretation
The model provides the equation for the

predicting the average car price given a
speciPc engine size. This equation means the
following:
One unit increase in engine size will

increase the average price of the car by
156.9 units.
Evaluation
The model is built. The robustness of the model
needs to be evaluated. How can we be sure that
the model will be able to predict the price
satisfactory? This evaluation is done in two
parts. First, test to establish the robustness of
the model. Second, test to evaluate the
accuracy of the model.
Fernando Prst evaluates the model on the

training data. He gets the following statistics.
There are a lot of statistics in there. Let us focus

on key ones (marked in red). Recall the
discussion on hypothesis testing. The
robustness of the model is evaluated using
hypothesis testing.
H0 and Ha need to be dePned. They are dePned

as follows:
H0 (NULL hypothesis): There is no

relationship between x and y i.e. there is no
Ha (Alternate hypothesis): There is some

relationship between x and y i.e. there is a
β1: The value of β1 determines the

relationship between price and engine size. If
β1 = 0 then there is no relationship. In this
case, β1 is positive. It implies that there is some
t-stat: The t-stat value is how many standard

deviations the coeTcient estimate (β1) is far
away from zero. Further, it is away from zero
stronger the relationship between price and
engine size. The coeTcient is signiPcant. In this
case, t-stat is 21.09. It is far enough from zero.
p-value: p-value is a probability value. It

indicates the chance of seeing the given t-
statistics, under the assumption that NULL
hypothesis is true. If the p-value is small e.g. <
0.0001, it implies that the probability that this
is by chance and there is no relation is very low.
In this case, the p-value is small. It means that
relationship between price and engine is not by
chance.
With these metrics, we can safely reject the

NULL hypothesis and accept the alternate
hypothesis. There is a robust relationship
between price and engine size
The relationship is established. How about

accuracy? How accurate is the model? To get a
feel for the accuracy of the model, a metric
named R-squared or coeBcient of
determination is important.
R-squared or CoeBcient of determination:

To understand these metrics, let us break it
down into its component.
Error (e) is the diCerence between the

actual y and the predicted y. The predicted
y is denoted as ŷ. This error is evaluated for
each observation. These errors are also
called as residuals.
Then all the residual values are squared

and added. This term is called as Residual
Sum of Squares (RSS). Lower the RSS, the
better it is.
There is another part of the equation of R-

squared. To get the other part, Prst, the
mean value of the actual target is computed
i.e. average value of the price of the car is
estimated. Then the diCerences between
the mean value and actual values are
calculated. These diCerences are then
squared and added. It is the total sum of
squares (TSS).
R-squared a.k.a coeTcient of

determination is computed as 1- RSS/TSS.
This metric explains the fraction of the
variance between the values predicted by
the model and the value as opposed to the
mean of the actual. This value is between 0
and 1. The higher it is, the better the model
can explain the variance.
Let us look at an example.
In the example above, RSS is computed based

on the predicted price for three cars. RSS value
is 41450201.63. The mean value of the actual
price is 11,021. TSS is calculated as
44,444,546. R-squared is computed as 6.737%.
For these three speciPc data points, the model
is only able to explain 6.73% of the variation.
Not good enough!!
However, for Fernando’s model, it is a diCerent

story. The R-squared for the training set is
0.7503 i.e. 75.03%. It means that the model
can explain more 75% of the variation.
Conclusion
Voila!! Fernando has a good model now. It
performs satisfactorily on the training data.
However, there is 25% of data unexplained.
There is room for improvement. How about
adding more independent variable for
predicting the price? When more than one
independent variables are added for predicting
a dependent variable, a multivariate
regression model is created i.e. more than one
variable.
The next installment of this series will delve

more into the multivariate regression model.
Stay tuned.
Machine Learning Data Science
Towards Data Science
587 claps
WRITTEN BY
Pradeep Menon
Director of #BigData and #AI Solutions @ Alibaba

Cloud. Impact driven. Executive-level interpersonal
skills. Hands-On. #WorldTraveller. #Blogger
Follow
Towards Data Science
A Medium publication sharing concepts, ideas, and

codes.
Follow
See responses (6)
More From Medium
More from Towards Data Science
Why Python is not the

programming language of the
future
Rhea MoutaHs in Towards D… Science
Mar 31 · 7 min read
Science 9.91K
Jupyter is now a full-Aedged IDE

Dimitris Poulopoulos in Towa… Science
Mar 26 · 5 min read
Science 3.4K
from sklearn import *

Conor Lazarou in Towards Data Science
Mar 22 · 9 min read 3.8K
About Help Legal

Data Science Simpli-Ed Part 4: Simple Linear Regression Models

Uploaded by

Copyright:

Available Formats

You might also like

Data Science Simpli-Ed Part 4: Simple Linear Regression Models

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Simpli-Ed Part 4: Simple Linear Regression Models

Uploaded by

Copyright:

Available Formats

Open in app Get started

Data Science Simpli-ed Part

In the previous posts of this series, we

Before we dive in, let us recall some important

Independent and Dependent variables:

In the context of Statistical learning, there are

Independent variables: Data that can be

Dependent variables: Data that cannot be

The data that can’t be controlled i.e. dependent

A model is a transformation engine that

Parameters are ingredients added to the model

Wait, what do we mean by linear?

Linear implies the following: arranged in or

Recall the geometry lesson from high school.

Linear regression is nothing but a manifestation

y is the dependent variable i.e. the variable

x is the independent variable i.e. the

m is the slope. It determines what will be

c is the intercept. A constant that

George Box, a famous British statistician, once

“All models are wrong; some are useful.”

Linear regression models are not perfect. It

The same equation of a line can be re-written

β0 and β1 are two unknown constants that

ε is the error term.

Fernando is a Data Scientist. He wants to buy a

The following are the data provided to him:

make: make of the car.

fuelType: type of fuel used by the car.

nDoor: number of doors.

engineSize: size of the engine of the car.

price: the price of the car.

First, Fernando wants to evaluate if indeed he

Is price of car price related with engine

How strong is the relationship?

Is the relationship linear?

Can we predict/estimate car price based on

Fernando does a correlation analysis.

If the correlation coeTcient is a large(> 0.7)

He does a correlation analysis. He plots the

He splits the data into training and test set.

He builds a linear regression model. He uses a

Following are the answers to the questions:

Is price of car price related with engine

Yes, there is a relationship.

How strong is the relationship?

The correlation coe<cient is 0.872 => There

Is the relationship linear?

A straight line can Ct => A decent prediction

Can we predict/estimate the car price

Yes, car price can be estimated based on

Fernando now wants to build a linear

price = β0 + β1 x engine size

Model Building and

Fernando splits the data into training and test

He splits the data into training and test set.

He builds a linear regression model. He uses a