Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 22

Linear Regression

Dr. Pradeep Kumar Dadabada


Asst. Professor, Information Systems & Analytics
pkd@iimshillong.ac.in
Predictive Analytics?
 The term “predictive analytics” describes the application of a
statistical or machine learning technique to create a quantitative
prediction about the future. Frequently, supervised machine learning
techniques are used to predict a future value (How long can this
machine run before requiring maintenance?) or to estimate a
probability (How likely is this customer to default on a loan?).

 Predictive analytics starts with a business goal: to use data to reduce


waste, save time, or cut costs. The process harnesses heterogeneous,
often massive, data sets into models that can generate clear,
actionable outcomes to support achieving that goal, such as less
material waste, less stocked inventory, and manufactured product that
meets specifications.
Overview of Regression
 Regression models are generally models that predict a numeric
outcome.
 Regression models are often expressed as a mathematical formula
that captures the relationship between a collection of input variables
and the numeric target variable. This formula can then be applied to
new observations to predict a numeric outcome.
 Interestingly, regression comes from the word “regress,” which
means to move backwards. It was used by Galton (1885) in the
context of techniques for regressing (i.e., moving from) observations
to the average.
 The early research included investigations that separated people into
different classes based on their characteristics. The regression came
from modelling the heights of related people (Crano and Brewer,
2002).
Linear Regression: What?
Simple Linear Regression
Simple Linear Regression
Simple Linear Regression
Simple Linear Regression
LSE
Simple Linear Regression: Example
(OLS)
LSE
Example Dataset
 What are the features?
 TV: advertising dollars spent on TV for a single product in
a given market (in thousands of dollars)
 Radio: advertising dollars spent on Radio
 Newspaper: advertising dollars spent on Newspaper
 What is the response?
 Sales: sales of a single product in a given market (in
thousands of widgets)
Simple Linear Regression Demo in
Orange
 1. Load “Advertising1.csv” using “File” Widget from “Data”
Toolbar. Select “Sales” as the target variable.
 2. View the data using “Data Table” after connecting it to “File”
Widget.
 3. Drag and Drop “Linear Regression” widget from “Model”
toolbar. Connect this to “Data” widget.
 4. View the regression coefficients after connecting it to “Linear
Regression” Widget.
 5. Connect “Linear Regression” and “File” to “Test and Score”
from “Evaluate”
 6. Connect “Data Table” to “Test and Score”
 7. Connect “Save Data” to “Data Table” for saving predictions in
CSV file
Interpreting Model Coefficients
 How do we interpret the TV coefficient (𝛽1)?
 A "unit" increase in TV ad spending is associated with a
0.0492751 "unit" increase in Sales.
 Or more clearly: An additional $1,000 spent on TV ads
is associated with an increase in sales of 49.2751 widgets.
 Note that if an increase in TV ad spending was associated
with a decrease in sales, 𝛽1 would be negative.
Interpreting Performance Metrics
 (Root) Mean Squared Error (RMSE/MSE) : RMSE represents the square root of the variance of
the residuals. Lower values of RMSE/MSE/MAE indicate better fit. RMSE/MSE/MAE is a good
measure of how accurately the model predicts the response (target variable).

 Ex: RMSE of Model 1: 14.5


RMSE of Model 2: 16.7
RMSE of Model 3: 9.8
Model 3 has the lowest RMSE, which tells us that it’s able to fit the dataset the best out of the three
potential models.

RMSE of 24.5$ means we can safely say the predicted and the actual price are off by 24.5$ at the
same time base on RMSE (upper-bound of prediction error)

MAE (Mean Absolute Error): Let's say we have a regressor for predicting house prices with a MAE
of 20.5$. Based on MAE. We can certainly interpret that the average difference between the predicted
and the actual price is 20.5$.

R-squared of the model tells us the proportion of the variance in the response variable that can be
explained by the predictor variable(s) in the model.
Overfitting and Underfitting
Overfitting in Machine Learning
Overfitting refers to a model that models the training data too well.

Overfitting happens when a model learns the detail and noise in the training data to the extent
that it negatively impacts the performance of the model on new data. This means that the noise
or random fluctuations in the training data is picked up and learned as concepts by the model.
The problem is that these concepts do not apply to new data and negatively impact the models
ability to generalize.

Overfitting is more likely with nonparametric and nonlinear models that have more flexibility
when learning a target function. As such, many nonparametric machine learning algorithms also
include parameters or techniques to limit and constrain how much detail the model learns.

For example, decision trees are a nonparametric machine learning algorithm that is very
flexible and is subject to overfitting training data. This problem can be addressed by pruning a
tree after it has learned in order to remove some of the detail it has picked up.
Overfitting and Underfitting
Underfitting in Machine Learning
Underfitting refers to a model that can neither model the training data nor generalize to new
data.

An underfit machine learning model is not a suitable model and will be obvious as it will have
poor performance on the training data.

Underfitting is often not discussed as it is easy to detect given a good performance metric. The
remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does
provide a good contrast to the problem of overfitting.

A Good Fit in Machine Learning


Ideally, you want to select a model at the sweet spot between underfitting and overfitting.

This is the goal, but is very difficult to do in practice.


Overfitting and Underfitting
Regularized linear models
Regularized linear models
Regularized linear models
Multiple Linear Regression/Regularized
Regression Demo in Orange
 1. Load “Advertising2.csv” using “File” Widget from “Data” Toolbar.
Select “Sales” as the target variable.
 2. View the data using “Data Table” after connecting it to “File”
Widget.
 3. Drag and Drop “Linear Regression” widget from “Model” toolbar.
Connect this to “Data” widget.
 4. View the regression coefficients after connecting it to “Linear
Regression” Widget.
 5. Connect “Linear Regression” and “File” to “Test and Score” from
“Evaluate”
 6. Double click “Linear Regression” select regularization one
 7. Connect “Data Table” to “Test and Score”
 8. Connect “Save Data” to “Data Table” for saving predictions in CSV
file

You might also like