Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

2

Course : Predictive analytics

Lecture On : Linear Regression


Edit Master text styles
Instructor :

upgrad.com
2
3

Session Agendas

In this session we will learn and revise JANUARY


important concepts of Linear
Regression.

22
upgrad.com
3
4

WHAT IS MEANT BY LINEAR REGRESSION ?

Linear regression is the most commonly used method of predictive analysis. It uses linear relationships
between a dependent variable (target) and one or more independent variables (predictors) to
predict the future of the target. The prediction is based on the assumption that the relationship between
the target and the predictors is dependent or causal.

For example - To analyze how previously advertisements are


related to an increase in sales to decide about future
advertisements. In this example, the dependent variable is
sales, and the independent variable is advertisement
expenses.

upgrad.com
5

Types of variables

Dependent variable (predicted or Independent variable(s) (predictor or


response variable) explanatory variable(s))
o It is a variable that we are trying to predict o The variables used to explain the dependent
or to explain using other variables. This is variables are termed as independent variables.
the target The dependent variable can be explained using:
variable that you will estimate through your ▪ one variable -simple regression
analysis. ▪ multiple variables - multivariate regression

upgrad.com
6

Ways to check for a linear relationship

● Scatter plot: If the scatter plot between the dependent and the independent variables shows a linear
rising or falling trend, it means that you can try to fit a line over the plot indicating the linear
relationship between the two variables.

● Correlation coefficient: The correlation coefficient provides a measure of the linear relationship
between the dependent and the independent variable. If the absolute value of the correlation
between the two variables is close to 1, it means that the variables have a strong linear relationship.
(This is valid only for simple linear regression.)

upgrad.com
7

Simple linear equation

After validating that a linear relationship exists between the two variables (dependent and
independent)using the scatter plot and the correlation coefficient, the next step is to fit a simple
regression model over the given dataset. A simple linear equation looks like the following general linear
equation:

Y = β0 + β1 ∗ X

where,
Y : the dependent or predicted variable
X : the independent or predictor variable
β0 : constant or intercept
β1 : the slope of the line

upgrad.com
8

Regression coefficients interpretation

● β0 : The beta-not value gives the Y-intercept term when X (independent variable) is zero. This
may not be of business significance at all times.
● β1 : The beta-one value gives the slope of the regression line that we fit over the dataset. With a
change of one unit in the independent variable, the value of the dependent variable will increase or
decrease by β1 depending on the sign of β1.
o If the value β1 is positive (negative), it means that Y and X are positively (negatively) related,
that is, as X increases, the value of Y increases (decreases).
o If the value of β1 is zero, it simply means that the variable X does not have any impact on the
variable Y.

upgrad.com
9

Modelling: 8-step Process

To build a model, you need to carry out the following steps:

1. Problem definition
2. Selection of relevant variables
3. Data collection
4. Model specification
5. Choice of the fitting method
6. Model fitting
7. Model validation and criticism
8. Using the model to solve the problem

Formulas and RegressIT- Simple_regression_formulas_and_RegressIt_model.xls

upgrad.com
10

Predicting baseball batting averages

Sports analytics is a booming field. Owners, coaches, and fans are using statistical measures and models of
all kinds to study the performance of players and teams. A very simple example is provided by the study of
yearly data on batting averages for individual players in the sport of baseball. The sample used here contains
588 rows of data for a select group of players during the years 1960-2004, and it was obtained from the
Lahman Baseball Database.

Objective : To predict a player’s batting average in a given year from his batting average in the previous year
and/or his cumulative batting average over all previous years for which data is available. A much larger file
with 4535 rows and 82 columns--more players and more statistical measures of performance.

upgrad.com
11

Data description

Each row in the data file contains statistics for a single player for a single year in which the player had at least 400
at-bats and also at least 400 at-bats in the previous year. The latter constraint was imposed to ensure that only regular
players (the best on their teams at their respective positions) were included and also so that the sample size of at-bats
for each player was large. The statistics for the analysis consist of batting average, batting average lagged by one
year, and cumulative batting average lagged by one year. The first few rows look like this:

The term “lagged” means “lagging behind” by a specified number of periods, i.e., an observation of the same variable
in an earlier period. For example, Hank Aaron’s value of 0.292 for BattingAverageLAG1 in 1961 is by definition the
same as the value of BattingAverage for him in 1960. In general in this file, BattingAverageLAG1 in a given row is
equal to BattingAverage in the previous row if the previous row corresponds to the previous year for the same player

upgrad.com
12

Observations

Stage 1:Descriptive analysis


The mean value of batting average is 0.277 (“two seventy seven” in baseball language), and the means of the
lagged averages and lagged cumulative averages are only slightly different. The correlation between batting
average and lagged batting average (0.481) is a little smaller than the correlation between batting average and
lagged cumulative batting average (0.538), i.e., the player’s batting average in a given year is better predicted
by his cumulative history of performance than by his performance in the immediately preceding year.

Stage 2: SLR Model


The estimated slope coefficient is 0.698, which means that a player whose prior cumulative average deviated
from the mean by the amount x is predicted to have a batting average that deviates from the mean by about
0.7x in the current year, i.e., his batting average is predicted to regress-to-the-mean by 30% relative to his prior
cumulative batting average.

Stage 3: MLR Model


In the second model, both coefficients are positive and their sum is 0.708, which is essentially the same as the
coefficient in the first model. This means that the second model merely reallocates some of the weight on prior
performance to place a little more on the most recent year’s average

upgrad.com
13

Any Queries?

Thank You!

upgrad.com

You might also like