Professional Documents
Culture Documents
Linear Regression (BA)
Linear Regression (BA)
upgrad.com
2
3
Session Agendas
22
upgrad.com
3
4
Linear regression is the most commonly used method of predictive analysis. It uses linear relationships
between a dependent variable (target) and one or more independent variables (predictors) to
predict the future of the target. The prediction is based on the assumption that the relationship between
the target and the predictors is dependent or causal.
upgrad.com
5
Types of variables
upgrad.com
6
● Scatter plot: If the scatter plot between the dependent and the independent variables shows a linear
rising or falling trend, it means that you can try to fit a line over the plot indicating the linear
relationship between the two variables.
● Correlation coefficient: The correlation coefficient provides a measure of the linear relationship
between the dependent and the independent variable. If the absolute value of the correlation
between the two variables is close to 1, it means that the variables have a strong linear relationship.
(This is valid only for simple linear regression.)
upgrad.com
7
After validating that a linear relationship exists between the two variables (dependent and
independent)using the scatter plot and the correlation coefficient, the next step is to fit a simple
regression model over the given dataset. A simple linear equation looks like the following general linear
equation:
Y = β0 + β1 ∗ X
where,
Y : the dependent or predicted variable
X : the independent or predictor variable
β0 : constant or intercept
β1 : the slope of the line
upgrad.com
8
● β0 : The beta-not value gives the Y-intercept term when X (independent variable) is zero. This
may not be of business significance at all times.
● β1 : The beta-one value gives the slope of the regression line that we fit over the dataset. With a
change of one unit in the independent variable, the value of the dependent variable will increase or
decrease by β1 depending on the sign of β1.
o If the value β1 is positive (negative), it means that Y and X are positively (negatively) related,
that is, as X increases, the value of Y increases (decreases).
o If the value of β1 is zero, it simply means that the variable X does not have any impact on the
variable Y.
upgrad.com
9
1. Problem definition
2. Selection of relevant variables
3. Data collection
4. Model specification
5. Choice of the fitting method
6. Model fitting
7. Model validation and criticism
8. Using the model to solve the problem
upgrad.com
10
Sports analytics is a booming field. Owners, coaches, and fans are using statistical measures and models of
all kinds to study the performance of players and teams. A very simple example is provided by the study of
yearly data on batting averages for individual players in the sport of baseball. The sample used here contains
588 rows of data for a select group of players during the years 1960-2004, and it was obtained from the
Lahman Baseball Database.
Objective : To predict a player’s batting average in a given year from his batting average in the previous year
and/or his cumulative batting average over all previous years for which data is available. A much larger file
with 4535 rows and 82 columns--more players and more statistical measures of performance.
upgrad.com
11
Data description
Each row in the data file contains statistics for a single player for a single year in which the player had at least 400
at-bats and also at least 400 at-bats in the previous year. The latter constraint was imposed to ensure that only regular
players (the best on their teams at their respective positions) were included and also so that the sample size of at-bats
for each player was large. The statistics for the analysis consist of batting average, batting average lagged by one
year, and cumulative batting average lagged by one year. The first few rows look like this:
The term “lagged” means “lagging behind” by a specified number of periods, i.e., an observation of the same variable
in an earlier period. For example, Hank Aaron’s value of 0.292 for BattingAverageLAG1 in 1961 is by definition the
same as the value of BattingAverage for him in 1960. In general in this file, BattingAverageLAG1 in a given row is
equal to BattingAverage in the previous row if the previous row corresponds to the previous year for the same player
upgrad.com
12
Observations
upgrad.com
13
Any Queries?
Thank You!
upgrad.com