Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 19

Regression Analysis

 Once a linear relationship is defined, the


independent variable can be used to forecast the
dependent variable.
^ = bo + bX
Y
• bo is called the Y intercept - represents the value of Y when X = 0.
But be cautious - this interpretation may be incorrect and difficult
to estimate - many times our data does not include 0. Think of this
value as representing the influences of the many other independent
variables that are not included in the equation.
• bX is called the slope - represents the amount of change in Y when
X increases by one unit.
Regression Analysis
 Regression line - line that best fits a collection
of X-Y data points. The regression line
minimizes the sum of the squared distances
from the points to the line.
 Regression equation - Method of Least
Squares. Find bo and bX. Other models: step-
wise, forward and backward stepwise.
Regression Assumptions
 Y values are normally distributed about the regression
line
 Variance remains constant as X values increase and
decrease. Violation is called heteroscedasticity.
 Error terms (residuals) are independent of one another
- random (no autocorrelation)
 Linear relationship exists between X and Y - nonlinear
techniques are discussed later.
Excel’s Regression Tool

Regression Statistics
Sales Advertising Multiple R 0.964212
27 20 R Square 0.929705
23 20 Adjusted R Square 0.922675
31 25 Standard Error 5.039375
45 28 Observations 12
47 29
ANOVA
42 28 df SS MS F Significance F
39 31 Regression 1 3358.714 3358.714 132.2573 4.35E-07
45 34 Residual 10 253.953 25.3953
57 35 Total 11 3612.667
59 36
73 41 Coefficie Standard Lower Upper
84 45 nts Error t Stat P-value 95% 95%
Intercept -23.0191 6.316228 -3.64444 0.004504 -37.0925 -8.94566
Advertising 2.280186 0.198272 11.50032 4.35E-07 1.838409 2.721962

Tools, Data Analysis, Regression - Hint: Include


labels in the input ranges to help with the interpretation! Can
also include plots (not shown here)
Total Deviation = Explained Variance + Unexplained Variance
(Y  Y )  (Yˆ  Y )  (Y  Yˆ )

Comparison of A Forecasted
value to the actual value and
average.

Y
Sales

Total Variance Explained


(Y  Y ) ^ (Yˆ  Y )
Y
Unexplained
Y
(Y  Yˆ )
Advertising
Data Analysis

 R2 or Coefficient of Determination. Equals the proportion of the


variance in the dependent variable Y that is explained through
the relationship with the Independent variable X.
 Explained Variance = Total Variance - Unexplained
Variance. We state this as a proportion: 2
S yx
Unadjusted: R 2  1
 (Yi  Yˆ ) 2
Adjusted: R  1
2

 (Y  Y )
i
2 S y2
Adjusted R2 - adjusted for complexity by the degrees of freedom.
Unadjusted R2 becomes larger as more variables are added to the
equation (decreases the sum of errors in the denominator). The use of
an unadjusted R2 may result in believing that additional variables are
useful when they are not.
More on R2
 If R2 = 1, there is a perfect linear relationship. All the variance in Y
is explained by X. All of the data points are on the regression line.
 If R2 = 0, there is no relationship between X and Y (if this is the
case, we should not have run a linear model - and we should have
realized this with a correlation coefficient and by graphing -
BEFORE running the model!
 Several ways to calculate. From ANOVA table: SSR/SST (this is
an UNADJUSTED R2 )
 Adjusted R2 from ANOVA = 1-MSE/(SST/n-1)
 The square root of R2 is R which is the correlation coefficient. This
identifies positive and negative relationships
 R2 is useful to make model comparisons
Data Analysis
 Syx or Standard Error - measure for goodness of fit. Measures the
actual values (Y) against the regression line ^ Lower S yx is a better
fit Y

Syx =  (Y  Yˆ ) 2
 i
e 2


nk nk

k refers to the number of population parameters being estimated - in


this case, we have 2: bo and bX

The standard error can also be calculated by taking the square root of
the MSE in the ANOVA table!
MSE
Residuals
Predicted
Week ly Residuals
Observation Sales Residuals Squared Excel will provide
1 13.23544 -3.23544 10.46805
2 3.058252 2.941748 8.653879 the residuals in the
3 7.419903 -2.4199 5.85593 output. This table
4 10.32767 1.67233 2.796688
5 8.873786 1.126214 1.268357 also includes another
6 14.68932 0.31068 0.096522
7 8.873786 -3.87379 15.00622
column that I added -
8 11.78155 0.218447 0.047719 the residuals squared
9 17.59709 -0.59709 0.356513
10 16.1432 3.856796 14.87488 which is used to
determine the
standard error of the
estimate (Syx)
Confidence Intervals
 Prior to relating Y to X, confidence intervals about the
future values are based on the standard error of Y.
However, in the regression equation, the standard
error of forecast (Sf) gives tighter confidence intervals
and greater accuracy.
 Confidence Interval for Y:
Y  z / 2 / n
 Confidence Interval for ^:
Y
1 ( X i  X )2
Yˆ  Z / 2 S yx 1 
n  ( X i  X )2
Use t/2 for small sample sizes!
Making Predictions
 Identifying a forecasted point from the regression equation
does not give us an idea of the accuracy of the prediction.
We use the prediction interval to determine accuracy. For
example, a prediction of 8.44 appears to be precise - but
not if the 95% confidence level allows the forecast to be
between 1.75 to 15.15!
 Be careful about making a prediction based on a prediction.
For example, if the X values range between 5 and 15, you
should be cautious about using an X value of 20 - it is
outside the range of the data and possibly outside of the
linear relationship.
Is the Independent Variable Significant?

 Ho: The regression coefficient is not


significantly different from zero
 HA: The regression coefficient is
significantly different from zero

Ho : B  0
H A: B  0
Where B is the true slope of the regression line
Is the Independent Variable Significant?

 The Standard Error of the Estimate is Syx,


 The Standard Error of the Regression Coefficient
is Sb.
S yx
Sb  b0
 ( X  X ) 2 t 
Sb
We will use Excel’s P-value for the Independent Variable to
determine significance. If the p-value is less than .05, we Reject
the null hypothesis and conclude that the Independent variable is
related to the dependent variable. However, it is important to have
an understanding of the formulation development - which is why
the formulas and definitions are provided.
Analyzing it all at once
 What happens if you have a large sample size, a small
R2 (such as .10) and you have determined that the
independent variable is significant?
 What happens with a small sample, large R2 and the
independent variable is NOT significant?
 To test the model, we use the F statistic from the
ANOVA table.
ANOVA Analysis

ANOVA
df SS MS F Significance F
Regression 1 174.1752 174.1752 23.44817 0.001284315
Residual 8 59.42476 7.428095
Total 9 233.6

ANOVA df SS MS F
Regression k-1  (Yˆ Y ) 2 SSR/k-1 MSR/MSE
Error n-k  (Y Yˆ ) 2 SSE/n-k
Total n-1  (Y Y ) 2
F-Test
 Ho: The model is NOT valid and there is NOT a
statistical relationship between the dependent and
independent variables
 HA: The model is valid. There is a statistical
relationship between the dependent and
independent variables.

If F from the ANOVA is greater than the F from


the F-table, reject Ho: The model is valid. We can
look at the P-values. If the p-value is less than our
set  level, we can REJECT Ho.
Durbin-Watson Statistic
 Minitab will provide a DW statistic. This detects
autocorrelation for Yt and Yt-1. The value of DW
varies between 0 and 4.
 A value of 2 indicates no autocorrelation.
 A value of 0 indicates positive autocorrelation
 A value of 4 indicates negative autocorrelation.

DW 
 t t 1
( e  e ) 2

t
e 2
Data Transformations
 Curvilinear relationships - fit the data with a curved line
 Transform the X variable (independent) so the
resulting relationship with Y is linear.
 Log of X, Square Root of X, X squared, and reciprocal
of X (or 1/X) are common. The hope is that one of
these transformations will result in a linear relationship.
Ok, 18 pages of notes, so where do we start?

 Determine the dependent and independent variables


 Develop scatter plots and determine if linear or nonlinear relationships exist.
Calculate a correlation coefficient. Transform non-linear data.
 Run an autocorrelation and interpret the results - it will be helpful to see if any
patterns exist
 Compute the regression equation. Interpret.
 Understand the difference between standard error of estimate, standard error of
forecast (regression) and standard error of the regression coefficient.
 Evaluate and interpret the adjusted R2
 Test the independent variables for significance
 Evaluate the ANOVA and test the model for significance (F and DW)
 Plot the error terms
 Calculate a prediction and prediction interval
 State final conclusions about the model (if running different models, compare
using MSE, MAD, MAPE, MPE)

You might also like