Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 31

Regression

Review Lecture
Major Points
• Is there a relationship between x and y?
• What is the strength of this relationship
• Pearson’s r
• Can we describe this relationship and use this to predict y
from x?
• Regression
• Is the relationship we have described statistically
significant?
• t test
The relationship between x and y
• Correlation: is there a relationship between 2
variables?
• Regression: how well a certain independent variable
predict dependent variable?
• CORRELATION  CAUSATION
• In order to infer causality: manipulate independent
variable and observe effect on dependent variable
Regression
• How well a set of data points fits a straight line can
be measured by calculating the distance between
the data points and the line.
• The total error between the data points and the
line is obtained by squaring each distance and then
summing the squared values.
• The regression equation is designed to produce the
minimum sum of squared errors.

4
Regression
• Is the statistical technique for finding the best-
fitting straight line for a set of data.
• To find the line that best describes the relationship
for a set of X and Y data.
Regression Analysis
• Question asked: Given one variable, can we predict
values of another variable?
• 
• Examples: Given the weight of a person, can we
predict how tall he/she is; given the IQ of a person,
can we predict their performance in statistics; given
the basketball team’s wins, can we predict the
extent of a riot. ...
Regression line
• makes the relationship between variables easier to
see.
• identifies the center, or central tendency, of the
relationship, just as the mean describes central
tendency for a set of scores.
• can be used for a prediction.
The Equation for a Line
Y = bX + a

• b = the slope
• a = y-intercept
• Y= predicted value
Regression
• The mathematical equation for a line:
Y = mx + b
Where: Y = the line’s position on the vertical axis at any point
X = the line’s position on the horizontal axis at any point
m = the slope of the line

b = the intercept with the Y axis, where X equals zero


Regression
• The statistics equation for a line:
Y = a + bx
Where: Y = the
^ line’s position on the vertical axis at any point (estimated
value of dependent
^ variable)
X = the line’s position on the horizontal axis at any point (value of
the independent variable for which you want an estimate of Y)
b = the slope of the line (called the coefficient)

a = the intercept with the Y axis, where X equals zero


Regression
• R2
• Is the improvement obtained by using X (and drawing a line through
the conditional means) in getting as near as possible to everybody’s
value for Y over just using the mean for Y alone.
• Falls between 0 and 1
• 1 means an exact fit (and there is no variation of scores around
the regression line)
• 0 means no relationship (and as much scatter around the line as
in the original Y variable and a flat regression line (slope = 0)
through the mean of Y)
• Would be the same for X regressed on Y as for Y regressed on X
• Can be interpreted as the percentage of variability in Y that is
explained by X.
• Some people get hung up on maximizing R2, but this is too bad because
any effect is still a finding—a small R2 only indicates that you haven’t told
the whole (or much of the) story of the relationship between your
variables.
Correlation and Regression
Back to the SPSS output:

r2

 (Y – Y)2 -  (Y – Y)2
 (Y – Y)2

71.194 ÷ 154.64 = .460


Correlation and Regression
Back to the SPSS output:

Of course, you get


the standard error
and
t on your output,
…and the p-value
too!
Correlation and Regression
Our data’s correlation is .679. How strong is that?

Correlation, r,
is significant.
Regression Example: Answer this question using Regression.
What is the best predicted size of a household that discard 0.50 lb of plastic?

Before you get too excited about this output, let’s cross-off the info that we
are not going to discuss or learn about in class. I’m only trying to give you an
elementary exposure to Regression. The following slides will show you the
things you need to understand and each of those items will be explained.
Regression Example: Answer this question using Regression.
What is the best predicted size of a household that discard 0.50 lb of plastic?

Info to help you understand Regression Info that you MUST know for test

Multiple R has no importance by itself, it is the


square of this value “R square” that is important

“R Square” is also known as the “Coefficient of


Determination” and represents the amount (percent)
of the variance in y that is explained by x. You can
sort-of say that this value shows how accurate or
effective our regression equation is. In this case, our
regression equation is 71% accurate or effective in
predicting y, given some value of x.

These coefficients form the Regression Equation. In


this case y=0.549+1.480(x). Thus, the answer to our
question is y=0.549+1.480(0.50)=1.289 people (or 1
rounded to the nearest whole person in a household
that tosses 0.50 lbs. of plastic)
Regression Example: Answer this question using Regression.
What is the best predicted size of a household that discard 0.50 lb of plastic?

Info to help you understand Regression Info that you MUST know for test

“Adjusted R Square” is the same as “R Square”


EXCEPT that it is adjusted for sample size. When the
sample size (n) is small, this value will be quite a bit
below the unadjusted R Square value. As the sample
size increases, the difference between R Square and
Adjusted R Square becomes negligible. Adjusted R
Square is a more accurate representation of the
accuracy or effectiveness of our regression equation.

Standard Error is an estimate of how far off our y


predictions will be (on average) using the regression
Observations equation.
is simply n

Here we simply see that Excel determines the


appropriate degrees of freedom – nothing for us to
do here
Regression Example: Answer this question using Regression.
What is the best predicted size of a household that discard 0.50 lb of plastic?

Info to help you understand Regression Info that you MUST know for test

Significance F shows the alpha level that separates


the regression equation between being statistically
significant and NOT. Thus, if this value is LESS
THAN our alpha value (usually 0.05) then the
regression equation is statistically significant (i.e.,
a low chance of error in predicting y). Conversely,
if this value is greater than our alpha then the
SS requires some
regression equation is not statistically significant
explanation – see the
(i.e., a relatively high chance of error in predicting
next couple of slides
y). In this case, since Significance F is much less
than 0.05 we conclude that the regression equation
is statistically significant.

You may remember the F test from the section on


hypothesis testing. Excel does an F test to see if the
regression equation is any better than simply using the
average y as the predictor of y. A large F value generally
suggests that the regression equation is effective, but we
rely on the “Significance F” to tell for sure
Regression Example: Answer this question using Regression.
What is the best predicted size of a household that discard 0.50 lb of plastic?

SS (Sum of Squares) NOT REQUIRED FOR


YOU TO KNOW
SS Regression (Explained Deviation) – This is
calculated by first finding the vertical distance
between the regression line and the mean y value
(y-bar) at each data point. Next, those individual
differences are squared and then all added together
(summed) to form the SS (sum of squares)
Regression. In practical terms, in order for the
regression equation to be valid, we should see a
y
relatively high value here as compared to SS
20
19
• Residual.
18
17 Unexplained deviation SS Residual (Unexplained Deviation) – This is
16
Total
(Residual) calculated by first finding the vertical distance
15 (y – y-hat)
14 deviation between each data point and the regression line.
13 (y – y-bar) • Explained deviation Since we hope our regression line explains the
12
(Regression)
11
10 (y-hat – y-bar)
relationship between x and y, any difference
9
• between the line and a data point is “unexplained”.
8
7
y-bar = 9 We hope this value is relatively small compared to
6
5
SS Regression.
4 y-hat = 3 + 2x SS Total – This is simply SS Regression + SS
3
2 Residual. It can also be calculated by finding the
1
0 x vertical distance between each data point and y-
0 1 2 3 4 5 6 7 8 9
bar, squaring each value and added them all up.
Multiple Regression
Multiple Regression
• Multiple Regression is a process for defining a linear
relationship between a dependent variable y and two or
more independent variables (x1, x2, x3 . . . , xk)
• The linear equation for a regression problem in which we
have multiple x variable is as follows, where b 0 is the y-
intercept and all the other b’s are coefficients associated
with their respective x values:

yˆ  b0  b1 x1  b2 x2  ...  bk xk
Multiple Regression Guidelines
• More x variables is NOT necessarily better
• Remember that R Square is a measure of how effective
our regression equation is. Therefore, if adding an x
variable does not appreciably increase the R Square value,
then DON’T add it
• Use those x variables (the fewest possible) that give you
the biggest R Square (or Adjusted R Square) value. We
want efficiency so a few variables that provide a big R
Square is best
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.
Step 1: Construct a “Correlation Matrix” to see which x variables have the strongest
linear relationships with the y variable (weight). Use the Excel function Tools, Data
Analysis, Correlation to construct a correlation matrix. An Excel file containing this
Bear data and Correlation Matrix are on the class website (mrbear.xls).
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.

Step 1: Construct a “Correlation Matrix” to see which x variables have the strongest
linear relationships with the y variable (weight)
• Ideally, we want to pick those few x variables that have strong correlations (close to
-1 or +1) with the y variable, BUT we also want the x variables to NOT be highly
correlated with each other
• The addition of an x variable that is strongly correlated with any x variable(s)
already in a multiple regression equation WILL NOT do much to increase the R
Squared or Adjusted R Square value
• On the other hand, adding an x variable that is strongly correlated with the y
variable, but NOT with any x variables already in the regression equation WILL
increase our R Squared value substantially
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.

Choosing which x variables to include in a multiple regression problem is often a


subjective decision. I suggest you try whichever variables you think will work best.
For this example, I am going to choose Neck size as my first x variable to predict y
(weight) since it has an awesome correlation of 0.971 with Weight. To pick a second
x variable I want one that is highly correlated with Weight, but NOT with Neck.
Unfortunately, it looks like all the other x variables ARE highly correlated with Neck.
In fact, if this were a real problem we would probably stick with simple regression
and just use Neck to predict Weight. But, since we have to do a multiple regression
problem I think I’ll pick Age as my second x variable. Age is not that highly
correlated with Weight (0.814), but it is also the least highly correlated with Neck
(0.906). Why don’t you try the problem with other x variables and see if you can beat
my Adjusted R Square value.
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.

 For Multiple Regression we use the same


Tools, Data Analysis, Regression function in
Excel that we used for simple Regression.
 The “Y Range” is still simply the range of
cells that contains the y-variable
The “X Range” is the range of cells that
contains ALL the x-variables we want to
include in the regression model. MAKE
SURE the x-variables are in adjacent
columns (you cannot skip columns). Notice
how I now have Neck and Age right next to
each other.
 I like to include the column headings or
labels, so I checked the “labels” box. When
you do this the output also includes the
labels, making it a lot easier to interpret.
Everything else is the same as when we
did simple Regression
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.

Again, before we analyze the output, let’s cross-off the info that we are not
going to discuss or learn about in class. The following slides will explain
those things you need to understand.
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.

Just a note: I ran another regression with


Neck size as the only x variable. You can
see the Adjusted R Square is 0.9330 as
compared to the Adjusted R Square of
0.9536 that we got when we included Age
in addition to Neck. I don’t think I would
consider the slight increase in Adjusted R
Square from including Age as really being
worth the trouble of including another
variable. There is no set guideline as to
how much R Square should increase to
justify adding another variable, but going
from an already high R Squared to a
slightly higher value does not seem
worthwhile in my opinion
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.

Info to help you understand Regression Info that you MUST know for test

Multiple R – same story as before

Almost the same story as in simple Regression - “R Square” or


the “Coefficient of Determination” represents the amount
(percent) of variance in y that is explained by the x’s. In this
case our regression equation is 97% accurate or effective in
predicting y (Weight). Adjusted R Square simply adjusted the
R Square value for the number of x variables and the sample
size. Lots of variables and/or small sample sizes reduce the
Adjusted R Square. A few good variables and a large sample
size bring Adjusted R Square very close to R Square.

These coefficients form the Regression Equation. In


this case y=-307.817+26.388(Neck)-1.527(Age).
Thus, if we found a bear with a Neck of 25” and an
Age of the 48 months we could predict its Weight as
y=-307.817+26.388(25)-1.527(48)=278.6 lbs.
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.

Info to help you understand Regression Info that you MUST know for test

Like simple Regression, if the


Significance F is less than alpha then
our regression equation is statistically
These significant

Same general comments as simple


Regression. Note that MS is SS/df and the F
statistic is calculated as MS Regression / MS
Residual (just in case you are curious)

See next slide for


explanation of these
important values
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.

Info to help you understand Regression Info that you MUST know for test

Excel calculates a t test statistic for each of the x variables (ignore the t Stat for Intercept).
The next column (P-value) is the important one because it provides the test result of this
test statistic as compared to the critical value. The critical values are not shown, but we
don’t need to see them because the P-value effectively tells us whether this test statistic is
inside of outside the critical value (see P-value for more explanation)

The P-value tells us whether the x variable is statistically significant in the regression equation.
If the P-value is less than alpha (usually 0.05) then that x variable is a significant contributor in
the regression equation. If the P-value is greater than alpha then that x variable does not
contribute significantly to the regression equation. In this case, Neck has a P-value much less
than 0.05, so we see that Neck size contributes significantly to our regression equation to predict
a Bear’s weight. On the other hand, Age has a P-value greater than 0.05, so we see that adding
the Age x variable into the equation was not a good idea since Age does not significantly help us
predict a Bear’s weight in our regression equation.

You might also like