Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Chapter 10

Inference About Regression


2

Where We’ve Been… Are Going


• For the last several weeks we have been talking about topics
that fit under the broad category of statistical significance

– We draw a sample from a population


– We use the data from that sample to make some statements about
what we think is true in the population. For example…
• Processing time will decrease
• The mean is greater than 80
• The mean is between 14.7 and 16.8
• Those who drink whiskey have higher avg. incomes vs. those who do not
• OSU graduates are genuinely inferior to U of M graduates (OK, that’s not
an actual statistical test, its just generally true)
3

Remember Back When…


• As part of chapter 2 we studied regression
– Get some sample data, two quantitative variables
– Draw a straight line between the points, try to use one
variable to predict the other
4

Related?
• So you say to yourself “What could
regression have to do with statistical
significance? These things can’t possibly
be related, right?”

• Oh how wrong you are.


• [Insert evil laugh here]
5

Review
• Recall the basics of regression
• We have two quantitative variables
• We can plot them and draw the best fitting line
through the data

yˆ = 0.125x − 41.4
Slope Y-intercept
b1 b0
6

Samples & Populations


• We are rarely interested just in the data that we
collected
– It is a sample from a larger population

• We wish to estimate what the relationships are


in the population based on what we see in our
sample
7

Calculations & Estimates


• From our sample we calculate ŷ = b0 + b1x
– ŷ (y-hat) is the estimate from the regression

• We use it to estimate my|x = b0 + b1x

• So in fact, y-hat is the estimate for the population


average value of y for a given level of x (my|x) . The “|x”
part of the subscript above stands for “given a certain
value of x”

• For example the average wage for a given number of


months of experience; the average attractiveness given
the amount of whiskey consumed in a week
8

Notation
• Recall the calculated values from the
SAMPLE are denoted as ŷ = b0 + b1x
where

ŷ = the estimate for y from the regression equation

b1 = the slope of the regression line from the data

b0 = the y-intercept of the line calculated from the data


9

Notation & Interpretation


Sample Statistic Population Parameter Pronounced Description
x-bar µ
s σ
p-hat or p-tilde p

Y-intercept in the
b0 β0 Beta-zero population

Slope in the
b1 β1 Beta-one population
10

Terminology: Repeating
• Realize that my|x, b0(pronounced Beta not) and b1
(pronounced Beta-one) are unknown numbers that exist
in the population–
– That’s why they are Greek letters
– They are called parameters

• y-hat, b0, b1 and are numbers we calculate from data


– Called sample statistics

ŷ unbiased estimate for mean response my|x

b0 unbiased estimate for intercept b0

b1 unbiased estimate for slope b1


11

Hypothetical Regression
• Let’s suppose I have a (ridiculous)
hypothesis: The more French DNA you
have, the less intelligent you are
12

Population: No Relationship Between


French DNA & Intelligence

So β1=?
13

Relationship in Population, Sample


Relationship In Population:
French & Intelligence Not Related Results from Samples
Because of
sampling variation,
some of the
samples from this
population will
have a positive,
negative or zero
slope.

So just because
we find a negative
slope in one
sample it does not
necessarily mean
that the slope in
the population is
negative.
14

How Decide if βo is Negative?


Relationship In Population ? Results from Sample
Suppose we don’t
know the slope in
the population

We obtain a sample
and it has a
negative slope

How do we decide
if the slope in the
population is
? negative or not?
15

Hypothesis
• We wish to know whether we believe that
something is true in the population
• We only have a sample of data from that
population
• Sound familiar?

• Ho: B1=0
• Ha: B1<0
16

How Decide?
• If the p-value is less than the alpha value...
• If the p-value is less than the alpha value...
• If the p-value is less than the alpha value...
• If the p-value is less than the alpha value...
• If the p-value is less than the alpha value...
• If the p-value is less than the alpha value...
• If the p-value is less than the alpha value...
17

Regression– No Relationship
• If there is truly no relationship in the data then
the regression line will have a slope of zero
• So, we can do hypothesis testing on how
different the slope of the line in the sample is
from zero Recall:
residual=observed
minus predicted value of
Example of Ineffective Regression y.
Best fitting line. 10 If slope=0 then residuals
The same as Y will be the distance from
average y (ybar) the individual data point
7
to the overall average of
the data.
4 Same as the standard
1 2 3 4 5 6 7 8 9 10
deviation of the y
variable (covered in
X (used to try to predict Y) chapter 1)
18

Regression– Relationship
• If x and y are related (lower left) then the
slope of the best fitting line will not be zero
• And the residuals will be smaller than the
standard deviation of y
residual

Exam ple of Effective Regression Example of Ineffective Regression

10 10

4
1 2 3 4 5 6 7 8 9 10 4
1 2 3 4 5 6 7 8 9 10
19

Test for Relationship


• We test for whether the relationship
between x and y is strong enough in our
sample that it would not be due to random
chance alone
• This test is based on dividing
b1: The slope of the line
The amount of the residuals (roughly)

• If the quotient is large enough then we


reject Ho for Ha
20

Average vs. Individual Values


• Changing the example: For a given level of experience,
different people earn different amounts of wage
• So there will be a DISTRIBUTION of wages, and this
distribution will have a mean, my, for every value of x

• Regression assumes that the distributions are normally


distributed, so they also have a standard deviation, s

my|x

Wage (y)

Experience (x)
21

The population standard deviation s


for y at any given value of x represents
the spread of the normal distribution of
the ei around the mean my .
ei is defined as yi – ŷi because ŷi is our
best estimate for my|xi

The regression standard error, s, for n sample data points is


calculated from the residuals (yi – ŷi)*:

s=
 residual2

=
 i i
( y − ˆ
y ) 2

n−2 n−2

s is an unbiased estimate of the regression standard deviation s.

* Recall that yi is the actual data value from the sample and ŷi is the estimate from
the regression line
22

Conditions for regression inference


• The observations are independent.

• The relationship is indeed linear.

• The standard deviation of y, σ, is the same for all values of x.

• The response y varies normally


around its mean.
23

Using residual plots to check regression validity


The residuals (y − ŷ) give useful information about the contribution of

individual data points to the overall pattern of scatter.

We view the residuals in

a residual plot:

If residuals are scattered randomly around 0 with uniform variation, it

indicates that the data fits a linear model, and has normally distributed

residuals for each value of x and constant standard deviation σ.


24

Review: No Pattern in Residuals


Residuals are randomly scattered
→ good!

Curved pattern
→ the relationship is not linear.

Change in variability across plot


→ σ not equal for all values of x.
25

Repeating: Hypothesis Testing For Regression

• Suppose that we get a sample of data regarding months of


experience and wages

• There appears to be a relationship between these two


things in our sample

• Similar to what was discussed in previous chapters, we


don’t know if the relationship in our data is due to the fact
that the same relationship exists in the population, or if it
occurred due to random chance alone

• We need a way to make an intelligent decision between


these two possibilities
26

Test For Relationship


• The test
Ho: B1=0 vs.
Ha: B1≠0

• Is the same as testing for


Ho: There is no relationship between Y and X in
the population vs.
Ha: There is a relationship
27

Direction of Relationship
• Note that depending on the specifics of the research, the
test for B1 can take different forms
• If we expect that as experience increases then wages
will increase then the form is
Ho: B1=0 vs.
Ha: B1>0
• If we expect that as individuals smoke more then they
will decrease life expectancy then the form is
Ho: B1=0 vs.
Ha: B1<0
• If we suspect a relationship exists but don’t know the
direction then the test is
Ho: B1=0 vs.
Ha: B1≠0
28

Analysis of Table 10.2


• Software can produce the p-value
for the test of Ho vs. Ha for the
regression
• Depending on the software settings
and whether you are looking for a
one-sided vs. two-sided result you
may have to divide the p-value by 2
or multiply it by 2
29

Confidence Interval for B1


• We can estimate B1 using a confidence interval

• The SPSS output indicates that the best guess for the
slope between T-Bill returns and inflation is between
.428 and .826

• We are 95% confident in this estimate

• The standard interpretation of a 95% confidence interval


applies to this situation: What do you think it is for this?
30

Prediction For Y and my


• Realize that regression gives estimates for a straight
line equation
• So the y-intercept is 2.666 and the slope=.627
• So for a given amount of inflation we can estimate
the return on T-bills
– Suppose we select Inflation=2(%)
– Prediction for T-bill return would be 2.666+2*.627=3.92%
31

Confidence vs. Prediction Interval


• We can build a 95% confidence interval for
the average T-bill return if Inflation is 2%
– This is (3.18,4.66)
• If we want to predict an individual value
(not an average value) then this is called
the prediction interval and it has a much
wider range
– This interval is (-0.52,8.36)
32

Confidence vs. Prediction Interval


• What’s the difference?
• If there were 10 instances where inflation were 2% in the
future and we measured the average T-bill return across
these 10 years then we expect that average to fall between
3.18 and 4.66
• In any one of the individual years where inflation is 2%,
however, we expect the T-Bill return could be anywhere
between -0.52 and 8.36
• This is because averages have less variability than
individual measurements
– Recall computer assignment 3 where variability of the averages
became much smaller as n increased

You might also like