Chapter 10 - 2 - 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

1

INTRODUCTION TO
STATISTICS & PROBABILITY

Chapter 10: Inference for Regression

Dr. Nahid Sultana

1/24/2023 Copyright© Nahid Sultana 2017-2018


Chapter 10
Inference for Regression
2

10.1 Simple Linear Regression


10.2 More Detail about Simple Linear Regression

Copyright© Nahid Sultana 2017-2018 1/24/2023


Chapter 10: Inference for Regression
3

➢ Statistical model for linear regression


➢ Simple linear regression model
➢ Estimating the regression parameters
➢ Confidence intervals and significance tests
➢ Confidence interval for mean response
➢ Prediction interval
➢ Analysis of variance for regression
➢ The ANOVA F test
Copyright© Nahid Sultana 2017-2018 1/24/2023
Simple Linear Regression
Introduction
4

➢ When a scatterplot shows a linear relationship between a quantitative


explanatory variable x and a quantitative response variable y, we can use
the least-squares line fitted to the data to predict y for a given value of x.
➢ If the data are a random sample from a larger population, we need
statistical inference to answer questions like these:
✓ Is there really a linear relationship between x and y in the population,
or could the pattern we see in the scatterplot plausibly happen just by
chance?

✓ What is the slope (rate of change) that relates y to x in the population,


including a margin of error for our estimate of the slope?

✓ If we use the least-squares regression line to predict y for a given value


of x, how accurate is our prediction (again, with a margin of error)?
Copyright© Nahid Sultana 2017-2018 1/24/2023
✓ Scatterplot: duration and interval of time for
all 222 recorded eruptions in a single month.
✓ The least-squares regression line:
slope 10.36 and y intercept 33.97.

5 Regarding all 222 eruptions as the
population, this line is the population
regression line (or true regression line)
because it uses all the observations that
month.

Take SRS of 20 eruptions from the population and calculate LSRL. How does the slope
of the sample regression line (LSRL) relate to the slope of the population regression
line? (green points in each graph are the selected points)

The pattern of variation in the slope b is described by its sampling distribution.


Conditions for Regression Inference 1
Different Sample Different LSRL
6

To do inference, think of b0 and b1 as estimates of unknown parameters b0


and β1 that describe the population of interest.
Conditions for Regression Inference

We have n observations on an explanatory variable x and a response


variable y. Our goal is to study or predict the behavior of y for given values
of x.
➢For any fixed value of x, the response y varies according to a Normal
distribution. Repeated responses y are independent of each other.
➢The mean response µy has a straight-line relationship with x given by a
population regression line µy= b 0 + β1 x.
➢The slope b0 and intercept β1 are unknown parameters.
➢The standard deviation of y (call it σ) is the same for all values of x. The
value of σ is unknown. Copyright© Nahid Sultana 2017-2018 1/24/2023
Conditions for Regression Inference 2
7

The value of σ determines


whether the points fall close
to the population regression
line (small σ) or are widely
scattered (large σ).
Copyright© Nahid Sultana 2017-2018 1/24/2023
Simple Linear Regression Model
8

In the population, the linear regression equation is m y = b 0 + b 1x.


Sample data fits the simple linear regression model:

Data = Fit + Error


yi = (b0 + b1xi) + (ei)

where the ei are


independent and
Normally distributed N(0,s).

Linear regression assumes equal variance of y (s is the same for all


values of x).
Estimating the Parameters
9

my = b 0 + b 1x
The intercept b 0 , the slope b 1 , and the standard deviation s of y are the
unknown parameters of the regression model. We rely on the random sample
data to provide unbiased estimates of these parameters.

➢ The value of ŷ from the least-squares regression line is really a prediction of the
mean value of y (m y) for a given value of x.

➢ The least-squares regression line (ŷ = b0 + b1x) obtained from sample data is the
best estimate of the true population regression line (my = b0 + b1x).

ŷ is an unbiased estimate for mean response my

b0 is an unbiased estimate for intercept b 0

b1 isCopyright© Nahid Sultana


an unbiased 2017-2018
estimate b1
1/24/2023
for slope
Estimating the Parameters – bs
10

The slope b 1 for the regression line


represents the change in the response
variable y for an increase of one unit in
the explanatory variable x.
The intercept b 0 for the regression line
represents the value of the response
variable y when the explanatory variable
x is zero.

Recall from Chapter 2 the Least-Squares Estimates:


sy
The least-squares estimate for b1 is b1 = r .
sx
The least-squares estimate for b0 is b0 = y − b1 x .
Copyright© Nahid Sultana 2017-2018 1/24/2023
Example

sy
b1 = r ( ) = 0.40(8.75 / 3.95) = 0.886
sx

10.1 Simple Linear Regression


Estimating the Parameters – s
12

The population standard deviation s for


y at any given value of x represents the
spread of the normal distribution of the e i
around the mean m y.

Copyright© Nahid Sultana 2017-2018 1/24/2023


Checking the Conditions for
Regression Inference
13
You can fit a least-squares line to any set of explanatory-response data when
both variables are quantitative. If the scatterplot does not show a roughly
linear pattern, the fitted line may be almost useless.

Before you can trust the results of inference, you must check the conditions for
inference one by one.

✓ The relationship is linear in the population.


✓ The response varies Normally about the population regression line.
✓ Observations are independent.
✓ The standard deviation of the responses is the same for all values of x.
You can check all of the conditions for regression inference by looking at
graphs of the residuals or residual plots.
Copyright© Nahid Sultana 2017-2018 1/24/2023
Using residual plots to check for
14
regression validity
If residuals are scattered randomly around 0 with uniform variation, it
indicates that the data fit a linear model, have normally distributed residuals
for each value of x, and constant standard deviation σ.

Copyright© Nahid Sultana 2017-2018 1/24/2023


Confidence Interval for
Regression Slope
15

Confidence Interval for Regression Slope

Copyright© Nahid Sultana 2017-2018 1/24/2023


Calculations for Regression Inference
16

Copyright© Nahid Sultana 2017-2018 1/24/2023


Example
17

A line has been fit to data representing cholesterol readings for 28 individuals starting
a cholesterol-reducing drug. The computer provides the following output:

The 95% confidence interval for the slope is


a. 0.6627 ± 2.055 (0.1428).
b. 0.6627 ± 1.96 (0.1428).

c. 0.6627 ± 0.1428.

10.1 Simple Linear Regression


Significance Test for Regression Slope
18

Note: Software typically provides


two-sided p-values.

Copyright© Nahid Sultana 2017-2018 1/24/2023


Testing the Hypothesis of
19
No Relationship

Copyright© Nahid Sultana 2017-2018 1/24/2023


Simple Linear Regression Example
20

Infants who cry easily may be more easily stimulated than others. This may be a sign of higher
IQ. Child development researchers explored the relationship between the crying of infants 4 to
10 days old and their later IQ test scores. A snap of a rubber band on the sole of the foot
caused the infants to cry. The researchers recorded the crying and measured its intensity by the
number of peaks in the most active 20 seconds. They later measured the children’s IQ at age 3
years using the Stanford-Binet IQ test. A scatterplot and Minitab output for the data from a
random sample of 38 infants is below.

Do these data provide convincing evidence that


there is a positive linear relationship between
crying counts and IQ in the population of infants?
Simple Linear Regression Example
We want to perform a test of
21
H0 : β1 = 0
Ha : β1 > 0
where β1 is the true slope of the population regression line relating cry count to IQ
score.

• The scatterplot suggests a moderately positive linear relationship between crying peaks
and IQ.

✓ IQ scores of individual infants should be independent.


✓ The Normal probability plot of the residuals shows a slight curvature, which suggests that the
responses may not be Normally distributed about the line at each x-value. With a large
sample size (n = 38), however, the t procedures are robust against departures from Normality.
✓ The residual plot shows a fairly equal amount of scatter around the horizontal line at 0 for all
x-values.
Simple Linear Regression Example
No obvious violations of the conditions, we proceed to inference.
22

The test statistic and P-value can be found in the Minitab output.

b1 1.4929
t= = = 3.07
SE b1 0.4870


The Minitab output gives P = 0.004 as the P-value
for a two-sided test. The P-value for the one-sided
test is half of this,
P = 0.002.

The P-value, 0.002, is less than our α = 0.05 significance level, so we have enough evidence to
reject H0 and conclude that there is a positive linear relationship between intensity of crying
and IQ score in the population of infants.
Copyright© Nahid Sultana 2017-2018 1/24/2023
Confidence Interval for Mean Response
23

We can also calculate a confidence interval for the population mean μy of all
responses y when x takes the value x* (within the range of data tested).

Copyright© Nahid Sultana 2017-2018 1/24/2023


Prediction Intervals
24

One use of regression is for predicting the value of y at some value of x within
the range of data tested. Reliable predictions require statistical inference.

To estimate an individual response y for a given value of x, we use a


prediction interval.

If we randomly sampled many times, there


would be many different values of y
obtained for a particular x following
N(0, σ) around the mean response µy.

Copyright© Nahid Sultana 2017-2018 1/24/2023


prediction interval for a single
observation
25

Copyright© Nahid Sultana 2017-2018 1/24/2023


Calculations for Regression Inference
26

Copyright© Nahid Sultana 2017-2018 1/24/2023


Analysis of Variance for Regression
27

The regression model is

Data = fit + error

yi = (b0 + b1xi) + (ei)

where the e i are independent and


Normally distributed N(0,s), and
s is the same for all values of x.

It resembles an ANOVA, which also assumes equal variance, where

Sum square total SST = SS model + SS error and

DFT = DF model + DF error

Copyright© Nahid Sultana 2017-2018 1/24/2023


The ANOVA F Test
28

Copyright© Nahid Sultana 2017-2018 1/24/2023


The ANOVA Table
Source Sum of squares SS DF Mean square MS F P-value
29

Model 1 MSM=SSM/DFM MSM/MSE Tail area above F

Error n−2 MSE=SSE/DFE

Total n−1

SST = SSM + SSE DFT = DFM + DFE F=MSM/MSE

Copyright© Nahid Sultana 2017-2018 1/24/2023


Example
The following is output from a regression analysis. The predictor variable is a
mathematics placement test score, and the response variable is a student’s final grade
in a statistics course.

Course Grade = 29.9 + 2.46 Placement Score


Predictor Coef SE Coef
Constant 29.882 7.304
Placement Score 2.4558 0.4175
S = _____ R-Sq = 72.7% R-Sq(adj) = 70.6%

Analysis of Variance
Source DF SS MS F P
Regression 1 1754.6 1754.6 34.60 0.000
Residual Error 13 659.2 50.7
Total 14 2413.7
What is the value of s, the estimated standard deviation about the regression line?
Example
A realtor is trying to assess the prices of homes in a new
development. She wants to know if the age of the house (x)
can explain the selling price (y) of a home, in thousands of
dollars.
x 1 2 3 4 5 6 7 8 9 10
y 245 180 200 200 171 120 115 69 60 47

What are the degrees of freedom for error associated with


this model?

Ans: 8
Example
For the below ANOVA table, what is the missing
value?
Source Df SS MS
Regression 5 6,500 1,300
Error 94 ? 37.234
Total ? 10,000

Ans: 99 and 3500


Example
For the below ANOVA table, what is the F statistic?

ANOVA
df SS MS F
Regression 1 17200
Error 15 3400
Total 16 20600

Ans: 75.88

10.2 More Detail about Simple Linear Regression

You might also like