Chapter 10 - 2 - 2

1
INTRODUCTION TO
STATISTICS & PROBABILITY
Chapter 10: Inference for Regression
Dr. Nahid Sultana
1/24/2023 Copyright© Nahid Sultana 2017-2018

Chapter 10
Inference for Regression
2
10.1 Simple Linear Regression

10.2 More Detail about Simple Linear Regression
Copyright© Nahid Sultana 2017-2018 1/24/2023

Chapter 10: Inference for Regression
3
➢ Statistical model for linear regression

➢ Simple linear regression model
➢ Estimating the regression parameters
➢ Confidence intervals and significance tests
➢ Confidence interval for mean response
➢ Prediction interval
➢ Analysis of variance for regression
➢ The ANOVA F test
Simple Linear Regression
Introduction
4
➢ When a scatterplot shows a linear relationship between a quantitative

explanatory variable x and a quantitative response variable y, we can use
the least-squares line fitted to the data to predict y for a given value of x.
➢ If the data are a random sample from a larger population, we need
statistical inference to answer questions like these:
✓ Is there really a linear relationship between x and y in the population,
or could the pattern we see in the scatterplot plausibly happen just by
chance?
✓ What is the slope (rate of change) that relates y to x in the population,

including a margin of error for our estimate of the slope?
✓ If we use the least-squares regression line to predict y for a given value

of x, how accurate is our prediction (again, with a margin of error)?
✓ Scatterplot: duration and interval of time for
all 222 recorded eruptions in a single month.
✓ The least-squares regression line:
slope 10.36 and y intercept 33.97.
✓
5 Regarding all 222 eruptions as the
population, this line is the population
regression line (or true regression line)
because it uses all the observations that
month.
Take SRS of 20 eruptions from the population and calculate LSRL. How does the slope
of the sample regression line (LSRL) relate to the slope of the population regression
line? (green points in each graph are the selected points)
The pattern of variation in the slope b is described by its sampling distribution.

Conditions for Regression Inference 1
Different Sample Different LSRL
6
To do inference, think of b0 and b1 as estimates of unknown parameters b0

and β1 that describe the population of interest.
Conditions for Regression Inference
We have n observations on an explanatory variable x and a response

variable y. Our goal is to study or predict the behavior of y for given values
of x.
➢For any fixed value of x, the response y varies according to a Normal
distribution. Repeated responses y are independent of each other.
➢The mean response µy has a straight-line relationship with x given by a
population regression line µy= b 0 + β1 x.
➢The slope b0 and intercept β1 are unknown parameters.
➢The standard deviation of y (call it σ) is the same for all values of x. The
value of σ is unknown. Copyright© Nahid Sultana 2017-2018 1/24/2023
Conditions for Regression Inference 2
7
The value of σ determines

whether the points fall close
to the population regression
line (small σ) or are widely
scattered (large σ).
Simple Linear Regression Model
8
In the population, the linear regression equation is m y = b 0 + b 1x.

Sample data fits the simple linear regression model:
Data = Fit + Error

yi = (b0 + b1xi) + (ei)
where the ei are

independent and
Normally distributed N(0,s).
Linear regression assumes equal variance of y (s is the same for all

values of x).
Estimating the Parameters
9
my = b 0 + b 1x
The intercept b 0 , the slope b 1 , and the standard deviation s of y are the
unknown parameters of the regression model. We rely on the random sample
data to provide unbiased estimates of these parameters.
➢ The value of ŷ from the least-squares regression line is really a prediction of the
mean value of y (m y) for a given value of x.
➢ The least-squares regression line (ŷ = b0 + b1x) obtained from sample data is the
best estimate of the true population regression line (my = b0 + b1x).
ŷ is an unbiased estimate for mean response my
b0 is an unbiased estimate for intercept b 0
b1 isCopyright© Nahid Sultana

an unbiased 2017-2018
estimate b1
1/24/2023
for slope
Estimating the Parameters – bs
10
The slope b 1 for the regression line

represents the change in the response
variable y for an increase of one unit in
the explanatory variable x.
The intercept b 0 for the regression line
represents the value of the response
variable y when the explanatory variable
x is zero.
Recall from Chapter 2 the Least-Squares Estimates:

sy
The least-squares estimate for b1 is b1 = r .
sx
The least-squares estimate for b0 is b0 = y − b1 x .
Example

sy
b1 = r ( ) = 0.40(8.75 / 3.95) = 0.886
sx

Estimating the Parameters – s
12
The population standard deviation s for

y at any given value of x represents the
spread of the normal distribution of the e i
around the mean m y.

Checking the Conditions for
Regression Inference
13
You can fit a least-squares line to any set of explanatory-response data when
both variables are quantitative. If the scatterplot does not show a roughly
linear pattern, the fitted line may be almost useless.
Before you can trust the results of inference, you must check the conditions for
inference one by one.
✓ The relationship is linear in the population.

✓ The response varies Normally about the population regression line.
✓ Observations are independent.
✓ The standard deviation of the responses is the same for all values of x.
You can check all of the conditions for regression inference by looking at
graphs of the residuals or residual plots.
Using residual plots to check for
14
regression validity
If residuals are scattered randomly around 0 with uniform variation, it
indicates that the data fit a linear model, have normally distributed residuals
for each value of x, and constant standard deviation σ.

Confidence Interval for
Regression Slope
15
Confidence Interval for Regression Slope

Calculations for Regression Inference
16

Example
17
A line has been fit to data representing cholesterol readings for 28 individuals starting
a cholesterol-reducing drug. The computer provides the following output:
The 95% confidence interval for the slope is

a. 0.6627 ± 2.055 (0.1428).
b. 0.6627 ± 1.96 (0.1428).
c. 0.6627 ± 0.1428.

Significance Test for Regression Slope
18
Note: Software typically provides

two-sided p-values.

Testing the Hypothesis of
19
No Relationship

Simple Linear Regression Example
20
Infants who cry easily may be more easily stimulated than others. This may be a sign of higher
IQ. Child development researchers explored the relationship between the crying of infants 4 to
10 days old and their later IQ test scores. A snap of a rubber band on the sole of the foot
caused the infants to cry. The researchers recorded the crying and measured its intensity by the
number of peaks in the most active 20 seconds. They later measured the children’s IQ at age 3
years using the Stanford-Binet IQ test. A scatterplot and Minitab output for the data from a
random sample of 38 infants is below.
Do these data provide convincing evidence that

there is a positive linear relationship between
crying counts and IQ in the population of infants?
We want to perform a test of
21
H0 : β1 = 0
Ha : β1 > 0
where β1 is the true slope of the population regression line relating cry count to IQ
score.
• The scatterplot suggests a moderately positive linear relationship between crying peaks
and IQ.
✓ IQ scores of individual infants should be independent.

✓ The Normal probability plot of the residuals shows a slight curvature, which suggests that the
responses may not be Normally distributed about the line at each x-value. With a large
sample size (n = 38), however, the t procedures are robust against departures from Normality.
✓ The residual plot shows a fairly equal amount of scatter around the horizontal line at 0 for all
x-values.
No obvious violations of the conditions, we proceed to inference.
22
The test statistic and P-value can be found in the Minitab output.
b1 1.4929
t= = = 3.07
SE b1 0.4870

The Minitab output gives P = 0.004 as the P-value
for a two-sided test. The P-value for the one-sided
test is half of this,
P = 0.002.
The P-value, 0.002, is less than our α = 0.05 significance level, so we have enough evidence to
reject H0 and conclude that there is a positive linear relationship between intensity of crying
and IQ score in the population of infants.
Confidence Interval for Mean Response
23
We can also calculate a confidence interval for the population mean μy of all
responses y when x takes the value x* (within the range of data tested).

Prediction Intervals
24
One use of regression is for predicting the value of y at some value of x within
the range of data tested. Reliable predictions require statistical inference.
To estimate an individual response y for a given value of x, we use a

prediction interval.
If we randomly sampled many times, there

would be many different values of y
obtained for a particular x following
N(0, σ) around the mean response µy.

prediction interval for a single
observation
25

Calculations for Regression Inference
26

Analysis of Variance for Regression
27
The regression model is
Data = fit + error
yi = (b0 + b1xi) + (ei)
where the e i are independent and

Normally distributed N(0,s), and
s is the same for all values of x.
It resembles an ANOVA, which also assumes equal variance, where
Sum square total SST = SS model + SS error and
DFT = DF model + DF error

The ANOVA F Test
28

The ANOVA Table
Source Sum of squares SS DF Mean square MS F P-value
29
Model 1 MSM=SSM/DFM MSM/MSE Tail area above F
Error n−2 MSE=SSE/DFE
Total n−1
SST = SSM + SSE DFT = DFM + DFE F=MSM/MSE

Example
The following is output from a regression analysis. The predictor variable is a
mathematics placement test score, and the response variable is a student’s final grade
in a statistics course.
Course Grade = 29.9 + 2.46 Placement Score

Predictor Coef SE Coef
Constant 29.882 7.304
Placement Score 2.4558 0.4175
S = _____ R-Sq = 72.7% R-Sq(adj) = 70.6%
Analysis of Variance
Source DF SS MS F P
Regression 1 1754.6 1754.6 34.60 0.000
Residual Error 13 659.2 50.7
Total 14 2413.7
What is the value of s, the estimated standard deviation about the regression line?
Example
A realtor is trying to assess the prices of homes in a new
development. She wants to know if the age of the house (x)
can explain the selling price (y) of a home, in thousands of
dollars.
x 1 2 3 4 5 6 7 8 9 10
y 245 180 200 200 171 120 115 69 60 47
What are the degrees of freedom for error associated with

this model?
Ans: 8
Example
For the below ANOVA table, what is the missing
value?
Source Df SS MS
Regression 5 6,500 1,300
Error 94 ? 37.234
Total ? 10,000
Ans: 99 and 3500

Example
For the below ANOVA table, what is the F statistic?
ANOVA
df SS MS F
Regression 1 17200
Error 15 3400
Total 16 20600
Ans: 75.88
10.2 More Detail about Simple Linear Regression

Chapter 10 - 2 - 2

Uploaded by

Copyright:

Available Formats

You might also like

Chapter 10 - 2 - 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 10 - 2 - 2

Uploaded by

Copyright:

Available Formats

1

Chapter 10: Inference for Regression

Dr. Nahid Sultana

1/24/2023 Copyright© Nahid Sultana 2017-2018

10.1 Simple Linear Regression

Copyright© Nahid Sultana 2017-2018 1/24/2023

➢ Statistical model for linear regression

➢ When a scatterplot shows a linear relationship between a quantitative

✓ What is the slope (rate of change) that relates y to x in the population,

✓ If we use the least-squares regression line to predict y for a given value

The pattern of variation in the slope b is described by its sampling distribution.

To do inference, think of b0 and b1 as estimates of unknown parameters b0

We have n observations on an explanatory variable x and a response

The value of σ determines

In the population, the linear regression equation is m y = b 0 + b 1x.

Data = Fit + Error

where the ei are

Linear regression assumes equal variance of y (s is the same for all

ŷ is an unbiased estimate for mean response my

b0 is an unbiased estimate for intercept b 0

b1 isCopyright© Nahid Sultana

The slope b 1 for the regression line

Recall from Chapter 2 the Least-Squares Estimates:

10.1 Simple Linear Regression

The population standard deviation s for

Copyright© Nahid Sultana 2017-2018 1/24/2023

✓ The relationship is linear in the population.

Copyright© Nahid Sultana 2017-2018 1/24/2023

Confidence Interval for Regression Slope

Copyright© Nahid Sultana 2017-2018 1/24/2023

Copyright© Nahid Sultana 2017-2018 1/24/2023

The 95% confidence interval for the slope is

10.1 Simple Linear Regression

Note: Software typically provides

Copyright© Nahid Sultana 2017-2018 1/24/2023

Copyright© Nahid Sultana 2017-2018 1/24/2023

Do these data provide convincing evidence that

✓ IQ scores of individual infants should be independent.

Copyright© Nahid Sultana 2017-2018 1/24/2023

To estimate an individual response y for a given value of x, we use a

If we randomly sampled many times, there

Copyright© Nahid Sultana 2017-2018 1/24/2023

Copyright© Nahid Sultana 2017-2018 1/24/2023

Copyright© Nahid Sultana 2017-2018 1/24/2023

The regression model is

Data = fit + error

yi = (b0 + b1xi) + (ei)

where the e i are independent and

It resembles an ANOVA, which also assumes equal variance, where

Sum square total SST = SS model + SS error and

DFT = DF model + DF error

Copyright© Nahid Sultana 2017-2018 1/24/2023

Copyright© Nahid Sultana 2017-2018 1/24/2023

Model 1 MSM=SSM/DFM MSM/MSE Tail area above F

Error n−2 MSE=SSE/DFE

SST = SSM + SSE DFT = DFM + DFE F=MSM/MSE

Copyright© Nahid Sultana 2017-2018 1/24/2023

Course Grade = 29.9 + 2.46 Placement Score