Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

Descriptive Simple Linear Regression

Sections 5.1-5.3

Introduction
Often, we measure two (or more) numerical variables on the same individual. In statistics,
finding relationships between variables is important. Simple linear regression is just one
approach and only models the linear relationship between two numerical variables. Linear
regression, in general, is a very powerful statistical technique and goes far beyond what we can
discuss in this course.1

Typically, we call one variable the response (or outcome) variable that measures the outcome
of the study. The response variable is denoted by Y; specific values of Y are denoted y. The
other variable is the explanatory (or predictor) variable whose values are denoted by x . The
explanatory variable is the variable that is thought to explain the changes we see in the
response variable.2

Warning! Choosing one variable as the explanatory variable does not necessarily mean that a
change in that variable will produce a change in the other variable. Association is not causation.

Relationships between Numerical Variables: Scatterplots


The first step in examining the relationship between two quantitative variables is to use a
graphical display—a scatterplot—to display the association between the two variables. A
scatterplot provides a case-by-case view of data for two numerical variables. Each point
represents a single case measured on the two variables. Typically, we want to assess three
characteristics of the visualized relationship:
 direction (positive or negative)
 form (linear or not)
 strength3 (weak, moderate, strong)
It is also useful to address any unusual features you might see in the scatterplot.

Example Ames Housing Data


Recall the ames50 data set that we worked with in
Chapter 1. The data set had information on 9 variables for
a random sample 50 single-family homes that sold
between 2006 and 2010 in Ames, Iowa. Here we consider
two variables, livingArea (living area in square feet) and
salePrice (sales price of the home, $). Is there a linear

1
Consider Stats 401 or Stats 413 for your follow-up courses to Stats 250.
2
Some people call the Y variable the dependent variable and the X variable the independent variable. We stay
away from those terms because they have specific meanings in the study of probability.
3
One way to assess the strength of a linear relationship between two quantitative variables is with correlation
(which we will discuss shortly).

Descriptive SLR (Sections 5.1-5.3), page 1


relationship between the amount of living area and the sales price of a home? Describe the
relationship you see in the scatterplot to the right.

Describing the Strength (and Direction) of a Linear Relationship with Correlation


The correlation (denoted by r or R) quantifies the strength (and direction) of the linear
relationship between two quantitative variables x and y. Where the standard deviation
describes the variability of a single set of data, the correlation describes the joint variability of
two sets of data, together. A few facts regarding correlation:
 The value of r is always between –1 and +1.
 Values closer to –1 correspond to a strong negative relationship
 Values closer to 1 correspond to a strong positive relationship.
 Values closer to 0 correspond to no linear relationship.

Figure 5.10 from the textbook allows us to see how the correlation coefficient is related to the
scatter of the points:

Correlation is nearly always calculated with technology. 4


Caution: Correlation and Linearity!5
A value of r close to –1 or 1 doesn’t in and of itself tell you that there is a linear relationship
between the two variables! You must examine the scatterplot; otherwise, your interpretation
of the correlation coefficient might be wrong. In each of the four scatterplots below, the

4
That means we will not ask you to calculate it by hand.
5
The scatterplots here display a famous collection of four data sets known as Anscombe’s Quartet.

Descriptive SLR (Sections 5.1-5.3), page 2


observed values of x and y have nearly identical means and standard deviations. As such, the
correlations for each graph are each identical, even though the data are very different!

Modeling a linear relationship with a straight line


Now that we know that correlation measures the strength of a linear relationship between two
variables, we might be interested in describing that linear relationship in more detail. We’ll do
this using something called a regression line.

Let’s return to our ames50 data. Below are a histogram and summary statistics for the sales
prices of the 50 single-family homes in our data set.

Question: What is our best estimate for the average sales price for all single-family homes sold
between 2006 and 2010 in Ames?6

Note: You might have thought that we should use the sample median because the distribution
of sales prices is right skewed. However, we want to estimate the population mean (not the
population median), so we need to use the sample mean as our estimate.

If we knew nothing about the relationship between


salePrice and livingArea, we would estimate the
sales price of a single-family home sold to be $229,033,
regardless of living area. We can do better than this!
6
Recall that we use sample statistics to estimate population parameters.

Descriptive SLR (Sections 5.1-5.3), page 3


In the scatterplot on page 2, we saw a positive relationship between sales price and living area
of a home. It makes sense to use our knowledge of this relationship to improve our estimation
of the average sales price for single-family homes sold between 2006 and 2010 in Ames.

The scatterplot below is the same one we examined on page 2 with the least-squares regression
line added. Notice that, on average, the sale price increases as living area increases.

In general, we write the (simple) linear regression model for the population as
μ Y| X =x =β 0 + β 1 x , where
μ Y| X =x is the mean of our response variable Y when the value of the explanatory
variable is X =x
β 0 is a population parameter denoting the y-intercept; the mean of Y when x=0 .
β 1 is a population parameter denoting the slope; the change in the mean of Y per unit
change in X .

When we use sample data to estimate β 0 and β 1, we estimate the β ’s with our data and write
the estimated regression line as
^y =b0 +b1 x . 7

7 ^ + β^ x .
Alternatively, we could use “hat” notation to get ^y = β 0 1

Descriptive SLR (Sections 5.1-5.3), page 4


How do we figure out the equation of
the line that best fits the data? To
investigate this, let’s continue our
work with the ames50 data set. The
scatterplot to the right has two lines
for us to compare, neither of which is
the actual regression line from the
previous page.

The equation of the blue, solid line is


^y =−60,000+170 x .
The equation of the red, dashed line is
^y =−10,000+ 140 x .

a. Which of these lines do you think is better? Why?

b. Consider the slope of the red, dashed line. How do we interpret it in context of the variables
represented?8

c. One of the homes in the ames50 data set has 2475 square feet of living area. Estimate the
sales price of this home using both of the lines.

d. The actual sales price for the home with 2475 square feet of living area was $355,000. How
far off are the estimates from part (c) from the observed sales price?

Note: These differences are called residuals.

8
When we interpret the slope, we need to be careful to talk about association (we do not want to imply
causation).

Descriptive SLR (Sections 5.1-5.3), page 5


Evaluating the Fit of a Line
Neither the blue line nor the red line estimated the sales price exactly, but both lines sure did
better than the sample mean sales price of $229,033! No line will fit the data exactly,9 so we
need to figure out our criteria for determining the equation of the line, ^y =b0 +b1 x , that fits the
data “best.”

In our housing data, the residuals were calculated as


observed sales price – predicted sales price

Residuals are the leftover variation in the response variable that the model cannot account for:
Data = Fit + Residual

Each observation will have a residual. Observations above the regression line have positive
residuals, while observations below the line have negative residuals. Our goal in picking the
right linear model is for these residuals to be as small as possible.

Notation: When we have a sample of n observations, we denote the ordered pairs by


( x 1 , y 1 ) , ( x 2 , y 2 ) , … ,( x n , y n ). If we want to refer to a point in general, we denote it as ( x i , y i ).

Definition: The residual of the ith observation (x i , y i ) is the difference of the observed response
( y i ) and the response we would predict based on the model fit ( ^y i ) :
e i= y i− ^yi
We typically identify ^y i by plugging x i into the model.

We want a line that has small residuals. Think about it: Why do you think small residuals are good?

For the least-squares line, some residuals are positive, and others are negative (and some may

even be 0), and, for the least-squares line, the average of the residuals
n ( )
∑ e i is zero, so that’s not
helpful. It turns out that the line that fits the data “best” is the one that minimizes the sum of the
squared residuals (that is, the line that minimizes the ∑ e2i ).

Equation of the Line in Simple Linear Regression


There is a line that is even better than either the red or the blue line at minimizing the squared
residuals. In fact, as long as certain conditions are met,10 this line will produce the smallest sum
of squared residuals ∑ e 2 mathematically possible; as a consequence, it is called the least-
squares regression line.

9
Unless there are only two data points, and you don’t need statistics for that!
10
Conditions: For the line to be the best linear regression for a set of data, certain assumptions about the data
must be made. We’ll get to these later on when we come back to inference for regression. For now, it’s enough to
know that most of these assumptions have to do with the residuals we’ve looked at over the past few pages.

Descriptive SLR (Sections 5.1-5.3), page 6


sy
 Property 1: An estimate of the slope of the line is b 1=r ∙ .
sx
 Property 2: The line must pass through the point ( x , y ); consequently, an estimate of the
y-intercept of the least-squares regression line is b 0= y−b1 x .

The equation for the least-squares regression line is then


^y =b0 +b1 x

Let’s use these properties to calculate the least-squares regression line for the predicting sales
price when living area is known.

e. The summary statistics for the livingArea and


salePrice variables in the ames50 data set are:

Variable mean sd r
livingAre
a
1747 539.88
0.8641
salePrice 229,033 86,060.69

Step1: Compute b 1 , the slope.

Step 2: Compute b 0 , the intercept.

Give the equation of the least-squares


regression line.

As you can imagine, calculating the least-


squares regression line by hand is quite time

Descriptive SLR (Sections 5.1-5.3), page 7


consuming and does little to add to our conceptual understanding of what the regression line
does. When we have a data set, it’s much easier to have R calculate the equation of the least-
squares regression line.

For our ames50 data set, the equation of this line is


estimated sales price ($)
= –11,568.89 + 137.74(living area, sq ft)
Another Example: Predicting the Cost of a LEGO Set
Dr. Miller is an avid AFOL (adult fan of LEGO). They were curious about the relationship
between price and piece count in LEGO sets that they might like to build. They performed the
following analysis about the relationship between price and number of pieces for a random
sample of 30 LEGO sets that were available from the LEGO website in January 2021.

Use the scatterplot to describe the relationship between the price of a LEGO set and the number
of pieces in the set.

Descriptive SLR (Sections 5.1-5.3), page 8


What is the equation of the least-squares regression line for these data?

Interpreting the Slope and Intercept


When you learned about slope in your math classes, you learned that the slope is the change in
y for a one-unit change in x . This assumes all points fall exactly on the line. In statistics, we
recognize the natural variability inherent to data, so we need to adjust slope interpretation
accordingly. The slope is the estimated difference in the predicted average outcome of y when
the predictor variable x happened to be one unit larger. For the LEGO data, we estimate that the
predicted average price for a LEGO set will be $0.076 more for each additional piece in the set.

The y -intercept describes the average outcome of y when x=0 AND the linear model is valid
all the way to x=0 , which in many applications is not the case. For this example, we would
predict the cost of a LEGO set with 0 pieces to be $9.28. This is obviously nonsensical.

Using the Regression Line to Make Predictions


One of the LEGO sets in the sample has 2480 pieces and costs $169.99. Calculate the residual
for this set. Did the linear model overestimate or underestimate the cost of this LEGO set?

Extrapolation is Treacherous11
This quote from The Colbert Report says it all:
When those blizzards hit the East Coast this winter, it proved to my satisfaction that
global warming was a fraud. That snow was freezing cold. But in an alarming trend,
temperatures this spring have risen. Consider this: On February 6th it was 10 degrees.
Today it hit almost 80. At this rate, by August it will be 220 degrees. So clearly folks the
climate debate rages on. (http://www.cc.com/shows/the-colbert-report)

Should we use our model to predict the cost of the new Colosseum set that has 9036 pieces?
The Frankenstein BrickHeadz set with 108 pieces?

11
ISRS, page 232

Descriptive SLR (Sections 5.1-5.3), page 9


Predicting outside of the range of the x-variable is called extrapolation. Extrapolation assumes
that the linear relationship continues outside of the data range. It may be tempting, but don’t
extrapolate.12

12
There are a lot of things you shouldn’t let your friends do. Remember that “friends don’t let friends extrapolate.”

Descriptive SLR (Sections 5.1-5.3), page 10


Using R2 to Describe the Strength of a Fit
We evaluated the strength of the linear relationship between two variables earlier using the
correlation, r. However, it is more common to explain the strength of a linear fit using R2,
called R-squared.

The R2 of a linear model describes the amount of variation in the response variable that is
explained by the least-squares regression line.

For example, consider the LEGO data, shown with the regression line in the scatterplot below.
Also included below is R output for the variance for price and variance for the residuals.

The variance of the response variable, price, is s2price=5722.299 . However, if we apply our least
squares line, then this model reduces our uncertainty in predicting price using the number of
pieces in the set. The variability in the residuals describes how much variation remains after
using the model: s2resid =690.5406 .

We saw a reduction of
s 2price−s2resid 5722.299−690.5406 5031.758
= = =0.879
s 2price 5722.299 5722.299
or about 87.9% in the data’s variation by using information about the linear relationship
between price and number of pieces.

Bottom line: The R2 value is a measure of how good the linear model is. The closer R2 is to
100%, the better.13

13
When trying to predict measures of human behavior (e.g., in psychology), it is not unusual for an R2 of around
10% or 20% to indicate that the explanatory variable is a helpful predictor for the response variable.

Descriptive SLR (Sections 5.1-5.3), page 11


Section 5.3 Types of Outliers in Linear Regression
Recall that outliers in regression are observations that fall far from the “cloud” of points. Outliers
are especially important because they can have a strong influence on the least squares line.

Two types of outliers are leverage points and influential points:


 Points that fall horizontally far from the line are points of high leverage; these points
can strongly influence the slope of the least squares line.
 If one of these high leverage points appears to actually invoke its influence on the slope
of the line, then we call it an influential point.

It is tempting to remove outliers. Don’t do this without a very good reason. Models that ignore
exceptional (and interesting) cases often perform poorly.
Examining Outliers
Consider the following scatterplot for a dataset of
simulated points. Notice how the regression line fits
the data quite well. The equation for the line is
^y =1.0927+ 2.9096 x.

Let’s see what a non-leverage outlier (unusual in y but


not unusual in x) does to our regression line… In this
scatterplot, the point (2.5, 18) was added. Notice that
the (turquoise) regression line with this outlier
included is just a smidgen above the original
regression line. Our estimate for the y-intercept is
higher, but our estimate for the slope is just about the
same: ^y =1.5901+2.8946 x .

How about a non-influential leverage point (unusual in


x but not unusual in y)? What does a point like this do
to our regression line? Now we add the point (8, 25) to
our original data. x=8 is unusually large in the
horizontal direction. It is difficult to distinguish the
(deep pink) regression line with (8, 25) added from the
original (black) regression line. Neither the estimate for
the y-intercept nor the estimate for the slope changed
much: ^y =1.0077+ 2.9500 x . The leverage point hasn’t
affected our estimate of the slope because it follows
the linear trend of the orignal data. Thus, the point is
not considered to be influential.
So, what does it look like when an outlier is an influential

Descriptive SLR (Sections 5.1-5.3), page 12


leverage point (unusual in x and does not fit the linear pattern)? We return to our original data
and now add the point (8, 5). This point is unusual in the horizontal direction and does not
follow the linear trend of the original data. We might expect it to alter both the y-intercept and
the slope. Let’s see… The (blue) regression line is very different from the regression line for the
original points—the influential point has had an impact. The equation of the
^y =3.7023+ 1.6674 x. Indeed, both estimates in our model changed drastically.
Caution: Don't ignore outliers when fitting a final model
If there are outliers in the data, they should not be removed or ignored without a good reason.
Whatever final model is fit to the data would not be very helpful if it ignores the most
exceptional cases.
Bottom Line: Above all, clearly communicate the decisions you have made when you fit your
model. If you communicate what your thoughts are, others can understand decisions that you
have made.

Descriptive SLR (Sections 5.1-5.3), page 13

You might also like