Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Linear Regression

Linear Regression
In many applications, scientists try to determine whether
two variables are related. If they are related, the scientists then
try to find an equation that can be used to model the
relationship. For instance, the geologist might want to know
whether there is a relationship between the duration of the
eruption of geyser and the time between the eruption. A first
step in this determination is to collect some data. Data involving
two variables are called bivariate data. The table showing the
time between the eruption and the duration of the second
eruption for 10 eruption of the geyser Old Faithful.
Linear Regression
Time between eruption
(in seconds), x 272 227 237 238 203 270 218 226 250 245
Duration of eruption
(in seconds), y 89 79 83 82 81 85 78 81 85 79
Linear Regression
Time between eruption Duration of eruption
(in seconds), x (in seconds), y
272 89
227 79
237 83
238 82
203 81
270 85
218 78
226 81
250 85
245 79
The Least-Squares Regression Line
The least-squares regression line for a set of bivariate data
is the line that minimizes the sum of the squares of the vertical
deviations from each data point to the line.
The Formula for the Least-Squares Line
The equation of the least-squares line for the n ordered pairs
(x1,y1), (x2,y2), (x3,y3),. . . , (xn,yn)
is y = ax + b, where
𝑛 ∑𝑥𝑦 −(∑𝑥)(∑𝑦)
a= and b=y–ax
𝑛 ∑𝑥 2 − ∑𝑥 2
Linear Regression
x y xy x2
272 89 24,208 73,984
227 79 17,933 51,529
237 83 19,671 56,169
238 82 19,516 56,644
203 81 16,443 41,209
270 85 22,950 72,900
218 78 17,004 47,524
226 81 18,306 51,076
250 85 21,250 62,500
245 79 19,355 60,025
2,386 822 196,636 573,560
The Least-Squares Line
To apply this formula in the data of Old Faithful, we first find the value of
each summation
∑𝑥 = 2,386 ∑𝑦 = 822 ∑𝑥 2 = 573,560 ∑𝑥𝑦 =196,636
Next, we use these values to find the value of a
𝑛 ∑𝑥𝑦 −(∑𝑥)(∑𝑦)
a= 𝑛 ∑𝑥 2 − ∑𝑥 2
10 196,636 −(2,386)(822)
a= 10 573,560 − 2,386 2
1,966,360−1,961,292
a=
5,735,600−5,692996
5,068
a= 42,604
a = 0.118955966
The Least-Squares Line
We then find the vales of x and y
∑𝑥 2,386
x= 𝑛
= 10
= 238.6
The Least-Squares Line
We then find the vales of x and y
∑𝑥 2,386 ∑𝑦 822
x= = = 238.6 y = = = 82.2
𝑛 10 𝑛 10
and use them to find y-intercept b
b = y – ax
= 82.2 - 0.1189559666(238.6)
=82.2 – 28.38289363
= 53.81710637
y = ax + b
y = 0.1189559666x + 53.81710637
The Linear Regression Equation
We can now use the regression equation to estimate the
duration of an eruption given the time between the eruption.
For instance, if the time between two eruption is 200 seconds,
then the estimated duration of the second eruption is
y = ax + b
y = 0.118955966(200) + 53.81710637
y = 23.79119332 +53.81710637
y = 77.60829969
TABLE 4.17 Speed for Selected Stride Lengths
a. Adult men
Stride length (m) 2.5 3.0 3.3 3.5 3.8 4.0 4.2 4.5
Speed (m/s) 3.4 4.9 5.5 6.6 7.0 7.7 8.3 8.7
b. Dogs
Stride length (m) 1.5 1.7 2.0 2.4 2.7 3.0 3.2 3.5
Speed (m/s) 3.7 4.4 4.8 7.1 7.7 9.1 8.8 9.9
c. Camels
Stride length (m) 2.5 3.0 3.2 3.4 3.5 3.8 4.0 4.2
Speed (m/s) 2.3 3.9 4.4 5.0 5.5 6.2 7.1 7.6
Find the equation of the least-squares line for the ordered pairs in a.
Least-square line for speed vs. stride length
Linear Correlation Coefficient
To determine the strength of a linear relationship
between two variables, statisticians use a statistic called
the linear correlation coefficient, which is denoted by the
variable r and is defined as follows.
For the n ordered pairs (x1, y1), (x2, y2), (x3, y3), ...,
(xn, yn), the linear correlation coefficient r is given by
𝑛 ∑𝑥𝑦 −(∑𝑥)(∑𝑦)
r=
𝑛 ∑𝑥 2 − ∑𝑥 2 • 𝑛 ∑𝑦 2 − ∑𝑦 2
Linear Correlation Coefficient
If the linear correlation coefficient r is positive, the
relationship between the variables has a positive
correlation. In this case, if one variable increases, the
other variable also tends to increase. If r is negative, the
linear relationship between the variables has a negative
correlation. In this case, if one variable increases, the
other variable tends to decrease.
Linear Correlation
Figure 4.19 shows some scatter diagrams along with the
type of linear correlation that exists between the x and y
variables. The closer |r| is to 1, the stronger the linear
relationship between the variables
Linear Correlation
Example 3 Find a Linear Correlation Coefficient
Find the linear correlation coefficient for stride
length versus speed of an adult man. Use the data in
Table 4.11a. Round your result to the nearest hundredth
Solution
The ordered pairs are (2.5, 3.4), (3.0, 4.9), (3.3, 5.5),
(3.5, 6.6), (3.8, 7.0), (4.0, 7.7), (4.2, 8.3), (4.5, 8.7)
The number of ordered pairs is n = 8.
2
∑ x = 28.8 ∑ y = 52.1 ∑ 𝑥 = 106.72 ∑ xy = 195.86
Example 3 Find a Linear Correlation Coefficient
The only additional value that is needed is
∑ 𝑦 2 = 3.422 + 4.922 + 5.522 + 6.622 + 7.022 + 7.722 + 8.322 + 8.722 =
362.25
Substituting the above values into the equation for the
linear correlation coefficient gives us
𝑛 ∑𝑥𝑦 −(∑𝑥)(∑𝑦)
r=
𝑛 ∑𝑥 2 − ∑𝑥 2 • 𝑛 ∑𝑦 2 − ∑𝑦 2
8(195.86) − (28.8)(52.1)
r=
8(106.72) − 28.8 2 • 8 362.25 − 52.1 2
Example 3 Find a Linear Correlation Coefficient
x y xy x2 y2
2.5 3.4 8.50 6.25 11.56
3.0 4.9 14.70 9.00 24.01
3.3 5.5 18.15 10.89 30.25
3.5 6.6 23.10 12.25 43.56
3.8 7.0 26.60 14.44 49.00
4.0 7.7 30.80 16.00 59.29
4.2 8.3 34.86 17.64 68.89
4.5 8.7 39.15 20.25 75.69
28.80 52.10 195.86 106.72 362.25
Example 3 Find a Linear Correlation Coefficient
The only additional value that is needed is
∑ 𝑦 2 = 3.422 + 4.922 + 5.522 + 6.622 + 7.022 + 7.722 + 8.322 + 8.722 =
362.25
Substituting the above values into the equation for the
linear correlation coefficient gives us
𝑛 ∑𝑥𝑦 −(∑𝑥)(∑𝑦)
r=
𝑛 ∑𝑥 2 − ∑𝑥 2 • 𝑛 ∑𝑦 2 − ∑𝑦 2
8 195.86 − 28.8 (52.1)
r= = 0.993715
8(106.72) − 28.8 2 • 8 632.25 − 52.1 2
▼ Check your progress 3
Find the linear correlation coefficient for stride
length versus speed of a camel as given in Table 4.11c.
Round your result to the nearest hundredth.
▼ Properties of the Linear Correlation Coefficient
1. The linear correlation coefficient r is always a real number between 1 and
1, inclusive. In the case in which
■ all of the ordered pairs lie on a line with positive slope, r is 1.
■ all of the ordered pairs lie on a line with negative slope, r is 1.
2. For any set of ordered pairs, the linear correlation coefficient r and the
slope of the least-squares line both have the same sign.
3. Interchanging the variables in the ordered pairs does not change the
value of r. Thus the value of r for the ordered pairs (x1, y1), (x2, y2), (x3, y3),
..., (xn, yn) is the same as the value of r for the ordered pairs (y1, x1), (y2,
x2), (y3, x3), ..., (yn, xn)
4. The value of r does not depend on the units used. You can change the
units of a variable from, for example, feet to inches and the value of r will
remain the same.
Given the bivariate data
x 10 12 14 15 16
y 8 7 5 4 1
a. Draw a scatter diagram for the data.
b. Find n, ∑x, ∑y, ∑𝑥 2 , ∑𝑦 2 , and ∑xy.
c. Find a. the slope of the least-squares regression line, and b. the y-
intercept of the least-squares line.
d. Draw the least-squares line on the scatter diagram from part a.
e. Is the point (x, y) on the least-squares line?
f. Use the equation of the least-squares line to predict the value of y for
x = 8.
g. Find the linear correlation coefficient.

You might also like