Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

71

Chapter 10 – Linear
Correlation and regression

10.1 Bivariate data and scatter diagrams


Often two variables are measured simultaneously and relationships between these variables
explored. Data sets involving two variables are known as bivariate data sets.

The first step in the exploration of bivariate data is to plot the variables on a graph. From
such a graph, which is known as a scatter diagram (scatter plot, scatter graph), an idea can
be formed about the nature of the relationship.

Examples

1) The number of copies sold (y) of a new book (measured in thousands of units) is
dependent on the advertising budget (x) the publisher commits in a pre-publication
campaign (measured in thousands of Rands). The values of x and y for 12 recently
published books are shown below.

x 8 9.5 7.2 6.5 10 12 11.5 14.8 17.3 27 30 25


y 12.5 18.6 25.3 24.8 35.7 45.4 44.4 45.8 65.3 75.7 72.3 79.2

Scatter diagram

Adverting budget and copies sold

90

80
70

60
copies sold

50
40

30

20
10

0
0 5 10 15 20 25 30 35
advertising budget
72

2) In a study of the relationship between the amount of daily rainfall (x) and the
quantity of air pollution removed (y), the following data were collected.

Rainfall quantity removed (micrograms per


(centimeters) cubic meter)
4.3 126
4.5 121
5.9 116
5.6 118
6.1 114
5.2 118
3.8 132
2.1 141
7.5 108

Scatter diagram

Rainfall and quantity removed

160

140

120
Quantity removed

100

80

60

40

20

0
0 2 4 6 8
Rainfall

 In both cases the relationship can be fairly well described by means of a straight line
i.e. both these relationships are linear relationships.

 In the first example an increase in y is proportional to an increase in x (positive


linear relationship).
In the second example a decrease in y is proportional to an increase in x (negative
linear relationship).

 In both the examples changes in the values of y are affected by changes in the values
of x (not the other way round). The variable x is known as the explanatory
(independent) variable and the variable y the response (dependent) variable.
73

In this section only linear relationships between 2 variables will be explored. The issues to
be explored are

1) Measuring the strength of the linear relationship between the 2 variables (the linear
correlation problem).

2) Finding the equation of the straight line that will best describe the relationship
between the 2 variables (the linear regression problem). Once this line is
determined, it can be used to estimate a value of y for given value of x (linear
estimation).

10.2 Linear Correlation


The calculation of the coefficient of correlation (r) is based on the closeness of the plotted
points (in the scatter diagram) to the line fitted through them. It can be shown that

–1 ≤ r ≤ 1.

If the plotted points are closely clustered around this line, r will lie close to either 1 or –1
(depending on whether the linear relationship is positive or negative). Perfect positive
correlation occurs when all the plotted points lie on a line with a positive gradient. For this
case r = 1. Perfect negative correlation occurs when the plotted points lie on a line with a
negative gradient. For this case r = –1.The further the plotted points are away from the line,
the closer the value of r will be to 0. Consider the scatter diagrams that follow.

Strong positive correlation (r close to 1)

Strong negative correlation (r close to –1)


74

No pattern (r close to 0)

For a sample of n pairs of values (x1, y1) , (x2, y2), . . . , (xn, yn) , the coefficient of
correlation can be calculated from the formula

Example

Consider the data on the advertising budget (x) and the number of copies sold (y)
considered earlier. For this data r can be calculated in the following way.

x y xy x2 y2
8 12.5 100 64 156.25
9.5 18.6 176.7 90.25 345.96
7.2 25.3 182.16 51.84 640.09
6.5 24.8 161.2 42.25 615.04
10 35.7 357 100 1274.49
12 45.4 544.8 144 2061.16
11.5 44.4 510.6 132.25 1971.36
14.8 45.8 677.84 219.04 2097.64
17.3 65.3 1129.69 299.29 4264.09
27 75.7 2043.9 729 5730.49
30 72.3 2169 900 5227.29
25 79.2 1980 625 6272.64
sum 178.8 545 10032.89 3396.92 30656.5

Substituting
n=12, ∑ x = 178.8, ∑ y = 545,
∑ xy = 10032.89, ∑ x2 = 3396.92 ∑ y2 = 30656.5

into the equation for r gives


75

Comment: Strong positive correlation i.e. the increase in the number of copies sold
is closely linked with an increase in advertising budget.

Coefficient of determination
The strength of the correlation between 2 variables is proportional to the square of
the correlation coefficient (r2). This quantity, called the coefficient of determination,
is the proportion of variability in the y variable that is accounted for by its linear
relationship with the x variable.

Example
In the above example on copies sold (y) and advertising budget (x), the
coefficient of determination = r2 = 0.91942 = 0.8453.
This means that 84.53% of the change in the variability of copies sold is explained by
its relationship with advertising budget.

10.3 Linear Regression


Finding the equation of the line that best fits the (x, y) points is based on the least squares
principle. This principle can best be explained by considering the scatter diagram below.

The scatter diagram is a plot of the DBH (diameter at breast height measured in inches)
versus the age (years) for 12 oak trees. The data are shown in the following table.
76

Age (x) 97 93 88 81 75 57 52 45 28 15 12 11
DBH (y) 12.5 12.5 8 9.5 16.5 11 10.5 9 6 1.5 1 1

According to the least squares principle, the line that “best” fits the plotted points is the one
that minimizes the sum of the squares of the vertical deviations (see vertical lines in the
graph) between the plotted y and estimated y (values on the line). For this reason the line
fitted according to this principle is called the least squares line.

Calculation of least squares linear regression line

The equation for the line to be fitted to the (x, y) points is

ŷ = a + bx,

where ŷ is the fitted y value (y value on the line which is different to the observed y
value),a is the y-intercept and b the slope of the line.
It can be shown that the coefficients that define the least squares line can be
calculated from

n xy   x y
b= and a = y  bx.
n x 2  ( x ) 2

Example

For the above data on age (x) and DBH (y) the least squares line can calculated as shown
below.

x y xy x2

97 12.5 1212.5 9409


93 12.5 1162.5 8649
88 8 704 7744
81 9.5 769.5 6561
75 16.5 1237.5 5625
57 11 627 3249
52 10.5 546 2704
45 9 405 2025
28 6 168 784
15 1.5 22.5 225
12 1 12 144
11 1 11 121
sum 654 99 6877.5 47240
77

Substituting

n=12, ∑ x = 654, ∑ y = 99,

∑ xy = 6877.5 ∑ x2 = 47240

into the above equation gives

Therefore the equation of the y on x least squares line that can be used to estimate values
of y (DBH) based on x (age) is
ŷ = 1.285 + 0.12779 x.

Suppose the DBH of a tree aged 90 years is to be estimated. This can be done by
substituting the value of x = 90 into the above equation. Then
ŷ = 1.285 + 0.12779 × 90 = 12.786.

A word of caution

 The linear relationship between y and x is often only valid for values of x within a
certain range e.g. when estimating the DBH using age as explanatory variable, it
should be taken into account that at some age the tree will stop growing. Assuming a
linear relationship between age and DBH for values beyond the age where the tree
stops growing would be incorrect.

 Only relationships between variables that could be related in a practical sense are
explored e.g. it would be pointless to explore the relationship between the number
of vehicles in New York and the number of divorces in South Africa. Even if data
collected on such variables might suggest a relationship, it cannot be of any practical
value.

 If variables are not linearly related, it does not mean that they are not related. There
are many situations where the relationships between variables are non-linear.
78

Example

A plot of the banana consumption (y) versus the price (x) is shown in the graph on the
following page. A straight line will not describe this relationship very well, but the non-linear
curve shown below will describe it well.

NONLINEAR REGRESSION: EXAMPLE

14
y

12

10

8

6 y    u    z  u
x
4

0
0 1 2 3 4 5 6 7 8 9 10 11 x12

This sequence shows how a nonlinear regression model may be fitted. It uses the banana
consumption example in the first sequence.

You might also like