Professional Documents
Culture Documents
Document
Document
AN OVERVIEW OF CORRELATION
AND REGRESSION
1
➢Inferential statistics – determining
whether a relationship between two
or more numerical or quantitative
variables exists
➢Independent variable – variable that
can be controlled or manipulated
➢Dependent variable – variable that
cannot be controlled or manipulated
EXAMPLE 1
Discuss relationship between monthly
income and monthly savings
2
EXAMPLE 2
Discuss relationship between number
of counters opened at a bank and
waiting time
6.1 CORRELATION
➢ The scatter diagram
➢ Define correlation.
➢ Correlation coefficient for sample.
(i) Pearson’s product moment correlation Coefficient
(ii) Spearman’s rank correlation coefficient
➢ Discuss the different values of r (show the different cases
graphically).
3
SCATTER DIAGRAM
(OR SCATTERPLOT)
SCATTER DIAGRAM
(OR SCATTER PLOT)
4
SCATTER PLOT AND ITS USES
➢ Initial tool to study relationship between two
quantitative random variables.
➢ Indicates the degree of linear relationship
(perfect,high, moderate or low) between two
random variables.
➢ If the points are widely scattered, then it
indicates low correlation between the
variables.
➢ The less scattered are the points in a linear
pattern, the higher is the degree of
relationship.
➢ It also indicates whether the relationship is
linear positive or linear negative.
9
INTERPRETING SCATTER
PLOTS
➢(Perfect/high/moderate/low)
Positive linear relationship
➢(Perfect/High/moderate/low)
Negative linear relationship
➢Nonlinear relationship
➢No relationship
10
5
FIGURE 1: POSITIVE AND NEGATIVE
LINEAR RELATIONSHIPS
BETWEEN X AND Y.
y y
b<0
b>0
11
y
y
x x
(a) (b)
12
6
FIGURE 3: LINEAR CORRELATION
BETWEEN TWO VARIABLES.
r=1
13 x
r = -1
14
x
7
FIGURE 5: LINEAR CORRELATION
BETWEEN TWO VARIABLES.
x
15
16 x
8
FIGURE 7: LINEAR CORRELATION
BETWEEN TWO VARIABLES.
x
17
18 x
9
FIGURE 9: LINEAR CORRELATION
BETWEEN TWO VARIABLES.
Nonlinear relationship, r ≈ 0
y
r≈0
19
x
EXAMPLE 3
20
10
TABLE 1 Incomes (in hundreds of dollars)
and Food Expenditures of Seven
Households
Income
22
11
INTERPRETATION OF FIGURE 10
➢ Figure 10 gives the scatter diagram or
scatter plot for the data of Table 1.
➢ Each dot in this diagram represents one
household.
➢ A scatter diagram is helpful in detecting a
relationship between two variables.
➢ By looking at the scatter diagram of
Figure 3, we observe that there exists a
strong/high positive linear relationship
between food expenditure and income.
➢ If a straight line is drawn through the
point, the points will be scattered closely
23 around the line.
CORRELATION
24
12
CORRELATION
➢ Correlation- produce a measurement that
describe the strength (Perfect/ high/
moderate/low) of relationship between
variables.
25
CORRELATION COEFFICIENT
➢ Measures the strength (Perfect/high/moderate/low)
and direction (positive/negative) of a linear
relationship between a pair of random variables.
26
13
Pearson’s Product Moment
Correlation Coefficient, r
➢ Both variables must be quantitative and
normally distributed.
➢ Calculation for r :
n ( XY ) − ( X )( Y )
r=
n X 2 − ( X )2 n Y 2 − ( Y )2
27
28
14
Interpretation of Pearson Correlation
coefficient, r
r Interpretation/Explanation/Comment
r=0 No relationship
r ≤ 0.5 Low positive linear relationship
r ≥ -0.5 Low negative linear relationship
0.5 < r < 0.7 Moderate positive linear relationship
-0.7 < r < -0.5 Moderate negative linear relationship
r ≥ 0.7 High positive linear relationship
r ≤ -0.7 High negative linear relationship
29
EXAMPLE 4
A study was carried out to determine the relationship between the
age and the time (in minutes) needed to run a 12 kilometre marathon
event. The following table shows the data recorded.
15
SOLUTION 4
n =8
X = 406
X = 21122
2
Y = 623
Y = 49359
2
XY = 32195
r = 0.88
There is a high positive linear relationship between the age and the
time (in minutes) needed to run a 12 kilometre marathon event.
31
Spearman’s rank
correlation Coefficient
➢Spearman’s rank correlation Coefficient is suitable for
qualitative data and quantitative data.
➢The variables must first be ranked (either ascending or
descending order).
➢For tied observations, that is two or more observations
receiving the same score on the same variable, each of them is
assigned the average of the ranks which would been assigned
had no ties occurred.
6 d 2
s = 1 − 2
n ( n − 1)
32
16
Interpretation of Spearman’s rank
correlation coefficient, s
Interpretation/Explanation/Comment
s = 0 No relationship
s ≤ 0.5 Low positive linear relationship
s ≥ -0.5 Low negative linear relationship
0.5 < s < 0.7 Moderate positive linear relationship
-0.7 < s < -0.5 Moderate negative linear relationship
s ≥ 0.7 High positive linear relationship
s ≤ -0.7 High negative linear relationship
33
EXAMPLE 5
The following information on mathematics score and marketing grade is
obtained from a random sample of ten students as shown in the following
table. Mathematics Marketing
score grade
30 F
88 A
75 A
90 A
51 B
20 C
51 C
90 A
22 C
51 B
Use the spearman’s rank correlation to establish whether there is any
relationship between the mathematics score and marketing grade.
34
17
SOLUTION 5
d 2
= 18
6 d 2
ρs = 1 −
n(n − 1)
2
6(18)
=1−
10(10 − 1)
2
= 0.89
36
18
REGRESSION
➢A regression model is a mathematical
equation that describes the relationship
between two or more variables.
37
yˆ = a + bx
38
19
SIMPLE LINEAR REGRESSION
➢Simple Regression
➢Linear Regression
39
Simple Regression
40
20
Linear Regression
41
Linear
Food Expenditure
Food Expenditure
Nonlinear
Income Income
(a) (b)
42
21
Simple linear
Regression analysis
In the regression model
y = A + Bx + Є
43
Simple linear
Regression analysis
Constant term or
y-intercept Slope
Random error
y = A + Bx + term
44
22
Simple linear
Regression analysis
45
Regression line
➢ Analyze the relationship between the two quantitative variables, X
and Y
yˆ = a + bx
➢ a : y-intercept (or constant term)
(i) if x = 0 is in the range (or in the interval of min and max value) then
the value of a is the mean of the distribution of the response y
(ii) if x = 0 is not in the range (or not in the interval of min and max value)
then the value of a has no practical interpretation
➢ b : slope:
change in the mean of the distribution of the response produced by a unit change in x
➢ : random error
46
23
The method of least squares
47
a=
( Y ) − b ( X )
n n
48
24
EXAMPLE 6
A study was carried out to determine the relationship between the
age and the time (in minutes) needed to run a 12 kilometre marathon
event. The following table shows the data recorded.
SOLUTION 6
n =8 Y = 623
X = 406 Y = 49359
2
X = 21122
2
XY = 32195
Solution (i)
a = 21.2177
b = 1.1164
y = a + bx
y = 21.2177 + 1.1164x
Solution (ii)
a = 21.2164
No practical interpretation for a since x=0 not in the range of X.
b= 1.1164
If the age increases by 1 year old, the time needed to run a 12 kilometre
marathon event will increase by 1.1 minutes.
50
25
Interpretation of a and b
Slope, b Constant term or y
➢ change in the mean of the intercept, a
distribution ➢ If x = 0 is in the
of the response produced by range, then a is
a unit change in x. the mean of the
➢ Change in y due to change of distribution of the
one unit in x. response y.
➢ In general,
o If b positive, if x incraeses ➢ If x = 0 is not in the
by 1 unit y will increase by b range, then a has no
units. practical
o If b negative, if x increases interpretation.
by 1 unit y will decrease by b
units.
o Or for each additional unit
of x, y will change by b units. 51
COEFFICIENT OF
DETERMINATION, r2
52
26
Coefficient of determination,r2.
➢ To measure strength of that linear relationship /
how well the model fits.
✓Coefficient of determination, r2 =
(correlation coefficient)2
r 2 = (r ) 2
➢ Interpretation :
r2 X 100% of total variations in y is explained by
x, the other (100% –(r2 X 100%)) of variations
is explained by other factors.
53
EXAMPLE 7
A study was carried out to determine the relationship between the
age and the time (in minutes) needed to run a 12 kilometre marathon
event. The following table shows the data recorded.
27
SOLUTION 7
n =8
X = 406
X = 21122
2
Y = 623
Y = 49359
2
XY = 32195
r = 0.87
Coefficient of determination, r 2 = (r) 2
= (0.87) 2
= 0.7569
56
28
Estimate the value of dependent variable (Y)
for a given value of independent variable (X).
ŷ = a + bx
57
EXAMPLE 8
A study was carried out to determine the relationship between the
age and the time (in minutes) needed to run a 12 kilometre marathon
event. The following table shows the data recorded.
29
SOLUTION 8
n =8 Y = 623
X = 406 Y = 49359
2
X = 21122
2
XY = 32195
Solution (i)
a = 21.2164
b = 1.1164
y = a + bx
y = 21.2164 + 1.1164x
Age = 62 years old
x = 62
y = 21.2164 + 1.1164(62)
= 90.4332 (in minutes)
If the age of the people is 62 years old, the time needed to run a 12
kilometre marathon event is 90.4 minutes.
59
60
30