Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

TOPIC 5:

CORRELATION AND REGRESSION

AN OVERVIEW OF CORRELATION
AND REGRESSION

Correlation and regression are two concepts


used to describe relationship between
variables (independent and dependent
variables)

1
➢Inferential statistics – determining
whether a relationship between two
or more numerical or quantitative
variables exists
➢Independent variable – variable that
can be controlled or manipulated
➢Dependent variable – variable that
cannot be controlled or manipulated

EXAMPLE 1
Discuss relationship between monthly
income and monthly savings

➢Independent variable, X : Monthly


income
➢Dependent variable, Y : Monthly saving

2
EXAMPLE 2
Discuss relationship between number
of counters opened at a bank and
waiting time

➢Independent variable, X : Number of


counters opened
➢Dependent variable, Y : Waiting time

6.1 CORRELATION
➢ The scatter diagram
➢ Define correlation.
➢ Correlation coefficient for sample.
(i) Pearson’s product moment correlation Coefficient
(ii) Spearman’s rank correlation coefficient
➢ Discuss the different values of r (show the different cases
graphically).

3
SCATTER DIAGRAM
(OR SCATTERPLOT)

x-axis: Independent Variable (X)


y-axis: Dependent Variable (Y)

SCATTER DIAGRAM
(OR SCATTER PLOT)

A plot of paired observations is


called a scatter diagram / Scatter
plot.
➢Given a scatter plot, one must
be able to draw the line of best
fit.
➢Purposes – enable to see the
trend and predictions on the
8
basis of the data

4
SCATTER PLOT AND ITS USES
➢ Initial tool to study relationship between two
quantitative random variables.
➢ Indicates the degree of linear relationship
(perfect,high, moderate or low) between two
random variables.
➢ If the points are widely scattered, then it
indicates low correlation between the
variables.
➢ The less scattered are the points in a linear
pattern, the higher is the degree of
relationship.
➢ It also indicates whether the relationship is
linear positive or linear negative.
9

INTERPRETING SCATTER
PLOTS

➢(Perfect/high/moderate/low)
Positive linear relationship
➢(Perfect/High/moderate/low)
Negative linear relationship
➢Nonlinear relationship
➢No relationship
10

5
FIGURE 1: POSITIVE AND NEGATIVE
LINEAR RELATIONSHIPS
BETWEEN X AND Y.

y y

b<0
b>0

(a) Positive linear x (b) Negative linear x


relationship. relationship.

11

FIGURE 2: NONLINEAR RELATIONS


BETWEEN X AND Y.

y
y

x x
(a) (b)
12

6
FIGURE 3: LINEAR CORRELATION
BETWEEN TWO VARIABLES.

Perfect positive linear relationship, r = 1


y

r=1

13 x

FIGURE 4: LINEAR CORRELATION


BETWEEN TWO VARIABLES.

Perfect negative linear relationship, r = -1


y

r = -1

14
x

7
FIGURE 5: LINEAR CORRELATION
BETWEEN TWO VARIABLES.

Strong positive linear relationship


(r is close to 1)
y

x
15

FIGURE 6: LINEAR CORRELATION


BETWEEN TWO VARIABLES.

Strong negative linear relationship


y
(r is close to -1)

16 x

8
FIGURE 7: LINEAR CORRELATION
BETWEEN TWO VARIABLES.

Weak positive linear relationship (r is positive


but close to 0)y

x
17

FIGURE 8: LINEAR CORRELATION


BETWEEN TWO VARIABLES.

Weak negative linear relationship (r is negative and close to 0)


y

18 x

9
FIGURE 9: LINEAR CORRELATION
BETWEEN TWO VARIABLES.

Nonlinear relationship, r ≈ 0
y

r≈0

19
x

EXAMPLE 3

Suppose we take a sample of seven


households from a low-to-moderate-income
neighborhood and collect information on
their income and food expenditure for the
past month. The information obtained
(in hundreds of dollars) is given in Table 1.

20

10
TABLE 1 Incomes (in hundreds of dollars)
and Food Expenditures of Seven
Households

Income Food Expenditure


35 9
49 15
21 7
39 11
15 5
28 8
25 9
21

FIGURE 10: Scatterplot.


The scatterplot shows the relationship between
Income and Food Expenditure.
Food expenditure

Income

22

11
INTERPRETATION OF FIGURE 10
➢ Figure 10 gives the scatter diagram or
scatter plot for the data of Table 1.
➢ Each dot in this diagram represents one
household.
➢ A scatter diagram is helpful in detecting a
relationship between two variables.
➢ By looking at the scatter diagram of
Figure 3, we observe that there exists a
strong/high positive linear relationship
between food expenditure and income.
➢ If a straight line is drawn through the
point, the points will be scattered closely
23 around the line.

CORRELATION

24

12
CORRELATION
➢ Correlation- produce a measurement that
describe the strength (Perfect/ high/
moderate/low) of relationship between
variables.

➢ The relationship will range from very high


to no relation.

25

CORRELATION COEFFICIENT
➢ Measures the strength (Perfect/high/moderate/low)
and direction (positive/negative) of a linear
relationship between a pair of random variables.

➢ Measured by the coefficient correlation, (rho).

➢  has a value between -1 and 1.

➢ The sample estimate for  is denoted by r (sample


coefficient of correlation)

26

13
Pearson’s Product Moment
Correlation Coefficient, r
➢ Both variables must be quantitative and
normally distributed.

➢ Calculation for r :

n ( XY ) − ( X )( Y )

r=

n  X 2  − ( X )2  n  Y 2  − ( Y )2 
    
      
 

27

Pearson’s product moment


correlation coefficient, r

➢The value of the correlation


coefficient always lies in the range of
-1 to 1; that is -1 ≤ r ≤ 1.

➢r and b have the same sign.

28

14
Interpretation of Pearson Correlation
coefficient, r
r Interpretation/Explanation/Comment
r=0 No relationship
r ≤ 0.5 Low positive linear relationship
r ≥ -0.5 Low negative linear relationship
0.5 < r < 0.7 Moderate positive linear relationship
-0.7 < r < -0.5 Moderate negative linear relationship
r ≥ 0.7 High positive linear relationship
r ≤ -0.7 High negative linear relationship

29

EXAMPLE 4
A study was carried out to determine the relationship between the
age and the time (in minutes) needed to run a 12 kilometre marathon
event. The following table shows the data recorded.

Age (years) Time


(minutes)
40 61
50 81
66 92
45 70
61 87
48 76
50 88
46 68

Compute the product moment correlation coefficient and explain its


30
meaning.

15
SOLUTION 4
n =8
 X = 406
 X = 21122
2

 Y = 623
 Y = 49359
2

 XY = 32195
r = 0.88
There is a high positive linear relationship between the age and the
time (in minutes) needed to run a 12 kilometre marathon event.

31

Spearman’s rank
correlation Coefficient
➢Spearman’s rank correlation Coefficient is suitable for
qualitative data and quantitative data.
➢The variables must first be ranked (either ascending or
descending order).
➢For tied observations, that is two or more observations
receiving the same score on the same variable, each of them is
assigned the average of the ranks which would been assigned
had no ties occurred.

 6 d 2 
s = 1 −  2 
 n ( n − 1) 
32

16
Interpretation of Spearman’s rank
correlation coefficient, s

 Interpretation/Explanation/Comment
s = 0 No relationship
s ≤ 0.5 Low positive linear relationship
s ≥ -0.5 Low negative linear relationship
0.5 < s < 0.7 Moderate positive linear relationship
-0.7 < s < -0.5 Moderate negative linear relationship
s ≥ 0.7 High positive linear relationship
s ≤ -0.7 High negative linear relationship

33

EXAMPLE 5
The following information on mathematics score and marketing grade is
obtained from a random sample of ten students as shown in the following
table. Mathematics Marketing
score grade
30 F
88 A
75 A
90 A
51 B
20 C
51 C
90 A
22 C
51 B
Use the spearman’s rank correlation to establish whether there is any
relationship between the mathematics score and marketing grade.
34

17
SOLUTION 5

d 2
= 18
 6 d 2 
ρs = 1 −  
 n(n − 1) 
2

 6(18) 
=1−  
10(10 − 1) 
2

= 0.89

There is a high positive linear relationship between


the mathematics score and marketing grade.
35

TOPIC 6.2: SIMPLE


LINEAR REGRESSION
-Simple linear regression equation using
the least square method (LSM)
-The coefficient of determination (r 2)
-The regression coefficient b (Slope of the regression line)
-Estimate the dependent variable (Y) using the Regression
line

36

18
REGRESSION
➢A regression model is a mathematical
equation that describes the relationship
between two or more variables.

➢ In this chapter, we investigate a response y


which is affected by an independent
variable, x.

37

➢ Regression- produce a prediction equation that


express y (dependent) as a function of x
(independent).

➢ Describe relationship between variables using a


regression equation.

➢ Discuss Simple Linear Regression (SLR):

yˆ = a + bx

38

19
SIMPLE LINEAR REGRESSION

➢Simple Regression
➢Linear Regression

39

Simple Regression

A simple regression model includes only


two variables: one independent and one
dependent. The dependent variable is the
one being explained, and the independent
variable is the one used to explain the
variation in the dependent variable.

40

20
Linear Regression

A (simple) regression model that


gives a straight-line relationship
between two variables is called a
linear regression model.

41

Figure 11: Relationship between food


expenditure and income.
(a) Linear relationship.
(b) Nonlinear relationship.

Linear
Food Expenditure

Food Expenditure

Nonlinear

Income Income

(a) (b)

42

21
Simple linear
Regression analysis
In the regression model

y = A + Bx + Є

❖A is called the y-intercept (or constant term)


❖B is the slope
❖Є is the random error term.
❖The dependent and independent variables are y
and x, respectively.

43

Simple linear
Regression analysis
Constant term or
y-intercept Slope
Random error
y = A + Bx +  term

Dependent variable Independent


variable

44

22
Simple linear
Regression analysis

In the model ŷ = a + bx (also


called the regression of y on x), a
and b, which are calculated using
sample data, are called the
estimates of A and B.

45

Regression line
➢ Analyze the relationship between the two quantitative variables, X
and Y

yˆ = a + bx
➢ a : y-intercept (or constant term)

(i) if x = 0 is in the range (or in the interval of min and max value) then
the value of a is the mean of the distribution of the response y

(ii) if x = 0 is not in the range (or not in the interval of min and max value)
then the value of a has no practical interpretation

➢ b : slope:

change in the mean of the distribution of the response produced by a unit change in x

➢  : random error

46

23
The method of least squares

➢ The equation of the best-fitting line is calculated


using a set of n pairs (Xi,Yi).

➢ We choose our estimates a and b to estimate A and


B so that the vertical distances of the points from
the line, are minimized.

➢ We use the equation of a line to describe the


relationship between y and x for a sample of n
pairs, (x,y).

47

The Least Squares Regression Line (or


Best fitting line) ŷ = a + bx

Slope, b Constant term or


y-intercept, a
n ( XY ) − ( X )( Y )
b= a = Y − bX
n ( X ) − ( X )
2 2

a=
( Y ) − b ( X )
n n

48

24
EXAMPLE 6
A study was carried out to determine the relationship between the
age and the time (in minutes) needed to run a 12 kilometre marathon
event. The following table shows the data recorded.

Age (years) Time


(minutes)
40 61
50 81
66 92
45 70
61 87
48 76
50 88
46 68

i) Find the least-squares regression equation.


49
ii) Interpret the values of a and b obtained in part i).

SOLUTION 6
n =8  Y = 623
 X = 406  Y = 49359
2

 X = 21122
2
 XY = 32195
Solution (i)
a = 21.2177
b = 1.1164

y = a + bx

y = 21.2177 + 1.1164x
Solution (ii)
a = 21.2164
No practical interpretation for a since x=0 not in the range of X.
b= 1.1164
If the age increases by 1 year old, the time needed to run a 12 kilometre
 marathon event will increase by 1.1 minutes.
50

25
Interpretation of a and b
Slope, b Constant term or y
➢ change in the mean of the intercept, a
distribution ➢ If x = 0 is in the
of the response produced by range, then a is
a unit change in x. the mean of the
➢ Change in y due to change of distribution of the
one unit in x. response y.
➢ In general,
o If b positive, if x incraeses ➢ If x = 0 is not in the
by 1 unit y will increase by b range, then a has no
units. practical
o If b negative, if x increases interpretation.
by 1 unit y will decrease by b
units.
o Or for each additional unit
of x, y will change by b units. 51

COEFFICIENT OF
DETERMINATION, r2

52

26
Coefficient of determination,r2.
➢ To measure strength of that linear relationship /
how well the model fits.
✓Coefficient of determination, r2 =
(correlation coefficient)2

r 2 = (r ) 2
➢ Interpretation :
r2 X 100% of total variations in y is explained by
x, the other (100% –(r2 X 100%)) of variations
is explained by other factors.
53

EXAMPLE 7
A study was carried out to determine the relationship between the
age and the time (in minutes) needed to run a 12 kilometre marathon
event. The following table shows the data recorded.

Age (years) Time


(minutes)
40 61
50 81
66 92
45 70
61 87
48 76
50 88
46 68

Compute the coefficient of determination and explain its meaning.


54

27
SOLUTION 7
n =8
 X = 406
 X = 21122
2

 Y = 623
 Y = 49359
2

 XY = 32195
r = 0.87
Coefficient of determination, r 2 = (r) 2
= (0.87) 2
= 0.7569

76% of total variations in time (in minutes) needed to run a 12


kilometre marathon event is explained by age, the other 24% of
variations is explained by other factors.
55

Estimate the value of


dependent variable (Y) for a
given value of independent
variable (X).

56

28
Estimate the value of dependent variable (Y)
for a given value of independent variable (X).

➢ We can predict the value of dependent


variable (Y) if the value of independent
variable (X) is given by using the equation
below.

ŷ = a + bx

57

EXAMPLE 8
A study was carried out to determine the relationship between the
age and the time (in minutes) needed to run a 12 kilometre marathon
event. The following table shows the data recorded.

Age (years) Time


(minutes)
40 61
50 81
66 92
45 70
61 87
48 76
50 88
46 68

i) Estimate the time (in minutes) needed to run a 12 kilometre


58
marathon event if the age of the people is 62 years old.

29
SOLUTION 8
n =8  Y = 623
 X = 406  Y = 49359
2

 X = 21122
2
 XY = 32195
Solution (i)
a = 21.2164
b = 1.1164

y = a + bx

y = 21.2164 + 1.1164x
Age = 62 years old
x = 62

y = 21.2164 + 1.1164(62)
= 90.4332 (in minutes)

If the age of the people is 62 years old, the time needed to run a 12
kilometre marathon event is 90.4 minutes.
59

PAST YEAR QUESTIONS


 SEP2013-Question3
 OCT2012-Question3
 MAC2012-Question3
 APR2011-Question3
 OCT2010-Question3

60

30

You might also like