Lecture Notes #4 Correlation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Ateneo de Zamboanga University

MatMod
College of Science & Information Technology
RRGuerrero
Mathematics Department
First Semester, Session 1, AY 2020-2021

III. STATISTICS Page |


a. Correlation 1
b. Regression
c. Testing of Hypotheses

Linear Correlation Linear Regression


https://www.youtube.com/watch?v=TsKyWOZv7hE https://www.youtube.com/watch?v=WWqE7YHR4Jc
https://www.youtube.com/watch?v=GtV-VYdNt_g https://www.youtube.com/watch?v=zPG4NjIkCjc

A. CORRELATION

Correlation
Correlation measures the association or strengths of the relationship between two
variable say, 𝑥 and 𝑦.

 Before the relationship between these two sets of data is found out through computation, it is
essential to first discuss the three types of correlation.
 When a variable increases while the other decreases, these variables are indirectly correlated or
negatively correlated.
 But when one variable increases and the other increases as well or one variable decreases, as well
as the other variable, then the two variables are directly correlated or positively correlated.

Example 3a.1: Suppose a ten-item English and a ten-item test in Mathematics were administered to ten
students. The scores of the students are tabulated below. It must be determined if the scores in the
Mathematics quiz (labelled as variable 𝑥) and the English test (labelled as variable 𝑦) are correlated or
not.
Table 3a.1. Math & Eng scores of 10 students

Student Mathematics English The scatter graph for Table 3a.1 is given by Figure 3a.1.In the
score (x) score graph, note that the x-axis represents the scores in Mathematics
(y) and y-axis shows the score in English.
1 4 5 Each point in the graph below is an ordered pair (𝑥, 𝑦)
2 5 4 corresponding to the score obtained by a student in the two
3 9 8 subjects.
4 2 3
5 8 9
6 1 2
7 2 1
8 7 6
9 6 7
10 4 5

Figure 3a.1
In Example 3a.1, the corresponding scatter graph (Figure 3a.1) indicates a direct correlation between
variables 𝑥 and 𝑦 which appears to be increasing.

Example 3a.2: Suppose the scores of the students in those two subjects happen to be as follows with the
corresponding scatter graph:
Page |
2
Table 3a.2

Figure 3a.2

This time the trend of the data is decreasing, hence, the variables are negatively correlated.

Example 3a.3: Suppose the same students have the following scores:

Table 3a.3
Figure 3a.3
Students Math English
score (x) score (y)
1 2 2
2 3 6
3 4 7
4 2 9
5 5 5
6 6 3
7 6 7
8 8 4
9 9 8
10 3 7

It can be noticed that the corresponding scatter graph as shown in Figure 3a.3, the graph neither increasing
nor decreasing. This graph represents a ZERO correlation.
DEFINITIONS:

3a.1.1 Two variables are positively correlated if the values of the two variables both increase or both
decrease.

3a.1.2 Two variables are negatively correlated if the values of one variable increases while the values of
the other decreases. Page |
3
3a.1.3 Two variables are not correlated or they have zero correlation if one variable neither increases nor
decreases while the other increases.

 While a scatter plot may be a convenient way of inspecting correlation between two variables,
it does not offer a measure of the strength of a correlation.
 Karl Pearson, an English mathematician & biostatistician, invented a formula that can give a
numerical value to the measure of a correlation.

 This formula does not only show how greatly two data sets are correlated but also reveals if
the correlation is direct or inverse, or if the data sets are not correlated.
 The formula named after him is called the Pearson product – moment correlation.

 The degree of correlation between two data sets 𝑥 and 𝑦 is represented by the Pearson product –
moment correlation coefficient 𝑟𝑥𝑦 which can have values from −1 to 1, where 1 representing a
strong positive relationship, whereas, −1 indicating a strong negative relationship.

 If the coefficient 𝑟𝑥𝑦 = 0, then there is NO RELATIONSHIP between the two variables.
 The Pearson product-moment correlation formula is given by:

∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)


𝑟𝑥𝑦 =
√∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2
Where 𝑥̅ is the sample mean for the data set {𝑥𝑖 }
𝑦̅ is the sample mean for the data set {𝑦𝑖 }.

 Below are some scatter diagrams along with the type of linear correlation that exists
between the 𝑥 and 𝑦 variables.
 The closer the absolute value of 𝑟𝑥𝑦 is to 1, the stronger the linear relationship between the
variables.

Figure 3a.4
Figure 3a.4 Page |
4

Consider the data in Example 3a.1. Let us organize the data as shown in the table that follows.
You may also tabulate using Excel.

Math score English score (𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅) (𝑥𝑖 − 𝑥̅ )2 (𝑦𝑖 − 𝑦̅)2 (𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅)
(x) (y)
4 5 −0.8 0 0.64 0 0
5 4 0.2 −1 0.04 1 −0.2
9 8 4.2 3 17.64 9 12.6
2 3 −2.8 −2 7.84 4 5.6
8 9 3.2 4 10.24 16 12.8
1 2 −3.8 −3 14.44 9 11.4
2 1 −2.8 −4 7.84 16 11.2
7 6 2.2 1 4.84 1 2.2
6 7 1.2 2 1.44 4 2.4
4 5 −0.8 0 0.64 0 0
10 10
𝑛
10 10 ∑(𝑥𝑖 − 𝑥̅ )2 ∑(𝑦𝑖 − 𝑦̅)2
∑ 𝑥𝑖 = 48 ∑ 𝑦𝑖 = 50 𝑖=1 𝑖=1 ∑(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)
𝑖=1 𝑖=1 = 65.6 = 60.0 𝑖=1
= 58

58 58
𝑟𝑥𝑦 = = = 𝟎. 𝟗𝟐
√(65.6)(60) 62.73755

 This result is in conformity with the scatter graph in Example 3a.1.


 The computed correlation coefficient is almost 1, hence, it has a strong positive correlation.

Given the tabulated data of Examples 3a.2, compute for 𝑟𝑥𝑦


For Example 3a.2

Math score English score (𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅) (𝑥𝑖 − 𝑥̅ )2 (𝑦𝑖 − 𝑦̅)2 (𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅)
(x) (y)
9 3 4.1 −2.5 16.81 6.25 −10.25
3 6 −1.9 0.5 3.61 0.25 −0.95
4 7 −0.9 1.5 0.81 2.25 −1.35 Page |
7 4 2.1 −1.5 4.41 2.25 −3.15 5
6 2 1.1 −3.5 1.21 12.25 −3.85
1 9 −3.9 3.5 15.21 12.25 −13.65
2 8 −2.9 2.5 8.41 6.25 −7.25
5 4 0.1 −1.5 0.01 2.25 −0.15
10 2 5.1 −3.5 26.01 12.25 −17.85
2 10 −2.9 4.5 8.41 20.25 −13.05
10 10
𝑛
10 10 ∑(𝑥𝑖 − 𝑥̅ )2 ∑(𝑦𝑖 − 𝑦̅)2
∑ 𝑥𝑖 = 49 ∑ 𝑦𝑖 = 55 𝑖=1 𝑖=1 ∑(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)
𝑖=1 𝑖=1 = 84.9 = 76.5 𝑖=1
= −71.5

 The absolute value of the correlation coefficient is almost negative 1; hence, it has a strong negative
correlation.
 Thus, the corresponding scatter graph in Example 3a.2 is decreasing from left to right.

For Example 3a.3

Math score English score (𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅) (𝑥𝑖 − 𝑥̅ )2 (𝑦𝑖 − 𝑦̅)2 (𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅)
(x) (y)
2 2 −3 −4 9 10 12
5 5 −1 0 1 0
3 6 −2 0 4 0 0
6 3 1 −3 1 9 −3
8 4 3 −2 9 4 −6
2 9 −3 3 9 9 −9
9 8 4 2 16 4 8
3 7 −2 1 4 1 −2
6 7 1 1 1 1 1
4 7 −1 1 1 1 −1
10 10
𝑛
10 10 ∑(𝑥𝑖 − 𝑥̅ )2 ∑(𝑦𝑖 − 𝑦̅)2
∑ 𝑥𝑖 = 48 ∑ 𝑦𝑖 = 58 𝑖=1 𝑖=1 ∑(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)
𝑖=1 𝑖=1 = 54 = 46 𝑖=1
= 0.0

0
𝑟𝑥𝑦 = =0
√(54)(46)

 This conforms with the scatter graph, that is, the graph is neither increasing nor decreasing,
 and therefore the two sets of data are not correlated.
Definition

3a.1.4 The Spearman’s rank – order correlation is the nonparametric version of the Pearson
product-moment correlation.
 Spearman’s correlation coefficient, denoted as 𝜌, also written as 𝑟𝑠 , measures the
STRENGTH and DIRECTION of ASSOCIATION between two ranked variables.
Page |
6
To compute for the Spearman’s rank correlation coefficient, we use the following formula:

6 ∑𝑛 2
𝑖=1 𝑑𝑖 Where 𝑑 = difference of paired ranks
𝜌 = 1 − 𝑛 (𝑛2 −1) 𝑛 = number of paired data

Example 3a.4. Given the scores in Mathematics and English below, rank the scores, and use the
Spearman’s rho to compute for the correlation coefficient.

Scores in Math Scores in English Rank Math Rank English 𝑑𝑖2


35 38 3 4 1
64 78 9 10 1
45 49 6 6 0
30 26 2 1 1
28 59 1 8 49
60 54 8 7 1
44 33 5 2 9
50 70 7 9 4
39 35 4 3 1
67 45 10 5 35
𝑛

∑ 𝑑𝑖2 = 92
𝑖=1

Solution: Using the Spearman’s rank formula:

6 ∑𝑛𝑖=1 𝑑𝑖2
𝜌 =1−
𝑛 (𝑛2 − 1)
Where ∑𝑛𝑖=1 𝑑𝑖2 = 92 and 𝑛 = 10, so we have:

6 ∑𝑛𝑖=1 𝑑𝑖2
𝜌 =1−
𝑛 (𝑛2 − 1)

6(92)
=1− = 1 − 0.557576 = 0.44
10(102 − 1)
 The correlation coefficient is 0.44 which is a low positive correlation.
The (Phi) 𝜙 COEFFICIENT

For a pair of nominal dichotomous set of data, the phi coefficient is more appropriate to describe
the data set than the Pearson product-moment correlation or Spearman’s rank correlation
coefficient.

𝑎𝑑−𝑏𝑐 Page |
Its formula is given by: 𝜙= 7
√(𝑎+𝑏)(𝑐+𝑑)(𝑎+𝑐)(𝑏+𝑑)

Example 3a.5. Find phi for the following contingency table.

Capital Punishment Female


Yes No
Male Yes 𝑎=6 𝑏 = 14
No 𝑐 = 10 𝑑 = 13

Solution: Substitute the values of 𝑎, 𝑏, 𝑐, and 𝑑 in the formula

𝑎𝑑 − 𝑏𝑐
𝜙=
√(𝑎 + 𝑏)(𝑐 + 𝑑)(𝑎 + 𝑐)(𝑏 + 𝑑)

6 ∙ 13 − 14 ∙ 10 −62
= = = −0.000312
√(6 + 4)(10 + 13)(6 + 10)(4 + 13) 198,720

 The result shows that the opinion on capital punishment whether for or against is almost zero in the
negative side.

The POINT-BISERIAL CORRELATION COEFFICIENT

The point –biserial correlation coefficient is a correlation that measures the strength of association
between a continuous – level variable (ratio or interval data) and a binary variable.
Binary variables are variables of nominal scale having only two possible values. They are also called
dichotomous variables. Given two variable sets, in which 𝑥 is the continuous variable and 𝑦 the
dichotomous variable, the formula of POINT-BISERIAL CORRELATION COEFFICIENT is:

𝑥1 − ̅̅̅
̅̅̅ 𝑥2 𝑛1 𝑛2
𝜌𝑥𝑦 = √ Remark:
𝑠𝑥 𝑛(𝑛 − 1) The point – biserial
correlation coefficient
measures the
Where 𝑥
̅̅̅1 is the mean 𝑥 when 𝑦 = 1 or those labelled with 1 relationship between a
𝑥2 is the mean 𝑥 when 𝑦 = 2 or those labelled with 2
̅̅̅ real dichotomous and an
interval sets of data.
𝑛1 the number of samples labeled 1 in 𝑦
𝑛2 the number of samples labeled 2 in 𝑦

𝑛 is the total number of samples


𝑠𝑥 is the standard deviation of all the 𝑥 values
Example 3a.6. Four girls (1) and five boys (2) of Grade 12 took a 20 – item Mathematics achievement
Test. The results are given below. Compute the correlation coefficient of the girls’ scores and the boy’s
scores in the set of data.
Students Gender Achievement Test results
1 1 10
2 2 9
3 2 10 Page |
4 1 17 8
5 2 18
6 1 8
7 1 10
8 2 12
9 2 19

Solution:
 In this example, the point-biserial correlation will be used because the data involves a
continuous interval data (the test results) and a nominal dichotomous data (gender).

 Let 𝑥 represent the interval data and 𝑦 stand for the dichotomous data.
 The formula to be used is the equation for determining the correlation coefficient.

We obtain the following given:


̅̅̅1 = 11.25
𝑥 𝑛1 = 4 𝑛 = 𝑛1 + 𝑛2 = 4 + 5 = 9
𝑥2 = 13.6
̅̅̅ 𝑛2 = 5

∑9𝑖=1(𝑥𝑖 − 𝑥̅ )2
𝑠𝑥 = √ = 4.245913
𝑛−1
11.25 − 13.6 (4)(5)
𝜌𝑥𝑦 = √ = −0.1957
4.245913 9(9 − 1)

 A correlation coefficient of −0.1957 indicates low negative correlation.


 For this specific example, the negative correlation suggests that the higher the boys score
in the exam, the lower are the girls scores.

Different formulas can be used to compute the correlation between two data sets.
o If data 𝑥 and 𝑦 are both interval data, the Pearson product – moment correlation will
suffice.
o For two data sets that are both ordinal, the Spearman’s rank correlation coefficient is
suited for checking the correlated behavior of data that are both real nominal
dichotomous.
o For a pair of nominal dichotomous set of data, the phi coefficient is more appropriate to
describe the data set than the Pearson product-moment correlation or Spearman’s rank
correlation coefficient.
o But, if one real nominal dichotomous and one interval data are involved, the point-
biserial formula should be utilized.

Reference: Baltazar, EC et al (2018). Mathematics in the Modern World. C & E Publishing Inc, pp.67-81.

You might also like