Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

Module Five

Correlation

Objectives
At the end of this chapter, the student will be able to:
1. explain the uses of correlation as measure of relationship or
association;
2. distinguish Pearson Product Moment Correlation Coefficient from
Spearman Correlation Coefficient;
3. analyze and interpret correlation coefficients; and
4. critique a study using correlation coefficients.

Introduction

Correlation, a term used often by most of us, is defined as the degree of relationship
between two or more variables. In this chapter, we will be discussing linear correlation or the
degree to which a straight line best describes the relationship between two variables will become
more apparent when we discuss and graph scatter plots.

So far, we have learned statistical techniques useful for describing distributions of single
variables or univariate distributions. Now, we shall consider statistical techniques that will
permit us to study joint distribution of two variables or a bivariate distribution. We will call one
of the variables X and the other Y. Each observation will consist of a pair of X, Y values. Our
interest is in the possible relationship between the values of the two variables. We shall
determine the extent to which the variables “co-relate” or correlate.

Note that theory and mathematical basis of correlation was developed by Sir Francis
Galton and Karl Pearson towards the end of the nineteenth century. Pearson devised a method
for obtaining the Pearson product-moment coefficient of correlation in order to quantify Galton’s
concept of linear relationship among variables.

The purpose of a linear correlations analysis is to determine whether there is a


relationship between two sets of variables. We may find that: 1) there is a positive correlation,
2) there is a negative correlation, or 3) there is no correlation. These relationships can be easily
visualized by using scatter diagrams shown in the next few pages.

Relations Between Paired Observations

Given a group of five students with two kinds of tests designated as X and Y given to
them. The data may be represented where X has been arranged in order of magnitude. Consider
the arrangement where the values of Y are in order of magnitude extending from the highest to
the lowest that is, the student who is highest on X is also highest on Y and so on. This situation
represents the maximum positive relation between two variables (r = +1). Figure 13 illustrates
the scatter diagrams for r approximately + 1. This gives a strong positive correlation.
Another possible arrangement is when the values of Y are reversed such that the student
who is highest on X is lowest on Y, and so on until the student who is lowest in X is highest on
Y. This situation represents maximum negative relationship between two variables (r = -1).
Figure 14 illustrates the scatter diagrams for r approximately - 1. This gives a strong negative
correlation.

When the arrangement of Y is strictly random in relation to X, the variables are said to be
independent or their correlation is zero. Figure 15 illustrates the scatter diagrams for r
approximately 0. This shows no correlation at all or zero relationship.
Y

Figure 13. A scattergram for r approximately + 1

Y
X

Figure 14. A scattergram for r approximately - 1

Figure 15. Scattergram for r approximately 0

Correlation Coefficient

The correlation coefficient is a measure of the degree of relationship between two


variables. The correlation coefficient may be defined as the covariance of X and Y divided by
the square root of the product of the variance of X and the variance of Y. The symbol used to
represent the correlation coefficient is r, sometimes known as the Pearson product-moment r.

∑ xy
r= √∑ x 2 ∑ y 2
Notice that the numerator is the covariation between X and Y. As indicated in the
formula, the covariation is the sum of the product of the deviations of X i from its own mean and
the deviations of Yi from its own mean, divided by the total variations of X and Y. This
denominator calculates the maximum possible values that the covariance could have. Thus, r
represents the ratio of the actual covariation between X and Y to their maximum possible
covariation.

Calculation of the Correlation Coefficient: “Deviation Method”

The following shows the bivariate distribution of English (X) and Mathematics (Y)
performance scores of 10 high school students. Compute the correlation coefficient

X 16 14 13 10 8 7 5 4 2 1
Y 5 3 2 8 9 17 6 15 11 14

Table 8 shows the value of X and Y in:


Columns (2) and (3) for n = 10.
Columns (4) and (5) give the corresponding values of X = X – X and X – Y,
Columns (6) and (7) the values of X2 = (X – X)2 and Y2 = (Y – Y)2.
Column 8 gives the product of the paired (X,Y) values.

Table 8
Calculation of the Correlation Coefficient: Deviation Scores

(1) (2) (3) (4) (5) (6) (7) (8)

Observation X Y x y x2 x2 Xy
1 16 5 8 -4 64 16 -32
2 14 3 6 -6 36 36 -36
3 13 2 5 -7 25 49 -35
4 10 8 2 -1 4 1 - 2
5 8 9 0 0 0 0 0
6 7 17 -1 8 1 64 - 8
7 5 6 -3 -3 9 9 9
8 4 15 -4 6 16 36 - 24
9 2 11 -6 2 36 4 - 12
10 1 14 -7 5 49 25 - 35

Total 80 90 0 0 240 240 -175


∑ xy −175 −175
r= √∑ x 2 ∑ y 2 = √ 240 ( 240 ) = 240 = -0.73

Using the Raw Score Formula for r


N ∑ XY −∑ X ∑ Y
2 2
r= √[ n∑ x −(∑ x) ][ N ∑ y −(∑ y ) ]
2 2

Table 9
Observation results using the raw score formula

Observation (X) (Y) X2 Y2 XY

1 16 5 256 25 80
2 14 3 196 9 42
3 13 2 169 4 26
4 10 8 100 64 80
5 8 9 64 81 72
6 7 17 49 289 119
7 5 6 25 36 30
8 4 15 16 225 60
9 2 11 4 121 22
10 1 14 1 196 14

Total 80 90 880 1050 545

n ∑ xy−∑ x ∑ y
2 2
r= √[ n∑ x −(∑ x) ][ n∑ y −(∑ y ) ]
2 2

10 ( 545 )−80 ( 90 )

= √[ 10 ( 880 ) −( 80 )2 ][ 10 ( 1050 )−( 90 )2]


5450−7200
= √( 8800−6400 ) ( 10500−8100 )

−1750
= √2400 ( 2400 )
−1750
= 2400

= -0.73
Table 10
Calculation of the Correlation Coefficient from Ungrouped Data Using Raw Scores

(1) (2) (3) (4) (5) (6)


Observation X Y X2 Y2 XY

1 9 19 81 361 171
2 6 17 36 289 102
3 7 15 91 225 105
4 5 13 25 169 65
5 4 11 16 121 44
6 3 12 9 144 36
7 2 9 4 81 18
8 1 7 1 49 7

37 103 221 1439 578

2
N ∑ X −∑ X ∑ Y
2 2
r= √[ N ∑ X −(∑ X ) ][ N ∑ Y −(∑ Y ) ]
2 2

8 ( 548 ) −37 ( 103 )

r= √[ 8 ( 221 )− ( 37 )2][ 8 ( 1439 )2−( 103 )2 ]


= 0.96

Spearman Rank Correlation

There are many cases where a dependency between two variables X and Y can be
observed but where the distribution is unknown. A statistic to measure the degree of association
between variables X and Y when their distributions are unknown was developed by a statistician
named Charles Spearman (1863-1945). He called this as rank correlation coefficient. It is based
on ranks of the observations and does not depend on a specific distribution of X and Y. Statistics
that does not depend on a specific distribution of the variables is called a non-parametric or
distribution-free statistics.

The Spearman’s rank correlation coefficient is defined as


6 ∑ d2
2
rs= 1 - N ( N −1 )

where d denotes the differences between the ranks of X and Y.

Applying the formula in a random sample of 10 college students and their grades in high
school mathematics course and college algebra, we can compute the value of r as in Table 10.

Table 11. Computation of Spearman Rank Coefficient

Subject X Y Rankx Ranky d d2


1 85 93 2 1 1 1
2 60 55 6.5 7 -.5 .25
3 73 80 5 4 1 1
4 60 75 6.5 5 1.5 2.25
5 75 65 4 6 -2 4
6 90 82 1 3 -2 4
7 83 85 3 2 1 1
13.5

6 ∑ d2
2
r = 1 - N ( N −1 )
2
6∑d
= 1- N ( N 2 −1 )

= 1 - .24

= 0.76

Note that the identical scores in distribution X shared the same rank. The two scores of
60 shared the same rank of 6.5 which was determined by finding the midpoint between rank 6
and 7. If there are three identical scores, we can find their common rank by adding the ranks
assigned to each score and dividing by three.

The Spearman rank correlation is a product-moment correlation coefficient for ranked


data. This being the case, rs is interpreted in the same manner as Pearson Product-Moment
Correlation Coefficient.
Here are guides on how to interpret correlation coefficients that arise in real data

Values between 0.0 and 0.1

 Little or no relationship between X and Y


 Knowing a subject’s value of X give you essentially no information for predicting Y

Example: The correlation between income and shoe size among 30-39 year-old adult males is
nearly zero.

 X and Y are weakly related

Values between 0.1 and 0.5

 Subjects with higher values of X do tend to have higher values of Y


 Knowing X alone does not give you enough information to predict Y precisely

Example: The correlation between the heights of fathers and sons is about 0.4.
Example: The correlation between verbal SAT and quantitative SAT scores is about 0.5.

 X and Y are strongly related


Values between 0.5 and 0.9

 Subjects with higher values of X tend to have higher values of Y


 Knowing X allows you to predict Y with considerably greater accuracy than if you did
not know Y

Example: The correlation between systolic blood pressure measurements taken twice in the same
examination is about 0.6.

Example: The correlation between students’ scores on midterm and final examinations tends to
be about 0.7.

Example: The correlation between self-reported body weight and measured weight tends to be
about 0.7

Values close to 1.00


 A value of r ≈ 1.00 indicates that X and Y are measuring exactly the same thing,
although perhaps on different scales.

Example:
Over the last 30 days, the correlation between

 X = daily high temperature recorded in degrees F, and


 Y = daily high temperature recorded in degrees C is essentially 1.00.
Meaning of r

r=1 Very strong positive linear relationship between X and Y. Y increases as X


increases.

r =0 No linear relationship between X and Y. Y does not tend to increase or


decrease as X increases.

r = -1 Very strong negative linear relationship between X and Y. Y decreases as X


increases.

The sign of r (+ or -) indicates the direction of the relationship between X and Y. The magnitude
of r (how far away from zero it is) indicates the strength of the relationship

For purposes of this course, we can also say that


 r values between 0.1 and 0.5 indicate that the relationship is “weak”
 r values between 0.5 and 0.9 indicate that the relationship is “strong”
 r values greater than 0.9 indicate that the relationship is “extremely strong

Interpretation of r

There are two important aspects to consider when interpreting r: the sign of r, which
indicates the direction of the relationship, and the size of r, which indicates the degree of linear
relationship.

The sign of r is simple to interpret, since a positive r means that the variables tend to
increase or decrease together and a negative r means that as one set of variables increases the
other tends to decrease.

In common usage, correlation of 0.80 and above are considered high, an r of 0.50 is
considered moderate, and an r of 0.30 or below is considered low. However, Daleon (1989)
claimed that a correlation of 0.90 to 1.0 is considered very high correlation, very dependable
relationship; 0.70 to 0.89 is considered high correlation, marked relationship; 0.40 to 0.69 is
considered moderate correlation, substantial relationship; 0.20 to 0.39 is considered low
correlation, definite but small relationship; and less than 0.20 is considered negligible
correlation.

While we can say that an r of 0.90 shows a high degree of linear relationship than an r of
0.45, we cannot say that the larger r indicates twice as much correlation. The value of r is just
an index of the degree of relationship. In attempting to conceptualize the degree of relationship
represented by the correlation coefficient, it is more meaningful to think of the square of the
correlation r2, sometimes referred to as coefficient of determination. Here, r indicates the
proportion of variance in one variable which may be said to be predictable from the other
variable. Thus, (0.90)2 = .81 indicates that 81% of the variable which may be said to be
predictable from the other variable. The remaining 19% (1 – r 2) may be due to other factors
which are responsible for the correlation of two variables.
We may test the significance of an obtained correlation either or by using t-test given by

n−2

or
t=r √
1−r 2

n−2
t = rs √ 1−r 2

But using the output in the Minitab, the probability is used as basis on testing the significance
between the two variables.
Exercise 5
1. Would you say that the relationship between the following pairs of variables is positive,
negative or about zero (0). Explain your answers

a. Students’ height and their school attendance.


b. Students’ class participation and their study habits.
c. Heights and weights of students in a class.
d. Number of Children in the family and children’s height
e. Students’ subject load and their grades

In the succeeding numbers, use first a calculator to find the value of r. Then use minitab
software to test whether the variables are significantly related or not.

2. Compute and interpret the correlation coefficient of the following data:

x 4 5 9 14 18 22 24 28 30

y 16 22 11 16 7 3 5 10 13

3. Thirteen students were given tests in Calculus and Statistics and their scores are shown
below.

Student Scores in Calculus Scores in Statistics

1 18 16
2 16 10
3 17 15
4 11 9
5 24 15
6 24 14
7 26 13
8 15 10
9 25 9
10 13 15
11 30 11
12 12 12
13 26 11

Students’ Hands-on Activity

1. Use minitab to process the data in your data file. Correlate the interval and ratio
variables in your data file. Analyze and interpret the results.

You might also like