Professional Documents
Culture Documents
Module 11 Unit 1 Correlation Analysis
Module 11 Unit 1 Correlation Analysis
As you diligently follow this unit, you are expected to be able to explain the importance of
knowing the strength of the relationship between two quantitative variables. Using a
scientific calculator, you will be tasked to calculate correlation coefficients. You will also
learn how to interpret these measures of association.
The word correlation is used in everyday life to denote some form of association. We might
say that we have noticed a correlation between foggy days and attacks of wheeziness.
However, in statistical terms we use correlation to denote association between two
quantitative variables. We also assume that the association is linear, that one variable
increases or decreases a fixed amount for a unit increase or decrease in the other. The
other technique that is often used in these circumstances is regression, which involves
estimating the best straight line to summarize the association.
(1) Does a relationship exist between job characteristics and job satisfaction of
employees?
(2) Is there a relationship between tax revenue and advertising expenditures?
(3) What is the degree of association between wages and labor force participation of
married women?
(4) Is the height of the eldest son related to the height of his father?
(5) Are the grades of a student related to the number of hours he uses the Internet for
research?
(6) What is the relationship between the total number of members in a cooperative
and the cooperative’s net surplus?
(7) Are consumer characteristics and willingness to adopt mass customization related?
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 180
Some important points to remember about correlation:
Statistical relationships exist even though a change in one variable is not caused by a
change in the other. A strong correlation can be produced simply by chance, by the
effect of a third variable not considered in the calculation of the correlation, or by a
cause-and-effect relationship. Additional analysis must be performed to determine which
of these three situations actually produced the correlation.
Let us define the two quantitative variables that we are considering here:
Dependent Variables, also referred to as the response variable, are the quantitative
variables that theinvestigation or experiment measures to determine its associationwith the
independent variable.
If you want to know the degree of relationship between two variables which are measured
in at least an interval scale and that the data is obtained from approximately normally
distributed population, the Pearson correlation coefficient (Pearson product moment
correlation coefficient) may be obtained. If the data involves ranks or if the data is at least
interval but the data is not approximately normally distributed, the Spearman Rank
Correlation will be used.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 181
is sometimes called Pearson's 𝑟after its originator and is a measure of linear association. If a
curved line is needed to express the relationship, other and more complicated measures of
the correlation must be used.
The correlation coefficient is measured on a scale that varies from −1 through 0to +1.A
perfect correlation between two variables is expressed by either +1or −1. When one
variable increases as the other increases the correlation is positive; when one decreases as
the other increases it is negative. Complete absence of correlation is represented by 0.
To compute the Pearson correlation coefficient, you can use the following formula:
𝑛 𝑥𝑦 − 𝑥 𝑦
𝑟=
𝑛 𝑥2 − 𝑥 2 𝑛 𝑦2 − 𝑦 2
Here,𝑛 is the number of pairs of observations; 𝑥𝑦 means we multiply 𝑥 and 𝑦 together first
before adding products together; 𝑥 2 means we square the values of 𝑥 first then sum them
up; and 𝑦 2 means we square the values of 𝑦 first before we add them up. Note that the
values of these summations can be easily computed using the statistics mode of a
scientific calculator (choose 𝑦 = 𝑎 + 𝑏𝑥 in the statistics menu). You don’t have to compute
them manually.
The resulting value of the correlation coefficient can be interpreted based on the following
table. When we are trying to determine if the correlation is perfect, very strong, strong,
moderate, weak, very weak or if no correlation exists, we only look at the numerical value
of the coefficient (disregard the sign). Then we add the modifier “positive” or “negative”
depending on the sign of the computed value.
Correlation Coefficient
Interpretation
(absolute value)
1.00 Perfect (Positive/Negative) Correlation
0.80 – 0.99 Very Strong (Positive/Negative) Correlation
0.60 – 0.79 Strong (Positive/Negative) Correlation
0.40 – 0.59 Moderate (Positive/Negative)Correlation
0.20 – 0.39 Weak (Positive/Negative)Correlation
0.01 – 0.19 Very Weak (Positive/Negative) Correlation
0.00 No Correlation
The following are the scatter diagramsthat show the different types or degrees of
correlation between two variables. The straight line that you see in the figures represents a
trendline, or estimated regression line, which will be discussed in the next module.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 182
Perfect positive correlation Some positive correlation No correlation
𝑟 = −1 𝑟 close to −1
Example 1:
A men's tie shop ran 10 sales promotions to determine the number of men's neckties of
a certain type that customers would buy at various prices. The table shows the sales
results. Calculate the coefficient of correlation.
Number of ties
Prices, 𝒙
sold, 𝒚
649 187
699 149
749 155
799 148
849 130
899 132
949 90
999 99
1,049 69
1,099 51
Solution:
Let us solve for the values needed in the formula for 𝑟:
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 183
Substituting these into the formula, we obtain:
𝑟 = −0.9648
Hence, there exists a very strong negative correlation between the number of ties
sold and price. [Ideally, as price increases, the number of ties sold is expected to
decrease; or, as price decreases, the number of ties sold is expected to increase.]
It is not sufficient for us, though, to conclude that two variables are indeed statistically
related based only on the computed value of the correlation coefficient. A hypothesis test
would be necessary to determine if there is a significant linear relationship between two
quantitative variables. For the test, we hypothesize that the population correlation
coefficient, 𝝆, is 0. Thus, the null and the alternative hypothesis are stated as
We set the significance level at 0.05. The test statistic for the analysis is the t-statistic with
formula given by
𝑟− 𝜌
𝑡 =
1− 𝑟 2
𝑛−2
where the test statistic 𝒕 follows a t-distribution with 𝒏 − 𝟐 degrees of freedom. The critical
region for the test for our example is illustrated now as
𝑑𝑓 = 10 − 2 = 8; 𝛼/2 = 0.025
Critical Values: −𝑡𝛼/2 = −2.306 and 𝑡𝛼/2 =
0.025 0.025 2.306
−2.306 2.306
Decision Rule:
Reject 𝐻0 if 𝑡𝑐𝑜𝑚𝑝 ≤ −2.306 or 𝑡𝑐𝑜𝑚𝑝 ≥ 2.306
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 184
Computing now for the value of the test statistic, we have
𝑟− 𝜌 −0.9648 − 0
𝑡 = = = −10.377
1− 𝑟 2 1−(−0.9648)2
𝑛−2 10−2
Now, since the computed value of the test statistic t is lesser than the critical value on the
left tail, -2.306, we reject the null hypothesis.
We then conclude that there is sufficient data or evidence showing that there is a
significant linear relationship between the price and the number of ties sold.
In the Spearman rank correlation, observations are replaced by their ranks in the
calculation of the correlation coefficient. It is used to determine a possible correlation
(consistency) between two ordinal variables.
6 𝑑2
𝑟𝑠 = 1 −
𝑛 𝑛2 − 1
In ranking the values of 𝑥and 𝑦, we let the highest value for each variable have rank 1.
Also, when taking the ranks of two (or more) observations with the same value, their
corresponding ranks must be averaged. Note also that when a problem does not specify
what correlation coefficient to compute, we usually compute for the Pearson correlation
coefficient (assuming the data are measured on the interval scale and taken from an
approximately normal population). If, however, the data already gives the rankings for
each variable, compute the Spearman rank correlation coefficient.
The values of 𝑟𝑠 can be interpreted in the same way as the Pearson’s 𝑟 using the table
given previously.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 185
Example 2:
Calculatethe Spearman rank correlation coefficient, given the following data on the
number of hours of study for an examination and the grades received by a random
sample of 10 students.
1 2 3 4 5 6 7 8 9 10
Number of hours of study 8 5 11 13 10 5 18 15 2 8
Grade 56 44 79 72 70 54 94 85 33 65
Solution:
Let us construct a table to aid us in computing the ranks of each observation of the
variables and the difference in ranks.
Rank Rank
𝑥 𝑦 𝑑 𝑑2
of 𝑥 of 𝑦
8 56 6.5 7 −0.5 0.25
5 44 8.5 9 −0.5 0.25
11 79 4 3 1 1
13 72 3 4 −1 1
10 70 5 5 0 0
5 54 8.5 8 0.5 0.25
18 94 1 1 0 0
15 85 2 2 0 0
2 33 10 10 0 0
8 65 6.5 6 0.5 0.25
2
𝑑 =3.00
We see from the table that the 𝑥 values equal to 8 both have a rank of 6.5; this is
because the observations were supposed to occupy ranks 6 and 7, but since the
observations are equal, we take the average of 6 and 7 to give us 6.5. The same
holds for the equal 𝑥 values of 5, which were supposed to be ranked 8 and 9; since
both ranks have the same observation, we take the average of 8 and 9 to give us
8.5.
6 3
𝑟𝑠 = 1 − = 0.9818
10 102 − 1
Hence, there exists a very strong positive correlation between the number of hours
of study and the grade.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 186
Test of Significance of the Spearman Correlation Coefficient
𝑑𝑓 = 10 − 2 = 8; 𝛼/2 = 0.025
Critical Values: −𝑡𝛼/2 = −2.306 and 𝑡𝛼/2 =
0.025 0.025 2.306
−2.306 2.306
Decision Rule:
Reject 𝐻0 if 𝑡𝑐𝑜𝑚𝑝 ≤ −2.306 or 𝑡𝑐𝑜𝑚𝑝 ≥ 2.306
Now, since the computed value of the test statistic t is greater than the critical value on the
right tail, 2.306, we reject the null hypothesis.
We then conclude that there is sufficient data or evidence showing that there is a
significant linear relationship between the number of hours of study and the grade.
The correlation coefficient can be easily obtained using R by means of the cor() or
cor.test() functions. The cor() function only gives the correlation coefficient while the
cor.test() function returns both the correlation coefficient and the p-value for the test of
significance.
For our examples, it would be best if the data are saved as “.csv” files to be imported later
for the R analysis. For example 1, we have the data “tieprices.csv” while for the second
example we use the “grade.csv” file.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 187
R Script for Example 1.
price No..of.Ties.Sold
1 649 187
2 699 149
3 749 155
4 799 148
5 849 130
6 899 132
# R Output
From the R output, we can see that there is a very strong negative correlation between the
price and the number of ties sold as indicated by the correlation coefficient value of
-0.9648. This relationship is significant at the 0.05 level (p = 6.431 x 10 -6). The result obtained
supports what we have in our manual solution.
We can also produce a scatterplot of the data for us to have a visual assessment of the
linear relationship.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 188
# Scatterplot
ggscatter(data, x ="price" , y ="No..of.Ties.Sold",
add ="reg.line", cor.coef =TRUE,
cor.method ="pearson",
xlab ="Price", ylab ="No. of Ties Sold")
# R Output
## No..of.hours Grade
## 1 8 56
## 2 5 44
## 3 11 79
## 4 13 72
## 5 10 70
## 6 5 54
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 189
# R Output
The spearman correlation of 0.9817 indicates a very strong, positive linear relationship
between the No. of hours of study and Grade. At the 5% significance level, this linear
relationship is highly significant (p = 4.773 x 10 -7).
# Scatterplot
ggscatter(data, x ="No..of.hours" , y ="Grade",
add ="reg.line", cor.coef =TRUE,
cor.method ="spearman",
xlab ="No. of hours of study", ylab ="Grade")
# R Output
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 190
Practice Exercise 11-1
(1) A department of transportation’s study on driving speed and miles per gallon for
midsize automobiles resulted in the following data. Find the Pearson and Spearman
rank correlation coefficient. Interpret these results.
(2) The marketing manager of a large supermarket chain would like to know the
correlation between shelf space and sales of pet food. A random sample of 12
equal-sized stores is selected, with the following results. Find the Pearson and
Spearman rank correlation coefficient. Interpret these results.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 191
Learning Reinforcement Activity No. 11-1: CORRELATION ANALYSIS
Accomplish by December 7, 2020
Using the R software, solve the following problems as directed. Present your solution just as
how it was presented in the examples in a .docx file, with file name LRA11-
1<LASTNAME>.docx. Summarize your solution for each problem with a conclusion. For the R
script, save it as LRA11-1<LASTNAME>.R.
1. The dean of a business school undertakes a study to relate starting salary after
graduation to grade point average GPA in major courses. He then randomly selects
records of 10 students shown in the accompanying table. Perform a correlation analysis
using Pearson correlation.(5 points)
Student 1 2 3 4 5 6 7 8 9 10
GPA 78 81 85 87 75 79 83 88 85 77
Starting salary
17 18 18 28 17 22 30 34 30 28
in thousands of pesos
2. The following are the numbers of sales contacts made by 9 salespersons during a week
and the number of sales made. Perform a correlation analysis using the Pearson
correlation coefficient and interpret.(5 points)
Salesperson 1 2 3 4 5 6 7 8 9
Sales
71 64 100 105 75 79 82 68 110
contacts
Sales 25 14 37 40 18 10 22 12 42
3. The owner of a car wants to study the relationship between the age of a car and its
selling price. Listed below is a random sample of 12 used cars sold at a dealership during
the last year. Perform a correlation analysis using (a) Pearson correlation coefficient; (b)
Spearman rank correlation coefficient. Interpret these results.(10 points)
Car 1 2 3 4 5 6 7 8 9 10 11 12
Age (years) 9 7 11 12 8 7 8 11 10 12 6 6
Selling Price
8.1 6.0 3.6 4.0 5.0 10.0 7.6 8.0 8.0 6.0 8.6 8.0
($1000)
Congratulations! You just completed Module 11 Unit 1. In the next unit, we use the
association to predict the value of the dependent variable.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 192