Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

MODULE 11: CORRELATION AND REGRESSION ANALYSIS

UNIT 1: CORRELATION ANALYSIS


(For DECEMBER 4-7)

Learning Outcome: Evaluate the strength of association (or relationship)


between two variables by solving for the correlation coefficient.

As you diligently follow this unit, you are expected to be able to explain the importance of
knowing the strength of the relationship between two quantitative variables. Using a
scientific calculator, you will be tasked to calculate correlation coefficients. You will also
learn how to interpret these measures of association.

The word correlation is used in everyday life to denote some form of association. We might
say that we have noticed a correlation between foggy days and attacks of wheeziness.
However, in statistical terms we use correlation to denote association between two
quantitative variables. We also assume that the association is linear, that one variable
increases or decreases a fixed amount for a unit increase or decrease in the other. The
other technique that is often used in these circumstances is regression, which involves
estimating the best straight line to summarize the association.

Correlation analysis is a group of techniques used to measure the strength of the


association/relationship between variables by means of a single number called a
correlation coefficient. Correlation is a measure of the extent to which there is a linear (or
straight line) relationship between two variables, and deals primarily with the magnitude
and direction of the relationship. A few examples of questions involving the relationship
between two variables are:

(1) Does a relationship exist between job characteristics and job satisfaction of
employees?
(2) Is there a relationship between tax revenue and advertising expenditures?
(3) What is the degree of association between wages and labor force participation of
married women?
(4) Is the height of the eldest son related to the height of his father?
(5) Are the grades of a student related to the number of hours he uses the Internet for
research?
(6) What is the relationship between the total number of members in a cooperative
and the cooperative’s net surplus?
(7) Are consumer characteristics and willingness to adopt mass customization related?

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 180
Some important points to remember about correlation:

 The correlation coefficient is a value between −1and +1, inclusive.


 The direction of the relationship is based on the sign of the correlation coefficient
“+”: positively correlated (sloping upwards; as one variable increases, the other
increases also)
“−”: negatively correlated (sloping downwards; as one increases, the other
decreases)
 The strength of relationship is based on the absolute value of the correlation
coefficient.
 Correlation does not necessarily imply or prove a causation effect – that is, that
the change in the value of one variable caused the change in the other
variable.

Statistical relationships exist even though a change in one variable is not caused by a
change in the other. A strong correlation can be produced simply by chance, by the
effect of a third variable not considered in the calculation of the correlation, or by a
cause-and-effect relationship. Additional analysis must be performed to determine which
of these three situations actually produced the correlation.

Let us define the two quantitative variables that we are considering here:

Independent Variables are frequently referredto as the input variable or predictor


variablesbecause it is systematically manipulated by the researcherand it is used to predict
the outcome.

Dependent Variables, also referred to as the response variable, are the quantitative
variables that theinvestigation or experiment measures to determine its associationwith the
independent variable.

If you want to know the degree of relationship between two variables which are measured
in at least an interval scale and that the data is obtained from approximately normally
distributed population, the Pearson correlation coefficient (Pearson product moment
correlation coefficient) may be obtained. If the data involves ranks or if the data is at least
interval but the data is not approximately normally distributed, the Spearman Rank
Correlation will be used.

PEARSON CORRELATION COEFFICIENT

The degree of association of two quantitative variables from approximately normally


distributed populations is measured by the Pearson correlation coefficient, denoted by 𝑟. It

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 181
is sometimes called Pearson's 𝑟after its originator and is a measure of linear association. If a
curved line is needed to express the relationship, other and more complicated measures of
the correlation must be used.

The correlation coefficient is measured on a scale that varies from −1 through 0to +1.A
perfect correlation between two variables is expressed by either +1or −1. When one
variable increases as the other increases the correlation is positive; when one decreases as
the other increases it is negative. Complete absence of correlation is represented by 0.

To compute the Pearson correlation coefficient, you can use the following formula:

𝑛 𝑥𝑦 − 𝑥 𝑦
𝑟=
𝑛 𝑥2 − 𝑥 2 𝑛 𝑦2 − 𝑦 2

Here,𝑛 is the number of pairs of observations; 𝑥𝑦 means we multiply 𝑥 and 𝑦 together first
before adding products together; 𝑥 2 means we square the values of 𝑥 first then sum them
up; and 𝑦 2 means we square the values of 𝑦 first before we add them up. Note that the
values of these summations can be easily computed using the statistics mode of a
scientific calculator (choose 𝑦 = 𝑎 + 𝑏𝑥 in the statistics menu). You don’t have to compute
them manually.

The resulting value of the correlation coefficient can be interpreted based on the following
table. When we are trying to determine if the correlation is perfect, very strong, strong,
moderate, weak, very weak or if no correlation exists, we only look at the numerical value
of the coefficient (disregard the sign). Then we add the modifier “positive” or “negative”
depending on the sign of the computed value.

Correlation Coefficient
Interpretation
(absolute value)
1.00 Perfect (Positive/Negative) Correlation
0.80 – 0.99 Very Strong (Positive/Negative) Correlation
0.60 – 0.79 Strong (Positive/Negative) Correlation
0.40 – 0.59 Moderate (Positive/Negative)Correlation
0.20 – 0.39 Weak (Positive/Negative)Correlation
0.01 – 0.19 Very Weak (Positive/Negative) Correlation
0.00 No Correlation

The following are the scatter diagramsthat show the different types or degrees of
correlation between two variables. The straight line that you see in the figures represents a
trendline, or estimated regression line, which will be discussed in the next module.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 182
Perfect positive correlation Some positive correlation No correlation

𝑟=1 𝑟 close to 1 𝑟 = 0 or close to zero

Perfect negative correlation Some negative correlation

𝑟 = −1 𝑟 close to −1

Example 1:
A men's tie shop ran 10 sales promotions to determine the number of men's neckties of
a certain type that customers would buy at various prices. The table shows the sales
results. Calculate the coefficient of correlation.

Number of ties
Prices, 𝒙
sold, 𝒚
649 187
699 149
749 155
799 148
849 130
899 132
949 90
999 99
1,049 69
1,099 51

Solution:
Let us solve for the values needed in the formula for 𝑟:

𝑛 = 10 (number of pairs of observations)


𝑥 = 649 + 699 + 749 + ⋯ + 1099 = 8740
𝑥 2 = 6492 + 6992 + 7492 + ⋯ + 10992 = 7845010
𝑦 = 187 + 149 + 155 + ⋯ + 51 = 1210
𝑦 2 = 1872 + 1492 + 1552 + ⋯ + 512 = 162686
𝑥𝑦 = 649 187 + 699 149 + 749 155 + ⋯ + 1099 51 = 1001640

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 183
Substituting these into the formula, we obtain:

𝑛 𝑥𝑦 − 𝑥 𝑦 10 1001640 − 8740 1210


𝑟= =
𝑛 𝑥2 − 𝑥 2 𝑛 𝑦2 − 𝑦 2 10 7845010 − 8740 2 10 162686 − 1210 2

𝑟 = −0.9648

Hence, there exists a very strong negative correlation between the number of ties
sold and price. [Ideally, as price increases, the number of ties sold is expected to
decrease; or, as price decreases, the number of ties sold is expected to increase.]

Test of Significance of the Correlation Coefficient

It is not sufficient for us, though, to conclude that two variables are indeed statistically
related based only on the computed value of the correlation coefficient. A hypothesis test
would be necessary to determine if there is a significant linear relationship between two
quantitative variables. For the test, we hypothesize that the population correlation
coefficient, 𝝆, is 0. Thus, the null and the alternative hypothesis are stated as

Ho: 𝝆 = 0 (no significant linear relationship)


Ha: 𝝆 ≠ 0 (there is a significant linear relationship)

We set the significance level at 0.05. The test statistic for the analysis is the t-statistic with
formula given by

𝑟− 𝜌
𝑡 =
1− 𝑟 2
𝑛−2

where the test statistic 𝒕 follows a t-distribution with 𝒏 − 𝟐 degrees of freedom. The critical
region for the test for our example is illustrated now as

𝑑𝑓 = 10 − 2 = 8; 𝛼/2 = 0.025
Critical Values: −𝑡𝛼/2 = −2.306 and 𝑡𝛼/2 =
0.025 0.025 2.306

−2.306 2.306
Decision Rule:
Reject 𝐻0 if 𝑡𝑐𝑜𝑚𝑝 ≤ −2.306 or 𝑡𝑐𝑜𝑚𝑝 ≥ 2.306

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 184
Computing now for the value of the test statistic, we have
𝑟− 𝜌 −0.9648 − 0
𝑡 = = = −10.377
1− 𝑟 2 1−(−0.9648)2
𝑛−2 10−2

Now, since the computed value of the test statistic t is lesser than the critical value on the
left tail, -2.306, we reject the null hypothesis.

We then conclude that there is sufficient data or evidence showing that there is a
significant linear relationship between the price and the number of ties sold.

SPEARMAN RANK CORRELATION COEFFICIENT

In the Spearman rank correlation, observations are replaced by their ranks in the
calculation of the correlation coefficient. It is used to determine a possible correlation
(consistency) between two ordinal variables.

This results in a simple formula for Spearman's rank correlation, 𝑟𝑠 :

6 𝑑2
𝑟𝑠 = 1 −
𝑛 𝑛2 − 1

where the variables involved in the formula are defined as follows:

𝑑 =difference in the ranks of the two variables for a given individual


(𝑑 = 𝑟𝑎𝑛𝑘 𝑜𝑓 𝑥 − 𝑟𝑎𝑛𝑘 𝑜𝑓 𝑦)
𝑛 =number of pairs of observations

In ranking the values of 𝑥and 𝑦, we let the highest value for each variable have rank 1.
Also, when taking the ranks of two (or more) observations with the same value, their
corresponding ranks must be averaged. Note also that when a problem does not specify
what correlation coefficient to compute, we usually compute for the Pearson correlation
coefficient (assuming the data are measured on the interval scale and taken from an
approximately normal population). If, however, the data already gives the rankings for
each variable, compute the Spearman rank correlation coefficient.

The values of 𝑟𝑠 can be interpreted in the same way as the Pearson’s 𝑟 using the table
given previously.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 185
Example 2:
Calculatethe Spearman rank correlation coefficient, given the following data on the
number of hours of study for an examination and the grades received by a random
sample of 10 students.

1 2 3 4 5 6 7 8 9 10
Number of hours of study 8 5 11 13 10 5 18 15 2 8
Grade 56 44 79 72 70 54 94 85 33 65

Solution:
Let us construct a table to aid us in computing the ranks of each observation of the
variables and the difference in ranks.

Rank Rank
𝑥 𝑦 𝑑 𝑑2
of 𝑥 of 𝑦
8 56 6.5 7 −0.5 0.25
5 44 8.5 9 −0.5 0.25
11 79 4 3 1 1
13 72 3 4 −1 1
10 70 5 5 0 0
5 54 8.5 8 0.5 0.25
18 94 1 1 0 0
15 85 2 2 0 0
2 33 10 10 0 0
8 65 6.5 6 0.5 0.25
2
𝑑 =3.00

We see from the table that the 𝑥 values equal to 8 both have a rank of 6.5; this is
because the observations were supposed to occupy ranks 6 and 7, but since the
observations are equal, we take the average of 6 and 7 to give us 6.5. The same
holds for the equal 𝑥 values of 5, which were supposed to be ranked 8 and 9; since
both ranks have the same observation, we take the average of 8 and 9 to give us
8.5.

Adding the values in the last column, we have 𝑑 2 = 3.00.Computing 𝑟𝑠 , we have

6 3
𝑟𝑠 = 1 − = 0.9818
10 102 − 1

Hence, there exists a very strong positive correlation between the number of hours
of study and the grade.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 186
Test of Significance of the Spearman Correlation Coefficient

Ho: 𝝆 = 0 (no significant linear relationship)


Ha: 𝝆 ≠ 0 (there is a significant linear relationship)

We set the significance level at 0.05.

Critical region for the test is shown by

𝑑𝑓 = 10 − 2 = 8; 𝛼/2 = 0.025
Critical Values: −𝑡𝛼/2 = −2.306 and 𝑡𝛼/2 =
0.025 0.025 2.306

−2.306 2.306
Decision Rule:
Reject 𝐻0 if 𝑡𝑐𝑜𝑚𝑝 ≤ −2.306 or 𝑡𝑐𝑜𝑚𝑝 ≥ 2.306

Computing now for the value of the test statistic, we have


𝑟𝑠 − 𝜌 0.9818 − 0
𝑡 = = = 14.6219
1− 𝑟𝑠2 1−(0.9818)2
𝑛−2 10−2

Now, since the computed value of the test statistic t is greater than the critical value on the
right tail, 2.306, we reject the null hypothesis.

We then conclude that there is sufficient data or evidence showing that there is a
significant linear relationship between the number of hours of study and the grade.

Performing the Analysis using R

The correlation coefficient can be easily obtained using R by means of the cor() or
cor.test() functions. The cor() function only gives the correlation coefficient while the
cor.test() function returns both the correlation coefficient and the p-value for the test of
significance.

For our examples, it would be best if the data are saved as “.csv” files to be imported later
for the R analysis. For example 1, we have the data “tieprices.csv” while for the second
example we use the “grade.csv” file.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 187
R Script for Example 1.

# Load readr package


library(readr)

# Import "tieprices.csv" into RStudio, assign it to "data"


data <-read.csv("tieprices.csv")

# Check variables of data frame. Note if numeric variable is classified as


factor. If so, we need to convert it toa numeric variable.
head(data)

price No..of.Ties.Sold
1 649 187
2 699 149
3 749 155
4 799 148
5 849 130
6 899 132

# Since both variables are classified as numeric(<int>), we proceed with the


correlation analysis.
cor.test(data$price, data$No..of.Ties.Sold, method ="pearson")

# R Output

Pearson's product-moment correlation

data: data$price and data$No..of.Ties.Sold


t = -10.378, df = 8, p-value = 6.431e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9918915 -0.8538984
sample estimates:
cor
-0.9648082

From the R output, we can see that there is a very strong negative correlation between the
price and the number of ties sold as indicated by the correlation coefficient value of
-0.9648. This relationship is significant at the 0.05 level (p = 6.431 x 10 -6). The result obtained
supports what we have in our manual solution.

We can also produce a scatterplot of the data for us to have a visual assessment of the
linear relationship.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 188
# Scatterplot
ggscatter(data, x ="price" , y ="No..of.Ties.Sold",
add ="reg.line", cor.coef =TRUE,
cor.method ="pearson",
xlab ="Price", ylab ="No. of Ties Sold")

# R Output

## `geom_smooth()` using formula 'y ~ x'

R Script for Example 2

# Load readr package


library(readr)

# Import "grades.csv" file and assign it to "data"


data <-read.csv("grades.csv")

# Check variables of data frame


head(data)

## No..of.hours Grade
## 1 8 56
## 2 5 44
## 3 11 79
## 4 13 72
## 5 10 70
## 6 5 54

# Proceed with correlation analysis


cor.test(data$No..of.hours, data$Grade, method ="spearman")

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 189
# R Output

Spearman's rank correlation rho

data: data$No..of.hours and data$Grade


S = 3.0153, p-value = 4.773e-07
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.9817256

The spearman correlation of 0.9817 indicates a very strong, positive linear relationship
between the No. of hours of study and Grade. At the 5% significance level, this linear
relationship is highly significant (p = 4.773 x 10 -7).

# Scatterplot
ggscatter(data, x ="No..of.hours" , y ="Grade",
add ="reg.line", cor.coef =TRUE,
cor.method ="spearman",
xlab ="No. of hours of study", ylab ="Grade")

# R Output

## `geom_smooth()` using formula 'y ~ x'

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 190
Practice Exercise 11-1

(1) A department of transportation’s study on driving speed and miles per gallon for
midsize automobiles resulted in the following data. Find the Pearson and Spearman
rank correlation coefficient. Interpret these results.

Speed (miles per hour) Consumption (miles per gallon)


30 28
50 25
40 25
55 23
30 30
25 32
60 21
25 35
50 26
55 25

(2) The marketing manager of a large supermarket chain would like to know the
correlation between shelf space and sales of pet food. A random sample of 12
equal-sized stores is selected, with the following results. Find the Pearson and
Spearman rank correlation coefficient. Interpret these results.

Store Shelf Space (Feet) Weekly Sales ($)


1 5 160
2 5 220
3 5 140
4 10 190
5 10 240
6 10 260
7 15 230
8 15 270
9 15 280
10 20 260
11 20 290
12 20 310

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 191
Learning Reinforcement Activity No. 11-1: CORRELATION ANALYSIS
Accomplish by December 7, 2020

Using the R software, solve the following problems as directed. Present your solution just as
how it was presented in the examples in a .docx file, with file name LRA11-
1<LASTNAME>.docx. Summarize your solution for each problem with a conclusion. For the R
script, save it as LRA11-1<LASTNAME>.R.

1. The dean of a business school undertakes a study to relate starting salary after
graduation to grade point average GPA in major courses. He then randomly selects
records of 10 students shown in the accompanying table. Perform a correlation analysis
using Pearson correlation.(5 points)
Student 1 2 3 4 5 6 7 8 9 10
GPA 78 81 85 87 75 79 83 88 85 77
Starting salary
17 18 18 28 17 22 30 34 30 28
in thousands of pesos

2. The following are the numbers of sales contacts made by 9 salespersons during a week
and the number of sales made. Perform a correlation analysis using the Pearson
correlation coefficient and interpret.(5 points)
Salesperson 1 2 3 4 5 6 7 8 9
Sales
71 64 100 105 75 79 82 68 110
contacts
Sales 25 14 37 40 18 10 22 12 42

3. The owner of a car wants to study the relationship between the age of a car and its
selling price. Listed below is a random sample of 12 used cars sold at a dealership during
the last year. Perform a correlation analysis using (a) Pearson correlation coefficient; (b)
Spearman rank correlation coefficient. Interpret these results.(10 points)
Car 1 2 3 4 5 6 7 8 9 10 11 12
Age (years) 9 7 11 12 8 7 8 11 10 12 6 6
Selling Price
8.1 6.0 3.6 4.0 5.0 10.0 7.6 8.0 8.0 6.0 8.6 8.0
($1000)

Congratulations! You just completed Module 11 Unit 1. In the next unit, we use the
association to predict the value of the dependent variable.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 192

You might also like