Computing The Pearson Correlation Coefficient

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 38

Computing the Pearson Correlation Coefficient

One formula for the Pearson correlation coefficient r is as follows:

(10.1)

The following numerical example shows how the formula ( 10.1) is used:
 

We present a second formula that is harder to compute but easier to interpret.   

(10.2)
 

Consider the Ad Spending example at the start of this chapter. Many of the (X, Y) points are simultaneously above
average, since companies that have higher than average Advertising Spending also have higher than average
Impressions. Both   and   are positive for these companies. Therefore, the product   is
positive for these companies. Most of the remaining companies have lower than average Spending and lower than

average Impressions. Both   and   are negative for these companies, but the product   is
still positive! Hence the numerator in ( 10.2) tends to be a large positive number for the Ad Spending data.

If the points were sloped downwards, then high X-values tend to go with low Y-values, and the

product   is negative for these points. This is partly how the correlation formula ( 10.2) works. The
denominator terms have been put in to ensure that r does not go beyond -1 or +1.

What is Pearson Correlation?


Correlation between sets of data is a measure of how well they are related. The most common measure of correlation in stats is the Pearson Correlation. The full
name is the Pearson Product Moment Correlation or PPMC. It shows the linear relationship between two sets of data. In simple terms, it answers the
question, Can I draw a line graph to represent the data? Two letters are used to represent the Pearson correlation: Greek letter rho (ρ) for a population and the
letter “r” for a sample.

The Pearson correlation coefficient can be calculated by hand or one a graphing calculator such as the TI-89
What are the Possible Values for the Pearson Correlation?
The results will be between -1 and 1. You will very rarely see 0, -1 or 1. You’ll get a number somewhere in between those values. The closer the value of r gets
to zero, the greater the variation the data points are around the line of best fit.
High correlation: .5 to 1.0 or -0.5 to 1.0.
Medium correlation: .3 to .5 or -0.3 to .5.
Low correlation: .1 to .3 or -0.1 to -0.3.

Potential problems with Pearson correlation.


The PPMC is not able to tell the difference between dependent and independent variables. For example, if you are trying to find the correlation between a high
calorie diet and diabetes, you might find a high correlation of .8. However, you could also work out the correlation coefficient formula with the variables
switched around. In other words, you could say that diabetes causes a high calorie diet. That obviously makes no sense. Therefore, as a researcher you have to be
aware of the data you are plugging in. In addition, the PPMC will not give you any information about the slope of the line; It only tells you whether there is a
relationship.
Real Life Example
Pearson correlation is used in thousands of real life situations. For example, scientists in China wanted to know if there was a relationship between how weedy
rice populations are different genetically. The goal was to find out the evolutionary potential of the rice. Pearson’s correlation between the two groups was
analyzed. It showed a positive Pearson Product Moment correlation of between 0.783 and 0.895 for weedy rice populations. This figure is quite high, which
suggested a fairly strong relationship.
If you’re interested in seeing more examples of PPMC, you can find several studies on the National Institute of Health’s Openi website, which shows result on
studies as varied as breast cyst imaging to the role that carbohydrates play in weight loss.
Next: How to find the Correlation coefficient.
Pearson Correlation: Definition and Easy Steps for Use was last modified: April 1st, 2017 by Andale
By Andale | October 31, 2012 | Correlation Coefficients, Definitions, Pearson's Correlation Coefficient |
 ←  Absolute Value: What is it?
 Regression Equation: What it is and How to use it  →

10 thoughts on “Pearson Correlation: Definition and Easy Steps for


Use”
1. MPATSWENUMUGABO Zozimo
2. The Pearson correlation coefficient is just one of many types of coefficients in the field of statistics. The following lesson provides
the formula, examples of when the coefficient is used, its significance and a quiz to assess your knowledge on the topic.

3. Pearson Correlation Coefficient


4. The Pearson correlation coefficient is a very helpful statistical formula that measures the strength between variables and
relationships. In the field of statistics, this formula is often referred to as the Pearson R test. When conducting a statistical test
between two variables, it is a good idea to conduct a Pearson correlation coefficient value to determine just how strong that
relationship is between those two variables.
5. Formula
6. In order to determine how strong the relationship is between two variables, a formula must be followed to produce what is referred
to as the coefficient value. The coefficient value can range between -1.00 and 1.00. If the coefficient value is in the negative
range, then that means the relationship between the variables is negatively correlated, or as one value increases, the other
decreases. If the value is in the positive range, then that means the relationship between the variables is positively correlated, or
both values increase or decrease together. Let's look at the formula for conducting the Pearson correlation coefficient value.
7. Step one: Make a chart with your data for two variables, labeling the variables (x) and (y), and add three more columns labeled (xy),
(x^2), and (y^2). A simple data chart might look like this:

Perso Score (xy


Age (x) (x^2) (y^2)
n (y) )

3
8. More data would be needed, but only three samples are shown for purposes of example.
9. Step two: Complete the chart using basic multiplication of the variable values.

Perso Score (xy


Age (x) (x^2) (y^2)
n (y) )

1 20 30 600 400 900

2 24 20 480 576 400

3 17 27 459 289 729


10. Step three: After you have multiplied all the values to complete the chart, add up all of the columns from top to bottom.

Perso Score
Age (x) (xy) (x^2) (y^2)
n (y)

1 20 30 600 400 900


2 24 20 480 576 400

3 17 27 459 289 729

Total 61 77 1539 1265 2029


11. Step four: Use this formula to find the Pearson correlation coefficient value.

Pearson Correlation Coefficient Formula

12. Step five: Once you complete the formula above by plugging in all the correct values, the result is your coefficient value! If the value
is a negative number, then there is a negative correlation of relationship strength, and if the value is a positive number, then there is
a positive correlation of relationship strength. Note: The above examples only show data for three people, but the ideal sample size
to calculate a Pearson correlation coefficient should be more than ten people.

13. Examples
14. Let's say you were analyzing the relationship between your participant's age and reported level of income. You're curious as to if
there is a positive or negative relationship between someone's age and their income level. After conducting the test, your Pearson
correlation coefficient value is +.20. Therefore, you would have a slightly positive correlation between the two variables, so the
strength of the relationship is also positive and considered strong. You could confidently conclude there is a strong relationship and
positive correlation between one's age and their income. In other words, as people grow older, their income tends to increase as
well.
15. Perhaps you were interested in learning more about the relationship strength of your participant's anxiety score and the number of
hours they work each week. After conducting the test, your Pearson correlation coefficient value is -.80. Therefore, you would have
a negative correlation between the two variables, and the strength of the relationship would be weak. You could confidently
conclude there is a weak relationship and negative correlation between one's anxiety score and how many hours a week they
report working. Therefore, those who scored high on anxiety would tend to report less hours of work per week, while those who
scored lower on anxiety would tend to report more hours of work each week.

16. Significance
17. A discussion on the Pearson correlation coefficient wouldn't be complete if we didn't talk about statistical significance. When
conducting statistical tests, statistical significance must be present in order to establish a probability of the results without error.
18.  To unlock this lesson you must be a Study.com Member. Create your account

Spearman's Rank-Order Correlation


This guide will tell you when you should use Spearman's rank-order correlation to analyse your data, what assumptions
you have to satisfy, how to calculate it, and how to report it. If you want to know how to run a Spearman correlation in
SPSS Statistics, go to our guide here.

When should you use the Spearman's rank-order correlation?


The Spearman's rank-order correlation is the nonparametric version of the Pearson product-moment correlation.
Spearman's correlation coefficient, (ρ, also signified by rs) measures the strength and direction of association between two
ranked variables.
What are the assumptions of the test?

You need two variables that are either ordinal, interval or ratio (see our Types of Variable guide if you need clarification).
Although you would normally hope to use a Pearson product-moment correlation on interval or ratio data, the Spearman
correlation can be used when the assumptions of the Pearson correlation are markedly violated. However, Spearman's
correlation determines the strength and direction of the monotonic relationship between your two variables rather than
the strength and direction of the linear relationship between your two variables, which is what Pearson's correlation
determines.

What is a monotonic relationship?

A monotonic relationship is a relationship that does one of the following: (1) as the value of one variable increases, so
does the value of the other variable; or (2) as the value of one variable increases, the other variable value decreases.
Examples of monotonic and non-monotonic relationships are presented in the diagram below:

Join the 10,000s of students, academics and professionals who rely on Laerd Statistics. TAKE THE
TOUR PLANS & PRICING
Why is a monotonic relationship important to Spearman's correlation?

Spearman's correlation measures the strength and direction of monotonic association between two variables.
Monotonicity is "less restrictive" than that of a linear relationship. For example, the middle image above shows a
relationship that is monotonic, but not linear.

A monotonic relationship is not strictly an assumption of Spearman's correlation. That is, you can run a Spearman's
correlation on a non-monotonic relationship to determine if there is a monotonic component to the association.
However, you would normally pick a measure of association, such as Spearman's correlation, that fits the pattern of the
observed data. That is, if a scatterplot shows that the relationship between your two variables looks monotonic you would
run a Spearman's correlation because this will then measure the strength and direction of this monotonic relationship. On
the other hand if, for example, the relationship appears linear (assessed via scatterplot) you would run a Pearson's
correlation because this will measure the strength and direction of any linear relationship. You will not always be able to
visually check whether you have a monotonic relationship, so in this case, you might run a Spearman's correlation
anyway.

How to rank data?

In some cases your data might already be ranked, but often you will find that you need to rank the data yourself (or
use SPSS Statistics to do it for you). Thankfully, ranking data is not a difficult task and is easily achieved by working
through your data in a table. Let us consider the following example data regarding the marks achieved in a maths and
English exam:
Exa
Marks
m

5 7 4 7 6 6 5 8 7 6
English
6 5 5 1 1 4 8 0 6 1

6 7 4 6 6 5 5 7 6 6
Maths
6 0 0 0 5 6 9 7 7 3

The procedure for ranking these scores is as follows:

First, create a table with four columns and label them as below:

English (mark) Maths (mark) Rank (English) Rank (maths)

56 66 9 4

75 70 3 2

45 40 10 10

71 60 4 7

61 65 6.5 5

64 56 5 9

58 59 8 8

80 77 1 1

76 67 2 3

61 63 6.5 6
You need to rank the scores for maths and English separately. The score with the highest value should be labelled "1"
and the lowest score should be labelled "10" (if your data set has more than 10 cases then the lowest score will be how
many cases you have). Look carefully at the two individuals that scored 61 in the English exam (highlighted in bold).
Notice their joint rank of 6.5. This is because when you have two identical values in the data (called a "tie"), you need to
take the average of the ranks that they would have otherwise occupied. We do this because, in this example, we have no
way of knowing which score should be put in rank 6 and which score should be ranked 7. Therefore, you will notice that
the ranks of 6 and 7 do not exist for English. These two ranks have been averaged ((6 + 7)/2 = 6.5) and assigned to each
of these "tied" scores.

TAKE THE TOUR

PLANS & PRICING


What is the definition of Spearman's rank-order correlation?

There are two methods to calculate Spearman's correlation depending on whether: (1) your data does not have tied ranks
or (2) your data has tied ranks. The formula for when there are no tied ranks is:
where di = difference in paired ranks and n = number of cases. The formula to use when there are tied ranks is:

where i = paired score.

Spearman's Rank correlation coefficient is used to identify and test the strength of a relationship between two sets of data. It is
often used as a statistical method to aid with either proving or disproving a hypothesis e.g. the depth of a river does not progressively
increase the further from the river bank.

he Spearman's Rank Correlation Coefficient is used to discover the strength of a link between two sets of data. This example looks at the
strength of the link between the price of a convenience item (a 50cl bottle of water) and distance from the Contemporary Art Museum in El
Raval, Barcelona.
Example: The hypothesis tested is that prices should decrease with distance from the key area of gentrification surrounding the
Contemporary Art Museum. The line followed is Transect 2 in the map below, with continuous sampling of the price of a 50cl bottle water at
every convenience store.

Disprove
Map to show the location of environmental gradients for transect lines in El Raval, Barcelona
 
Hypothesis
We might expect to find that the price of a bottle of water decreases as distance from the Contemporary Art Museum increases. Higher
property rents close to the museum should be reflected in higher prices in the shops.
The hypothesis might be written like this:
The price of a convenience item decreases as distance from the Contemporary Art Museum increases.
The more objective scientific research method is always to assume that no such price-distance relationship exists and to express the null
hypothesis as:
there is no significant relationship between the price of a convenience item and distance from the Contemporary Art Museum.
What can go wrong?
Having decided upon the wording of the hypothesis, you should consider whether there are any other factors that may influence the study.
Some factors that may influence prices may include:
 The type of retail outlet. You must be consistent in your choice of retail outlet. For example, bars and restaurants often charge
significantly more for water than a convenience store. You should decide which type of outlet to use and stick with it for all your data
collection.
 Some shops have different prices for the same item: a high tourist and lower local price, dependent upon the shopkeeper's perception
of the customer.
 Shops near main roads may charge more than shops in less accessible back streets, due to the higher rents demanded for main road
retail sites.
 The positive spread effects from other nearby areas of gentrification or from competing areas of tourist attraction.
 The negative spread effects from nearby areas of urban decay.
 Higher prices may be charged during the summer when demand is less flexible, making seasonal comparisons less reliable.
 Cumulative sampling may distort the expected price-distance gradient if several shops cluster within a short area along the transect
line followed by a considerable gap before the next group of retail outlets.
You should mention such factors in your investigation.
Data collected (see data table below) suggests a fairly strong negative relationship as shown in this scatter graph:

Scatter graph to show the change in the price of a convenience item with distance from the Contemporary Art Museum. Roll
over image to see trend line.
The scatter graph shows the possibility of a negative correlation between the two variables and the Spearman's rank correlation technique
should be used to see if there is indeed a correlation, and to test the strength of the relationship.
Spearman’s Rank correlation coefficient
A correlation can easily be drawn as a scatter graph, but the most precise way to compare several pairs of data is to use a statistical test -
this establishes whether the correlation is really significant or if it could have been the result of chance alone.
Spearman’s Rank correlation coefficient is a technique which can be used to summarise the strength and direction (negative or positive) of a
relationship between two variables.
The result will always be between 1 and minus 1.
Method - calculating the coefficient
 Create a table from your data.
 Rank the two data sets. Ranking is achieved by giving the ranking '1' to the biggest number in a column, '2' to the second biggest
value and so on. The smallest value in the column will get the lowest ranking. This should be done for both sets of measurements.
 Tied scores are given the mean (average) rank. For example, the three tied scores of 1 euro in the example below are ranked fifth in
order of price, but occupy three positions (fifth, sixth and seventh) in a ranking hierarchy of ten. The mean rank in this case is calculated
as (5+6+7) ÷ 3 = 6.
 Find the difference in the ranks (d): This is the difference between the ranks of the two values on each row of the table. The rank of
the second value (price) is subtracted from the rank of the first (distance from the museum).
 Square the differences (d²) To remove negative values and then sum them ( d²).
 
Distance Price of Difference
Convenienc Rank
from CAM 50cl bottle Rank price between d²
e Store distance
(m) (€) ranks (d)

1 50 10 1.80 2 8 64

2 175 9 1.20 3.5 5.5 30.25

3 270 8 2.00 1 7 49

4 375 7 1.00 6 1 1

5 425 6 1.00 6 0 0

6 580 5 1.20 3.5 1.5 2.25

7 710 4 0.80 9 -5 25

8 790 3 0.60 10 -7 49

9 890 2 1.00 6 -4 16

10 980 1 0.85 8 -7 49

 d² = 285.5

Data Table: Spearman's Rank Correlation


 Calculate the coefficient (R) using the formula below. The answer will always be between 1.0 (a perfect positive correlation) and -1.0
(a perfect negative correlation).
When written in mathematical notation the Spearman Rank formula looks like this :

Now to put all these values into the formula.


 Find the value of all the d² values by adding up all the values in the Difference² column. In our example this is 285.5. Multiplying this
by 6 gives 1713.
 Now for the bottom line of the equation. The value n is the number of sites at which you took measurements. This, in our example
is 10. Substituting these values into n³ - n we get 1000 - 10
 We now have the formula: R = 1 - (1713/990) which gives a value for R:
1 - 1.73 = -0.73

What does this R value of -0.73 mean?


The closer R is to +1 or -1, the stronger the likely correlation. A perfect positive correlation is +1 and a perfect negative correlation is -1.
The R value of -0.73 suggests a fairly strong negative relationship.

A further technique is now required to test the significance of the relationship.


The R value of -0.73 must be looked up on the Spearman Rank significance table below as follows:
 Work out the 'degrees of freedom' you need to use. This is the number of pairs in your sample minus 2 (n-2). In the example it is 8
(10 - 2).
 Now plot your result on the table.
 If it is below the line marked 5%, then it is possible your result was the product of chance and you must reject the hypothesis.
 If it is above the 0.1% significance level, then we can be 99.9% confident the correlation has not occurred by chance.
 If it is above 1%, but below 0.1%, you can say you are 99% confident.
 If it is above 5%, but below 1%, you can say you are 95% confident (i.e. statistically there is a 5% likelihood the result occurred by
chance).

In the example, the value 0.73 gives a significance level of slightly less than 5%. That means that the probability of the relationship you have
found being a chance event is about 5 in a 100. You are 95% certain that your hypothesis is correct. The reliability of your sample can be
stated in terms of how many researchers completing the same study as yours would obtain the same results: 95 out of 100.
 The fact two variables correlate cannot prove anything - only further research can actually prove that one thing affects the other.
 Data reliability is related to the size of the sample. The more data you collect, the more reliable your result.

Click Spearman's Rank Signifance Graph for a blank copy of the above significance graph.
he Spearman's Rank Correlation Coefficient is used to discover the strength of a link between two sets of data. This example looks at the
strength of the link between the price of a convenience item (a 50cl bottle of water) and distance from the Contemporary Art Museum in El
Raval, Barcelona.
Example: The hypothesis tested is that prices should decrease with distance from the key area of gentrification surrounding the
Contemporary Art Museum. The line followed is Transect 2 in the map below, with continuous sampling of the price of a 50cl bottle water at
every convenience store.
Map to show the location of environmental gradients for transect lines in El Raval, Barcelona
 
Hypothesis
We might expect to find that the price of a bottle of water decreases as distance from the Contemporary Art Museum increases. Higher
property rents close to the museum should be reflected in higher prices in the shops.
The hypothesis might be written like this:
The price of a convenience item decreases as distance from the Contemporary Art Museum increases.
The more objective scientific research method is always to assume that no such price-distance relationship exists and to express the null
hypothesis as:
there is no significant relationship between the price of a convenience item and distance from the Contemporary Art Museum.
What can go wrong?
Having decided upon the wording of the hypothesis, you should consider whether there are any other factors that may influence the study.
Some factors that may influence prices may include:
 The type of retail outlet. You must be consistent in your choice of retail outlet. For example, bars and restaurants often charge
significantly more for water than a convenience store. You should decide which type of outlet to use and stick with it for all your data
collection.
 Some shops have different prices for the same item: a high tourist and lower local price, dependent upon the shopkeeper's perception
of the customer.
 Shops near main roads may charge more than shops in less accessible back streets, due to the higher rents demanded for main road
retail sites.
 The positive spread effects from other nearby areas of gentrification or from competing areas of tourist attraction.
 The negative spread effects from nearby areas of urban decay.
 Higher prices may be charged during the summer when demand is less flexible, making seasonal comparisons less reliable.
 Cumulative sampling may distort the expected price-distance gradient if several shops cluster within a short area along the transect
line followed by a considerable gap before the next group of retail outlets.
You should mention such factors in your investigation.
Data collected (see data table below) suggests a fairly strong negative relationship as shown in this scatter graph:

Scatter graph to show the change in the price of a convenience item with distance from the Contemporary Art Museum. Roll
over image to see trend line.
The scatter graph shows the possibility of a negative correlation between the two variables and the Spearman's rank correlation technique
should be used to see if there is indeed a correlation, and to test the strength of the relationship.
Spearman’s Rank correlation coefficient
A correlation can easily be drawn as a scatter graph, but the most precise way to compare several pairs of data is to use a statistical test -
this establishes whether the correlation is really significant or if it could have been the result of chance alone.
Spearman’s Rank correlation coefficient is a technique which can be used to summarise the strength and direction (negative or positive) of a
relationship between two variables.
The result will always be between 1 and minus 1.
Method - calculating the coefficient
 Create a table from your data.
 Rank the two data sets. Ranking is achieved by giving the ranking '1' to the biggest number in a column, '2' to the second biggest
value and so on. The smallest value in the column will get the lowest ranking. This should be done for both sets of measurements.
 Tied scores are given the mean (average) rank. For example, the three tied scores of 1 euro in the example below are ranked fifth in
order of price, but occupy three positions (fifth, sixth and seventh) in a ranking hierarchy of ten. The mean rank in this case is calculated
as (5+6+7) ÷ 3 = 6.
 Find the difference in the ranks (d): This is the difference between the ranks of the two values on each row of the table. The rank of
the second value (price) is subtracted from the rank of the first (distance from the museum).
 Square the differences (d²) To remove negative values and then sum them ( d²).
 
Distance Price of Difference
Convenienc Rank
from CAM 50cl bottle Rank price between d²
e Store distance
(m) (€) ranks (d)

1 50 10 1.80 2 8 64

2 175 9 1.20 3.5 5.5 30.25

3 270 8 2.00 1 7 49

4 375 7 1.00 6 1 1

5 425 6 1.00 6 0 0

6 580 5 1.20 3.5 1.5 2.25

7 710 4 0.80 9 -5 25

8 790 3 0.60 10 -7 49

9 890 2 1.00 6 -4 16

10 980 1 0.85 8 -7 49

 d² = 285.5

Data Table: Spearman's Rank Correlation


 Calculate the coefficient (R) using the formula below. The answer will always be between 1.0 (a perfect positive correlation) and -1.0
(a perfect negative correlation).
When written in mathematical notation the Spearman Rank formula looks like this :

Now to put all these values into the formula.


 Find the value of all the d² values by adding up all the values in the Difference² column. In our example this is 285.5. Multiplying this
by 6 gives 1713.
 Now for the bottom line of the equation. The value n is the number of sites at which you took measurements. This, in our example
is 10. Substituting these values into n³ - n we get 1000 - 10
 We now have the formula: R = 1 - (1713/990) which gives a value for R:
1 - 1.73 = -0.73

What does this R value of -0.73 mean?


The closer R is to +1 or -1, the stronger the likely correlation. A perfect positive correlation is +1 and a perfect negative correlation is -1.
The R value of -0.73 suggests a fairly strong negative relationship.

A further technique is now required to test the significance of the relationship.


The R value of -0.73 must be looked up on the Spearman Rank significance table below as follows:
 Work out the 'degrees of freedom' you need to use. This is the number of pairs in your sample minus 2 (n-2). In the example it is 8
(10 - 2).
 Now plot your result on the table.
 If it is below the line marked 5%, then it is possible your result was the product of chance and you must reject the hypothesis.
 If it is above the 0.1% significance level, then we can be 99.9% confident the correlation has not occurred by chance.
 If it is above 1%, but below 0.1%, you can say you are 99% confident.
 If it is above 5%, but below 1%, you can say you are 95% confident (i.e. statistically there is a 5% likelihood the result occurred by
chance).

In the example, the value 0.73 gives a significance level of slightly less than 5%. That means that the probability of the relationship you have
found being a chance event is about 5 in a 100. You are 95% certain that your hypothesis is correct. The reliability of your sample can be
stated in terms of how many researchers completing the same study as yours would obtain the same results: 95 out of 100.
 The fact two variables correlate cannot prove anything - only further research can actually prove that one thing affects the other.
 Data reliability is related to the size of the sample. The more data you collect, the more reliable your result.

Click Spearman's Rank Signifance Graph for a blank copy of the above significance graph.

A comparison of the Pearson and Spearman correlation


methods
Learn more about Minitab
In This Topic

 What is correlation?
 Comparison of Pearson and Spearman coefficients
 Other nonlinear relationships

What is correlation?

A correlation coefficient measures the extent to which two variables tend to change together. The coefficient describes both the
strength and the direction of the relationship. Minitab offers two different correlation analyses:
Pearson product moment correlation
The Pearson correlation evaluates the linear relationship between two continuous variables. A relationship is linear when
a change in one variable is associated with a proportional change in the other variable.

For example, you might use a Pearson correlation to evaluate whether increases in temperature at your production
facility are associated with decreasing thickness of your chocolate coating.

Spearman rank-order correlation


The Spearman correlation evaluates the monotonic relationship between two continuous or ordinal variables. In a
monotonic relationship, the variables tend to change together, but not necessarily at a constant rate. The Spearman
correlation coefficient is based on the ranked values for each variable rather than the raw data.

Spearman correlation is often used to evaluate relationships involving ordinal variables. For example, you might use a
Spearman correlation to evaluate whether the order in which employees complete a test exercise is related to the
number of months they have been employed.

It is always a good idea to examine the relationship between variables with a scatterplot. Correlation coefficients only
measure linear (Pearson) or monotonic (Spearman) relationships. Other relationships are possible.
Comparison of Pearson and Spearman coefficients

The Pearson and Spearman correlation coefficients can range in value from −1 to +1. For the Pearson correlation
coefficient to be +1, when one variable increases then the other variable increases by a consistent amount. This

relationship forms a perfect line. The Spearman correlation coefficient is also +1 in this case.

Pearson = +1, Spearman = +1


If the relationship is that one variable increases when the other increases, but the amount is not consistent, the Pearson
correlation coefficient is positive but less than +1. The Spearman coefficient still equals +1 in this case.

Pearson = +0.851, Spearman = +1

When a relationship is random or non-existent, then both correlation coefficients are nearly zero.
Pearson = −0.093, Spearman = −0.093
If the relationship is a perfect line for a decreasing relationship, then both correlation coefficients are −1.

Pearson = −1, Spearman = −1


If the relationship is that one variable decreases when the other increases, but the amount is not consistent, then the
Pearson correlation coefficient is negative but greater than −1. The Spearman coefficient still equals −1 in this case

Pearson = −0.799, Spearman = −1


Correlation values of −1 or 1 imply an exact linear relationship, like that between a circle's radius and circumference.
However, the real value of correlation values is in quantifying less than perfect relationships. Finding that two variables are
correlated often informs a regression analysis which tries to describe this type of relationship more.
Other nonlinear relationships

Pearson correlation coefficients measure only linear relationships. Spearman correlation coefficients measure only
monotonic relationships. So a meaningful relationship can exist even if the correlation coefficients are 0. Examine a

scatterplot to determine the form of the relationship.


Coefficient of 0
This graph shows a very strong relationship. The Pearson coefficient and Spearman coefficient are both approximately 0.

 Minitab.co

Statistical analysis often uses probabilitydistributions, and the two topics are often studied together. However, probability theory
contains much that is mostly of mathematical interest and not directly relevant to statistics. Moreover, many topics instatistics are
independent of probability theory.

Displaying and describing data


One of the first things we do in statistics is try to understand sets of data. This involves plotting the data in different ways and summarizing what
we see with measures of center (like mean and median) and measures of spread (like range and standard deviation). This topic focuses on concepts
that are often referred to as "descriptive statistics".

Statistics overview
Categorical data displays
Two-way tables for categorical data
Dot plots and frequency tables
Histograms
Comparing features of distributions
Stem-and-leaf plots
Line graphs
Mean and median: The basics
More on mean and median
Range, Interquartile range (IQR), Mean absolute deviation (MAD)
Box and whisker plots
Population variance and standard deviation
Sample variance and standard deviation

Modeling distributions of data


The normal distribution is the most commonly used model in all of statistics. Learn how to measure position using z-scores and find what percent
of data falls where in a normal distribution.

Describing location in a distribution


Normal distributions

Describing relationships in quantitative data


If we want to explore a relationship between two quantitative variables, we make a scatterplot of the data. The fun doesn't stop there. We describe
pattern, talk about the type of correlation we see, and sometimes fit a line to the data so we can use the pattern to make predictions.

Scatterplots and correlation


Regression
Residuals, least-squares regression, and r-squared

Designing studies
Study design focuses on collecting data properly and making the most valid conclusions we can based on how the data was collected. This topic
covers explore samples, surveys, and experiments.

Sampling and surveys


Experiments
Probability
Probability tells us how often some event will happen after many repeated trials. This topic covers theoretical, experimental, compound
probability, permutations, combinations, and more!

Basic theoretical probability


Probability using sample spaces
Experimental probability
Basic set operations
Addition rule for probability
Multiplication rule for independent events
Multiplication rule for dependent events
Conditional probability and independence
Counting principle and factorial
Permutations
Combinations
Combinatorics and probability

Random variables
Random variables can be any outcomes from some chance process, like how many heads will occur in a series of 20 flips. We calculate
probabilities of random variables and calculate expected value for different types of random variables.

Discrete and continuous random variables and probability models


Expected value
Transforming and combining random variables
Binomial random variables
Poisson distribution

Sampling distributions
A sampling distribution shows every possible result a statistic can take in every possible sample from a population and how often each result
happens. This topic covers how sample proportions and sample means behave in repeated samples.

Sample proportions
Sample means
Confidence intervals (one sample)
Confidence intervals give us a range of plausible values for some unknown value based on results from a sample. This topic covers confidence
intervals for means and proportions.

Estimating a population proportion


Estimating a population mean

Significance tests (one sample)


Significance tests give us a formal process for using sample data to evaluate the likelihood of some claim about a population value. We calculate
p-values to see how likely a sample result is to occur by random chance, and we use p-values to make conclusions about hypotheses.

The idea of significance tests


Tests about a population proportion
Tests about a population mean

Significance tests and confidence intervals (two samples)


Learn how to apply what you know about confidence intervals and significance tests to situations that involve comparing two samples to see if
there is a significant difference between the two populations.

Comparing two proportions


Comparing two means

Inference for categorical data (chi-square tests)


Chi-square tests are a family of significance tests that give us ways to test hypotheses about distributions of categorical data. This topic covers
goodness-of-fit tests to see if sample data fits a hypothesized distribution, and tests for independence between two categorical variables.

Chi-square goodness-of-fit tests


Chi-square tests for homogeneity and association/independence

Advanced regression (inference and transforming)


Advanced regression will introduce you to regression methods when data has a nonlinear pattern.

Nonlinear regression
Analysis of variance (ANOVA)
Analysis of variance, also called ANOVA, is a collection of methods for comparing multiple means across different groups.

In statistics, dependence or association is any statistical relationship, whether causal or not, between two random variables or


bivariate data. Correlation is any of a broad class of statistical relationships involving dependence, though in common usage it most often
refers to the extent to which two variables have a linear relationship with each other. Familiar examples of dependent phenomena include
the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a product and its
price.
Correlations are useful because they can indicate a predictive relationship that can be exploited in practice. For example, an electrical utility
may produce less power on a mild day based on the correlation between electricity demand and weather. In this example, there is a causal
relationship, because extreme weather causes people to use more electricity for heating or cooling. However, in general, the presence of a
correlation is not sufficient to infer the presence of a causal relationship (i.e., correlation does not imply causation).
Formally, random variables are dependent if they do not satisfy a mathematical property of probabilistic independence. In informal
parlance, correlation is synonymous with dependence. However, when used in a technical sense, correlation refers to any of several
specific types of relationship between mean values. There are several correlation coefficients, often denoted ρ or r, measuring the degree
of correlation. The most common of these is the Pearson correlation coefficient, which is sensitive only to a linear relationship between two
variables (which may be present even when one variable is a nonlinear function of the other). Other correlation coefficients have been
developed to be more robust than the Pearson correlation – that is, more sensitive to nonlinear relationships.[1][2][3] Mutual information can
also be applied to measure dependence between two variables.

Several sets of (x, y) points, with the Pearson correlation coefficient of x and y for each set. Note that the correlation reflects the noisiness and direction of
a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the
center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.
Contents
  [hide] 

 1Pearson's product-moment coefficient


 2Rank correlation coefficients
 3Other measures of dependence among random variables
 4Sensitivity to the data distribution
 5Correlation matrices
 6Common misconceptions
o 6.1Correlation and causality
o 6.2Correlation and linearity
 7Bivariate normal distribution
 8See also
 9References
 10Further reading
 11External links

Pearson's product-moment coefficient[edit]


Main article: Pearson product-moment correlation coefficient
The most familiar measure of dependence between two quantities is the Pearson product-moment correlation coefficient, or "Pearson's
correlation coefficient", commonly called simply "the correlation coefficient". It is obtained by dividing the covariance of the two variables by
the product of their standard deviations. Karl Pearson developed the coefficient from a similar but slightly different idea by Francis Galton.[4]
The population correlation coefficient ρX,Y between two random variables X and Y with expected values μX and μY and standard
deviations σX and σY is defined as:
where E is the expected value operator, cov means covariance, and corr is a widely used alternative notation for the correlation
coefficient.

External video

 Lecture 21: Covariance and

Correlation, Statistics 110, Harvard


University, 49:25, April 29, 2013

The Pearson correlation is defined only if both of the standard deviations are finite and nonzero. It is a corollary of the Cauchy–Schwarz
inequality that the correlation cannot exceed 1 in absolute value. The correlation coefficient is symmetric: corr(X,Y) = corr(Y,X).
The Pearson correlation is +1 in the case of a perfect direct (increasing) linear relationship (correlation), −1 in the case of a perfect
decreasing (inverse) linear relationship (anticorrelation),[5] and some value in the open interval (−1, 1) in all other cases, indicating the
degree of linear dependence between the variables. As it approaches zero there is less of a relationship (closer to uncorrelated). The
closer the coefficient is to either −1 or 1, the stronger the correlation between the variables.
If the variables are independent, Pearson's correlation coefficient is 0, but the converse is not true because the correlation coefficient
detects only linear dependencies between two variables. For example, suppose the random variable X is symmetrically distributed
about zero, and Y = X2. Then Y is completely determined by X, so that X and Y are perfectly dependent, but their correlation is zero;
they are uncorrelated. However, in the special case when X and Y are jointly normal, uncorrelatedness is equivalent to independence.
If we have a series of n measurements of X and Y written as xi and yi for i = 1, 2, ..., n, then the sample correlation coefficient can be
used to estimate the population Pearson correlation r between X and Y. The sample correlation coefficient is written:
where x and y are the sample means of X and Y, and sx and sy are the sample standard deviations of X and Y.
This can also be written as:
If x and y are results of measurements that contain measurement error, the realistic limits on the correlation coefficient are not
−1 to +1 but a smaller range.[6]
For the case of a linear model with a single independent variable, the coefficient of determination (R squared) is the square
of r, Pearson's product-moment coefficient.

Rank correlation coefficients[edit]


Main articles: Spearman's rank correlation coefficient and Kendall tau rank correlation coefficient
Rank correlation coefficients, such as Spearman's rank correlation coefficient and Kendall's rank correlation coefficient
(τ) measure the extent to which, as one variable increases, the other variable tends to increase, without requiring that increase
to be represented by a linear relationship. If, as the one variable increases, the other decreases, the rank correlation
coefficients will be negative. It is common to regard these rank correlation coefficients as alternatives to Pearson's coefficient,
used either to reduce the amount of calculation or to make the coefficient less sensitive to non-normality in distributions.
However, this view has little mathematical basis, as rank correlation coefficients measure a different type of relationship than
the Pearson product-moment correlation coefficient, and are best seen as measures of a different type of association, rather
than as alternative measure of the population correlation coefficient. [7][8]
To illustrate the nature of rank correlation, and its difference from linear correlation, consider the following four pairs of
numbers (x, y):
(0, 1), (10, 100), (101, 500), (102, 2000).
As we go from each pair to the next pair x increases, and so does y. This relationship is perfect, in the sense that an
increase in x is always accompanied by an increase in y. This means that we have a perfect rank correlation, and both
Spearman's and Kendall's correlation coefficients are 1, whereas in this example Pearson product-moment correlation
coefficient is 0.7544, indicating that the points are far from lying on a straight line. In the same way
if y always decreases when x increases, the rank correlation coefficients will be −1, while the Pearson product-moment
correlation coefficient may or may not be close to −1, depending on how close the points are to a straight line. Although in
the extreme cases of perfect rank correlation the two coefficients are both equal (being both +1 or both −1), this is not
generally the case, and so values of the two coefficients cannot meaningfully be compared. [7] For example, for the three
pairs (1, 1) (2, 3) (3, 2) Spearman's coefficient is 1/2, while Kendall's coefficient is 1/3.

Other measures of dependence among random variables[edit]


See also: Pearson product-moment correlation coefficient §  Variants
The information given by a correlation coefficient is not enough to define the dependence structure between random
variables.[9] The correlation coefficient completely defines the dependence structure only in very particular cases, for
example when the distribution is a multivariate normal distribution. (See diagram above.) In the case of elliptical
distributions it characterizes the (hyper-)ellipses of equal density, however, it does not completely characterize the
dependence structure (for example, a multivariate t-distribution's degrees of freedom determine the level of tail
dependence).
Distance correlation[10][11] was introduced to address the deficiency of Pearson's correlation that it can be zero for dependent
random variables; zero distance correlation implies independence.
The Randomized Dependence Coefficient[12] is a computationally efficient, copula-based measure of dependence between
multivariate random variables. RDC is invariant with respect to non-linear scalings of random variables, is capable of
discovering a wide range of functional association patterns and takes value zero at independence.
The correlation ratio is able to detect almost any functional dependency,[citation needed][clarification needed] and the entropy-based mutual
information, total correlation and dual total correlation are capable of detecting even more general dependencies. These
are sometimes referred to as multi-moment correlation measures, [citation needed] in comparison to those that consider only
second moment (pairwise or quadratic) dependence.
The polychoric correlation is another correlation applied to ordinal data that aims to estimate the correlation between
theorised latent variables.
One way to capture a more complete view of dependence structure is to consider a copula between them.
The coefficient of determination generalizes the correlation coefficient for relationships beyond simple linear regression.

Sensitivity to the data distribution[edit]


Further information: Pearson product-moment correlation coefficient §  Sensitivity to the data distribution
The degree of dependence between variables X and Y does not depend on the scale on which the variables are
expressed. That is, if we are analyzing the relationship between X and Y, most correlation measures are unaffected by
transforming X to a + bX and Y to c + dY, where a, b, c, and d are constants (b and d being positive). This is true of some
correlation statistics as well as their population analogues. Some correlation statistics, such as the rank correlation
coefficient, are also invariant to monotone transformations of the marginal distributions of X and/or Y.

Pearson/Spearman correlation coefficients between X and Y are shown when the two variables' ranges are unrestricted, and when the
range of X is restricted to the interval (0,1).
Most correlation measures are sensitive to the manner in which X and Y are sampled. Dependencies tend to be stronger if
viewed over a wider range of values. Thus, if we consider the correlation coefficient between the heights of fathers and
their sons over all adult males, and compare it to the same correlation coefficient calculated when the fathers are selected
to be between 165 cm and 170 cm in height, the correlation will be weaker in the latter case. Several techniques have
been developed that attempt to correct for range restriction in one or both variables, and are commonly used in meta-
analysis; the most common are Thorndike's case II and case III equations. [13]
Various correlation measures in use may be undefined for certain joint distributions of X and Y. For example, the Pearson
correlation coefficient is defined in terms of moments, and hence will be undefined if the moments are undefined.
Measures of dependence based on quantiles are always defined. Sample-based statistics intended to estimate population
measures of dependence may or may not have desirable statistical properties such as being unbiased, or asymptotically
consistent, based on the spatial structure of the population from which the data were sampled.
Sensitivity to the data distribution can be used to an advantage. For example, scaled correlation is designed to use the
sensitivity to the range in order to pick out correlations between fast components of time series. [14] By reducing the range of
values in a controlled manner, the correlations on long time scale are filtered out and only the correlations on short time
scales are revealed.

Correlation matrices[edit]
"Correlation matrix" redirects here. For correlation matrices in quantum physics, see Quark–lepton complementarity.
See also: Covariance matrix §  Correlation matrix
The correlation matrix of n random variables X1, ..., Xn is the n  ×  n matrix whose i,j entry is corr(Xi, Xj). If the measures of
correlation used are product-moment coefficients, the correlation matrix is the same as the covariance matrix of
the standardized random variables Xi / σ (Xi) for i = 1, ..., n. This applies to both the matrix of population correlations (in
which case "σ" is the population standard deviation), and to the matrix of sample correlations (in which case "σ" denotes
the sample standard deviation). Consequently, each is necessarily a positive-semidefinite matrix. Moreover, the correlation
matrix is strictly positive definite if no variable can have all its values exactly generated as a linear combination of the
others.
The correlation matrix is symmetric because the correlation between Xi and Xj is the same as the correlation
between Xj and Xi.
A correlation matrix appears, for example, in one formula for the coefficient of multiple determination, a measure of
goodness of fit in multiple regression.

Common misconceptions[edit]
Correlation and causality[edit]
Main article: Correlation does not imply causation
See also: Normally distributed and uncorrelated does not imply independent
The conventional dictum that "correlation does not imply causation" means that correlation cannot be used to infer a causal
relationship between the variables. [15] This dictum should not be taken to mean that correlations cannot indicate the
potential existence of causal relations. However, the causes underlying the correlation, if any, may be indirect and
unknown, and high correlations also overlap with identity relations (tautologies), where no causal process exists.
Consequently, establishing a correlation between two variables is not a sufficient condition to establish a causal
relationship (in either direction).
A correlation between age and height in children is fairly causally transparent, but a correlation between mood and health
in people is less so. Does improved mood lead to improved health, or does good health lead to good mood, or both? Or
does some other factor underlie both? In other words, a correlation can be taken as evidence for a possible causal
relationship, but cannot indicate what the causal relationship, if any, might be.

Correlation and linearity[edit]

Four sets of data with the same correlation of 0.816


The Pearson correlation coefficient indicates the strength of a linear relationship between two variables, but its value
generally does not completely characterize their relationship. [16] In particular, if the conditional mean of Y given X, denoted
E(Y | X), is not linear in X, the correlation coefficient will not fully determine the form of E(Y | X).
The image on the right shows scatter plots of Anscombe's quartet, a set of four different pairs of variables created
by Francis Anscombe.[17] The four y variables have the same mean (7.5), variance (4.12), correlation (0.816) and
regression line (y = 3 + 0.5x). However, as can be seen on the plots, the distribution of the variables is very different. The
first one (top left) seems to be distributed normally, and corresponds to what one would expect when considering two
variables correlated and following the assumption of normality. The second one (top right) is not distributed normally; while
an obvious relationship between the two variables can be observed, it is not linear. In this case the Pearson correlation
coefficient does not indicate that there is an exact functional relationship: only the extent to which that relationship can be
approximated by a linear relationship. In the third case (bottom left), the linear relationship is perfect, except for
one outlier which exerts enough influence to lower the correlation coefficient from 1 to 0.816. Finally, the fourth example
(bottom right) shows another example when one outlier is enough to produce a high correlation coefficient, even though
the relationship between the two variables is not linear.
These examples indicate that the correlation coefficient, as a summary statistic, cannot replace visual examination of the
data. Note that the examples are sometimes said to demonstrate that the Pearson correlation assumes that the data follow
a normal distribution, but this is not correct.[4]
Bivariate normal distribution[edit]
If a pair (X, Y) of random variables follows a bivariate normal distribution, the conditional mean E(X|Y) is a linear function
of Y, and the conditional mean E(Y|X) is a linear function of X. The correlation coefficient r between X and Y, along with
the marginal means and variances of X and Y, determines this linear relationship:
where  and  are the expected values of X and Y, respectively, and σx and σy are the standard deviations of X and Y,
respectively.

See also[edit]

You might also like