Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Correlation Analysis

Definition:
Correlation is a statistical measure that indicates the extent to which two or more variables
fluctuate together. A positive correlation indicates the extent to which those variables increase or
decrease in parallel; a negative correlation indicates the extent to which one variable increases as
the other decreases.
When the fluctuation of one variable reliably predicts a similar fluctuation in another variable,
there’s often a tendency to think that means that the change in one causes the change in the other.
However, correlation does not imply causation. There may be an unknown factor that influences
both variables similarly.
Correlation is a statistical technique that can show whether and how strongly pairs of variables are
related. Although this correlation is fairly obvious your data may contain unsuspected correlations.
You may also suspect there are correlations, but don't know which are the strongest. An intelligent
correlation analysis can lead to a greater understanding of your data.
• Correlation is Positive or direct when the values increase together, and
• Correlation is Negative when one value decreases as the other increases, and so called
inverse or contrary correlation.

a b c d e
If the points plotted were all on a straight line we would have perfect correlation, but it could be
positive or negative as shown in the diagrams above,
a. Strong positive correlation between x and y. The points lie close to a straight line with y
increasing as x increases.
b. Weak, positive correlation between x and y. The trend shown is that y increases as x
increases but the points are not close to a straight line
c. No correlation between x and y; the points are distributed randomly on the graph.
d. Weak, negative correlation between x and y. The trend shown is that y decreases as x
increases but the points do not lie close to a straight line
e. Strong, negative correlation. The points lie close to a straight line, with y decreasing as x
increases
Correlation can have a value:
1. 1 is a perfect positive correlation
2. 0 is no correlation (the values don't seem linked at all)
3. -1 is a perfect negative correlation
The value shows how good the correlation is (not how steep the line is), and if it is positive or
negative.
Types of Correlation:
Correlation can be of following types -
A. Positive and Negative Correlation: If changes in two connected series is in the same
direction, i.e. increase in one variable is associated with increase in other variable, the
correlation is said to be positive. For example, increase in father’s age, increase in son’s
age.
If the two related series change in opposite direction i.e. increase in one variable is
associated with the corresponding decrease in other variable, the correlation is said to be
negative.
B. Linear and Non-Linear: If the amount of change in one variable tends to bear constant
ratio of change in the other variable, the correlation is said to be linear. We get a straight
line if the variables of these series are marked on graph paper.
Correlation would be called non-linear or curvilinear if the amount of change in one
variable does not bear a constant ratio to the amount of change in the other variable. For
example, if we double the amount of rainfall the production would not necessarily be
doubled.
C. Simple and Multiple Correlation: When only two variables are studied, it is called simple
correlation.
If the common effect of two or more independent variables on one dependent data series is
studied, it is called multiple correlations. For example, if the study of rain, soil, temperature
on potato production per acre is studied then it is multiple correlation.

Correlation: Degree of relationship (-1 <= r <= 1)


Correlation= r Degree of relationship
0 Absence of relationship
0.00001-0.29 Very low
0.30-0.49 Low
0.50-0.69 Moderate
0.70-0.89 High
0.90-0.99 Very High
1 Perfect

Correlation Coefficients Pearson and Spearman:


Correlation is a Bivariate analysis that measures the strengths of association between two
variables. In statistics, the value of the correlation coefficient varies between +1 and -1. When
the value of the correlation coefficient lies around ± 1, then it is said to be a perfect degree of
association between the two variables. As the correlation coefficient value goes towards 0, the
relationship between the two variables will be weaker.
Pearson correlation: Pearson correlation is widely used in statistics to measure the degree of
the relationship between linear related variables. For example, in the stock market, if we want to
measure how two commodities are related to each other, Pearson correlation is used to measure
the degree of relationship between the two commodities. The following formula is used to
calculate the Pearson correlation coefficient :

Example – Correlation of statistics and science tests


A study is conducted involving 10 students to investigate the association between statistics and
science tests. The question arises here; is there a relationship between the degrees gained by the
10 students in statistics and science tests?

Table: Student degree in Statistic and science


Students 1 2 3 4 5 6 7 8 9 10
Statistics 20 23 8 29 14 12 11 21 17 18
Science 20 25 11 24 23 16 12 21 22 26
Notes: the marks out of 30
Suppose that (x) denotes for statistics degrees and (y) for science degree
Table : Calculating the equation parameters
Statistics Science ̅)
(𝒚 − 𝒚 ̅) 𝟐
(𝒙 − 𝒙 ̅ )𝟐
(𝒚 − 𝒚
Students ̅)
(𝒙 − 𝒙 ̅)(𝒚 − 𝒚
(𝒙 − 𝒙 ̅)
(x) (y)
1 20 20 2.7 0 7.29 0 0
2 23 25 5.7 5 32.49 25 28.5
3 8 11 -9.3 -9 86.49 81 83.7
4 29 24 11.7 4 136.89 16 46.8
5 14 23 -3.3 3 10.89 9 -9.9
6 12 16 -5.3 -4 28.09 16 21.2
7 11 12 -6.3 -8 39.69 64 50.4
8 21 21 3.7 1 13.69 1 3.7
9 17 22 -0.3 2 0.09 4 -0.6
10 18 26 0.7 6 0.49 36 4.2
sum 173.00 200 155.7 180 356.10 252.00 228.00
Mean 17.3 20

Calculating the Pearson correlation coefficient,


228
𝑟= = .7611
√356.1 × 252

Other solution
Also; the Pearson correlation coefficient is given by the following equation:

Table: Calculating the equation parameters


Statistics Science
Students x2 y2 xy
(x) (y)
1 20 20 400 400 400
2 23 25 529 625 575
3 8 11 64 121 88
4 29 24 841 576 696
5 14 23 196 529 322
6 12 16 144 256 192
7 11 12 121 144 132
8 21 21 441 441 441
9 17 22 289 484 374
10 18 26 324 676 468
sum 173 200 3349 4252 3688
Mean 17.30 20

Calculating the Pearson correlation coefficient by substitute in the aforementioned equation;

(173)(200)
3688 − 228
𝑟= 10 = = .7611
(173)2 (200)2 √(356.1)(252)
√(3349 −
10 ) (4252 − 10 )

Pearson Correlation coefficient r = 0.761 exactly the same output of the first equation.
The calculation shows a strong positive correlation (0.761) between the student's statistics and
science degrees. This means that as degrees of statistics increases the degrees of science increase
also. Generally, the student who has a high degree in statistics has high degree in science and vice
versa.
Spearman rank correlation: Spearman rank correlation is a non-parametric test that is used to
measure the degree of association between two variables. It was developed by Spearman, thus it
is called the Spearman rank correlation. Spearman rank correlation test does not assume any
assumptions about the distribution of the data and is the appropriate correlation analysis when the
variables are measured on a scale that is at least ordinal.
The following formula is used to calculate the Spearman rank correlation coefficient:

Where:
𝜌 = Spearman rank correlation coefficient
di= the difference between the ranks of corresponding values Xi and Yi
n= number of value in each data set.
The Spearman correlation coefficient, 𝜌, can take values from +1 to -1. A 𝜌 of +1 indicates a
perfect association of ranks, a 𝜌 of zero indicates no association between ranks and a 𝜌 of -1
indicates a perfect negative association of ranks. The closer 𝜌 to zero, the weaker the association
between the ranks.
An example of calculating Spearman's correlation
To calculate a Spearman rank-order correlation coefficient on data without any ties use the
following data:
Students 1 2 3 4 5 6 7 8 9 10
Statistics 20 23 8 29 14 12 11 21 17 18
Science 20 25 11 24 23 16 12 21 22 26

Solution:

Table (2.5) Calculating the Parameters of Spearman rank Equation:


statistics science Rank Rank
|d| d2
(degree) (degree) (statistics) (science)
20 20 4 7 3 9
23 25 2 2 0 0
8 11 10 10 0 0
29 24 1 3 2 4
14 23 7 4 3 9
12 16 8 8 0 0
11 12 9 9 0 0
21 21 3 6 3 9
17 22 6 5 1 1
18 26 5 1 4 16

Where d = absolute difference between ranks and d2 = difference squared. Then calculate the
following:
∑ 𝑑𝑖2 = 9 + 0 + 0 + 4 + 9 + 0 + 0 + 9 + 1 + 16 = 48
Then substitute into the main equation as follows:

; 𝜌 = 1 − 0.2909
990
𝜌 = 0.71
Hence, we have a = 0.71; this indicates a strong positive relationship between the ranks
individuals obtained in the statistics and science exam. This means the higher you ranked in
statistics, the higher you ranked in science also, and vice versa.
So, the Pearson r correlation coefficient = 0.761 and Spearman's correlation = 0.71 for the same
data which means that correlation coefficients for both techniques are approximately equal.

2.6 Exercises
Study is conducted involving 14 infants to investigate the association between gestational age at
birth, measured in weeks, and birth weight, measured in grams. Table (2.6) Gestational age and
their Weight at birth
Infant No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Gestational
34.7 36 29.3 40.1 35.7 42.4 40.3 37.3 40.9 38.3 38.5 41.4 39.7 39.7
age
Birth
1895 2030 1440 2835 3090 3827 3260 2690 3285 2920 3430 3657 3685 3345
Weight

Applying the proper method; Estimate the association between gestational age and infant birth
weight.
(Guide values 𝑟 = 0.882, 𝜌 = 0.779 )
Regression Analysis
Definition:
Regression analysis is one of the most commonly used statistical techniques in social and
behavioral sciences as well as in physical sciences which involves identifying and evaluating the
relationship between a dependent variable and one or more independent variables, which are also
called predictor or explanatory variables. It is particularly useful for assess and adjusting for
confounding. Model of the relationship is hypothesized and estimates of the parameter values are
used to develop an estimated regression equation. Various tests are then employed to determine if
the model is satisfactory. If the model is deemed satisfactory, the estimated regression equation
can be used to predict the value of the dependent variable given values for the independent
variables.
Linear regression explores relationships that can be readily described by straight lines or their
generalization to many dimensions. A surprisingly large number of problems can be solved by
linear regression, and even more by means of transformation of the original variables that result in
linear relationships among the transformed variables.
When there is a single continuous dependent variable and a single independent variable, the
analysis is called a simple linear regression analysis. This analysis assumes that there is a linear
association between the two variables. Multiple regression is to learn more about the relationship
between several independent or predictor variables and a dependent or criterion variable.
Independent variables are characteristics that can be measured directly; these variables are also
called predictor or explanatory variables used to predict or to explain the behavior of the dependent
variable.
Dependent variable is a characteristic whose value depends on the values of independent
variables.
Reliability and Validity:
• Does the model make intuitive sense? Is the model easy to understand and interpret?
• Are all coefficients statistically significant? (p-values less than .05)
• Are the signs associated with the coefficients as expected?
• Does the model predict values that are reasonably close to the actual values? • Is the
model sufficiently sound? (High R-square, low standard error, etc.)

Objectives of Regression Analysis:


Regression analysis used to explain variability in dependent variable by means of one or more of
independent or control variables and to analyze relationships among variables to answer; the
question of how much dependent variable changes with changes in each of the independent's
variables, and to forecast or predict the value of dependent variable based on the values of the
independent's variables.
The primary objective of regression is to develop a linear relationship between a response variable
and explanatory variables for the purposes of prediction, assumes that a functional linear
relationship exists, and alternative approaches (functional regression) are superior.
Simple Regression Model:
Simple linear regression is a statistical method that allows us to summarize and study relationships
between two continuous (quantitative) variables. In a cause and effect relationship, the
independent variable is the cause, and the dependent variable is the effect. Least squares linear
regression is a method for predicting the value of a dependent variable y, based on the value of an
independent variable x. Under the least squares method, we try to find out the line that minimizes
the sum of the squares of the errors. Error is the difference between the actual value of Y and the
̂.
estimated value of Y, 𝒀
• One variable, denoted (x), is regarded as the predictor, explanatory, or independent
variable.
• The other variable, denoted (y), is regarded as the response, outcome, or dependent
variable.

̂ = 𝒂 + 𝒃𝑿
Simple linear regression equation 𝒀
Where
Y = Dependent Variable
X = Independent/explanatory/predictor variable
a = Constant/Intercept (y-axis)
b = Regression coefficient/ slope

Intercept or Constant
 Intercept is the point at which the regression intercepts y-axis.
 Intercept provides a measure about the mean of dependent variable when slope(s) are zero.
 If slope(s) are not zero then intercept is equal to the mean of dependent variable minus
slope × mean of independent variable.
Slope
 Change is dependent variable as we change independent variable.
 Zero Slope means that independent variable does not have any influence on dependent
variable.
 If the regression coefficient b is positive, correlation coefficient r will be positive, and if b
is negative, the coefficient of correlation r will also be negative.

Distinction between the correlation and regression

Correlation ( r ) Regression ( Y= a + bX)


1. Measures the relationship in relative terms 1.Measures the relationship in absolute terms
2. Cause-effect relationship cannot be identified 2. Cause-effect relationship can be identified
3. Forecasting cannot be made 3. Forecasting can be made/possible
Regression Analysis: An Example

Data on sales and profit of a company are given below:

Sales (Lakh Tk.) Profit (Thousand Tk.)

10 15
12 20
14 20
16 25
20 30

i. Find the equation of regression of profit on sales


ii. Estimate the relationship between sales and profit in relative term
iii. Assess the explanatory power of the independent variable
iv. Estimate the profit when the volume of sales is Tk. 50 lakh
v. Calculate standard error of estimate
Interpret the results

Solutions:

i.The required equation is:


Y=a+bX
Where
Y = Profit (Dependent/Criterion Variable)
X = Sales (Independent/explanatory/predictor variable)
a = Constant/Intercept (y-axis)
b = Regression coefficient/ slope

X Y XY X2 Y2
10 15 150 100 225
12 20 240 144 400
14 20 280 196 400
16 25 400 256 625
20 30 600 400 900
ΣX = 72 ΣY = 110 ΣXY= 1670 ΣX2 = 1096 ΣY2= 2550
∑ 𝑋 72
𝑋̅ = = = 14.4
𝑛 5

∑ 𝑌 110
𝑌̅ = = = 22
𝑛 5

We know,
∑ 𝑋𝑌 − 𝑛𝑋̅𝑌̅ 1670 − (5)(14.4)(22) 1670 − 1584 86
𝑏= = = = = 1.45
∑ 𝑋 2 − 𝑛 𝑋̅ 2 1096 − (5)(14.4)2 1096 − 1036.8 59.2

𝑎 = 𝑌̅ − 𝑏𝑋̅ = 22 − (1.45)(14.4) = 22 − 20.88 = 1.12

Hence the desired equation is: Y = 1.12 + 1.45 X

Interpretation:
If the volume of sales increases for Tk. 1 lakh, profit of the company increases for Tk. 1.45
Thousand.

ii. The relationship between the variables in relative term can be estimated with the help of the
coefficient of correlation, r.

𝑎 ∑ 𝑌 + 𝑏 ∑ 𝑋𝑌 − 𝑛𝑌̅ 2
𝑟2 =
∑ 𝑌 2 − 𝑛𝑌̅ 2
(1.12)(110) + (1.45)(1670) − (5)(22)2
=
2550 − (5)(22)2
123.2 + 2421.5 − 2420
=
2550 − 2420
124.7
=
130
= 0.9592

𝒓 = √𝒓𝟐 = √𝟎. 𝟗𝟓𝟗𝟐 = 𝟎. 𝟗𝟕𝟗𝟒

Interpretation:
r = 0.9794 indicates a very high degree of positive relationship between sales and profit.

iii. The explanatory power of the independent variable can be assessed by the coefficient of
determination (r2), r2 = 0.9592.

Interpretation:
r2 = 0.9592 indicates that 95.92% of the variation in the profit can be explained by the variation in
the sales. Or, Volume of sales explains 95.92% of the variation in the profit.

iv. When X= Tk. 50 lakh, Y=Tk. 73.62 thousands

Y= 1.12 +1.45*50= 73.62

Interpretation:
When the volume of sales is Tk. 50 lakh, profit will be Tk. 73.731 thousands.

∑ 𝑌 2 −𝑎 ∑ 𝑌−𝑏 ∑ 𝑋𝑌
v. 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒, 𝑠𝑒 = √ 𝑛−2

2550 − (1.12)(110) − (1.45)(1670)


=√
3

2550 − 123.2 − 2421.5


=√
3
= √1.7667
= 1.3291

Interpretation:
The standard deviation of the data points around the regression line is about 1.3291 units.

You might also like