Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Correlation and Regression Analysis

Relationship between variables


Two variables are said to be related if the movement (change in the values) in one variable is associated
with the movement (change in the values) in another. The relationship is positive if movement takes place
in the same direction i.e., the increase (or decrease) in the value of one variable is associated with an
increase (or decrease) in the value of the other or vice versa. If this movement takes place in the opposite
direction i.e., increase (or decrease) in the value of one variable causes a decrease (or increase) in the
value of the other or vice versa, the variables are said to be negatively related.
 The distribution in which we consider two variables simultaneously for each item of the series is
known as bivariate distribution.
Some examples of series of positive correlation are:
 Heights and weights of students in a class;
 Household income and expenditure;
 Investment and Profit;
 Levels of fertilizer and yield of crops
 Amount of rainfall and yield of crops.
Some examples of series of negative correlation are:
 Price and demand of goods.
 Supply and Price of goods.
 Volume and pressure of perfect gas;

Dependent and independent variable


A variable is said to be dependent if its values are dependent on the values of the other variable(s). The
variables whose values do not depend on the values of the other variables are said to be independent
variable. Sometimes, the independent variables are termed as causal variables and the dependent variables
are termed as the effect variables.

Correlation Analysis

The word Correlation is made of Co- (meaning "together"), and Relation. Correlation analysis is a
statistical technique used to determine the strength of linear relationship between two variables.

 The concept of „correlation‟ is a statistical tool which studies the relationship between two
variables and Correlation Analysis involves various methods and techniques used for studying
and measuring the extent of the relationship between the two variables.

The correlation coefficient is a measure of the strength of linear relationship between two
quantitative variables. In other words, a measure of the degree of linear relationship is called the
coefficient of correlation. We use the Greek letter  (rho) to represent the population correlation
coefficient and r to represent the sample correlation coefficient. The correlation analysis is applicable
when the variables are interdependent.

If X and Y are two variables then, correlation coefficient XY between the variables is defined as:

Covariance between X and Y σ XY


ρXY  
(variance of X)(Variance of Y) 2 .σ 2
σX Y
Where,  XY = Covariance between X and Y
 2X = Variance of X,
 2Y = Variance of Y

1
The sample estimate of XY denoted by „r‟ is obtained as:

Sample covariance between X and Y SP(x, y)


r 
(Sample variance of X)(Sample variance of Y) SS(x).SS(y)
 x y
 xy 
 n
2  2
2  x   y 2   y  

 x   
 n  n 
 

Where, (xi, yi), i = 1, 2, 3, . . . . ., n are the n paired observations on X and Y.


If the values of one variable are plotted against other, the resulting scatter exhibits the relationship
between the variables.

Scatter diagram and relationship between variables

A scatter diagram is a graph that may be used to represent the relationship between two variables. Also it
is a graphical technique used to analyze the relationship between two variables. The scatter diagram is
constructed by plotting the pairs of observation on two variables with one variable along X-axis and the
other along Y-axis (Fig. 1). It is the pictorial representation of the relationship between two variables and
the best way to explore the existence of any relationship between two variables and if it exists, to identify
the nature of relationship – positive, negative, linear or non-linear. The scatter diagram not only helps to
identify the relationship between variables and its nature, it also helps to detect if there is any outlier in
the data set.

Scatter diagrams and Correlation Coefficients

The scatter diagrams below show how different patterns of data produce different degrees of
correlation.

Scatter diagram of X vs Y Scatter diagram of X vs Y


20 120
100
15 80
Y 10 Y 60
40
5
20
0 0
0 5 10 15 0 50 100 150
X X

Perfect positive correlation (r = +1.0) Strong positive correlation (r = +0.95)

Scatter diagram of X vs Y Scatter diagram of X vs Y


120 150
100
80 Y 100
Y 60
40 50
20
0
0
0 50 100 150
0 50 100 150
X X

Weak positive correlation (r = +0.41) No correlation (r = 0)

2
Scatter diagram of X Vs Y Scatter diagram of X vs Y
30 20
15
20
Y Y 10
10 5
0 0
0 5 10 15
-10 -5 0 5 10
X
X

No correlation (r = 0) Perfect negative correlation (r = -1.0)

Scatter diagram of X vs Y Scatter diagram of X vs Y


120
100
100
80 80
60 Y 60
Y
40 40
20 20
0 0
0 50 100 150 0 50 100 150
X X

Strong negative correlation (r = -0.94) Weak negative correlation (r = -0.44)

Interpretation of correlation coefficient (r)


Scatter plot: A graph of a set of data pairs (x, y) used to determine whether there is a relationship
between the variables x and y (see Fig. 1).

When the two sets of data are strongly linked together we say they have a High Correlation.

 Correlation is Positive when the values increase together, and


 Correlation is Negative when one value decreases as the other increases

1. Perfect positive: If r = + 1, there is a perfect positive linear relationship between x and y values;
all data points fall exactly on a straight line. The slope of the line is positive.

2. Strong Positive correlation: A set of data pairs (x, y) for which as x increases, y tends to increase.
The closer r is to +1, say + 0.95, the stronger the positive association between the two variables.
 A high value of r does not necessarily mean that the variables are related. For example, one may
obtain a correlation coefficient of 0.90 between the height of farmers and farm size but these two
variables are not related at all. Such correlation is termed as spurious correlation.

3. Weak Positive Correlation: The value of Y increases slightly as the value of X. The lower r is to
+1, say 0.41, the weak the positive association between the two variables.

 The correlation becomes weaker as the data points become more scattered.

4. No correlation: A set of data pairs (x, y) for which there is no clear pattern between x and y. There
is no linear relation among the points of the scatter diagram.

 If the data points fall in a random pattern, then the correlation is equal to zero.

3
 If r = 0, either the variables are independent or the relationship between the variables is not linear.
 If two variables are independent, then r = 0.
 If r = 0, it does not mean that the variables are not related. This simply says that the relationship is
not linear. For example, if Y is related to X by Y = X2, the value of r will be zero although the
relationship between X and Y is perfect and it is quadratic.

5. Perfect negative: If r=-1, there is a perfect negative linear relation between x and y values; all
points lie on the line. The slope of the line is negative.

6. Strong Negative correlation: A set of data pairs (x, y) for which as x increases, y tends to decrease.
The closer r is to -1, say -0.94, the stronger the negative association between the two variables.

7. Weak Negative Correlation: The value of Y decreases slightly as the value of X increases. The
lower r is to -1, say -0.44, the weak the negative association between the two variables.

Properties of the linear correlation coefficient:

1. The linear correlation coefficient is always between -1 and +1 i.e., −1≤ r ≤ +1


2. The value of r is independent of change of origin and scale of measurement.
3. Correlation coefficient is the geometric mean of two regression coefficients
4. Correlation coefficient is the symmetric with respect to the dependence of the variables
5. The linear correlation coefficient is a unitless measure of association.

....................................................................................
 Karl Pearson‟s Correlation coefficient between x and y is
rxy 
 
 x i  x yi  y 
...................(1)
 2
 
 x i  x  yi  y
2

x a
Let ui  i , where a is origin and h is scale of variable xi
h
 xi  x  h(ui  u) ................... (2)
y b
Again vi  i , where b is origin and k is scale of variable yi
k
 yi  y  k (vi  v) ................... (3)

Using (2) and (3) in (1) we get rxy 


 
 x i  x yi  y  ruv
 2
 
 x i  x  yi  y
2

Thus value of r is independent of change of origin and scale of measurement.
Test of Significance

1. Test on correlation coefficient:

H0: ρ = 0
H1: ρ ≠ 0

r n2
The test statistic is t  which is distributed as Student‟s t with (n-2) df.
2
1 r

Conclusion: If computed t is greater than or equal to the tabulated t with same df and at 5% (or 1%) level
of significance, H0 may be rejected at 5% (or 1%) level of significance otherwise, H0 may be accepted.

4
Example:
H0: = 0 (the correlation in the population is 0)
H1: ≠ 0 (the correlation in the population is not 0)

r n 2
The test statistic is t  with n - 2 degrees of freedom
1  r2

Computing t, we get
r n -2 0.91 7  2 , say n = 7, r = 0.91
t   4.90
1 - r2 1  (0.91)2

The computed t (4.90) is within the rejection region; therefore, we will reject H 0. This
means the correlation in the population is not zero.

5
Simple Linear Regression
Regression analysis, in general sense, means the estimation or prediction of the unknown value of one
variable from the known value of the other variable. It is one of the most important statistical tools
which are extensively used in almost all sciences.

Prediction or estimation is one of the major problems in almost all the spheres of human activity. The
estimation or prediction of future production, consumption, prices, investments, sales, profits, income
etc. are of very great importance to business professionals. Similarly, population estimates and
population projections, GNP, Revenue and Expenditure etc. are essential for economists and efficient
planning of an economy.

Regression analysis was explained by M. M. Blair as follows: “Regression analysis is a mathematical


measure of the average relationship between two or more variables in terms of the original units of the
data.”

......................................................................................
Regression Model: A mathematical (or theoretical) equation that shows the linear relation between
the independent or explanatory variable and the dependent or response variable. The simple linear
regression model is Y = α + X + ,

Where
 i is called the error term with mean 0 and variance   2 .
 Y is called the response (dependent) variable,
 X is called the predictor (explanatory variable)
  is the intercept of the true regression line.
 β is the slope of the true regression line or regression coefficient.

 Slope is the average amount of change in Y for one unit of increase in X.


 Intercept is the value of Y when X = 0.
 Y=2+5x
......................................................................................

In simple linear regression a single independent variable(x) is used to predict the value of a dependent
variable(y). The equation that describes how y is related to x and an error term is called the regression
model. The regression model used in simple linear regression follows.

Simple linear regression model


y  α  βx  ε ................. (1)

The variables in the model are Y, the response variable; X, the predictor variable; and  , the residual
error, which is an unmeasured variable. The parameters in the model are  , the Y-intercept;  , the
regression coefficient.

In practice, the parameter values are not known, and must be estimated using sample data. Sample
statistics ( a and b ) are computed as estimates of the population parameters α and β . The estimated
regression equation for simple linear regression follows.

Estimated simple linear regression equation


yˆ  a  bx ................
(2)
The values of a and b are obtained using the method of least squares. It can be verified that the best
6
estimate of β is given by
 x y
 xy 
n SP(XY)
̂  b  
2 SS(X)
2  x 
x 
n
The best estimate of α is given by ̂  a  y  bx

x y
x and y
n n

The regression line indicates the average value of the dependent variable, Y associated with a
particular value of the independent variable, X.

 The regression coefficient (b) is the average change in the dependent variable (Y) for a 1-unit
change in the independent variable (X). It is the slope of the regression line.
 b , the regression coefficient of y on x indicates the change in the value of y (dependent
variable) for unit change of x (independent variable)
 b  0 , y does not change for changing x.
 b  1 , one unit change of x results one unit change in the value of y.
 b  2 one unit change of x results two unit change in the value of y.

Properties of regression coefficient:

1. Regression coefficients are independent of change of origin and but not of scale of measurement.
2. The geometric mean of two regression coefficients is the correlation coefficient.
3. The arithmetic mean of the regression coefficients is greater than the correlation coefficient.
4. If one of the regression coefficients is greater than one the other must be less than one and vice
versa.
........................................................................

 Regression coefficient of y on x is b yx 
  
 x i  x yi  y
...................(1)
 
2
xi  x
xi  a
Let u i  , where a is origin and h is scale of variable xi
h
 xi  x  h(ui  u) ................... (2)
y b
Again vi  i , where b is origin and k is scale of variable yi
k
 yi  y  k (vi  v) ................... (3)

Using (2) and (3) in (1) we get b yx 


  
 x i  x yi  y k
 bvu
 
2 h
xi  x
 x i  x y i  y  h
Similarly, we can prove that b xy   buv
 y i y
2
 k

 Regression coefficients are independent of change of origin and but not of scale of measurement.

rxy  b yx  bxy
So, The geometric mean of two regression coefficients is the correlation coefficient.

7
The Coefficient of Determination

The coefficient of determination (denoted by R2) is a key output of regression analysis. It is


interpreted as the proportion of the variance in the dependent variable that is predictable from the
independent variable.

SS due to regression
Coefficient of determination = R2 = x 100
Total SS
Where, SS due to regression = b. SP(x,y)
Total SS = SS(y)

 The coefficient of determination, R2, measures the proportion of the total variation in the
dependent variable that is explained by the independent variable.
 The coefficient of determination ranges from 0 to 1.
 An R2 of 0 means that the dependent variable cannot be predicted from the independent
variable.
 An R2 of 1 means the dependent variable can be predicted without error from the independent
variable i.e., the regression line explains 100% of the variation in the dependent variable.
 An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An
R2 of 0.10 means that 10 percent of the variance in Y is predictable from X; an R2 of 0.20
means that 20 percent is predictable; and so on.

Application of Regression Analysis

There are numerous fields of study where regression analysis can be applied.

 An economist may want to ascertain how the price of a commodity (P) relates to its supply (S)
and to obtain a measure of that relationship.
 An educator may want to predict the performance of a student in a course, say mathematics from
his scores in an intelligent test.
 An agriculturist may want to forecast yield per hectare for a rice farm from the amount of
fertilizer he has used and the number of times his field has been irrigated.
 A pathologist may want to predict the development of certain fungal disease in rice plants on the
basis of the seed treatments by different dozes of a fungicide.

The objectives of regression analysis

The main objectives of regression analysis are:

 To determine whether one variable Y depends on another variable X and if these two variables
are related, to get a measure of the relationship
 To know the shape of the curve representing the relationship or dependence of Y on X,
 To predict the value of the dependent variable Y from some known values of the independent
variable X
 If there are more than one independent variables, to determine the contribution of independent
variables on the dependent variable individually or collectively.

Types of Regression

1. Simple Regression – if there is only one independent variables.


2. Multiple Regression – if there are more than one independent variables.
3. Linear Regression – if the relationship between dependent and independent variable(s) is linear in
independent variable(s). e.g.,

8
 y =  + x  simple linear regression
 y =  + 1x1 + 2x2  multiple linear regression

4. Non-linear Regression – if the relationship between dependent and independent variable(s) is non-
linear in independent variable(s). e.g.,
 y = x
 y = x
 y = ex
 y =  + 1x + 2x2

Test on regression coefficient

H0: β = 0
H1: β ≠ 0

b β
The test statistic is t  , which is distributed as Student‟s t with (n-2) df.
s2
ss(x)

where s2  residual mean square


ss(y)  b  sp(xy)
=
n 2

ss(x)   x 2 
 x2
n
 y2
ss(y)   y 2 
n
 x y
sp(xy)   xy 
n
Conclusion: If computed t is greater than or equal to the tabulated t with same df and at 5% (or 1%) level
of significance, H0 is rejected at 5% (or 1%) level of significance otherwise H0 may be accepted.
Example:
Wheat yield(ton/ha) Fertilizer level(kg/ha)
4.0 100
5.0 200
5.0 300
7.0 400
6.5 500
6.5 600
8.0 700

The estimated regression equation is, yˆ  a  bx


 x y
 xy 
We have, a  y  bx  3.64 and b n  0.0059
 x 2
2 
x 
n

Thus the estimated regression line is yˆ  3.64  0.0059 x


The slope of the estimated regression equation ( b  0.0059) is positive, implying that wheat yield (y)

9
increases 0.0059 t/ha for 1 kg change of fertilizer (x).

 The slope of a regression model represents the average change in Y per unit X.

Test on regression coefficient:

Hypothesis:
H0: β = 0
H1: β ≠ 0
b β
The test statistic is t  , with (n-2) d.f.
s2
ss(x)
= 5.23, 5 df

For 5 df and two-tailed test, critical value of t.01 = 4.03


Since t = 5.23 > 4.03, reject H0: conclude that the true slope is not zero.

Presentation and interpretation of the result:

10
Wheat yield (ton/ha)

8
6

4 y = 3.64** + 0.0059**x
2 R2 = 0.85

0
0 200 400 600 800
Fertilizer level (kg/ha)

Fig. 1: Relationship between fertilizer level and yield of wheat

The scatter diagram of yield against fertilizer level clearly indicated that the relationship between
fertilizer level and yield is linear and hence a linear regression of yield on fertilizer level was
estimated and the result is presented in Fig. 1. The regression of yield of wheat on fertilizer level was
obtained as Y = 3.64** + 0.0059**x (n = 7, R2 = 0.85). Both regression coefficient and intercept were
significant at 1% probability level. The coefficient of determination was 0.85 meaning that 85% of
total variation in yield of wheat is explained by the fertilizer level(x). For every 1 kg/ha increase in
fertilizer level, there was an increase of grain yield by 0.0059 t/ha. The result indicates that under
similar climate and soils, the farmers can harvest more than 3.64 t/ha of wheat without application of
any fertilizer.

10
Distinctions between correlation and regression:

sl. no. Correlation Regression


1. Correlation means the linear relationship of Regression means determination of
two or more dependent variables with each estimated value of the dependent variable
other. with respect to one or more independent
variables.
2. It measures the degree of linear relationship It does not measure the degree of
between two variables. relationship between two variables.
3. Does not determine any cause and effect of It determines the cause and effect of
relationship. relationship of the variables.
4. In correlation, the variables are dependent In regression, influenced variable is called
on each other. dependent variable and other variables are
independent.
5. In correlation, both variables must be In regression, only the dependent variable
random. needs to be random.
6. Correlation coefficient is independent of Regression coefficients are independent of
change of origin and scale. change of origin but not of scale.
7. rxy  ryx b xy  byx
8. 1  r  1 It has no upper and lower limits.

1. Correlation quantifies the degree to which two variables are related. Correlation does
not fit a line through the data.

2. In correlation, we see only the relationship between two or more variables without
being concerned about which variables are independent or dependent; and in regression
analysis it is very important to know which is the dependent variable and which is the
independent variable to estimate the regression.

3. With correlation we don't have to think about cause and effect. We simply quantify
how well two variables relate to each other. With regression, we do have to think about
cause and effect as the regression line is determined as the best way to predict Y from
X.

4. From correlation we can only get an index describing the linear relationship between
two variables; in regression we can predict the relationship between more than two
variables and can use it to identify which variables x can predict the outcome variable
y.

Example:

Height of plant(cm) Age of plant(days)


3.5 1
16 3
20.9 5
22.8 6
29 9
32.9 11
34.2 12
11
38 15
41.3 17
44.6 20
46.7 21

The estimated regression equation is, yˆ  a  bx


 x y
 xy 
We have, a  y  bx  9.44 and b  n  1.88
 2
2  x
x 
n
Thus the estimated regression line is yˆ  9.44  1.88 x
The slope of the estimated regression equation ( b  1.88) is positive, implying that height of plant (y)
increases 1.88 cm for 1 day change in age of plant (x).

60
Height of plant(cm)

50
40
30 y = 9.44** + 1.88** x
20 (n = 11, R2 = 0.942)
10
0
0 5 10 15 20 25
Fig. 1: Relationship Age
between age(Days)
of plant and height of plant

The scatter diagram of height against age of plant clearly indicated that the relationship between age
and height of plant is linear and hence a linear regression of height on age was estimated and the result
is presented in Fig. 1. The regression of height of plant on age was obtained as Y = 9.44** + 1.88**x
(n =11, R2 = 0.942). Both regression coefficient and intercept were significant at 1% probability level.
The coefficient of determination was 0.942 meaning that 94% of total variation in height of plant is
explained by the age (x). For every 1 day increase in age, there was an increase of height by 1.88 cm.

The result indicates that under similar climate and soils, the farmers can harvest more than 36.42
qn./ha of wheat without application of any fertilizer.

100
Wheat yield(qn./ha)

80
60
40 y = 36.42** + 0.06**x
(n = 7, R2 = 0.85)
20
0
0 200 400 600 800
Fertilizer level(kg./ha)

12

You might also like