Professional Documents
Culture Documents
6 Correlation and Regression Analysi PDF
6 Correlation and Regression Analysi PDF
Correlation Analysis
The word Correlation is made of Co- (meaning "together"), and Relation. Correlation analysis is a
statistical technique used to determine the strength of linear relationship between two variables.
The concept of „correlation‟ is a statistical tool which studies the relationship between two
variables and Correlation Analysis involves various methods and techniques used for studying
and measuring the extent of the relationship between the two variables.
The correlation coefficient is a measure of the strength of linear relationship between two
quantitative variables. In other words, a measure of the degree of linear relationship is called the
coefficient of correlation. We use the Greek letter (rho) to represent the population correlation
coefficient and r to represent the sample correlation coefficient. The correlation analysis is applicable
when the variables are interdependent.
If X and Y are two variables then, correlation coefficient XY between the variables is defined as:
1
The sample estimate of XY denoted by „r‟ is obtained as:
A scatter diagram is a graph that may be used to represent the relationship between two variables. Also it
is a graphical technique used to analyze the relationship between two variables. The scatter diagram is
constructed by plotting the pairs of observation on two variables with one variable along X-axis and the
other along Y-axis (Fig. 1). It is the pictorial representation of the relationship between two variables and
the best way to explore the existence of any relationship between two variables and if it exists, to identify
the nature of relationship – positive, negative, linear or non-linear. The scatter diagram not only helps to
identify the relationship between variables and its nature, it also helps to detect if there is any outlier in
the data set.
The scatter diagrams below show how different patterns of data produce different degrees of
correlation.
2
Scatter diagram of X Vs Y Scatter diagram of X vs Y
30 20
15
20
Y Y 10
10 5
0 0
0 5 10 15
-10 -5 0 5 10
X
X
When the two sets of data are strongly linked together we say they have a High Correlation.
1. Perfect positive: If r = + 1, there is a perfect positive linear relationship between x and y values;
all data points fall exactly on a straight line. The slope of the line is positive.
2. Strong Positive correlation: A set of data pairs (x, y) for which as x increases, y tends to increase.
The closer r is to +1, say + 0.95, the stronger the positive association between the two variables.
A high value of r does not necessarily mean that the variables are related. For example, one may
obtain a correlation coefficient of 0.90 between the height of farmers and farm size but these two
variables are not related at all. Such correlation is termed as spurious correlation.
3. Weak Positive Correlation: The value of Y increases slightly as the value of X. The lower r is to
+1, say 0.41, the weak the positive association between the two variables.
The correlation becomes weaker as the data points become more scattered.
4. No correlation: A set of data pairs (x, y) for which there is no clear pattern between x and y. There
is no linear relation among the points of the scatter diagram.
If the data points fall in a random pattern, then the correlation is equal to zero.
3
If r = 0, either the variables are independent or the relationship between the variables is not linear.
If two variables are independent, then r = 0.
If r = 0, it does not mean that the variables are not related. This simply says that the relationship is
not linear. For example, if Y is related to X by Y = X2, the value of r will be zero although the
relationship between X and Y is perfect and it is quadratic.
5. Perfect negative: If r=-1, there is a perfect negative linear relation between x and y values; all
points lie on the line. The slope of the line is negative.
6. Strong Negative correlation: A set of data pairs (x, y) for which as x increases, y tends to decrease.
The closer r is to -1, say -0.94, the stronger the negative association between the two variables.
7. Weak Negative Correlation: The value of Y decreases slightly as the value of X increases. The
lower r is to -1, say -0.44, the weak the negative association between the two variables.
....................................................................................
Karl Pearson‟s Correlation coefficient between x and y is
rxy
x i x yi y
...................(1)
2
x i x yi y
2
x a
Let ui i , where a is origin and h is scale of variable xi
h
xi x h(ui u) ................... (2)
y b
Again vi i , where b is origin and k is scale of variable yi
k
yi y k (vi v) ................... (3)
H0: ρ = 0
H1: ρ ≠ 0
r n2
The test statistic is t which is distributed as Student‟s t with (n-2) df.
2
1 r
Conclusion: If computed t is greater than or equal to the tabulated t with same df and at 5% (or 1%) level
of significance, H0 may be rejected at 5% (or 1%) level of significance otherwise, H0 may be accepted.
4
Example:
H0: = 0 (the correlation in the population is 0)
H1: ≠ 0 (the correlation in the population is not 0)
r n 2
The test statistic is t with n - 2 degrees of freedom
1 r2
Computing t, we get
r n -2 0.91 7 2 , say n = 7, r = 0.91
t 4.90
1 - r2 1 (0.91)2
The computed t (4.90) is within the rejection region; therefore, we will reject H 0. This
means the correlation in the population is not zero.
5
Simple Linear Regression
Regression analysis, in general sense, means the estimation or prediction of the unknown value of one
variable from the known value of the other variable. It is one of the most important statistical tools
which are extensively used in almost all sciences.
Prediction or estimation is one of the major problems in almost all the spheres of human activity. The
estimation or prediction of future production, consumption, prices, investments, sales, profits, income
etc. are of very great importance to business professionals. Similarly, population estimates and
population projections, GNP, Revenue and Expenditure etc. are essential for economists and efficient
planning of an economy.
......................................................................................
Regression Model: A mathematical (or theoretical) equation that shows the linear relation between
the independent or explanatory variable and the dependent or response variable. The simple linear
regression model is Y = α + X + ,
Where
i is called the error term with mean 0 and variance 2 .
Y is called the response (dependent) variable,
X is called the predictor (explanatory variable)
is the intercept of the true regression line.
β is the slope of the true regression line or regression coefficient.
In simple linear regression a single independent variable(x) is used to predict the value of a dependent
variable(y). The equation that describes how y is related to x and an error term is called the regression
model. The regression model used in simple linear regression follows.
The variables in the model are Y, the response variable; X, the predictor variable; and , the residual
error, which is an unmeasured variable. The parameters in the model are , the Y-intercept; , the
regression coefficient.
In practice, the parameter values are not known, and must be estimated using sample data. Sample
statistics ( a and b ) are computed as estimates of the population parameters α and β . The estimated
regression equation for simple linear regression follows.
x y
x and y
n n
The regression line indicates the average value of the dependent variable, Y associated with a
particular value of the independent variable, X.
The regression coefficient (b) is the average change in the dependent variable (Y) for a 1-unit
change in the independent variable (X). It is the slope of the regression line.
b , the regression coefficient of y on x indicates the change in the value of y (dependent
variable) for unit change of x (independent variable)
b 0 , y does not change for changing x.
b 1 , one unit change of x results one unit change in the value of y.
b 2 one unit change of x results two unit change in the value of y.
1. Regression coefficients are independent of change of origin and but not of scale of measurement.
2. The geometric mean of two regression coefficients is the correlation coefficient.
3. The arithmetic mean of the regression coefficients is greater than the correlation coefficient.
4. If one of the regression coefficients is greater than one the other must be less than one and vice
versa.
........................................................................
Regression coefficient of y on x is b yx
x i x yi y
...................(1)
2
xi x
xi a
Let u i , where a is origin and h is scale of variable xi
h
xi x h(ui u) ................... (2)
y b
Again vi i , where b is origin and k is scale of variable yi
k
yi y k (vi v) ................... (3)
Regression coefficients are independent of change of origin and but not of scale of measurement.
rxy b yx bxy
So, The geometric mean of two regression coefficients is the correlation coefficient.
7
The Coefficient of Determination
SS due to regression
Coefficient of determination = R2 = x 100
Total SS
Where, SS due to regression = b. SP(x,y)
Total SS = SS(y)
The coefficient of determination, R2, measures the proportion of the total variation in the
dependent variable that is explained by the independent variable.
The coefficient of determination ranges from 0 to 1.
An R2 of 0 means that the dependent variable cannot be predicted from the independent
variable.
An R2 of 1 means the dependent variable can be predicted without error from the independent
variable i.e., the regression line explains 100% of the variation in the dependent variable.
An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An
R2 of 0.10 means that 10 percent of the variance in Y is predictable from X; an R2 of 0.20
means that 20 percent is predictable; and so on.
There are numerous fields of study where regression analysis can be applied.
An economist may want to ascertain how the price of a commodity (P) relates to its supply (S)
and to obtain a measure of that relationship.
An educator may want to predict the performance of a student in a course, say mathematics from
his scores in an intelligent test.
An agriculturist may want to forecast yield per hectare for a rice farm from the amount of
fertilizer he has used and the number of times his field has been irrigated.
A pathologist may want to predict the development of certain fungal disease in rice plants on the
basis of the seed treatments by different dozes of a fungicide.
To determine whether one variable Y depends on another variable X and if these two variables
are related, to get a measure of the relationship
To know the shape of the curve representing the relationship or dependence of Y on X,
To predict the value of the dependent variable Y from some known values of the independent
variable X
If there are more than one independent variables, to determine the contribution of independent
variables on the dependent variable individually or collectively.
Types of Regression
8
y = + x simple linear regression
y = + 1x1 + 2x2 multiple linear regression
4. Non-linear Regression – if the relationship between dependent and independent variable(s) is non-
linear in independent variable(s). e.g.,
y = x
y = x
y = ex
y = + 1x + 2x2
H0: β = 0
H1: β ≠ 0
b β
The test statistic is t , which is distributed as Student‟s t with (n-2) df.
s2
ss(x)
ss(x) x 2
x2
n
y2
ss(y) y 2
n
x y
sp(xy) xy
n
Conclusion: If computed t is greater than or equal to the tabulated t with same df and at 5% (or 1%) level
of significance, H0 is rejected at 5% (or 1%) level of significance otherwise H0 may be accepted.
Example:
Wheat yield(ton/ha) Fertilizer level(kg/ha)
4.0 100
5.0 200
5.0 300
7.0 400
6.5 500
6.5 600
8.0 700
9
increases 0.0059 t/ha for 1 kg change of fertilizer (x).
The slope of a regression model represents the average change in Y per unit X.
Hypothesis:
H0: β = 0
H1: β ≠ 0
b β
The test statistic is t , with (n-2) d.f.
s2
ss(x)
= 5.23, 5 df
10
Wheat yield (ton/ha)
8
6
4 y = 3.64** + 0.0059**x
2 R2 = 0.85
0
0 200 400 600 800
Fertilizer level (kg/ha)
The scatter diagram of yield against fertilizer level clearly indicated that the relationship between
fertilizer level and yield is linear and hence a linear regression of yield on fertilizer level was
estimated and the result is presented in Fig. 1. The regression of yield of wheat on fertilizer level was
obtained as Y = 3.64** + 0.0059**x (n = 7, R2 = 0.85). Both regression coefficient and intercept were
significant at 1% probability level. The coefficient of determination was 0.85 meaning that 85% of
total variation in yield of wheat is explained by the fertilizer level(x). For every 1 kg/ha increase in
fertilizer level, there was an increase of grain yield by 0.0059 t/ha. The result indicates that under
similar climate and soils, the farmers can harvest more than 3.64 t/ha of wheat without application of
any fertilizer.
10
Distinctions between correlation and regression:
1. Correlation quantifies the degree to which two variables are related. Correlation does
not fit a line through the data.
2. In correlation, we see only the relationship between two or more variables without
being concerned about which variables are independent or dependent; and in regression
analysis it is very important to know which is the dependent variable and which is the
independent variable to estimate the regression.
3. With correlation we don't have to think about cause and effect. We simply quantify
how well two variables relate to each other. With regression, we do have to think about
cause and effect as the regression line is determined as the best way to predict Y from
X.
4. From correlation we can only get an index describing the linear relationship between
two variables; in regression we can predict the relationship between more than two
variables and can use it to identify which variables x can predict the outcome variable
y.
Example:
60
Height of plant(cm)
50
40
30 y = 9.44** + 1.88** x
20 (n = 11, R2 = 0.942)
10
0
0 5 10 15 20 25
Fig. 1: Relationship Age
between age(Days)
of plant and height of plant
The scatter diagram of height against age of plant clearly indicated that the relationship between age
and height of plant is linear and hence a linear regression of height on age was estimated and the result
is presented in Fig. 1. The regression of height of plant on age was obtained as Y = 9.44** + 1.88**x
(n =11, R2 = 0.942). Both regression coefficient and intercept were significant at 1% probability level.
The coefficient of determination was 0.942 meaning that 94% of total variation in height of plant is
explained by the age (x). For every 1 day increase in age, there was an increase of height by 1.88 cm.
The result indicates that under similar climate and soils, the farmers can harvest more than 36.42
qn./ha of wheat without application of any fertilizer.
100
Wheat yield(qn./ha)
80
60
40 y = 36.42** + 0.06**x
(n = 7, R2 = 0.85)
20
0
0 200 400 600 800
Fertilizer level(kg./ha)
12