Professional Documents
Culture Documents
Unit-4 IGNOU STATISTICS
Unit-4 IGNOU STATISTICS
Unit-4 IGNOU STATISTICS
REGRESSION
Structure
Introduction
Objectives
Tabular and Diagrammatic Representation of Bivariate Data
What Do We Mean by Regression Analysis?
Simple Regression Line
Correlation Coefficient
Relationship between Regression and Correlation Coefficients
Limitations of Correlation Coefficient
Summary
Solutions and Answers
4.1 INTRODUCTION
Suppose you are given the data on the scores obtained by 30 students in English.
From the three units that you have studied so far, you know how to form the
frequency distribution of the data. You have also seen how to compute the average
score and the standard deviation. Now suppose the scores obtained by these students
in Mathematics are also given to you. Apart from finding the average score in
Mathematics, you may also like to know whether there is any relationship between
a student's score in English and hisher score in Mathematics. It is quite reasonable
to think that a student good at English would also be good at Mathematics. But is
this fact borne out by the data? In this unit, we are going to discuss some methods
which will enable us to answer this question.
We'll start by discussing the tabular and diagrammatic representation of bivariate
data, i.e., of data pertaining to two variables. After this, we'll talk about the nature
and the degree of relationship between the observations on two variables. We would
also like to see if we can build up an equation on the basis of the given data, which
can help us in predicting one of the variables when the other is given.
Here are the ob$ctives which you should achieve by the end of this unit.
Objectives
After reading this unit, you should be able to :
draw a scatter diagram corresponding to the given bivariate frequency distribution
explain the meaning of "correlation" and "regression"
fit a regression line to the given data
compute the correlation coefficient for grouped and ungrouped data
derive some relationship between the correlation and regression coefficient.
I
8 19
15 51
11 30
4
2157 42
25
18
1 9
10
65
; 10
13
12
13
44
56
38
32
11 5 25
12 6 10
13 4 20
14 1 8 27
15 7 13
16 12 49
17 6 27
18 16 55
Source :Fundamentals of Statistics by Goon, Gupta and Dasgupta.
F
However, as in the univariate case, when the number of individuals is large, the data
needs to be grouped into a frequency table. Now in the present case, the frequency We decide upon these "suitable"
numbers after bearing in mind all
table has to be a two-way table, with a suitable number of classes for x and a suitable the points listed in Set,
number of classes for y. In Table 2, you can see that
we have divided the number of grains per earhead into 10 classes, 8-12,13-17, .....,
53-57.
we have divided the length of earhead into 9 classes, 5.25-6.25, 6.25-7.25, .....,
, 13.25-14.25.
If there are 1 classes for x and k classes for y, the table will have kl cells in all. So in
Table 2 we have 90 cells.
After specifying the classes we determine the cell frequencies and write them as in
Table 2.
Table 2 : Bivariate frequency table for length of earhead and number of grains per
ear for 400 ears of a variety of wheat.
; '"
25
6 86
37 6 1
13 21 2
17 23 6 3 55
2 4 3 9
48-52 i I
1 1 2 1 4
53-57 l i l
i
Total 4 14 19 74 , 123 92 55 13 6 1400'
8
Source :Statrstical Methods for Agricultural Research Wo kers, by Panse and Sukhatme.
and the cumulative frequency of the more-than type for the (i, j)th cell will be
Let us now look at the ways in which a bivariate distribution can be represented
diagrammatically.
Diagrammatic Representation
Like a univariate frequency distribution, a bivariate frequency distribution too may be
represented diagrammatically.
If raw data are to be represented, then we make use of what is called a scatter diagram
(or dot diagram). Here, the n pairs of values, say (xi, yi) for i = l , 2, .... n, of the
variables x and y are plotted as points w.r.t. a rectangular system of coordinates. In
Fig. 1, we have the scatter diagram for the data in Table 1.
I 1 1
4 E, 12 16 20
Height
Fig. 1 : Scatter diagram for the data in Table 1.
Fig. 2
In order to get the frequency polygon, on the other hand, we join the tops of adjacent
columns by straight-line segments.
Descriptive Statistics. The three-dimensional analogues of a frequency polygor~as well as a histogram are
called stereograms.
In this unit, we'll be using scatter diagrams to represent bivariate data. So, before
going any further, see if you can draw a scatter diagram on the basis of given data.
~ 1 , Draw a scatter diagram on the basis of the following data on the size of crop
and percentage of wormy fruits during a season for 12 apple trees.
I
--
12 40 1 22 1
Source : Statistical Methods, by Snedecor and Cochran
Now, on the basis of such bivariate data, let us see if we can establish a relationship
between the two variables under study.
0 X 0 X 0 X
(dl (el (9
Fig. 4
In (a), (b) and (c), you can see that all the points are quite close to a straight line
path. In (a) and (b), this straight line has positive slope, while the one in (c) has
negative slope. We say that the data in (a) and (b) show a positive linear relationship.
Of course, the points in (b) are more scattered than those in (a). The scatter diagram
in (c) exhibits a negative linear relationship. From (d) and (e), we can say that the
data indicate curvilinear relationships. But, from the scatter diagram in (f), we cannot .
think of any relationship existing between the variables in the data.
For bivariate data, we know that the conditional distribution of y given x indicates
how the frequency for a given x is distributed over different classes of y. Now, if the
conditional distribution of y given x does not change significantly for different values
of x, then we can say that there is no relationship between y and x. And, in this case,
the x-values are useless for predicting y. One way to see if there is some kind of
relationship between y and x is to consider the average value of y given the values of
x. In other words, we consider the conditional average of y given x. If we plot the
conditional average of y given x, against x, we get a curve which is called the
regression curve of y on x. In genera1,this curve could be quite complicated. So, our
first reaction would be to explore the possibility of approximating it by a simple curve
like a straight line.
There is yet another motivation for using the regression curve as defined above.
Recall that in Unit 2, we have seen that
n
2 (xi - A)' is least when A
i= 1
= I.
Now, let us again refer to the data on length of earhead (x) and number of grains R~~~ the adjoining paragraph
per ear (y) given in Table 2. Suppose we are interested in predicting the number of slowly and carefully to
grains per ear on the basis of ear length. Suppose further, that the loss we suffer due understand the argument.
to wrong prediction is proportional to the squared error, where error is the difference
between observed and predicted values. Then our average loss will be minimum, if
the average squared error is minimum. Using (I), we can say that the average squared
error will be least if we take the mean number of grains per ear for any specific length 77
of the ear, as our predicted value. We can read out this predicted value from the
regression curve.
The term 'regression' was first used by Sir Francis Galton in 1877, in connection with
his study of human height. He found that the height of the children of tall parents
tended to regress (or move back) towards the average height of the population.
Galton called the line describing this relationship, a 'line of regression', and the
terminology has since been accepted universally.
In the next section, we'll see how to fit a regression line to given data. I
I
r o see that the sa ..as (4) and (6), indeed, minimise 2 e:, we proceed as below :
We have
ei = yi - a - bx,
= (yi-7)-b(xi-5i) +6-a-bq
Therefore,
ef = ( ~ ~ - 9+ 'b2(xi-?)' + @ - a - b q 2 - 2 ( 5 - 3 ) b ( x i - 3 +
2(yi-59 (y-a-b5i)-2b(xi-?) 6-a-bq ... (7)
.
I Now, when we sum over i= l , 2 , ....., n, the last two terms on the right hand side in
n n
(7) vanish, since I:(y, -7) = 0 and I:(xi - i )= 0
1 1
Hence we get,
x 5
'n
1
e: =
1
(yi-?)2 + b2 5
1
(5-q2+ 5
1
6-a-bq2-2b
n
2 (yi-7)(xi-$.
1
x - Y)'
Let us write (y, as s;.
Then
Have you noted that the
n
e: = n (7 - a - b5i)' + S; + b'~: - 2bsXy s2 ~2
expressions of 2 and - are
1 similar to the variances of y and x?.
s; = xy:--
n(Yil2 i
1
1 n
. i
Descriptive Statistics
Pu:ting these expressions for S: and Sxyin the formula for b, , we get
1 1
or.
Suppose we look upon x as the dependent variable. Then to predict the values of x
from the values of y, we use the regression equation,
x = a' + b'y ,
%
and a' = x - bly
b', the regression coefficient of x on y is denoted bybx,, and we have
Using Formulas (9) and (lo), you can calculate by, and b, from raw data. But
sometimes you may need to calculate bxyor b, from grouped data. We now give the
formulas which you can use in such cases.
Now the formulas for the regression coefficients of grouped data are ,
j j
Now let us use Formula (9) t o fit a regression line to the raw data in the next example
Example 1 : Look at the data given the following rfible. where x is the independent Sorrelation atad Regrrsslon
FP, Now, to find the values of a and b in the equation of the line of regression,
y = ax + b,
we form the following table.
= 0.387.
39
Further, k = -= 3.9 and y = 3.51.
10-
Therefore, a = y - bk = 3.51 - (0.387) 3.9
= 2.w.
Hence, the regression line is y = 2 + 0 . 3 8 7 ~ .
In Fig. 5, you can see the scatter diagram and the regression line for these data.
E3) A preparation of insulin was being studied to determine its effect on reducing
the blood-sugar level in rats. Seven rats were injected with different dosages.
Reductions in their blood-sugar
- levels are :
Dosage .20
-.25
.25 .30 .40 .50 .50
As usual, you can expect a suitable change of origin and scale to simplify your
calculations. In fact, if
u = - X-A and v = - y - B , then
C d
E4) Find the regression line, y = a + bx, corresponding to the following data :
So far you have seen how to find the line of regression to fit the given data. Using
this line, we can then predict the value of the dependent variable for given values of
the independent variable. For example, the regression line for the data in Example 1
is given by y = 2 + 0 . 3 8 7 ~We
. could use this to predict the value of y, when x is,
say, 3.2. Thus, we can expect that
But we should also know how good our estimate is. 'A measure of 'goodness' of our
prediction is given by the correlation coefficient and this is what we are going to study
in the next section.
We liad also, seen that with these values of k and 6, we get the least squared error.
Let us define
ti A A
= yi - a - b xi for i = l , 2, ......., n.
Now, if you go back to (8), you will see that
n
x&:
i=l
S2
= S: - $$ , since the first two terms on the right hand side of (8)
X
A
vanish for 2 and b , .
Sxy . Then
Let us write r = -
sxsy
--
Sxy - r S, and we have
sx
The left hand side of (13) is non-negative. This implies that 1 - r 2 should also be
non-negative. Thus,
0sr2<1.
n
Now, if r2 = 1, 2 2: will be zero. This means the prediction error is zero if ?=1.
I
On the other hand, if r2 is close to zero,
n
1
< will be close to s:. Thus, we can take
= S ,- 6s:
= 0.
Therefore, S: = k S: + S: (1-r2)
=
Y + S: (1-r2), since 6-
-q
Formula (15) is very useful if we want to compute the correlation coefficient from
raw data. Now let us derive a formula for the correlation coefficient for grouped data.
If 1is the class-mark of the ith class of x, and yj is that of the jth class of y, then we have
Recall that
fa = z
I
f,,
and
%=&,
1
and Cov(x,y) = 2 xfij(xi - Y ) (y, - 7)
i j
'{ Z ~ f i j x i Y j - ~ (
= n i J x
1 fi0.i) ( xI f q y j ) J
Now we are going to introduce some notation which will help us in the calculations
later.
Let X, = x
I
fij xi
and Yi = xfij
J
yi
and ZX,y,
j
= x
1
xiyi = 8 x f , x i Y j.
i j
Since
and
The correlation coefficient between height and yield of dry bark per plant is
Now we take an example of grouped data. To simplify the calculations here, we shall
make use of our usual technique: change of origin and scale. So let's first outline the
technique here.
+
If u = a + bx ( b f 0), v = c dy (d # O), and the correlation coefficient of x and y is
defined and is rxy,then the correlation coefficient of u and v is also defined and is
*
ruv= rxy, ... (16)
according as b and d have the same signs or have opposite signs. We are len: ing the
proof of this statement to you as an exercise. See E5). To solve it, you will have to
show that
st = b2sf , s: = d2sf and Cov(u,v) = bd Cov(x,y).
Let's look at an example now.
I
Example 3 :Consider the grouped data of Table 2. Let us find rxy,where x is the
I
length (in cm) and y is the number of grains per earhead of wheat. To compute this
coefficient, let us make a change of base and scale for each variable. Precisely, let
u = (X - xo)/b, v = (y - yo)/d, ... (17)
where xo = 9.75, b = 1(class-width for any x class), yo = 30 and d = 5 (class-width
for any y class). The computations needed to obtain r,, are shown below (taking x,
to be the class-mark of the ith x-class and yJ to be that of the jth y-class, u, and v,
being defined by (17)).
I J i
= 270000 - 4180 = 265820
Hence,
265820
i/ ruv = J336975 x 326864
f,
Since b and d are both of the same sign, the correlation coefficient between x and y,
,I r,, , has also the value 0.801.
Try to do the following exercises now. By doing them, you will gain a better
understanding of the concepts covered in this section.
In the next section, we'll talk about some algebraic relationships involving r , b,,
and b,.
Therefore, b,, . b, = S ~ Y= 3
Yy
SO, we get lrl = -/, ... (18)
Note that b,, and b, have the same sign. The correlation coefficient has the same
sign as that of b,, and b, .
Now, suppose the regression line of y on x is
y = 10 - 2x
and that of x on y is
u ~ U Y
m, = and m2 = -
TUX ux
You can check that these lines intersect in (St, y). If we denote by 0 the smaller
angle between these lines, then 2
From (19), we can see that the two lines are identical (i.e., 0 = 0) iff r2 = 1, or
r = + I . We can also infer that the only way the correlation coefficient can be zerc
is that the two regression lines are perpendicular to each other; one being parallel to
the x-axis and the other parallel to the y-axis.
+
This, in turn, implies that each = 0. So, if r = 1, then all the pairs of
observations (xi, y,) lie on the regression line and, in such a case, y is determined by
x through a mathematical relation of a straight line. Further, the value of r is + 1,
if y is an increasing function of x, and it is -1, if y is a decreasing function of x.
We shall now introduce a new concept, that of product moments.
Product Moments
Recall that for univariate distributions, we have defined the rth moment about a to be
1 "
m:(a) = -
n x(xi - a)' .
i=l
Extending this definition, we say that the product moment of order (r,s) of bivariate
data about (a, b) is
1
mi, = - X ( x i - a)' (yi - b)', for ungrouped data,
n
i
=-
n
1
z
i j
fi,(xi - a)' (y, - b)' for grouped data.
If a = 2 and b = 7, we get the central product moment of order (r, s), which we
denote by m,.
Note that
m& = -
n
1
x (xi - a)' for ungrouped data
1
= - x f i o (xi - a)' for grouped data,
n .
= 1 -
n
'
x 1
-
fii (xi - x) (y, - 7 ) for grouped data.
You must have found this expression familiar. We have been calling it the covariance
of -x and y , Cov (x, y) .
Now we can write
So fa^, we have seen that if the scatter diagram indicates a linear relationship, then
we can find the equation of the regression line and use it to predict the values of one
variable when those of the other are given. The correlation coefficient tells us'how
precise our estimate is. But if the data do not have linear relatipnship, then the
correlation coefficient does not serve any purpose. We shall talk about the limitations
of the correlation coefficient in the next section.
--
4.7 LIMITATIONS OF THE CORRELATION
COEFFICIENT
The correlation coefficient is a measure of the relationship between two variables,
say, x and y. While it generally serves as a useful statistical tool, we should also be
.aware nf its limitatinns
Firstly, the coefficient 1s d m,,asure of statistical relationship, and not of causal
relationship, between the two variables. This means that the Value of r tells us
whether, and with what regularity, y irlcreases or decreases as x increases. But it I
cannot tell us whether that increase or decrease is due to any causal (or cause-effect) I'
relationship between the two variables. Age of husband and age of wife for a group
of couples are found to be highly correlated. But to say that one of the variables is i
the cause of the other would be preposterous. Indeed, quite often, the high
correlation between two variables is because of a third variables 'lurking factor',
simultaneously influencing the two.
A strong correlation may often be found between two time series (i.e., series of
values of two variables corresponding to, say, n points or periods of time) with no
conceivable causal nexus. You will find that the correlation coefficient between the
population of India and production of coal in India for the census years is very high.
You will also find a high correlation coefficient between school enrolment and
number of cars on the roads in the country. But this high correlation is entirely
fortuitous and is due to the fact that in each case the two series are showing an
increasing or decreasing trend. This type of correlation is often referred to as
nonsense (or spurious) correlation.
Secondly, the correlation coefficient is a measure of linear statistical relationship
only, and may fail to be a proper index of statistical relationship in case it is
non-linear. I
X -2 -1 0 1 2
You can check that here the correlation coefficient between x and y is zero. Yet, you
can see that yi = x: for all i = l , 2, .......,5. Thus, even though y and x are related
by a mathematical relationship, y = x2, the correlation coefficient is zero. This is
because the relationship between y and x is not linear.
You should a l s ~note that a spurious correlation may be generated through the
combining of non-homogeneous sets of data, i.e., sets having different means for each
of the variables.
You may think of three groups of higher secondary science students: the first group
is from colleges that only admit highly meritorious students, the second from colleges
of the middle order where admission is less restrictive and the third from colleges
where admission restrictions are virtually absent. When taken separately, each group
will show a zero or near-zero correlation coefficient between score in the science
subjects(x) and score in the langnages(y). But in case the groups are combined, you
will find a rather high (positive) correlation coefficient between x and y. This arise!
simply because the means of x for the three groups say, TI, T2 and Xg, are unequal
and also the means of y, say, Ji,, 7, and y3.The situation is indicated in a somewhat
extreme form in Fig. 6.
%
E!O\ Let V, and 7, (i=l, 2, ........, k) be the means of x and y for k groups of
individuals, the ith group having ni individuals. Suppose r,,, and hence
Cov(x,y), is 0 for each group. Show that for -the composite -group, Cov(x, - y)
and hence- rxy,will be non-zero, unless x,
= x2 = ........ = xk and TI = y2 =
........ = yk.
We have seen that the correlation coefficient can be effective only when there is a
, linear relationship between the variables. In case of a non-linear relationship, we
have w think of some other measure. In such cases, we try to fit a polynomial of
degree p > 1 t o the data. The measure of the "goodness of fit" is then provided by
the correlation ratio. But we are not going t o discuss non-linear regression here, as
it is outside the purview of this course.
Now let us summarise what we have done in this unit.
4.8 SUMMARY
In this unit we have seen
1) how to organise bivariate data.
We have seen how to tabulate and diagrammatically represent such data. In
particular, we have seen that scatter diagrams are useful to judge what kind 01
relationship (if any) exists between the two variables.
2) that, in regression analysis, we try to find a mathematical model to describe the
relationship between two variables.
3) how the method of least squares is used to find the equation of the regression line
y-Y=b,(x-x),
where b, is the regression coefficient of y on x? defined by
4) how to fit a regression line to given data and use it for prediction.
5) that the correlation coefficient
measures the strength of the linear relationship between x and y. We have also
seen how to geometrically interpret the correl'ation coefficient.
6) how the regression and correlation coefficients are related :
Irl =
We have also defined product moments :
p,. , = X x i Y i - n'y.
E3) a) The dosage : independent
Reduction in blood sugar : dependent.
b)
.'. a = -
bS? =-6.13
.'. The regression line is :
.i . y = 6.13 + 110.54 x.
$g E4) Let A = 150, B = 600,
:r t
200788
b, = -42729
= 4.7
.; b, = 4.7
-
y ='600
-
+ = 608.8
v
x = 150 + li= 150.9
a = -100.43.
.'. line : y = - 100.43 + 4.7~.
wi
according as Cov(w,x)
'
- nCov (w, x) >
7 0,
$ 0, since wi a 0 V i .
E10) Let s,, be the covariance in the composite group and s;, , the covariance in the
ith group. We have