Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

05 Regression and Correlation

We are mainly concern with the use of associations among variables. These associations may be useful in many ways, and one of the most
important and most common is prediction. This method is used for predicting the value of one quantity by using values of related quantities. Such
method may also lead to methods for controlling the value of one variable by adjusting the values of related variables. Regression analysis offers us
a sensible and sound approach for examining associations among variables and for obtaining good rules for prediction.
Regression Analysis
 we make decisions based on prediction of future events
 we develop relationship between what is already known and what is to be estimated
There always exists a tendency towards the average. For example, the height of children born to tall parents tends towards the shorter
height. So we make use of regression that is a process of predicting one variable (the height of the children) from another (the height of parents). So
we develop an estimating equation. That is a mathematical formula that relates the known variable to the unknown variable.

Scattered diagram can give us two type of information. Visually we can look for pattern that indicate that the variables are related. Then, if the
variables are related, we can see what kind of line, or estimating equation, describes this equation.
Later on we will study correlation analysis to determine the degree to which the variables are related. It tells us how well the estimating
equation actually describes the relationship.
We find a CAUSAL relationship between variables. i.e. how does the independent variable causes the dependent variable to change.
Deterministic and Probabilistic Relations or Models
A formula that relates quantities in the real world is called a model. Recall that in physics we have studies that if a body is moving under
uniform motion with an initial velocity ‘u’ and uniform acceleration ‘a’, the velocity after time ‘t’ is given by:
v = u + at
This is a model for uniform motion. This model has the property that when a value of ‘t’ is substituted in the above equation, the value of
v is determined without any error. Such models are called deterministic models. An important example of the deterministic model is the
relationship between Celsius and Fahrenheit scales in the form of F = 32+9/5C. Other examples of such models are Boyle’s law, Newton’s law of
gravitation, ohm’s law etc.
Examples (1)
Consider another example to investigate the relationship between Density and compressive strength at 28 days from examination of 40
concrete cube test records during the period 8 July 1991 to 21 September 1992, and arranged in reverse chronological order
Density (kg/m3) Compressive Density (kg/m3) Compressive Density (kg/m3) Compressive Density (kg/m3) Compressive
Stength Stength Stength Stength
(N/mm2) (N/mm2) (N/mm2) (N/mm2)
2437 60.5 2428 56.9 2435 57.8 2444 64.9
2437 60.9 2448 67.3 2446 60.9 2447 63.4
2425 59.8 2456 68.9 2441 61.9 2433 60.5
2427 53.4 2436 49.9 2456 67.2 2429 68.1
2435 68.3 2454 59.8 2458 61.1 2455 56.3
2471 65.7 2449 56.7 2414 50.7 2473 64.9
2472 61.5 2441 57.9 2448 59.0 2488 69.5
2445 60.0 2457 60.2 2445 63.3 2454 58.9
2436 59.6 2447 55.8 2436 52.5 2427 54.4
2450 60.5 2436 53.2 2469 54.6 2411 58.8
(Ref. Applied Statistics for Civil and Environmental Engineers, 2nd Edition by Kottegoda and Renzo Example 6.1)
Let we denote x and y for concrete density and strength, respectively. Suppose the investigator believes that the relation between y and x
is exactly given by:

Y = −274.4 + 0.1368x.
If this is true we must obtain the exact value of yield y for a given value of x. Thus when x = 2445, the yield must be:
Y = 22 + 2.5 (2445) = 60.076
But it is 60.0. There is an error of 60.076 – 60.0 = 0.076. Hence no deterministic model can be constructed to represent this experiment.
This type of error is known as probabilistic model. The deterministic relation in such cases is then modified to include both a deterministic
component and a random error component given as
Yi = a = bXi + i , where i’s are the unknown random errors.
Regression Model
There are many statistical investigations in which the main objective is to determine whether a relationship exists between two or more
variables. If such a relationship can be expressed by a mathematical formula, we will then be able to use it for the purpose of making predictions.
The reliability of any prediction will, of course, depend on the strength of the relationship between the variables included in the formula.
A mathematical equation that allows us to predict values of one dependent variable from known values of one or more independent
variables is called a regression equation. Today the term regression is applied to all types of prediction problems and does not necessarily imply a
regression towards the population mean.
Linear Regression
We consider here the problem of estimating or predicting the value of a dependent variable Y on the basis of a known measurement of an
independent and frequently controlled variable X. The variable intended to be estimated or predicted is termed as dependent variable or
Regressand or response variable and the variable on the basis of which the dependent variable is to be estimated is called the independent
variable, the regressor or the predictor.
e.g. If we want to estimate the heights of children on the basis of their ages, the heights would be the dependent variable and the
ages would be the independent variable. In estimating the yields of a crop, on the basis of the amount of the fertilizer used, the yield will be the
dependent variable and the amount of fertilizer would be the independent variable.
Scatter Diagram
Let us consider the data given in Example 1. The data table has been plotted in figure to give a scattered diagram.

In the scattered diagram, the points follow closely a straight line indicate that the two variables are to some extend linearly related. Once
a reasonable linear relationship is obtained, we usually try to express this mathematically by a straight-line equation Y = a + bX, called the linear
regression line, where the constants a and b represent the y-intercept and slope respectively. Such a regression line has been drawn in the following
figure. This linear regression line can be used to predict the value Y corresponding to any given value X.

Many possible regression lines could be fitted to the sample data, but we choose that particular line which best fits that data. The best
regression line is obtained by estimating the regression parameters by the most commonly used method of least squares.
Estimation of a Straight Line using the Method of Least Squares
The basic linear relationship between the dependent variable Y i and the value Xi is
Y i = a + b X i + i
where a and b are called the unknown population parameters (b is also called the coefficient of regression), Y i are the observed values and
i are the error components. We write the equation for the estimating line as

Yi = a + b Xi
The line will have a good fit if it minimizes the error between the estimated points on the line and the actual observed points that were
used to draw it.
One way to measure the error of the estimating line is to sum all the individual differences or errors, between the estimated points and
the observed points. In the figure we have two estimated lines that have been fitted to the same set of three data points. Two very different lines
have been drawn to describe the relationship between the two variables.


We have calculated the individual differences between the corresponding Y and Y, and then we have found the sum of these differences.
The total error in both the cases is zero. This means that both the lines describe the data equally well. Thus we must conclude that the process of
summing individual differences for calculating the error is not a reliable way to judge the goodness of fit of an estimating line. Actually the sum of
the individual errors is cancelling effect of the positive and negative values.
However if we consider the sum of the absolute values of the errors seems to be the better criteria to find a good fit but it does not stress
the magnitude of the errors. But if we consider the sum of the squares of the errors, it magnifies the larger errors and it cancels the effect of the
positive and negative values. So finally we looking for an estimating line that minimizes the sum of the squares of the errors. That is called the
method of the least squares.
Method of Least Squares
The method of least squares determines the values of the unknown parameters that minimize the sum of squares of the errors where
errors are defined as the difference between observed values and the corresponding predicted or estimated values.

It is denoted by
n n  n
S(a,b) =  ei2 =  (Yi  Yi)2 =  (Yi  a  b Xi)2
i =1 i =1 i =1
minimizing S(a,b), we put first partial derivatives w. r. t. a and b equal to zero. Therefore
S(ab) n
= 2  (Yi  a  b Xi)(1) = 0
a i =1
S(ab) n
= 2  (Yi  a  b Xi)(Xi) = 0
b i =1
by simplifying, we have
Yi = na + bXi
Xi Yi = aXi + bXi 2
by solving, we have
(Y)(X2)  (X)(XY) nXY  (X)(Y)
a= b=
n(X2)  (X)2 n(X2)  (X)2
If the variable X is taken as dependent variable , then the least square line is given by
X = c + dY
and the normal equations are
X = nc + dY
XY = cY + dY2
By solving simultaneously, w have the values of c and d
(X)(Y2)  (Y)(XY) nXY  (X)(Y)
c= ,d=
n(Y2)  (Y)2 n(Y2)  (Y)2
Examples (2) (Weather and Traffic)
Weather and traffic are two everyday occurrences that have inherent randomness. For example, if you live in a cold climate you know that
traffic tends to be more difficult when snow falls and covers the roads.We can create a simple mathematical model of traffic incidents as a function
of snowy weather, based on known data.
In the following table, we have accumulated a record of the number of snow days occurring in a certain locality over the past 10 years,
along with the number of traffic incidents reported to police in the same year. A scatter plot of the data can be used to visualize the possible
correlation.
Snow Days 16 55 43 29 59 42 20 45 30 35
Incidents 5825 11427 9006 5963 11449 8380 5745 9104 6495 6938
Scattered Plot is shown below:

We see that there is a general trend to the data, with traffic incidents increasing as the number of snow days increases. We have added a
linear trend line to the data to highlight this relationship. This linear trend is, in fact, a straight line probabilistic model of the data.
A straight-line probabilistic model is often referred to as a linear regression, Y = a + b X
The normal equations are:
Y = na + bX
X Y = aX + bX2
Table construction for these normal equations are:
 
X Y X2 XY y e=yy
16 5825 256 93200 4818.2 -1006.8
55 11427 3025 628485 10676 -751
43 9006 1849 387258 8873.6 -132.4
29 5963 841 172927 6770.8 807.8
59 11449 3481 675491 11276.8 -172.2
42 8380 1764 351960 8723.4 343.4
20 5745 400 114900 5419 -326
45 9104 2025 409680 9174 70
30 6495 900 194850 6921 426
35 6938 1225 242830 7672 734
374 80332 15766 3271581
Now the normal equations:
80332 = 10a + 374b
3271581 = 374a + 15766b
By solving, we get a = 2415 and b = 150.2
The regression equation is
Y = 2415 + 150.2 X
Using this regression line Y = 2415 + 150.2 X, we can predict number of incidents during a number of snow days. For example the
predicted number of incidents during 40 snow days, are 8423. The estimated number of incidents during 45 snow days are 9174 but the observed
number of incidents during 45 snow days are 9104.
Such interpretation is valid only when x lies between 16 and 59. An extension of the model beyond these values may lead to unreasonable
results. The value of b is called coefficient of regression.
Accuracy of the Regression Curve
a) Using scattered diagram, we can guess, how accurate the regression curve is.
b) If sum of the residuals is zero then the curve is supposed to be good one.
c) If the points in the plot of the residuals are close to the x-axis and scattered in a random way, the model appears to provide a
good fit.
d) If the points in the plot of the residuals are distributed in a systematic manner we should try some other model.
From Example 2;
 
Y y e=yy
Observed Estimated errors or
X values values residuals
16 5825 4818.2 -1006.8
55 11427 10676 -751
43 9006 8873.6 -132.4
29 5963 6770.8 807.8
59 11449 11276.8 -172.2
42 8380 8723.4 343.4
20 5745 5419 -326
45 9104 9174 70
30 6495 6921 426
35 6938 7672 734
374 80332
Scattered Plot for the errors is shown below

Hence the model is a good fit.


Exercise 1
Following is given the data of 10 randomly selected areas in each area number of oil stoves and the annual consumption of oil in barrels is
given. Fit a regression equation of annual oil consumption on number of stoves.
No. of stoves: =x 1 1.5 2 2.5 3 3.5 4 4.5
Annual Consumption of oil: =y 25 31 27 28 36 35 32 34

Examples (3)
An investigator divides the field into eight plots of equal size with equal fertility and applied varying amounts of fertilizer to each. The yield
of potatoes (in kg) and the fertilizer application (in kg) was recorded for each plot. This data is given below:
Fertilizer Applied: =x 1 1.5 2 2.5 3 3.5 4 4.5
Yield of Potatoes: =y 25 31 27 28 36 35 32 34
Fit a line of regression to the yield of potatoes using the method of least squares.
Solution
The necessary calculations are given in the table below:
x y xy x2 y2 y^ e
1 25 25.0 1.00 625 26.83 -1.83
1.5 31 46.5 2.25 961 28.02 2.98
2 27 54.0 4.00 729 29.21 -2.21
2.5 28 70.0 6.25 784 30.40 -2.40
3 36 108.0 9.00 1296 31.59 4.41
3.5 35 122.5 12.25 1225 32.79 2.21
4 32 128.0 16.00 1024 33.98 -1.98
4.5 34 153.0 20.25 1156 35.17 -1.17
Total 22 248 707.0 71.00 7800
(Y)(X )  (X)(XY)
2
nXY  (X)(Y)
a= ,b=
n(X2)  (X)2 n(X2)  (X)2
we have a = 24.452 and b = 2.38095
The fitted line which is the simple regression line: y^ = 24.452+ 2.38095 x
Examples (4)
Fit a parabola y = ax2 + bx + c in least square sense to the data

X= 10 12 15 23 20
Y= 14 17 23 25 21
Solution
We are given y = ax2 + bx + c
The normal equations to the curve are
Y = a  X2 + b  X + 5c
XY = a X3 + b X2 + c X
X2Y = a X4 + b X3 + c X2
X Y X2 X3 X4 XY X2Y
10 14 100 1000 10000 140 1400
12 17 144 1728 20736 204 2448
15 23 225 3375 50625 345 5175
23 25 529 12167 279841 575 13225
20 21 400 8000 160000 420 8400
X = 80 Y = 100 x2 = 1398 x3 = 26270 x4 =521202 XY = 1684 X2Y= 30648
Substituting the obtained values from the table in normal equations, we have
100 = 1398 a + 80 b + 5c
1684 = 26270 a + 1398 b + 80 c
30648 = 521202 a + 26270 b + 1398 c
on solving, a =  0.07, b = 3.03, c =  8.89
Hence the required equation is
Y = 0.07 X2 + 3.03 X  8.89
Examples (5)
The curve to be fitted is

Y = a ebX Y = a bX
log10Y = log10 a ebX ln Y = ln a ebX
log10Y = log10 a + log10 ebX = ln a + ln ebX
log10Y = log10 a + b X log10 e = ln a + bX ln e
put y = log10Y, A = log10 a and B = b log10 e ln Y = ln a + bX
we get y = A+BX by putting ln Y = y, ln a = A
y = A + bX

 the normal equations are


y = nA + BX
 Xy = A  X + B  X2 Table construction:
X Y y = log10 Y X2 Xy
0 0.10 -1 0 0
0.5 0.45 - 0.3468 0.25 -0.1734
1.0 2.15 0.3324 1.0 0.3324
1.5 9.15 0.9614 2.25 1.4421
2.0 40.35 1.6058 4.0 3.2116
2.5 180.75 2.2571 6.25 5.6428
X=7.5 y = 3.8099 X2 = 13.75 Xy = 10.4555

Substituting the values of the summations in the normal equations, we get 3.8099 = 6A + 7.5 B
and 10.4555 = 7.5 A + 13.75 B
On solving, A =  0.9916, B = 1.3013
log10 a =  0.9916, b log10 e = 1.3013
a = anti log10 A = 0.1019, b = B/log10e = 2.9963 and the curve is Y = 0.1019 e(2.9963)X
Note: try this above example by taking natural log (i.e. ln )
Alternatively
Y = a ebX
ln Y = ln a ebX
= ln a + ln ebX
= ln a + bX ln e ln Y = ln a + bX
put ln Y = y , ln a = A  a = anti ln (A) we have
y = A + bX
now find normal equations for this st. line
y = nA + b X
Xy = A X + b X2
Table construction:

X Y y = ln Y X2 Xy
Exercises
(1) Given below the data relating to the thermal energy generated in Pakistan 1981-94. The energy generation is in billion kwh.
Year 1981 1982 1983 1984 1985 1986 1987
Energy Generated 4.2 5.2 5.1 5.2 6.5 7.3 8.4
Year 1988 1989 1990 1991 1992 1993 1994
Energy Generated 10.8 11.9 14.5 16.1 19.4 19.7 23.0
Fit a straight line to the data. Find the residuals. Plot the residuals and comment on your result.
(2) Following is the annual installation of computers in labs in UET. Fit a linear regression equation of the computers on years and give the
annual rate of installation of them.
Year: 2001-2003 2003-2005 2005-2007 2007-2009 2009-2011
No of Computers installed: 139 144 150 154 158
Note
For each situation where the independent variable is a time factor, the values assigned to
2001-2003,… may be taken as 1,2,3,…
(3) A study of the department of transportation on the effect of bus ticket prices on the number of passengers produced the following results:

Ticket Prices (Cents): 25 30 35 40 45 50 55 60


Passengers per 100 miles: 800 780 660 640 600 600 620 620

(a) Plot these data.


(b) Develop an estimating line that best describe the data.
(c) Predict the number of passengers per 100 miles if the ticket price were 50 cents.
(Statistics for Management, 7th Ed, by Richard Levin and David Rubin Prob. 12.18 )
(4) A tire manufacturing company is interested in removing pollutants from the exhaust at the factory and cost in concern. The company has
collected data from other companies concerning the amount of money spent on environmental measures and the resulting amount of
dangerous pollutants released (as a percentage of total emissions)
Money Spent ($ thousands) 8.4 10.2 16.5 21.7 9.4 8.3 11.5
Percentage of Dangerous Pollutants 35.9 31.8 24.7 25.2 36.8 35.8 33.4
Money Spent ($ thousands) 18.4 16.7 19.3 28.4 4.7 12.3
Percentage of Dangerous Pollutants 25.4 31.4 27.4 15.8 31.5 28.9
(a) Compute the regression equation
(b) Predict the percentage of dangerous pollutants released when $20,000 is spent on the control measures.
(c) Calculate the standard error of estimates.
(Statistics for Management, 7th Ed, by Richard Levin and David Rubin Prob. 12.24)
Examples (6)
Use method of least squares to determine the constants a and b such that Y = a e bX fits the following data.
X= 0.0 0.5 1.0 1.5 2.0 2.5
Y= 0.10 0.45 2.15 9.15 40.35 180.75
Solution
The curve to be fitted is Y = a ebX or y = A+Bx, where y = log10Y, A = log10 a and B = b log10 e
 the normal equations are

y = 6A + BX
 Xy = A  x + B  x2
Table construction:
X Y Y = log10 y X2 Xy
0 0.10 -1 0 0
0.5 0.45 - 0.3468 0.25 -0.1734
1.0 2.15 0.3324 1.0 0.3324
1.5 9.15 0.9614 2.25 1.4421
2.0 40.35 1.6058 4.0 3.2116
2.5 180.75 2.2571 6.25 5.6428
X=7.5 y = 3.8099 X2 = 13.75 Xy = 10.4555

Substituting the values of the summations in the normal equations, we get


3.8099 = 6A + 7.5 B and 10.4555 = 7.5 A + 13.75 B
On solving, A =  0.9916, B = 1.3013
a = anti log10 A = 0.1019, b = B/log10e = 2.9963
and the curve is Y = 0.1019 e(2.9963)X
Examples (7)
Obtain a relation of the form Y = a bX for the following data by the method of least squares.
X= 2 3 4 5 6
Y= 8.3 15.3 33.1 65.2 127.4
Solution
The curve to be fitted is Y = a bX or y = A + BX, where A = log10 a, B= log10b and y = log10Y.
 the normal equations are:  y = 5A + BX and Xy = a X + B  X2

X Y y = log10Y X2 Xy
2 8.3 0.9191 4 1.8382
3 15.4 1.1872 9 3.5616
4 33.1 1.5198 16 6.0792
5 65.2 1.8142 25 9.0710
6 127.4 2.1052 36 12.6312
X = 20 y = 7.5455 X2 = 90 Xy = 33.1812

Substituting the values in normal equations, we get


7.5455 = 5A + 20B and 33.1812 = 20A + 90 B
On solving, A = 0.31 and B = 0.3
 a = anti-log10 A = 2.04 and b = anti-log10 B = 1.995
Hence, the required curve is Y = 2.04 (1.995)X
Exercise 2
 
Fit a least squares line for 20 pairs of observations having X = 2, Y = 8, X2 = 180 and XY=404
Exercise 3
For 5 pairs of observations, it is given that A.M of X is 2 and A.M of Y is 15. It is also known that X2 = 30, X3 = 100, X4 =354, XY = 242,
X2Y = 850. Fit a second degree parabola taking X as an independent variable.
Exercise 4
Given the following sets of values:
X 6.5 5.3 8.6 1.2 4.2 2.9 1.1 3.9
Y 3.2 2.7 4.5 1.0 2.0 1.7 0.6 1.9
(a) Compute the least squares regression equation for Y values on X values.
(b) Compute the least squares regression equation for X values on Y values.
Exercise 5
For each of the following data, determine the estimated regression equation Y = a + bX:
 
(a) Y = 20, X = 10, XY = 1000, X2 = 2000, n = 10.
(b) X = 528, Y = 11720, XY = 193640, X2 = 11440, n = 32
Exercise 6
For the following set of data: (AIOU)

plot the scatter diagram.


Develop the estimating equation that best describes the data
Predict Y for X = 10, 15, 20
X 13 16 14 11 17 9 13 17 18 12
Y 6.2 8.6 7.2 4.5 9.0 3.5 6.5 9.3 9.5 5.7
A Motivation Standard Error of Estimates
For the data in Example (3) relating to potatoes yields, remove the first pair (x = 1, y = 25) and fit a line of regression to the remaining
seven pairs. Is the line same as already determined for the eight pairs? Do the same by removing the second pair (x=1, y = 31) instead of the first.
Are the three lines different?
You will observe that a change in data leads to a different line. We say that a least squares line has zero breakdown point. There are
methods in which a change of as many as 50% data points does not cause any change in equation of the fitted line. This factor will be discussed in
the following section.
Standard Deviation of Regression or Standard Error of Estimate
Now we study how to measure the reliability of the estimating equation we have developed. In a scattered diagram, we realize that a line
as an estimator is more accurate when the data points lie close to the line than when the points are farther away from the line. To measure the
reliability of the estimating equation, statisticians have developed the standard error of estimate which is similar to the standard deviation.
The standard error of estimates measures the variability of or scatter of the observed values around the regression line. The standard
deviation of regression or the standard error of estimate of Y on X denoted by s Y.X and defined by

^
(YY)2 ^
sY.X = n2 where Y = a + bX, the estimated regression line.

Alternative formula,
Yi2  aYi  bXiYi
sY.X = n2 where n is the number of pairs.

Dividing by ‘n-2’
Because the values a and b were obtained from a sample of data points, we lose 2 degree of freedom when we use points to estimate the
regression line.
In statistics, the number of degree of freedom is the number of values in the final calculation of a statistic that are free to vary.
Exercise 7
Cost accountant often estimates overhead based on the level of production. At the Standard Knitting Co. they have collected information
on overhead expenses and units produced at different plants, and want to estimate a regression equation to predict future overhead. (AIOU)

Overhead 191 170 272 155 280 173 234 116 153 178
Units 40 42 53 35 56 39 48 30 37 40
(a) Develop the regression equation for the cost accountants.
(b) Predict overhead when 50 units are produced.
(c) Calculate the standard error of estimate.
Coefficient of Determination
To determine the goodness of fit for the estimated regression equation.
In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variance in the
dependent variable that is predictable from the independent variable(s).
It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of
hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on
the proportion of total variation of outcomes explained by the model.
For the ith observation, the residual is yi  y^i. The sum of the squares of these residuals or errors is known as the sum of the squares due to
errors, is denoted by SSE.

SSE =  (yi  y^i)2


also called unexplained deviation.
 
The difference yi  yi provides a measure of the error involved in using y. the corresponding sum of squares, called the total sum of the
squares, is denoted by SST,

SST =  (yi  yi)2
also called total variation.

To measure how much the ^y values on the estimated regression line deviate from y, another sum of squares is computed. This sum of
squares, called the sum of squares due to regression, is denoted by SSR.

SSR =  (y^i  yi)2
also called explained variation.
The relationship between these three sums of squares provides one of the most important results in statistics.
i.e. SST = SSR + SSE
The estimated regression equation would provide a perfect fit ifevery value of the dependent variable y i happened to lie on the estimated
regression line. In this case, yi  y^i would be zero for each observation, resulting in SSE = 0. Then for a perfect fit SSR = SST and the ratio (SSR/SST)
must equal to one. Poorer fits will result in larger values of SSE. Hence, the largest value of SSE (and hence the poorest fit) occurs when SSR = 0 and
SSE = SST.
The ratio SSR/SST, which will take values between zero and one, is used to evaluate the goodness of fit for estimated regression equation.
This ration is called the coefficient of determination and is denoted by r2. See the diagram
^  ^
(Y  Y)2 (Y  Y )2
SSR Explained Variation . .
r2 = SST = Total Variation = =1
 
(Y  Y)2 (Y  Y)2

Alternative formula for Coefficient of Determination:



aY + bXY  nY 2
r2 =

Y2 nY 2
Exercise 8
years R&D Annual
expenses (X) Profit (Y)
1st 5 31
Calculate 2nd 11 40
Coefficient of
Determination 3rd 4 30
using both the 4 th
5 34
formulas
5th 3 25
6th 2 20
X = 30 Y=180

Exercise 9
The curb weight x in hundreds of pounds and braking distance y in feet, at 50 miles per hour on dry pavement, were measured for five
vehicles, with the results shown in the table.

X: 25 27.5 32.5 35 45
Y: 105 125 140 140 150

Fitted line for this data is Y = 66.34 + 1.990X and fitted second degree parabola is
Y = -112.5 + 12.61 X – 0.1510 X2 shown in the following figure. compute the coefficient of determination and interpret its value in the context of
vehicle weight and braking distance.

Examples (8)
An architect wants to determine the relationship between the heights (in feet) of a building and the number of stories in the building. The
data for a sample of 10 buildings in a city shown below. Explain the relationship.

Stories: X 64 54 40 31 45 38 42 41 37 40
Height: Y 841 725 635 616 615 582 535 520 511 485
Correlation
Two variables are said to be correlated if they tend to simultaneously vary in some direction; if both the variables tend to increase (or
decrease) together, the correlation is said to be direct or positive. e.g. the length of an iron bar will increase as temperature increases. If one
variable tend to increase as the other variable decreases, the correlation is said to be negative or inverse. e.g. the volume of gas will decrease as the
pressure increases.
(1) The correlation answers the STRENGTH of linear association between paired variables, say X and Y. On the other hand, the regression tells
us the FORM of linear association that best predicts Y from the values of X.
(2) Correlation is calculated whenever:
o both X and Y is measured in each subject and quantify how much they are linearly associated.
o in particular the Pearson's product moment correlation coefficient is used when the assumption of both X and Y are sampled
from normally-distributed populations are satisfied
o or the Spearman's moment order correlation coefficient is used if the assumption of normality is not satisfied.
o correlation is not used when the variables are manipulated, for example, in experiments.
The numerical measure of strength in the linear relationship between any two variables is called the correlation coefficient, usually
denoted by r, is defined by
_ _
(XX) (YY)
r= , called Pearson Product Moment Correlation Coefficient.
_ _
(XX)2 (YY) 2
XY(X)( Y)/n
Alternatively, r =
[X2(X)2/n][ Y2(Y)2/n]
Its range is from -1 to +1
If r = -1, that’s mean there is a perfect negative correlation
If r = +1, that’s mean there is a perfect positive correlation

If r is near -1 , that’s mean there is a strong negative correlation


If r is near +1 , that’s mean there is a strong positive correlation

If r is near 0 but negative, that’s mean there is a weak negative correlation


If r is near 0 but positive , that’s mean there is a weak positive correlation

It is important to note that r = 0 does not mean that there is no relationship at all. e.g. if all the observed values lie exactly on a circle,
there is a perfect non-linear relationship between the variables.
Rank Correlation
Sometimes, the actual measurements of individuals or objects are either not available or accurate assessment is not possible. They are
then arranged in order according to some characteristic of interest.. Such an ordered arrangement is called a ranking and the order given to an
individual or object is called its rank. The correlation between two such sets of ranking is called Rank Correlation.
we have
6di2
r = 1 - n(n2 - 1)
This is also ranging from – 1 to + 1
Note
If two objects or observations are tied (having same value), lets say for fourth and fifth, then they are both given the mean rank of 4 and 5.
i.e. 4.5.
This situation is given in the following example.
Examples (9)
The following table shows the number of hours studied (X) by a random sample of ten students and their grades in examination (Y):
X: 8 5 11 13 10 5 18 15 2 8
Y: 56 44 79 72 70 54 94 85 33 65
Calculate Spearman’s rank correlation coefficient.
Solution
We rank the X values by giving rank 1 to the highest value 18, rank 2 to 15, rank 3 to 13, rank 4 to 11, rank 5 to 10, rank 6.5 (mean of rank
6 and 7) to both 8, rank 8.5 (mean of rank 8 and 9) to both 5 and rank 10 to 2. Similarly we rank the values of Y by giving 1 to the highest value 94,
rank 2 to 85, rank 3 to 79, …, and rank 10 to 33 which is the smallest.
Table given below:
X Y Rank of X Rank of Y di d2
8 56 6.5 7 - 0.5 0.25
5 44 8.5 9 - 0.5 0.25
11 79 4 3 1.0 1
13 72 3 4 - 1.0 1
10 70 5 5 0.0 0
5 54 8.5 8 0.5 0.25
18 94 1 1 0.0 0
15 85 2 2 0.0 0
2 33 10 10 0.0 0
8 65 6.5 6 0.5 0.25

d2 = 3
The value of n is 10.
6di2 6(3)
Hence r = 1 - n(n2 - 1) = 1 - 10(102 - 1) = 0.98
Compare this value with the correlation coefficient for the original values.
Exercise 10

Ten competitors in a beauty contest are ranked by three judges in the following order

1st Judge 1 6 5 10 3 2 4 9 7 8
2nd Judge 3 5 8 4 7 10 2 1 6 9
3rd Judge 6 4 9 8 1 2 3 10 5 7

Use the rank correlation coefficient to discuss which pair of judges have the nearest approach to common tastes in
beauty.
(2) Multiple Regression and Correlation
Regression Model and Regression Equation
Multiple regression analysis is the study of how a dependent variable y is related to two or more independent variables. In the general
case, we will use ‘k’ to denote the number of independent variables.
The equation that describes how the dependent variable y is related to the independent variables X1, X2, . . ., Xk and an error term is called
the multiple regression model and the general format of the model is
Y = 0 + 1X1 + 2X2 + … + kXk + 
where ’s are the random errors, 0 is the intercept and 1, 2, … , k, are the regression coefficients.
One of the assumptions is that the mean or expected value of  is zero. A consequence of this assumption is that the mean or expected
value of y, denoted E( y), is equal to 0 + 1X1 + 2X2 + … + kXk The equation that describes how the mean value of Y is related to X 1, X2, . . ., Xk is
called the multiple regression equation.
Multiple Regression Equation is
E(Y) = 0 + 1X1 + 2X2 + … + kXk
Unfortunately, these parameter values 0 , 1, 2, … , k, are unknown, and must be estimated from sample data. A simple random sample
is used to compute sample statistics b0, b1, b2, . . . , bk that are used as the point estimators of the parameters 0 , 1, 2, … , k, These sample
statistics provide the following estimated multiple regression equation.
^
Y = b0 + b1X1 + b2X2 + … + bkXk
This same approach of the method of least squares is used to develop the estimated multiple regression equation.
Multiple Linear Regression with two Regressors
^
Y = b0 + b1X1 + b2X2
Examples (10)
A statistician wants to predict the incomes of restaurants, using two independent variables : the number of restaurants employees and
restaurants floor area. He collected the following data.

Income (000) Floor Area (000 Number of


Y sq. ft) Employees
X1 X2
30 10 15
22 5 8
16 10 12
7 3 7
14 2 10
^
Calculate the estimated multiple linear regression equation Y = b0 + b1X1 + b2X2
Solution
Normal Equations are:
 Y = na + b1X1 + b2X2
X1Y = a X1 + b1X12 + b2 X1X2
X2Y = a X2 + b1X1X2 + b2 X22
Construction of the table:
Y X1 X2 X12 X22 X1X2 X1Y X2Y
30 10 15 100 225 150 300 450
22 5 8 25 64 40 110 176
16 10 12 100 144 120 160 192
7 3 7 9 49 21 21 49
14 2 10 4 100 20 28 140
89 30 52 238 582 351 619 1007
Substituting the sums in the normal equations
5a + 30 b1 + 52 b2 = 89
30 a + 238 b1 + 351 b2 = 619
52 a + 351 b1 + 582 b2 = 1007
By solving simultaneously we have
a = - 1.33, b1 = 0.38, b2 = 1.62
^
Hence the desired estimated multiple linear regression equation Y = - 1.33 + 0.38 X1 + 1.62 X2
Standard Deviation of the Multiple Regression OR Standard Error of Estimate
^
(YY)2 Y2aYb1X1Yb2 X2Y
SY.12 = =
n3 n3
Coefficient of Multiple Determination
^ 
(Y  Y)2
. aY + b1X1Y + b2 X2Y  (Y)2/n
R2 = = =
 Y2  (Y)2/n
(Y  Y)2
Coefficient of Multiple Correlation
The coefficient of multiple correlation between Y and the variables X1 and X2 is given by
r2YX1 + r2YX2 - 2 r YX1 r YX2 r X1 X2
RY.12 = 1 - r2X1X2
Examples (11)
From the data given in Example (8), Compute the standard error of the estimates, coefficient of multiple determination and coefficient of
multiple correlation.
Solution
From the example, Y = 89, Y2 = 1885, n = 5, a = -1.33, b1 = 0.38, b2 = 1.62, X1Y = 619, X2Y = 1007
the standard error of the estimate is

Y2aYb1X1Yb2 X2Y
SY.12 =
n3
(1885)(-1.33)(89)(0.38)(619)(1.62)(1007)
= 53
136.81
= 2 = 68.405 = 8.27

coefficient of multiple determination is


aY + b1X1Y + b2 X2Y  (Y)2/n
R2Y.12 =
Y2  (Y)2/n
(-1.33)(89) + (0.38)(619) + (1.62)(1007)  (89)2/5 163.99
= = 300.80 = 0.55
1885  (89)2/5
This means 55% of the variability in income is explained by its linear relationship with floor area and the number of employees.
coefficient of multiple correlation is
RY.12 = 0.55 = 0.74
Examples (12)
(c) Standard Deviation of the Multiple Regression OR Standard Error of Estimate
(d) Coefficient of Multiple Determination
(e) Coefficient of Multiple Correlation
Exercise 11
Construction standards specify the strength of concrete 28 days after it is poured. For 30 samples of various types of concrete the
strength x after 3 days and the strength y after 28 days (both in hundreds of pounds per square inch) were measured. The sample data are
summarized by the following information.
n = 30, x = 501.6, y = 1338.8, xy = 23246.55,  x2 = 8724.74, y2 = 61980.14, 11 < x < 22
Compute the linear correlation coefficient for these sample data and interpret its meaning in the context of the problem.
Exercise 12
Given data is based on the study for drilling rock. The researchers wanted to determine whether the time it takes to dry drill a distance of
5 feet in rock increases with the depth at which the drilling begins. So, depth at which drilling begins is trhe predictor variable X, and time (in
minutes) to drill five feet is the response variable Y. Source: Penner. R, and Watts. D. C. “Mining Information” The American Statistician, Vol. 45, No.
1, Feb, 1991, p. 6
X: 35 50 75 95 120 130 145 155 160 175 185 190
Y: 5.88 5.99 6.74 6.1 7.47 6.93 6.42 7.97 7.92 7.62 6.89 7.9

Exercise 13

You might also like