Professional Documents
Culture Documents
Handout 3 Regression and Correlation
Handout 3 Regression and Correlation
We are mainly concern with the use of associations among variables. These associations may be useful in many ways, and one of the most
important and most common is prediction. This method is used for predicting the value of one quantity by using values of related quantities. Such
method may also lead to methods for controlling the value of one variable by adjusting the values of related variables. Regression analysis offers us
a sensible and sound approach for examining associations among variables and for obtaining good rules for prediction.
Regression Analysis
we make decisions based on prediction of future events
we develop relationship between what is already known and what is to be estimated
There always exists a tendency towards the average. For example, the height of children born to tall parents tends towards the shorter
height. So we make use of regression that is a process of predicting one variable (the height of the children) from another (the height of parents). So
we develop an estimating equation. That is a mathematical formula that relates the known variable to the unknown variable.
Scattered diagram can give us two type of information. Visually we can look for pattern that indicate that the variables are related. Then, if the
variables are related, we can see what kind of line, or estimating equation, describes this equation.
Later on we will study correlation analysis to determine the degree to which the variables are related. It tells us how well the estimating
equation actually describes the relationship.
We find a CAUSAL relationship between variables. i.e. how does the independent variable causes the dependent variable to change.
Deterministic and Probabilistic Relations or Models
A formula that relates quantities in the real world is called a model. Recall that in physics we have studies that if a body is moving under
uniform motion with an initial velocity ‘u’ and uniform acceleration ‘a’, the velocity after time ‘t’ is given by:
v = u + at
This is a model for uniform motion. This model has the property that when a value of ‘t’ is substituted in the above equation, the value of
v is determined without any error. Such models are called deterministic models. An important example of the deterministic model is the
relationship between Celsius and Fahrenheit scales in the form of F = 32+9/5C. Other examples of such models are Boyle’s law, Newton’s law of
gravitation, ohm’s law etc.
Examples (1)
Consider another example to investigate the relationship between Density and compressive strength at 28 days from examination of 40
concrete cube test records during the period 8 July 1991 to 21 September 1992, and arranged in reverse chronological order
Density (kg/m3) Compressive Density (kg/m3) Compressive Density (kg/m3) Compressive Density (kg/m3) Compressive
Stength Stength Stength Stength
(N/mm2) (N/mm2) (N/mm2) (N/mm2)
2437 60.5 2428 56.9 2435 57.8 2444 64.9
2437 60.9 2448 67.3 2446 60.9 2447 63.4
2425 59.8 2456 68.9 2441 61.9 2433 60.5
2427 53.4 2436 49.9 2456 67.2 2429 68.1
2435 68.3 2454 59.8 2458 61.1 2455 56.3
2471 65.7 2449 56.7 2414 50.7 2473 64.9
2472 61.5 2441 57.9 2448 59.0 2488 69.5
2445 60.0 2457 60.2 2445 63.3 2454 58.9
2436 59.6 2447 55.8 2436 52.5 2427 54.4
2450 60.5 2436 53.2 2469 54.6 2411 58.8
(Ref. Applied Statistics for Civil and Environmental Engineers, 2nd Edition by Kottegoda and Renzo Example 6.1)
Let we denote x and y for concrete density and strength, respectively. Suppose the investigator believes that the relation between y and x
is exactly given by:
Y = −274.4 + 0.1368x.
If this is true we must obtain the exact value of yield y for a given value of x. Thus when x = 2445, the yield must be:
Y = 22 + 2.5 (2445) = 60.076
But it is 60.0. There is an error of 60.076 – 60.0 = 0.076. Hence no deterministic model can be constructed to represent this experiment.
This type of error is known as probabilistic model. The deterministic relation in such cases is then modified to include both a deterministic
component and a random error component given as
Yi = a = bXi + i , where i’s are the unknown random errors.
Regression Model
There are many statistical investigations in which the main objective is to determine whether a relationship exists between two or more
variables. If such a relationship can be expressed by a mathematical formula, we will then be able to use it for the purpose of making predictions.
The reliability of any prediction will, of course, depend on the strength of the relationship between the variables included in the formula.
A mathematical equation that allows us to predict values of one dependent variable from known values of one or more independent
variables is called a regression equation. Today the term regression is applied to all types of prediction problems and does not necessarily imply a
regression towards the population mean.
Linear Regression
We consider here the problem of estimating or predicting the value of a dependent variable Y on the basis of a known measurement of an
independent and frequently controlled variable X. The variable intended to be estimated or predicted is termed as dependent variable or
Regressand or response variable and the variable on the basis of which the dependent variable is to be estimated is called the independent
variable, the regressor or the predictor.
e.g. If we want to estimate the heights of children on the basis of their ages, the heights would be the dependent variable and the
ages would be the independent variable. In estimating the yields of a crop, on the basis of the amount of the fertilizer used, the yield will be the
dependent variable and the amount of fertilizer would be the independent variable.
Scatter Diagram
Let us consider the data given in Example 1. The data table has been plotted in figure to give a scattered diagram.
In the scattered diagram, the points follow closely a straight line indicate that the two variables are to some extend linearly related. Once
a reasonable linear relationship is obtained, we usually try to express this mathematically by a straight-line equation Y = a + bX, called the linear
regression line, where the constants a and b represent the y-intercept and slope respectively. Such a regression line has been drawn in the following
figure. This linear regression line can be used to predict the value Y corresponding to any given value X.
Many possible regression lines could be fitted to the sample data, but we choose that particular line which best fits that data. The best
regression line is obtained by estimating the regression parameters by the most commonly used method of least squares.
Estimation of a Straight Line using the Method of Least Squares
The basic linear relationship between the dependent variable Y i and the value Xi is
Y i = a + b X i + i
where a and b are called the unknown population parameters (b is also called the coefficient of regression), Y i are the observed values and
i are the error components. We write the equation for the estimating line as
Yi = a + b Xi
The line will have a good fit if it minimizes the error between the estimated points on the line and the actual observed points that were
used to draw it.
One way to measure the error of the estimating line is to sum all the individual differences or errors, between the estimated points and
the observed points. In the figure we have two estimated lines that have been fitted to the same set of three data points. Two very different lines
have been drawn to describe the relationship between the two variables.
We have calculated the individual differences between the corresponding Y and Y, and then we have found the sum of these differences.
The total error in both the cases is zero. This means that both the lines describe the data equally well. Thus we must conclude that the process of
summing individual differences for calculating the error is not a reliable way to judge the goodness of fit of an estimating line. Actually the sum of
the individual errors is cancelling effect of the positive and negative values.
However if we consider the sum of the absolute values of the errors seems to be the better criteria to find a good fit but it does not stress
the magnitude of the errors. But if we consider the sum of the squares of the errors, it magnifies the larger errors and it cancels the effect of the
positive and negative values. So finally we looking for an estimating line that minimizes the sum of the squares of the errors. That is called the
method of the least squares.
Method of Least Squares
The method of least squares determines the values of the unknown parameters that minimize the sum of squares of the errors where
errors are defined as the difference between observed values and the corresponding predicted or estimated values.
It is denoted by
n n n
S(a,b) = ei2 = (Yi Yi)2 = (Yi a b Xi)2
i =1 i =1 i =1
minimizing S(a,b), we put first partial derivatives w. r. t. a and b equal to zero. Therefore
S(ab) n
= 2 (Yi a b Xi)(1) = 0
a i =1
S(ab) n
= 2 (Yi a b Xi)(Xi) = 0
b i =1
by simplifying, we have
Yi = na + bXi
Xi Yi = aXi + bXi 2
by solving, we have
(Y)(X2) (X)(XY) nXY (X)(Y)
a= b=
n(X2) (X)2 n(X2) (X)2
If the variable X is taken as dependent variable , then the least square line is given by
X = c + dY
and the normal equations are
X = nc + dY
XY = cY + dY2
By solving simultaneously, w have the values of c and d
(X)(Y2) (Y)(XY) nXY (X)(Y)
c= ,d=
n(Y2) (Y)2 n(Y2) (Y)2
Examples (2) (Weather and Traffic)
Weather and traffic are two everyday occurrences that have inherent randomness. For example, if you live in a cold climate you know that
traffic tends to be more difficult when snow falls and covers the roads.We can create a simple mathematical model of traffic incidents as a function
of snowy weather, based on known data.
In the following table, we have accumulated a record of the number of snow days occurring in a certain locality over the past 10 years,
along with the number of traffic incidents reported to police in the same year. A scatter plot of the data can be used to visualize the possible
correlation.
Snow Days 16 55 43 29 59 42 20 45 30 35
Incidents 5825 11427 9006 5963 11449 8380 5745 9104 6495 6938
Scattered Plot is shown below:
We see that there is a general trend to the data, with traffic incidents increasing as the number of snow days increases. We have added a
linear trend line to the data to highlight this relationship. This linear trend is, in fact, a straight line probabilistic model of the data.
A straight-line probabilistic model is often referred to as a linear regression, Y = a + b X
The normal equations are:
Y = na + bX
X Y = aX + bX2
Table construction for these normal equations are:
X Y X2 XY y e=yy
16 5825 256 93200 4818.2 -1006.8
55 11427 3025 628485 10676 -751
43 9006 1849 387258 8873.6 -132.4
29 5963 841 172927 6770.8 807.8
59 11449 3481 675491 11276.8 -172.2
42 8380 1764 351960 8723.4 343.4
20 5745 400 114900 5419 -326
45 9104 2025 409680 9174 70
30 6495 900 194850 6921 426
35 6938 1225 242830 7672 734
374 80332 15766 3271581
Now the normal equations:
80332 = 10a + 374b
3271581 = 374a + 15766b
By solving, we get a = 2415 and b = 150.2
The regression equation is
Y = 2415 + 150.2 X
Using this regression line Y = 2415 + 150.2 X, we can predict number of incidents during a number of snow days. For example the
predicted number of incidents during 40 snow days, are 8423. The estimated number of incidents during 45 snow days are 9174 but the observed
number of incidents during 45 snow days are 9104.
Such interpretation is valid only when x lies between 16 and 59. An extension of the model beyond these values may lead to unreasonable
results. The value of b is called coefficient of regression.
Accuracy of the Regression Curve
a) Using scattered diagram, we can guess, how accurate the regression curve is.
b) If sum of the residuals is zero then the curve is supposed to be good one.
c) If the points in the plot of the residuals are close to the x-axis and scattered in a random way, the model appears to provide a
good fit.
d) If the points in the plot of the residuals are distributed in a systematic manner we should try some other model.
From Example 2;
Y y e=yy
Observed Estimated errors or
X values values residuals
16 5825 4818.2 -1006.8
55 11427 10676 -751
43 9006 8873.6 -132.4
29 5963 6770.8 807.8
59 11449 11276.8 -172.2
42 8380 8723.4 343.4
20 5745 5419 -326
45 9104 9174 70
30 6495 6921 426
35 6938 7672 734
374 80332
Scattered Plot for the errors is shown below
Examples (3)
An investigator divides the field into eight plots of equal size with equal fertility and applied varying amounts of fertilizer to each. The yield
of potatoes (in kg) and the fertilizer application (in kg) was recorded for each plot. This data is given below:
Fertilizer Applied: =x 1 1.5 2 2.5 3 3.5 4 4.5
Yield of Potatoes: =y 25 31 27 28 36 35 32 34
Fit a line of regression to the yield of potatoes using the method of least squares.
Solution
The necessary calculations are given in the table below:
x y xy x2 y2 y^ e
1 25 25.0 1.00 625 26.83 -1.83
1.5 31 46.5 2.25 961 28.02 2.98
2 27 54.0 4.00 729 29.21 -2.21
2.5 28 70.0 6.25 784 30.40 -2.40
3 36 108.0 9.00 1296 31.59 4.41
3.5 35 122.5 12.25 1225 32.79 2.21
4 32 128.0 16.00 1024 33.98 -1.98
4.5 34 153.0 20.25 1156 35.17 -1.17
Total 22 248 707.0 71.00 7800
(Y)(X ) (X)(XY)
2
nXY (X)(Y)
a= ,b=
n(X2) (X)2 n(X2) (X)2
we have a = 24.452 and b = 2.38095
The fitted line which is the simple regression line: y^ = 24.452+ 2.38095 x
Examples (4)
Fit a parabola y = ax2 + bx + c in least square sense to the data
X= 10 12 15 23 20
Y= 14 17 23 25 21
Solution
We are given y = ax2 + bx + c
The normal equations to the curve are
Y = a X2 + b X + 5c
XY = a X3 + b X2 + c X
X2Y = a X4 + b X3 + c X2
X Y X2 X3 X4 XY X2Y
10 14 100 1000 10000 140 1400
12 17 144 1728 20736 204 2448
15 23 225 3375 50625 345 5175
23 25 529 12167 279841 575 13225
20 21 400 8000 160000 420 8400
X = 80 Y = 100 x2 = 1398 x3 = 26270 x4 =521202 XY = 1684 X2Y= 30648
Substituting the obtained values from the table in normal equations, we have
100 = 1398 a + 80 b + 5c
1684 = 26270 a + 1398 b + 80 c
30648 = 521202 a + 26270 b + 1398 c
on solving, a = 0.07, b = 3.03, c = 8.89
Hence the required equation is
Y = 0.07 X2 + 3.03 X 8.89
Examples (5)
The curve to be fitted is
Y = a ebX Y = a bX
log10Y = log10 a ebX ln Y = ln a ebX
log10Y = log10 a + log10 ebX = ln a + ln ebX
log10Y = log10 a + b X log10 e = ln a + bX ln e
put y = log10Y, A = log10 a and B = b log10 e ln Y = ln a + bX
we get y = A+BX by putting ln Y = y, ln a = A
y = A + bX
Substituting the values of the summations in the normal equations, we get 3.8099 = 6A + 7.5 B
and 10.4555 = 7.5 A + 13.75 B
On solving, A = 0.9916, B = 1.3013
log10 a = 0.9916, b log10 e = 1.3013
a = anti log10 A = 0.1019, b = B/log10e = 2.9963 and the curve is Y = 0.1019 e(2.9963)X
Note: try this above example by taking natural log (i.e. ln )
Alternatively
Y = a ebX
ln Y = ln a ebX
= ln a + ln ebX
= ln a + bX ln e ln Y = ln a + bX
put ln Y = y , ln a = A a = anti ln (A) we have
y = A + bX
now find normal equations for this st. line
y = nA + b X
Xy = A X + b X2
Table construction:
X Y y = ln Y X2 Xy
Exercises
(1) Given below the data relating to the thermal energy generated in Pakistan 1981-94. The energy generation is in billion kwh.
Year 1981 1982 1983 1984 1985 1986 1987
Energy Generated 4.2 5.2 5.1 5.2 6.5 7.3 8.4
Year 1988 1989 1990 1991 1992 1993 1994
Energy Generated 10.8 11.9 14.5 16.1 19.4 19.7 23.0
Fit a straight line to the data. Find the residuals. Plot the residuals and comment on your result.
(2) Following is the annual installation of computers in labs in UET. Fit a linear regression equation of the computers on years and give the
annual rate of installation of them.
Year: 2001-2003 2003-2005 2005-2007 2007-2009 2009-2011
No of Computers installed: 139 144 150 154 158
Note
For each situation where the independent variable is a time factor, the values assigned to
2001-2003,… may be taken as 1,2,3,…
(3) A study of the department of transportation on the effect of bus ticket prices on the number of passengers produced the following results:
y = 6A + BX
Xy = A x + B x2
Table construction:
X Y Y = log10 y X2 Xy
0 0.10 -1 0 0
0.5 0.45 - 0.3468 0.25 -0.1734
1.0 2.15 0.3324 1.0 0.3324
1.5 9.15 0.9614 2.25 1.4421
2.0 40.35 1.6058 4.0 3.2116
2.5 180.75 2.2571 6.25 5.6428
X=7.5 y = 3.8099 X2 = 13.75 Xy = 10.4555
X Y y = log10Y X2 Xy
2 8.3 0.9191 4 1.8382
3 15.4 1.1872 9 3.5616
4 33.1 1.5198 16 6.0792
5 65.2 1.8142 25 9.0710
6 127.4 2.1052 36 12.6312
X = 20 y = 7.5455 X2 = 90 Xy = 33.1812
^
(YY)2 ^
sY.X = n2 where Y = a + bX, the estimated regression line.
Alternative formula,
Yi2 aYi bXiYi
sY.X = n2 where n is the number of pairs.
Dividing by ‘n-2’
Because the values a and b were obtained from a sample of data points, we lose 2 degree of freedom when we use points to estimate the
regression line.
In statistics, the number of degree of freedom is the number of values in the final calculation of a statistic that are free to vary.
Exercise 7
Cost accountant often estimates overhead based on the level of production. At the Standard Knitting Co. they have collected information
on overhead expenses and units produced at different plants, and want to estimate a regression equation to predict future overhead. (AIOU)
Overhead 191 170 272 155 280 173 234 116 153 178
Units 40 42 53 35 56 39 48 30 37 40
(a) Develop the regression equation for the cost accountants.
(b) Predict overhead when 50 units are produced.
(c) Calculate the standard error of estimate.
Coefficient of Determination
To determine the goodness of fit for the estimated regression equation.
In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variance in the
dependent variable that is predictable from the independent variable(s).
It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of
hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on
the proportion of total variation of outcomes explained by the model.
For the ith observation, the residual is yi y^i. The sum of the squares of these residuals or errors is known as the sum of the squares due to
errors, is denoted by SSE.
Exercise 9
The curb weight x in hundreds of pounds and braking distance y in feet, at 50 miles per hour on dry pavement, were measured for five
vehicles, with the results shown in the table.
X: 25 27.5 32.5 35 45
Y: 105 125 140 140 150
Fitted line for this data is Y = 66.34 + 1.990X and fitted second degree parabola is
Y = -112.5 + 12.61 X – 0.1510 X2 shown in the following figure. compute the coefficient of determination and interpret its value in the context of
vehicle weight and braking distance.
Examples (8)
An architect wants to determine the relationship between the heights (in feet) of a building and the number of stories in the building. The
data for a sample of 10 buildings in a city shown below. Explain the relationship.
Stories: X 64 54 40 31 45 38 42 41 37 40
Height: Y 841 725 635 616 615 582 535 520 511 485
Correlation
Two variables are said to be correlated if they tend to simultaneously vary in some direction; if both the variables tend to increase (or
decrease) together, the correlation is said to be direct or positive. e.g. the length of an iron bar will increase as temperature increases. If one
variable tend to increase as the other variable decreases, the correlation is said to be negative or inverse. e.g. the volume of gas will decrease as the
pressure increases.
(1) The correlation answers the STRENGTH of linear association between paired variables, say X and Y. On the other hand, the regression tells
us the FORM of linear association that best predicts Y from the values of X.
(2) Correlation is calculated whenever:
o both X and Y is measured in each subject and quantify how much they are linearly associated.
o in particular the Pearson's product moment correlation coefficient is used when the assumption of both X and Y are sampled
from normally-distributed populations are satisfied
o or the Spearman's moment order correlation coefficient is used if the assumption of normality is not satisfied.
o correlation is not used when the variables are manipulated, for example, in experiments.
The numerical measure of strength in the linear relationship between any two variables is called the correlation coefficient, usually
denoted by r, is defined by
_ _
(XX) (YY)
r= , called Pearson Product Moment Correlation Coefficient.
_ _
(XX)2 (YY) 2
XY(X)( Y)/n
Alternatively, r =
[X2(X)2/n][ Y2(Y)2/n]
Its range is from -1 to +1
If r = -1, that’s mean there is a perfect negative correlation
If r = +1, that’s mean there is a perfect positive correlation
It is important to note that r = 0 does not mean that there is no relationship at all. e.g. if all the observed values lie exactly on a circle,
there is a perfect non-linear relationship between the variables.
Rank Correlation
Sometimes, the actual measurements of individuals or objects are either not available or accurate assessment is not possible. They are
then arranged in order according to some characteristic of interest.. Such an ordered arrangement is called a ranking and the order given to an
individual or object is called its rank. The correlation between two such sets of ranking is called Rank Correlation.
we have
6di2
r = 1 - n(n2 - 1)
This is also ranging from – 1 to + 1
Note
If two objects or observations are tied (having same value), lets say for fourth and fifth, then they are both given the mean rank of 4 and 5.
i.e. 4.5.
This situation is given in the following example.
Examples (9)
The following table shows the number of hours studied (X) by a random sample of ten students and their grades in examination (Y):
X: 8 5 11 13 10 5 18 15 2 8
Y: 56 44 79 72 70 54 94 85 33 65
Calculate Spearman’s rank correlation coefficient.
Solution
We rank the X values by giving rank 1 to the highest value 18, rank 2 to 15, rank 3 to 13, rank 4 to 11, rank 5 to 10, rank 6.5 (mean of rank
6 and 7) to both 8, rank 8.5 (mean of rank 8 and 9) to both 5 and rank 10 to 2. Similarly we rank the values of Y by giving 1 to the highest value 94,
rank 2 to 85, rank 3 to 79, …, and rank 10 to 33 which is the smallest.
Table given below:
X Y Rank of X Rank of Y di d2
8 56 6.5 7 - 0.5 0.25
5 44 8.5 9 - 0.5 0.25
11 79 4 3 1.0 1
13 72 3 4 - 1.0 1
10 70 5 5 0.0 0
5 54 8.5 8 0.5 0.25
18 94 1 1 0.0 0
15 85 2 2 0.0 0
2 33 10 10 0.0 0
8 65 6.5 6 0.5 0.25
d2 = 3
The value of n is 10.
6di2 6(3)
Hence r = 1 - n(n2 - 1) = 1 - 10(102 - 1) = 0.98
Compare this value with the correlation coefficient for the original values.
Exercise 10
Ten competitors in a beauty contest are ranked by three judges in the following order
1st Judge 1 6 5 10 3 2 4 9 7 8
2nd Judge 3 5 8 4 7 10 2 1 6 9
3rd Judge 6 4 9 8 1 2 3 10 5 7
Use the rank correlation coefficient to discuss which pair of judges have the nearest approach to common tastes in
beauty.
(2) Multiple Regression and Correlation
Regression Model and Regression Equation
Multiple regression analysis is the study of how a dependent variable y is related to two or more independent variables. In the general
case, we will use ‘k’ to denote the number of independent variables.
The equation that describes how the dependent variable y is related to the independent variables X1, X2, . . ., Xk and an error term is called
the multiple regression model and the general format of the model is
Y = 0 + 1X1 + 2X2 + … + kXk +
where ’s are the random errors, 0 is the intercept and 1, 2, … , k, are the regression coefficients.
One of the assumptions is that the mean or expected value of is zero. A consequence of this assumption is that the mean or expected
value of y, denoted E( y), is equal to 0 + 1X1 + 2X2 + … + kXk The equation that describes how the mean value of Y is related to X 1, X2, . . ., Xk is
called the multiple regression equation.
Multiple Regression Equation is
E(Y) = 0 + 1X1 + 2X2 + … + kXk
Unfortunately, these parameter values 0 , 1, 2, … , k, are unknown, and must be estimated from sample data. A simple random sample
is used to compute sample statistics b0, b1, b2, . . . , bk that are used as the point estimators of the parameters 0 , 1, 2, … , k, These sample
statistics provide the following estimated multiple regression equation.
^
Y = b0 + b1X1 + b2X2 + … + bkXk
This same approach of the method of least squares is used to develop the estimated multiple regression equation.
Multiple Linear Regression with two Regressors
^
Y = b0 + b1X1 + b2X2
Examples (10)
A statistician wants to predict the incomes of restaurants, using two independent variables : the number of restaurants employees and
restaurants floor area. He collected the following data.
Y2aYb1X1Yb2 X2Y
SY.12 =
n3
(1885)(-1.33)(89)(0.38)(619)(1.62)(1007)
= 53
136.81
= 2 = 68.405 = 8.27
Exercise 13