Regression Illustrations PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

The following example will illustrate how regression lines are used

Example
An investment company advertised the sale of pieces of land at different prices. The following table shows the
pieces of land their acreage and costs

Piece of (x)Acreage (y) Cost £ 000 xy x2


land Hectares
A 2.3 230 529 5.29
B 1.7 150 255 2.89
C 4.2 450 1890 17.64
D 3.3 310 1023 10.89
E 5.2 550 2860 27.04
F 6.0 590 3540 36
G 7.3 740 5402 53.29
H 8.4 850 7140 70.56
J 5.6 530 2969 31.36
Σx =44.0 Σy = 4400 Σxy= 25607 Σx2 = 254.96

Required
Determine the regression equations of
i. y on x and hence estimate the cost of a piece of land with 4.5 hectares
ii. Estimate the expected average if the piece of land costs £ 900,000
Σy = an + bΣxy
Σxy = a∑x + bΣx2

By substituting of the appropriate values in the above equations we have


4400 = 9a + 44b …….. (i)
25607 = 44a + 254.96b ……..(ii)
By multiplying equation …. (i) by 44 and equation …… (ii) by 9 we have
193600 = 396a + 1936b …….. (iii)
230463 = 396a + 2294.64b ……..(iv)
By subtraction of equation …. (iii) from equation …… (iv) we have
36863 = 358.64b
102.78 = b
by substituting for b in …….. (i)
4400 = 9a + 44( 102.78)
4400 – 4522.32 = 9a
–122.32 = 9a
-13.59 = a
Therefore the equation of the regression line of y on x is
Y = 13.59 + 102.78x
When the acreage (hectares) is 4.5 then the cost
(y) = -13.59 + (102.78 x 4.5)
= 448.92
= £ 448, 920
Note that
Where the regression equation is given by
y= a + bx
Where a is the intercept on the y axis and
b is the slope of the line or regression coefficient
n is the sample size
then,

intercept a =
 y − b x
n
n xy −  x  y
Slope b =
n x 2 − (  x )
2

Example
The calculations for our sample size n = 10 are given below. The linear regression model is y = a + bx
Table

Distance x miles Time y mins xy x2 y2


3.5 16 56.0 12.25 256
2.4 13 31.0 5.76 169
4.9 19 93.1 24.01 361
4.2 18 75.6 17.64 324
3.0 12 36.0 9.0 144
1.3 11 14.3 1.69 121
1.0 8 8.0 1.0 64
3.0 14 42.0 9.0 196
1.5 9 13.5 2.25 81
4.1 16 65.6 16.81 256
Total 28.9 136 435.3 99.41 1972
Σx Σy Σxy Σx2 Σy2

10 × 435.3 − 28.9 × 136 422.6


The Slope b = =
10 × 99.41 − 28.9 2 158.9

= 2.66

136 − ( 2.66 × 28.9 )


and the intercept a =
10

= 5.91
We now insert these values in the linear model giving
y = 5.91 + 2.66x
or
Delivery time (mins) = 5.91 + 2.66 (delivery distance in miles)
The slope of the regression line is the estimated number of minutes per mile needed for a delivery. The intercept
is the estimated time to prepare for the journey and to deliver the goods, that is the time needed for each
journey other than the actual traveling time.

PREDICTION WITHIN THE RANGE OF SAMPLE DATA


We can use the linear regression model to predict the mean of dependant variable for any given value of
independent variable
For example if the sample model is given by
Time (min) = 5.91 + 2.66 (distance in miles)
Then the distance if 4.0 miles then our estimated mean time is
Ý = 5.91 + 2.66 x 4.0 = 16.6 minutes

Multiple Linear Regression Models


There are situations in which there is more than one factor which influence the dependent variable
Example
Cost of production per week in a large department
Factors
i. Total numbers of hours worked
ii. Raw material used during the week
iii. Total number of items produced during the week
iv. Number of hours spent on repair and maintenance
It is sensible to use all the identified factors to predict department costs
Scatter diagram will not give the relationship between the various factors and total costs
The linear model for multiple linear regression if of the type; (which is the line of best fit).
y = α + b1x1 +b2x2 +………… + bnxn
We assume that errors or residuals are negligible.
In order to choose between the models we examine the values of the multiple correlation coefficient r and the
standard deviation of the residuals α.
A model which describes well the relationship between y and x’s has multiple correlation coefficient r close to
± and the value of α which is small.

Example
Odino chemicals limited are aware that its power costs are semi variable cost and over the last six months these
costs have shown the following relationship with a standard measure of output.

Month Output (standard units) Total power costs £ 000


1 12 6.2
2 18 8.0
3 19 8.6
4 20 10.4
5 24 10.2
6 30 12.4

Required
i. Using the method of least squares, determine on appropriate linear relationship between total
power costs and output
ii. If total power costs are related to both output and time (as measured by the number of the month)
the following least squares regression equation is obtained
Power costs = 4.42 + (0.82) output + (0.10) month
Where the regression coefficients (i.e. 0.82 and 0.10) have t values 2.64 and 0.60 respectively and
coefficient of multiple correlation amounts to 0.976
Compare the relative merits of this fitted relationship with one you determine in (a). Explain
(without doing any further analysis) how you might use the data to forecast total power costs in
seven months.

Solution
a)
Output (x) Power costs (y) x2 y2 xy
12 6.2 144 38.44 74.40
18 8.0 324 64.00 144.00
19 8.6 361 73.96 163.40
20 10.4 400 108.16 208.00
24 10.2 576 104.04 244.80
30 12.4 900 153.76 372.00
Σx = 123 Σy = 55.8 Σx2 = 2705 Σy2 = 542.36 Σxy= 1,206.60

n xy −  x  y
b=
n x 2 − (  x )
2

6 ×1206.6 − 123 × 55.8


=
6 × 2705 − (123)
2

376.2
=
1101

= 0.342

1
a = (Σy – bΣy)
n

1
= x (55.8 – 0.342) x 123
6

= 2.29
∴ (Power costs) = 2.29 + 0.342 (output)
b. For linear regression calculated above, the coefficient of correlation r is

r=
( 6 ×1206.6 ) − (123 × 55.8)
6 × 2705 − 123 ×123 6 × 542.36 − 55.8 × 55.8

376.2
=
1101× 140.52

= 0.96

This show a strong correlation between power cost and output. The multiple correlation when both output and
time are considered at the same time is 0.976.
We observe that there has been very little increase in r which means that inclusion of time variable does not
improve the correlation significantly
The value for time variable is only 0.60 which is insignificant as compared with a t value of 2.64 for the output
variable
In fact, if we work out correlation between output and time, there will be a high correlation. Hence there is no
necessity of taking both the variables. Inclusion of time does improve the correlation coefficient but by a very
small amount.
If we use the linear regression analysis and attempt to find the linear relationship between output and time i.e.
Month Output
1 12
2 18
3 19
4 20
5 24
6 30

The value of b and a will turn out to be 3.11 and 9.6 i.e. relationship will be of the form
Output = 9.6 + 3.11 × month
For this equation forecast for 7th month will be
Output = 9.6 + 3.11 × 7
= 9.6 + 21.77
= 31.37 units
Using the equation , Power costs = 2.29 + 0.34 × output
= 2.29 + 0.34 × 31.37
= 2.29 + 10.67
= 12.96 i.e. £ 12,960

Non Linear Relationships


In the scatter diagram and the correlation coefficient do not indicate linear relationship, then the relationship
may be non – linear
Two such relationships are of peculiar interest
y = abx
Both of these can be reduced to linear model. Simple or multiple linear regression methods are then used to
determine the values of the coefficients
i. Exponential model
y = abx
Take log of both sides
Log y = log a + log bx
Log x = log a + xlog b
Let log y = Y and log a = A and log b = B
Then Y = A + Bx. This is a linear regression model
ii. Geometric model
y = axb
using the same technique as above
Log y = log a + blog x
Y = A + bX
Where Y = log
A = log a
X = log x
Using linear regression technique (the method of least squares), it is possible to calculate the value of a and b

You might also like