Linear Regression

LINEAR REGRESSION
A mathematical equation that allows us to predict the values of one dependent

variable from known values of one or more independent variables is called a regression
equation. The term regression equation is derived from the original heredity studies made
by Francis Galton. In his study, he compared the heights of the sons of tall fathers over
successive generations regressed toward the mean height of the population. In other words,
sons of unusually tall fathers tend to be shorter than their fathers and sons of unusually
short fathers tend to be taller than their fathers. Today the term regression is applied to all
types of prediction problems and does not necessarily imply a regression toward a
population mean.
In the study of linear regression, we consider the problem of estimating or

predicting the value of a dependent variable Y on the basis of a known measurement of an
independent and frequently controlled variable X.
Using a scatter diagram, we can determine if the two variables are linearly related
to some extent. Once a reasonable linear relationships has been ascertained, we usually
express this mathematically by a straight-line equation called the linear regression line.
The linear regression line is written using the slope-intercept form
where the constants a and b represents the y-intercept and slope, respectively. The symbol
is used here to distinguish between the value given by the regression line and an actual
observed value y for some value of x.
Once the point estimates a and b are determined from the sample data, the linear
regression line can be used to predict the value corresponding to any given value x.
Estimation of Parameters. Given the sample the least-squares

estimate of the parameters in the regression line
are obtained from the formulas
and
Example 1. Consider the following data:

x 1 2 3 4 5 6
y 6 4 3 5 4 2
(a) Find the equation of the regression line.
(b) Graph the line on a scatter diagram.
(c) Find the point estimate of
1
Solution:
1 6 6 1 36
2 4 8 4 16
3 3 9 9 9
4 5 20 16 25
5 4 20 25 16
6 2 12 36 4
Total 21 24 75 91 106
(a)
Substituting these values in the formula for b, we get
This is the regression line

(b) Y
7
6 .
. . .
5
4 . . .
3 . . .
2
1 X
1 2 3 4 5 6 7 8
Since the slope of y is negative, it implies that as x increases y decreases.
( c)
LINEAR CORRELATION
We shall consider here the problem of measuring the relationship between two
variables X and Y rather than predicting a value of Y from a knowledge of the independent
2
variable X. For example, if X represents the amount of money spent yearly on advertising
by a retail merchandising firm and Y represents their total yearly sales, we might ask
whether a decrease in advertising is likely to be accompanied by a decrease in the yearly
sales.
Correlation analysis attempts to measure the strength of such relationships between

two variables by means of a single number called a correlation coefficient.
A linear correlation coefficient is defined to be a measure of the linear relationship

between the two random variables X and Y. This relationship is denoted by r. r measures
the extent to which the points cluster about a straight line. By constructing a scatter
diagram for the n pairs of measurements in our random sample
(as in the graph below), we are able to draw certain conclusions concerning r. If the points
follow closely a straight line of positive slope as in (a), we have a high positive correlation
between the two variables. On the other hand, if the points follow closely a straight line of
negative slope as in (b), we have a high negative correlation between the two variables.
The correlation between the two variables decreases numerically as the scattering of points
from a straight line increases. If the points follow a strictly random pattern as in (c) below,
we have zero correlation and conclude that no linear relationship exists between X and Y.
Y Y
. .
.. . .
... . . . .
.... . . .
... . . .
... . .
X X
(a) (b)
Y
Y
. . ...
. . . .. ...
. . . ... ...
. . . ... ...
. . . ..
. . . .... ...
X X
(c ) (d)
The correlation coefficient between two variables is a measure of their linear
relationship and a value of implies a lack of linearity and not a lack of association.
Hence, if a strong quadratic relationship exists between X and Y as indicated in (d), we still
obtain a zero correlation even though there is a strong nonlinear relationship.
3
The most widely used measure of linear correlation between two variables is called
PEARSON PRODUCT-MOMENT CORRELATION COEFFICIENT or simply the
SAMPLE CORRELATION COEFFICIENT and is denoted by r.
The measure of linear relationship between two variables X and Y is estimated by

the sample correlation coefficient r, where
Since
And by dividing both sides of the equation by we obtain the relation
Note that SSE and are always nonnegative, we can say that must be between zero
and 1. Consequently r must range from –1 to +1. A value of r = -1 will occur when SSE
= 0 and all points lie exactly on a straight line having a negative slope. If all points lie
exactly on a straight line having a positive slope, once again SSE =0 and we obtain a value
r= +1. Hence a perfect linear relationship exists between the values of X and Y in our
sample when If r is close to +1 or –1, the linear relationship between the two
variables is strong and we say that we have a high correlation. However, if r is close to
zero, the linear relationship between X and Y is weak or perhaps nonexistent.
A number that expresses the proportion of the total variation in the values of the
variable Y that can be accounted for or explained by the linear relationship with the values
of the variable X is usually referred to as the sample coefficient of variation and is denoted
by . Thus a correlation of r= 0.6 means that 0.36 or 36% of the total variation of the
values of Y in our sample is accounted for by linear relationship with the values of X.
The values of r and its interpretation
R interpretation
1 Perfect positive correlation
4
0.91 to 0.99 very highly positively correlated
0.71 to 0.90 highly positively correlated
0.41 to 0.70 Marked or moderately positively correlated
0.21 to 0.40 Low or slightly positively correlated
0 to 0.21 negligible
-0.20 to 0 negligible
-0.21 to -0.40 Low or slightly negatively correlated
-0.41 to 0.70 Marked or moderately negatively correlated
-0.71 to -0.90 Highly positively correlated
-0.91 to -0.99 Very highly positively correlated
-1 Perfect negative correlation
Example 1: Compute and interpret the correlation coefficient for the following data:
X 4 5 9 14 18 22 24
Y 16 22 11 16 7 3 17
Solution:
4 16 16 256 64
5 22 25 484 110
9 11 81 121 99
14 16 196 256 224
18 7 324 49 126
22 3 484 9 66
24 17 576 289 408
Total 96 92 1702 1464 1097
substituting these values in the formula for r, we get
5
Since r= -0.53, the two variables X and Y are moderately negatively correlated.
Example 2. Compute and interpret the correlation coefficient for the aptitude scores and
grade point averages below:
Grade-point Average Aptitude Score

Y X
1.93 565
2.55 525
1.72 477
2.48 555
2.87 502
1.87 469
1.34 517
3.03 555
2.54 576
2.34 559
1.40 574
1.45 578
1.72 548
3.80 656
2.13 688
1.81 46
2.33 661
2.53 477
2.04 490
3.20 524
Solution:
6
GPA AS
1.93 565 1090.45 319225 3.7249

2.55 525 1338.75 275625 6.50250
1.72 477 820.44 227529 2.95840
2.48 555 1376.40 308025 6.15040
2.87 502 1440.74 252004 8.23690
1.87 469 877.03 219961 3.49690
1.34 517 692.78 267289 1.79560
3.03 555 1681.65 308025 9.18090
2.54 576 1463.04 331776 6.45160
2.34 559 1308.06 312481 4.47560
1.40 574 803.60 329476 1.96000
1.45 578 838.10 334084 2.10250
1.72 548 942.56 300304 2.95840
3.80 656 2492.80 430336 14.44000
2.13 688 1465.44 473344 4.53690
1.81 465 841.65 216225 3.27610
2.33 661 1540.13 436921 5.42890
2.53 477 1206.81 227529 6.40090
2.04 490 999.60 240100 4.16160
3.20 524 1676.80 274576 10.24000
TOTAL 45.08 10961 24896.83 6084835 109.47900
7
The grade-point averages are highly correlated with the aptitude scores.
The sample correlation coefficient r is a value computed from a random sample of

n pairs of measurements. Different random samples of size n from the same population
will generally produce different values of r.
EXERCISES: Solve each of the following problems. Show all solutions.
1. The grades of a class of 9 students on a midterm report (x) and on the final
examination (y) are as follows:
x 77 50 71 72 81 94 96 99 67
y 82 66 78 34 47 85 99 99 67
(a) Find the equation of the regression line.

(b) Estimate the final examination grade of a student who receive a grade of
85 on the midterm report but was ill at the time of the final examination.
2. A study was made on the amount of converted sugar in a certain process at

various temperatures. The data were coded and recorded as follows:
Temperature, x Converted Sugar, Temperature, x Converted Sugar,

y y
1.0 8.1 1.6 8.6
1.1 7.8 1.7 10.2
1.2 8.5 1.8 9.3
1.3 9.8 1.9 9.2
1.4 9.5 2.0 10.5
1.5 8.9
(a) Estimate the linear regression line.

(b) Estimate the amount of converted sugar produced when the coded
temperature is 1.75.
8
3. A mathematics placement test is given to all entering freshmen at a small
college. A student who receives a grade below 35 is denied admission to the
regular mathematics course and placed in a remedial class. The placement test
scores and the final grades for 20 students who took the regular course were
recorded as follows:
Placement Test Course Grade Placement Test Course Grade

50 53 90 54
35 41 80 91
35 61 60 48
40 56 60 71
55 68 60 71
65 36 40 47
35 11 55 53
60 70 50 68
90 79 65 57
35 59 50 79
(a) Plot a scatter diagram.

(b) Find the equation of the regression line to predict course grades from
placement test scores.
(c) Graph the line on the scatter diagram
(d) If 60 is the minimum passing grade, below which placement test score
should students in the future be denied admission to this course?
4. Compute and interpret the correlation for the following grades of 6 students
selected at random.
Mathematics Grade 70 92 80 74 65 83
English Grade 74 84 63 87 78 90
5. The following data were obtained in a study of the relationship between the
weight and chest size of infants at birth:
Weight (kg) Chest Size (cm) Weight (kg) Chest Size (cm)
2.75 29.5 4.32 27.7
2.15 26.3 2.31 28.3
4.41 32.2 4.30 30.3
5.52 36.5 3.71 28.7
3.21 27.2
(a) Calculate r.
(b) Graph the line on a scatter diagram.
(c) Find the point estimate of .

Linear Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression

Uploaded by

Copyright:

Available Formats

LINEAR REGRESSION

A mathematical equation that allows us to predict the values of one dependent

In the study of linear regression, we consider the problem of estimating or

Estimation of Parameters. Given the sample the least-squares

are obtained from the formulas

Example 1. Consider the following data:

Substituting these values in the formula for b, we get

This is the regression line

Since the slope of y is negative, it implies that as x increases y decreases.

Correlation analysis attempts to measure the strength of such relationships between

A linear correlation coefficient is defined to be a measure of the linear relationship

The measure of linear relationship between two variables X and Y is estimated by

The values of r and its interpretation

substituting these values in the formula for r, we get

Grade-point Average Aptitude Score

1.93 565 1090.45 319225 3.7249

TOTAL 45.08 10961 24896.83 6084835 109.47900

The sample correlation coefficient r is a value computed from a random sample of

EXERCISES: Solve each of the following problems. Show all solutions.

(a) Find the equation of the regression line.

2. A study was made on the amount of converted sugar in a certain process at

Temperature, x Converted Sugar, Temperature, x Converted Sugar,

(a) Estimate the linear regression line.

Placement Test Course Grade Placement Test Course Grade

(a) Plot a scatter diagram.

You might also like