Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

REGRESSION AND CORRELATION ANALYSIS

Introduction
There are many situations where we are interested in the relationship between two or more
variables occurring together.
The analysis of bivariate numeric data is concerned with statistical measures of regression and
correlation. Correlation is a statistical method used to determine whether a relationship between
variables exists. Regression is a statistical method used to describe the nature of the relationship
between variables, that is, positive or negative, linear or nonlinear. Bivariate equivalents of
location and dispersion. Generally, regression locates bivariate data in terms of a mathematical
relationship, able to be graphed as a line or curve, while correlation describes the nature of the
spread of the items about the line or curve.

Regression
There are two types of regression: simple linear regression and multiple regressions.
In a simple regression there are only two variables under study while in multiple regression, many
variables are under study.

Simple linear regression model

The simple linear regression model: Regression analysis attempts to establish the nature of the
relationship between variables – that is, to study the functional relationship between the variables
and thereby provide a mechanism for prediction, or forecasting. The variable which is used to
predict the variable of interest is called the independent variable or explanatory variable. The
variable we are trying to predict is called the dependent variable or explained variable.

Bivariate data always involves two distinct variables and in the majority of case one variable will
depend naturally on the other. The independent variable is the one that is chosen freely or occurs
naturally, the dependent variable occurs as a consequence of the value of the independent variable.
Sometimes the relation between a dependent and an independent variable is called a causal
relationship, since it can be argued that the value of one variable has been caused by the value of
the other.

The analysis is called simple linear regression analysis – simple because there is only one
Predictor or independent variable, and linear because of the assumed relationship between the
dependent and independent variables.
The independent and dependent variables can be plotted on a graph called a scatter plot. The
independent variables x is plotted on the horizontal axis, and the dependent variable y is plotted
on the vertical axis.
A scatter plot is a graph of the ordered pairs ( x, y) of numbers consisting of the independent
variable x and the dependent variable y.

The scatter plot is a visual way to describe the nature of the relationship between the independent
and dependent variables.
A. Construct a scatter plot for the following data
Number of 1 2 3 4 5 6 7 8 9 10 11 12
observation
Quantity Y 69 76 52 56 57 77 58 55 67 53 72 64
Price X 9 12 6 10 9 10 7 8 12 6 11 8

B: Construct a scatter plot for the data obtained in a study on the number of absences and the
final grades of seven randomly selected students from a statistics class. The data are shown here.

Student Number of absences X Final grade Y (%)


A 6 82
B 2 86
C 15 43
D 9 74
E 12 58
F 5 90
G 8 78

Linear Functions (Regression Line Equation) and Graphs

For any set of bivariate data, there are two regression lines equation which can be obtained:
a) The y on x regression line is the name given to that regression line which is used for
estimating y given a value of x. where
Mathematically: Y=a +bx. Where Y is called the dependent variable,
explained variable, predictand and X is called independent variable, explanatory
variable or, predictor; a” and “b” can be any numerical values, positive or negative, and
“a” is the Y intercept, “b” is the gradient or the slope

b) The x on y regression line is the name given to that regression line which is used for
estimating x given a value of y.
Mathematically: X=a+by. Where X is called the dependent variable,
explained variable, predictand and Y is called independent variable, explanatory
variable or, predictor; a” and “b” can be any numerical values, positive or negative, and
“a” is the X intercept, “b” is the gradient or the slope.

Standard methods of obtaining a regression line

They are many methods that we can use for obtaining a regression line, but in this chapter we
will use: Method of Least squares regression line, Method of Mayer, Method of three Points.

a) Method of Least squares regression line


The method of least squares is the standard technique for obtaining a regression line.
Least squares regression formulae.
If the least squares regression line of y on x is given by Y  a  bX , then
a
  Y    X     X   XY 
2

n  X    X  2 2

n   XY     X   Y 
b
n  X    X 
2 2

Where a is the Y intercept and b is the slope of the line


Example 1: Find the equation of the regression line for the data in example above, and graph the
line on the scatter plot.
Solution
n  7,  x  57,  y  511,  xy  3745,  x2 579

a
 Y    X     X   XY   511579  57 3745  82404  104.49
2

n  X    X   7  579   57 
2 2
2 804

n   xy     x   y   7  3745   57  511  2912  3.622


b 
n   x2     x   7  579   57 
2 2
804

Hence, the equation of the regression line y  a  bx is y  102.493  3.622x


The graph of the line is shown here.

Example: A physician wishes to know whether there is a relationship between a father’s weight
(in kg) and his son’s weight (in kg).
The data are given here.
Father’s 65 63 67 64 68 62 70 66 68 67 69 71
weight
x
Son’s 68 66 68 65 69 66 68 65 71 67 68 70
weight
y

n  12,  x  800,  y  811,  xy  54107,  x 2

n   xy     x   y  12  54107    800  811 484


b    0.476
n  x    x 12  53418  800 
2 2
2 1016

a
 Y    X     X   XY   811 53418  800 54107   36398  35,82
2

n  X    X  12  53418   800 
2 2
2 1016
Hence, the equation of the regression line y  a  bx is y  35.82  0.476x

b) Method of Mayer
The set will be divided in two equal parts and for each part we will calculate the mean of Y and
X values as ( X 1 , Y1 );( X 2 , Y2 ).
The two points will be used to formulate a system of two equations which will be solved for
finding the straight line of the distribution.

Example 246 8  9  13
X1   4 et X 2   10
3 3
X Y
7  10  13 15  20  28
2 7 Y1   10 et Y2   21
3 3
4 10
Then we have (4,10) et (10,21)
6 13
The equation of the straight line pssed through two points is:Y=ax+b
8 15
10=4b+a
9 20 we have to solve: 
13 28 21  10b  a

Hence, the equation of the regression line y  a  bx is y  2.14  1.907 x

Correlation

Correlation may be defined as the degree or strength of relationship existing between two or
more variables.
The variables are said to be correlated, if there exists a change in one variable corresponding to a
change in the other variable.

Types of Correlation
Correlation is classified in several ways. The following are the important types:
a. Positive correlation
b. Negative correlation
c. Simple, partial and multiple correlation
d. Linear and non - linear

a. Positive correlation: correlation is positive (direct), if the variables vary in the same
direction, that is, if they increase or decrease together.
Ex: Height ( x) and weight ( y) of persons are positively correlated.
b. Negative correlation: correlation is negative (inverse), if the variables vary in opposite
directions, that is, if one variable is increasing, the other is decreasing or vice versa.
Ex: Production ( x) and price ( y) of a commodity are negatively correlated.
c. Simple, partial and multiple correlations: The distinction between simple, partial and
multiple correlations is based on the number of variables involved. Simple correlations are
concerned with two variables only while partial and multiple correlations are concerned
with three or more related variables.
d. Linear and Nonlinear (Curvilinear) correlations. If the amount of change in one
variable tends to near a constant ratio to the amount of change in the other variable, then the
correlation is said to be linear otherwise nonlinear.
The coefficient of correlation is used to measure the strength and direction of a linear
relationship between two variables.

There are several types of correlation coefficients. The one explained in this section is Pearson’s
Coefficient correlation, named after statistician Karl Pearson, who pioneered the research in
this area. The symbol for the sample correlation coefficient is r. The symbol for population
correlation coefficients is  (Greek letter rho).

Properties of the linear correlations coefficient


1. The linear correlation coefficient is always between -1 and 1, inclusive.
That is, 1  r  1.
2. If r  1 , there is a perfect positive linear relation between the two variables.
3. If r  1 , there is a perfect negative linear relation between the two variables.
4. The closer r is to +1, the stronger is the evidence of positive association between
the two variables this is where increases in one variable are associated with increases
in the other.
5. The closer r is to -1, the stronger is the evidence of negative association between the two
variables.
6. If r is close to 0, there is little or no evidence of a linear relation between the two
variables. Because the linear correlation coefficient is a measure of the strength of the
linear relation, r close to 0 does not imply no relation, just no linear relation.
7. The linear correlation coefficient is a unit less measure of association. So the unit of
measure for x and y plays no role in the interpretation of r.
.
Coefficient of correlation
Correlation Co-efficient Definition:
A measure of the strength of linear association between two variables. Correlation will always
between -1.0 and +1.0. If the correlation is positive, we have a positive relationship. If it is
negative, the relationship is negative.

Formula:

Correlation Co-efficient :
Correlation (r) =[ NΣXY - (ΣX)(ΣY) / Sqrt([NΣX2 - (ΣX)2][NΣY2 - (ΣY)2])]
where
N = Number of values or elements
X = First Score
Y = Second Score
ΣXY = Sum of the product of first and Second Scores
ΣX = Sum of First Scores
ΣY = Sum of Second Scores
ΣX2 = Sum of square First Scores
ΣY2 = Sum of square Second Scores
Correlation Co-efficient Example: To find the Correlation of

X Values Y Values
60 3.1
61 3.6
62 3.8
63 4
65 4.1

Step 1: Count the number of values.


N=5

Step 2: Find XY, X2, Y2


See the below table

X Value Y Value X*Y X*X Y*Y


60 3.1 60 * 3.1 = 186 60 * 60 = 3600 3.1 * 3.1 = 9.61
61 3.6 61 * 3.6 = 219.6 61 * 61 = 3721 3.6 * 3.6 = 12.96
62 3.8 62 * 3.8 = 235.6 62 * 62 = 3844 3.8 * 3.8 = 14.44
63 4 63 * 4 = 252 63 * 63 = 3969 4 * 4 = 16
65 4.1 65 * 4.1 = 266.5 65 * 65 = 4225 4.1 * 4.1 = 16.81

Step 3: Find ΣX, ΣY, ΣXY, ΣX2, ΣY2.


ΣX = 311
ΣY = 18.6
ΣXY = 1159.7
ΣX2 = 19359
ΣY2 = 69.82

Step 4: Now, Substitute in the above formula given.


Correlation(r) =[ NΣXY - (ΣX)(ΣY) / Sqrt([NΣX2 - (ΣX)2][NΣY2 - (ΣY)2])]
= ((5)*(1159.7)-(311)*(18.6))/sqrt([(5)*(19359)-(311)2]*[(5)*(69.82)-(18.6)2])
= (5798.5 - 5784.6)/sqrt([96795 - 96721]*[349.1 - 345.96])
= 13.9/sqrt(74*3.14)
= 13.9/sqrt(232.36)
= 13.9/15.24336
= 0.9119
This example will guide you to find the relationship between two variables by calculating the
Correlation Co-efficient from the above steps.

Example:
The following table shows the Height (x) vs. Femur Length (y) measurements (both in inches)
for 10 men:
X 70.8 66.2 71.7 68.7 67.6 69.2 66.5 67.2 68.3 65.6
Y 42.5 40.2 44.4 42.8 40 47.3 43.4 40.1 42.1 36

Scatter Diagram for Height vs. Femur Length

50
Legth of Femur

45

40

35
65 66 67 68 69 70 71 72

Height

The diagram shows a positive linear correlation between the variables.

Example: The following table gives the weight (x) (in 1000 lbs.) and highway fuel efficiency (y)
(in miles/gallon) for a sample of 13 cars.

Vehicle X Y

Chevrolet Camaro 3.545 30


Dodge Neon 2.6 32
Honda Accord 3.245 30
Lincoln Continental 3.93 24
Oldsmobile Aurora 3.995 26
Pontiac Grand Am 3.115 30
Mitsubishi Eclipse 3.235 33
BMW 3-Series 3.225 27
Honda Civic 2.44 37
Toyota Camry 3.24 32
Hyundai Accent 2.29 37
Mazda Protégé 2.5 34
Cadillac DeVille 4.02 26
Scatter Diagram for Weight vs. Highway MPG

40
38
36
34
MPG Highway

32
30
28
26
24
22
20
2 2.5 3 3.5 4 4.5
weight (1000 lbs)

The diagram indicates a negative linear correlation between the variables.

The coefficient can be used to test for linear relationship between two variables.
Formula for the correlation coefficient r
n   xy     x   y 
r (1)
[n   x     x  ]  n   y     y  
 
2 2 2 2

 

  
n
1
 xi  x yi  y
n i 1 Cov  xy   
 xi  x yi  y 
r  x, y    
var  x  .var  y 
1 n

2 1 n

2
    
 xi  x   yi  y 
2 2

  xi  x    yi  y 
 n i 1   n i 1 
Where n is the number of data pairs.

Example Compute the value of the correlation coefficient for the data obtained in the study of
the number of absences and the final grade of the seven students in the statistics class given in
example above.
Solution
Student Number of Final grade Y xy x2 y2
absences X (%)
A 6 82 492 36 6724
B 2 86 172 4 7396
C 15 43 645 225 1849
D 9 74 666 81 5476
E 12 58 696 144 3364
F 5 90 450 25 8100
G 8 78 624 64 6084
 x  57;  y  511;  xy  3745;  x ;  y 2 2

Substitute in the formula and solve for r

n   xy     x   y 
r
 n  x 2    x 2   n  y 2    y 2 
       

 7  3745    57  511  0.944
 7  579    57 2   7  38993    5112 
  
The value of r suggests a strong negative relationship between a student’s final grade and the
number of absences a student has. That is, the more absences a student has, the lower is his or
her grade.
Example.
Find the equation of the regression line and compute the value of the correlation coefficient for
the following data.
Income x 80 100 120 140 160 180
Consumption y 325 462 445 707 678 750

Example: From the following data:


Age of Husband: 35 34 40 43 56 20 38
Age of wife: 32 30 31 32 53 20 33
Calculate the coefficient of correlation between the age of husband and wife.
Solution
Let the age of husband be denoted by x and the age of wife by y.
Calculation of correlation coefficient
xi yi xy x2 y2
35 32 1120 1225 1024
34 30 1020 1156 900
40 31 1240 1600 961
43 32 1376 1849 1024
56 53 2968 3136 2809
20 20 400 400 400
38 33 1254 1444 1089

 x  266;  y  231;  xy  9378;  x 2


 10810; y 2
 8207

Substitute in the formula and solve for r


n   xy     x   y 
r
[n   x 2     x  ]  n   y 2     y  
2 2

 

r ( x, y ) 
 7  9378   266  231  0,937
 7 10810    266 2   7  8207    2312 
  
Example: From the following data of marks in Accountancy and statistics obtained by 6
students (out of 50). Calculate the correlation coefficient:

Marks in Accountancy: 35 30 28 29 13 45
Marks in Statistics : 40 27 35 26 24 40

Solution
Let marks in Accountancy be denoted by x and marks in Statistics by y
xi yi 
xi  x 
x2 y2
35 40 1120 1225 1024
30 27 1020 1156 900
28 35 1240 1600 961
29 26 1376 1849 1024
13 24 2968 3136 2809
45 40 400 400 400
1254 1444 1089

NOTE: The sign of the correlation coefficient and the sign of the slope of the regression will
always the same.

Scatter plots of data with various correlation coefficients

The coefficient of determination


The coefficient of determination is the square of the coefficient of correlation (i.e. r 2 ).
The coefficient of determination determines the proportion of the variation in Y which is
explained by variations in X.
In other words, the coefficient of determination gives the proportion of all the variation (in the y
– values) that is explained ( by the variation in the x – values). For this reason r 2 is called
coefficient of determination. For example if r2 0.5671, this means that the regression line
explains 56.71% of the total variation of the Y values around the mean. The remaining 43.29%
of the total variation in Y is unexplained for by the regression line .this value is called the
coefficient of nondetermination
Notice that, since 1  r  1, it follows that o < r 2  +1
The coefficient of correlation between two variables is most easily calculated by constructing a
table (see example below) with columns that contain the x and y variable values for each
individual, the value of xy for each individual, and the values of x2 and y 2 for each individual.

The sum of each column is found, and these sums can then be substituted into the formula above
to find r.
Example: Using our previous data set of height vs femur length for 10 men, we get the table:

Variable X Y Xy x2 y2
70.8 42.5 3009 5012.64 1806.25
66.2 40.2 2661.24 4382.44 1616.04
71.7 44.4 3183.48 5140.89 1971.36
68.7 42.8 2940.36 4719.69 1831.84
67.6 40 2704 4569.76 1600
69.2 47.3 3273.16 4788.64 2237.29
66.5 43.4 2886.1 4422.25 1883.56
67.2 40.1 2694.72 4515.84 1608.01
68.3 42.1 2875.43 4664.89 1772.41
65.6 36 2361.6 4303.36 1296

Sum 681.8 418.8 28589.09 46520.4 17622.76

The coefficient of correlation for the variables is thus:

10  28589.09    681.8 418.8


r 
10  46520.4    681.8 10 17622.76    418.8
2 2

353.06 353.06
  .651
352.76 834.16 542.4558

Exercise: Calculate the coefficient of correlation for the vehicle weight and miles per gallon
data sets. The table of variables is given below:

Variable X y Xy x2 y2
3.545 30 106.35 12.567025 900
2.6 32 83.2 6.76 1024
3.245 30 97.35 10.530025 900
3.93 24 94.32 15.4449 576
3.995 26 103.87 15.960025 676
3.115 30 93.45 9.703225 900
3.235 33 106.755 10.465225 1089
3.225 27 87.075 10.400625 729
2.44 37 90.28 5.9536 1369
3.24 32 103.68 10.4976 1024
2.29 37 84.73 5.2441 1369
2.5 34 85 6.25 1156
4.02 26 104.52 16.1604 676

Sums 41.38 398 1240.58 135.93675 12388


Linear Regression

If a pair of variables has a significant linear correlation, then the relationship between the data
values can be roughly approximated by a linear equation. The process of finding the linear
equation which best fits the data values is known as linear regression and the line of best fit is
called the regression line.

It is a fact of linear algebra and analysis that the least squares line of best fit to a set of data
values has an equation of the form ŷ  mx b where:

n   xy     x   y   y  m  x
m and b  y  mx 
 x    x
2
n 2 n

Example: For the vehicle weight vs. highway mileage data set, we have:

13 1240.58    41.38  398  341.7


m   6.23
13 135.937    41.38 
2
54.877

and

b
 398  (6.23)  41.38  655.797  50.45
13 13

so our regression line is given by the equation yˆ  6.23x  50.45 . The graph of this line is
shown on the scatter diagram for the data set below.
Vehicle Weight vs. MPG Highway
40
38
36
MPG Highway

34
32
30
28
26
24
22
20
2 2.5 3 3.5 4 4.5
weight (1000 lbs)

Example: For the vehicle weight vs. highway mileage data set, we have:

13 1240.58    41.38  398  341.7


m   6.23
13 135.937    41.38 
2
54.877
and

b
 398  (6.23)  41.38  655.797  50.45
13 13
so our regression line is given by the equation yˆ  6.23x  50.45 . The graph of this line is
shown on the scatter diagram for the data set below.

Line of Best Fit (Least Square Method)


A line of best fit is a straight line that is the best approximation of the given set of data which is
used to study the nature of the relation between two variables.

A line of best fit can be roughly determined using an eyeball method by drawing a straight line
on a scatter plot so that the number of points above the line and below the line is about equal
(and the line passes through as many points as possible).

A more accurate way of finding the line of best fit is the least square method.

Use the following steps to find the equation of line of best fit for a set of ordered pairs.

Step 1: Calculate the mean of the x -values and the mean of the y -values.

Step 2: Compute the sum of the squares of the x -values.


Step 3: Compute the sum of each x -value multiplied by its corresponding y -value.

Step 4: Calculate the slope of the line using the formula:

where n is the total number of data points.


Step 5: Compute the y -intercept of the line by using the formula:

where are the mean of the x - and y -coordinates of the data points respectively.
Step 6: Use the slope and the y -intercept to form the equation of the line.
Example:

Use the least square method to determine the equation of line of best fit for the data. Then plot
the line.

Solution:

Plot the points on a coordinate plane.


Calculate the means of the x -values and the y -values, the sum of squares of the x -values, and
the sum of each x -value multiplied by its corresponding y -value.

Calculate the slope.

Calculate the y -intercept.

First, calculate the mean of the x -values and that of the y -values.

Use the formula to compute the y -intercept.


Use the slope and y -intercept to form the equation of the line of best fit.

The slope of the line is –1.1 and the y -intercept is 14.0.

Therefore, the equation is y = –1.1 x + 14.0.

Draw the line on the scatter plot.

 . Using the Regression Line to Predict Data Values

The primary use for the regression equation is to predict values for one variable given a value for
the other variable.

Example: Using our regression equation for the car data, we could estimate that a car that
weighed 3000 lbs. ( x  3) would have a highway mpg of yˆ  6.23(3)  50.45  31.76 .

Likewise, if we knew a car’s highway mpg was 36 mpg, then we would estimate its weight by
solving 36  6.23 x  50.45 to get x  2.319 or a car that weighs 2319 lbs.

You might also like