REGRESSION

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

REGRESSION

Definition

• “Regression is the measure of the average relationship between two or


more variables in terms of the original units of the data.”
• It is clear from the definitions that regression analysis is a statistical device
with the help of which we are in a position to estimate (or predict) the
unknown values of one variable from known values of another variable.
• The variable which is used to predict the variable of interest is called the
independent variable or explanatory variable and the variable we are trying
to predict is called the dependent variable or explained variable.
• The independent variable is denoted by X and dependent variable by Y.
• The term “linear” means that an equation of a straight line of the form Y =
a + bX,
• Where a and b are constants, is used to describe the average relationship
that exists between the two variables.
Linear Regression
Definition

• A good measure of relationship between two variables is given by


coefficient of correlation which tells us about the strength of relationship
and the direction of relationship as well.

• We also discussed that correlation coefficient measures only linear


relationship between two variables.

• After determining the correlation between two variables, we wish to


determine a mathematical relationship between them so that we can

(i) Predict the value of a variable based on the value of the other variable,
and

(ii) Explain the impact of changes in the values of variable on the values of
the other variable.
Definition

• In this topic, we discuss how to fit and analyze a regression model.

• Out of two variables, one is consider as a depended variable and other is


independent variable .

• Using regression model, we study those changes in the values of


independent variable that are resulted by the changes in the value of
dependent variable.

• Or in other words, we regress the dependent variable on the independent


variable. Therefore, the dependent variable is also called as the regressed
variable or study variable, while the independent variable is called as
regressor variable or explanatory variable.
Definition

• Out of rainfall and yield of a certain crop, rainfall is independent variable


which affects the yield of a certain crop. So yield becomes dependent
variable.

• Demand and supply, demand as affected by supply, demand becomes


regressed and supply becomes regressor.
Regression Line
A regression line, also called a line of best fit, is the line for which the sum of the
squares of the residuals is a minimum.

The Equation of a Regression Line


The equation of a regression line for an independent variable x and a
dependent variable y is
ŷ = mx + b
where ŷ is the predicted y-value for a given x-value. The slope m and
y-intercept b are given by
n ∑ xy − (∑ x )(∑ y ) ∑y ∑x
m= and b = y − mx = − m
n ∑ x 2 − (∑ x ) n n
2

where y is the mean of the y - values and x is the mean of the


x - values. The regression line always passes through (x , y ).
Regression Line
Example:
Find the equation of the regression line.

x y xy x2 y2
1 –3 –3 1 9
2 –1 –2 4 1
3 0 0 9 0
4 1 4 16 1
5 2 10 25 4
∑ x = 15 ∑ y = −1 ∑ xy = 9 ∑ x 2 = 55 ∑ y 2 = 15

n ∑ xy − (∑ x )(∑ y ) 5(9) − (15)(−1) 60


m= = = = 1.2
n ∑ x − (∑ x ) 5(55) − (15) 50
2 2 2

Continued.
Regression Line
Example continued:

−1 15
b = y − mx = − (1.2) = −3.8
5 5
The equation of the regression line is
ŷ = 1.2x – 3.8.
y
2
1
x
1 2 3 4 5

( )
−1
−1
−2 (x , y ) = 3,
5
−3
Regression Line
Example:
The following data represents the number of hours 12 different students watched
television during the weekend and the scores of each student who took a test the
following Monday.

a.) Find the equation of the regression line.


b.) Use the equation to find the expected test score for a student who watches
9 hours of TV.

Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50
xy 0 85 164 222 285 340 380 420 348 455 525 500
x2 0 1 4 9 9 25 25 25 36 49 49 100
y2 9216 7225 6724 5476 9025 4624 5776 7056 3364 4225 5625 2500

∑ x = 54 ∑ y = 908 ∑ xy = 3724 ∑ x 2 = 332 ∑ y 2 = 70836


Regression Line
Example continued:

n ∑ xy − (∑ x )(∑ y ) 12(3724) − (54)(908)


m= = ≈ −4.067
n ∑ x − (∑ x ) 12(332) − (54)
2 2 2

y
b = y − mx
908 54
100 (x , y ) = (1254 , 908
12 )
≈ (4.5,75.7)

= − (−4.067) 80

Test score
12 12
60
≈ 93.97
40

ŷ = –4.07x + 93.97 20
x
2 4 6 8 10
Hours watching TV
Continued.
Regression Line
Example continued:
Using the equation ŷ = –4.07x + 93.97, we can predict the test score for a student who
watches 9 hours of TV.

ŷ = –4.07x + 93.97

= –4.07(9) + 93.97

= 57.34

A student who watches 9 hours of TV over the weekend can expect to receive about a
57.34 on Monday’s test.
Regression Line
Example:
The following data represents the number of hours 20 different students watched
television during the weekend and the scores of each student who took a test the
following Monday.

a.) Find the equation of the regression line.


b.) Use the equation to find the expected test score for a student who watches
7 hours of TV.

Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 94 83 82 64 97 78 76 86 56 63 73 60
Measures of Regression
and Prediction Intervals
Variation About a Regression Line
To find the total variation, you must first calculate the total deviation, the explained
deviation, and the unexplained deviation.

Total deviation = y i − y
Explained deviation = yˆ i − y
Unexplained deviation = y i − yˆ i
y (xi, yi)
Unexplained
Total deviation
y i − yˆ i
deviation
yi − y
(xi, ŷi) Explained
y deviation
(xi, yi)
yˆ i − y
x
x
Variation About a Regression Line
The total variation about a regression line is the sum of the squares of
the differences between the y-value of each ordered pair and the mean
of y.
Total variation = ∑ (y i − y )
2

The explained variation is the sum of the squares of the differences


between each predicted y-value and the mean of y.
Explained variation = ∑ (yˆ i − y )
2

The unexplained variation is the sum of the squares of the differences


between the y-value of each ordered pair and each corresponding
predicted y-value.
Unexplained variation = ∑ (y i − yˆ i )
2

Total variation = Explained variation + Unexplained variation


Coefficient of Determination
The coefficient of determination r2 is the ratio of the explained variation to the total
variation. That is,

r 2 = Explained variation
Total variation
Example:
The correlation coefficient for the data that represents the number of hours students
watched television and the test scores of each student is r ≈ −0.831.
− Find the
coefficient of determination.

r 2 ≈ (−0.831)2 About 69.1% of the variation in the test scores


can be explained by the variation in the hours
≈ 0.691
of TV watched. About 30.9% of the variation is
unexplained.
Residuals
After verifying that the linear correlation between two variables is significant, next we
determine the equation of the line that can be used to predict the value of y for a given
value of x.

Observed y-
y
value

d2 For a given x-value,


d1
d = (observed y-value) – (predicted y-value)

Predicted y- d
3
value
x

Each data point di represents the difference between the observed y-value and the
predicted y-value for a given x-value on the line. These differences are called residuals.
Least Square Estimation

• To obtain some reasonably good estimate of a and b we use the method of


least squares. We wish to have such values of a and b which help us the
most in studying Y as affected by X.
The Standard Error of Estimate
When a ŷ-value is predicted from an x-value, the prediction is a point estimate.

An interval can also be constructed.

The standard error of estimate se is the standard deviation of the observed yi -values about
the predicted ŷ-value for a given xi -value. It is given by

where n is the number of ordered pairs in the data set.

∑( y i − yˆ i )2
se =
n −2
The closer the observed y-values are to the predicted y-values, the smaller the standard error
of estimate will be.
The Standard Error of Estimate
Finding the Standard Error of Estimate

In Words In Symbols

1. Make a table that includes the column x i , y i , yˆ i , ( y i − yˆ i ),


heading shown. ( y i − yˆ i )2
2. Use the regression equation to yˆ = mx i + b
calculate the predicted y-values.
3. Calculate the sum of the squares of ∑( y i − yˆ i )2
the differences between each
observed y-value and the
corresponding predicted y-value.
4. Find the standard error of estimate. ∑( y i − yˆ i )2
se =
n −2
The Standard Error of Estimate
Example:
The regression equation for the following data is
ŷ = 1.2x – 3.8.
Find the standard error of estimate.

xi yi ŷi (yi – ŷi )2
1 –3 – 2.6 0.16
2 –1 – 1.4 0.16
3 0 – 0.2 0.04
4 1 1 0
5 2 2.2 0.04 Unexplained
∑ = 0.4 variation
∑( y i − yˆ i )2 0.4
se = = ≈ 0.365
n −2 5−2
The standard deviation of the predicted y value for a given x value is
about 0.365.
The Standard Error of Estimate
Example:
The regression equation for the data that represents the number of hours 12 different
students watched television during the weekend and the scores of each student who
took a test the following Monday is
ŷ = –4.07x + 93.97.
Find the standard error of estimate.

Hours, xi 0 1 2 3 3 5
Test score, yi 96 85 82 74 95 68
ŷi 93.97 89.9 85.83 81.76 81.76 73.62

Hours, xi 5 5 6 7 7 10
Test score, yi 76 84 58 65 75 50
ŷi 73.62 73.62 69.55 65.48 65.48 53.27
Continued.
The Standard Error of Estimate
Example:
The regression equation for the data that represents the number of hours 12 different
students watched television during the weekend and the scores of each student who
took a test the following Monday is
ŷ = –4.07x + 93.97.
Find the standard error of estimate.

Hours, xi 0 1 2 3 3 5
Test score, yi 96 85 82 74 95 68
ŷi 93.97 89.9 85.83 81.76 81.76 73.62
(yi – ŷi)2 4.12 24.01 14.67 60.22 175.3 31.58

Hours, xi 5 5 6 7 7 10
Test score, yi 76 84 58 65 75 50
ŷi 73.62 73.62 69.55 65.48 65.48 53.27
(yi – ŷi)2 5.66 107.74 133.4 0.23 90.63 10.69 Continued.
The Standard Error of Estimate
Example continued:

∑( y i − yˆ i )2 = 658.25
Unexplained
variation

∑( y i − yˆ i )2 658.25 ≈ 8.11
se = =
n −2 12 − 2

The standard deviation of the student test scores for a specific


number of hours of TV watched is about 8.11.

You might also like