Regression Analysis: Li-Ann Lee C. Nalangan

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 92

Regression Analysis

LI-ANN LEE C. NALANGAN


OUTLINE
1. Simple Regression
 Scatter plot
 Correlation
 Models, coefficients and interpretation
 Outliers
 Evaluation of statistical models

2. Multiple Regression
Simple Regression

 Learn….

To use regression analysis to


explore the association between
two quantitative variables
EXAMPLES OF RESEARCH
QUESTIONS
How does one’s level of education affect one’s
earnings?
Is the attention span of students affected by
giving quiz?
What is the effect of unemployment benefits
on the duration of unemployment?
EXAMPLES OF RESEARCH
QUESTIONS, Cont’d
How is literacy affected by household
income?
How is unemployment affected by inflation?
What kind of individual characteristics
contribute to explain an individual’s decision
to retire?
What is the effect of drug use on wages?
The Scatterplot
 Thefirst step in answering the
question of association is to look at
the data
A scatterplot is a graphical display of
the relationship between two variables
Person Neckline(cm)
Waistline(cm)
1 38 72
2 34 64 90
3 31 59 85
4 31 64
5 32 68 80

6 29 62 75

7 33 75 70

8 31 66 65
9 32 72
10 30 67 60

11 30 64 55
28 29 30 31 32 33 34 35 36 37 38 39 40
12 33 66
13 36 78
14 35 81
15 35 66
What information does a scatter plot
give?

 The
form of relationship (linear or
non-linear)
 The strength of relationship
Possible relationships between
X and Y in Scatter Diagrams
Y Y Y
(a) Direct linear (b) Inverse linear (c) Direct curvilinear

X X X

Y YY (e) Inverse linear Y


(d) Inverse curvilinear (f) No linear relationship
with more scattering

X X X
Types of Relationships
 Direct vs. Inverse
◦ Direct - X and Y increase together
◦ Inverse - X and Y have opposite directions

 Linear vs. Curvilinear


◦ Linear - Straight line best describes the
relationship between X and Y
◦ Curvilinear - Curved line best describes the
relationship between X and Y
Direct vs. Inverse Relationship
Direct Relationship Inverse Relationship

Price and demand


of durian have

Price of durian
opposite direction
Sales

Advertising and Sales


increase/decrease
together

Advertising Durian demand


South Cotabato Total rice production data 2007

municipality area area prodn farmers


planted harvested volume served
banga 9784 9713 44907 6154
koronadal 11267 10526 46813 6433
lake sebu 5681 5846 24286 4033
norala 14623 12157 59235 7895
polomolok 778 748 3258 487
sto.nino 16151 15648 71998 8594
surallah 9881 9598 44965 6012
tampakan 163 100 435 131
tantangan 13828 12229 51434 8654
t'boli 670 663 2470 568
tupi 665 669 3056 442
 Observe the 3 scatter-plots that
follow

 Comment on the form of relationship of


each pair used in the graph
Scatter plot of area planted of rice vs. area
harvested at South Cotabato, 2007

Area planted and area


harvested increase together.
(Direct relationship)
Scatter plot of area planted rice vs.
production volume

Area planted and production


volume increase together.
(Direct relationship)
Scatter plot of farmers served vs. area
planted of rice

Area planted and farmers


served increase together.
(Direct relationship)
Exercise 1

1. Identify two variables and construct a


scatter-plot using SPSS.
2. Based on the diagram, what is the
possible relationship of the variables?
3. (What is the probable strength of the
variables used?)
Direct vs. Inverse Relationship
An upward straight line implies a perfect linear
correlation,  = 1 while a downward line means  = -1

Price and demand of


durian are perfectly
inversely proportional

Price of durian
Sales

Advertising and Sales


are perfectly correlated

Advertising Durian Demand


Strength of relationships between X
and Y in Scatter plots
Y YY
(b) Moderate linear
(a) Very strong linear
relationship
relationship

X X
Y Y
(c) No linear relationship (d) non-linear relationship

X X
Correlation coefficient

 The sign could be + or -.


 “+” sign indicates they increase (or decrease)
together.
 “-” sign indicates one increases while the

other decreases.

 The magnitude (0-1) is a measure of


strength of the association
Values and interpretation of r
Absolute values Interpretation on strength of
of correlation linear relationship of two
coefficient, r variables
0.01 - 0.20 very weak
0.21 - 0.40 weak
0.41 - 0.60 moderate
0.61 - 0.80 strong
0.80 - 0.99 very strong
Interpret the sign and magnitude of
correlation coefficient
Correlation (r) between
 Area planted & area
harvested, r=0.994

 Farmers served &


production volume,
r=0.982

Both pairs have very strong


linear relationship
Strength of linear relationship of rice production data

area planted area production farmers


(ha) harvested volume served
(ha) (mt/yr)
area planted .994** .993** .992
(ha) 1
.000 .000 .000**
area .994** .997** .989**
harvested 1
(ha) .000 .000 .000
production .993* .997** .982**
volume 1
(mt/yr) .000* .000 .000
farmers .992** .989** .982**
served 1
How to estimate correlation coefficients,
Pearson ()

SPXY
r
SS X SSY

SPXY   xy 
  x   y   x  2

n SS X  x 2

n

  y
2

SS Y   y2 
n
a.planted(X) a.Harvested(Y) XY XX YY
9784 9713 95031992 95726656 94342369
11267 10526 1.19E+08 1.27E+08 1.11E+08
5681 5846 33211126 32273761 34175716
14623 12157 1.78E+08 2.14E+08 1.48E+08
778 748 581944 605284 559504
16151 15648 2.53E+08 2.61E+08 2.45E+08
9881 9598 94837838 97634161 92121604
163 100 16300 26569 10000
13828 12229 1.69E+08 1.91E+08 1.5E+08
670 663 444210 448900 439569
665 669 444885 442225 447561

symbols SX SY SXY SXX SYY


sums 83491 77897 9.43E+08 1.02E+09 8.75E+08

Spxy 3.52E+08 r 0.994


SSx 3.86E+08
Ssy 3.23E+08
Hypothesis Test
 H0:  = 0 (There is no significant correlation
between X and Y)
Ha:   0 (There is significant correlation
between X and Y)
 Test statistic is : r
tc 
1 r2
n2
 Decision Rule: Reject H0 if t > t/2(n-2)
or if tn-2 < - t/2(n-2)

 t/2 – use t table


Hypothesis Test
 H0:  = 0 (There is no significant correlation
between a.planted and harvested)
Ha:   0 (There is significant correlation
between a.planted and harvested)
r  0 .994  0
tc    27.26
 Test statistic is : 1 r 2
1  .994 2

n2 11  2

 Decision Rule: Reject H0 if |tc| > t.025(9) =2.262


 Decision: Reject Ho.
Exercise 2
1. Identify at least three variables.
2. Compute their correlation coefficients.
3. Which pair of variables has significant
correlation?
4. Interpret the results.
Regression Analysis
 The next step in a regression analysis is
to identify the response and explanatory
variables
◦ We use Y to denote the response variable -the
variable we want to predict in terms of other
variables (factors)

◦ We use X to denote the explanatory variable-


the variable(s) or factor(s) used to predict the
value of a variable (response or dependent)
EXAMPLES OF RESEARCH QUESTIONS
How does one’s level of education affect one’s
earnings?
Response variable: earnings
Explanatory variable: level of education

What is the effect of unemployment benefits


on the duration of unemployment?
Response variable: duration of unemployment
Explanatory variable:
unemployment benefits
EXAMPLES OF RESEARCH QUESTIONS, Cont’d

How is literacy affected by household


income?
Response variable: literacy
Explanatory variable: household income

How is unemployment affected by inflation?


Response variable: unemployment
Explanatory variable: inflation
The Regression Line Equation
 When the scatterplot shows a linear trend, a
straight line fitted through the data points
90

describes that trend 85


80
75

 The regression line is: 70


65

yˆ  a  bx 60
55
28 29 30 31 32 33 34 35 36 37 38 39 40

 ŷ
is the predicted value of the response
variable y
 a is the y-intercept and b is the slope
 Constant is another name for y-intercept
b = SPXY/SSX a  Y  bX

XY
SSX   X 2

  X
2
SPXY  XY
n n
The Regression Line of Wage
What factors (or variables) determine higher
or lower wage?
Years of education
Gender
Work experience

Let’s consider one factor as explanatory


variable : Years of education
How is years of education related to wage?
Hourly wage rate explained by years of
education
yˆ  a  bx 50
Let 40

Hourly Wage Rate


Y = hourly wage rate($) 30
X = education (years) 20
10
a=-2.67853 0
0 5 10 15 20
b= 0.90538 Years of Education

Wage = -2.67853 +0.90538 (educ)


 If b is positive, correlation is also positive
 Higher values of education are positively
correlated with higher values of wages
Wage = -2.679 +0.905(educ)
Wage = -2.679 +0.905(educ=0)
Wage = -2.679 +0.905(1)
Educ (yr) Wage($) Wage = -2.679 +0.905(2)
0 -2.679
0.905 A person who has not
1 -1.773
attended school is estimated
2 -0.868
0.905 to have hourly wage of
33 0.0376
$-2.679. (Meaningful? )
66 2.7538
1010 6.3753  For every additional 1 year
1414 9.9968
9.9968 of schooling, hourly wage of
a person is estimated to
increase by $0.905.
The Regression Line of Production Volume

What factors (or variables) determine higher


or lower production volume?
Area planted
Area harvested
Farmers served

Let’s consider one factor as explanatory


variable : area harvested
How is area harvested related to production
volume?
Production volume explained by area harvested
yˆ  a  bx
Let
Y = production volume (mt/yr)
X = area harvested (ha)
a=-446.341 b= 4.593
prodxn volume=-446.341+4.593 (area harvested)
A municipality with no harvested area has -446.341
mt/yr volume of rice production. (Meaningful? )

 For every additional 1 ha of harvested area


production increases its volume by 4.593 mt/yr.
For recent data on municipality, the prediction
equation relating y = production volume to x =
area harvested is:

prodxn volume=-446.341+4.593 (area harvested)

Find the predicted production volume of


sto.nino which has the largest area harvested
(= 15648 ha).

a. -446.341
b. 4.593
c. 331133
d. 71425
South Cotabato Total rice production data 2007

municipality area prodn Estimated residual


harvested volume Prodn vol (error)
banga 9713 44907 44165 742
koronadal 10526 46813 47900 -1087
lake sebu 5846 24286 26404 -2118
norala 12157 59235 55391 3844
polomolok 748 3258 2989 269
sto.nino 15648 71998 71425 573
surallah 9598 44965 43637 1328
tampakan 100 435 13 422
tantangan 12229 51434 55721 -4287
t'boli 663 2470 2599 -129
tupi 669 3056 2626 430
Actual Y data and Predicted (Estimated) Y
Actual Y data and Predicted (Estimated) Y

50
40
30
20 Y
Wage ($)

Predicted Y
10
0
0 5 10 15 20
Educ (years)X
Residuals are Prediction Errors
 The regression equation is often called a
prediction equation yˆ  a  bx
 The difference between an observed
outcome and its predicted value is the
prediction error, called a residual
Outliers
 Outliers are observations with large residuals
 Check for outliers by plotting the data
 The regression line can be pulled toward an
outlier and away from the general trend of
points
Influential Observation
 An observation can be influential in affecting
the regression line when two things happen:
◦ Its x value is low or high compared to the rest of
the data
◦ It does not fall in the straight-line pattern that the
rest of the data have

 It is usually an extreme observation in the X-


variable, lying away from the bulk of X-data
A Statistical Model
 A statistical model never holds exactly in
practice.
 It is merely a simple approximation for
reality
 Even though it does not describe reality
exactly, a model is useful if the true
relationship is close to what the model
predicts
 Select the best among many different
models.
How to evaluate models to
determine the best
Evaluation of a Statistical Model

A statistical model can be evaluated using:

 R2 – the coefficient of determination


 Standard error of the estimate
 Significance of parameters
Coefficient of Determination, R2

 The proportion of variation in Y which is


explained by X
 Ranges from 0 to 100 percent
(or 0-1 if in decimal)
 The nearer it is to 100, the better is the
model
2 b1 SP XY
R  *100%
SSY
2 b1 SP XY SSY   Y 2

  Y 2

R  *100%
SSY n

R2 is the % variation in Y explained by X


R2 = 38.3%
R2 of regression of production volume
 R2 = .994 or 99.4%
 The 99.4% variation in production volume is
explained by area harvested.
 This indicates that only .6% is not explained
by area harvested
(probably .6 can be explained by other
factors like area planted).
Model Summary

Adjusted R Std. Error of the


Model R R Square Square Estimate
1 .997a .994 .993 2158.217
a. Predictors: (Constant), ha
Standard error of the estimate

A measure of difference of the actual value and the


estimated value (which uses statistical model) on the
average.

The smaller the standard error, the better is the model.

Model Summary

Adjusted R Std. Error of the


Model R R Square Square Estimate
1 .997a .994 .993 2158.217
a. Predictors: (Constant), ha
Standard error of the regression of production
volume

For regression of production volume, standard


error of estimate is 2158.217.

Interpretation: The actual production volume


differs from its predicted value using area
harvested by an average of 2158.217 mt/yr.

Note: A possible better model has std. error


smaller than 2158.217 mt/yr.
South Cotabato Total rice production data 2007
municipality area prodn Estimated residual
harvested volume Prodn vol (error)
banga 9713 44907 44165 742
koronadal 10526 46813 47900 -1087
lake sebu 5846 24286 26404 -2118
norala 12157 59235 55391 3844
polomolok 748 3258 2989 269
sto.nino 15648 71998 71425 573
surallah 9598 44965 43637 1328
tampakan 100 435 13 422
tantangan 12229 51434 55721 -4287
t'boli 663 2470 2599 -129
tupi 669 3056 2626 430

Std. error of estimate


2158.217
Significance of parameters
 An explanatory variable is said to be an important
predictor variable if it has significant effect on
response variable.
 An explanatory variable with significant effect has
“sig” value less than .05 or .01 (the usual level of
significance)
 A good model includes (only) significant
explanatory variables
Sample SPSS output for
Significance of parameters
Coefficientsa/
Model
Unstandardized Standardized
Coefficients Coefficients
B Std. Error Beta t Sig.
(Constant) -446.341 1070.321 -.417 .686
Area 4.593 .120 .997 38.273 .000
prodxn
a. Dependent Variable: mt/yr

 Area production is a significant factor or explanatory


variable
 the constant is not and could be excluded in modeling
Application of correlation
coefficient on regression
 Strong relationship of explanatory variable (X)
to response variable (Y) may imply that X is a
good variable for predicting Y.

 Usually, we prioritize using X with a higher


correlation coefficient.

Note: Correlation coefficient does not mean


causality
Exercise 3
1.Identify 2 variables, one explanatory and
the other response variable.
2.Compute the coefficients of a regression
model for the response variable.
3.Evaluate the model using 3 indicators of
a good statistical model.
Sometimes we need to transform the data

(a) Y versus PORC3_NR (%age of large farms in number );


(b) log10 Y versus log 10 (PORC3_NR).
Sometimes we need to transform the data

Predicted vs Observed Plots: (a) model with variables not


transformed): R2 = 0.61; (b) Model 7: R2 = 0.85.
OUTLINE

1. Simple Regression

2. Multiple Regression
 Models, coefficients and interpretation
 Evaluation of statistical models
 Cautions in using multiple regression
EXAMPLES OF RESEARCH QUESTIONS

How does one’s level of education, years of


experience and gender affect one’s earnings?
Response variable: earnings
Explanatory variable: level of education,
years of experience
and gender
EXAMPLES OF RESEARCH QUESTIONS

How is literacy affected by household income,


unemployment, family tradition and political
condition?
Response variable: literacy
Explanatory variable: household income,
unemployment, family
tradition and political
condition
South Cotabato Total rice production data 2007

municipality area area prodn farmers


planted harvested volume served
banga 9784 9713 44907 6154
koronadal 11267 10526 46813 6433
lake sebu 5681 5846 24286 4033
norala 14623 12157 59235 7895
polomolok 778 748 3258 487
sto.nino 16151 15648 71998 8594
surallah 9881 9598 44965 6012
tampakan 163 100 435 131
tantangan 13828 12229 51434 8654
t'boli 670 663 2470 568
tupi 665 669 3056 442
 With data from South Cotabato on rice
production, which could be response
variable and its possible explanatory
variables?
 Data: area planted
area harvested
production volume
farmers served
What factors (or variables) determine higher
or lower production volume?
Area planted
Area harvested
Farmers served

Let’s consider the 3 factors as explanatory


variables: area harvested, area planted and
farmers served.

Compare results of this model (having 3 factors) with those of


prodxn volume=-446.341+4.593 (area harvested)
Production volume explained by 3 factors

Let Y = production volume (mt/yr)


X1 = area harvested (ha)
X2 = area planted (ha)
X3 = farmers served

yˆ  a  b1 x 1  b 2 x 2  b 3 x 3
Production volume explained by 3 factors
Coefficientsa
Model
Unstandardized Standardi
Coefficients zed Coef
B Std. Error Beta t Sig.
1
(Constant) 346.059 956.121 .362 .728
area 4.344 .981 .943 4.429 .003
harvested(ha)
area 1.920 1.030 .455 1.865 .104
harvested(ha)
Farmers served -3.029 1.330 -.403 -2.277 .057
a. Dependent Variable: production volume(mt/yr)

yˆ  346 . 059  4 . 344 x1  1 . 92 x 2  3 . 029 x 3


Production volume explained by 3 factors

Let Y = production volume (mt/yr)


X1 = area harvested (ha)
X2 = area planted (ha)
X3 = farmers served
yˆ  346 . 059  4 . 344 x1  1 . 92 x 2  3 . 029 x 3

prodxn vol=346.059 + 4.344*(a.harvested)


+ 1.92(a.planted) – 3.029*(farmers)
For recent data on municipality, the prediction
equation relating production volume to area
harvested, planted and farmers served is:
prodxn vol=346.059 + 4.344*(a.harvested)
+ 1.92(a.planted) – 3.029*(farmers)

Find the predicted production volume of


sto.nino with area harvested = 15648 ha, area
planted = 16151 and farmers = 8594.

prodxn vol=346.059 + 4.344*(15648)


+ 1.92(16151) –3.029*(8594)
prodxn vol=73300
prodxn vol=346.059 + 4.344*(a.harvested)
+ 1.92(a.planted) – 3.029*(farmers)
Interpretation:
A municipality with no harvest, no area planted and no
farmer served has an estimate of 346.059 mt/yr volume of
rice production.(Meaningful?)
 For every additional 1 ha of harvested area, production of
rice increases its volume by 4.344 mt/yr, holding other factors
fixed.
 For every additional 1 ha of area planted of rice,
production increases its volume by 1.92 mt/yr, holding
other factors fixed.
 For every additional 1 farmer served, production of rice
decreases its volume by 3.029 mt/yr, holding other factors
fixed. (Meaningful?)
ca ll…
R e
Evaluation of a Statistical Model
A good statistical model has :

 High R2 – the coefficient of determination


 Low standard error of the estimate
 Significant parameters ONLY
R2 of regression of production volume
 R2 = .997 or 99.7%
 The 99.7% variation in production volume is
explained by area harvested, planted and
farmers served.
 Has an additional .3% compared to first
model with R2 =99.4.
Model Summary

Adjusted R Std. Error of the


Model R R Square Square Estimate
2 .998a .997 .995 1809.882
a. Predictors: (Constant), area harvested, area planted, farmers served
Standard error of the regression of production
volume

Standard error of estimate is 1809.882.

Note: This model has smaller std. error of estimate


compared to 2158.217 mt/yr of the first.

Model Summary

Adjusted R Std. Error of the


Model R R Square Square Estimate
2 .998a .997 .995 1809.882
a. Predictors: (Constant), area harvested, area planted, farmers served
Significance of parameters
Coefficientsa
Unstandardized Standardi
Model Coefficients zed Coef
B Std. Error Beta t Sig.
2 (Constant) 346.059 956.121 .362 .728ns
area 4.344 .981 .943 4.429 .003**
harvested(ha)
area planted(ha) 1.920 1.030 .455 1.865 .104ns
Farmers served -3.029 1.330 -.403 -2.277 .057ns
a. Dependent Variable: production volume(mt/yr)

 Area harvested is the only significant factor


or explanatory variable, the rests are not.
Cautions in using multiple
regression analysis
Caution1: Factors (explanatory
variables) should not be correlated
with one another.

Result if violated:
Effects of each factor cannot be
singled out since there is
simultaneous effects among highly
correlated factors.
Symptoms:
 High R2
 Few significant factors
 Negative ‘b’ (coefficients) on
estimates for positively correlated
response and explanatory variable.
 Strong/very strong correlation
among factors
ca ll … OUTPUT

R e Model Summary

Adjusted R Std. Error of the


Model R R Square Square Estimate
2 .998a .997 .995 1809.882
a. Predictors: (Constant), area harvested,aarea planted, farmers served
Coefficients
Unstandardized Standardi
Model Coefficients zed Coef
B Std. Error Beta t Sig.
2 (Constant) 346.059 956.121 .362 .728ns
area 4.344 .981 .943 4.429 .003**
harvested(ha)
area 1.920 1.030 .455 1.865 .104ns
harvested(ha)
Farmers served -3.029 1.330 -.403 -2.277 .057ns
a. Dependent Variable: production volume(mt/yr)
prodxn vol=346.059 + 4.344*(a.harvested)
+ 1.92(a.planted) – 3.029*(farmers)
Interpretation:
A municipality with no harvest, no area planted and no
farmer served has an estimate of 346.059 mt/yr volume
production of rice.(Meaningful?)

 For every additional 1 farmer served, volume production


of rice decreases by 3.029 mt/yr, holding other factors
fixed. (Meaningful?)
 Not consistent with r=.982 for production volume and
farmers served
Strength of linear relationship of rice production data

r area planted area production farmers


(ha) harvested volume served
Sig (ha) (mt/yr)
area planted .994** .993** .992
(ha) 1
.000 .000 .000**
area .994** .997** .989**
harvested 1
(ha) .000 .000 .000
production .993* .997** .982**
volume 1
(mt/yr) .000* .000 .000
farmers .992** .989** .982**
served 1
 High R2: 99.7%
 Few significant factors: 1 out of 3
 Negative ‘b’ (coefficients) on estimates for
positively correlated response and explanatory
variable: farmers served
 Strong/very strong correlation among
factors: r=.99

Diagnosis: Multicollinearity
Caution2:No patterns should
remain in the residuals. (Residuals
should be random and not related
with one another.)

Result if violated:
The model estimate is not the
BEST estimate for the data.
Symptoms:

 Increasing residuals with


the predicted value (or
explanatory variable).
 Significant correlations on
residuals
South Cotabato Total rice production data 2007

municipality prodn Estimated residual


volume Prodn vol (error)
banga 44907 42684 2223
koronadal 46813 48218 -1405
lake sebu 24286 24433 -147
norala 59235 57318 1917
polomolok 3258 3614 -356
sto.nino 71998 73300 -1302
surallah 44965 42801 2164
tampakan 435 697 -262
tantangan 51434 53806 -2372
t'boli 2470 2792 -322
tupi 3056 3190 -134
Detecting patterns

As the predicted production


volume increases, the points
become more scattered

Diagnosis: Heteroscedasticity
Good – no heteroscedasticity
Bad – heteroscedasticity
Caution3: Be careful in assuming
the model form (linear or non-linear)

Result if violated:
Forecasts (predicted value or
estimates) may also be incorrect.
Caution4: Check data for outliers
or typographical errors.

Result if violated:
Analyses and forecasts (predicted
value or estimates) may also be
incorrect.
Exercise 4
1.Identify 1 response variable and at least
2 explanatory variables.
2.Estimate the regression model for the
response variable.
3.Evaluate the model using 3 indicators of
a good statistical model.
4.Be reminded of the CAUTIONS.
References
 Agresti/Franklin. Statistics Analyzing Association Between
Quantitative Variables: Regression www-
rohan.sdsu.edu/~szarei/ppt250/ch_11.ppt
 colorado.edu/Economics/courses/.../chapter4/regression1.ppt
 Hartman, Julia. An Interactive Tutorial for SPSS 10.0 for
Windows.Multiple Linear Regression.
bama.ua.edu/~jhartman/689/mlr.ppt
 Makridakis, Wheelwright and Hyndman, Forecasting Methods
and Applications. John Wiley & Sons, Inc. New York, 1998.
 Regression and correlation analysis.
www.unc.edu/~jreiler/econ70/handouts/regression.doc
 Wong, Ka-fu, ECON1003 Analysis of Economic Data School of
Economics and Finance, The University of Hongkong

You might also like