M2L2 CLRM & Simple Linear Regression Analysis

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

MODULE 2: DEPENDENCE TECHNIQUES

Module Overview

More often than not, what we do and how we do it is really due to some factors we take
into consideration. Right? This is exactly how many variables behave! In this module,
we try to measure how well some (dependent) variables are influenced by other
(independent) variables. This is actually the foundation for many other statistical
techniques, so make sure that you understand the concepts herein.

67
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
Module 2: Dependence Techniques

Lesson 2: Classical Linear Regression Model & Simple Regression Analysis

Time Frame:

Learning Outcomes

✓ Discuss the assumptions of the CLRM


✓ Do simple and multiple regression analysis
✓ Interpret results of regression analysis

Introduction

Empirical analysis has a lot to do with explaining relationships between variables and
it tries to do so using regression model. This technique attempts to quantify how much
one variable changes as another variable changes and also provides information that
would allow prediction and hypothesis testing. The regression model, however, has
assumptions and it would be necessary that you know and understand these before you
move into its many uses.

ABSTRACTION

The Classical Linear Regression Model (CLRM)

Danao (2002) simplifies the concepts in CLRM as follows: With one explanatory
variable, the CLRM is formally given by the equation

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝜀𝑖

where 𝑌𝑖 is the dependent variable (also referred to as explained variable or regressand),


𝑋𝑖 is the independent variable (also referred to as explanatory variable or regressor), 𝛽𝑖
as the regression coefficients and 𝜀𝑖 as the error term (also referred to as the stochastic
error term or stochastic disturbance term; stochastic being random in nature). The
subscript 𝑖 denotes the 𝑖th observation.

Linearity mentioned in “linear regression” expressly states linearity of the coefficients.


That is, the 𝛽𝑖 s are not expressed in polynomials or other forms.

The error term 𝜀𝑖 is assumed to be a random variable and follows some probability
distribution. This term can be due to many reasons including those (1) due to non-
inclusion of essential variables in the model, (2) inclusion of variables that should not
be in the model, (3) errors in measurement, (4) randomness of events, and (5) model
misspecification. Since 𝜀𝑖 is a random variable, 𝑌𝑖 will also be a random variable based
on the equation above. And, 𝑋𝑖 will be associated with a probability of 𝑌.

68
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
At it may now seem clear, econometric analysis employing regression analysis attempts
to establish a causal relationship between 2 or more economic variables (Hill, Griffiths,
& Lim, 2018). Having an estimate of the model allows one to make predictions and do
hypothesis testing about its parameters. We will expound this further in a later section
here.

Assumptions of the CLRM

Assumptions of the CLRM and brief discussions below were culled from Hill, Griffiths,
& Lim (2018) and Danao (2002). These do not provide an exhaustive exposition of
these assumptions. Detailed discussions are in the ebooks provided with this course
pack. The only intention here is to provide direction as to what you may have to read
on further.

1. Observed data (𝑦𝑖 , 𝑥𝑖 ) satisfy the relationship 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝑒𝑖 , 𝑖 = 1, … , 𝑁.


Such regression equation suggests that 𝑋 causes 𝑌. This assertion, however, must
foremost be based on theory and should precede estimation. The estimated
regression, in fact, however significant, does not prove causality. This also assumes
that the data were randomly chosen from a population. As such data used are
assumed to be statistically independent and identically distributed.

2. The expected value of the error term is zero given 𝑋 = (𝑥1 , 𝑥2 , … , 𝑥𝑛 ) such that
𝐸 (𝑒𝑖 |𝑋) = 0. If this is true then

𝐸 (𝑦𝑖 |𝑋) = 𝛽0 + 𝛽1 𝑥𝑖 , 𝑖 = 1, … , 𝑁 and 𝑦𝑖 = 𝐸 (𝑦𝑖 |𝑋) + 𝑒𝑖 , 𝑖 = 1, … , 𝑁

3. Homoskedasticity. The conditional variance of the error term is constant,

𝑣𝑎𝑟(𝑒𝑖 |𝑋) = 𝜎 2

4. No serial correlation. The conditional covariance of the error terms 𝑒𝑖 and 𝑒𝑗 is


zero, 𝑐𝑜𝑣(𝑒𝑖 , 𝑒𝑗 |𝑋) = 0 for all 𝑖 ≠ 𝑗. Because the errors are uncorrelated, this
implies that these are also independent.

5. The explanatory variable 𝑥𝑖 must take at least 2 values. If this were not true then
𝑥1 = 𝑥2 = ⋯ = 𝑥𝑛 and there will be no relationship between 𝑋 and 𝑌.

6. The error term given 𝑋 is normally distributed such that 𝑒𝑖 |𝑋~𝑁(0, 𝜎 2 ).

7. There must be positive degrees of freedom. This means that the number of
observations must be greater than the number of parameters being estimated. This
also means that there has to be more than 2 data points. If there is only one data
point (𝑥1 , 𝑦1 ) then there will be an infinite number of lines that go through this
point. If, on the other hand, there are only 2 data points, (𝑥1 , 𝑦1 ) and (𝑥2 , 𝑦2 ), then
there is a unique line that connects the two but the problem will not be statistical.

69
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
Ordinary Least Squares (OLS) Estimation

The OLS is commonly used in estimating the coefficients (𝛽 ) or the parameters of the
regression equation. This least squares rule asserts that the estimated regression line
that best fit the data points is that which minimizes the sum of squared distances from
each data point to the line. That is 𝑦̂𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 and 𝑒̂𝑖 = 𝑦𝑖 − 𝑦̂𝑖 = 𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 .

Source: Hill, Griffiths, & Lim (2018), p.62

Derived OLS estimators are computed using the following equations (technically,
referred to as normal equations):

∑(𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅)


𝑏1 =
∑(𝑥𝑖 − 𝑥̅ )2

∑ 𝑦𝑖 ∑ 𝑥𝑖
𝑏0 = − 𝑏1
𝑛 𝑛

The OLS is referred to as “ordinary least squares” to distinguish it from the other
methods as generalized least squares, weighted least squares and the 2-stage least
squares.

Goodness-of-Fit (Hill, Griffiths, & Judge, 1997)

The coefficient of determination (𝑅2 ) is an overall measure of how well the estimated
regression line fits the data. Technically, econometricians define this as “how much
variation in the dependent variable is explained by the regression model”.

𝑆𝑆𝑅 𝑆𝑆𝐸 ̂2
(𝑛 − 2 )𝜎
𝑅2 = =1− =1−
𝑆𝑆𝑇 𝑆𝑆𝑇 𝑆𝑆𝑇

where

70
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
𝑆𝑆𝐸 = ∑(𝑦𝑖 − 𝑦̂𝑖 )2 = ∑ 𝑒̂2 Error sum of squares; that part of total variation
𝑖
in 𝑦 about its sample mean that is not explained
by the regression
𝑆𝑆𝑅 = ∑(𝑦̂𝑖 − 𝑦̅)2 Explained sum of squares; that part of total
variation in 𝑦 about its sample mean that is
explained by the regression
𝑆𝑆𝑇 = ∑(𝑦𝑖 − 𝑦̅)2 Total sum of squares; measures the total
variation in 𝑦 about its sample mean

Since 𝑅2 measures the proportion (or that percentage) of the total variation in 𝑦 that is
explained by regression model, then the higher the 𝑅2 , better for you. This also means
that the estimated regression have better predictive ability. There are some limitations
to this though as will be explained later. Also, 0 ≤ 𝑅2 ≤ 1. If 𝑅2 = 0, then 𝑆𝑆𝑅 = 0
and that 𝑦 and 𝑥 are uncorrelated. Graphically, there is no linear association and the
fitted line is horizontal and identical to 𝑦̅. If 𝑅2 = 1, then all the sample data fall exactly
on the fitted line so that 𝑆𝑆𝐸 = 0.

The decomposition for 𝑆𝑆𝑇 above is usually presented in an analysis of variance table
of the statistical software output as shown below:

Source of Variation DF* Sum of Squares Mean Square


Explained 1 𝑆𝑆𝑅 𝑆𝑆𝑅 ⁄1
Unexplained 𝑛−𝑘 𝑆𝑆𝐸 ̂2
𝑆𝑆𝐸 ⁄(𝑛 − 2) = 𝜎
Total 𝑛−1 𝑆𝑆𝑇
* DF refers to degrees of freedom and computed as defined above; 𝑘 refers to the number of parameters
being estimated in the model.

The Simple Linear Regression Model

The “simple” in simple linear regression model does not in any way mean that it is easy
to do. This actually refers to a regression model presented above where there is only 1
explanatory variable.

At this point, it might be good to review some of the statistical concepts you have
learned in a previous course. Some of the important ones in an introductory course in
econometrics are presented here.

• An acceptable 𝑅2 is quite arbitrary and sometimes depends on the purpose of the


study. For some a 70% - 95% would be acceptable. Values lower than 70% can be
low and more than 95% may be too high. While a high 𝑅2 is desired, values
considered to be too high can be indicative of some violations of the assumptions
of the CLRM (we will discuss this in Module 3).
• A 𝑝 − 𝑣𝑎𝑙𝑢𝑒 for the 𝐹 statistic (𝑆𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝐹 ) of 0.05 or lower is usually
indicative that the regression model is significant (review hypothesis testing in your
statistics course). As you may remember now, this rule is usually referred to as your
alpha being set to 0.05 (or 𝛼 = 0.05). Technically, in regression analysis, we say
data supports the fact that at least one of the coefficients is significantly different
from 0. In this case, the coefficient of the lone explanatory variable is significantly

71
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
different from 0. This means that the whole equation, 𝑦̂𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 is significant
and may prove to have some utility.
• The 𝑝 − 𝑣𝑎𝑙𝑢𝑒 of the 𝑡 statistics are used to evaluate whether the individual
explanatory variables are significant or not. In a simple regression analysis, we are
only looking at one independent variable. Values of 0.05 or lower is also usually
indicative that the variable is significantly different from 0 such that it is useful in
the regression model.
• Confidence intervals (𝐶𝐼 )are useful because these allow sensitivity analysis.
Statistical softwares usually has a default output of 95% confidence interval.
Simply put, these confidence intervals tells you that values of the coefficient
(assuming different samples are used from the same population) will fall within this
range 95% of the time.

M2L2 Example 1: Simple Linear Regression Analysis. Real Personal


Consumption Expenditure (PCER) and Real Gross National Product (GNPR)
for the period 1975-1996 (Danao, 2002). File: pcer-gnpr.sav

Let’s do a simple regression analysis using the consumption and income data found in
Danao (2002). Results presented in the book used Eviews but we will try to replicate
the results using MS Excel that should be readily available in your computer unit.
Philippine data for the real personal consumption expenditure (𝑃𝐶𝐸𝑅) and for the real
gross national product (𝐺𝑁𝑃𝑅) for the period 1975-1996 will be used in this example.

Initially, you may want to do a scatter diagram for 𝑃𝐶𝐸𝑅 and 𝐺𝑁𝑃𝑅 just to give you
some indication that you are indeed working on a linear regression problem here. To
do this, highlight the data then click on “insert”. Then, choose “Scatter” (the type of
plot) under “Charts”. As you may well observe, these data points appear to line-up
increasingly from lower left to upper right of the graph. The regression line has been
super-imposed there to even further prove this point.

To replicate the graph above in SPSS, Graphs\Legacy Dialogs\Scatter/Dot…\Simple


Scatter\Define move gnpr to X Axis: and move pcer to Y Axis:\OK.

72
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
Let’s do it first by using the normal equations presented above (see MS Excel file:
M2L1 Simple Regression Example 1).

∑(𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅)


𝑏1 =
∑(𝑥𝑖 − 𝑥̅ )2

∑ 𝑦𝑖 ∑ 𝑥𝑖
𝑏0 = − 𝑏1
𝑛 𝑛

Try doing it by your self first. Just work on the formula in MS Excel. 𝑥𝑖 and 𝑦𝑖 refer
to the 𝑖 𝑡ℎ observation of the independent and dependent variables, respectively. The
∑𝑥 ∑𝑦
𝑥̅ = 𝑛 𝑖 and 𝑦̅ = 𝑛 𝑖 are the averages for the independent and dependent variables
respectively. Do your calculations in MS Excel. Then, compare those calculations in
“Sheet1” of the same file.

73
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
If you are not familiar yet with formulas or equations in MS Excel, just point to a cell
to see how that particular value is being computed clicking on F2. Example below for
(𝑥𝑖 − 𝑥̅ ):

You should see there that the value of -191.126 is equal to that value in cell B10 less
that average value in B33. The $ signs in $B$33 is really just a way to fix that cell
when copying that formula for the rest of the data points.

Let’s do this example in SPSS this time.

Analyze\Regression\Linear… move pcer to Dependent: and gnpr to Independent(s):


Click on Statistics… click on Estimates, Confidence intervals, Model fit\Continue\OK

Evaluating and Interpreting Results:

• Goodness of fit statistic, 𝑅2 , appears to be acceptable. This says that 93.5% of the
variation in 𝑃𝐶𝐸𝑅 is explained by the regression model with 𝐺𝑁𝑃𝑅 as explanatory
variable. Another way of stating this is “93.5% of the variation in 𝑃𝐶𝐸𝑅 is
explained by variation in 𝐺𝑁𝑃𝑅.

• A highly significant 𝐹 statistic, 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 0.000, is indicative of a highly


significant regression equation or relationship between the dependent and
independent variable.

74
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
• When asked what is the estimated equation, simply pick-up those 𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠
estimates. For this problem, it will be 𝑃𝐶𝐸𝑅𝑖 = −96.675 + 0.869𝐺𝑁𝑃𝑅𝑖 . This
means that for every unit increase in 𝐺𝑁𝑃𝑅, there is an expected increase of about
0.869 in the 𝑃𝐶𝐸𝑅. This positive slope of 0.869 confirms the sign put forward in
economic theory. This means that an increase in 𝐺𝑁𝑃𝑅 will result to an increase
in 𝑃𝐶𝐸𝑅. Economists refer to this slope as the marginal propensity to consume.

So, assuming that you can project the 𝐺𝑁𝑃𝑅 for the years 1997-2001 as shown
below, you will also be able to predict values for your 𝑃𝐶𝐸𝑅 over the same period.

Year Projected 𝐺𝑁𝑃𝑅 Estimated 𝑃𝐶𝐸𝑅


1997 933.284 714.746
1998 979.948 755.317
1999 1,004.450 776.620
2000 1,019.510 789.713
2001 1,050.100 816.309

But just how did we arrive at these estimates for 𝑃𝐶𝐸𝑅? Simply substitute values
of the 𝐺𝑁𝑃𝑅 in the estimated equation. Hence, for 1997,

𝑃𝐶𝐸𝑅1997 = −96.675 + 0.869(933.284) = 714.746

You will see some differences in the computed estimated values above and the
values of “Estimated 𝑃𝐶𝐸𝑅” indicated in the table because values in the former
were based on rounded-off values from the software output.

• The 𝑝 − 𝑣𝑎𝑙𝑢𝑒 of the 𝑡 statistics of 2.40609E-13 is also highly significant. Don’t


worry if you see the same value as that of the 𝐹 statistic. In simple regression the
𝑝 − 𝑣𝑎𝑙𝑢𝑒𝑠 of the 𝑡 statistic and the 𝐹 statistic are equal. Because 𝐺𝑁𝑃𝑅 is
significant, we now know that 𝐺𝑁𝑃𝑅 helps explain the variation in 𝑃𝐶𝐸𝑅.

• The 95% 𝐶𝐼 for 𝐺𝑁𝑃𝑅 tells us that assuming repeated sampling (or repeated trials)
the values of the coefficient will fall within 0.763 and 0.976. This is also another
proof that 𝐺𝑁𝑃𝑅 is significant! The fact that the range 0.763 and 0.976 does not
include 0, the coefficient for cannot be zero, and therefore significant.

75
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
M2L2 Example 2: Simple Linear Regression Analysis. Food Expenditure
and Household Income (Hill, Griffiths, & Judge, 1997). Data shows survey
results from 40 households regarding their week income and food
expenditure. We will try to find the relationship here between the two variables.

1. Create a scatterplot that will allow you to see whether the data seem to be linearly
related.

1.1. Highlight the whole data set with your mouse.


1.2. Reproduce the scatter plot below in SPSS. Since there seems to be a line that
may be fitted from the bottom left upwards to upper right; and, a linear
regression would appear to be appropriate.

To draw the line through the data points, double click on any of the data points
and the Chart Editor will appear as shown below. Click on Add Fit Line at Total
and a Properties window is shown. Click on Linear\Close. Close the Chart
Editor window and the graph will now be updated with regression line.

2. Do a regression analysis with exp as the dependent variable and inc the independent
variable using SPSS.

76
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
3. What is the regression equation? 𝑒𝑥𝑝 = 40.768 + 0.128𝑖𝑛𝑐 .What does this mean?
This means that a unit increase in Weekly Household Income will result to an
increase in Food Expenditure of 0.12829. Hence, if Weekly Household Income
increases by P1,000.00, Food Expenditure will increase by P128.29.
3. Is the regression equation significant? How do you know? Explain. Yes, because of
the 𝑆𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝐹 of 0.000.
4. Is Income a good explanatory variable for Expenditure? How do you know?
Explain. Yes, the P-value of t-Stat for Income is also very low at 0.000.
5. How do you assess the goodness of fit? This would seem low considering an 𝑅2 =
0.317. This means that variation in Income is only able to explain about 32% of
the variation in Expenditure.
6. What does the 95% CI tell us? It says that the increase in Expenditure can be as
low as P66.47 or as high as P190.01 if Income increases by P1,000.00.

M2L2 Application 1: Simple Linear Regression Analysis (SAP #2). Use


the file ceo.sav. It shows the salary of 209 CEOs for the year 1990 (source:
Business Week May 1991). Assume that the salary of the CEO could be
predicted by their firm’s return on equity (𝑅𝑂𝐸 =
𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑛𝑒𝑡 𝑖𝑛𝑐𝑜𝑚𝑒 𝑎𝑠 𝑎 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑜𝑓 𝑐𝑜𝑚𝑚𝑜𝑛 𝑒𝑞𝑢𝑖𝑡𝑦). Given what you have
learned above in performing regression analysis in SPSS, do as instructed below:

1. Prepare a scatter plot for the variables.


2. Find the regression equation that may be used to express the relationship between
𝑆𝐴𝐿𝐴𝑅𝑌 and 𝑅𝑂𝐸.
3. Is the regression equation significant? Why do you say so?

77
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
4. Is the independent variable significant? Why do you say so?
5. What is the 95% 𝐶𝐼 for the independent variable?
6. Let’s convert our variable salary into its natural logarithmic form (ln).

Transform\Compute Variable
Type your variable name (example: lnsalary) on Target Variable:
Click on All in Function Group: and select Ln on Functions and Special Variables:
Click on your variable to be transformed (example: salary) and move to Numeric
Expression using the move arrow then OK

You should see the newly created variable in your Data View

Do a scatter plot for roe and lnsalary. What can you observed? Do you see any
advantage in converting a variable into its ln form?

What is the estimated regression equation now?

7. Considering the estimated regression equation in #6, is the independent variable


significant?
8. Considering the estimated regression equation in #6, what would 𝑆𝐴𝐿𝐴𝑅𝑌 be if
𝑅𝑂𝐸 = 30.0? Note that the predicted 𝑆𝐴𝐿𝐴𝑅𝑌 will still be in natural logs.
Converting it back to original values require finding the inverse of this predicted
value. Use a calculator for this

CLOSURE

78
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
Congratulations! You have just completed Lesson 2 of this module. The next lesson
should be even more interesting as we include more variables in the models being
estimated.

79
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics

You might also like