Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 18

Course: Basic Econometrics (807)

Semester: Autumn, 2021


ASSIGNMENT No. 1
Q.1 What does the Gauss Markov Theorem tell about the properties of OLS estimators. What are the
assumptions under which this theorem holds?
The Gauss-Markov theorem states that if your linear regression model satisfies the first six classical
assumptions, then ordinary least squares (OLS) regression produces unbiased estimates that have the smallest
variance of all possible linear estimators.
The Gauss-Markov theorem famously states that OLS is BLUE. BLUE is an acronym for the following:
In this context, the definition of “best” refers to the minimum variance or the narrowest sampling distribution.
More specifically, when your model satisfies the assumptions, OLS coefficient estimates follow the tightest
possible sampling distribution of unbiased estimates compared to other linear estimation methods.
Regression analysis is like any other inferential methodology. Our goal is to draw a random sample from
a population and use it to estimate the properties of that population. In regression analysis, the coefficients in
the equation are estimates of the actual population parameters.
The notation for the model of a population is the following:

The betas (β) represent the population parameter for each term in the model. Epsilon (ε) represents the random
error that the model doesn’t explain. Unfortunately, we’ll never know these population values because it is
generally impossible to measure the entire population. Instead, we’ll obtain estimates of them using our random
sample.
The notation for an estimated model from a random sample is the following:

The hats over the betas indicate that these are parameter estimates while e represents the residuals, which are
estimates of the random error.
Typically, statisticians consider estimates to be useful when they are unbiased (correct on average) and precise
(minimum variance). To apply these concepts to parameter estimates and the Gauss-Markov theorem, we’ll
need to understand the sampling distribution of the parameter estimates.
Imagine that we repeat the same study many times. We collect random samples of the same size, from the same
population, and fit the same OLS regression model repeatedly. Each random sample produces different
estimates for the parameters in the regression equation. After this process, we can graph the distribution of
estimates for each parameter. Statisticians refer to this type of distribution as a sampling distribution, which is a
type of probability distribution.
Keep in mind that each curve represents the sampling distribution of the estimates for a single parameter. The
graphs below tell us which values of parameter estimates are more and less common. They also indicate how far
estimates are likely to fall from the correct value.

1
Course: Basic Econometrics (807)
Semester: Autumn, 2021
Unbiased Estimates: Sampling Distributions Centered on the True Population Parameter
In the graph below, beta represents the true population value. The curve on the right centers on a value that is
too high. This model tends to produce estimates that are too high, which is a positive bias. It is not correct on
average. However, the curve on the left centers on the actual value of beta. That model produces parameter
estimates that are correct on average. The expected value is the actual value of the population parameter. That’s
what we want and satisfying the OLS assumptions helps us!

Keep in mind that the curve on the left doesn’t indicate that an individual study necessarily produces an
estimate that is right on target. Instead, it means that OLS produces the correct estimate on average when the
assumptions hold true. Different studies will generate values that are sometimes higher and sometimes lower—
as opposed to having a tendency to be too high or too low.
Minimum Variance: Sampling Distributions are Tight Around the Population Parameter
In the graph below, both curves center on beta. However, one curve is wider than the other because the
variances are different. Broader curves indicate that there is a higher probability that the estimates will be
further away from the correct value. That’s not good. We want our estimates to be close to beta.

2
Course: Basic Econometrics (807)
Semester: Autumn, 2021

Both studies are correct on average. However, we want our estimates to follow the narrower curve because
they’re likely to be closer to the correct value than the wider curve. The Gauss-Markov theorem states that
satisfying the OLS assumptions keeps the sampling distribution as tight as possible for unbiased estimates.
Q.2 Explain confidence intervals for regression. Find out confidence intervals for regression coefficients
β1, β2 and σ2.
A 95% confidence interval for βi has two equivalent definitions:
 The interval is the set of values for which a hypothesis test to the level of 5% cannot be rejected.
 The interval has a probability of 95% to contain the true value of βi. So in 95% of all samples that could
be drawn, the confidence interval will cover the true value of βi.
We also say that the interval has a confidence level of 95%.
To get a better understanding of confidence intervals we conduct another simulation study. For now, assume
that we have the following sample of n=100n=100 observations on a single variable YY where
Yii.i.d∼N(5,25), i=1,…,100.Yi∼i.i.dN(5,25), i=1,…,100.
# set seed for reproducibility
set.seed(4)

# generate and plot the sample data


Y <- rnorm(n = 100,
mean = 5,
sd = 5)

3
Course: Basic Econometrics (807)
Semester: Autumn, 2021
plot(Y,
pch = 19,
col = "steelblue")

We assume that the data is generated by the model


Yi=μ+ϵiYi=μ+ϵi
where μμ is an unknown constant and we know that ϵii.i.d.∼N(0,25)ϵi∼i.i.d.N(0,25). In this model, the OLS
estimator for μμ is given by^μ=¯¯¯¯Y=1nn∑i=1Yi,μ^=Y¯=1n∑i=1nYi,i.e., the sample average of the YiYi. It
further holds that
SE(^μ)=σϵ√n=5√100SE(μ^)=σϵn=5100
(see Chapter 2) A large-sample 95%95% confidence interval for μμ is then given by
CIμ0.95=[^μ−1.96×5√100 , ^μ+1.96×5√100].(5.1)(5.1)CI0.95μ=[μ^−1.96×5100 , μ^+1.96×5100].
It is fairly easy to compute this interval in R by hand. The following code chunk generates a named vector
containing the interval bounds:
cbind(CIlower = mean(Y) - 1.96 * 5 / 10, CIupper = mean(Y) + 1.96 * 5 / 10)
#> CIlower CIupper
#> [1,] 4.502625 6.462625
Knowing that μ=5μ=5 we see that, for our example data, the confidence interval covers true value.
As opposed to real world examples, we can use R to get a better understanding of confidence intervals by
repeatedly sampling data, estimating μμ and computing the confidence interval for μμ as in (5.1).

4
Course: Basic Econometrics (807)
Semester: Autumn, 2021
The procedure is as follows:
 We initialize the vectors lower and upper in which the simulated interval limits are to be saved. We want
to simulate 1000010000 intervals so both vectors are set to have this length.
 We use a for() loop to sample 100100 observations from the N(5,25)N(5,25) distribution and
compute ^μμ^ as well as the boundaries of the confidence interval in every iteration of the loop.
 At last we join lower and upper in a matrix.
# set seed
set.seed(1)

# initialize vectors of lower and upper interval boundaries


lower <- numeric(10000)
upper <- numeric(10000)

# loop sampling / estimation / CI


for(i in 1:10000) {

Y <- rnorm(100, mean = 5, sd = 5)


lower[i] <- mean(Y) - 1.96 * 5 / 10
upper[i] <- mean(Y) + 1.96 * 5 / 10

# join vectors of interval bounds in a matrix


CIs <- cbind(lower, upper)
According to Key Concept 5.3 we expect that the fraction of the  1000010000 simulated intervals saved in the
matrix CIs that contain the true value μ=5μ=5 should be roughly 95%95%. We can easily check this using
logical operators.
The simulation shows that the fraction of intervals covering  μ=5μ=5, i.e., those intervals for
which H0:μ=5H0:μ=5 cannot be rejected is close to the theoretical value of 95%95%.
Let us draw a plot of the first 100100 simulated confidence intervals and indicate those which do not cover
the true value of μμ. We do this via horizontal lines representing the confidence intervals on top of each
other.
# identify intervals not covering mu
# (4 intervals out of 100)
ID <- which(!(CIs[1:100, 1] <= 5 & 5 <= CIs[1:100, 2]))

5
Course: Basic Econometrics (807)
Semester: Autumn, 2021
# initialize the plot
plot(0,
xlim = c(3, 7),
ylim = c(1, 100),
ylab = "Sample",
xlab = expression(mu),
main = "Confidence Intervals")

# set up color vector


colors <- rep(gray(0.6), 100)
colors[ID] <- "red"

# draw reference line at mu=5


abline(v = 5, lty = 2)

# add horizontal bars representing the CIs


for(j in 1:100) {

lines(c(CIs[j, 1], CIs[j, 2]),


c(j, j),
col = colors[j],
lwd = 2)
}
Q.3 Consider the regression model Y = β 1 + β2X2 + β3X3 + . The sample size of 25 was taken for
estimation of the model.
a) Explain the OLS method for estimation of the parameters.
b) How to test hypothesis under the assumption of β2 = β3 =0.
c) Explain how to test significance of β2 and β3 separately.
Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses
several explanatory variables to predict the outcome of a response variable. The goal of multiple linear
regression (MLR) is to model the linear relationship between the explanatory (independent) variables and
response (dependent) variable.
In essence, multiple regression is the extension of ordinary least-squares (OLS) regression because it involves
more than one explanatory variable.

6
Course: Basic Econometrics (807)
Semester: Autumn, 2021
 Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique
that uses several explanatory variables to predict the outcome of a response variable.
 Multiple regression is an extension of linear (OLS) regression that uses just one explanatory variable.
 MLR is used extensively in econometrics and financial inference.
Formula and Calculation of Multiple Linear Regression
yi=β0+β1xi1+β2xi2+...+βpxip+ϵ
where, for i=n observations:
yi=dependent variable
xi=explanatory variables
β0=y-intercept (constant term)
βp=slope coefficients for each explanatory variable
ϵ=the model’s error term (also known as the residuals)
Simple linear regression is a function that allows an analyst or statistician to make predictions about one
variable based on the information that is known about another variable. Linear regression can only be used
when one has two continuous variables—an independent variable and a dependent variable. The independent
variable is the parameter that is used to calculate the dependent variable or outcome. A multiple regression
model extends to several explanatory variables.
The multiple regression model is based on the following assumptions:
 There is a linear relationship between the dependent variables and the independent variables
 The independent variables are not too highly correlated with each other
 yi observations are selected independently and randomly from the population
 Residuals should be normally distributed with a mean of 0 and variance σ
The coefficient of determination (R-squared) is a statistical metric that is used to measure how much of the
variation in outcome can be explained by the variation in the independent variables. R2 always increases as
more predictors are added to the MLR model, even though the predictors may not be related to the outcome
variable.
R2 by itself can't thus be used to identify which predictors should be included in a model and which should be
excluded. R2 can only be between 0 and 1, where 0 indicates that the outcome cannot be predicted by any of the
independent variables and 1 indicates that the outcome can be predicted without error from the independent
variables.
When interpreting the results of multiple regression, beta coefficients are valid while holding all other variables
constant ("all else equal"). The output from a multiple regression can be displayed horizontally as an equation,
or vertically in table form.
As an example, an analyst may want to know how the movement of the market affects the price of ExxonMobil

7
Course: Basic Econometrics (807)
Semester: Autumn, 2021
(XOM). In this case, their linear equation will have the value of the S&P 500 index as the independent variable,
or predictor, and the price of XOM as the dependent variable.
In reality, there are multiple factors that predict the outcome of an event. The price movement of ExxonMobil,
for example, depends on more than just the performance of the overall market. Other predictors such as the
price of oil, interest rates, and the price movement of oil futures can affect the price of XOM and stock prices of
other oil companies. To understand a relationship in which more than two variables are present, multiple linear
regression is used.
Multiple linear regression (MLR) is used to determine a mathematical relationship among a number of random
variables. In other terms, MLR examines how multiple independent variables are related to one dependent
variable. Once each of the independent factors has been determined to predict the dependent variable, the
information on the multiple variables can be used to create an accurate prediction on the level of effect they
have on the outcome variable. The model creates a relationship in the form of a straight line (linear) that best
approximates all the individual data points.
Referring to the MLR equation above, in our example:
 yi = dependent variable—the price of XOM
 xi1 = interest rates
 xi2 = oil price
 xi3 = value of S&P 500 index
 xi4= price of oil futures
 B0 = y-intercept at time zero
 B1 = regression coefficient that measures a unit change in the dependent variable when x i1 changes - the
change in XOM price when interest rates change
 B2 = coefficient value that measures a unit change in the dependent variable when x i2 changes—the
change in XOM price when oil prices change
The least-squares estimates, B0, B1, B2…Bp, are usually computed by statistical software. As many variables can
be included in the regression model in which each independent variable is differentiated with a number—1,2, 3,
4...p. The multiple regression model allows an analyst to predict an outcome based on information provided on
multiple explanatory variables.
Still, the model is not always perfectly accurate as each data point can differ slightly from the outcome
predicted by the model. The residual value, E, which is the difference between the actual outcome and the
predicted outcome, is included in the model to account for such slight variations.
Regression analysis is like other inferential methodologies. Our goal is to draw a random sample from
a population and use it to estimate the properties of that population.
In regression analysis, the coefficients in the regression equation are estimates of the actual
population parameters. We want these coefficient estimates to be the best possible estimates!

8
Course: Basic Econometrics (807)
Semester: Autumn, 2021
Suppose you request an estimate—say for the cost of a service that you are considering. How would you define
a reasonable estimate?
1. The estimates should tend to be right on target. They should not be systematically too high or too low. In
other words, they should be unbiased or correct on average.
2. Recognizing that estimates are almost never exactly correct, you want to minimize the discrepancy
between the estimated value and actual value. Large differences are bad!
These two properties are exactly what we need for our coefficient estimates!
When your linear regression model satisfies the OLS assumptions, the procedure generates unbiased coefficient
estimates that tend to be relatively close to the true population values (minimum variance). In fact, the Gauss-
Markov theorem states that OLS produces estimates that are better than estimates from all other linear model
estimation methods when the assumptions hold true.
The Seven Classical OLS Assumptions
Like many statistical analyses, ordinary least squares (OLS) regression has underlying assumptions. When these
classical assumptions for linear regression are true, ordinary least squares produces the best estimates. However,
if some of these assumptions are not true, you might need to employ remedial measures or use other estimation
methods to improve the results.
Many of these assumptions describe properties of the error term. Unfortunately, the error term is a population
value that we’ll never know. Instead, we’ll use the next best thing that is available—the residuals. Residuals are
the sample estimate of the error for each observation.
Residuals = Observed value – the fitted value
When it comes to checking OLS assumptions, assessing the residuals is crucial!
There are seven classical OLS assumptions for linear regression. The first six are mandatory to produce the best
estimates. While the quality of the estimates does not depend on the seventh assumption, analysts often evaluate
it for other important reasons that I’ll cover.
OLS Assumption 1: The regression model is linear in the coefficients and the error term
This assumption addresses the functional form of the model. In statistics, a regression model is linear when all
terms in the model are either the constant or a parameter multiplied by an independent variable. You build the
model equation only by adding the terms together. These rules constrain the model to one type:

In the equation, the betas (βs) are the parameters that OLS estimates. Epsilon (ε) is the random error.
In fact, the defining characteristic of linear regression is this functional form of the parameters rather than the
ability to model curvature. Linear models can model curvature by including nonlinear variables such as
polynomials and transforming exponential functions.
To satisfy this assumption, the correctly specified model must fit the linear pattern.

9
Course: Basic Econometrics (807)
Semester: Autumn, 2021
Related posts: The Difference Between Linear and Nonlinear Regression and How to Specify a Regression
Model
OLS Assumption 2: The error term has a population mean of zero
The error term accounts for the variation in the dependent variable that the independent variables do not
explain. Random chance should determine the values of the error term. For your model to be unbiased, the
average value of the error term must equal zero.
Suppose the average error is +7. This non-zero average error indicates that our model systematically
underpredicts the observed values. Statisticians refer to systematic error like this as bias, and it signifies that our
model is inadequate because it is not correct on average.
Stated another way, we want the expected value of the error to equal zero. If the expected value is +7 rather
than zero, part of the error term is predictable, and we should add that information to the regression model itself.
We want only random error left for the error term.
You don’t need to worry about this assumption when you include the constant in your regression model because
it forces the mean of the residuals to equal zero. For more information about this assumption, read my post
about the regression constant.
OLS Assumption 3: All independent variables are uncorrelated with the error term
If an independent variable is correlated with the error term, we can use the independent variable to predict the
error term, which violates the notion that the error term represents unpredictable random error. We need to find
a way to incorporate that information into the regression model itself.
This assumption is also referred to as exogeneity. When this type of correlation exists, there is endogeneity.
Violations of this assumption can occur because there is simultaneity between the independent and dependent
variables, omitted variable bias, or measurement error in the independent variables.
Violating this assumption biases the coefficient estimate. To understand why this bias occurs, keep in mind that
the error term always explains some of the variability in the dependent variable. However, when an independent
variable correlates with the error term, OLS incorrectly attributes some of the variance that the error term
actually explains to the independent variable instead. For more information about violating this assumption,
read my post about confounding variables and omitted variable bias.
OLS Assumption 4: Observations of the error term are uncorrelated with each other
One observation of the error term should not predict the next observation. For instance, if the error for one
observation is positive and that systematically increases the probability that the following error is positive, that
is a positive correlation. If the subsequent error is more likely to have the opposite sign, that is a negative
correlation. This problem is known both as serial correlation and autocorrelation.
Assess this assumption by graphing the residuals in the order that the data were collected. You want to see a
randomness in the plot. In the graph for a sales model, there appears to be a cyclical pattern with a positive
correlation.

10
Course: Basic Econometrics (807)
Semester: Autumn, 2021
OLS Assumption 5: The error term has a constant variance (no heteroscedasticity)
The variance of the errors should be consistent for all observations. In other words, the variance does not
change for each observation or for a range of observations. This preferred condition is known as
homoscedasticity (same scatter). If the variance changes, we refer to that as heteroscedasticity (different
scatter).
The easiest way to check this assumption is to create a residuals versus fitted value plot. On this type of graph,
heteroscedasticity appears as a cone shape where the spread of the residuals increases in one direction. In the
graph below, the spread of the residuals increases as the fitted value increases.
OLS Assumption 6: No independent variable is a perfect linear function of other explanatory variables
Perfect correlation occurs when two variables have a Pearson’s correlation coefficient of +1 or -1. When one of
the variables changes, the other variable also changes by a completely fixed proportion. The two variables move
in unison.
Perfect correlation suggests that two variables are different forms of the same variable. For example, games
won and games lost have a perfect negative correlation (-1). The temperature in Fahrenheit and Celsius have a
perfect positive correlation (+1).
Ordinary least squares cannot distinguish one variable from the other when they are perfectly correlated. If you
specify a model that contains independent variables with perfect correlation, your statistical software can’t fit
the model, and it will display an error message. You must remove one of the variables from the model to
proceed.
OLS Assumption 7: The error term is normally distributed (optional)
OLS does not require that the error term follows a normal distribution to produce unbiased estimates with the
minimum variance. However, satisfying this assumption allows you to perform statistical hypothesis testing and
generate reliable confidence intervals and prediction intervals.
The easiest way to determine whether the residuals follow a normal distribution is to assess a normal probability
plot. If the residuals follow the straight line on this type of graph, they are normally distributed.
Q.4 What is simultaneous equation system? How would you identify it?
A Simultaneous Equation Model (SEM) is a model in the form of a set of linear simultaneous equations.
Where introductory regression analysis introduces models with a single equation (e.g. simple linear
regression), SEM models have two or more equations. In a single-equation model, changes in the response
variable (Y) happen because of changes in the explanatory variable (X); in an SEM model, other Y variables are
among the explanatory variables in each SEM equation. The system is jointly determined by the equations in
the system; In other words, the system exhibits some type of simultaneity or “back and forth” causation between
the X and Y variables.
 Demand behavior,
 Supply behavior,

11
Course: Basic Econometrics (807)
Semester: Autumn, 2021
 Equilibrium levels for pay rate and employment.
Let’s say that the simultaneous equations model for this scenario is made up the following two equations**:
1. Demand: nt = β1 + β2gt + β3pt + ε1t
2. Supply: nt = β11 + β12mt + β13pt + ε2t
Where:
 n = number of employed nurses,
 p = earnings rate,
 g = graduate nursing school enrollment,
 m = median income for employed nurses.
**These formulas are just regression equations tailored for this specific model; β is the regression
coefficient and ε is the error term — unexpected factors that can creep into the model.
Using the Model to Solve Problems
Remember those simultaneous equations from algebra? They can be solved together to find values for x and y.
In the same way, the equations in SEM can also be solved. Using the above example, let’s say you wanted to
find out the partial impact of median pay (m) on both the number of employed nurses (n) and the pay rate (p).
You can model this by solving the equations for n and p:

Complete Models and Structural Equation Models


When the total number of endogenous variables is equal to the number of equations, it is called a complete
SEM. Endogenous variables are similar to (but not exactly the same as) dependent variables; They have values
that are determined by other variables in the system (these “other” variables are called exogenous variables). If
earnings rate and number of employed nurses are the only two endogenous variables in the above example, then
this SEM is complete. A complete SEM is called a structural equations model.
Structural Equation Modeling and Relationship to Simultaneous Equations Models
The terms structural equation modeling and simultaneous equations modeling are similar — and often confused
— but they are not exactly the same thing. Underpinning any statistical modeling technique is a set of
simultaneous equations. Structural equation models use these equations and are complete Simultaneous
Equations Models. “Complete” means that the total number of endogenous variables is equal to the number of
equations in the model. In other words, if the number of endogenous variables in your model doesn’t equal the
number of equations, then it is not a structural equation model.

12
Course: Basic Econometrics (807)
Semester: Autumn, 2021
The primary variables used in Structural Equation Modeling are usually latent variables, which are compared
with observed variables in the model. A latent or “hidden” variable is not directly measurable or observable. For
example, a person’s level of neurosis, conscientiousness or openness are all latent variables. Latent variables are
ever-present in nearly all regression analysis, because all additive error terms are not measurable (and are
therefore latent).
In some specific cases structural equation modeling is used to create a model; One of the most common
modeling techniques — Regression Analysis — is a special case of Structural Equation Modeling.

Bayes Theorem Example


Simultaneous Equations Model (SEM): Simple Definition
Share on

Regression Analysis > Simultaneous Equations Model (SEM)


You may want to read this other article first: What is Simultaneity?

What is a Simultaneous Equations Model (SEM)?


A Simultaneous Equation Model (SEM) is a model in the form of a set of linear simultaneous equations.
Where introductory regression analysis introduces models with a single equation (e.g. simple linear
regression), SEM models have two or more equations. In a single-equation model, changes in the response
variable (Y) happen because of changes in the explanatory variable (X); in an SEM model, other Y variables are
among the explanatory variables in each SEM equation. The system is jointly determined by the equations in
the system; In other words, the system exhibits some type of simultaneity or “back and forth” causation between
the X and Y variables.
SEM Example
The market for graduate nurses is influenced by:

 Demand behavior,
 Supply behavior,
 Equilibrium levels for pay rate and employment.
Let’s say that the simultaneous equations model for this scenario is made up the following two equations**:
1. Demand: nt = β1 + β2gt + β3pt + ε1t

13
Course: Basic Econometrics (807)
Semester: Autumn, 2021
2. Supply: nt = β11 + β12mt + β13pt + ε2t
Where:
 n = number of employed nurses,
 p = earnings rate,
 g = graduate nursing school enrollment,
 m = median income for employed nurses.
**These formulas are just regression equations tailored for this specific model; β is the regression
coefficient and ε is the error term — unexpected factors that can creep into the model.
Using the Model to Solve Problems
Remember those simultaneous equations from algebra? They can be solved together to find values for x and y.
In the same way, the equations in SEM can also be solved. Using the above example, let’s say you wanted to
find out the partial impact of median pay (m) on both the number of employed nurses (n) and the pay rate (p).
You can model this by solving the equations for n and p:

Complete Models and Structural Equation Models


When the total number of endogenous variables is equal to the number of equations, it is called a complete
SEM. Endogenous variables are similar to (but not exactly the same as) dependent variables; They have values
that are determined by other variables in the system (these “other” variables are called exogenous variables). If
earnings rate and number of employed nurses are the only two endogenous variables in the above example, then
this SEM is complete. A complete SEM is called a structural equations model.
Structural Equation Modeling and Relationship to Simultaneous Equations Models
The terms structural equation modeling and simultaneous equations modeling are similar — and often confused
— but they are not exactly the same thing. Underpinning any statistical modeling technique is a set of
simultaneous equations. Structural equation models use these equations and are complete Simultaneous
Equations Models. “Complete” means that the total number of endogenous variables is equal to the number of
equations in the model. In other words, if the number of endogenous variables in your model doesn’t equal the
number of equations, then it is not a structural equation model.
The primary variables used in Structural Equation Modeling are usually latent variables, which are compared
with observed variables in the model. A latent or “hidden” variable is not directly measurable or observable. For

14
Course: Basic Econometrics (807)
Semester: Autumn, 2021
example, a person’s level of neurosis, conscientiousness or openness are all latent variables. Latent variables are
ever-present in nearly all regression analysis, because all additive error terms are not measurable (and are
therefore latent).
In some specific cases structural equation modeling is used to create a model; One of the most common
modeling techniques — Regression Analysis — is a special case of Structural Equation Modeling.
Three Modeling Techniques

Factor analysis is a type of SEM.


Structural Equation Modeling is a general term for a set of three modeling techniques in statistics. It is
usually used to confirm that a chosen model is valid. In other words, it’s used to test if a model accurately
represents sample data. Unlike the bulk of statistical techniques, structural equation modeling can handle
complex theoretical relationships between multiple sets of variables. This technique also takes measurement
error into account, something which basic statistical techniques do not do.
Structural Equation Models look for relationships between sets of latent variables. First developed in the latter
half of the 20th century by Karl G. Jöreskog, it combines path analysis and confirmatory factor analysis. The
three techniques included in the umbrella term Structural Equation Modeling are:
 Regression analysis only deals with observed variables. In regression, one dependent variable is
predicted using a set of independent variables. For example, a patient’s weight is used to predict their risk
for diabetes. Regression is one of the earliest modeling techniques and was made possible after Karl
Pearson’s development of the correlation coefficient.
 Path analysis , developed by biologist Sewell Wright in the early 1900s, can use observed variables or a
combination of observed and latent variables. Very basically, a path model is regression analysis with
latent variables. For example, you might want to predict how interest rates and GNP influence consumer
spending and consumer trust.
 Factor Analysis looks for relationships between sets of latent variables (“factors”). It can answer
questions like “Does my ten question survey accurately measure one specific factor?”. Spearman (1904)
was the first person to use the term Factor Analysis; he used it to find a two-factor construct for

15
Course: Basic Econometrics (807)
Semester: Autumn, 2021
intelligence. Later, Confirmatory Factor Analysis was developed to test if a set of latent variables
accurately depicted a construct. Latent Class Analysis is very similar; the main difference is that LCA
includes categorical dependent variables and Factor Analysis does not.
Q.5 Write short notes on the following:
a) endogenous and Exogenous variables
An endogenous variable is a variable in a statistical model that's changed or determined by its relationship with
other variables within the model. In other words, an endogenous variable is synonymous with a dependent
variable, meaning it correlates with other factors within the system being studied. Therefore, its values may be
determined by other variables.
Endogenous variables are the opposite of exogenous variables, which are independent variables or outside
forces. Exogenous variables can have an impact on endogenous factors, however.
 Endogenous variables are variables in a statistical model that are changed or determined by their
relationship with other variables.
 Endogenous variables are dependent variables, meaning they correlate with other factors—although it
can be a positive or negative correlation.
 Endogenous variables are important in economic modeling because they show whether a variable
causes a particular effect.
Endogenous variables are important in econometrics and economic modeling because they show whether a
variable causes a particular effect. Economists employ causal modeling to explain outcomes by analyzing
dependent variables based on a variety of factors. For example, in a model studying supply and demand, the
price of a good is an endogenous factor because the price can be changed by the producer (supplier) in
response to consumer demand.
Economists also include independent variables to help determine to which extent a result can be attributed to
an exogenous or endogenous cause. Endogenous variables have values that shift as part of a functional
relationship between other variables within the model. The relationship is also referred to as dependent and is
seen as predictable in nature.
The variables typically correlate in such a way that a movement in one variable should result in a move in the
other variable. In other words, the variables should correlate with each other. However, they don't necessarily
need to move in the same direction, meaning a rise in one factor could cause a fall in another. As long as the
change in the variables is correlating, it's considered endogenous—regardless of whether it's a positive or
negative correlation.
Endogenous vs. Exogenous Variables
In contrast to endogenous variables, exogenous variables are considered independent. In other words, one
variable within the formula doesn't dictate or directly correlate to a change in another. Exogenous variables

16
Course: Basic Econometrics (807)
Semester: Autumn, 2021
have no direct or formulaic relationship. For example, personal income and color preference, rainfall and  gas
prices, education obtained and favorite flower would all be considered exogenous factors.
Examples of Endogenous Variables
For example, assume a model is examining the relationship between employee commute times and fuel
consumption. As the commute time rises within the model, fuel consumption also increases. The relationship
makes sense since the longer a person’s commute, the more fuel it takes to reach the destination. For example,
a 30-mile commute requires more fuel than a 20-mile commute. Other relationships that may be endogenous
include:
 Personal income to personal consumption, since a higher income typically leads to increases in
consumer spending.
 Rainfall to plant growth is correlated and studied by economists since the amount of rainfall is
important to commodity crops such as corn and wheat.
 Education obtained to future income levels because there's a correlation between education and higher
salaries or wages.
b) methodology of Econometrics
Econometrics is the quantitative application of statistical and mathematical models using data to develop
theories or test existing hypotheses in economics and to forecast future trends from historical data. It subjects
real-world data to statistical trials and then compares and contrasts the results against the theory or theories
being tested.
Depending on whether you are interested in testing an existing theory or in using existing data to develop a
new hypothesis based on those observations, econometrics can be subdivided into two major categories:
theoretical and applied. Those who routinely engage in this practice are commonly known as econometricians.
Econometrics analyzes data using statistical methods in order to test or develop economic theory. These
methods rely on statistical inferences to quantify and analyze economic theories by leveraging tools such
as frequency distributions, probability, and probability distributions, statistical inference, correlation analysis,
simple and multiple regression analysis, simultaneous equations models, and time series methods.
The Methodology of Econometrics
The first step to econometric methodology is to obtain and analyze a set of data and define a specific
hypothesis that explains the nature and shape of the set. This data may be, for example, the historical prices for
a stock index, observations collected from a survey of consumer finances, or unemployment and inflation rates
in different countries.
If you are interested in the relationship between the annual price change of the S&P 500 and the
unemployment rate, you'd collect both sets of data. Here, you want to test the idea that higher unemployment
leads to lower stock market prices. Stock market price is thus your dependent variable and the unemployment
rate is the independent or explanatory variable.

17
Course: Basic Econometrics (807)
Semester: Autumn, 2021
The most common relationship is linear, meaning that any change in the explanatory variable will have a
positive correlation with the dependent variable, in which case a simple regression model is often used to
explore this relationship, which amounts to generating a best-fit line between the two sets of data and then
testing to see how far each data point is, on average, from that line.
Note that you can have several explanatory variables in your analysis—for example, changes to GDP and
inflation in addition to unemployment in explaining stock market prices. When more than one explanatory
variable is used, it is referred to as multiple linear regression, the model that is the most commonly used tool in
econometrics.
Different Regression Models
Several different regression models exist that are optimized depending on the nature of the data being analyzed
and the type of question being asked. The most common example is the ordinary least-squares (OLS)
regression, which can be conducted on several types of cross-sectional or time-series data. If you're interested
in a binary (yes-no) outcome—for instance, how likely you are to be fired from a job based on your
productivity—you can use a logistic regression or a probit model. Today, there are hundreds of models that an
econometrician has at his disposal.
Econometrics is now conducted using statistical analysis software packages designed for these purposes, such
as STATA, SPSS, or R. These software packages can also easily test for statistical significance to provide
support that the empirical results produced by these models are not merely the result of chance. R-squared, t-
tests, p-values, and null-hypothesis testing are all methods used by econometricians to evaluate the validity of
their model results.

18

You might also like