Professional Documents
Culture Documents
807 1
807 1
The betas (β) represent the population parameter for each term in the model. Epsilon (ε) represents the random
error that the model doesn’t explain. Unfortunately, we’ll never know these population values because it is
generally impossible to measure the entire population. Instead, we’ll obtain estimates of them using our random
sample.
The notation for an estimated model from a random sample is the following:
The hats over the betas indicate that these are parameter estimates while e represents the residuals, which are
estimates of the random error.
Typically, statisticians consider estimates to be useful when they are unbiased (correct on average) and precise
(minimum variance). To apply these concepts to parameter estimates and the Gauss-Markov theorem, we’ll
need to understand the sampling distribution of the parameter estimates.
Imagine that we repeat the same study many times. We collect random samples of the same size, from the same
population, and fit the same OLS regression model repeatedly. Each random sample produces different
estimates for the parameters in the regression equation. After this process, we can graph the distribution of
estimates for each parameter. Statisticians refer to this type of distribution as a sampling distribution, which is a
type of probability distribution.
Keep in mind that each curve represents the sampling distribution of the estimates for a single parameter. The
graphs below tell us which values of parameter estimates are more and less common. They also indicate how far
estimates are likely to fall from the correct value.
1
Course: Basic Econometrics (807)
Semester: Autumn, 2021
Unbiased Estimates: Sampling Distributions Centered on the True Population Parameter
In the graph below, beta represents the true population value. The curve on the right centers on a value that is
too high. This model tends to produce estimates that are too high, which is a positive bias. It is not correct on
average. However, the curve on the left centers on the actual value of beta. That model produces parameter
estimates that are correct on average. The expected value is the actual value of the population parameter. That’s
what we want and satisfying the OLS assumptions helps us!
Keep in mind that the curve on the left doesn’t indicate that an individual study necessarily produces an
estimate that is right on target. Instead, it means that OLS produces the correct estimate on average when the
assumptions hold true. Different studies will generate values that are sometimes higher and sometimes lower—
as opposed to having a tendency to be too high or too low.
Minimum Variance: Sampling Distributions are Tight Around the Population Parameter
In the graph below, both curves center on beta. However, one curve is wider than the other because the
variances are different. Broader curves indicate that there is a higher probability that the estimates will be
further away from the correct value. That’s not good. We want our estimates to be close to beta.
2
Course: Basic Econometrics (807)
Semester: Autumn, 2021
Both studies are correct on average. However, we want our estimates to follow the narrower curve because
they’re likely to be closer to the correct value than the wider curve. The Gauss-Markov theorem states that
satisfying the OLS assumptions keeps the sampling distribution as tight as possible for unbiased estimates.
Q.2 Explain confidence intervals for regression. Find out confidence intervals for regression coefficients
β1, β2 and σ2.
A 95% confidence interval for βi has two equivalent definitions:
The interval is the set of values for which a hypothesis test to the level of 5% cannot be rejected.
The interval has a probability of 95% to contain the true value of βi. So in 95% of all samples that could
be drawn, the confidence interval will cover the true value of βi.
We also say that the interval has a confidence level of 95%.
To get a better understanding of confidence intervals we conduct another simulation study. For now, assume
that we have the following sample of n=100n=100 observations on a single variable YY where
Yii.i.d∼N(5,25), i=1,…,100.Yi∼i.i.dN(5,25), i=1,…,100.
# set seed for reproducibility
set.seed(4)
3
Course: Basic Econometrics (807)
Semester: Autumn, 2021
plot(Y,
pch = 19,
col = "steelblue")
4
Course: Basic Econometrics (807)
Semester: Autumn, 2021
The procedure is as follows:
We initialize the vectors lower and upper in which the simulated interval limits are to be saved. We want
to simulate 1000010000 intervals so both vectors are set to have this length.
We use a for() loop to sample 100100 observations from the N(5,25)N(5,25) distribution and
compute ^μμ^ as well as the boundaries of the confidence interval in every iteration of the loop.
At last we join lower and upper in a matrix.
# set seed
set.seed(1)
5
Course: Basic Econometrics (807)
Semester: Autumn, 2021
# initialize the plot
plot(0,
xlim = c(3, 7),
ylim = c(1, 100),
ylab = "Sample",
xlab = expression(mu),
main = "Confidence Intervals")
6
Course: Basic Econometrics (807)
Semester: Autumn, 2021
Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique
that uses several explanatory variables to predict the outcome of a response variable.
Multiple regression is an extension of linear (OLS) regression that uses just one explanatory variable.
MLR is used extensively in econometrics and financial inference.
Formula and Calculation of Multiple Linear Regression
yi=β0+β1xi1+β2xi2+...+βpxip+ϵ
where, for i=n observations:
yi=dependent variable
xi=explanatory variables
β0=y-intercept (constant term)
βp=slope coefficients for each explanatory variable
ϵ=the model’s error term (also known as the residuals)
Simple linear regression is a function that allows an analyst or statistician to make predictions about one
variable based on the information that is known about another variable. Linear regression can only be used
when one has two continuous variables—an independent variable and a dependent variable. The independent
variable is the parameter that is used to calculate the dependent variable or outcome. A multiple regression
model extends to several explanatory variables.
The multiple regression model is based on the following assumptions:
There is a linear relationship between the dependent variables and the independent variables
The independent variables are not too highly correlated with each other
yi observations are selected independently and randomly from the population
Residuals should be normally distributed with a mean of 0 and variance σ
The coefficient of determination (R-squared) is a statistical metric that is used to measure how much of the
variation in outcome can be explained by the variation in the independent variables. R2 always increases as
more predictors are added to the MLR model, even though the predictors may not be related to the outcome
variable.
R2 by itself can't thus be used to identify which predictors should be included in a model and which should be
excluded. R2 can only be between 0 and 1, where 0 indicates that the outcome cannot be predicted by any of the
independent variables and 1 indicates that the outcome can be predicted without error from the independent
variables.
When interpreting the results of multiple regression, beta coefficients are valid while holding all other variables
constant ("all else equal"). The output from a multiple regression can be displayed horizontally as an equation,
or vertically in table form.
As an example, an analyst may want to know how the movement of the market affects the price of ExxonMobil
7
Course: Basic Econometrics (807)
Semester: Autumn, 2021
(XOM). In this case, their linear equation will have the value of the S&P 500 index as the independent variable,
or predictor, and the price of XOM as the dependent variable.
In reality, there are multiple factors that predict the outcome of an event. The price movement of ExxonMobil,
for example, depends on more than just the performance of the overall market. Other predictors such as the
price of oil, interest rates, and the price movement of oil futures can affect the price of XOM and stock prices of
other oil companies. To understand a relationship in which more than two variables are present, multiple linear
regression is used.
Multiple linear regression (MLR) is used to determine a mathematical relationship among a number of random
variables. In other terms, MLR examines how multiple independent variables are related to one dependent
variable. Once each of the independent factors has been determined to predict the dependent variable, the
information on the multiple variables can be used to create an accurate prediction on the level of effect they
have on the outcome variable. The model creates a relationship in the form of a straight line (linear) that best
approximates all the individual data points.
Referring to the MLR equation above, in our example:
yi = dependent variable—the price of XOM
xi1 = interest rates
xi2 = oil price
xi3 = value of S&P 500 index
xi4= price of oil futures
B0 = y-intercept at time zero
B1 = regression coefficient that measures a unit change in the dependent variable when x i1 changes - the
change in XOM price when interest rates change
B2 = coefficient value that measures a unit change in the dependent variable when x i2 changes—the
change in XOM price when oil prices change
The least-squares estimates, B0, B1, B2…Bp, are usually computed by statistical software. As many variables can
be included in the regression model in which each independent variable is differentiated with a number—1,2, 3,
4...p. The multiple regression model allows an analyst to predict an outcome based on information provided on
multiple explanatory variables.
Still, the model is not always perfectly accurate as each data point can differ slightly from the outcome
predicted by the model. The residual value, E, which is the difference between the actual outcome and the
predicted outcome, is included in the model to account for such slight variations.
Regression analysis is like other inferential methodologies. Our goal is to draw a random sample from
a population and use it to estimate the properties of that population.
In regression analysis, the coefficients in the regression equation are estimates of the actual
population parameters. We want these coefficient estimates to be the best possible estimates!
8
Course: Basic Econometrics (807)
Semester: Autumn, 2021
Suppose you request an estimate—say for the cost of a service that you are considering. How would you define
a reasonable estimate?
1. The estimates should tend to be right on target. They should not be systematically too high or too low. In
other words, they should be unbiased or correct on average.
2. Recognizing that estimates are almost never exactly correct, you want to minimize the discrepancy
between the estimated value and actual value. Large differences are bad!
These two properties are exactly what we need for our coefficient estimates!
When your linear regression model satisfies the OLS assumptions, the procedure generates unbiased coefficient
estimates that tend to be relatively close to the true population values (minimum variance). In fact, the Gauss-
Markov theorem states that OLS produces estimates that are better than estimates from all other linear model
estimation methods when the assumptions hold true.
The Seven Classical OLS Assumptions
Like many statistical analyses, ordinary least squares (OLS) regression has underlying assumptions. When these
classical assumptions for linear regression are true, ordinary least squares produces the best estimates. However,
if some of these assumptions are not true, you might need to employ remedial measures or use other estimation
methods to improve the results.
Many of these assumptions describe properties of the error term. Unfortunately, the error term is a population
value that we’ll never know. Instead, we’ll use the next best thing that is available—the residuals. Residuals are
the sample estimate of the error for each observation.
Residuals = Observed value – the fitted value
When it comes to checking OLS assumptions, assessing the residuals is crucial!
There are seven classical OLS assumptions for linear regression. The first six are mandatory to produce the best
estimates. While the quality of the estimates does not depend on the seventh assumption, analysts often evaluate
it for other important reasons that I’ll cover.
OLS Assumption 1: The regression model is linear in the coefficients and the error term
This assumption addresses the functional form of the model. In statistics, a regression model is linear when all
terms in the model are either the constant or a parameter multiplied by an independent variable. You build the
model equation only by adding the terms together. These rules constrain the model to one type:
In the equation, the betas (βs) are the parameters that OLS estimates. Epsilon (ε) is the random error.
In fact, the defining characteristic of linear regression is this functional form of the parameters rather than the
ability to model curvature. Linear models can model curvature by including nonlinear variables such as
polynomials and transforming exponential functions.
To satisfy this assumption, the correctly specified model must fit the linear pattern.
9
Course: Basic Econometrics (807)
Semester: Autumn, 2021
Related posts: The Difference Between Linear and Nonlinear Regression and How to Specify a Regression
Model
OLS Assumption 2: The error term has a population mean of zero
The error term accounts for the variation in the dependent variable that the independent variables do not
explain. Random chance should determine the values of the error term. For your model to be unbiased, the
average value of the error term must equal zero.
Suppose the average error is +7. This non-zero average error indicates that our model systematically
underpredicts the observed values. Statisticians refer to systematic error like this as bias, and it signifies that our
model is inadequate because it is not correct on average.
Stated another way, we want the expected value of the error to equal zero. If the expected value is +7 rather
than zero, part of the error term is predictable, and we should add that information to the regression model itself.
We want only random error left for the error term.
You don’t need to worry about this assumption when you include the constant in your regression model because
it forces the mean of the residuals to equal zero. For more information about this assumption, read my post
about the regression constant.
OLS Assumption 3: All independent variables are uncorrelated with the error term
If an independent variable is correlated with the error term, we can use the independent variable to predict the
error term, which violates the notion that the error term represents unpredictable random error. We need to find
a way to incorporate that information into the regression model itself.
This assumption is also referred to as exogeneity. When this type of correlation exists, there is endogeneity.
Violations of this assumption can occur because there is simultaneity between the independent and dependent
variables, omitted variable bias, or measurement error in the independent variables.
Violating this assumption biases the coefficient estimate. To understand why this bias occurs, keep in mind that
the error term always explains some of the variability in the dependent variable. However, when an independent
variable correlates with the error term, OLS incorrectly attributes some of the variance that the error term
actually explains to the independent variable instead. For more information about violating this assumption,
read my post about confounding variables and omitted variable bias.
OLS Assumption 4: Observations of the error term are uncorrelated with each other
One observation of the error term should not predict the next observation. For instance, if the error for one
observation is positive and that systematically increases the probability that the following error is positive, that
is a positive correlation. If the subsequent error is more likely to have the opposite sign, that is a negative
correlation. This problem is known both as serial correlation and autocorrelation.
Assess this assumption by graphing the residuals in the order that the data were collected. You want to see a
randomness in the plot. In the graph for a sales model, there appears to be a cyclical pattern with a positive
correlation.
10
Course: Basic Econometrics (807)
Semester: Autumn, 2021
OLS Assumption 5: The error term has a constant variance (no heteroscedasticity)
The variance of the errors should be consistent for all observations. In other words, the variance does not
change for each observation or for a range of observations. This preferred condition is known as
homoscedasticity (same scatter). If the variance changes, we refer to that as heteroscedasticity (different
scatter).
The easiest way to check this assumption is to create a residuals versus fitted value plot. On this type of graph,
heteroscedasticity appears as a cone shape where the spread of the residuals increases in one direction. In the
graph below, the spread of the residuals increases as the fitted value increases.
OLS Assumption 6: No independent variable is a perfect linear function of other explanatory variables
Perfect correlation occurs when two variables have a Pearson’s correlation coefficient of +1 or -1. When one of
the variables changes, the other variable also changes by a completely fixed proportion. The two variables move
in unison.
Perfect correlation suggests that two variables are different forms of the same variable. For example, games
won and games lost have a perfect negative correlation (-1). The temperature in Fahrenheit and Celsius have a
perfect positive correlation (+1).
Ordinary least squares cannot distinguish one variable from the other when they are perfectly correlated. If you
specify a model that contains independent variables with perfect correlation, your statistical software can’t fit
the model, and it will display an error message. You must remove one of the variables from the model to
proceed.
OLS Assumption 7: The error term is normally distributed (optional)
OLS does not require that the error term follows a normal distribution to produce unbiased estimates with the
minimum variance. However, satisfying this assumption allows you to perform statistical hypothesis testing and
generate reliable confidence intervals and prediction intervals.
The easiest way to determine whether the residuals follow a normal distribution is to assess a normal probability
plot. If the residuals follow the straight line on this type of graph, they are normally distributed.
Q.4 What is simultaneous equation system? How would you identify it?
A Simultaneous Equation Model (SEM) is a model in the form of a set of linear simultaneous equations.
Where introductory regression analysis introduces models with a single equation (e.g. simple linear
regression), SEM models have two or more equations. In a single-equation model, changes in the response
variable (Y) happen because of changes in the explanatory variable (X); in an SEM model, other Y variables are
among the explanatory variables in each SEM equation. The system is jointly determined by the equations in
the system; In other words, the system exhibits some type of simultaneity or “back and forth” causation between
the X and Y variables.
Demand behavior,
Supply behavior,
11
Course: Basic Econometrics (807)
Semester: Autumn, 2021
Equilibrium levels for pay rate and employment.
Let’s say that the simultaneous equations model for this scenario is made up the following two equations**:
1. Demand: nt = β1 + β2gt + β3pt + ε1t
2. Supply: nt = β11 + β12mt + β13pt + ε2t
Where:
n = number of employed nurses,
p = earnings rate,
g = graduate nursing school enrollment,
m = median income for employed nurses.
**These formulas are just regression equations tailored for this specific model; β is the regression
coefficient and ε is the error term — unexpected factors that can creep into the model.
Using the Model to Solve Problems
Remember those simultaneous equations from algebra? They can be solved together to find values for x and y.
In the same way, the equations in SEM can also be solved. Using the above example, let’s say you wanted to
find out the partial impact of median pay (m) on both the number of employed nurses (n) and the pay rate (p).
You can model this by solving the equations for n and p:
12
Course: Basic Econometrics (807)
Semester: Autumn, 2021
The primary variables used in Structural Equation Modeling are usually latent variables, which are compared
with observed variables in the model. A latent or “hidden” variable is not directly measurable or observable. For
example, a person’s level of neurosis, conscientiousness or openness are all latent variables. Latent variables are
ever-present in nearly all regression analysis, because all additive error terms are not measurable (and are
therefore latent).
In some specific cases structural equation modeling is used to create a model; One of the most common
modeling techniques — Regression Analysis — is a special case of Structural Equation Modeling.
Demand behavior,
Supply behavior,
Equilibrium levels for pay rate and employment.
Let’s say that the simultaneous equations model for this scenario is made up the following two equations**:
1. Demand: nt = β1 + β2gt + β3pt + ε1t
13
Course: Basic Econometrics (807)
Semester: Autumn, 2021
2. Supply: nt = β11 + β12mt + β13pt + ε2t
Where:
n = number of employed nurses,
p = earnings rate,
g = graduate nursing school enrollment,
m = median income for employed nurses.
**These formulas are just regression equations tailored for this specific model; β is the regression
coefficient and ε is the error term — unexpected factors that can creep into the model.
Using the Model to Solve Problems
Remember those simultaneous equations from algebra? They can be solved together to find values for x and y.
In the same way, the equations in SEM can also be solved. Using the above example, let’s say you wanted to
find out the partial impact of median pay (m) on both the number of employed nurses (n) and the pay rate (p).
You can model this by solving the equations for n and p:
14
Course: Basic Econometrics (807)
Semester: Autumn, 2021
example, a person’s level of neurosis, conscientiousness or openness are all latent variables. Latent variables are
ever-present in nearly all regression analysis, because all additive error terms are not measurable (and are
therefore latent).
In some specific cases structural equation modeling is used to create a model; One of the most common
modeling techniques — Regression Analysis — is a special case of Structural Equation Modeling.
Three Modeling Techniques
15
Course: Basic Econometrics (807)
Semester: Autumn, 2021
intelligence. Later, Confirmatory Factor Analysis was developed to test if a set of latent variables
accurately depicted a construct. Latent Class Analysis is very similar; the main difference is that LCA
includes categorical dependent variables and Factor Analysis does not.
Q.5 Write short notes on the following:
a) endogenous and Exogenous variables
An endogenous variable is a variable in a statistical model that's changed or determined by its relationship with
other variables within the model. In other words, an endogenous variable is synonymous with a dependent
variable, meaning it correlates with other factors within the system being studied. Therefore, its values may be
determined by other variables.
Endogenous variables are the opposite of exogenous variables, which are independent variables or outside
forces. Exogenous variables can have an impact on endogenous factors, however.
Endogenous variables are variables in a statistical model that are changed or determined by their
relationship with other variables.
Endogenous variables are dependent variables, meaning they correlate with other factors—although it
can be a positive or negative correlation.
Endogenous variables are important in economic modeling because they show whether a variable
causes a particular effect.
Endogenous variables are important in econometrics and economic modeling because they show whether a
variable causes a particular effect. Economists employ causal modeling to explain outcomes by analyzing
dependent variables based on a variety of factors. For example, in a model studying supply and demand, the
price of a good is an endogenous factor because the price can be changed by the producer (supplier) in
response to consumer demand.
Economists also include independent variables to help determine to which extent a result can be attributed to
an exogenous or endogenous cause. Endogenous variables have values that shift as part of a functional
relationship between other variables within the model. The relationship is also referred to as dependent and is
seen as predictable in nature.
The variables typically correlate in such a way that a movement in one variable should result in a move in the
other variable. In other words, the variables should correlate with each other. However, they don't necessarily
need to move in the same direction, meaning a rise in one factor could cause a fall in another. As long as the
change in the variables is correlating, it's considered endogenous—regardless of whether it's a positive or
negative correlation.
Endogenous vs. Exogenous Variables
In contrast to endogenous variables, exogenous variables are considered independent. In other words, one
variable within the formula doesn't dictate or directly correlate to a change in another. Exogenous variables
16
Course: Basic Econometrics (807)
Semester: Autumn, 2021
have no direct or formulaic relationship. For example, personal income and color preference, rainfall and gas
prices, education obtained and favorite flower would all be considered exogenous factors.
Examples of Endogenous Variables
For example, assume a model is examining the relationship between employee commute times and fuel
consumption. As the commute time rises within the model, fuel consumption also increases. The relationship
makes sense since the longer a person’s commute, the more fuel it takes to reach the destination. For example,
a 30-mile commute requires more fuel than a 20-mile commute. Other relationships that may be endogenous
include:
Personal income to personal consumption, since a higher income typically leads to increases in
consumer spending.
Rainfall to plant growth is correlated and studied by economists since the amount of rainfall is
important to commodity crops such as corn and wheat.
Education obtained to future income levels because there's a correlation between education and higher
salaries or wages.
b) methodology of Econometrics
Econometrics is the quantitative application of statistical and mathematical models using data to develop
theories or test existing hypotheses in economics and to forecast future trends from historical data. It subjects
real-world data to statistical trials and then compares and contrasts the results against the theory or theories
being tested.
Depending on whether you are interested in testing an existing theory or in using existing data to develop a
new hypothesis based on those observations, econometrics can be subdivided into two major categories:
theoretical and applied. Those who routinely engage in this practice are commonly known as econometricians.
Econometrics analyzes data using statistical methods in order to test or develop economic theory. These
methods rely on statistical inferences to quantify and analyze economic theories by leveraging tools such
as frequency distributions, probability, and probability distributions, statistical inference, correlation analysis,
simple and multiple regression analysis, simultaneous equations models, and time series methods.
The Methodology of Econometrics
The first step to econometric methodology is to obtain and analyze a set of data and define a specific
hypothesis that explains the nature and shape of the set. This data may be, for example, the historical prices for
a stock index, observations collected from a survey of consumer finances, or unemployment and inflation rates
in different countries.
If you are interested in the relationship between the annual price change of the S&P 500 and the
unemployment rate, you'd collect both sets of data. Here, you want to test the idea that higher unemployment
leads to lower stock market prices. Stock market price is thus your dependent variable and the unemployment
rate is the independent or explanatory variable.
17
Course: Basic Econometrics (807)
Semester: Autumn, 2021
The most common relationship is linear, meaning that any change in the explanatory variable will have a
positive correlation with the dependent variable, in which case a simple regression model is often used to
explore this relationship, which amounts to generating a best-fit line between the two sets of data and then
testing to see how far each data point is, on average, from that line.
Note that you can have several explanatory variables in your analysis—for example, changes to GDP and
inflation in addition to unemployment in explaining stock market prices. When more than one explanatory
variable is used, it is referred to as multiple linear regression, the model that is the most commonly used tool in
econometrics.
Different Regression Models
Several different regression models exist that are optimized depending on the nature of the data being analyzed
and the type of question being asked. The most common example is the ordinary least-squares (OLS)
regression, which can be conducted on several types of cross-sectional or time-series data. If you're interested
in a binary (yes-no) outcome—for instance, how likely you are to be fired from a job based on your
productivity—you can use a logistic regression or a probit model. Today, there are hundreds of models that an
econometrician has at his disposal.
Econometrics is now conducted using statistical analysis software packages designed for these purposes, such
as STATA, SPSS, or R. These software packages can also easily test for statistical significance to provide
support that the empirical results produced by these models are not merely the result of chance. R-squared, t-
tests, p-values, and null-hypothesis testing are all methods used by econometricians to evaluate the validity of
their model results.
18