CH 07 Specification and Data Issues TQT

Chapter 7:
More on model specification and data issues

7.1. Testing functional form
7.2. Using proxy variables for unobserved
explanatory variables
7.3. Measurment errors
7.4. Missing data and non-random samples
7.5. Outliers
7.1. Testing functional form
❑ Functional form misspecification occurs when a multiple
regression model does not correctly account for the
relationship between 𝑌 and 𝑋𝐽
How to do?
1. 𝑋𝐽 should appear as squares or higher order terms
reg wage edu edu2 age gender
utest edu edu2
2. Interactions among 𝑋𝐽
reg lnwage i.married##i.gender age
reg lnwage c.edu##i.gender age
3. Log or level: 𝐿𝑜𝑔 𝑌 𝑣𝑠 𝑌

7.2. Using proxy variables for unobserved explanatory variables
• 𝑊𝑎𝑔𝑒 = 𝛽0 + 𝛽1 𝑒𝑑𝑢 + 𝛽2 𝑎𝑏𝑖𝑙𝑖𝑡𝑦 + 𝑢
Data on ability is unavailable, so ability is proxyed by the intelligent quotient.
• 𝑊𝑎𝑔𝑒 = 𝛽0 + 𝛽1 𝑒𝑑𝑢 + 𝛽2 𝐼𝑄 + 𝑢
❑ Using lagged dependent variables as proxy variables
• 𝐶𝑟𝑖𝑚𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑡 = 𝛽0 + 𝛽1 𝑢𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑚𝑒𝑛𝑡 + 𝛽2 𝑒𝑥𝑝𝑒𝑛𝑑 + 𝛽3 𝐶𝑟𝑖𝑚𝑒𝑝𝑎𝑠𝑡 + 𝑢
• 𝐶ℎ𝑜𝑖𝑐𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑡 = 𝛽0 + 𝛽1 𝑎𝑠𝑠𝑒𝑡𝑠 + 𝛽2 𝑙𝑎𝑛𝑑𝑙𝑜𝑠𝑠 + 𝛽3 𝑐ℎ𝑜𝑖𝑐𝑒𝑝𝑎𝑠𝑡 + 𝑢
• This is a simple technique to account for past factors affecting the

dependent variable.
 Measurement error in 𝑡ℎ𝑒 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 (𝑌)
𝑌 = 𝑌 ∗ + 𝑒0
Miseasured value True value Measurement eror
Population regression
Estimated regression
❑ It is assumed that 𝐸(𝑒0 )=0. If 𝐸(𝑒0 ) ≠ 0, we only get a

biased estimator of 𝛽0 , which is rarely a main concern.
❑ (𝑢 + 𝑒0 )=>leads to higher error variance=>increasing
Var 𝛽መ𝑗 , producing less precise estimates.
Measurement error in 𝑡ℎ𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 (𝑋𝑗 )
𝑋 = 𝑋 ∗ + 𝑒0
Miseasured value True value Measurement error
 OLS is biased and inconsistent because the mismeasured variable 𝑋𝑗

is endogenous.
 The magnitude of the effect of the mismeasured variable will be
biased toward zero(called attenuation bias).
2
𝜎𝑥∗
𝑝𝑙𝑖𝑚 𝛽መ𝑗
Plim: the value at which an estimator
converges as the sample size infinitely increases = 𝛽𝑗 ( 2 2)
𝜎𝑥∗ + 𝜎𝑒
 Also, the effects of the other independent variables will be biased.

If any observation is missing data on one of the variables in the model, it
can’t be used.
EXOGENOUS SAMPLE SELECTION
❑ The sample was nonrandom in the way that certain age groups or edu
groups were over or undersampled.
E.g., suppose we have a sample that includes only those older than 40; this
causes no problem for the regression because the regression function
E(saving|edu,age,size) is the same for any subgroup of the population
defined by edu, age, or hhsize.
𝑠𝑎𝑣𝑖𝑛𝑔 = 𝛽0 + 𝛽1 𝑒𝑑𝑢 + 𝛽2 𝑎𝑔𝑒 + 𝛽3 ℎℎ𝑠𝑖𝑧𝑒 + 𝑢
❑ There is no problem with the sample selection being based on

independent variables (exogenous sample selection).
❑ Also, the sample selection causes no problem if it is unrelated to the

error term of the regression.
ENDOGENOUS SAMPLE SELECTION
❑ The sample was nonrandom in the way that many respondents decline
to engage in the survey while their income is particularly high or low.
❑ This causes the bias and inconsistency since these respondents may
systematically differ from those who engage in the survey.
𝐼𝑛𝑐𝑜𝑚𝑒 = 𝛽0 + 𝛽1 𝑒𝑑𝑢 + 𝛽2 𝑎𝑔𝑒 + 𝛽3 ℎℎ𝑠𝑖𝑧𝑒 + 𝑢
❑ For instance, only those with income obove 1000$ are included in the
sample. Then the population regression 𝐸 𝑖𝑛𝑐𝑜𝑚𝑒 𝑒𝑑𝑢, 𝑎𝑔𝑒, 𝑠𝑖𝑧𝑒 ≠
𝐸 𝑖𝑛𝑐𝑜𝑚𝑒 𝑒𝑑𝑢, 𝑎𝑔𝑒, 𝑠𝑖𝑧𝑒, 𝑖𝑛𝑐𝑜𝑚𝑒 > 1000$ .
❑ Sample selection causes a problem if the sample is chosen on the basis

of the dependent variable ( endogenous sample selection).
Stratified random sampling divides the population into homogeneous
groups, or strata. Then random samples are taken from each stratum.
Non-random samples and stratified sampling
❑ In a stratified sample, if males are oversampled and if

we are interested in the gender gap in wages, whether
this causes the bias?
❑ In a stratified sample, if students with low English

scores are undersampled and if we are interested in
the English score equation, whether this causes the
bias?
7.5. Outliers
❑ OLS may be sensitive to outliers because
this method is based on squaring residuals.
❑ Sometimes, outliers are the result of errors
in data entry. It is easy to fix.
❑ If outliers are trully different from other
observations, it is not easy to discard
outliers.
How to deal with outliers
❑ Itis not easy to drop outliers, but researchers may
choose estimates with and without them.
How to detect outliers?
ssc install winsor2

winsor2 wage, replace cuts(1 99) trim
❑ Log transformation can help de-emphasize
outliers.
❑ Quantile regression is robust to outliers in the
dependent variables.
Regression with and without outliers
Use “chapter5.dta”
Drop exeme values at the 1 and 99 percentiles
reg wage edu exper gender married urban
gen out=wage
winsor2 out, replace cuts(1 99) trim
reg out edu exper gender married urban
𝑊𝑖𝑡ℎ 𝑜𝑢𝑡𝑙𝑖𝑒𝑟𝑠 Without 182 outliers.

Log transformation can help reduce outliers. It narrows the
range of the data by bringing extreme values closer to the mean
Outliers and quantile regression (QR)
❑ Quantile regression is a form of robust (outlier resistant) regression
𝑄𝜃 𝑊𝑎𝑔𝑒| 𝐸𝑑𝑢, 𝑒𝑥𝑝𝑒𝑟, 𝑔𝑒𝑛𝑑𝑒𝑟 = 𝛽0 + 𝛽1 𝐸𝑑𝑢 + 𝛽2 𝐸𝑥𝑝𝑒𝑟 + 𝛽3 𝐺𝑒𝑛𝑑𝑒𝑟 + 𝜀𝑖𝑗
❑ The model specifies the qth – quantile (0< q <1) of conditional distribution of the
dependent variable.
❑ Quantile regression (QR) reveals the heterogeneous relationship

between X and Y, while OLS only shows the homogeneous (average)
relationship.
❑ In other words, QR estimates the slope coefficients at different points
(different percentiles) in the outcome distribution.
❑ For instance, QR can quantify the effect of education on wages at the
15th percentile, the 50th percentile (median), or the 75th percentile of
the wage distribution.
Quantile regression
Conditional quantile regression (CQR): ssc install sqreg
Use “chapter5.dta”
sqreg lnwage edu exper gender married urban mismatch , quantile(0.1 0.25 0.5 0.75 0.9) reps(100)
𝑡ℎ𝑒 𝑒𝑓𝑓𝑒𝑐𝑡 𝑜𝑓 𝑒𝑑𝑢 𝑣𝑎𝑟𝑖𝑒𝑠 𝑎𝑐𝑟𝑜𝑠𝑠 𝑞𝑢𝑎𝑛𝑡𝑖𝑙𝑒𝑠
Quantile regression
Higher effects for higher quantiles
𝑂𝐿𝑆: 𝑚𝑒𝑎𝑛 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛

Quantile regression
ssc install qregplot, replace
ssc install ftools
ssc install sqreg
sqreg lnwage edu exper gender married urban mismatch , quantile(0.1 0.25 0.5 0.75 0.9) reps(100) OLS estimates
qregplot edu exper mismatch gender, /// Variables to be plotted
estore(e_qreg) /// Request Storing the variables in memory
q(5(5)95) // and indicates what quantiles to plot
Questions &
Answers for
Chapter 7
OLS assumption and practical guides
1. Linear in parameters
We may include 𝑋, 𝑋 2 𝑜𝑟 𝑋 3 in the model
𝐿𝑜𝑔 𝑌 𝑎𝑛𝑑 𝑌 may be tried and compare the results
Using “curvefit” to check the relationship
curvefit y x, function()
Curvefit wage edu, func(4)
To test a quadratic relationship
reg 𝑦 𝑥 𝑥 2
utest 𝑥 𝑥 2
2. Random sampling
-Endogenous sample selection:
What are the consequences?
-Exogenous sample selection:

What are the consequences?
See more in 7.3

3. Perfect multicollinearity
❑ Perfect multicollinearity:
Easy to fix: drop one of the explanatory variables
❑ Multicollinearity:
Easy to detect multicollinearity with Stata command:
reg y x1 x2 x3
vif
As an (arbitrary) rule of thumb, 𝑉𝐼𝐹 should not be larger 10
Be careful: dropping collinear variables may cause omitted variable bias
4. Zero conditional mean assumption
❑ 𝐸 𝑢 𝑥 = 0: the expected value of the error = 0, given any value of
the explanatory variables.
Strong and key assumption: unbiased and consistent estimator
❑ 𝐶𝑜𝑣(𝑢, 𝑥) = 0: zero correlation assumption
No linear relationship between unobservable factors and the
explanatory variables.
Weaker assumption: biased and consistent estimator
❑ Adding more important (relevant) variables makes it more
likely to satisfy this assumption. Statistical significance
and pratical significance
(Cont)
This assumption fails if:
1. A functional form is misspecified ( see more in 5.2; 7.1)
A variable is included in an incorrect way.
𝑀𝑎𝑡ℎ𝑠𝑐𝑜𝑟𝑒𝑠 = 𝛽0 + 𝛽1 𝑖𝑛𝑐𝑜𝑚𝑒 + 𝛽2 𝒊𝒏𝒄𝒐𝒎𝒆𝟐 +𝑢: the true relationship is quadratic.
𝑀𝑎𝑡ℎ𝑠𝑐𝑜𝑟𝑒𝑠 = 𝛽0 + 𝛽1 𝑖𝑛𝑐𝑜𝑚𝑒 +𝑢
𝑳𝒐𝒈 𝒐𝒇 𝒘𝒂𝒈𝒆 = 𝛽0 + 𝛽1 𝑒𝑑𝑢 +𝛽2 exper + 𝑢: the true model is the log of wage.
W𝑎𝑔𝑒 = 𝛽0 + 𝛽1 𝑒𝑑𝑢 +𝛽2 exper + 𝑢
2. Omitting one or more relevant variables that correlate with any explanatory
variable (See more in 3.4)
𝐿𝑜𝑔 𝑜𝑓 𝑤𝑎𝑔𝑒 = 𝛽0 + 𝛽1 𝑒𝑑𝑢 +𝑢: this model omitted “ability”
𝐿𝑜𝑔 𝑜𝑓 𝑤𝑎𝑔𝑒 = 𝛽0 + 𝛽1 𝑒𝑑𝑢 +𝛽2 𝐚𝐛𝐢𝐥𝐢𝐭𝐲 + 𝑢 a variable that has a nonzero partial
3. Measurement errors (See more in 7.3) effect on the dependent variable.
4. An explanatory variable is jointly determined with the dependent variable
Advanced econometrics
How to do?
❑ Functional form is correctly specified
❑ Accuracy in collecting data
❑ Adding more important variables

Statistical significance
and pratical significance
5. Homoskedascity: the error term has constant variance
Under assumptions 1-4: Unbiased
Assumption 5: 𝑉𝑎𝑟 𝑢𝑖 𝑥𝑖1 , 𝑥𝑖2 … 𝑥𝑖𝑘 = 𝜎 2 : the constant variance assumption
Under assumptions 1-5: BLUE: Best Linear Unbiased Estimator

❑ How to check the unconstant variance or heteroskedasticity (see more in 5.5)
reg wage edu exper female married
hettest
estat imtest, white
If p-value<0.05: reject the null hypothesis and conclude that the heteroskedascity exits
❑ What are the consequences of the unconstant variance? (See more in 5.5)
1. -The formulas for 𝑉𝑎𝑟 𝛽መ𝑗 : invalid
2. -F-test & ttest: invalid
3. -No longer blue (No best linear unbiased estimator): because not the smallest
variance)
But heteroskedasticity does not affect unbiasedness, consistency and R-square.
❑ How to do deal with the unsconstant variance (see more in 5.5)
-Log-transformation.
-Using the “robust” option provided by Stata.
reg wage edu exper female married, vce(robust)
6. Normality of the errors
❑ The population error 𝑢 is independent of the explanatory variables and is normally

distributed with zero mean and constant variance:
𝒖𝒊 | 𝑥𝑖1, 𝑥𝑖𝑘 … 𝑥𝑖𝑘 ~i. i. d 𝑵(𝟎, 𝝈2 ), 𝑖 = 1 … 𝑛.
❑ This assumption can also be stated briefly that: Conditional on X, Y is normally distributed with a
mean linear in X and a constant variance.
❑ NOTE: For the goal of statistical inference, the normality

assumption can be replaced by a large sample size.
❑ For small sample size, we can check this assumption by examining the distribution of
the residuals
reg y x1 x2
predict r, resid
sktest r
Ho: the residuals are normally distributed vs H1: the residuals are not normally distributed.
P-value=0.2746 >0.05: fail to reject Ho. Conclusion: the normality assumption is satisfied.
Other issues with data structure
❑ It would be better to account for regional fixed effects

Use “chapter6_3.dta”
logit poverty hhsize edu gender age i.tinh
❑ If data collected from cluster sampling, always use the cluster option.
The standard errors take into account that households within communes are not independent.
logit poverty hhsize edu gender age i.tinh, cluster (xa)

❑ If data is collected from disproportionate stratified sampling, always
adjust for sampling weights.
logit poverty hhsize edu gender age i.tinh [pw=wt], cluster (xa)
7 dummy variables account for
regional fixed effect
Interpretation
Dummy variables 𝒅𝒖𝒎𝒎𝒚 𝒗𝒂𝒓𝒊𝒂𝒃𝒍𝒆: 𝒎𝒂𝒍𝒆 = 𝟏; 𝒇𝒆𝒎𝒂𝒍𝒆 = 𝟎
𝑊𝑎𝑔𝑒 = b0 +b𝟏𝑚𝑎𝑙𝑒 + b2 𝑒𝑑𝑢 + 𝑢

b1+b0 =the intercept for men
b0= the intercept for women
b𝟏shows the difference in the intercepts between men and women
b𝟏 : the wage gap between men and women (holding other things constant)
b𝟏 is statistically significant
at 1% =>the intercept is
different between the two
groups
Interpretation: interactions between two dummy variables
1=married; 0=single Gender: 1=men; 0=women
Wage= b0 + d1married + d2 gender + d3 married*gender + u

❑ If we set married=0; Gender=0, then the base group is single women
The intercept for the base group is b0
❑ If we set married=1; Gender=1, then we have the married man group
The intercept for the married man group= b0 +d1+ d2 + d3
The difference in the intercepts between married men and single women= d1+ d2 + d3
❑ If we set married=0; Gender=1, then we have the single man group

The intercept for the single man group= b0 +0 + d2 + 0= b0 +d2
❑ If we set married=1; Gender=0, then we have the married women group
The intercept for the married women group= b0 +d1 + 0 + 0= b0 +d1
Interpretation: interactions between two dummy variables
 0.03 married*men
Log_wage = 8.6+ 0.11 married + 0.14 men +
1. What is the intercept for single women?
2. What is the difference in intercepts between married
women and single women?
men and single women?
men and single men?
men and married women?
Note: the difference in intercepts=the difference in mean wages between groups

Interactions between dummy and continuous variables
𝑬𝒅𝒖 is a continuous variable while 𝑮𝒆𝒏𝒅𝒆𝒓 is a dummy variable (1=Men; 0=Women)
W𝑎𝑔𝑒 = 𝛽0 + 𝛿0 𝐺𝑒𝑛𝑑𝑒𝑟 + 𝛽1 𝐸𝑑𝑢 + 𝛿1 𝐺𝑒𝑛𝑑𝑒𝑟 ∗ 𝐸𝑑𝑢 + 𝑢
 If Gender = 0=Women, then Wage = b0 + b1Edu + u

𝛽0 : 𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝑤𝑜𝑚𝑒𝑛 𝛽1 : 𝑆𝑙𝑜𝑝𝑒 𝑤𝑜𝑚𝑒𝑛
 If Gender = 1=Men, then Wage = (d0 +b0) + (d1+ b1)Edu+ u

𝛽0 + 𝛿0 : 𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝑚𝑒𝑛 𝛽1 +𝛿1 : 𝑆𝑙𝑜𝑝𝑒 𝑚𝑒𝑛
 𝛿1 : shows the difference in slopes between groups

If 𝛿1 is statistically significant=>the effect of edu differs between men and women
If 𝛿1 is statistically insignificant=>the effect of edu is the same between men and women
Interactions between dummy and continuous variables(Cont)
𝑤𝑎𝑔𝑒
ෟ = 2145.787 − 666.4691𝐺𝑒𝑛𝑑𝑒𝑟 + 349.1673𝐸𝑑𝑢 + 141.9522𝑮𝒆𝒏𝒅𝒆𝒓 ∗ 𝑬𝒅𝒖
 For every one-year increase in education, a woman earns an additional 349.1673
 For every one-year increase in education, a man earns an additional 491.1195 = 349.1673 + 141.9522
 The difference in the slopes ( or the different effect of edu on wage): 141.9522
 Note: This kind of interaction allows for different slopes
Quadratic Functional Forms: interpretation
Quadratic Functional Forms: y = b0 + b1x + b2x2 + u
❑ This functional form captures decreasing or increasing marginal effects
❑ The effect depends on b1, b2 and the value of X:
∆𝑦ො ≈ 𝛽መ1 + 2𝛽መ2 𝑥 ∆𝑥, 𝑠𝑜 ∆𝑦/∆𝑥
ො ≈ 𝛽መ1 + 2𝛽መ2 𝑥
ෟ = 500 + 200𝑒𝑥𝑝𝑒𝑟 − 5𝑒𝑥𝑝𝑒𝑟 2

𝑤𝑎𝑔𝑒
∆𝑤𝑎𝑔𝑒
ෟ = (200 − 2 ∗ 5𝑒𝑥𝑝𝑒𝑟)∆𝑒𝑥𝑝𝑒𝑟
An increase in work experience from 10–11 years increases wages by about 100 thousand VND/month.
∆𝑤𝑎𝑔𝑒
ෟ = 200 − 2 ∗ 5 ∗ 10 ∗ 1 =200-100=100
Note: This is an approximate calculation

𝛽
❑ The turning point is 𝑥 ∗ = | 2𝛽1 |=200/2*5=20 years
2
Unbiasedness, consistency and efficiency
❑ Unbiasedness: E(𝛽መ𝑗 )=𝛽𝑗

The expected value of OLS estimators = the true population parameter, 𝛽𝑗 .
Note: for a given sample, 𝛽෠𝑗 may be larger or smaller than 𝛽𝑗
❑ Consistency: 𝛽መ𝑗 → 𝛽𝑗 as n → ∞
OLS estimator (𝛽መ𝑗 )would converge to the true population parameter (𝛽𝑗 ) as the
sample size get larger, and approaches infinity.
❑ Efficiency: smaller variances or more precision
❑ Under assumptions 1-5: OLS estimator is unbiased, consistent and efficient.
❑ BLUE: Best Linear Unbiased Estimator
❑ Best: OLS has smallest variances among other linear unbiased estimators
Three Components of OLS Variances
𝝈𝟐
𝑆𝑒 𝛽෠𝑗 = 𝑉𝑎𝑟 𝛽෠𝑗 = ; 𝑗 = 1,2 … 𝑘
𝑺𝑺𝑻𝒋 (𝟏−𝑹𝟐𝒋 )
𝑺𝑺𝑻𝒋 = ෍(𝑥𝑖𝑗 − 𝑥ഥ𝑗 )2 𝑹𝟐𝒋 is 𝐭𝐡𝐞 R−squared from a regression of

𝑖=1 explanatory variable 𝒙𝒋 on all other explanatory
variables (including an intercept).
1. Error variance: 𝝈𝟐
▪ Larger 𝝈𝟐 increases the sampling variances
▪ Larger 𝝈𝟐 reduces the precision of estimates.
▪ 𝝈𝟐 does not decrease with sample size because 𝝈𝟐 𝒊𝒔 𝒂 𝒇𝒆𝒂𝒕𝒖𝒓𝒆 𝒐𝒇 𝒕𝒉𝒆 𝒑𝒐𝒑𝒖𝒍𝒂𝒕𝒊𝒐𝒏
▪ Adding more explanatory variables reduces 𝝈𝟐
2. The total sample variation in explanatory variable𝒋 : 𝑺𝑺𝑻𝒋
• Higher sample variations increase the precision of estimates.
• Larger sample sizes automatically increase sample variation.
• Increasing sample size is a method to obtain more precise estimates.
3. Linear relationship among explanatory variables:𝐑𝟐𝐣
• As 𝑹𝟐𝒋 increases to 1, 𝑉𝑎𝑟 𝛽෠𝑗 gets larger and larger: 𝑉𝑎𝑟
෡
𝛽𝑗 → ∞ 𝑎𝑠 𝑅𝑗2 → 1
• Multicollinearity leads to higher variances for OLS slope estimators.
The precision of the slope coefficient
will increase if
𝝈𝟐
𝑆𝑒 𝛽෠𝑗 = 𝑉𝑎𝑟 𝛽෠𝑗 = ; 𝑗 = 1,2 … 𝑘
1. The variation of the explanatory

variables about its mean increases.
2. The variance of the error increases.
3. Thereis a high correlation between
explanatory variables.
4. The sample size is larger.
5. Adding more relevant regressors
𝑊𝑎𝑔𝑒 = b +b𝟏𝑒𝑥𝑒𝑝𝑒𝑟 + b2 𝑒𝑑𝑢 + 𝑢
The precision of b2 will increase if

1. More variation about the mean of
education variables
2. There is a high correlation between
education and experience.
3. The sample size is larger.
4. All the observations in the edu variable
have the same value.
𝑆𝑒 𝛽෠𝑗 = 𝑉𝑎𝑟 𝛽෠𝑗 =
𝝈𝟐 ; 𝑗 = 1,2 … 𝑘
A. True or false?
1. The error variance (𝝈𝟐 ) can be reduced if we add more explanatory
variables.
2. Reducing the error variance makes the estimates less precise.
3. Measurement errors in the dependent variable make the estimates less
precise.
4. The estimates are more precise if the variance of the dependent variable is
a function of the independent variable: Var(Y|x)=F(x).
5. The estimates are less precise if the variance of the errors is a function of
the independent variable: Var(U|x)=F(x).
B. Omitted variable bias occurs if
1. A model omitted a variable that has no partial effect on Y and is highly
correlated with other explanatory variables.
2. A model omitted a variable that has a non-zero partial effect on Y and is
uncorrelated with other explanatory variables.
3. A model omitted a variable that has a positive effect on Y and is uncorrelated
with other explanatory variables.
4. A model omitted a variable that has a positive effect on Y and is negatively
5. A model omitted a variable that has a negative effect on Y and is uncorrelated
with other explanatory variables.
6. A model omitted a variable that has a negative effect on Y and is positively
7. A model omitted a variable that has a positive effect on Y and is positively
8. A model omitted the squared term of X, while both the X and X squared terms
were statistically significant.
C.
1. You are given this regression model: 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝑢. Suppose
there is a positive and highly correlated relationship between X1 and X2,
and X1 has a positive effect on Y. What happens if we drop X1 from the
model?
there is a negative and highly correlated relationship between X1 and X2,
and X1 has a positive effect on Y. What happens if we drop X1 from the
model?
there is a positive and highly correlated relationship between X1 and X2,
and X2 has a negative effect on Y. What happens if we drop X2 from the
model?
there is a positive relationship between X1 and X2, and X2 has a zero
partial effect on Y . What happens if we include X2 in the model?
5. If X2 has a non-zero partial effect on Y but is omitted, does this
increase the bias?
6. You are given this regression model: 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝑢. You find
that var (Y|x1) increases with the value of x1. Does this affect the
unbiasedness?
Midterm exam
8 Nov 2023
One-hour written open-book exam
The first group includes students from 1 to 20.
The test time is 7.30–8.30
The second group includes students from 21 to 43.
The test time is 9:00–10.00.
Midterm exam
9 Nov 2023
One-hour written open-book exam
The first group includes students from 1 to 20.
The test time is 7.30–8.30
The second group includes students from 21 to 44.
The test time is 9:00–10.00.

CH 07 Specification and Data Issues TQT

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CH 07 Specification and Data Issues TQT

Uploaded by

Copyright:

Available Formats

Chapter 7:

More on model specification and data issues

3. Log or level: 𝐿𝑜𝑔 𝑌 𝑣𝑠 𝑌

• 𝐶𝑟𝑖𝑚𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑡 = 𝛽0 + 𝛽1 𝑢𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑚𝑒𝑛𝑡 + 𝛽2 𝑒𝑥𝑝𝑒𝑛𝑑 + 𝛽3 𝐶𝑟𝑖𝑚𝑒𝑝𝑎𝑠𝑡 + 𝑢

• 𝐶ℎ𝑜𝑖𝑐𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑡 = 𝛽0 + 𝛽1 𝑎𝑠𝑠𝑒𝑡𝑠 + 𝛽2 𝑙𝑎𝑛𝑑𝑙𝑜𝑠𝑠 + 𝛽3 𝑐ℎ𝑜𝑖𝑐𝑒𝑝𝑎𝑠𝑡 + 𝑢

• This is a simple technique to account for past factors affecting the

Miseasured value True value Measurement eror

❑ It is assumed that 𝐸(𝑒0 )=0. If 𝐸(𝑒0 ) ≠ 0, we only get a

Miseasured value True value Measurement error

 OLS is biased and inconsistent because the mismeasured variable 𝑋𝑗

 Also, the effects of the other independent variables will be biased.

𝑠𝑎𝑣𝑖𝑛𝑔 = 𝛽0 + 𝛽1 𝑒𝑑𝑢 + 𝛽2 𝑎𝑔𝑒 + 𝛽3 ℎℎ𝑠𝑖𝑧𝑒 + 𝑢

❑ There is no problem with the sample selection being based on

❑ Also, the sample selection causes no problem if it is unrelated to the

𝐼𝑛𝑐𝑜𝑚𝑒 = 𝛽0 + 𝛽1 𝑒𝑑𝑢 + 𝛽2 𝑎𝑔𝑒 + 𝛽3 ℎℎ𝑠𝑖𝑧𝑒 + 𝑢

❑ Sample selection causes a problem if the sample is chosen on the basis

Non-random samples and stratified sampling

❑ In a stratified sample, if males are oversampled and if

❑ In a stratified sample, if students with low English

ssc install winsor2

𝑊𝑖𝑡ℎ 𝑜𝑢𝑡𝑙𝑖𝑒𝑟𝑠 Without 182 outliers.

❑ Quantile regression (QR) reveals the heterogeneous relationship

𝑂𝐿𝑆: 𝑚𝑒𝑎𝑛 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛

-Exogenous sample selection:

See more in 7.3

4. An explanatory variable is jointly determined with the dependent variable

❑ Accuracy in collecting data

❑ Adding more important variables

Under assumptions 1-5: BLUE: Best Linear Unbiased Estimator

❑ The population error 𝑢 is independent of the explanatory variables and is normally

❑ NOTE: For the goal of statistical inference, the normality

❑ It would be better to account for regional fixed effects

logit poverty hhsize edu gender age i.tinh, cluster (xa)

𝑊𝑎𝑔𝑒 = b0 +b𝟏𝑚𝑎𝑙𝑒 + b2 𝑒𝑑𝑢 + 𝑢

b𝟏shows the difference in the intercepts between men and women

1=married; 0=single Gender: 1=men; 0=women

Wage= b0 + d1married + d2 gender + d3 married*gender + u

❑ If we set married=0; Gender=1, then we have the single man group

Note: the difference in intercepts=the difference in mean wages between groups

W𝑎𝑔𝑒 = 𝛽0 + 𝛿0 𝐺𝑒𝑛𝑑𝑒𝑟 + 𝛽1 𝐸𝑑𝑢 + 𝛿1 𝐺𝑒𝑛𝑑𝑒𝑟 ∗ 𝐸𝑑𝑢 + 𝑢

 If Gender = 0=Women, then Wage = b0 + b1Edu + u

 If Gender = 1=Men, then Wage = (d0 +b0) + (d1+ b1)Edu+ u

 𝛿1 : shows the difference in slopes between groups

ෟ = 500 + 200𝑒𝑥𝑝𝑒𝑟 − 5𝑒𝑥𝑝𝑒𝑟 2

Note: This is an approximate calculation

❑ Unbiasedness: E(𝛽መ𝑗 )=𝛽𝑗

𝑺𝑺𝑻𝒋 = ෍(𝑥𝑖𝑗 − 𝑥ഥ𝑗 )2 𝑹𝟐𝒋 is 𝐭𝐡𝐞 R−squared from a regression of

1. The variation of the explanatory

The precision of b2 will increase if

You might also like