Professional Documents
Culture Documents
Part III - Analysis With NonLinear Models
Part III - Analysis With NonLinear Models
For example, the choice of which car a person buys are statistically related to the person’s income
and age as well as to price, fuel efficiency, size, and other attributes of each available car.
• The models estimate the probability that a person chooses a particular alternative.
• Discrete choice models specify the probability that an individual chooses an option among a set of alternatives.
• The probabilistic description of discrete choice behavior is used not to reflect individual behavior that is viewed
as intrinsically probabilistic.
• Rather, it is the lack of information that leads us to describe choice in a probabilistic fashion.
• In practice, we cannot know all factors affecting individual choice decisions as their determinants are partially
observed or imperfectly measured.
• Therefore, discrete choice models rely on stochastic assumptions and specifications to account for unobserved
factors related to
a) choice alternatives,
b) taste variation over people (inter-personal heterogeneity)
c) taste variation over time (intra-individual choice dynamics), and
d) heterogeneous choice sets
Binary Discrete Choice Models
• To analyze the factors that influence the probability of a choice outcome, economists use one of 3 binary discrete
choice models
a) Linear probability model
b) Probit model
P F X 1 , X 2 , X 3 ,... X k
c) Logit model
• The major difference between these models is the specific functional form, F, which relates X1, X2, …, Xk to P.
• The linear probability model assumes F is a linear function.
• The probit model assumes F is a normal cumulative distribution function.
• The logit model assumes F is a logistic cumulative distribution function.
(a) Linear Probability Model (LPM)
• The linear probability model is the classical linear regression model with a dummy variable as the dependent
variable.
• It is the easiest model to estimate and interpret.
• However, it also has the biggest shortcomings.
• It is defined as follows as classical model:
Yi 0 1 X 1 2 X 2 ... k X k u
E (Yi ) 0 1 X 1 2 X 2 ... k X k E (u ) 0
E (Yi ) 0 1Y1 2Y2 ... k Yk Yˆ
E (Y ) Yˆ p
i
• From Probability theory since the maximum value of Y is one and the minimum is zero, indicating also that it is
only two outcomes with the probability function of Bernoulli distribution as presented below:
Problems associated with the above LPM:
• Example: Suppose that a government plan to construct a water project for sustaining lives of rural society. However, the
cost of the project is huge and above the capacity of the government so that it designs a strategy to collect some portion
of the cost from the beneficiary. In doing so, a research team is assigned to study the willingness to pay the constriction
allotted to each. The team finds the following result of LPM model:
WTP = 0.23+0.64education+1.08age+0.44income - 0.78distance+u
• Problem 1: It violates the principle of probability since some coefficients falls out of the bound of probability (0, 1). the
LPM possibly generates predicted probabilities out of the bound of probability theory with [0, 1].
• Problem 2: A probability cannot be linearly related with X. But, in reality the change at high income level has high
probability than a change at lowest income level. Since of linear in parameter. It is logically inconsistent having
constant marginal probability
• Problem 3: 3. It violates the Gauss- Markov assumption i.e. error terms are Heteroscedastic. there exists two
error term structure, Binomial distribution, instead of being normal distribution. Therefore, error terms do not have
constant variance, rather its value depends on X. Pi (Y = 1) = Xβ+ ε=1 ; ε=1-Xβ; and Pi (Y = 0) = Xβ+ ε=0 ; ε=0-Xβ
• Problem 4: It violates Normal distribution of Error terms, it becomes binomials distribution
Graphically, problems can be captured by a single • But, we wan to have the following nature:
graph
We need a function that shows non-linear relationship between the predictors and the
probability of the event occurring, and it produces an output on a continuous scale that
ranges from 0 to 1. So, we need to impose Normal distribution (Probit) or Logistic
distribution (Logit Model).
How to get this? By assumption and Imposition?
• By imposing the assumption of Normal distribution and/or Logistic distribution to capture the behaviour of the
above non-linear relationship.
• Specifically, we need the Cumulative Density Function, CDF, to fit with the reality.
• Mostly, both of them behaves in the similar fashion. The distribution assumes that the probability distribution of
εi follows Normal or Logistic distribution, calling probit OR Logit Model respectively.
Pi = F (Xi)
“Due to the short coming of LPM, probit and Logit models are recommended!”
Probability Density Function and Cumulative Density Function: Recall
• The distribution of a statistical data set (or a population) is a listing or function showing all the possible values
(or intervals) of the data and how often they occur.
• Probability density function (PDF), or density of a continuous random variable, is a function that describes the
relative likelihood for this random variable to take on a given value.
• The total area in this interval of the graph equals the probability of a continuous random variable occurring.
• The Cumulative Density Function, CDF of a real-valued random variable X, or just distribution function of X,
evaluated at X, if the probability that X will take a value less than or equal to X
X N ( , 2 ) x
1
1
x 2
f ( x) e 2
2
• The probability density of the Normal distribution dx
2 2
( x )2
1 1
f ( x) e 2 2
1 x 2
f ( x) e 2
2
2 dx
• The cumulative Density Function of Normal distribution 2 2
1 x
1 x 2
• Properties f ( x) .e 2 2
2 2
x
1
x 2
2 2
1 e
f ( x) .
2 2 1
2
x
2
2
x
Standard Normal Distribution ( Z-score) and its properties
• If X is Normally distributed with ( xi x )
F ( x)
f ( x ) dx
Z ( x)
X N (0,1) x
1 1
x 2
• The probability density of the Normal distribution F ( x)
2
e2 dx
( x )2
1
1
f ( x) e 2
1 x 2
2
• The cumulative Density Function of Normal distribution
F ( x)
2 e
2
dx
1 x
1 x2
• Properties F ( x) .e 2
2
x
1
x 2
1 e 2
F ( x) .
2 1 x 2
2
Logistic Distribution and its properties: Recall
• The logistic distribution is used for modeling growth, and also for logistic regression.
• Use the logistic distribution to model data distributions that have longer tails and higher kurtosis than the normal
distribution.
• In fact, the logistic and normal distributions are so close in shape (although the logistic tends to have slightly
fatter tails) that for most applications it’s impossible to tell one from the other
• The logistic distribution is mainly used because the curve has a relatively simple cumulative distribution formula
to work with. ( x )
e
• The PDF of the logistic distribution
f ( x; , ) 2
( x )
1 e
• The CDF of the logistic distribution
x
1
F ( x; , ) f ( x)dx
( x )
1 e
Binary Probit Model
Probit Model
• The probit model depends on the standard normal distribution.
• The dependent variable Y is a discrete random variable since it takes only two values, either 0 or 1.
• The probit model assumes that the conditional mean function for Y is given by
P (Y / X ) F ( X ) F ( I ) F ( 0 1 X 1 ... k X k )
• When the value of an explanatory variable changes, the value of the index function changes. When the value of the
index function changes, the probability that Y = 1 changes. Y = 1 if I ≥ 0 , and Y = 0 if I < 0
• The index, I, can be given any interpretation that is theoretically plausible. In economics, when specifying a model of
a choice process it is usually interpreted as a utility index.
• Note that the value of the index is unknown and unobservable because the values of coefficients are unknown and
unobservable.
• However, by obtaining estimates of coefficients you can obtain an estimate of I.
Interpretation of Probit Model and Computation of Marginal Effects:
• The slope coefficients of the index function are not marginal probabilities. To derive the marginal probability for
Xk you must use the chain rule from calculus.
• Recall that for the probit model the probability that Y = 1 is given by P = F(I) = F (1 + 2X2 + … + kXk) where
F is the standard normal cumulative distribution function.
• When Xk changes I changes. When I change P changes. Application of the chain rule therefore yields
P F I
. f ( I ). k
xk I xk
• Where (I) is the standard normal probability density function. To obtain an estimate of the marginal probability,
you use the estimate of k and evaluate (I) at specific values of the X’s, using the estimates of 1, 2, …, k.
• The magnitude of the marginal probability is not constant but varies with the values of the X’s. This is because
(I) varies with the values of the X’s.
• The marginal probability is largest when P = 0.5 and smallest when P is close to 0 or 1. This implies that a
change in Xk has the biggest effect on decision-makers who are “sitting on the fence” when choosing Y =1 or Y
= 0, and the smallest effect on decision-makers who are “set in their ways.”
• The fitted value of Y is equal to a predicted probability for given values of the X’s.
• The marginal effect in this context computes the marginal effect at mean value of X.
Interpretation of Probit Model and Computation of Average Effects:
• Unlike the marginal effect ( which computes the effect at mean value of x), the average marginal effect is
calculated for each observation, and then averaged across all observations.
P F I 1 n
. f ( I ). k f ( I ). k
xk I xk n i 1
• The marginal effect at the mean uses the means of the variable, but there my not such average individual.
• The average marginal effect makes more sense.
STATA Software Application:
• The data in this example were gathered on undergraduates applying to graduate school and includes undergraduate GPAs, the
reputation of the school of the undergraduate (a topnotch indicator), the students' GRE score, and whether or not the student
was admitted to graduate school. Using this dataset, we can predict admission to graduate school using undergraduate GPA,
GRE scores, and the reputation of the school of the undergraduate. Our outcome variable is binary, and we will use a probit
model. Thus, our model will calculate a predicted probability of admission based on our predictors. The probit model does so
using the cumulative distribution function of the standard normal.
• use http://www.ats.ucla.edu/stat/stata/dae/logit.dta, clear
summarize
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
admit | 400 .3175 .4660867 0 1
gre | 400 587.7 115.5165 220 800
topnotch | 400 .1625 .3693709 0 1 https://stats.oarc.ucla.edu/stata/output/probit-
gpa | 400 3.3899 .3805668 2.26 4 regression/
tabulate admit
admit | Freq. Percent Cum.
------------+-----------------------------------
0 | 273 68.25 68.25
1 | 127 31.75 100.00
------------+-----------------------------------
Total | 400 100.00
Iteration History - This is a listing of the log likelihoods at each iteration for the probit model. Remember that probit LR chi2(3) - This is the Likelihood Ratio (LR) Chi-Square test
regression uses maximum likelihood estimation, which is an iterative procedure. The first iteration (called Iteration 0) is that at least one of the predictors' regression coefficient is not
the log likelihood of the "null" or "empty" model; that is, a model with no predictors. At the next iteration (called equal to zero. The number in the parentheses indicates the
Iteration 1), the specified predictors are included in the model. At each iteration, the log likelihood increases because the degrees of freedom of the Chi-Square distribution used to test
goal is to maximize the log likelihood. When the difference between successive iterations is very small, the model is said the LR Chi-Square statistic and is defined by the number of
to have "converged" and the iterating stops predictors in the model
• To generate values from F in Stata, use the normal function. For example, and the effects of these one unit increases are different if we hold gre and topnotch constant at
display normal(0) their respective means instead of zero:
• This will display .5, indicating that F(0) = .5 (i.e., half of the area under the standard
normal distribution curve falls to the left of zero). The first student in our dataset has a
GRE score of 380, a GPA of 3.61, and a topnotch indicator value of 0. We could multiply
these values by their corresponding coefficients,
display -2.797884 +(.0015244*380) + (.2730334*0) + (.4009853*3.61)
• to determine that the predicted probability of admittance is F(-0.77105507). To find this
value, we type However, there are limited ways in which we can interpret the individual regression coefficients.
display normal(-0.77105507) A positive coefficient means that an increase in the predictor leads to an increase in the predicted
• and arrive at a predicted probability of 0.22033715. probability. A negative coefficient means that an increase in the predictor leads to a decrease in
the predicted probability.
Robust standard errors
• If you specify the vce (robust) option, probit reports robust standard errors; Obtaining robust variance estimates.
Robust
admit Coef. Std. Err. z P>|z| [95% Conf. Interval]
https://www.stata.com/manuals13/rprobit.pdf
• To obtain marginal effect (at mean values):
. mfx
Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]
admit Coef. Std. Err. z P>|z| [95% Conf. Interval] Note: dy/dx for factor levels is the discrete change from the base level.
Marginal effects can be an informative means for summarizing how change in a Expression : Pr(admit), predict()
at : gre = 587.7 (mean)
response is related to change in a covariate. For categorical variables, the effects 0.topnotch = .8375 (mean)
of discrete changes are computed, i.e., the marginal effects for categorical 1.topnotch = .1625 (mean)
gpa = 3.3899 (mean)
variables show how P(Y = 1) is predicted to change as Xk changes from 0 to 1
holding all other Xs equal. This can be quite useful, informative, and easy to Delta-method
understand. For continuous independent variables, the marginal effect measures Margin Std. Err. z P>|z| [95% Conf. Interval]
the instantaneous rate of change. If the instantaneous rate of change is similar to topnotch
the change in P(Y=1) as Xk increases by one, this too can be quite useful and 0 .2936703 .0253189 11.60 0.000 .2440462 .3432945
1 .3937108 .0629817 6.25 0.000 .2702688 .5171527
intuitive. However, there is no guarantee that this will be the case; it will depend,
in part, on how Xk is scaled.
.. display .3937108 - .2936703
.1000405
• To obtain marginal effect (at different values):
. sum admit gre topnotch gpa
https://www.youtube.com/watch?v=bOILSmV8kF4
PROBIT MODEL GOODNESS-OF-FIT TEST:
• The lfit command, typed without options, displays the Pearson goodness-of-fit test for the estimated model. With the group
option lfit produces Hosmer-Lemeshow's goodness-of-fit test.
. lfit
• The p-value for the goodness-of-fit test suggests that the model fits reasonably well. However, since the number of covariate
patterns is equal to the number of observations the Pearson test is not appropriate for these data. In this situation the Hosmer
and Lemeshow goodness-of-fit test for grouped data is preferred. The group option requested that the data be formed into 10
nearly equal-size groups for the Hosmer and Lemeshow test of goodness-of-fit.
probit
Log-likelihood
Model -238.943
Intercept-only -249.988
Chi-square
Measures of Fit for probit: Deviance (df=396) 477.887
LR (df=3) 22.090
p-value 0.000
We may also wish to see measures of R2
how well our model fits. This can be McFadden
McFadden (adjusted)
0.044
0.028
particularly useful when comparing McKelvey & Zavoina 0.091
Cox-Snell/ML 0.054
competing models. The user-written Cragg-Uhler/Nagelkerke 0.075
IC
AIC 485.887
AIC divided by N 1.215
BIC (df=4) 501.853
Variance of
e 1.000
y-star 1.100
References
Hosmer, D. & Lemeshow, S. (2000). Applied Logistic Regression (Second Edition). New York: John Wiley & Sons, Inc.
Long, J. Scott (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.
Binary Logit Model
Logit Model
• It is based on cumulative logistic distribution. Like the probit model, the logit model
assumes that the conditional mean function is given by
P( y / x) G ( x ) G ( I ) G ( 0 1 x1 ... k xk ) G ( x )
• Where I = 1 + 2X2 + … + kXk is an index function, with the restriction that P/I > 0.
However, the logit model assumes that G is a logistic cumulative distribution function.
• Thus, the conditional mean function for the logit model is given by
I
P
1 e I
• The rest of the model is analogous to the probit model.
Odd Ratio:
I I e I Logistic regression is used to obtain odds ratio in the presence of more
P I
and 1- P 1 I
than one explanatory variable. The procedure is quite similar to multiple
1 e 1 e 1 e I
linear regression, with the exception that the response variable is
I binomial. The result is the impact of each variable on the odds ratio of
P 1 e I
the observed event of interest. The main advantage is to avoid
I eI
1 P e confounding effects by analyzing the association of all variables
together
1 e I
P
ln I X
1 P
• There exists a linear relationship between log of ratio with explanatory variables.
• In the logit model, the probability distribution of follows the logistic probability distribution.
• And, it also assumes that conditional mean function for Y is given as presented above.
• First it is imperative to understand that odds and probabilities,
although sometimes used as synonymous, are not the same.
Probability is the ratio between the number of events favorable I I e I
to some outcome and the total number of events. P and 1- P 1
1 e I 1 e I 1 e I
• On the other hand, odds are the ratio between probabilities: the I
probability of an event favorable to an outcome and the P I
probability of an event against the same outcome. Probability
1 e
I eI
is constrained between zero and one and odds are constrained 1 P e
between zero and infinity. 1 e I
• And odds ratio is the ratio between odds. The importance of P
ln I X
this is that a large odds ratio (OR) can represent a small 1 P
probability and vice-versa.
• Stata Application ( Similar to Probit)
. logit admit gre i.topnotch gpa
.
• Odds Ratio
. logistic admit gre i.topnotch gpa
Now we can say that for a one unit increase in gpa, the odds of being admitted to graduate school (versus not
being admitted) increase by a factor of 1.94
• Therefore, given this background,
There is a relationship between the logistic coefficients and the odds ratios. The odds ratio is equal to
exp(coefficient). For instance, the logistic coefficient for gpa is 0.6675556, so that its odds ratio becomes, exp
(0.6675556) = 1.949466, which is very close -to the value of the odds ratio for income.
2. Discrete Choice Models: Multinomial Models for
Polytomous Dependent Variable
Multinomial Models for Polytomous Data
• The i. before ses indicates that ses is a Multinomial logistic regression Number of obs = 200
indicator variable (i.e., categorical LR chi2(6) = 48.23
variable), and that it should be included in Prob > chi2 = 0.0000
the model. We have also used the option Log likelihood = -179.98173 Pseudo R2 = 0.1182
“base” to indicate the category we would
want to use for the baseline comparison
group. In the model below, we have chosen prog Coef. Std. Err. z P>|z| [95% Conf. Interval]
to use the academic program type as the
baseline category general
ses
• In the output above, we first see the middle -.533291 .4437321 -1.20 0.229 -1.40299 .336408
iteration log, indicating how quickly the high -1.162832 .5142195 -2.26 0.024 -2.170684 -.1549804
model converged. The log likelihood (-
179.98173) can be used in comparisons of write -.0579284 .0214109 -2.71 0.007 -.0998931 -.0159637
_cons 2.852186 1.166439 2.45 0.014 .5660075 5.138365
nested models, but we won’t show an
example of comparing models here
academic (base outcome)
• The likelihood ratio chi-square of48.23
with a p-value < 0.0001 tells us that our vocation
ses
model as a whole fits significantly better
middle .2913931 .4763737 0.61 0.541 -.6422822 1.225068
than an empty model (i.e., a model with no
high -.9826703 .5955669 -1.65 0.099 -2.14996 .1846195
predictors)
write -.1136026 .0222199 -5.11 0.000 -.1571528 -.0700524
_cons 5.2182 1.163549 4.48 0.000 2.937686 7.498714
Multinomial Response Model: Probit
• One problem with multinomial Logit models was the IIA assumption imposed
• The multinomial probit model is by them. This is due to the fact that the e's were assumed to be independent
similar to multinomial logit model, distributed from each other, i.e. from each other, i.e. the covariance matrix E(ee')
just like the binary probit model is restricted to be a diagonal matrix
similar to the binary logit model. • Although this independence has the advantage that the likelihood function is
• The difference is that it uses the quite easy to compute in most of the cases the IIA assumption leads to
standard normal cdf unrealistic predictions (recall the famous "red and blue-bus" example). One
• alternative to break down the IIA assumption therefore consists in allowing the
It takes longer for a probit model to
obtain results. e's to be correlated with each other and that is exactly what the multinomial
Probit model does
• The coefficients are different by a
scale factor from the logit model. • In the multinomial Probit model it is assumed that the e's follows a multivariate
• The marginal effects will be similar normal distribution with covariance matrix Σ where Σ now is not restricted to be
a diagonal matrix
• It is also one way to avoid IIA
assumption. How?
Stata Application:
• As discussed in example 1 of [R] mlogit, we have data on
the type of health insurance available to 616
psychologically depressed subjects in the United States
(Tarlov et al. 1989; Wells et al. 1989). Patients may have
either an indemnity (fee-for-service) plan or a prepaid plan
such as an HMO, or the patient may be uninsured.
Demographic variables include age, gender, race, and site.
Indemnity insurance is the most popular alternative, so
mprobit will choose it as the base outcome by default.
use https://www.stata-press.com/data/r18/sysdsn1
Ordered Response Model: Logit /Probit Model
• For any individual respondent, we hypothesize that there is a continuously varying strength of preferences that
underlies the rating they submit.
• For convenience and consistency with what follows, we will label that strength of preference “utility,” U ∗. we
also describe utility as ranging over the entire real line:
• where i indicates the individual and m indicates, for example the movie. Individuals are invited to “rate” the movie
on an integer scale from 1 to 5.
• Logically, then, the translation from underlying utility to a rating could be viewed as a censoring of the underlying
utility
• The crucial feature of the description thus far is that underlying the discrete response is a continuous range of
preferences.
• Therefore, the observed rating represents a censored version of the true underlying preferences. Providing a
rating of five could be an outcome ranging from general enjoyment to wild enthusiasm.
• Note that the thresholds, μ j, number (J − 1) where J is the number of possible ratings (here, five) –J−1 values are
needed to divide the range of utility into J cells.
• The thresholds are an important element of the model; they divide the range of utility into cells that are then
identified with the observed outcomes.
• Importantly, the difference between two levels of a rating scale (for example, one compared to two, two
compared to three) is not the same as on a utility scale.
• Hence we have a strictly nonlinear transformation captured by the thresholds, which are estimable parameters in
an ordered choice model
• Lets focus on ordered probit. The ordered probit model is built around a latent regression in the same manner as
the binomial probit model. We begin with
which is a form of
censoring. The μ’s are
unknown parameters to be
estimated with β
• Stata Software Application for Ordered Logit
• Example: A study looks at factors that influence the decision of whether to apply to graduate school. College
juniors are asked if they are unlikely, somewhat likely, or very likely to apply to graduate school. Hence, our
outcome variable has three categories. Data on parental educational status, whether the undergraduate institution
is public or private, and current GPA is also collected. The researchers have reason to believe that the
“distances” between these three points are not equal. For example, the “distance” between “unlikely” and
“somewhat likely” may be shorter than the distance between “somewhat likely” and “very likely”.
• use https://stats.idre.ucla.edu/stat/data/ologit.dta, clear
• This hypothetical data set has a three-level variable called apply (coded 0, 1, 2), that we will use as our outcome
variable. We also have three variables that we will use as predictors: pared, which is a 0/1 variable indicating
whether at least one parent has a graduate degree; public, which is a 0/1 variable where 1 indicates that the
undergraduate institution is public and 0 private, and gpa, which is the student’s grade point average.
. ologit apply i.pared i.public gpa
Variable Obs Mean Std. Dev. Min Max general 40 51.575 7.97074
academic 101 56.89109 9.018759
achiv 178 54.23596 8.96323 41 76 vocation 37 49.86486 7.276912
langscore 178 54.01124 8.944896 31 67
Total 178 54.23596 8.96323
. tab prog
50
type of
program Freq. Percent Cum.
40
general 40 22.47 22.47
Frequency
30
academic 101 56.74 79.21
vocation 37 20.79 100.00
20
Total 178 100.00
10
0
40 50 60 70 80
• OLS regression – You could analyze these data using OLS regression. OLS regression will not adjust the
estimates of the coefficients to take into account the effect of truncating the sample at 40, and the coefficients
may be severely biased. This can be conceptualized as a model specification error (Heckman, 1979).
• Truncated regression – Truncated regression addresses the bias introduced when using OLS regression with
truncated data. Note that with truncated regression, the variance of the outcome variable is reduced compared to
the distribution that is not truncated. Also, if the lower part of the distribution is truncated, then the mean of the
truncated variable will be greater than the mean from the untruncated variable; if the truncation is from above,
the mean of the truncated variable will be less than the untruncated variable.
• These types of models can also be conceptualized as Heckman selection models, which are used to correct for
sampling selection bias.
• Censored regression – Sometimes the concepts of truncation and censoring are confused. With censored data we
have all of the observations, but we don’t know the “true” values of some of them. With truncation, some of the
observations are not included in the analysis because of the value of the outcome variable. It would be
inappropriate to analyze the data in our example using a censored regression model.
. truncreg achiv langscore i.prog, ll(40)
(note: 0 obs. truncated)
• The ancillary statistic /sigma is equivalent to the Iteration 0: log likelihood = -598.11669
standard error of estimate in OLS regression. The Iteration 1: log likelihood = -591.68356
value of 8.76 can be compared to the standard Iteration 2: log likelihood = -591.31208
deviation of achievement which was 8.96. This shows Iteration 3: log likelihood = -591.30981
a modest reduction. The output also contains an Iteration 4: log likelihood = -591.30981
estimate of the standard error of /sigma, as well as a
95% confidence interval for this value. Truncated regression
• The truncated regression model predicting Limit: lower = 40 Number of obs = 178
upper = +inf Wald chi2(3) = 54.76
achievement from language scores and program type
Log likelihood = -591.30981 Prob > chi2 = 0.0000
was statistically significant (chi-square = 54.76, df = 3,
p<.001). The variable langscore is statistically
significant. A unit increase in language score leads to a
achiv Coef. Std. Err. z P>|z| [95% Conf. Interval]
.71 increase in predicted achievement. One of the
indicator variables for prog is also statistically
langscore .7125775 .1144719 6.22 0.000 .4882168 .9369383
significant. Compared to level 1 of prog, the predicted
achievement for level 2 of prog increases by about
prog
4.07. To determine if prog itself is statistically
academic 4.065219 2.054938 1.98 0.048 .0376131 8.092824
significant, we can use the test command to obtain the
vocation -1.135863 2.669961 -0.43 0.671 -6.368891 4.097165
two degree-of-freedom test of this variable.
_cons 11.30152 6.772731 1.67 0.095 -1.97279 24.57583
.
Censoring / Tobit Model
• Censoring is when the limit observations are in the sample (only the value of the dependent variable is censored)
• Example: Censored sample: include consumers who consume zero quantities of a product, observe people that
do not work but their work hours are recorded as zero.
• The censored sample is representative of the population (only the mean for the dependent variable is not)
because all observations are included.
• Because of censoring, the dependent variable y is the incompletely observed value of the latent dependent
variable y*.
• Income of y*=120,000 will be censored as y=100,000 with top coding of 100,000
• A very common problem in microeconomic data is censoring of the dependent variable. When the dependent
variable is censored, values in a certain range are all transformed to (or reported as) a single value.
• Conventional regression methods fail to account for the qualitative difference between limit (zero) observations
and nonlimit (continuous) observations.
Censoring from below
• The actual value for the dependent variable y is observed if the latent variable y* is above the limit and the limit
is observed for the censored observations.
• We observe the actual hours worked for people who work and zero for people who do not work
type of
program Freq. Percent Cum.
.004
apt float %9.0g
.003
. sum read math prog apt
Density
.002
Variable Obs Mean Std. Dev. Min Max
.001
read 200 52.23 10.25294 28 76
math 200 52.645 9.368448 33 75
prog 200 2.025 .6904772 1 3
0
300 400 500 600 700 800
apt 200 640.035 99.21903 352 800 apt
• Tobit regression coefficients are interpreted in the
similar manner to OLS regression coefficients;
however, the linear effect is on the uncensored
latent variable, not the observed outcome. See
McDonald and Moffitt (1980) for more details.
• For a one unit increase in read, there is a 2.7 point
. tobit apt read math i.prog, ul(800) increase in the predicted value of apt.
Tobit regression Number of obs = 200 • A one unit increase in math is associated with a
LR chi2(4) = 188.97 5.91 unit increase in the predicted value of apt.
Prob > chi2 = 0.0000
Log likelihood = -1041.0629 Pseudo R2 = 0.0832
• The terms for prog have a slightly different
interpretation. The predicted value of apt is 46.14
points lower for students in a vocational program
apt Coef. Std. Err. t P>|t| [95% Conf. Interval] (prog=3) than for students in an academic program
(prog=1).
read 2.697939 .618798 4.36 0.000 1.477582 3.918296
math 5.914485 .7098063 8.33 0.000 4.514647 7.314323
prog
general -12.71476 12.40629 -1.02 0.307 -37.18173 11.7522
vocational -46.1439 13.72401 -3.36 0.001 -73.2096 -19.07821
The ancillary statistic /sigma is analogous to the
_cons 209.566 32.77154 6.39 0.000 144.9359 274.1961 square root of the residual variance in OLS
regression. The value of 65.67 can be compared to
/sigma 65.67672 3.481272 58.81116 72.54228
the standard deviation of academic aptitude which
0 left-censored observations
was 99.21, a substantial reduction. The output also
183 uncensored observations contains an estimate of the standard error of /sigma
17 right-censored observations at apt >= 800 as well as the 95% confidence interval.
https://stats.oarc.ucla.edu/stata/dae/tobit-analysis/
Heckman model
• The Heckman model is a sample selection model.
• Sample selection usually occurs when people select themselves into a group.
We want to study the factors affecting income for working women. We have the selection decision,
whether women choose to work or not. We also have the income only for women that work.
Different factors may affect the two decisions. Whether women work or not may be influenced by whether
or not they have kids, but their income should not be influenced by the presence of kids
• Sample selection (incidental truncation) is different from truncation.
• We want to study income. Truncation is if the sample is based on high income. Incidental truncation is if the
sample is based on whether or not people have executive jobs (high correlation but not exactly the same).
• The dependent variable is not observed if the observation is not in the sample. We do not know the income for
people who are not in the high income sample (but the income is not zero).
• Sample selection assumes that the discrete decision z and the continuous decision y have a bivariate distribution
with correlation coefficient.
Heckman model two-step estimation procedure
• 1. Probit model for the selection mechanism
• The Heckman model may or may not have the same regressors for the selection equation and regression.
• The Heckman model will report estimates of all coefficients.
Example 1
• In the syntax for heckman, depvar and indepvars are the dependent variable and regressors for the underlying
regression model to be fit (y = Xβ), and varlists are the variables (Z) thought to determine whether depvar is
observed or unobserved (selected or not selected). In our female wage example, the number of children at home
would be included in the second list. By default, heckman assumes that missing values (see [U] 12.2.1 Missing
values) of depvar imply that the dependent variable is unobserved (not selected). With some datasets, it is more
convenient to specify a binary variable (depvars) that identifies the observations for which the dependent is
observed/selected (depvars6= 0) or not observed (depvars= 0); heckman will accommodate either type of data.
Here we have a (fictional) dataset on 2,000 women, 1,343 of whom work:
• use https://www.stata-press.com/data/r18/womenwk
• We will assume that the hourly wage is a
function of education and age, whereas
the likelihood of working (the likelihood
of the wage being observed) is a function
of marital status, the number of children at
home, and (implicitly) the wage (via the
inclusion of age and education, which we
think determine the wage):
• heckman assumes that wage is the dependent variable and that the first variable list (educ and age) are the
determinants of wage. The variables specified in the select() option (married, children, educ, and age) are
assumed to determine whether the dependent variable is observed (the selection equation). Thus, we fit the
model