Part III - Analysis With NonLinear Models

Nonlinear Models : Micro econometrics
/ When Y is Not Continuous Variable/

Zerayehu Sime Eshete (PhD)
Yom Postgraduate College
OUTLINE
1. Concept of Nonlinear Models
2. Discrete Choice Models (Binary and Multinomial Models)
3. Limited Dependent Models (Tobit Censored, Truncated, Heckman Model)
• Discrete choice models theoretically or empirically model choices made by people among a finite set of
alternatives.
• Daniel McFadden won the Nobel prize in 2000 for his pioneering work in developing the theoretical basis for
discrete choice.
• Discrete choice models statistically relate the choice made by each person to
 the attributes of the person and
 the attributes of the alternatives available to the person
For example, the choice of which car a person buys are statistically related to the person’s income
and age as well as to price, fuel efficiency, size, and other attributes of each available car.
• The models estimate the probability that a person chooses a particular alternative.
• Discrete choice models specify the probability that an individual chooses an option among a set of alternatives.
• The probabilistic description of discrete choice behavior is used not to reflect individual behavior that is viewed
as intrinsically probabilistic.
• Rather, it is the lack of information that leads us to describe choice in a probabilistic fashion.
• In practice, we cannot know all factors affecting individual choice decisions as their determinants are partially
observed or imperfectly measured.
• Therefore, discrete choice models rely on stochastic assumptions and specifications to account for unobserved
factors related to
a) choice alternatives,
b) taste variation over people (inter-personal heterogeneity)
c) taste variation over time (intra-individual choice dynamics), and
d) heterogeneous choice sets
Binary Discrete Choice Models
• To analyze the factors that influence the probability of a choice outcome, economists use one of 3 binary discrete
choice models
a) Linear probability model
b) Probit model
P  F  X 1 , X 2 , X 3 ,... X k 
c) Logit model
• The major difference between these models is the specific functional form, F, which relates X1, X2, …, Xk to P.
• The linear probability model assumes F is a linear function.
• The probit model assumes F is a normal cumulative distribution function.
• The logit model assumes F is a logistic cumulative distribution function.
(a) Linear Probability Model (LPM)
• The linear probability model is the classical linear regression model with a dummy variable as the dependent
variable.
• It is the easiest model to estimate and interpret.
• However, it also has the biggest shortcomings.
• It is defined as follows as classical model:
Yi   0  1 X 1   2 X 2  ...   k X k  u
E (Yi )   0  1 X 1   2 X 2  ...   k X k E (u )  0
E (Yi )   0  1Y1   2Y2  ...   k Yk  Yˆ
E (Y )  Yˆ  p
i
• From Probability theory since the maximum value of Y is one and the minimum is zero, indicating also that it is
only two outcomes with the probability function of Bernoulli distribution as presented below:
Problems associated with the above LPM:
• Example: Suppose that a government plan to construct a water project for sustaining lives of rural society. However, the
cost of the project is huge and above the capacity of the government so that it designs a strategy to collect some portion
of the cost from the beneficiary. In doing so, a research team is assigned to study the willingness to pay the constriction
allotted to each. The team finds the following result of LPM model:
WTP = 0.23+0.64education+1.08age+0.44income - 0.78distance+u
• Problem 1: It violates the principle of probability since some coefficients falls out of the bound of probability (0, 1). the
LPM possibly generates predicted probabilities out of the bound of probability theory with [0, 1].
• Problem 2: A probability cannot be linearly related with X. But, in reality the change at high income level has high
probability than a change at lowest income level. Since of linear in parameter. It is logically inconsistent having
constant marginal probability
• Problem 3: 3. It violates the Gauss- Markov assumption i.e. error terms are Heteroscedastic. there exists two
error term structure, Binomial distribution, instead of being normal distribution. Therefore, error terms do not have
constant variance, rather its value depends on X. Pi (Y = 1) = Xβ+ ε=1 ; ε=1-Xβ; and Pi (Y = 0) = Xβ+ ε=0 ; ε=0-Xβ
• Problem 4: It violates Normal distribution of Error terms, it becomes binomials distribution
Graphically, problems can be captured by a single • But, we wan to have the following nature:
graph
We need a function that shows non-linear relationship between the predictors and the
probability of the event occurring, and it produces an output on a continuous scale that
ranges from 0 to 1. So, we need to impose Normal distribution (Probit) or Logistic
distribution (Logit Model).
How to get this? By assumption and Imposition?
• By imposing the assumption of Normal distribution and/or Logistic distribution to capture the behaviour of the
above non-linear relationship.
• Specifically, we need the Cumulative Density Function, CDF, to fit with the reality.
• Mostly, both of them behaves in the similar fashion. The distribution assumes that the probability distribution of
εi follows Normal or Logistic distribution, calling probit OR Logit Model respectively.
Pi = F (Xi)
Linear (LPM Non Linear

model) Models
Probit Model Logit Model

Normal Logistic
Cumulative cumulative
distribution distribution
function by probit function by Logit
“Due to the short coming of LPM, probit and Logit models are recommended!”
Probability Density Function and Cumulative Density Function: Recall
• The distribution of a statistical data set (or a population) is a listing or function showing all the possible values
(or intervals) of the data and how often they occur.
• Probability density function (PDF), or density of a continuous random variable, is a function that describes the
relative likelihood for this random variable to take on a given value.
• The total area in this interval of the graph equals the probability of a continuous random variable occurring.
• The Cumulative Density Function, CDF of a real-valued random variable X, or just distribution function of X,
evaluated at X, if the probability that X will take a value less than or equal to X
No. Score Frequency Relative Frequency Cumulative Frequency

(Probability Density (Cumulative Density
Function) Function)
1 5 2 0.2 0.2
2 6 1 0.1 0.3
3 7 3 0.3 0.6
4 8 2 0.2 0.8
5 9 1 0.1 0.9
6 10 1 0.1 1.0
Total 10 1.0
x
Normal Distribution and its properties
• If X is Normally distributed with
f ( x)  

f ( x ) dx
X N ( , 2 ) x
1
1
 x   2
f ( x)   e 2
2
• The probability density of the Normal distribution dx
 2 2
( x   )2
1   1
f ( x)  e 2 2
1  x   2
f ( x)   e 2
2
 2 dx
• The cumulative Density Function of Normal distribution 2 2

1 x
1 x  2
• Properties f ( x)  .e 2 2
2 2

x
1
x  2
2 2
1 e
f ( x)  .
2 2 1
2 
x  
2
2 
x
Standard Normal Distribution ( Z-score) and its properties
• If X is Normally distributed with ( xi  x )
F ( x)  

f ( x ) dx
Z ( x) 
X  N (0,1)  x
1 1
 x 2
• The probability density of the Normal distribution F ( x)  
 2
e2 dx
( x )2
1 
 1
f ( x)  e 2
1  x 2
2
• The cumulative Density Function of Normal distribution
F ( x) 
2 e

2
dx
1 x
1  x2
• Properties F ( x)  .e 2
2 
x
1
 x 2
1 e 2
F ( x)  .
2 1  x 2
2 
Logistic Distribution and its properties: Recall
• The logistic distribution is used for modeling growth, and also for logistic regression.
• Use the logistic distribution to model data distributions that have longer tails and higher kurtosis than the normal
distribution.
• In fact, the logistic and normal distributions are so close in shape (although the logistic tends to have slightly
fatter tails) that for most applications it’s impossible to tell one from the other
• The logistic distribution is mainly used because the curve has a relatively simple cumulative distribution formula
to work with. ( x )

e 
• The PDF of the logistic distribution
f ( x;  ,  )  2
( x )
  
 1  e  
 
• The CDF of the logistic distribution
x
1
F ( x;  ,  )   f ( x)dx 

( x )

1 e 
Binary Probit Model
Probit Model
• The probit model depends on the standard normal distribution.
• The dependent variable Y is a discrete random variable since it takes only two values, either 0 or 1.
• The probit model assumes that the conditional mean function for Y is given by
P (Y / X )  F ( X  )  F ( I )  F (  0  1 X 1  ...   k X k )
• When the value of an explanatory variable changes, the value of the index function changes. When the value of the
index function changes, the probability that Y = 1 changes. Y = 1 if I ≥ 0 , and Y = 0 if I < 0
• The index, I, can be given any interpretation that is theoretically plausible. In economics, when specifying a model of
a choice process it is usually interpreted as a utility index.
• Note that the value of the index is unknown and unobservable because the values of coefficients are unknown and
unobservable.
• However, by obtaining estimates of coefficients you can obtain an estimate of I.
Interpretation of Probit Model and Computation of Marginal Effects:
• The slope coefficients of the index function are not marginal probabilities. To derive the marginal probability for
Xk you must use the chain rule from calculus.
• Recall that for the probit model the probability that Y = 1 is given by P = F(I) = F (1 + 2X2 + … + kXk) where
F is the standard normal cumulative distribution function.
• When Xk changes I changes. When I change P changes. Application of the chain rule therefore yields
P F I
 .  f ( I ). k
xk I xk
• Where (I) is the standard normal probability density function. To obtain an estimate of the marginal probability,
you use the estimate of k and evaluate (I) at specific values of the X’s, using the estimates of 1, 2, …, k.
• The magnitude of the marginal probability is not constant but varies with the values of the X’s. This is because
(I) varies with the values of the X’s.
• The marginal probability is largest when P = 0.5 and smallest when P is close to 0 or 1. This implies that a
change in Xk has the biggest effect on decision-makers who are “sitting on the fence” when choosing Y =1 or Y
= 0, and the smallest effect on decision-makers who are “set in their ways.”
• The fitted value of Y is equal to a predicted probability for given values of the X’s.
• The marginal effect in this context computes the marginal effect at mean value of X.
Interpretation of Probit Model and Computation of Average Effects:
• Unlike the marginal effect ( which computes the effect at mean value of x), the average marginal effect is
calculated for each observation, and then averaged across all observations.
P F I 1 n
 .  f ( I ). k   f ( I ). k
xk I xk n i 1
• The marginal effect at the mean uses the means of the variable, but there my not such average individual.
• The average marginal effect makes more sense.
STATA Software Application:
• The data in this example were gathered on undergraduates applying to graduate school and includes undergraduate GPAs, the
reputation of the school of the undergraduate (a topnotch indicator), the students' GRE score, and whether or not the student
was admitted to graduate school. Using this dataset, we can predict admission to graduate school using undergraduate GPA,
GRE scores, and the reputation of the school of the undergraduate. Our outcome variable is binary, and we will use a probit
model. Thus, our model will calculate a predicted probability of admission based on our predictors. The probit model does so
using the cumulative distribution function of the standard normal.
• use http://www.ats.ucla.edu/stat/stata/dae/logit.dta, clear
summarize
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
admit | 400 .3175 .4660867 0 1
gre | 400 587.7 115.5165 220 800
topnotch | 400 .1625 .3693709 0 1 https://stats.oarc.ucla.edu/stata/output/probit-
gpa | 400 3.3899 .3805668 2.26 4 regression/
tabulate admit
admit | Freq. Percent Cum.
------------+-----------------------------------
0 | 273 68.25 68.25
1 | 127 31.75 100.00
------------+-----------------------------------
Total | 400 100.00
Iteration History - This is a listing of the log likelihoods at each iteration for the probit model. Remember that probit LR chi2(3) - This is the Likelihood Ratio (LR) Chi-Square test
regression uses maximum likelihood estimation, which is an iterative procedure. The first iteration (called Iteration 0) is that at least one of the predictors' regression coefficient is not
the log likelihood of the "null" or "empty" model; that is, a model with no predictors. At the next iteration (called equal to zero. The number in the parentheses indicates the
Iteration 1), the specified predictors are included in the model. At each iteration, the log likelihood increases because the degrees of freedom of the Chi-Square distribution used to test
goal is to maximize the log likelihood. When the difference between successive iterations is very small, the model is said the LR Chi-Square statistic and is defined by the number of
to have "converged" and the iterating stops predictors in the model
probit admit gre topnotch gpa

Iteration 0: log likelihood = -249.98826
e. Prob > chi2 - This is the probability of getting
cons - The constant term is - Iteration 1: log likelihood = -238.97735 a LR test statistic as extreme as, or more so, than
2.797884. This means that if Iteration 2: log likelihood = -238.94339 the observed statistic under the null hypothesis;
all of the predictors (gre, Iteration 3: log likelihood = -238.94339 the null hypothesis is that all of the regression
topnotch and gpa) are coefficients are simultaneously equal to zero. In
evaluated at zero, the other words, this is the probability of obtaining
predicted probability of Probit regression Number of obs = 400
this chi-square statistic (22.09) or one more
admission is F(-2.797884) = LR chi2(3) = 22.09 extreme if there is in fact no effect of the
0.002571929. So, as Prob > chi2 = 0.0001 predictor variables. This p-value is compared to
expected, the predicted Log likelihood = -238.94339 Pseudo R2 = 0.0442 a specified alpha level, our willingness to accept
probability of a student with a type I error, which is typically set at 0.05 or
a GRE score of zero and a 0.01. The small p-value from the LR test,
------------------------------------------------------------------------------ 0.0001, would lead us to conclude that at least
GPA of zero from a non- admit | Coef. Std. Err. z P>|z| [95% Conf. Interval] one of the regression coefficients in the model is
topnotch school has an not equal to zero. The parameter of the chi-
extremely low predicted
-------------+----------------------------------------------------------------
square distribution used to test the null
probability of admission gre | .0015244 .0006382 2.39 0.017 .0002736 .0027752
hypothesis is defined by the degrees of freedom
topnotch | .2730334 .1795984 1.52 0.128 -.078973 .6250398 in the prior line, chi2(3).
gpa | .4009853 .1931077 2.08 0.038 .0225012 .7794694
_cons | -2.797884 .6475363 -4.32 0.000 -4.067032 -1.528736
This is McFadden's pseudo R-squared. Probit

However, there are limited ways in which we can interpret the individual regression coefficients. A Except topnotch, regression does not have an equivalent to the R-
all are statistically squared that is found in OLS regression; however,
positive coefficient means that an increase in the predictor leads to an increase in the predicted
significant and many people have tried to come up with one.
probability. A negative coefficient means that an increase in the predictor leads to a decrease in the policy variables at
predicted probability. 5%
probit admit gre topnotch gpa
Iteration 2: log likelihood = -238.94339 Coef. - These are the regression coefficients. The predicted probability of admission can be
Iteration 3: log likelihood = -238.94339 calculated using these coefficients. For a given record, the predicted probability of admission
is
Probit regression Number of obs = 400
LR chi2(3) = 22.09 where F is the cumulative distribution function of the standard normal. However,
Prob > chi2 = 0.0001 interpretation of the coefficients in probit regression is not as straightforward as the
Log likelihood = -238.94339 Pseudo R2 = 0.0442 interpretations of coefficients in linear regression or logit regression. The increase in
probability attributed to a one-unit increase in a given predictor is dependent both on the
------------------------------------------------------------------------------ values of the other predictors and the starting value of the given predictors. For example, if we
admit | Coef. Std. Err. z P>|z| [95% Conf. Interval] hold gre and topnotch constant at zero, the one unit increase in gpa from 2 to 3 has a different
-------------+---------------------------------------------------------------- effect than the one unit increase from 3 to 4 (note that the probabilities do not change by a
common difference or common factor):
gre | .0015244 .0006382 2.39 0.017 .0002736 .0027752
topnotch | .2730334 .1795984 1.52 0.128 -.078973 .6250398
gpa | .4009853 .1931077 2.08 0.038 .0225012 .7794694
_cons | -2.797884 .6475363 -4.32 0.000 -4.067032 -1.528736
• To generate values from F in Stata, use the normal function. For example, and the effects of these one unit increases are different if we hold gre and topnotch constant at
display normal(0) their respective means instead of zero:
• This will display .5, indicating that F(0) = .5 (i.e., half of the area under the standard
normal distribution curve falls to the left of zero). The first student in our dataset has a
GRE score of 380, a GPA of 3.61, and a topnotch indicator value of 0. We could multiply
these values by their corresponding coefficients,
display -2.797884 +(.0015244*380) + (.2730334*0) + (.4009853*3.61)
• to determine that the predicted probability of admittance is F(-0.77105507). To find this
value, we type However, there are limited ways in which we can interpret the individual regression coefficients.
display normal(-0.77105507) A positive coefficient means that an increase in the predictor leads to an increase in the predicted
• and arrive at a predicted probability of 0.22033715. probability. A negative coefficient means that an increase in the predictor leads to a decrease in
the predicted probability.
Robust standard errors
• If you specify the vce (robust) option, probit reports robust standard errors; Obtaining robust variance estimates.
. probit admit gre topnotch gpa, vce(robust)nolog

Wald chi2(3) = 22.49
Prob > chi2 = 0.0001
Log pseudolikelihood = -238.94339 Pseudo R2 = 0.0442
Robust
admit Coef. Std. Err. z P>|z| [95% Conf. Interval]
gre .0015244 .0006456 2.36 0.018 .000259 .0027898

topnotch .2730334 .1828177 1.49 0.135 -.0852828 .6313496
gpa .4009853 .19962 2.01 0.045 .0097373 .7922333
_cons -2.797884 .6475403 -4.32 0.000 -4.06704 -1.528728
https://www.stata.com/manuals13/rprobit.pdf
• To obtain marginal effect (at mean values):
. mfx
Marginal effects after probit

y = Pr(admit) (predict)
= .30912719 The probability of willingness to admit (WTA) is 0.30912719 when all
explanatory variables are fixed at their mean values. And, the marginal
variable dy/dx Std. Err. z P>|z| [ 95% C.I. ] X
effect of change in WTA with respect to change in gre, topnotch, and
gre .0005371 .00022 2.39 0.017 .000097 .000977 587.7 gpa are 0.005371, 0.0.096206, and 0.141291 respectively. Note that the
topnotch* .1000404 .06792 1.47 0.141 -.033089 .233169 .1625 categorical variable has different value and difficult to interpret, with the
gpa .141291 .06794 2.08 0.038 .008122 .27446 3.3899
two commands, prefer to use dydx.
(*) dy/dx is for discrete change of dummy variable from 0 to 1
. margin, dydx (*) atmeans
Conditional marginal effects Number of obs = 400

Model VCE : OIM
With binary independent variables, marginal effects measure
Expression : Pr(admit), predict()
dy/dx w.r.t. : gre topnotch gpa
discrete change, i.e. how do predicted probabilities change as
at : gre = 587.7 (mean) the binary independent variable changes from 0 to 1?
topnotch
gpa
=
=
.1625 (mean)
3.3899 (mean)
Marginal effects for continuous variables measure the
instantaneous rate of change (defined shortly).
Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]
gre .0005371 .0002243 2.39 0.017 .0000975 .0009768

topnotch .096206 .0633146 1.52 0.129 -.0278884 .2203003
gpa .141291 .0679447 2.08 0.038 .0081218 .2744603
Discrete Change for Categorical Variables.
. mfx

= .30912719 The MEM for categorical variables therefore shows how P(Y=1)
changes as the categorical variable changes from 0 to 1, holding all
other variables at their means. That is, for a categorical variable Xk
gre .0005371 .00022 2.39 0.017 .000097 .000977 587.7 Marginal Effect Xk = Pr(Y = 1|X, Xk = 1) – Pr(y=1|X, Xk = 0)
topnotch* .1000404 .06792 1.47 0.141 -.033089 .233169 .1625
gpa .141291 .06794 2.08 0.038 .008122 .27446 3.3899
In the current case, the MEM for topnotch of 0.096 tells us that, for
two hypothetical individuals with average values on gre (587.7) and
(*) dy/dx is for discrete change of dummy variable from 0 to 1 gpa (3.38), the predicted probability of success is 0.096 greater for the
individual with 1 than for one who is with 0.
. margin, dydx (*) atmeans
To confirm, we can easily compute the predicted probabilities for
Conditional marginal effects Number of obs = 400 those hypothetical individuals, and then compute the difference
Model VCE : OIM between the two, as presented in the next slide
dy/dx w.r.t. : gre topnotch gpa
at : gre = 587.7 (mean)
topnotch = .1625 (mean)
gpa = 3.3899 (mean)
Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]
gre .0005371 .0002243 2.39 0.017 .0000975 .0009768

topnotch .096206 .0633146 1.52 0.129 -.0278884 .2203003
gpa .141291 .0679447 2.08 0.038 .0081218 .2744603
. margins, dydx (*) atmeans
Conditional marginal effects Number of obs = 400

Model VCE : OIM

dy/dx w.r.t. : gre 1.topnotch gpa
at : gre = 587.7 (mean)
0.topnotch = .8375 (mean)
The right approach is to entertain as categorical like this: 1.topnotch = .1625 (mean)
. probit admit gre i.topnotch gpa gpa = 3.3899 (mean)
Iteration 2: log likelihood = -238.94339 Delta-method
Iteration 3: log likelihood = -238.94339 dy/dx Std. Err. z P>|z| [95% Conf. Interval]
LR chi2(3) = 22.09 gre .0005371 .0002243 2.39 0.017 .0000975 .0009768
Prob > chi2 = 0.0001 1.topnotch .1000404 .0679242 1.47 0.141 -.0330886 .2331695
Log likelihood = -238.94339 Pseudo R2 = 0.0442 gpa .141291 .0679447 2.08 0.038 .0081218 .2744603
admit Coef. Std. Err. z P>|z| [95% Conf. Interval] Note: dy/dx for factor levels is the discrete change from the base level.
gre .0015244 .0006382 2.39 0.017 .0002736 .0027752 . margins topnotch,atmeans

1.topnotch .2730334 .1795984 1.52 0.128 -.078973 .6250398
gpa .4009853 .1931077 2.08 0.038 .0225012 .7794694
_cons -2.797884 .6475363 -4.32 0.000 -4.067032 -1.528736 Adjusted predictions Number of obs = 400
Model VCE : OIM
Marginal effects can be an informative means for summarizing how change in a Expression : Pr(admit), predict()
at : gre = 587.7 (mean)
response is related to change in a covariate. For categorical variables, the effects 0.topnotch = .8375 (mean)
of discrete changes are computed, i.e., the marginal effects for categorical 1.topnotch = .1625 (mean)
gpa = 3.3899 (mean)
variables show how P(Y = 1) is predicted to change as Xk changes from 0 to 1
holding all other Xs equal. This can be quite useful, informative, and easy to Delta-method
understand. For continuous independent variables, the marginal effect measures Margin Std. Err. z P>|z| [95% Conf. Interval]
the instantaneous rate of change. If the instantaneous rate of change is similar to topnotch
the change in P(Y=1) as Xk increases by one, this too can be quite useful and 0 .2936703 .0253189 11.60 0.000 .2440462 .3432945
1 .3937108 .0629817 6.25 0.000 .2702688 .5171527
intuitive. However, there is no guarantee that this will be the case; it will depend,
in part, on how Xk is scaled.
.. display .3937108 - .2936703
.1000405
• To obtain marginal effect (at different values):
. sum admit gre topnotch gpa
Variable Obs Mean Std. Dev. Min Max
admit 400 .3175 .4660867 0 1

gre 400 587.7 115.5165 220 800
topnotch 400 .1625 .3693709 0 1 The probability of willingness to admit
gpa 400 3.3899 .3805668 2.26 4
(WTA) is 0.05981 when all explanatory
. mfx compute,at (220,0,2.26)
variables are fixed at their minimum values.
y = Pr(admit) (predict) But, at their maximum values, the WTA
= .05981969
becomes 0.6138.
gre .0001812 .00007 2.66 0.008 .000048 .000314 220

topnotch*
gpa
.0398816
.0476541
.03691
.02138
1.08
2.23
0.280
0.026
-.032469
.00576
.112232
.089549
0
2.26
Possible to compute
mfx compute, at (topnotch=0, gpa=4)
. mfx compute,at (800,1,4) It can also be computed at different value,

Marginal effects after probit this claims the average effect next.
= .61738152
gre .0005816 .00023 2.50 0.012 .000126 .001037 800

topnotch* .1071787 .06975 1.54 0.124 -.029525 .243883 1
gpa .1529945 .07213 2.12 0.034 .011614 .294375 4

To obtain Average effect:
. margin, dydx (*)
• The average marginal effect gives you an effect on the
Average marginal effects Number of obs = 400 probability, i.e. a number between 0 and 1. It is the average
Model VCE : OIM change in probability when x increases by one unit. Since a
probit is a non-linear model, that effect will differ from
Expression : Pr(admit), predict() individual to individual. What the average marginal effect
dy/dx w.r.t. : gre topnotch gpa does is compute it for each individual and than compute the
average. To get the effect on the percentage you need to
multiply by a 100.
Delta-method • average marginal effect of a variable is the average of
dy/dx Std. Err. z P>|z| [95% Conf. Interval] predicted changes in fitted values for one unit change in X
(if it is continuous) for each X values, i.e., for each
gre .0005173 .0002121 2.44 0.015 .0001017 .000933 observation
topnotch .0926585 .0604004 1.53 0.125 -.0257242 .2110412
gpa .1360811 .0645638 2.11 0.035 .0095385 .2626237
https://www.youtube.com/watch?v=bOILSmV8kF4
PROBIT MODEL GOODNESS-OF-FIT TEST:
• The lfit command, typed without options, displays the Pearson goodness-of-fit test for the estimated model. With the group
option lfit produces Hosmer-Lemeshow's goodness-of-fit test.
. lfit
Probit model for admit, goodness-of-fit test

Under the null hypothesis that the fitted
number of observations
number of covariate patterns
=
=
400
374
model is the correct model
Pearson chi2(370) = 368.41
Prob > chi2 = 0.5135
• The p-value for the goodness-of-fit test suggests that the model fits reasonably well. However, since the number of covariate
patterns is equal to the number of observations the Pearson test is not appropriate for these data. In this situation the Hosmer
and Lemeshow goodness-of-fit test for grouped data is preferred. The group option requested that the data be formed into 10
nearly equal-size groups for the Hosmer and Lemeshow test of goodness-of-fit.
. lfit, group (10)
Probit model for admit, goodness-of-fit test
(Table collapsed on quantiles of estimated probabilities)
number of observations = 400

number of groups = 10
Hosmer-Lemeshow chi2(8) = 10.29
Prob > chi2 = 0.2455
Multicollinearity Test Model Specification Test
. collin topnotch gpa gre . linktest
(obs=400)
Collinearity Diagnostics Iteration 0: log likelihood = -249.98826

SQRT R-
Variable VIF VIF Tolerance Squared Iteration 2: log likelihood = -238.45861
---------------------------------------------------- Iteration 3: log likelihood = -238.45861
topnotch 1.08 1.04 0.9228 0.0772
gpa 1.21 1.10 0.8255 0.1745
gre 1.20 1.09 0.8360 0.1640 Probit regression Number of obs = 400
---------------------------------------------------- LR chi2(2) = 23.06
Mean VIF 1.16
Prob > chi2 = 0.0000
Cond Log likelihood = -238.45861 Pseudo R2 = 0.0461
Eigenval Index
---------------------------------
1 3.2124 1.0000
2 0.7606 2.0552 admit Coef. Std. Err. z P>|z| [95% Conf. Interval]
3 0.0211 12.3276
4 0.0059 23.3809
--------------------------------- _hat .5182407 .535289 0.97 0.333 -.5309065 1.567388
Condition Number 23.3809 _hatsq -.5473129 .5602018 -0.98 0.329 -1.645288 .5506624
Eigenvalues & Cond Index computed from scaled raw sscp (w/ intercept)
Det(correlation matrix) 0.7866
_cons -.0547582 .1324039 -0.41 0.679 -.3142651 .2047487
. fitstat
probit
Log-likelihood
Model -238.943
Intercept-only -249.988
Chi-square
Measures of Fit for probit: Deviance (df=396) 477.887
LR (df=3) 22.090
p-value 0.000
We may also wish to see measures of R2
how well our model fits. This can be McFadden
McFadden (adjusted)
0.044
0.028
particularly useful when comparing McKelvey & Zavoina 0.091
Cox-Snell/ML 0.054
competing models. The user-written Cragg-Uhler/Nagelkerke 0.075
command fitstat produces a variety of Efron

Tjur's D
0.052
0.054
fit statistics. Count
Count (adjusted)
0.680
-0.008
IC
AIC 485.887
AIC divided by N 1.215
BIC (df=4) 501.853
Variance of
e 1.000
y-star 1.100
References
Hosmer, D. & Lemeshow, S. (2000). Applied Logistic Regression (Second Edition). New York: John Wiley & Sons, Inc.
Long, J. Scott (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.
Binary Logit Model
Logit Model
• It is based on cumulative logistic distribution. Like the probit model, the logit model
assumes that the conditional mean function is given by
P( y / x)  G ( x )  G ( I )  G (  0  1 x1  ...   k xk )  G ( x  )
• Where I = 1 + 2X2 + … + kXk is an index function, with the restriction that P/I > 0.
However, the logit model assumes that G is a logistic cumulative distribution function.
• Thus, the conditional mean function for the logit model is given by
I
P
1  e I
• The rest of the model is analogous to the probit model.
Odd Ratio:
I I e I Logistic regression is used to obtain odds ratio in the presence of more
P I
and 1- P  1  I
 than one explanatory variable. The procedure is quite similar to multiple
1 e 1 e 1  e I
linear regression, with the exception that the response variable is
I binomial. The result is the impact of each variable on the odds ratio of
P 1  e I
the observed event of interest. The main advantage is to avoid
  I  eI
1 P e confounding effects by analyzing the association of all variables
together
1  e I
 P 
ln    I X
1  P 
• There exists a linear relationship between log of ratio with explanatory variables.
• In the logit model, the probability distribution of follows the logistic probability distribution.
• And, it also assumes that conditional mean function for Y is given as presented above.
• First it is imperative to understand that odds and probabilities,
although sometimes used as synonymous, are not the same.
Probability is the ratio between the number of events favorable I I e I
to some outcome and the total number of events. P and 1- P  1  
1  e I 1  e I 1  e I
• On the other hand, odds are the ratio between probabilities: the I
probability of an event favorable to an outcome and the P I
probability of an event against the same outcome. Probability
1  e
  I  eI
is constrained between zero and one and odds are constrained 1 P e
between zero and infinity. 1  e I
• And odds ratio is the ratio between odds. The importance of  P 
ln    I X
this is that a large odds ratio (OR) can represent a small 1  P 
probability and vice-versa.
• Stata Application ( Similar to Probit)
. logit admit gre i.topnotch gpa

Logistic regression Number of obs = 400

LR chi2(3) = 21.85
Prob > chi2 = 0.0001
Log likelihood = -239.06481 Pseudo R2 = 0.0437
admit Coef. Std. Err. z P>|z| [95% Conf. Interval]
gre .0024768 .0010702 2.31 0.021 .0003792 .0045744

1.topnotch .4372236 .2918532 1.50 0.134 -.1347983 1.009245
gpa .6675556 .3252593 2.05 0.040 .0300591 1.305052
_cons -4.600814 1.09638 -4.20 0.000 -6.749678 -2.451949
.
• Odds Ratio
. logistic admit gre i.topnotch gpa
Logistic regression Number of obs = 400

LR chi2(3) = 21.85 • Greater than 1: As the continuous variable increases,
Prob > chi2 = 0.0001 the event is more likely to occur.
Log likelihood = -239.06481 Pseudo R2 = 0.0437 • Less than 1: As the variable increases, the event is less
likely to occur.
• Equals 1: As the variable increases, the likelihood of
admit Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] the event does not change.
gre 1.00248 .0010729 2.31 0.021 1.000379 1.004585
1.topnotch 1.548402 .4519062 1.50 0.134 .8738922 2.74353
gpa 1.949466 .634082 2.05 0.040 1.030515 3.687881
_cons .0100437 .0110117 -4.20 0.000 .0011713 .0861256
Now we can say that for a one unit increase in gpa, the odds of being admitted to graduate school (versus not
being admitted) increase by a factor of 1.94
• Therefore, given this background,
There is a relationship between the logistic coefficients and the odds ratios. The odds ratio is equal to
exp(coefficient). For instance, the logistic coefficient for gpa is 0.6675556, so that its odds ratio becomes, exp
(0.6675556) = 1.949466, which is very close -to the value of the odds ratio for income.
2. Discrete Choice Models: Multinomial Models for
Polytomous Dependent Variable
Multinomial Models for Polytomous Data
• This is an extension of binary response choice model.

• Here one is supposed to make a single decision among two or more alternatives.
• The choice may be unordered or ordered. This part consists of the following econometric models:
a) Multinomial response models (e.g. whether an individual is unemployed, wage-employed, or self-
employed)
b) Ordered response models (e.g. modelling the rating of the corporate payment default risk, which varies
from, say, A (best) to D (worst))
• Lets first go for Multinomial Response Model: Logit and Probit
Multinomial outcome examples
• The type of insurance contract that an individual selects.
• The product that an individual selects (say type of cereal).
• Occupational choice by an individual (business, academic, non-profit organization).
• The choice of fishing mode (beach, pier, private boat, charter boat).
Multinomial outcome dependent variable
• The dependent variable y is a categorical, unordered variable.
• An individual may select only one alternative.
• The choices/categories are called alternatives and are coded as j =1, 2, ..., m.
• The numbers are only codes and their magnitude cannot be interpreted (use frequency for each
• category instead of means to summarize the dependent variable).
• The data are usually recorded in two formats: a wide format and a long format.
Stata Application for Multinomial Logit Model
• The data set contains variables on 200 students. The outcome variable is prog, program type. The predictor variables are social
economic status, ses, a three-level categorical variable and writing score, write, a continuous variable. Let’s start with getting
some descriptive statistics of the variables of interest.
use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear
. tab prog . tab ses
type of ses Freq. Percent Cum.

program Freq. Percent Cum.
low 47 23.50 23.50
general 45 22.50 22.50 middle 95 47.50 71.00
academic 105 52.50 75.00 high 58 29.00 100.00
vocation 50 25.00 100.00
Total 200 100.00
Total 200 100.00
. tab prog ses, chi2 . table prog, con(mean write sd write)

type of ses
program low middle high Total
type of
general 16 20 9 45 program mean(write) sd(write)
academic 19 44 42 105
vocation 12 31 7 50 general 51.33333 9.397776
Total 47 95 58 200
academic 56.25714 7.943343
vocation 46.76 9.318754
Pearson chi2(4) = 16.6044 Pr = 0.002
. mlogit prog i.ses write, base(2)

• The i. before ses indicates that ses is a Multinomial logistic regression Number of obs = 200
indicator variable (i.e., categorical LR chi2(6) = 48.23
variable), and that it should be included in Prob > chi2 = 0.0000
the model. We have also used the option Log likelihood = -179.98173 Pseudo R2 = 0.1182
“base” to indicate the category we would
want to use for the baseline comparison
group. In the model below, we have chosen prog Coef. Std. Err. z P>|z| [95% Conf. Interval]
to use the academic program type as the
baseline category general
ses
• In the output above, we first see the middle -.533291 .4437321 -1.20 0.229 -1.40299 .336408
iteration log, indicating how quickly the high -1.162832 .5142195 -2.26 0.024 -2.170684 -.1549804
model converged. The log likelihood (-
179.98173) can be used in comparisons of write -.0579284 .0214109 -2.71 0.007 -.0998931 -.0159637
_cons 2.852186 1.166439 2.45 0.014 .5660075 5.138365
nested models, but we won’t show an
example of comparing models here
academic (base outcome)
• The likelihood ratio chi-square of48.23
with a p-value < 0.0001 tells us that our vocation
ses
model as a whole fits significantly better
middle .2913931 .4763737 0.61 0.541 -.6422822 1.225068
than an empty model (i.e., a model with no
high -.9826703 .5955669 -1.65 0.099 -2.14996 .1846195
predictors)
write -.1136026 .0222199 -5.11 0.000 -.1571528 -.0700524
_cons 5.2182 1.163549 4.48 0.000 2.937686 7.498714
Multinomial Response Model: Probit
• One problem with multinomial Logit models was the IIA assumption imposed
• The multinomial probit model is by them. This is due to the fact that the e's were assumed to be independent
similar to multinomial logit model, distributed from each other, i.e. from each other, i.e. the covariance matrix E(ee')
just like the binary probit model is restricted to be a diagonal matrix
similar to the binary logit model. • Although this independence has the advantage that the likelihood function is
• The difference is that it uses the quite easy to compute in most of the cases the IIA assumption leads to
standard normal cdf unrealistic predictions (recall the famous "red and blue-bus" example). One
• alternative to break down the IIA assumption therefore consists in allowing the
It takes longer for a probit model to
obtain results. e's to be correlated with each other and that is exactly what the multinomial
Probit model does
• The coefficients are different by a
scale factor from the logit model. • In the multinomial Probit model it is assumed that the e's follows a multivariate
• The marginal effects will be similar normal distribution with covariance matrix Σ where Σ now is not restricted to be
a diagonal matrix
• It is also one way to avoid IIA
assumption. How?
Stata Application:
• As discussed in example 1 of [R] mlogit, we have data on
the type of health insurance available to 616
psychologically depressed subjects in the United States
(Tarlov et al. 1989; Wells et al. 1989). Patients may have
either an indemnity (fee-for-service) plan or a prepaid plan
such as an HMO, or the patient may be uninsured.
Demographic variables include age, gender, race, and site.
Indemnity insurance is the most popular alternative, so
mprobit will choose it as the base outcome by default.
use https://www.stata-press.com/data/r18/sysdsn1
Ordered Response Model: Logit /Probit Model
• For any individual respondent, we hypothesize that there is a continuously varying strength of preferences that
underlies the rating they submit.
• For convenience and consistency with what follows, we will label that strength of preference “utility,” U ∗. we
also describe utility as ranging over the entire real line:
• where i indicates the individual and m indicates, for example the movie. Individuals are invited to “rate” the movie
on an integer scale from 1 to 5.
• Logically, then, the translation from underlying utility to a rating could be viewed as a censoring of the underlying
utility
• The crucial feature of the description thus far is that underlying the discrete response is a continuous range of
preferences.
• Therefore, the observed rating represents a censored version of the true underlying preferences. Providing a
rating of five could be an outcome ranging from general enjoyment to wild enthusiasm.
• Note that the thresholds, μ j, number (J − 1) where J is the number of possible ratings (here, five) –J−1 values are
needed to divide the range of utility into J cells.
• The thresholds are an important element of the model; they divide the range of utility into cells that are then
identified with the observed outcomes.
• Importantly, the difference between two levels of a rating scale (for example, one compared to two, two
compared to three) is not the same as on a utility scale.
• Hence we have a strictly nonlinear transformation captured by the thresholds, which are estimable parameters in
an ordered choice model
• Lets focus on ordered probit. The ordered probit model is built around a latent regression in the same manner as
the binomial probit model. We begin with
• As usual, y∗ is unobserved. What we do observe is
which is a form of
censoring. The μ’s are
unknown parameters to be
estimated with β
• Stata Software Application for Ordered Logit
• Example: A study looks at factors that influence the decision of whether to apply to graduate school. College
juniors are asked if they are unlikely, somewhat likely, or very likely to apply to graduate school. Hence, our
outcome variable has three categories. Data on parental educational status, whether the undergraduate institution
is public or private, and current GPA is also collected. The researchers have reason to believe that the
“distances” between these three points are not equal. For example, the “distance” between “unlikely” and
“somewhat likely” may be shorter than the distance between “somewhat likely” and “very likely”.
• use https://stats.idre.ucla.edu/stat/data/ologit.dta, clear
• This hypothetical data set has a three-level variable called apply (coded 0, 1, 2), that we will use as our outcome
variable. We also have three variables that we will use as predictors: pared, which is a 0/1 variable indicating
whether at least one parent has a graduate degree; public, which is a 0/1 variable where 1 indicates that the
undergraduate institution is public and 0 private, and gpa, which is the student’s grade point average.
. ologit apply i.pared i.public gpa

• In the table we see the coefficients, their standard errors,
z-tests and their associated p-values, and the 95% Iteration 3: log likelihood = -358.51244
confidence interval of the coefficients. Both pared and Iteration 4: log likelihood = -358.51244
gpa are statistically significant; public is not. So for
pared, we would say that for a one unit increase in pared Ordered logistic regression Number of obs = 400
(i.e., going from 0 to 1), we expect a 1.05 increase in the LR chi2(3) = 24.18
log odds of being in a higher level of apply, given all of Prob > chi2 = 0.0000
the other variables in the model are held constant. For a Log likelihood = -358.51244 Pseudo R2 = 0.0326
one unit increase in gpa, we would expect a 0.62
increase in the log odds of being in a higher level of
apply, given that all of the other variables in the model
are held constant. The cutpoints shown at the bottom of apply Coef. Std. Err. z P>|z| [95% Conf. Interval]
the output indicate where the latent variable is cut to
make the three groups that we observe in our data. Note 1.pared 1.047664 .2657891 3.94 0.000 .5267266 1.568601
that this latent variable is continuous. In general, these 1.public -.0586828 .2978588 -0.20 0.844 -.6424754 .5251098
are not used in the interpretation of the results. The gpa .6157458 .2606311 2.36 0.018 .1049183 1.126573
cutpoints are closely related to thresholds, which are
reported by other statistical packages.
/cut1 2.203323 .7795353 .6754621 3.731184
/cut2 4.298767 .8043147 2.72234 5.875195
. ologit apply i.pared i.public gpa, or

• In the output above the results are displayed as Iteration 2: log likelihood = -358.51248
proportional odds ratios. We would interpret these Iteration 3: log likelihood = -358.51244
pretty much as we would odds ratios from a Iteration 4: log likelihood = -358.51244
binary logistic regression. For pared, we would
say that for a one unit increase in pared, i.e., Ordered logistic regression Number of obs = 400
going from 0 to 1, the odds of high apply versus LR chi2(3) = 24.18
the combined middle and low categories are 2.85
Prob > chi2 = 0.0000
greater, given that all of the other variables in the
model are held constant. Likewise, the odds of
the combined middle and high categories versus
low apply is 2.85 times greater, given that all of
the other variables in the model are held constant. apply Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
For a one unit increase in gpa, the odds of the
high category of apply versus the low and middle 1.pared 2.850982 .7577601 3.94 0.000 1.69338 4.799927
categories of apply are 1.85 times greater, given 1.public .9430059 .2808826 -0.20 0.844 .5259888 1.690645
that the other variables in the model are held gpa 1.851037 .4824377 2.36 0.018 1.11062 3.085067
constant. Because of the proportional odds
assumption (see below for more explanation), the /cut1 2.203323 .7795353 .6754621 3.731184
same increase, 1.85 times, is found between low /cut2 4.298767 .8043147 2.72234 5.875195
apply and the combined categories of middle and
high apply.
.
https://stats.oarc.ucla.edu/stata/dae/ordered-logistic-regression/
3. Limited Dependent Models
(Tobit Censored, Truncated, Heckman Model, Cragg’s Model, Heckman Model)
Introduction:
• A limited dependent variable is a variable whose range of possible values is "restricted in some important way.“
• In econometrics, the term is often used when estimation of the relationship between the limited dependent variable of
interest and other variables requires methods that take this restriction into account.
• A limited dependent variable means that there is a limit or boundary on the dependent variable and some of the
observations “hit” this limit. A limited dependent variable is a continuous variable with a lot of repeated
observations at the lower or upper limit.
• For example, this may arise when the variable of interest is constrained to lie between zero and one, as in the case of
a probability, or is constrained to be positive, as in the case of wages or hours worked. For example, modeling
college GPA as a function of high school GPA (HSGPA) and SAT scores involves a sample that is truncated based on
the predictors, i.e., only students with higher HSGPA and SAT scores are admitted into the college.
• Censoring is when the limit observations are in the sample and truncation is when the observations are not in the
sample.
• Limited dependent variable models include:
 Censoring, where for some individuals in a data set, some data are missing but other data are present;
 Truncation, where some individuals are systematically excluded from observation (failure to take this
phenomenon into account can result in selection bias);
Truncation and its Regression Model
• Truncation is when the observations are not in the sample.
• Truncated sample: only include consumers who choose positive quantities of a product.
• The truncated sample is not representative of the population because some observations are not included.
• Truncated sample: do not observe anything about people who do not work
• A truncated sample will have fewer observations and higher mean (with censoring from below) than a censored
sample.
• Truncation has greater loss of information than censoring (missing observations rather than values for the
dependent variable).
Stata Applications for Truncated Regression
• A study of students in a special GATE (gifted and talented education) program wishes to model achievement as a
function of language skills and the type of program in which the student is currently enrolled. A major concern
is that students are required to have a minimum achievement score of 40 to enter the special program. Thus, the
sample is truncated at an achievement score of 40.
• We have a hypothetical data file, truncreg.dta, with 178 observations. The outcome variable is called achiv, and
the language test score variable is called langscore. The variable prog is a categorical predictor variable with
three levels indicating the type of program in which the students were enrolled.
• use https://stats.idre.ucla.edu/stat/stata/dae/truncreg, clear
. tabstat achiv, by(prog) stats(n mean sd)
Summary for variables: achiv

by categories of: prog (type of program)
. summarize achiv langscore prog N mean sd
Variable Obs Mean Std. Dev. Min Max general 40 51.575 7.97074
academic 101 56.89109 9.018759
achiv 178 54.23596 8.96323 41 76 vocation 37 49.86486 7.276912
langscore 178 54.01124 8.944896 31 67
Total 178 54.23596 8.96323
. tab prog
50
type of
40
general 40 22.47 22.47
Frequency
30
academic 101 56.74 79.21
vocation 37 20.79 100.00
20
Total 178 100.00
10
0
40 50 60 70 80
• OLS regression – You could analyze these data using OLS regression. OLS regression will not adjust the
estimates of the coefficients to take into account the effect of truncating the sample at 40, and the coefficients
may be severely biased. This can be conceptualized as a model specification error (Heckman, 1979).
• Truncated regression – Truncated regression addresses the bias introduced when using OLS regression with
truncated data. Note that with truncated regression, the variance of the outcome variable is reduced compared to
the distribution that is not truncated. Also, if the lower part of the distribution is truncated, then the mean of the
truncated variable will be greater than the mean from the untruncated variable; if the truncation is from above,
the mean of the truncated variable will be less than the untruncated variable.
• These types of models can also be conceptualized as Heckman selection models, which are used to correct for
sampling selection bias.
• Censored regression – Sometimes the concepts of truncation and censoring are confused. With censored data we
have all of the observations, but we don’t know the “true” values of some of them. With truncation, some of the
observations are not included in the analysis because of the value of the outcome variable. It would be
inappropriate to analyze the data in our example using a censored regression model.
. truncreg achiv langscore i.prog, ll(40)
(note: 0 obs. truncated)
Fitting full model:
• The ancillary statistic /sigma is equivalent to the Iteration 0: log likelihood = -598.11669
standard error of estimate in OLS regression. The Iteration 1: log likelihood = -591.68356
value of 8.76 can be compared to the standard Iteration 2: log likelihood = -591.31208
deviation of achievement which was 8.96. This shows Iteration 3: log likelihood = -591.30981
a modest reduction. The output also contains an Iteration 4: log likelihood = -591.30981
estimate of the standard error of /sigma, as well as a
95% confidence interval for this value. Truncated regression
• The truncated regression model predicting Limit: lower = 40 Number of obs = 178
upper = +inf Wald chi2(3) = 54.76
achievement from language scores and program type
Log likelihood = -591.30981 Prob > chi2 = 0.0000
was statistically significant (chi-square = 54.76, df = 3,
p<.001). The variable langscore is statistically
significant. A unit increase in language score leads to a
achiv Coef. Std. Err. z P>|z| [95% Conf. Interval]
.71 increase in predicted achievement. One of the
indicator variables for prog is also statistically
langscore .7125775 .1144719 6.22 0.000 .4882168 .9369383
significant. Compared to level 1 of prog, the predicted
achievement for level 2 of prog increases by about
prog
4.07. To determine if prog itself is statistically
academic 4.065219 2.054938 1.98 0.048 .0376131 8.092824
significant, we can use the test command to obtain the
vocation -1.135863 2.669961 -0.43 0.671 -6.368891 4.097165
two degree-of-freedom test of this variable.
_cons 11.30152 6.772731 1.67 0.095 -1.97279 24.57583
/sigma 8.755315 .666803 13.13 0.000 7.448405 10.06222
.
Censoring / Tobit Model
• Censoring is when the limit observations are in the sample (only the value of the dependent variable is censored)
• Example: Censored sample: include consumers who consume zero quantities of a product, observe people that
do not work but their work hours are recorded as zero.
• The censored sample is representative of the population (only the mean for the dependent variable is not)
because all observations are included.
• Because of censoring, the dependent variable y is the incompletely observed value of the latent dependent
variable y*.
• Income of y*=120,000 will be censored as y=100,000 with top coding of 100,000
• A very common problem in microeconomic data is censoring of the dependent variable. When the dependent
variable is censored, values in a certain range are all transformed to (or reported as) a single value.
• Conventional regression methods fail to account for the qualitative difference between limit (zero) observations
and nonlimit (continuous) observations.
Censoring from below
• The actual value for the dependent variable y is observed if the latent variable y* is above the limit and the limit
is observed for the censored observations.
• We observe the actual hours worked for people who work and zero for people who do not work
Censoring from above

• The actual value for the dependent variable y is observed if the latent variable y* is below the limit and the limit
is observed for the censored observations. If people make below $100,000, we observe their actual income and if
they make above
• $100,000, we record their income as 100,000 (censored values)
Stata Software Application
• Consider the situation in which we have a measure of academic aptitude (scaled 200-800) which we want to
model using reading and math test scores, as well as, the type of program the student is enrolled in (academic,
general, or vocational). The problem here is that students who answer all questions on the academic aptitude test
correctly receive a score of 800, even though it is likely that these students are not “truly” equal in aptitude. The
same is true of students who answer all of the questions incorrectly. All such students would have a score of 200,
although they may not all be of equal aptitude.
• We have a hypothetical data file, tobit.dta with 200 observations. The academic aptitude variable is apt, the
reading and math test scores are read and math respectively. The variable prog is the type of program the student
is in, it is a categorical (nominal) variable that takes on three values, academic (prog = 1), general (prog = 2), and
vocational (prog = 3).
• use https://stats.idre.ucla.edu/stat/stata/dae/tobit, clear
. tabulate prog
type of
. describe id read math prog apt academic 45 22.50 22.50

general 105 52.50 75.00
storage display value vocational 50 25.00 100.00
variable name type format label variable label
Total 200 100.00
id float %9.0g
read float %9.0g reading score
math float %9.0g math score . histogram apt, normal bin(10) xline(800)
prog float %10.0g sel type of program
.004
apt float %9.0g
.003
. sum read math prog apt
Density
.002
Variable Obs Mean Std. Dev. Min Max
.001
read 200 52.23 10.25294 28 76
math 200 52.645 9.368448 33 75
prog 200 2.025 .6904772 1 3
0
300 400 500 600 700 800
apt 200 640.035 99.21903 352 800 apt
• Tobit regression coefficients are interpreted in the
similar manner to OLS regression coefficients;
however, the linear effect is on the uncensored
latent variable, not the observed outcome. See
McDonald and Moffitt (1980) for more details.
• For a one unit increase in read, there is a 2.7 point
. tobit apt read math i.prog, ul(800) increase in the predicted value of apt.
Tobit regression Number of obs = 200 • A one unit increase in math is associated with a
LR chi2(4) = 188.97 5.91 unit increase in the predicted value of apt.
Prob > chi2 = 0.0000
• The terms for prog have a slightly different
interpretation. The predicted value of apt is 46.14
points lower for students in a vocational program
apt Coef. Std. Err. t P>|t| [95% Conf. Interval] (prog=3) than for students in an academic program
(prog=1).
read 2.697939 .618798 4.36 0.000 1.477582 3.918296
math 5.914485 .7098063 8.33 0.000 4.514647 7.314323
prog
general -12.71476 12.40629 -1.02 0.307 -37.18173 11.7522
vocational -46.1439 13.72401 -3.36 0.001 -73.2096 -19.07821
The ancillary statistic /sigma is analogous to the
_cons 209.566 32.77154 6.39 0.000 144.9359 274.1961 square root of the residual variance in OLS
regression. The value of 65.67 can be compared to
/sigma 65.67672 3.481272 58.81116 72.54228
the standard deviation of academic aptitude which
0 left-censored observations
was 99.21, a substantial reduction. The output also
183 uncensored observations contains an estimate of the standard error of /sigma
17 right-censored observations at apt >= 800 as well as the 95% confidence interval.
https://stats.oarc.ucla.edu/stata/dae/tobit-analysis/
Heckman model
• The Heckman model is a sample selection model.
• Sample selection usually occurs when people select themselves into a group.
 We want to study the factors affecting income for working women. We have the selection decision,
whether women choose to work or not. We also have the income only for women that work.
 Different factors may affect the two decisions. Whether women work or not may be influenced by whether
or not they have kids, but their income should not be influenced by the presence of kids
• Sample selection (incidental truncation) is different from truncation.
• We want to study income. Truncation is if the sample is based on high income. Incidental truncation is if the
sample is based on whether or not people have executive jobs (high correlation but not exactly the same).
• The dependent variable is not observed if the observation is not in the sample. We do not know the income for
people who are not in the high income sample (but the income is not zero).
• Sample selection assumes that the discrete decision z and the continuous decision y have a bivariate distribution
with correlation coefficient.
Heckman model two-step estimation procedure
• 1. Probit model for the selection mechanism
• Compute the inverse Mills ratio
• 2. Regression model for the selected sample
• The Heckman model may or may not have the same regressors for the selection equation and regression.
• The Heckman model will report estimates of all coefficients.
Example 1
• In the syntax for heckman, depvar and indepvars are the dependent variable and regressors for the underlying
regression model to be fit (y = Xβ), and varlists are the variables (Z) thought to determine whether depvar is
observed or unobserved (selected or not selected). In our female wage example, the number of children at home
would be included in the second list. By default, heckman assumes that missing values (see [U] 12.2.1 Missing
values) of depvar imply that the dependent variable is unobserved (not selected). With some datasets, it is more
convenient to specify a binary variable (depvars) that identifies the observations for which the dependent is
observed/selected (depvars6= 0) or not observed (depvars= 0); heckman will accommodate either type of data.
Here we have a (fictional) dataset on 2,000 women, 1,343 of whom work:
• use https://www.stata-press.com/data/r18/womenwk
• We will assume that the hourly wage is a
function of education and age, whereas
the likelihood of working (the likelihood
of the wage being observed) is a function
of marital status, the number of children at
home, and (implicitly) the wage (via the
inclusion of age and education, which we
think determine the wage):
• heckman assumes that wage is the dependent variable and that the first variable list (educ and age) are the
determinants of wage. The variables specified in the select() option (married, children, educ, and age) are
assumed to determine whether the dependent variable is observed (the selection equation). Thus, we fit the
model
• And we assumed that wage is observed if
• where u1 and u2 have correlation ρ.

MANY THANKS

Part III - Analysis With NonLinear Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Part III - Analysis With NonLinear Models

Uploaded by

Copyright:

Available Formats

Nonlinear Models : Micro econometrics

/ When Y is Not Continuous Variable/

Linear (LPM Non Linear

Probit Model Logit Model

No. Score Frequency Relative Frequency Cumulative Frequency

probit admit gre topnotch gpa

This is McFadden's pseudo R-squared. Probit

. probit admit gre topnotch gpa, vce(robust)nolog

Probit regression Number of obs = 400

gre .0015244 .0006456 2.36 0.018 .000259 .0027898

Marginal effects after probit

. margin, dydx (*) atmeans

Conditional marginal effects Number of obs = 400

gre .0005371 .0002243 2.39 0.017 .0000975 .0009768

Marginal effects after probit

gre .0005371 .0002243 2.39 0.017 .0000975 .0009768

Conditional marginal effects Number of obs = 400

Expression : Pr(admit), predict()

gre .0015244 .0006382 2.39 0.017 .0002736 .0027752 . margins topnotch,atmeans

Variable Obs Mean Std. Dev. Min Max

admit 400 .3175 .4660867 0 1

gre .0001812 .00007 2.66 0.008 .000048 .000314 220

. mfx compute,at (800,1,4) It can also be computed at different value,

variable dy/dx Std. Err. z P>|z| [ 95% C.I. ] X

gre .0005816 .00023 2.50 0.012 .000126 .001037 800

(*) dy/dx is for discrete change of dummy variable from 0 to 1

Probit model for admit, goodness-of-fit test

. lfit, group (10)

Probit model for admit, goodness-of-fit test

(Table collapsed on quantiles of estimated probabilities)

number of observations = 400

Collinearity Diagnostics Iteration 0: log likelihood = -249.98826

command fitstat produces a variety of Efron

Iteration 0: log likelihood = -249.98826

Logistic regression Number of obs = 400

admit Coef. Std. Err. z P>|z| [95% Conf. Interval]

gre .0024768 .0010702 2.31 0.021 .0003792 .0045744

Logistic regression Number of obs = 400

• This is an extension of binary response choice model.

type of ses Freq. Percent Cum.

. tab prog ses, chi2 . table prog, con(mean write sd write)

Iteration 0: log likelihood = -204.09667

• As usual, y∗ is unobserved. What we do observe is

Iteration 0: log likelihood = -370.60264

Iteration 0: log likelihood = -370.60264

Summary for variables: achiv

. summarize achiv langscore prog N mean sd

Fitting full model:

/sigma 8.755315 .666803 13.13 0.000 7.448405 10.06222

Censoring from above

. describe id read math prog apt academic 45 22.50 22.50

• Compute the inverse Mills ratio

• 2. Regression model for the selected sample

• And we assumed that wage is observed if

• where u1 and u2 have correlation ρ.

You might also like