Professional Documents
Culture Documents
Bacon An Effective Way To Detect Outliers in Multi
Bacon An Effective Way To Detect Outliers in Multi
net/publication/46573859
Bacon: An Effective way to Detect Outliers in Multivariate Data Using Stata (and
Mata)
CITATIONS READS
104 2,467
1 author:
Sylvain Weber
Haute école de gestion de Genève
52 PUBLICATIONS 799 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Sylvain Weber on 02 July 2014.
Associate Editors
Christopher F. Baum Peter A. Lachenbruch
Boston College Oregon State University
Nathaniel Beck Jens Lauritsen
New York University Odense University Hospital
Rino Bellocco Stanley Lemeshow
Karolinska Institutet, Sweden, and Ohio State University
University of Milano-Bicocca, Italy
Maarten L. Buis J. Scott Long
Tübingen University, Germany Indiana University
A. Colin Cameron Roger Newson
University of California–Davis Imperial College, London
Mario A. Cleves Austin Nichols
Univ. of Arkansas for Medical Sciences Urban Institute, Washington DC
William D. Dupont Marcello Pagano
Vanderbilt University Harvard School of Public Health
David Epstein Sophia Rabe-Hesketh
Columbia University University of California–Berkeley
Allan Gregory J. Patrick Royston
Queen’s University MRC Clinical Trials Unit, London
James Hardin Philip Ryan
University of South Carolina University of Adelaide
Ben Jann Mark E. Schaffer
University of Bern, Switzerland Heriot-Watt University, Edinburgh
Stephen Jenkins Jeroen Weesie
University of Essex Utrecht University
Ulrich Kohler Nicholas J. G. Winter
WZB, Berlin University of Virginia
Frauke Kreuter Jeffrey Wooldridge
University of Maryland–College Park Michigan State University
http://www.stata-journal.com
Copyright Statement: The Stata Journal and the contents of the supporting files (programs, datasets, and
help files) are copyright
c by StataCorp LP. The contents of the supporting files (programs, datasets, and
help files) may be copied or reproduced by any means whatsoever, in whole or in part, as long as any copy
or reproduction includes attribution to both (1) the author and (2) the Stata Journal.
The articles appearing in the Stata Journal may be copied or reproduced as printed copies, in whole or in part,
as long as any copy or reproduction includes attribution to both (1) the author and (2) the Stata Journal.
Written permission must be obtained from StataCorp if you wish to make electronic copies of the insertions.
This precludes placing electronic copies of the Stata Journal, in whole or in part, on publicly accessible websites,
fileservers, or other locations where the copy may be accessed by anyone other than the subscriber.
Users of any of the software, ideas, data, or other materials published in the Stata Journal or the supporting
files understand that such use is made without warranty of any kind, by either the Stata Journal, the author,
or StataCorp. In particular, there is no warranty of fitness of purpose or merchantability, nor for special,
incidental, or consequential damages such as loss of profits. The purpose of the Stata Journal is to promote
free communication among Stata users.
The Stata Journal, electronic version (ISSN 1536-8734) is a publication of Stata Press. Stata, Mata, NetCourse,
and Stata Press are registered trademarks of StataCorp LP.
Subscriptions are available from StataCorp, 4905 Lakeway Drive, College Station,
Texas 77845, telephone 979-696-4600 or 800-STATA-PC, fax 979-696-4601, or online at
http://www.stata.com/bookstore/sj.html
Subscription rates
The listed subscription rates include both a printed and an electronic copy unless oth-
erwise mentioned.
http://www.stata.com/bookstore/sjj.html
Individual articles three or more years old may be accessed online without charge. More
recent articles may be ordered online.
http://www.stata-journal.com/archives.html
The Stata Journal is published quarterly by the Stata Press, College Station, Texas, USA.
Address changes should be sent to the Stata Journal, StataCorp, 4905 Lakeway Drive,
College Station, TX 77845, USA, or emailed to sj@stata.com.
Volume 10 Number 3 2010
Stata tip 89: Estimating means and percentiles following multiple imputation . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P. A. Lachenbruch 496
Stata tip 90: Displaying partial results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Weiss 500
Stata tip 91: Putting unabbreviated varlists into local macros . . . . . . . . . N. J. Cox 503
n
E {f (x)} = pi f (xi )
i=1
For concreteness, we consider a die known to have E (x) = 3.5, where x = (1, 2, 3, 4, 5, 6),
and we want to determine the associated probabilities. Clearly, there are infinitely many
possible solutions, but the obvious one is p1 = p2 = · · · = p6 = 1/6. The obviousness is
based on Laplace’s principle of insufficient reason, which states that two events should
be assigned equal probability unless there is a reason to think otherwise (Jaynes 1957,
c 2010 StataCorp LP st0196
316 Maximum entropy estimation
622). This negative reason is not much help if, instead, we know that E (x) = 4.
Jaynes’s solution was to tackle this from the point of view of Shannon’s information
theory. Jaynes wanted a criterion function H (p1 , p2 , . . . , pn ) that would summarize the
uncertainty about the distribution. This is given uniquely by the entropy measure
n
H (p1 , p2 , . . . , pn ) = −K pi ln(pi )
i=1
As Golan, Judge, and Miller (1996, 8–10) show, if our knowledge of E {f (x)} is
based on the outcome of N (very large) trials, then the distribution function p = (p1 , p2 ,
. . . , pn ) that maximizes the entropy measure is the distribution that can give rise to
the observed outcomes in the greatest number of ways, which is consistent with what
we know. Any other distribution requires more information to justify it. Degenerate
distributions, ones where pi = 1 and pj = 0 for j = i, have entropy of zero. That is to
say, they correspond to zero uncertainty and therefore maximal information.
The J constraints given in (1) can be thought of as moment constraints, with yj being
the population mean of the Xj random variable. To solve this problem, we set up the
Lagrangian function
where
n
=
Ω λ
exp −xi λ
i=1
n
pi
I (p, q) = pi ln
i=1
qi
= p ln(p) − p ln(q)
1. In the Golan, Judge, and Miller (1996) book, the constraint is written as y = Xp, where X is J ×n.
For the applications considered below, it is more natural to write the data matrix in the form shown
here.
318 Maximum entropy estimation
The solution of this problem is given by (Golan, Judge, and Miller 1996, 29)
/Ω λ
pi = qi exp xi λ (6)
where
n
Ω λ =
qi exp xi λ (7)
i=1
The most efficient way to calculate the estimates is, in fact, not by numerical solution
of the first-order conditions [along the lines of (3), (4), and (5)] but by the unconstrained
maximization of the dual problem as discussed further in section 3.5.
3.2 Description
maxentropy provides minimum cross-entropy or maximum entropy estimates of ill-
posed inverse problems, such as the Jaynes’s dice problem. The command can also
be used to calibrate survey datasets to external totals along the lines of the multi-
plicative method implemented in the SAS CALMAR macro (Deville and Särndal 1992;
Deville, Särndal, and Sautory 1993). This is a generalization of iterative raking as im-
plemented, for instance, in Nick Winter’s survwgt command, which is available from
the Statistical Software Components archive (type net search survwgt).
3.3 Options
generate(varname , replace ) provides the variable name in which Stata will store
the probability estimates. This must be a new variable name, unless the replace
suboption is specified, in which case the existing variable is overwritten. generate()
is required.
M. Wittenberg 319
3.4 Output
maxentropy returns output in three forms. First, it returns estimates of the λ coeffi-
cients. The absolute magnitude of the coefficient is an indication of how informative
the corresponding constraint is, that is, how far it moves the resulting p distribution
away from the prior q distribution in the cross-entropy case or away from the uniform
distribution in the maximum entropy case.
Second, the estimates of p are returned in the variable specified by the user. Third,
the vector of constraints y is returned in the matrix e(constraint), with the rows of
the matrix labeled according to the variable whose constraint that row represents.
Example
Consider the Jaynes’s die problem described earlier. Specifically, let us calculate the
probabilities if we know that the mean of the die is 4. We set the problem up by
creating the x variable, which contains the discrete distribution of outcomes, that is,
(1, 2, 3, 4, 5, 6). The y vector contains the mean 4.
. set obs 6
obs was 0, now 6
. generate x = _n
. matrix y = (4)
Variable lambda
x .17462893
p values returned in p4
constraints given in matrix y
x p4
1 .1030653
2 .1227305
3 .146148
4 .1740337
5 .2072401
6 .2467824
The distribution is weighted toward the larger numbers. We can check that these
estimates obey the restrictions:
. generate xp4=x*p4
. quietly summarize xp4
. display r(sum)
4
Finally, we can retrieve a copy of the constraint matrix labeled with the correspond-
ing variables.
J
L (λ) = λj yj − ln {Ω (λ)} = M (λ) (8)
j=1
where Ω (λ) is given by (7). Golan, Judge, and Miller show that this function behaves
like a maximum likelihood. In this case,
∇λ M (λ) = y − X p (9)
so that the constraint is met at the point where the gradient is zero. Furthermore,
2
∂2M
n
n
− = pi x2ji − pi xji = var (xj ) (10)
∂λ2j i=1 i=1
n n
∂2M n
− = pi xji xki − pi xji pi xki = cov (xj , xk ) (11)
∂λj ∂λk i=1 i=1 i=1
where the variances and covariances are taken with respect to the distribution p. The
negative of the Hessian of M is therefore guaranteed to be positive definite, which
guarantees a unique solution provided that the constraints are not inconsistent.
Golan, Judge, and Miller (1996, 25) note that the function M can be thought of
as an expected log likelihood, given the exponential family p (λ) parameterized by λ.
Along these lines, we use Stata’s maximum likelihood routines to estimate λ, giving it
the dual objective function [(8)], gradient [(9)], and negative Hessian [(10) and (11)].
The routine that calculates these is contained in maxentlambda d2.ado. Because of
the globally concave nature of the objective function, convergence should be relatively
quick, provided that there is a feasible solution in the interior of the parameter space.
The command checks for some obvious errors; for example, the population means (yj )
must be inside the range of the Xj variables. If any mean is on the boundary of the
range, then a degenerate solution is feasible, but the corresponding Lagrange multiplier
will be ±∞, so the algorithm will not converge.
Once the estimates of λ have been obtained, estimates of p are derived from (6).
4 Sample applications
4.1 Jaynes’s die problem
In section 3.4, I showed how to calculate the probability distribution given that y = 4.
The following code generates predictions given different values for y:
matrix y=(2)
maxentropy x, matrix(y) generate(p2)
matrix y=(3)
maxentropy x, matrix(y) generate(p3)
matrix y=(3.5)
maxentropy x, matrix(y) generate(p35)
matrix y=(5)
maxentropy x, matrix(y) generate(p5)
list p2 p3 p35 p4 p5, sep(10)
p2 p3 p35 p4 p5
Note in particular that when we set y = 3.5, the command returns the uniform
discrete distribution with pi = 1/6.
We can see the impact of adding in a second constraint by considering the same
problem given the population moments
μ 3.5
y= =
σ2 σ2
M. Wittenberg 323
6 2
for different values of σ 2 . By definition in this case, σ 2 = i=1 pi (xi − 3.5) . We
2
can therefore create the values (xi − 3.5) and consider which probability distribution
p = (p1 , p2 , . . . , p6 ) will generate both a mean of 3.5 and a given value of σ 2 . The code
to run this is
generate dev2=(x-3.5)^2
matrix y=(3.5 \ (2.5^2/3+1.5^2/3+0.5^2/3))
maxentropy x dev2, matrix(y) generate(pv)
matrix y=(3.5 \ 1)
maxentropy x dev2, matrix(y) generate(pv1)
matrix y=(3.5 \ 2)
maxentropy x dev2, matrix(y) generate(pv2)
matrix y=(3.5 \ 3)
maxentropy x dev2, matrix(y) generate(pv3)
matrix y=(3.5 \ 4)
maxentropy x dev2, matrix(y) generate(pv4)
matrix y=(3.5 \ 5)
maxentropy x dev2, matrix(y) generate(pv5)
matrix y=(3.5 \ 6)
maxentropy x dev2, matrix(y) generate(pv6)
The probabilities behave as we would expect: in the case where σ 2 = 35/12, we get
the uniform distribution. With variances smaller than this, the probability distribution
puts more emphasis on the values 3 and 4, while with higher variances the distribution
becomes bimodal with greater probability being attached to extreme values. This output
does not reveal that in all cases the λ1 estimate is basically zero. The reason for this
is that with a symmetrical distribution of xi values around the population mean, the
mean is no longer informative and all the information about the distribution of p derives
from the second constraint. If we force p4 = p5 = 0 so that the distribution is no longer
symmetrical, the first constraint becomes informative, as shown in this output:
Variable lambda
x .0119916
dev2 .59568007
p values returned in p5
constraints given in matrix y
. list x p5 if e(sample), noobs
x p5
1 .4578909
2 .0427728
3 .0131515
6 .4861848
This example shows how to overwrite an existing variable and demonstrates that the
command allows if and in qualifiers. It also shows how to use the e(sample) function.
gender is given in table 1, where we have also indicated the population totals to which
the weights should gross.
Table 1. Sum of weights from example1.dta by stratum, gender, and gross weight for
population totals
gender
stratum 0 1 Margin Required
0 100 400 500 1600
1 300 200 500 400
Margin 400 600 1000
Required 1200 800 2000
The weights can be adjusted to these totals by using the downloadable survwgt
command. To use the maxentropy command, we need to convert the desired constraints
from population totals into population means. That is straightforward because
n
N = wi (12)
i=1
n
Ngender=0 = wi 1 (gender = 0) (13)
i=1
We could obviously add a condition for the proportion where gender = 1, but because
of the adding-up constraint, that would be redundant. If we have k categories for a
particular variable, we can only use k − 1 constraints in our estimation.
In this particular example, the constraint vector is contained in the constraint
variable. The syntax of the command in this case is
maxentropy constraint stratum gender, generate(wt3) prior(weight) total(2000)
326 Maximum entropy estimation
We did not specify a matrix, so the first variable is interpreted as the constraint
vector. We did specify a prior weight and asked Stata to convert the calculated proba-
bilities to raising weights by multiplying them by 2,000. A comparison with the “raked
weights” confirms them to be identical in this case.
We can check whether the constraints were correctly rendered by retrieving the
constraint matrix used in the estimation:
. matrix C=e(constraint)
. matrix list C
C[2,1]
c1
stratum .2
gender .40000001
We see that E(stratum) = 0.2 and E(gender) = 0.4. Means of dummy variables
are, of course, just population proportions; that is, the proportion in stratum = 1 is
0.2 and the proportion where gender = 1 is 0.4.
that is,
N= wh∗
h
where wih is the weight of individual i within household h, equal to the common weight
wh . This constraint can again be written in the form of probabilities as
w∗
h
1 =
N
h
that is,
1 = p∗h
h
M. Wittenberg 327
Consider now any other constraint involving individual aggregates [for example, (13)]
n
Nx = wi xi
i=1
= wih xih
h i
= wh xih
h i
Nx wh hhsizeh
xih
i
=
N N hhsizeh
h
Consequently,
E (x) = p∗h mxh (14)
h
The term mxh is just the mean of the x variable within household h.
If the prior weight qh is similarly constant within households (as it should be if it is
a design weight), then we similarly create a new variable
qh∗ = hhsizeh qh
We can then write the cross-entropy objective function as
n
pi pih
I (p, q) = pi ln = pih ln
i=1
qi i
qih
h
ph hhsizeh
= ph ln
i
qh hhsizeh
h
p∗h
= ph ln ∗
i
qh
h
∗
p
= hhsizeh ph ln h∗
qh
h
p∗
= p∗h ln h∗
qh
h
In short, the objective function evaluated over all individuals and imposing the con-
straint pih = ph for all i is identical to the objective function evaluated over house-
holds where the probabilities have been adjusted to p∗h and qh∗ . We therefore run the
maxentropy command on a household-level file, with the population constraints given
by (14). Our cross-entropy estimates can then be retrieved as
p∗h
ph =
hhsizeh
328 Maximum entropy estimation
We can check that the weights obtained in this way do, in fact, obey all the
restrictions—they are obviously constant within household, and when added up over
the individuals, they reproduce the required totals.
With 144 constraints and 7,305 observations, the command took 18 seconds to cal-
culate the new weights on a standard desktop computer.
M. Wittenberg 329
In this context, the λ estimates prove informative. The output of the command is
Variable lambda
P1 -.15945276
P2 .00735986
P3 .14000206
(output omitted )
IMa75 15.402056
IMa80 8.6501559
IFa_0 -7.0753612
IFa_5 2.3584972
(output omitted )
IFa75 -9.2778495
IFa80 14.142518
(output omitted )
WFa70 .05009103
WFa75 .90961156
WFa80 4.6868009
p values returned in hw
constraints given in variable constraints
The huge coefficients for old Indian males and old Indian females suggests that the
population constraints affected the weights for these categories substantially. Given the
large number of constraints, mistakes are possible. The easiest way to check that the
command has worked correctly is to add up the weights within categories and to check
that they add up to the intended totals. Listing the constraint matrix used by the
command is also a useful check. In this case, the labeling of the rows does help:
. matrix list e(constraint)
e(constraint)[144,1]
c1
P1 .10803039
P2 .13514069
P3 .02320805
P4 .05914972
P5 .20764017
P6 .07044568
P7 .21462312
P8 .07373177
AMa_0 .04486157
AMa_5 .04584822
(output omitted )
WFa75 .0012318
WFa80 .00147087
The first eight constraints are the province proportions followed by the proportions
in the age, sex, and race cells.
330 Maximum entropy estimation
5 Conclusion
This article introduced the power of maximum entropy and minimum cross-entropy
estimation. The maxentropy command uses Stata’s powerful maximum-likelihood esti-
mation routines to provide fast estimates of even complicated problems. I have shown
how the command can be used to calibrate a survey to a set of known population totals
while imposing restrictions like constant weights within households.
6 References
Deming, W. E., and F. F. Stephan. 1940. On a least squares adjustment of a sample fre-
quency table when the expected marginal totals are known. Annals of Mathematical
Statistics 11: 427–444.
Deville, J.-C., and C.-E. Särndal. 1992. Calibration estimators in survey sampling.
Journal of the American Statistical Association 87: 376–382.
Deville, J.-C., C.-E. Särndal, and O. Sautory. 1993. Generalized raking procedures in
survey sampling. Journal of the American Statistical Association 88: 1013–1020.
Golan, A., G. G. Judge, and D. Miller. 1996. Maximum Entropy Econometrics: Robust
Estimation with Limited Data. Chichester, UK: Wiley.
Jaynes, E. T. 1957. Information theory and statistical mechanics. Physical Review 106:
620–630.
Merz, J., and H. Stolze. 2008. Representative time use data and new harmonised cali-
bration of the American Heritage Time Use Data 1965–1999. electronic International
Journal of Time Use Research 5: 90–126.
Sylvain Weber
University of Geneva
Department of Economics
Geneva, Switzerland
sylvain.weber@unige.ch
1 Introduction
The literature on outliers is abundant, as proved by Barnett and Lewis’s (1994) bibli-
ography of almost 1,000 articles. Despite this considerable research by the statistical
community, knowledge apparently fails to spill over, so proper methods for detecting
and handling outliers are seldom used by practitioners in other fields.
The reason is likely that algorithms implemented for the detection of outliers are
sparse. Moreover, the few algorithms available are so time-consuming that using them
may be discouraging. Until now, hadimvo was the only command in Stata available for
identifying outliers. Anyone who has tried to use hadimvo on large datasets, however,
knows it may take hours or even days to obtain a mere dummy variable indicating which
observations should be considered as outliers.
The new bacon command, presented in this article, provides a more efficient way
to detect outliers in multivariate data. It is named for the blocked adaptive computa-
tionally efficient outlier nominators (BACON) algorithm proposed by Billor, Hadi, and
Velleman (2000). bacon is a simple modification of the methodology proposed by Hadi
(1992, 1994) and implemented in hadimvo, but bacon is much less computationally in-
tensive. As a result, bacon runs many times faster than hadimvo, even though both
commands end up with similar sets of outliers. Identifying multivariate outliers thus
becomes fast and easy in Stata, even with large datasets of tens of thousands of obser-
vations.
c 2010 StataCorp LP st0197
332 Multivariate outliers detection
The initial basic subset is given by the m observations with the smallest Mahalanobis
distances from the whole sample. The subset size m is given by the product of the
number of variables p and a parameter chosen by the analyst.
Billor, Hadi, and Velleman (2000) also proposed using distances from the medians
for this first step. This second version of the algorithm is also implemented in bacon.
Distances from the medians are not scale-invariant, so they should be used carefully if
the variables analyzed are of different magnitudes.
In step 2, Mahalanobis distances from the basic subset are computed:
di (xb , Sb ) = (xi − xb )T Sb−1 (xi − xb ) , i = 1, 2, . . . , n (1)
In step 3, all observations with a distance smaller than some threshold—a corrected
percentile of a χ2 distribution—are added to the basic subset.
Steps 2 and 3 are iterated until the basic subset no longer changes. Observations
excluded from the final basic subset are nominated as outliers, whereas those inside the
final basic subset are nonoutliers.
The difference in the algorithm proposed by Hadi (1992, 1994) is that observa-
tions are added by blocks in the basic subset instead of observation by observation.
Thus some time is spared through a reduction of the number of iterations. Neverthe-
less, it is important to note that the performance of the algorithm is not altered, as
Billor, Hadi, and Velleman (2000) and section 5 of this article show.
The reduction in the number of iterations is not the only source of efficiency gain.
Another major improvement lies in the way bacon is coded. When hadimvo was im-
plemented, Mata did not exist. Now, though, Mata provides significant speed enhance-
ments to many computationally intensive tasks, like the calculation of Mahalanobis
distances. I therefore coded bacon so that it benefits from Mata’s power.
S. Weber 333
4.2 Options
additionally creates a variable dist containing the distances from the final basic
subset.
replace specifies that the variables newvar1 and newvar2 be replaced if they already
exist in the database. This option makes it easier to run bacon several times on
the same data. It should be used cautiously because it might definitively drop some
data.
percentile(#) determines the 1 − # percentile of the chi-squared distribution to be
used as a threshold to separate outliers from nonoutliers. A larger # identifies a
larger proportion of the sample as outliers. The default is percentile(0.15). If #
is specified greater than 1, it is interpreted as a percent; thus percentile(15) is
the same as percentile(0.15).
version(1 | 2) specifies which version of the BACON algorithm must be used to identify
the initial basic subset in multivariate data. version(1), the default, identifies the
initial subset selected based on Mahalanobis distances. version(2) identifies the ini-
tial subset selected based on distances from the medians. In the case of version(2),
varlist must not contain missing values, and you must install the moremata command
before running bacon.
c(#) is the parameter that determines the size of the initial basic subset, which is given
by the product of # and the number of variables in varlist. # must be an integer.
c(4) is used by default as proposed by Billor, Hadi, and Velleman (2000, 285).
. webuse auto
(1978 Automobile Data)
. hadimvo weight length, generate(outhadi) percentile(0.05)
(output omitted )
. bacon weight length, generate(outbacon) percentile(0.15)
(output omitted )
. tabulate outhadi outbacon
Hadi
outlier BACON outlier (p=.15)
(p=.05) 0 1 Total
0 72 0 72
1 0 2 2
Total 72 2 74
Both commands have identified the same two observations as outliers. The param-
eter (in the percentile() option) was set higher in bacon than in hadimvo. With a
parameter of 5%, bacon would not have identified any observation as an outlier. It is
the role of the researcher to choose the parameter that is best adapted for each dataset,
but the default percentile(0.15) appears to bring sensible outcomes in any case and
could always be used as a first benchmark.
With two-dimensional data, it is helpful to draw a scatterplot such as figure 1 that
allows us to see where outliers are located:
0
0
0 0
4,000
0
0 000
0 0
0
Weight (lbs.)
0
00 0
0
0
0 00
000 0
1 0
1 0 000
00
0
3,000
0
0 0
0
00 0
00 000
0
0 0
0
00
0 0
2,000
00 0
00
00
00
0 00 0
To compare the speeds of bacon and hadimvo, let us now use a larger dataset.
Containing about 28,000 observations, nlswork.dta is sufficiently large to illustrate
the point. Suppose we want to identify outliers with respect to the variables ln wage,
age, and tenure. If we did not have bacon, we would type
. webuse nlswork, clear
(National Longitudinal Survey. Young Women 14-26 years of age in 1968)
. hadimvo ln_wage age tenure, generate(outhadi) percentile(0.05)
Beginning number of observations: 28101
Initially accepted: 4
Expand to (n+k+1)/2:
At this point, your screen will remain idle. You might become worried and think
your computer crashed, but in fact hadimvo is simply going to take some long minutes
to run its many iterations. Remember, there are “only” 28,000 observations in this
dataset. If you are patient enough, Stata will at last show you the outcome:
. hadimvo ln_wage age tenure, generate(outhadi) percentile(0.05)
Beginning number of observations: 28101
Initially accepted: 4
Expand to (n+k+1)/2: 14052
Expand, p = .05: 28081
Outliers remaining: 20
the solution appears in only a few seconds! Again we can check that the set of identified
outliers is pretty much the same in the two cases:
. tabulate outhadi outbacon
Hadi
outlier BACON outlier (p=.15)
(p=.05) 0 1 Total
0 28,072 9 28,081
1 0 20 20
Given the time hadimvo needs and the similarities between the outcomes, it seems
clear that bacon is preferable.
Because there is no rule for the choice of percentile(), the practitioner might
legitimately be willing to test several values and decide after several trials which set of
observations to nominate as outliers. With hadimvo, such an iterative process is almost
impracticable, unless you are particularly patient and have enough time in front of
you. With bacon, on the other hand, completing the iterative process becomes readily
feasible.
S. Weber 337
bacon has a replace option precisely to give the possibility of running the algorithm
several times without having to add a new variable at each iteration. For the user
wanting to try several percentile() values, replace will prove convenient:
6 Conclusion
“The two big questions about outliers are ‘how do you find them?’ and ‘what do you
do about them?’” (Ord 1996). The bacon command presented here provides an answer
to the first of these questions. The answer to the second is beyond the scope of this
article and is left to the consideration of the researcher.
No doubt, bacon renders the process of detecting outliers in multivariate data easier.
Compared with hadimvo, the only other command devoted to this task in Stata, bacon
appears to identify a similar set of observations as outliers. In terms of speed, bacon
proves to be far faster. Hence, there is no apparent reason to use hadimvo instead of
bacon.
Even though the bacon command provides a fast and easy way to identify potential
outliers, a certain amount of judgment is always needed when deciding which cases to
nominate as outliers and what to do with those observations. Most researchers simply
discard outliers, but before you do so, keep in mind that something new and useful can
often be learned by looking at the nominated cases.
7 References
Barnett, V., and T. Lewis. 1994. Outliers in Statistical Data. 3rd ed. Chichester, UK:
Wiley.
Baum, C. F. 2008. Using Mata to work more effectively with Stata: A tutorial. UK
Stata Users Group meeting proceedings.
http://ideas.repec.org/p/boc/usug08/11.html.
Billor, N., A. S. Hadi, and P. F. Velleman. 2000. BACON: Blocked adaptive computa-
tionally efficient outlier nominators. Computational Statistics & Data Analysis 34:
279–298.
338 Multivariate outliers detection
Hadi, A. S. 1992. Identifying multiple outliers in multivariate data. Journal of the Royal
Statistical Society, Series B 54: 761–771.
Ord, K. 1996. Review of Outliers in Statistical Data, 3rd ed., by V. Barnett and T.
Lewis. International Journal of Forecasting 12: 175–176.
Abstract. Medical researchers frequently make statements that one model pre-
dicts survival better than another, and they are frequently challenged to provide
rigorous statistical justification for those statements. Stata provides the estat
concordance command to calculate the rank parameters Harrell’s C and Somers’ D
as measures of the ordinal predictive power of a model. However, no confidence
limits or p-values are provided to compare the predictive power of distinct models.
The somersd package, downloadable from Statistical Software Components, can
provide such confidence intervals, but they should not be taken seriously if they are
calculated in the dataset in which the model was fit. Methods are demonstrated
for fitting alternative models to a training set of data, and then measuring and
comparing their predictive powers by using out-of-sample prediction and somersd
in a test set to produce statistically sensible confidence intervals and p-values for
the differences between the predictive powers of different models.
Keywords: st0198, somersd, stcox, estat concordance, streg, predict, survival,
model validation, prediction, concordance, rank methods, Harrell’s C, Somers’ D
1 Introduction
Harrell’s C and the equivalent parameter Somers’ D were proposed as measures of
the general predictive power of a general regression model by Harrell et al. (1982) and
Harrell, Lee, and Mark (1996), who focused attention on the case of a survival model
with a possibly right-censored outcome, which was interpreted as a lifetime. In the case
of a Cox proportional hazards regression model, both parameters are output by the
Stata postestimation command estat concordance (see [ST] stcox postestimation).1
However, because Harrell’s C and Somers’ D are rank parameters, they are equally valid
as measures of the predictive power of any model in which the scalar outcome Y is at
least ordinal (with or without censorship), and in which the conditional distribution
of the outcome, given the predictor variables, is governed by a scalar function of the
predictor variables and the parameters, such as the hazard ratio in a Cox regression or
the linear predictor in a generalized linear model. If the assumptions of the model are
true, then such a scalar predictive score plays the role of a balancing score as defined
by Rosenbaum and Rubin (1983).
1. As of Stata 11.1, estat concordance provides two concordance measures: Harrell’s C and Gönen
and Heller’s K. Harrell’s C is computed by default or if harrell is specified.
c 2010 StataCorp LP st0198
340 Comparing the predictive powers of survival models
Harrell’s C and Somers’ D are members of the Kendall family of rank parameters.
The family history can be summarized as follows: Kendall’s τa begat Somers’ D begat
Theil–Sen percentile slopes. This family is implemented in Stata by using the somersd
package, which can be downloaded from Statistical Software Components. An overview
of the parameter family is given in Newson (2002), and the methods and formulas are
given in detail in Newson (2006a,b,c).
Parameters in this family are defined by assuming the existence of a population of
bivariate data pairs of the form (Xi , Yi ) and a sampling scheme for sampling pairs of
pairs {(Xi , Yi ), (Xj , Yj )} from that population. A pair of pairs is said to be concordant
if the larger of the X values is paired with the larger of the Y values, and a pair is
said to be discordant if the larger of the X values is paired with the smaller of the
Y values. Kendall’s τa is the difference between the probability of concordance and
the probability of discordance. Somers’ D(X | Y ) is the difference between the cor-
responding conditional probabilities, assuming that the two Y values can be ordered.
Harrell’s C(X | Y ) is defined as {D(X | Y ) + 1}/2 and is equal to the conditional proba-
bility of concordance plus half the conditional probability that the data pairs are neither
concordant nor discordant, assuming that the two Y values can be ordered. In the case
where Y is an outcome to be predicted by a multivariate model with a scalar predictive
score, there is an underlying population of multivariate data points (Yi , Vi1 , . . . , Vik )
where the Vih are predictive covariates and the role of the Xi is played by the scalar
predictive score η(Vi1 , . . . , Vik ). In this case, the Somers’ D and Harrell’s C parameters
can be denoted as D{η(V1 , . . . , Vk ) | Y } and C{η(V1 , . . . , Vk ) | Y }, respectively. If the
model is a survival model, then the Y values are lifetimes, and there is the possibility
that one or both of a pair of Y values may be censored, which sometimes implies that
they cannot be ordered.
We often want to compare the predictive powers of alternative predictors of the same
outcome Y . Newson (2002, 2006b) argues that if there is an underlying population of
trivariate data points (Wi , Xi , Yi ) and if any positive association between the Yi and
the Xi is caused by a positive association of both of these variables with the Wi , then
we must have the inequality D(X | Y ) − D(W | Y ) ≤ 0 or, equivalently, C(X | Y ) −
C(W | Y ) = {D(X | Y ) − D(W | Y )}/2 ≤ 0. This inequality still holds if the Y variable
may be censored, but not if the W or X variable may be censored. This implies that if
we have multiple alternative positive predictors of the same outcome, such as alternative
predictive scores from alternative multivariate models, then it may be useful to calculate
confidence intervals for the differences between the Somers’ D or Harrell’s C parameters
of these predictors with respect to the outcome, and then to make statements regarding
which predictors may or may not be secondary to which other predictors. In Stata,
this can be done by using lincom after the somersd command, as demonstrated in
section 4.1 of Newson (2002).
Medical researchers frequently make statements that one model predicts survival
better than another. Statistical referees acting for medical journals frequently challenge
the researchers to provide rigorous statistical justification for these statements. The
Stata postestimation command estat concordance provides estimates of Harrell’s C
and Somers’ D but provides no confidence limits for these, nor any confidence limits or
R. B. Newson 341
p-values for the differences between the values of these rank parameters from different
models. This is the case for good reason: confidence-interval formulas do not protect the
user for finding a model in the same data in which its parameters are then estimated.
Used sequentially, the somersd and lincom commands provide confidence limits and p-
values for differences between the Somers’ D or Harrell’s C parameters between different
predictors. However, not all medical researchers know how to calculate a confidence
interval (CI) when the predictors are scalar predictive scores from models, and fewer
still know how to do so in such a way that the confidence limits can be taken seriously.
In this article, I aim to explain how medical researchers can calculate CIs and preempt
possible queries that may arise in the process.
The remainder of this article is divided into four sections. Section 2 addresses
the queries that commonly arise when users try to duplicate the results of estat
concordance using somersd. Section 3 describes the method of splitting the data into
a training set (to which models are fit) and a test set (in which their predictive powers
are measured). Section 4 describes the extension to non-Cox survival models, such as
those described in [ST] streg. Finally, section 5 briefly explains how the methods can
be extended even further.
1. The predict command, used after stcox, by default produces a negative predic-
tion score, in contrast to the positive prediction score produced by using predict
after most estimation commands.
2. The default coding of a censorship status variable for stcox is different from the
coding of a censorship status variable for somersd.
3. The treatment of tied failure times by estat concordance is different from that
used by somersd.
There are solutions to all of these problems, and I will demonstrate them, enabling
users to use somersd and estat concordance as checks on one another.
Let’s start the demonstration by inputting the Stanford drug-trial data, fitting a
Cox model, and calling estat concordance:
342 Comparing the predictive powers of survival models
. use http://www.stata-press.com/data/r11/drugtr
(Patient Survival in Drug Trial)
. stset
-> stset studytime, failure(died)
failure event: died != 0 & died < .
obs. time interval: (0, studytime]
exit on or before: failure
48 total obs.
0 exclusions
. estat concordance
Harrell´s C concordance statistic
failure _d: died
analysis time _t: studytime
Number of subjects (N) = 48
Number of comparison pairs (P) = 849
Number of orderings as expected (E) = 679
Number of tied predictions (T) = 15
Harrell´s C = (E + T/2) / P = .8086
Somers´ D = .6172
The stset command shows us that the input dataset has already been set up as
a survival-time dataset that includes one observation per drug-trial subject as well as
data on survival time and termination modes, among other things (see [ST] stset).
The Cox model contains two predictive covariates, age (subject age in years) and drug
(indicating treatment group, with a value of 0 for placebo and a value of 1 for the
R. B. Newson 343
drug being tested). We then see that, according to estat concordance, Harrell’s C is
0.8086 and Somers’ D is 0.6172. The Somers’ D implies that when one of two subjects is
observed to survive another, the model predicts that the survivor is 61.72% more likely
to have a lower hazard ratio than the nonsurvivor. The Harrell’s C is the probability
that the survivor has the lower hazard ratio plus half the (possibly negligible) probability
that the two subjects have equal hazard ratios, and this sum is 80.86% on a percentage
scale.
We will now see how to duplicate these estimates by using predict and somersd.
We start by defining a negative predictor of lifetime by using predict to calculate a
hazard ratio. We then derive an inverse hazard ratio, which we expect to be a positive
predictor of lifetime:
. predict hr
(option hr assumed; relative hazard)
. generate invhr=1/hr
This strategy addresses the first of the three sources of confusion mentioned before.
Addressing the second source of confusion, we need to define a censorship indicator
for input to the somersd command. The somersd command has a cenind() option
that requires a list of censorship indicators. These censorship indicators are allocated
one-to-one to the corresponding variables of the variable list input to somersd and must
be either variable names or zeros (implying a censorship indicator variable whose values
are all zero). Censorship indicator variables for somersd are positive in observations
where the corresponding input variable value is right-censored (or known to be equal to
or greater than its stated value), are negative in observations where the corresponding
input variable value is left-censored (or known to be equal to or less than its stated
value), and are zero in observations where the corresponding input variable value is
uncensored (or known to be equal to its stated value). If the list of censorship indicators
is shorter than the input variable list, then the list of censorship indicators is extended
on the right with zeros, implying that the variables without censorship indicators are
uncensored.
This coding scheme is not the same as that for the censorship indicator variable d
that is created by the stset command, which is 1 in observations where the correspond-
ing lifetime is uncensored and is 0 in observations where the corresponding lifetime is
right-censored.
To convert an stset censorship indicator variable to a somersd censorship indicator
variable, we use the command
This command creates a new variable, censind, which assumes the following values:
missing in observations excluded from the survival sample, as indicated by the variable
st created by stset; 1 in observations with right-censored lifetimes (where d is 0);
and 0 in observations with uncensored lifetimes (where d is 1).
344 Comparing the predictive powers of survival models
We can now use somersd to calculate Harrell’s C and Somers’ D, using the transf(c)
option for Harrell’s C and the transf(z) option (indicating the normalizing and vari-
ance-stabilizing Fisher’s z or hyperbolic arctangent transformation) for Somers’ D:
Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
In both cases, we use the survival-time variable t, the survival sample indicator
st (created by stset), and the inverse hazard rate invhr (created using predict)
to estimate rank parameters of the inverse hazard ratio with respect to survival time
(censored by censorship status). In the case of Harrell’s C, the estimated parameter
is on a scale from 0 to 1 and is expected to be at least 0.5 for a positive predictor of
lifetime, such as an inverse hazard ratio. In the case of Somers’ D, the untransformed
parameter is on a scale from −1 to 1 and is expected to be at least 0 for a positive
predictor of lifetime.
However, we now encounter the third source of confusion mentioned before. If we
compare the estimates here to those produced earlier by estat concordance, we find
that the estimates for Harrell’s C and Somers’ D are similar but not exactly the same.
The estimates are 0.8106 and 0.6213, respectively, when computed by somersd, and
0.8086 and 0.6172, respectively, when computed by estat concordance. The reason
for this discrepancy is that somersd and estat concordance have different policies
for comparing two lifetimes that terminate simultaneously when one lifetime is right-
censored and the other is uncensored. The estat concordance policy assumes that
the owner of the right-censored lifetime survived the owner of the uncensored lifetime,
whereas the somersd policy assumes that neither of the two owners can be said to have
survived the other. In the case of a drug trial, one subject might be known to have
R. B. Newson 345
died in a certain month, whereas another might be known to have left the country in
the same month and has therefore become lost to follow-up. The estat concordance
policy assumes that the second subject must have survived the first, which might be
probable, given that this second subject seems to have been in a fit state to travel out
of the country. The somersd policy, more cautiously, allows the possibility that the
second subject may have left the country early in the month and died unexpectedly of
a venous thromboembolism on the outbound plane, whereas the first subject may have
died under observation of the trial organizers later in the same month.
Whatever the merits of the two policies, we might still like to show that somersd
and estat concordance can be made to duplicate one another’s estimates. This can
easily be done if lifetimes are expressed as whole numbers of time units, as they are
in the Stanford drug trial data, where lifetimes are expressed in months. In this case,
we can add half a unit to right-censored lifetimes only. As a result, right-censored
lifetimes become greater than uncensored lifetimes terminating within the same time
unit without affecting any other orderings of lifetimes.
In our example, we do this by generating a new lifetime variable, studytime2, that
is equal to the modified survival time. We then use stset to reset the various survival-
time variables and characteristics so that the modified survival time is now used. This
step is done after using the assert command to check that the old studytime variable
is indeed integer-valued; see [D] assert and [D] functions. We then proceed as in the
previous example:
48 total obs.
0 exclusions
. estat concordance
Harrell´s C concordance statistic
failure _d: died
analysis time _t: studytime2
Number of subjects (N) = 48
Number of comparison pairs (P) = 849
Number of orderings as expected (E) = 679
Number of tied predictions (T) = 15
Harrell´s C = (E + T/2) / P = .8086
Somers´ D = .6172
. predict hr
(option hr assumed; relative hazard)
. generate invhr=1/hr
. generate censind=1-_d if _st==1
. somersd _t invhr if _st==1, cenind(censind) tdist transf(c)
Somers´ D with variable: _t
Transformation: Harrell´s c
Valid observations: 48
Degrees of freedom: 47
Symmetric 95% CI for Harrell´s c
Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
This time, the model fit produces the same output as before, and the command
estat concordance produces the same estimates as it did before of 0.8086 and 0.6172
for Harrell’s C and Somers’ D, respectively. But now the same estimates of 0.8086 and
0.6172 are also produced by somersd, at least after rounding to four decimal places.
It should be stressed that Harrell’s C and Somers’ D, computed as above either
by somersd or by estat concordance, are valid measures of the predictive power of a
survival model only if there are no time-dependent covariates or lifetimes with delayed
entries. However, if somersd (instead of estat concordance) is used, then sensible
estimates can still be produced with weighted data, so long as those weights are explicitly
supplied to somersd.
stratified subsets, respectively. We will use the somersd policy, rather than the estat
concordance policy, regarding tied censored and noncensored lifetimes.
0 24 50.00 50.00
1 24 50.00 100.00
Total 48 100.00
We see that there are 24 patient lifetimes in the training set (where testset==0)
and 24 in the test set (where testset==1). We then fit the three Cox models to the
training set and create the inverse hazard-rate variables invhr1, invhr2, and invhr3
for models 1, 2 and 3, respectively:
R. B. Newson 349
. predict hr1
(option hr assumed; relative hazard)
. generate invhr1=1/hr1
. stcox drug if testset==0
failure _d: died
analysis time _t: studytime
Iteration 0: log likelihood = -36.900079
Iteration 1: log likelihood = -32.692209
Iteration 2: log likelihood = -32.647379
Iteration 3: log likelihood = -32.647309
Refining estimates:
Iteration 0: log likelihood = -32.647309
Cox regression -- Breslow method for ties
No. of subjects = 24 Number of obs = 24
No. of failures = 14
Time at risk = 370
LR chi2(1) = 8.51
Log likelihood = -32.647309 Prob > chi2 = 0.0035
. predict hr2
(option hr assumed; relative hazard)
. generate invhr2=1/hr2
. predict hr3
(option hr assumed; relative hazard)
. generate invhr3=1/hr3
The variables invhr1, invhr2, and invhr3 are defined for all observations, both in
the training set and in the test set. We then define the censorship indicator, as before,
and estimate the Harrell’s C indexes in the test set for all three models fit to the training
set:
Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
We see that Harrell’s C of inverse hazard ratio with respect to lifetime is 0.8819 for
model 1 (using both drug treatment and age), 0.7917 for model 2 (using drug treatment
only), and 0.6366 for model 3 (using age only). All of these estimates have confidence
limits, which are probably less unreliable than the ones we saw in the previous sec-
tion. However, the sample Harrell’s C is likely to have a skewed distribution in the
presence of such strong positive associations, for the same reasons as Kendall’s τa (see
Daniels and Kendall [1947]). Differences between Harrell’s C indexes are likely to have
R. B. Newson 351
a less-skewed sampling distribution and are also what we probably really wanted to
know. We estimate these differences with lincom, as follows:
. lincom invhr1-invhr2
( 1) invhr1 - invhr2 = 0
. lincom invhr1-invhr3
( 1) invhr1 - invhr3 = 0
. lincom invhr2-invhr3
( 1) invhr2 - invhr3 = 0
Model 1 seems to have a slightly higher predictive power than model 2 or (especially)
model 3, while the difference between model 2 and model 3 is slightly less convincing.
We can also do the same comparison using Somers’ D rather than Harrell’s C, by using
the normalizing and variance-stabilizing z transform, recommended by Edwardes (1995)
and implemented using the somersd option transf(z). In that case, the differences
between the predictive powers of the different models will be expressed in z units (not
shown).
sort it back to its original order. We sort as follows, using the xtile command to define
age groups (see [D] pctile):
0 11 9 20
1 16 12 28
Total 27 21 48
. sort drug agegp ranord, stable
. by drug agegp: generate testset=mod(_n,2)
. sort oldord
. table testset drug agegp, row col scol
0 5 8 13 4 6 10 9 14 23
1 6 8 14 5 6 11 11 14 25
Total 11 16 27 9 12 21 20 28 48
This time, the training set is slightly smaller than the test set because of odd total
numbers of subjects in sampling strata. We then carry out the model fitting in the
training set and the calculation of inverse hazard ratios in both sets using the same
command sequence as with the completely random training and test sets, producing
mostly similar results, which are not shown. Finally, we estimate the Harrell’s C indexes
in the test set:
R. B. Newson 353
Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
The C estimates for the three models are not dissimilar to the previous ones with
completely random training and test sets. Their pairwise differences are as follows:
. lincom invhr1-invhr2
( 1) invhr1 - invhr2 = 0
. lincom invhr1-invhr3
( 1) invhr1 - invhr3 = 0
. lincom invhr2-invhr3
( 1) invhr2 - invhr3 = 0
Model 1 (with drug treatment and age) still seems to predict better than model 3
(with age alone). This conclusion is similar if we compare the z-transformed Somers’ D
values, which are not shown.
predict (see [ST] streg postestimation). These output variables may predict survival
times positively or negatively on an ordinal scale and may include median survival times,
mean survival times, median log survival times, mean log survival times, hazards, hazard
ratios, or linear predictors.
We will briefly demonstrate the principles involved by fitting Gompertz models to
the survival dataset that we used in previous sections. The Gompertz model assumes an
exponentially increasing (or decreasing) hazard rate, and the linear predictor is the log
of the zero-time baseline hazard rate, whereas the rate of increase (or decrease) in hazard
rate, after time zero, is a nuisance parameter. Therefore, if the Gompertz model is true,
then so is the Cox model. However, the argument of Fisher (1935) presumably implies
that if the Gompertz model is true, then we can be no less efficient, asymptotically, by
fitting a Gompertz model instead of a Cox model. We will use the predicted median
lifetime as the positive predictor, whose predictive power will be assessed using somersd.
We start by inputting the cancer trial dataset and defining the stratified, semirandom
training and test sets, exactly as we did in section 3.2. We then fit to the training
set Gompertz models 1, 2, and 3, containing, respectively, both drug treatment and
age, drug treatment only, and age only. After fitting each of the three models, we
use predict to compute the predicted median survival time for the whole sample,
deriving the alternative positive lifetime predictors medsurv1, medsurv2, and medsurv3
for models 1, 2, and 3, respectively:
. streg drug age if testset==0, distribution(gompertz) nolog
failure _d: died
analysis time _t: studytime
Gompertz regression -- log relative-hazard form
No. of subjects = 23 Number of obs = 23
No. of failures = 15
Time at risk = 338
LR chi2(2) = 20.62
Log likelihood = -14.076214 Prob > chi2 = 0.0000
. predict medsurv1
(option median time assumed; predicted median time)
R. B. Newson 355
. predict medsurv2
(option median time assumed; predicted median time)
. streg age if testset==0, distribution(gompertz) nolog
failure _d: died
analysis time _t: studytime
Gompertz regression -- log relative-hazard form
No. of subjects = 23 Number of obs = 23
No. of failures = 15
Time at risk = 338
LR chi2(1) = 5.56
Log likelihood = -21.606438 Prob > chi2 = 0.0184
. predict medsurv3
(option median time assumed; predicted median time)
Unsurprisingly, the fitted parameters are not dissimilar to the corresponding param-
eters for the Cox regression. We then compute the censorship indicator censind, and
then the Harrell’s C indexes, for the test set:
Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
We then compare the Harrell’s C parameters for the alternative median survival
functions, using lincom, just as before:
. lincom medsurv1-medsurv2
( 1) medsurv1 - medsurv2 = 0
. lincom medsurv1-medsurv3
( 1) medsurv1 - medsurv3 = 0
. lincom medsurv2-medsurv3
( 1) medsurv2 - medsurv3 = 0
Unsurprisingly, the conclusions for the Gompertz model are essentially the same as
those for the Cox model.
R. B. Newson 357
5 Further extensions
The use of Harrell’s C and Somers’ D in test sets to compare the power of models
fit to training sets can be extended further to nonsurvival regression models. In this
case, life is even simpler because we do not have to define a censorship indicator such
as censind for input to somersd. The predictive score is still computed using out-of-
sample prediction and can be either the fitted regression value or the linear predictor
(if one exists in the model).
The methods presented so far have the limitation that the Harrell’s C and Somers’ D
parameters that we calculated estimate only the ordinal predictive power (in the pop-
ulation from which the training and test sets were sampled) of the precise model that
we fit to the training set. We might prefer to estimate the mean predictive power that
we can expect (in the whole universe of possible training and test sets) using the same
set of alternative models. Bootstrap-like methods for doing this, involving repeated
splitting of the same sample into training and test sets, are described in Harrell et al.
(1982) and Harrell, Lee, and Mark (1996).
Another limitation of the methods presented here, as mentioned at the end of sec-
tion 2, is that they should not usually be used with models with time-dependent co-
variates. This is because the predicted variable input to somersd, which the alternative
predictive scores are competing to predict, is the length of a lifetime rather than an
event of survival or nonsurvival through a minimal time interval, such as a day. A
predictor variable for such a lifetime must therefore stay constant, at least through that
lifetime, which rules out functions of continuously varying time-dependent covariates.
In Stata, survival-time datasets may have multiple observations for each subject
with a lifetime, representing multiple sublifetimes. Discretely varying time-dependent
covariates, which remain constant through a sublifetime, can also be included in such
datasets. somersd can therefore be used when these conditions are met: the model
is a Cox regression, the time-dependent covariates vary only discretely, the multiple
sublifetimes are the times spent by a subject in an age group, and each subject becomes
at risk at the start of each age group to which she or he survives. If the subject
identifier variable is named subid, and the age group for each sublifetime is represented
by a discrete variable agegp, then the user may use somersd with cluster(subid)
funtype(bcluster) wstrata(agegp) to calculate Somers’ D or Harrell’s C estimates
restricted to comparisons between sublifetimes of different subjects in the same age
group. See Newson (2006b) for details of the options for somersd, and see [ST] stset
for details on survival-time datasets.
If the user has access to sufficient data-storage space, then the age groups can be
defined finely (as subject-years or even subject-days), and the discretely time dependent
covariates might therefore be very nearly continuously time-dependent. Any training
sets or test sets in this case should, of course, be sets of subjects rather than sets of
lifetimes.
358 Comparing the predictive powers of survival models
6 Acknowledgments
I would like to thank Samia Mora, MD, of Partners HealthCare, for sending me the
query that prompted me to write this article. I also thank the many other Stata users
who have also contacted me over the past few years with essentially similar queries on
how to use somersd to compare the predictive powers of survival models.
7 References
Daniels, H. E., and M. G. Kendall. 1947. The significance of rank correlation where
parental correlation exists. Biometrika 34: 197–208.
Edwardes, M. D. 1995. A confidence interval for Pr(X < Y ) − Pr(X > Y ) estimated
from simple cluster samples. Biometrics 51: 571–578.
Fisher, R. A. 1935. The logic of inductive inference. Journal of the Royal Statistical
Society 98: 39–82.
Harrell, F. E., Jr., K. L. Lee, and D. B. Mark. 1996. Multivariable prognostic models:
Issues in developing models, evaluating assumptions and adequacy, and measuring
and reducing errors. Statistics in Medicine 15: 361–387.
———. 2006a. Confidence intervals for rank statistics: Percentile slopes, differences,
and ratios. Stata Journal 6: 497–520.
———. 2006b. Confidence intervals for rank statistics: Somers’ D and extensions. Stata
Journal 6: 309–334.
———. 2006c. Efficient calculation of jackknife confidence intervals for rank statistics.
Journal of Statistical Software 15: 1–10.
Rosenbaum, P. R., and D. B. Rubin. 1983. The central role of the propensity score in
observational studies for causal effects. Biometrika 70: 41–55.
Abstract. Modern genetics studies require the use of many specialty software
programs for various aspects of the statistical analysis. PHASE is a program often
used to reconstruct haplotypes from genotype data, and Haploview is a program
often used to visualize and analyze single nucleotide polymorphism data. Three
new commands are described for performing these three steps: 1) exporting geno-
type data stored in Stata to PHASE, 2) importing the resulting inferred haplotypes
back into Stata, and 3) exporting the haplotype/single nucleotide polymorphism
data from Stata to Haploview.
Keywords: st0199, phaseout, phasein, haploviewout, genetics, haplotypes, SNPs,
PHASE, haploview
1 Introduction
For a variety of reasons, including favorable power for detecting small effects and
the low cost of genotyping, association studies based on single nucleotide polymor-
phism (SNP, pronounced “snip”) markers have become common in genetic epidemiology
(Cordell and Clayton 2005). SNP markers are positions along a chromosome that can
have four forms called alleles: adenine, cytosine, guanine, and thymine, which are de-
noted A, C, G, and T, respectively. Humans are diploid organisms, meaning that we
have two copies of each of our chromosomes; thus each SNP is composed of a pair of
alleles called a genotype.
For example, a SNP might have an adenine (A) molecule on one chromosome paired
with a cytosine (C) molecule on the other chromosome. This is often described as an
A/C genotype. When two SNP markers are physically close to one another, a pair of
alleles found on the same chromosome forms a haplotype. For example, a person might
have an A/C genotype for SNP1 and a G/T genotype for SNP2. If the A allele from SNP1
and the G allele from SNP2 are physically located on the same chromosome, they are
said to form an AG haplotype. Similarly, the C allele from SNP1 and the T allele from
SNP2 would form a CT haplotype.
c 2010 StataCorp LP st0199
360 Using Stata with PHASE and Haploview
It has been shown that association studies based on haplotypes are often more pow-
erful than similar studies based on individual SNPs (Akey, Jin, and Xiong 2001). Unfor-
tunately, haplotypes are not observed directly using typical low-cost, high-throughput
laboratory techniques. However, haplotypes can be inferred statistically based on the
observed genotypes.
David G. Clayton of the University of Cambridge has written a useful command for
Stata (snp2hap) that infers haplotypes for pairs of SNPs. In theory, this program could
be used iteratively to infer haplotypes across many SNPs. However, several sophisticated
algorithms have been developed for statistically inferring haplotypes from many SNP
genotypes simultaneously. These algorithms and the software that implement them have
been reviewed and compared elsewhere (Marchini et al. 2006; Stephens and Donnelly
2003; The International HapMap Consortium 2005). In most comparisons, the algo-
rithm used in the PHASE program (Stephens, Smith, and Donnelly 2001) was found to
be the most accurate and is arguably the most frequently used.
Rather than attempt the daunting task of creating a Stata command to imple-
ment the algorithm used in PHASE, a Stata command (phaseout) was developed for
exporting genotype data stored in Stata to an ASCII file formatted as a PHASE input
file. A second program (phasein) was developed to import the inferred haplotype data
back into Stata for subsequent association analyses with programs such as haplologit
(Marchenko et al. 2008). These commands use a group of Stata’s low-level file com-
mands including file open, file write, file read, and file close.
Once the haplotypes have been inferred for a set of genotypes, one would often like
to know certain attributes of the haplotypes. For example, the alleles of some pairs of
SNPs along a haplotype may tend to be transmitted together from parent to offspring
more frequently than alleles of other pairs of SNPs. This phenomenon, known as linkage
disequilibrium (Devlin and Risch 1995), is often quantified by the r2 or D statistics.
Similarly, some contiguous groups of SNPs often called haplotype blocks, may exhibit
high levels of pairwise linkage disequilibrium (Gabriel et al. 2002; Goldstein 2001). High
levels of linkage disequilibrium between two SNPs indicate that much of their statistical
information is redundant, so both SNPs are not necessary for association analyses. One
of the SNPs, called a tagSNP (Zhang et al. 2004), can be selected using one of several
algorithms. A tagSNP can be used in place of the group of redundant SNPs. Typically,
there are several tagSNPs in a group of contiguous SNPs found on a chromosome.
Haploview (Barrett et al. 2005) is a popular software package used for calculating
and visualizing the linkage disequilibrium statistics r2 and D , as well as for identifying
haplotype blocks and tagSNPs. The new Stata haploviewout program exports haplo-
type data from Stata to a pair of ASCII files formatted as Haploview input files: a haps
format data file and a haps format locus information file.
The dataset used for the following examples was downloaded from the SeattleSNPs
website (SeattleSNPs 2009) and was modified to include missing data. Genotypes for
47 individuals of African and European descent include 22 SNPs from the vascular
endothelial growth factor (VEGF) gene located on chromosome six.
J. C. Huber Jr. 361
The input file for PHASE requires the data to be formatted in an ASCII file that
contains header information about the number of samples and the number and types of
markers (SNP or multiallelic), as well as the actual data:
The phaseout command calculates the header information, converts the ID and
genotype data to rows, and writes this data to the ASCII file. The types of markers—
SNPs or multiallelic markers—are automatically determined by tabulating the genotypes
and by examining the length of the genotype in the first record. If a marker has three
or fewer genotypes (for example, C/C, C/T, T/T) and the length of the genotype in the
first record is fewer than five alleles, the marker is treated as a SNP. All other markers
are treated as multiallelic.
2.1 Syntax
phaseout SNPlist, idvariable(varname) filename(filename) missing(string)
separator(string) positions(string)
2.2 Options
idvariable(varname) is required to specify the variable that contains the individual
identifiers.
filename(filename) is required to name the ASCII file that will be created. It is con-
ventional, though not necessary, to name PHASE input files with the extension .inp.
missing(string) may be used to provide a list of genotypes that indicate missing data.
For example, missing data might be included in the dataset as X/X for SNPs and
as 999/999 for multiallelic markers. Multiple missing values may be specified by
placing a space between them (for example, missing("X/X 9/9 999/999")). PHASE
requires missing SNP alleles to be coded as “?” and missing multiallelic alleles to
be coded as “−1”. It is not necessary to preprocess your data because phaseout
will automatically convert each genotype contained in the missing() list to its
appropriate PHASE missing value.
separator(string) specifies the separator to use when storing genotype data. Genotype
data are often stored with a separator between the two alleles. For data stored in
the format C/G, the separator() option would look like separator("/"). If SNP
data are stored without a separator (for example, CG) then the separator() option
is unnecessary, and phaseout will assume that the left character is allele 1 and the
right character is allele 2.
positions(string) provides a list of the marker positions for use by PHASE when infer-
ring haplotypes from the genotype data. If the positions() option is not specified,
PHASE will assume that the markers are equally spaced.
2.4 Examples
Markers and positions may be specified in the command itself:
BEGIN BESTPAIRS1
0 D001
C C T
C T T
......
......
0 E023
C C T
C C T
END BESTPAIRS1.
The data are imported into Stata in “long” format with one row per chromosome
(two rows per ID). The haplotypes are imported into a variable named haplotype, and
each of the markers that make up the haplotype are saved in an individual variable.
If the markers() option is specified, the marker variables will be renamed using their
original names.
. list id haplotype rs1413711 rs3024987 rs3024989 in 1/2
1. D001 CCT C C T
2. D001 CTT C T T
If the positions() option is used, the positions will be placed in the variable label
of each marker variable:
. describe
Contains data from VEGF_Haplotypes.dta
obs: 94
vars: 5 8 Jul 2010 13:09
size: 1,692 (99.9% of memory free)
id str4 %9s
haplotype str3 %9s
rs1413711 str1 %9s position=674
rs3024987 str1 %9s position=836
rs3024989 str1 %9s position=1955
Sorted by:
3.1 Syntax
PhaseOutputFile is the name of the PHASE output file that contains the inferred haplo-
types. It will have the file extension .out.
3.2 Options
markers(filename) allows the user to specify an ASCII file that contains the names of
the markers included in the haplotype. If the original genotype data were exported
to PHASE using the phaseout command, the marker names will be automatically
saved to a file named MarkerList.txt. If that is the case, then the option would
be markers("MarkerList.txt"). Alternatively, the user can save a space-delimited
list of marker names in an ASCII file and use the markers("filename.txt") option.
positions(filename) allows the user to specify an ASCII file that contains the posi-
tions of the markers. If the original genotype data were exported to PHASE us-
ing the phaseout command, the marker positions will be automatically saved to
a file named PositionList.txt. If that is the case, then the option would be
positions("PositionList.txt"). Alternatively, the user can save a space-delimit-
ed list of marker positions in an ASCII file and use the positions("filename.txt")
option.
3.3 Examples
Using the default files created by phaseout:
D001 D001 2 2 4
D001 D001 2 4 4
D002 D002 2 2 4
D002 D002 4 2 4
The file Filename MarkerInput.txt contains the marker names and positions in
two columns:
rs1413711 674
rs3024987 836
rs3024989 1955
4.1 Syntax
SNPlist is a list of SNP variables in long format (that is, one row per chromosome).
If your data are in wide format, you can convert them to long format by using the
reshape command.
Haploview will not accept multiallelic markers.
4.2 Options
idvariable(varname) is required to specify the variable that contains the individual
identifiers.
filename(filename) is required to name the two ASCII files that will be created. Those
files will have the extensions DataInput.txt and MarkerInput.txt appended
to filename. For example, the filename("VEGF") option will create a file named
VEGF DataInput.txt and a file named VEGF MarkerInput.txt. To open the files in
Haploview, select File > Open new data and click on the tab labeled Haps For-
mat. Click on the Browse button next to the box labeled Data File and select the
file VEGF DataInput.txt. Next click on the Browse button next to the box labeled
Locus Information File and select the file VEGF MarkerInput.txt.
366 Using Stata with PHASE and Haploview
positions(string) allows the user to specify a space-delimited list of the marker posi-
tions.
familyid(variable) allows the user to specify the variable that contains family identi-
fiers if relatives are included in the dataset. If familyid() is omitted, the
idvariable() will be automatically substituted for the familyid().
poslabel will automatically extract the SNP positions from the variable label of each
SNP if the haplotype data were created using the commands phaseout and phasein.
The positions for each marker are stored in the variable label of each SNP.
4.3 Examples
Using the default files created by phaseout:
5 Discussion
Many young and rapidly evolving fields of inquiry, including genetic association studies,
use a variety of boutique software packages. While it would be very convenient to have
Stata commands that accomplish the same tasks, the time and programming expertise
required does not make this a practical option. However, a suite of commands that
allows easy exporting and importing of data from Stata to other specialized software
seems to be an efficient way for Stata users to accomplish specialized analytical tasks.
6 Acknowledgments
This work was supported in part by grant 1 R01 DK073618-02 from the National In-
stitute of Diabetes and Digestive and Kidney Diseases and by grant 2006-35205-16715
from the United States Department of Agriculture. The author would like to thank
Drs. Loren Skow, Krista Fritz, and Candice Brinkmeyer-Langford of the Texas A&M
College of Veterinary Medicine and Roger Newson of the Imperial College London for
their very useful feedback.
J. C. Huber Jr. 367
7 References
Akey, J., L. Jin, and M. Xiong. 2001. Haplotypes vs single marker linkage disequilibrium
tests: What do we gain? European Journal of Human Genetics 9: 291–300.
Barrett, J. C., B. Fry, J. Maller, and M. J. Daly. 2005. Haploview: Analysis and
visualization of LD and haplotype maps. Bioinformatics 21: 263–265.
Cordell, H. J., and D. G. Clayton. 2005. Genetic association studies. Lancet 366:
1121–1131.
Devlin, B., and N. Risch. 1995. A comparison of linkage disequilibrium measures for
fine-scale mapping. Genomics 29: 311–322.
Stephens, M., and P. Donnelly. 2003. A comparison of Bayesian methods for haplotype
reconstruction from population genotype data. American Journal of Human Genetics
73: 1162–1169.
Stephens, M., N. J. Smith, and P. Donnelly. 2001. A new statistical method for haplo-
type reconstruction from population data. American Journal of Human Genetics 68:
978–989.
The International HapMap Consortium. 2005. A haplotype map of the human genome.
Nature 437: 1299–1320.
Zhang, K., Z. S. Qin, J. S. Liu, T. Chen, M. S. Waterman, and F. Sun. 2004. Haplotype
block partitioning and tag SNP selection using genotype data and their applications
to association studies. Genome Research 14: 908–916.
368 Using Stata with PHASE and Haploview
Abstract. A new Stata command, simsum, analyzes data from simulation studies.
The data may comprise point estimates and standard errors from several analysis
methods, possibly resulting from several different simulation settings. simsum can
report bias, coverage, power, empirical standard error, relative precision, average
model-based standard error, and the relative error of the standard error. Monte
Carlo errors are available for all of these estimated quantities.
Keywords: st0200, simsum, simulation, Monte Carlo error, normal approximation,
sandwich variance
1 Introduction
Simulation studies are an important tool for statistical research (Burton et al. 2006), but
they are often poorly reported. In particular, to understand the role of chance in results
of simulation studies, it is important to estimate the Monte Carlo (MC) error, defined
as the standard deviation of an estimated quantity over repeated simulation studies.
However, this error is often not reported: Koehler, Brown, and Haneuse (2009) found
that of 323 articles reporting the results of a simulation study in Biometrics, Biometrika,
and the Journal of the American Statistical Association in 2007, only 8 articles reported
the MC error.
This article describes a new Stata command, simsum, that facilitates analyses of
simulated data. simsum analyzes simulation studies in which each simulated dataset
yields point estimates by one or more analysis methods. Bias, empirical standard error
(SE), and precision relative to a reference method can be computed for each method. If,
in addition, model-based SEs are available, then simsum can compute the average model-
based SE, the relative error in the model-based SE, the coverage of nominal confidence
intervals, and the power to reject a null hypothesis. MC errors are available for all
estimated quantities.
c 2010 StataCorp LP st0200
370 simsum: Analyses of simulation studies including Monte Carlo error
where estvarlist is a varlist containing point estimates from one or more analysis meth-
ods.
In long format, data contain one record per analysis method per simulated dataset,
and the appropriate syntax is
simsum estvarname if in , true(expression) methodvar(varname)
id(varlist) options
2.2 Options
Main options
true(expression) gives the true value of the parameter. This option is required for
calculations of bias and coverage.
methodvar(varname) specifies that the data are in long format and that each record
represents one analysis of one simulated dataset using the method identified by
varname. The id() option is required with methodvar(). If methodvar() is not
specified, the data must be in wide format, and each record represents all analyses
of one simulated dataset.
id(varlist) uniquely identifies the dataset used for each record, within levels of any
by-variables. This is a required option in the long format. The methodvar() option
is required with id().
se(varlist) lists the names of the variables containing the SEs of the point estimates.
For data in long format, this is a single variable.
seprefix(string) specifies that the names of the variables containing the SEs of the
point estimates be formed by adding the given prefix to the names of the variables
containing the point estimates. seprefix() may be combined with sesuffix(string)
but not with se(varlist).
I. R. White 371
sesuffix(string) specifies that the names of the variables containing the SEs of the
point estimates be formed by adding the given suffix to the names of the variables
containing the point estimates. sesuffix() may be combined with seprefix(string)
but not with se(varlist).
Data-checking options
dropbig specifies that point estimates or SEs beyond the maximum acceptable values
be dropped; otherwise, the command halts with an error. Missing values are always
dropped.
nolistbig suppresses listing of point estimates and SEs that lie outside the acceptable
limits.
listmiss lists observations with missing point estimates or SEs.
Calculation options
level(#) specifies the confidence level for coverages and powers. The default is
level(95) or as set by set level; see [R] level.
by(varlist) summarizes the results by varlist.
mcse reports MC errors for all summaries.
robust requests robust MC errors (see section 4) for the statistics empse, relprec, and
relerror. The default is MC errors based on an assumption of normally distributed
point estimates. robust is only useful if mcse is also specified.
modelsemethod(rmse | mean) specifies whether the model SE should be summarized as
the root mean squared value (modelsemethod(rmse), the default) or as the arith-
metic mean (modelsemethod(mean)).
ref(string) specifies the reference method against which relative precisions will be cal-
culated. With data in wide format, string must be a variable name. With data in
long format, string must be a value of the method variable; if the value is labeled,
the label must be used.
372 simsum: Analyses of simulation studies including Monte Carlo error
Statistic options
If none of the following options are specified, then all available statistics are computed.
bsims reports the number of simulations with nonmissing point estimates.
sesims reports the number of simulations with nonmissing SEs.
bias estimates the bias in the point estimates.
empse estimates the empirical SE, defined as the standard deviation of the point esti-
mates.
relprec estimates the relative precision, defined as the inverse squared ratio of the
empirical SE of this method to the empirical SE of the reference method. This
calculation is slow; omitting it can reduce run time by up to 90%.
modelse estimates the model-based SE. See modelsemethod() above.
relerror estimates the proportional error in the model-based SE, using the empirical
SE as the gold standard.
cover estimates the coverage of nominal confidence intervals at the specified level.
power estimates at the specified level the power to reject the null hypothesis that the
true parameter is zero.
Output options
nolist suppresses listing of the results and is allowed only when clear or saving() is
specified.
listsep lists results using one table per statistic, giving output that is narrower and
better formatted. The default is to list the results as a single table.
format(string) specifies the format for printing results and saving summary data. If
listsep is also specified, then up to three formats may be specified: 1) for results
on the scale of the original estimates (bias, empse, and modelse), 2) for percentages
(relprec, relerror, cover, and power), and 3) for integers (bsims and sesims).
The default is the existing format of the (first) estimate variable for 1 and 2 and
%7.0f for 3.
sepby(varlist) invokes this list option when printing results.
abbreviate(#) invokes this list option when printing results.
gen(string) specifies the prefix for new variables identifying the different statistics in
the output dataset. gen() is only useful with clear or saving(). The default is
gen(stat) so that the new identifiers are, for example, statnum and statcode.
3 Example
This example is based on, but distinct from, a simulation study comparing different
ways to handle missing covariates when fitting a Cox model (White and Royston 2009).
One thousand datasets were simulated, each containing normally distributed covariates
x and z and a time-to-event outcome. Both covariates had 20% of their values deleted
independently of all other variables so the data became missing completely at random
(Little and Rubin 2002). Each simulated dataset was analyzed in three ways. A Cox
model was fit to the complete cases (CC). Then two methods of multiple imputation using
chained equations (van Buuren, Boshuizen, and Knook 1999), implemented in Stata as
ice (Royston 2004, 2009), were used. The MI LOGT method multiply imputes the missing
values of x and z with the outcome included as log(t) and d, where t is the survival time
and d is the event indicator. The MI T method is the same except that log(t) is replaced
by t in the imputation model. The results are stored in long format, with variable
dataset identifying the simulated dataset number, string variable method identifying
the method used, variable b holding the point estimate, and variable se holding the SE.
The data start like this:
dataset method b se
1. 1 CC .7067682 .14651
2. 1 MI_T .6841882 .1255043
3. 1 MI_LOGT .7124795 .1410814
4. 2 CC .3485008 .1599879
5. 2 MI_T .4060082 .1409831
6. 2 MI_LOGT .4287003 .1358589
7. 3 CC .6495075 .1521568
8. 3 MI_T .5028701 .130078
9. 3 MI_LOGT .5604051 .1168512
CC MI_LOGT MI_T
CC MI_LOGT MI_T
• Comparing tables 4 and 6 shows that model-based SEs are close to the empirical
values. This is shown more directly in table 7.
• Table 8: Coverage of nominal 95% confidence intervals also seems fine, which is
not surprising in view of the lack of bias and good model-based SEs.
• Table 9: CC lacks power compared with MI LOGT and MI T, which is not surprising
in view of its inefficiency.
If different formatting of the results is required, the results can be loaded into mem-
ory using the clear option and can then be manipulated.
376 simsum: Analyses of simulation studies including Monte Carlo error
4 Formulas
Assume that the true parameter is β and that the ith simulated dataset (i = 1, . . . , n)
yields a point estimate βi with SE si . Define
1
β = βi
n i
1 2
Vβb = βi − β
n−1 i
1
s2 = s2i
n i
1 2 2
V s2 = si − s2
n−1 i
2 2
n cov Vβb1 , Vβb2 ≈ cov β1 − β1, β2 − β2
2 2
= cov β1 − β1 , E β2 − β2 β1
2 2
= cov β1 − β1 , ρ212 Vβb2 /Vβb1 β1 − β1
assuming that s and Vβb are approximately uncorrelated and using a further Taylor
approximation.
However, if the modelsemethod(mean) option is used, the formulas are
1
average model-based SE s = si
n i
1 2
MC error = (si − s)
n i
where 1(·) is the indicator function. The power of a significance test at the α level is
1
power P = 1 | βi | ≥ zα/2 si
n i
MC error = P (1 − P )/n
Robust MC errors
Several of the MC errors presented above require a normality assumption. Alternative
approximations can be derived using an estimating equations method. The empirical
standard deviation, Vβb, can be written as the solution θ of the equation
n 2
βi − β − θ 2
=0
i
n−1
I. R. White 379
The relative precision of β2 compared with β1 can be written as the solution θ of
2 2
β1i − β 1 − θ + 1 β2i − β 2 =0
i
Finally, as an attempt to allow for uncertainty in the sample means, we multiply the
sandwich variance by n/(n − 1). A rationale is that this agrees exactly with (1) if the
method is applied to the MC error of the bias. However, most simulation studies are
large enough that this correction is unimportant.
5 Evaluations
Most of the formulas used by simsum to compute MC errors involve approximations, so
I evaluated them in two simulation studies.
The data are now held in memory, with one record for each statistic for each of the
250 simulation studies. The statistics are identified by the values of a newly created
numerical variable statnum, and the different simulation studies are still identified by
simno. The variables bCC, bMI LOGT, and bMI T contain the analysis results for the three
methods. MC errors in variables are suffixed with mcse. In the second run, these values
are treated as ordinary output from a simulation study, and the average calculated MC
error is compared with the empirical MC error.
The 250 observations with missing values refer to the relative precisions, which are
missing for the reference method (CC). Average calculated MC errors for each statistic are
compared in table 1 with empirical MC errors. The calculated MC errors are naturally
similar to those reported in the single simulation study above (some values have been
multiplied by 1,000 for convenience). Empirical MC errors are close to the model-based
values. The only exception is for coverage, where the model-based MC errors appear
rather small for methods CC and MI LOGT. This is likely to be a chance finding, because
there is no doubt about the accuracy of the model-based MC formula for this statistic.
Table 1. Simulation study comparing three ways to handle incomplete covariates in a Cox model: Comparison of average
calculated MC error (Calc) with empirical MC error (Emp) for various statistics
I. R. White
1
Statistics are abbreviated as follows: Bias, bias in point estimate; EmpSE, empirical SE; RelPrec, % gain in precision relative
to method CC; ModSE, RMS model-based SE; RelErr, relative % error in SE; Cover, coverage of nominal 95% confidence interval;
Power, power of 5% level test.
2
Relative % error in average calculated SE, with its MC error in parentheses.
381
382 simsum: Analyses of simulation studies including Monte Carlo error
The 100,000 datasets were divided into 100 simulation studies each of 1,000 simu-
lated datasets. The quantities described above and their SEs were calculated for each
simulation study, except that power for testing β = 0 was not computed because this
null hypothesis was true. Finally, the empirical MC error of each quantity across simula-
tion studies was compared with the average MC error estimated within each simulation
study.
Results are shown in table 2. The calculated MC error is adequate for all quantities
except for the relative precision of LDA compared with logistic regression, for which the
calculated SE is some three times too small. This appears to be due to the nonnormal
joint distribution of the parameter estimates shown in figure 1. The robust MC errors
perform well in all cases.
I. R. White 383
−1 −.5 0 .5 1
Average of LDA and LR
Figure 1. Scatterplot of the difference βLDA − βLR against the average βLDA + βLR /2
in 2,000 simulated datasets
384 simsum: Analyses of simulation studies including Monte Carlo error
6 Discussion
I hope that simsum will help statisticians improve the reporting of their simulation
studies. In particular, I hope simsum will help them think about and report MC errors.
If MC errors are too large to enable the desired conclusions to be drawn, then it is
usually straightforward to increase the sample size, a luxury rarely available in applied
research.
For three statistics (empirical SE, and relative precision and relative error in model-
based SE), I have proposed two approximate MC error methods, one based on a normality
assumption and one based on a sandwich estimator. The MC error should only be taken
as a guide, so errors of some 10–20% in calculating the MC error are of little importance.
In most cases, both MC error methods performed adequately. However, the normality-
based MC error was about three times too small when evaluating the relative precision of
two estimators with a highly nonnormal joint distribution (figure 1). It is good practice
to examine the marginal and joint distributions of parameter estimates in simulation
studies, and this practice should be used to guide the choice of MC error method.
Other methods are available for estimating MC errors. Koehler, Brown, and Haneuse
(2009) proposed more computationally intensive techniques that are available for im-
plementation in R. Other software (Doornik and Hendry 2009) is available with an
econometric focus.
7 Acknowledgment
This work was supported by MRC grant U.1052.00.006.
8 References
Burton, A., D. G. Altman, P. Royston, and R. L. Holder. 2006. The design of simulation
studies in medical statistics. Statistics in Medicine 25: 4279–4292.
Doornik, J. A., and D. F. Hendry. 2009. Interactive Monte Carlo Experimentation in
Econometrics Using PcNaive 5. London: Timberlake Consultants Press.
Koehler, E., E. Brown, and S. J.-P. A. Haneuse. 2009. On the assessment of Monte Carlo
error in simulation-based statistical analyses. American Statistician 63: 155–162.
Little, R. J. A., and D. B. Rubin. 2002. Statistical Analysis with Missing Data. 2nd
ed. Hoboken, NJ: Wiley.
Royston, P. 2004. Multiple imputation of missing values. Stata Journal 4: 227–241.
———. 2009. Multiple imputation of missing values: Further update of ice, with an
emphasis on categorical variables. Stata Journal 9: 466–477.
van Buuren, S., H. C. Boshuizen, and D. L. Knook. 1999. Multiple imputation of missing
blood pressure covariates in survival analysis. Statistics in Medicine 18: 681–694.
I. R. White 385
White, I. R., and P. Royston. 2009. Imputing missing covariate values for the Cox
model. Statistics in Medicine 28: 1982–1998.
1 Introduction
Barthel, Royston, and Babiker (2005) presented a menu-driven Stata program under
the generic name of ART (assessment of resources for trials) to calculate sample size and
power for complex clinical trial designs with a time-to-event or binary outcome. Briefly,
the features of ART include multiarm trials, dose–response trends, arbitrary failure-
time distributions, nonproportional hazards, nonuniform rates of patient entry, loss to
follow-up, and possible changes from allocated treatment. A full report on the method-
ology and its performance—in particular, regarding loss to follow-up, nonproportional
hazards, and treatment crossover—is given by Barthel et al. (2006).
In this article, we concentrate on a new tool that addresses a practical issue in trials
with a time-to-event outcome. Because of staggered entry of patients and the gradual
maturing of the data, the accumulation of events from the date the trial opens is a
process that occurs over a relatively long period of time and with a variable course.
Trials are planned and their resources are assigned under certain critical assumptions.
c 2010 StataCorp LP st0013 2
P. Royston and F. M.-S. Barthel 387
If those assumptions are unrealistic, timely completion of the trial may be threatened.
Because the cumulative number of events is the key indicator of trial maturity and is
the parameter targeted in the sample-size calculation, it is of considerable interest and
relevance to monitor and project this number at particular points during the trial.
The new tool is called ARTPEP (ART projection of events and power). ARTPEP
comprises an ado-file (artpep) and an associated dialog box. It works in conjunction
with the ART system, of which the latest update is included with this article.
A sample size program by Abdel Babiker, Patrick Royston & Friederike Barthel,
MRC Clinical Trials Unit, London NW1 2DA, UK.
Apart from small, unimportant differences, the protocol power (0.82) and the number
of events (673) are consistent with ART’s results.
1. You must activate the ART and ARTPEP items on the User menu by typing the
command artmenu on.
2. You must compute the relevant sample size for the trial using either the ART
dialog box or the artsurv command. This automatically sets up a global macro
called $S ARTPEP whose contents are used by the artpep command. (A slightly
more convenient alternative with the same result is to use the ART Settings...
button on the ARTPEP dialog box to set up the necessary quantities for ART
without having to run ART or artsurv separately.)
3. To set up additional parameters that ARTPEP needs, you must use the ARTPEP
dialog box, either by typing db artpep or by selecting User > ART > Artpep
from the menu.
As a worked example, we now imagine that the OE05 trial has been running for
1 year and has accrued 100 patients so far. Assuming the survival distribution to be
P. Royston and F. M.-S. Barthel 389
correct, when may we expect to complete the trial (that is, obtain the required number
of events)? To answer this question, we complete the three steps described above. The
resulting empty dialog box is shown in figure 1.
We now explain the various items that the dialog box needs. The name of the
corresponding option for the artpep command is given in square brackets:
• ART Settings...: As already mentioned, this button may be used to set up the
parameters of an ART run if that has not been done already. It accesses the ART
dialog box.
• Patients recruited in each period so far [pts]: A “period” here is 1 year, and we
have recruited 100 patients in the first period. We therefore enter 100 for this
item.
• Additional patients to be recruited [epts]: To get to the 842 patients (we will
use 850), we hope to recruit at about 150 patients per year for the next 5 years,
making a total of 6 years’ planned recruitment. We enter 150. The program knows
the period in which recruitment is to cease and, by default, repeats the number
150 over the next 5 periods. If we had expected a differing recruitment rate (say,
accelerating toward the end of the trial), we could have entered a different number
of patients to be recruited in each period.
390 Projection of power and events in clinical trials
• Number of periods over which to project [eperiods]: Let us say we wish to project
events and power over the next 10 years. We enter 10.
• Period in which recruitment cease [stoprecruit]: Here enter the number of periods
after which recruitment is to cease. The number must be no smaller than the
number of periods implied by Patients recruited in each period so far [pts]. If
the option is left blank, it is assumed that recruitment continues indefinitely. As
already noted, we wish to stop recruitment at 850 patients, which we will achieve
by the end of period 6. We therefore enter 6 for this item.
• Save using filename [using]: The numerical results of the artpep run can be saved
to a .dta file for a permanent record or for plotting. We leave the item blank.
• Start date of trial (ddmmmyyyy) [datestart]: If we enter the start date, the output
from artpep is conveniently labeled with the calendar date of the end of each
period. We recommend using this option. We enter 01jan2009.
After submitting the above setup to Stata (version 10 or later), we get the following
result:
The program reports the total number of events (#events) and the number of events in
the control arm (#C-events), which are often of interest. The required total number of
events (that is, both arms combined) of 673 is projected to be reached on 31 December
2016, the end of period 8. We expect 351 events in the control arm by that time. The
projection is not surprising because the accrual figures that have been entered more or
less agree with the trial plan. Nevertheless, the output shows us the expected progress
of the number of events and the power over time. The trial may be monitored (and the
ARTPEP analysis updated) to follow its progress.
The dialog box has, as usual, created and run the necessary artpep command line.
The second item in the command is $S ARTPEP. As already mentioned, it contains
additional information needed by artpep. On displaying its contents, we find
The key pieces of information here are hratio(1, 0.80) and edf0(0.3, 3), which
specify the hazard ratios in groups 1 and 2, and the survival function in group 1,
respectively. All the other items are default values and could be omitted in the present
example. The present example could have been run directly from the command line as
follows:
The time to observe the required number of events has advanced by more than 1 year,
to period 9 (31dec2017).
3 Syntax
Once you have gained a little experience with using the ARTPEP dialog box, you will
find it more natural and efficient to use the command line. The syntax of artpep is as
follows:
artpep using filename , pts(numlist) edf0(slist0) epts(numlist)
eperiods(#) stoprecruit(#) startperiod(#) datestart(ddmmmyyyy)
4 Options
pts(numlist) is required. numlist specifies the number of patients recruited in each
period since the start of the trial, that is, since randomization. See help on artsurv
for the definition of a “period”. The number of items in numlist defines the number
of periods of recruitment so far. For example, pts(23 12 25) specifies three initial
periods of recruitment, with recruitment of 23 patients in period 1, 12 in period 2,
and 25 in period 3. The “current” period would be period 3 and would be demarcated
by parallel lines in the output.
edf0(slist0 ) is required and gives the survival function in the control group (group 1).
This need not be one of the survival distributions to be compared in the trial, unless
hratio() = 1 for at least one of the groups. The format of slist0 is #1 [#2 . . . #r,
#1 #2 . . . #r]. Thus edf0(p1 p2 . . . pr , t1 t2 . . . tr ) gives the value pi for the survival
function for the event time at the end of time period ti , i = 1, . . . , r. Instantaneous
event rates (that is, hazards) are assumed constant within time periods; that is, the
P. Royston and F. M.-S. Barthel 393
5 Final comments
We have illustrated ARTPEP with a basic example. However, ARTPEP understands
the more complex options of artsurv. Therefore, complex features, including loss to
follow up, treatment crossover, and nonproportional hazards, can be allowed for in the
projection of power and events.
Sometimes it is desirable to make projections on a finer time scale than 1 year,
for example, in 3- or 6-month periods. This is easily done by adjusting the period
parameters used in ART and ARTPEP.
6 References
Barthel, F. M.-S., A. Babiker, P. Royston, and M. K. B. Parmar. 2006. Evaluation of
sample size and power for multi-arm survival trials allowing for non-uniform accrual,
non-proportional hazards, loss to follow-up and cross-over. Statistics in Medicine 25:
2521–2542.
Barthel, F. M.-S., P. Royston, and A. Babiker. 2005. A menu-driven facility for complex
sample size calculation in randomized controlled trials with a survival or a binary
outcome: Update. Stata Journal 5: 123–129.
Abstract. This article describes the new meta-analysis command metaan, which
can be used to perform fixed- or random-effects meta-analysis. Besides the stan-
dard DerSimonian and Laird approach, metaan offers a wide choice of available
models: maximum likelihood, profile likelihood, restricted maximum likelihood,
and a permutation model. The command reports a variety of heterogeneity mea-
sures, including Cochran’s Q, I 2 , HM
2
, and the between-studies variance estimate
τb2 . A forest plot and a graph of the maximum likelihood function can also be
generated.
Keywords: st0201, metaan, meta-analysis, random effect, effect size, maximum
likelihood, profile likelihood, restricted maximum likelihood, REML, permutation
model, forest plot
1 Introduction
Meta-analysis is a statistical methodology that integrates the results of several inde-
pendent clinical trials in general that are considered by the analyst to be “combinable”
(Huque 1988). Usually, this is a two-stage process: in the first stage, the appropriate
summary statistic for each study is estimated; then in the second stage, these statis-
tics are combined into a weighted average. Individual patient data (IPD) methods
exist for combining and meta-analyzing data across studies at the individual patient
level. An IPD analysis provides advantages such as standardization (of marker values,
outcome definitions, etc.), follow-up information updating, detailed data-checking, sub-
group analyses, and the ability to include participant-level covariates (Stewart 1995;
Lambert et al. 2002). However, individual observations are rarely available; addition-
ally, if the main interest is in mean effects, then the two-stage and the IPD approaches
can provide equivalent results (Olkin and Sampson 1998).
This article concerns itself with the second stage of the two-stage approach to meta-
analysis. At this stage, researchers can select between two main approaches—the fixed-
effects (FE) or the random-effects model—in their efforts to combine the study-level
summary estimates and calculate an overall average effect. The FE model is simpler
and assumes the true effect to be the same (homogeneous) across studies. However, ho-
mogeneity has been found to be the exception rather than the rule, and some degree of
true effect variability between studies is to be expected (Thompson and Pocock 1991).
Two sorts of between-studies heterogeneity exist: clinical heterogeneity stems from dif-
c 2010 StataCorp LP st0201
396 metaan: Random-effects meta-analysis
where
varname1 is the study effect size.
varname2 is the study effect variation, with standard error used as the default.
2.2 Options
fe fits an FE model that assumes there is no heterogeneity between the studies. The
model assumes that within-study variances may differ, but that there is homogeneity
of effect size across studies. Often the homogeneity assumption is unlikely, and
variation in the true effect across studies is to be expected. Therefore, caution is
required when using this model. Reported heterogeneity measures are estimated
using the dl option. You must specify one of fe, dl, ml, reml, pl, or pe.
dl fits a DL random-effects model, which is the most commonly used model. The model
assumes heterogeneity between the studies; that is, it assumes that the true effect
can be different for each study. The model assumes that the individual-study true
effects are distributed with a variance τ 2 around an overall true effect, but the model
makes no assumptions about the form of the distribution of either the within-study
or the between-studies effects. Reported heterogeneity measures are estimated using
the dl option. You must specify one of fe, dl, ml, reml, pl, or pe.
ml fits an ML random-effects model. This model makes the additional assumption
(necessary to derive the log-likelihood function, and also true for reml and pl, below)
that both the within-study and the between-studies effects have normal distributions.
It solves the log-likelihood function iteratively to produce an estimate of the between-
studies variance. However, the model does not always converge; in some cases, the
between-studies variance estimate is negative and set to zero, in which case the
model is reduced to an fe specification. Estimates are reported as missing in the
event of nonconvergence. Reported heterogeneity measures are estimated using the
ml option. You must specify one of fe, dl, ml, reml, pl, or pe.
reml fits an REML random-effects model. This model is similar to ml and uses the same
assumptions. The log-likelihood function is maximized iteratively to provide esti-
mates, as in ml. However, under reml, only the part of the likelihood function that
is location invariant is maximized (that is, maximizing the portion of the likelihood
that does not involve μ if estimating τ 2 , and vice versa). The model does not always
converge; in some cases, the between-studies variance estimate is negative and set
to zero, in which case the model is reduced to an fe specification. Estimates are re-
ported as missing in the event of nonconvergence. Reported heterogeneity measures
are estimated using the reml option. You must specify one of fe, dl, ml, reml, pl,
or pe.
pl fits a PL random-effects model. This model uses the same likelihood function as ml
but takes into account the uncertainty associated with the between-studies variance
estimate when calculating an overall effect, which is done by using nested iterations
to converge to a maximum. The confidence intervals (CIs) provided by the model
are asymmetric, and hence so is the diamond in the forest plot. However, the model
398 metaan: Random-effects meta-analysis
does not always converge. Values that were not computed are reported as missing.
Reported heterogeneity measures are estimated using the ml option because μ and
τ2 , the effect and between-studies variance estimates, are the same. Only their
CIs are reestimated. The model also provides a CI for the between-studies variance
estimate. You must specify one of fe, dl, ml, reml, pl, or pe.
pe fits a PE random-effects model. This model can be described in three steps. First, in
line with a null hypothesis that all true study effects are zero and observed effects
are due to random variation, a dataset of all possible combinations of observed
study outcomes is created by permuting the sign of each observed effect. Then, the
dl model is used to compute an overall effect for each combination. Finally, the
resulting distribution of overall effect sizes is used to derive a CI for the observed
overall effect. The CI provided by the model is asymmetric, and hence so is the
diamond in the forest plot. Reported heterogeneity measures are estimated using
the dl option. You must specify one of fe, dl, ml, reml, pl, or pe.
varc specifies that the study-effect variation variable, varname2, holds variance values.
If this option is omitted, metaan assumes that the variable contains standard-error
values (the default).
label(varname) selects labels for the studies. One or two variables can be selected
and converted to strings. If two variables are selected, they will be separated by a
comma. Usually, the author names and the year of study are selected as labels. The
final string is truncated to 20 characters.
forest requests a forest plot. The weights from the specified analysis are used for
plotting symbol sizes (pe uses dl weights). Only one graph output is allowed in each
execution.
forestw(#) requests a forest plot with adjusted weight ratios for better display. The
value can be in the [1, 50] range. For example, if the largest to smallest weight ratio
is 60 and the graph looks awkward, the user can use this command to improve the
appearance by requesting that the weight be rescaled to a largest/smallest weight
ratio of 30. Only the weight squares in the plot are affected, not the model. The CIs
in the plot are unaffected. Only one graph output is allowed in each execution.
plplot(string) requests a plot of the likelihood function for the average effect or
between-studies variance estimate of the ml, pl, or reml model. The plplot(mu) op-
tion fixes the average effect parameter to its model estimate in the likelihood function
and creates a two-way plot of τ 2 versus the likelihood function. The plplot(tsq)
option fixes the between-studies variance to its model estimate in the likelihood
function and creates a two-way plot of μ versus the likelihood function. Only one
graph output is allowed in each execution.
E. Kontopantelis and D. Reeves 399
In addition to the standard results, metaan, fe and metaan, dl save the following in
r():
Scalars
r(tausq dl) τb2 , from the DL model
In addition to the standard results, metaan, reml saves the following in r():
Scalars
r(tausq dl) τb2 , from the DL model
r(conv reml) REML convergence information
r(tausq reml) τb2 , from the REML model
2
In each case, heterogeneity measures HM and I 2 are computed using the returned
between-variances estimate τ . Convergence and PE execution information is returned
2
3 Methods
The metaan command offers six meta-analysis models for calculating a mean effect esti-
mate and its CIs: FE model, random-effects DL method, ML random-effects model, REML
random-effects model, PL random-effects model, and PE method using a DL random-
effects model. Models of the random-effects family take into account the identified
between-studies variation, estimate it, and usually produce wider CIs for the overall
effect than would an FE analysis. Brief descriptions of the models have been provided
in section 2.2. In this section, we will provide a few more details and practical advice in
selecting among the models. Their complexity prohibits complete descriptions in this
article, and users wishing to look into model details are encouraged to refer to the orig-
inal articles that described them (DerSimonian and Laird 1986; Hardy and Thompson
1996; Follmann and Proschan 1999; Brockwell and Gordon 2001).
The three ML models are iterative and usually computationally expensive. ML and PL
derive the μ (overall effect) and τ 2 estimates by maximizing the log-likelihood function
in (1) under different conditions. REML estimates τ 2 and μ by maximizing the restricted
log-likelihood function in (2).
k
1 2 k 2
(yi − μ)
log L(μ, τ ) = −
2
i + τ
log 2π σ 2
+ , μ ∈ & τ 2 ≥ 0 (1)
2 i=1 i=1
i2 + τ 2
σ
k
1 k
(y −
μ)
2
log L (μ, τ 2 ) = −
i
log 2π σi2 + τ 2 +
2 i=1 i=1
i2 + τ 2
σ
1 k
1
− log 2 + τ2 , ∈ & τ2 ≥ 0
μ (2)
2
σ
i=1 i
where k is the number of studies to be meta-analyzed, yi and σ i2 are the effect and
is the overall effect estimate.
variance estimates for study i, and μ
ML follows the simplest approach, maximizing (1) in a single iteration loop. A criti-
cism of ML is that it takes no account of the loss in degrees of freedom that results from
estimating the overall effect. REML derives the likelihood function in a way that adjusts
for this and removes downward bias in the between-studies variance estimator. A use-
ful description for REML, in the meta-analysis context, has been provided by Normand
(1999). PL uses the same likelihood function as ML, but uses nested iterations to take
into account the uncertainty associated with the between-studies variance estimate when
calculating an overall effect. By incorporating this extra factor of uncertainty, PL yields
E. Kontopantelis and D. Reeves 401
CIsthat are usually wider than for DL and also are asymmetric. PL has been shown to
outperform DL in various scenarios (Brockwell and Gordon 2001).
The PE model (Follmann and Proschan 1999) can be described as follows: First, in
line with a null hypothesis that all true study effects are zero and observed effects are due
to random variation, a dataset of all possible combinations of observed study outcomes
is created by permuting the sign of each observed effect. Next the dl model is used to
compute an overall effect for each combination. Finally, the resulting distribution of
overall effect sizes is used to derive a CI for the observed overall effect.
Method performance is known to be affected by three factors: the number of studies
in the meta-analysis, the degree of heterogeneity in true effects, and—provided there is
heterogeneity present—the distribution of the true effects (Brockwell and Gordon 2001).
Heterogeneity, which is attributed to clinical or methodological diversity (Higgins and
Green 2006), is a major problem researchers have to face when combining study results
in a meta-analysis. The variability that arises from different interventions, populations,
outcomes, or follow-up times is described by clinical heterogeneity, while differences in
trial design and quality are accounted for by methodological heterogeneity (Thompson
1994). Traditionally, heterogeneity is tested with Cochran’s Q, which provides a p-value
for the test of homogeneity, when compared with a χ2k−1 distribution where k is the
number of studies (Brockwell and Gordon 2001). However, the test is known to be poor
at detecting heterogeneity because its power is low when the number of studies is small
(Hardy and Thompson 1998). An alternative measure is I 2 , which is thought to be more
informative in assessing inconsistency between studies. I 2 values of 25%, 50%, and 75%
correspond to low, moderate, and high heterogeneity, respectively (Higgins et al. 2003).
2
Another measure is HM , the measure least affected by the value of k. It takes values in
the [0, +∞) range, with 0 indicating perfect homogeneity (Mittlböck and Heinzl 2006).
Obviously, the between-studies variance estimate τ2 can also be informative about the
presence or absence of heterogeneity.
The test for heterogeneity is often used as the basis for applying an FE or a random-
effects model. However, the often low power of the Q test makes it unwise to base a
decision on the result of the test alone. Research studies, even on the same topic, can
vary on a large number of factors; hence, homogeneity is often an unlikely assumption
and some degree of variability between studies is to be expected (Thompson and Pocock
1991). Some authors recommend the adoption of a random-effects model unless there
are compelling reasons for doing otherwise, irrespective of the outcome of the test for
heterogeneity (Brockwell and Gordon 2001).
However, even though random-effects methods model heterogeneity, the performance
of the ML models (ML, REML, and PL) in situations where the true effects violate the
assumptions of a normal distribution may not be optimal (Brockwell and Gordon 2001;
Hardy and Thompson 1998; Böhning et al. 2002; Sidik and Jonkman 2007). The num-
ber of studies in the analysis is also an issue, because most meta-analysis models (includ-
ing DL, ML, REML, and PL —but not PE) are only asymptotically correct; that is, they
provide the theoretical 95% coverage only as the number of studies increases (approaches
infinity). Method performance is therefore affected when the number of studies is small,
402 metaan: Random-effects meta-analysis
but the extent depends on the model (some are more susceptible), along with the degree
of heterogeneity and the distribution of the true effects (Brockwell and Gordon 2001).
4 Example
As an example, we apply the metaan command to health-risk outcome data from seven
studies. The information was collected for an unpublished meta-analysis, and the data
are available from the authors. Using the describe and list commands, we provide
details of the dataset and proceed to perform a univariate meta-analysis with metaan.
. use metaan_example
. describe
Contains data from metaan_example.dta
obs: 7
vars: 4 19 Apr 2010 12:19
size: 560 (99.9% of memory free)
value df p-value
The PL model used in the example converged successfully, as did ML, whose convergence
is a prerequisite. The overall effect is not found to be significant at the 95% level,
and there is considerable heterogeneity across studies, according to the measures. The
model also displays a 95% CI for the between-studies variance estimate τ2 (provided
that convergence is achieved, as is the case in this example). The forest plot created by
the command is displayed in figure 1.
Bakx A, 1985
Campbell A, 1998
Cupples, 1994
Eckerlund
Studies
Moher, 2001
Woolard A, 1995
Woolard B, 1995
−1.1 −1 −.9 −.8 −.7 −.6 −.5 −.4 −.3 −.2 −.1 0 .1 .2 .3 .4
Effect sizes and CIs
Original weights (squares) displayed. Largest to smallest ratio: 1.30
When we reexecute the analysis with the plplot(mu) and plplot(tsq) options, we
obtain the log-likelihood function plots shown in figures 2 and 3.
Likelihood plot
for mu fixed to the ML/PL estimate
−2
−4
log−likelihood
−6
−8
−10
0 .05 .1 .15 .2
tau² values
Likelihood plot
for tau² fixed to the ML/PL estimate
0
−5
log−likelihood
−10
−15
−20
−25
−1.5 −1 −.5 0 .5
mu values
5 Discussion
The metaan command can be a useful meta-analysis tool that includes newer and, in
certain circumstances, better-performing models than the standard DL random-effects
model. Unpublished results exploring model performance in various scenarios are avail-
able from the authors. Future work will involve implementing more models in the
metaan command and embellishing the forest plot.
6 Acknowledgments
We would like to thank the authors of meta and metan for all their work and the
anonymous reviewer whose useful comments improved the article considerably.
7 References
Böhning, D., U. Malzahn, E. Dietz, P. Schlattmann, C. Viwatwongkasem, and A. Big-
geri. 2002. Some general points in estimating heterogeneity variance with the
DerSimonian–Laird estimator. Biostatistics 3: 445–457.
DerSimonian, R., and N. Laird. 1986. Meta-analysis in clinical trials. Controlled Clinical
Trials 7: 177–188.
406 metaan: Random-effects meta-analysis
Follmann, D. A., and M. A. Proschan. 1999. Valid inference in random effects meta-
analysis. Biometrics 55: 732–737.
Higgins, J. P. T., and S. Green. 2006. Cochrane Handbook for Systematic Reviews of
Interventions Version 4.2.6.
http://www2.cochrane.org/resources/handbook/Handbook4.2.6Sep2006.pdf.
Kontopantelis, E., and D. Reeves. 2009. MetaEasy: A meta-analysis add-in for Microsoft
Excel. Journal of Statistical Software 30: 1–25.
Mittlböck, M., and H. Heinzl. 2006. A simulation study comparing properties of het-
erogeneity measures in meta-analyses. Statistics in Medicine 25: 4321–4333.
Olkin, I., and A. Sampson. 1998. Comparison of meta-analysis versus analysis of variance
of individual patient data. Biometrics 54: 317–322.
Thompson, S. G., and S. J. Pocock. 1991. Can meta-analyses be trusted? Lancet 338:
1127–1130.
1 Introduction
Statistical methods in survival analysis need to deal with data that are incomplete
because of right-censoring; a host of such methods are available, including the Kaplan–
Meier estimator, the log-rank test, and the Cox regression model. If one had complete
data, standard methods for quantitative data could be applied directly for the observed
survival time X, or methods for binary outcomes could be applied by dichotomizing
X as I(X > τ ) for a suitably chosen τ . With complete data, one could furthermore
set up regression models for any function f (X) and check such models using standard
graphical methods such as scatterplots or residuals for quantitative or binary outcomes.
One way of achieving these goals with censored survival data and with more-general
event history data (for example, competing-risks data) is to use a technique based on
pseudo-observations, as recently described in a series of articles. Thus the technique
has been studied in modeling of the survival function (Klein et al. 2007), the restricted
mean (Andersen, Hansen, and Klein 2004), and the cumulative incidence function in
competing risks (Andersen, Klein, and Rosthøj 2003; Klein and Andersen 2005; Klein
2006; Andersen and Klein 2007).
The basic idea is simple. Suppose a well-behaved estimator θ, for the expectation θ =
E{f (X)}, is available—for example, the Kaplan–Meier estimator for S(t) = E{I(X >
t)}—based on a sample of size n. The ith pseudo-observation (i = 1, . . . , n) for f (X) is
then defined as θi = n× θ−(n−1)× θ−i where θ−i is the estimator applied to the sample
of size n − 1, which is obtained by eliminating the ith observation from the dataset. The
pseudovalues are generated once, and the idea is to replace the incompletely observed
c 2010 StataCorp LP st0202
E. T. Parner and P. K. Andersen 409
f (Xi ) by θi . That is, θi may be used as an outcome variable in a regression model
or it may be used to compute residuals. θi also may be used in a scatterplot when
assessing model assumptions (Perme and Andersen 2008; Andersen and Perme 2010).
The intuition is that,
in the absence of censoring, θ = E{f (X)} could, obviously, be
estimated as (1/n) i f (Xi ), in which case the ith pseudo-observation is simply the
observed value f (Xi ). The pseudovalues are related to the jackknife residuals used in
regression diagnostics.
We present three new Stata commands—stpsurv, stpci, and stpmean—that pro-
vide a new possibility in Stata for analyzing regression models and that generate pseu-
dovalues (respectively) for the survival function (or the cumulative distribution func-
tion, “the cumulative incidence”) under right-censoring, for the cumulative incidence
in competing risks, and for the restricted mean under right-censoring. Cox regression
models can be fit using the pseudovalue function for survival probabilities in several
time points. Thereby, the pseudovalue method provides an alternative to Cox regres-
sion, for example, in situations where rates are not proportional. As discussed by
Perme and Andersen (2008), residuals for model checking may also be obtained from
the pseudovalues. An example based on bone marrow transplantation data is presented
to illustrate the methodology.
In section 2, we briefly present the general pseudovalue approach to censored data
regression. In section 3, we present the new Stata commands; and in section 4, we show
examples of the use of the commands. Section 5 concludes with some remarks.
where g(·) is the link function. If right-censoring prevents us from observing all the
Xi s, then it is not simple to analyze this regression model. However, suppose θ is
an approximately unbiased estimator of the marginal mean θ = E{f (X)} that may
be computed from the sample of right-censored observations. If f (X) = I(X > τ ),
then θ = S(τ ) may be estimated using the Kaplan–Meier estimator. The ith pseudo-
observation is now defined, as suggested in section 1, as
θi = n × θ − (n − 1) × θ−i
Here θ−i is the “leave-one-out” estimator for θ based on all observations but the ith:
Xj , j = i. The idea is to replace the possibly incompletely observed f (Xi ) by θi and to
obtain estimates of the βs based on the estimating equation:
∂ T
−1 T
g (β Zi ) Vi−1 (β) θi − g −1 (β T Zi ) = Ui (β) = U (β) = 0 (1)
i
∂β i
In (1), Vi is a working covariance matrix. Graw, Gerds, and Schumacher (2009) showed
that for the examples studied in this article, E{f (Xi ) | Zi } = E(θi | Zi ), and thereby
(1) is unbiased, provided that censoring is independent of covariates; see also Andersen
and Perme (2010). A sandwich estimator is used to estimate the variance of β. Let
∂ T −1 T
∂g (β Zi )
−1 T −1
I(β) = g (β Zi ) Vi (β)
i
∂β ∂β
and T
U β =
Var Ui β Ui β
i
where t1 < · · · < tD are the distinct event times, Yj is the number at risk, and dj is the
number of events at time tj . The cumulative distribution function is then estimated by
F(t) = 1 − S(t).
In this case, the link function of interest could be the cloglog function
cloglog {F (τ )} = log [− log {1 − F (τ )}]
which is equivalent to a Cox regression model for the survival function evaluated in τ .
For right-censored data, the estimated survival function (the Kaplan–Meier estimator)
does not always converge down to zero. Then the mean cannot be estimated reliably
by plugging the Kaplan–Meier estimator into (2). An alternative to the mean is the
restricted mean, defined as the area under the survival curve up to a time τ < ∞
(Klein and Moeschberger 2003), which is equal to θ = μτ = E{min(X, τ )}. The re-
stricted mean survival time is estimated by the area under the Kaplan–Meier curve up
to time τ . That is, ! τ
τ =
μ
S(u)du
0
An alternative mean is the conditional mean given that the event time is smaller than
τ , μcτ = E(X | X ≤ τ ), which is similarly estimated by
! τ )
S(u) − S(τ
cτ =
μ du
0 )
1 − S(τ
For the restricted and conditional mean, a link function of interest could be the log or
the identity.
If t1 < · · · < tD are the distinct times of the primary event and the competing risk
combined, Yj is the number at risk, d1j is the number of the primary events at time
tj , and d2j is the number of competing risks at time tj . Then the cumulative incidence
function of the primary event is estimated by
d1j
Yi − (d1i + d2i )
F1 (t) =
Yj t <t Yi
tj ≤t i j
Again the link function of interest could be cloglog corresponding to the regression
model for the competing risks cumulative incidence studied by Fine and Gray (1999).
stpsurv, stpmean, and stpci are for use with st data. You must, therefore, stset
your data before issuing these commands. Frequency weights are allowed in the stset
command. In the stpci command for the cumulative incidence function in competing
risks, an indicator variable for the competing risks should always be specified. The
pseudovalues are by default stored in the pseudo variable when one time point is spec-
ified and are stored in variables pseudo1, pseudo2, . . . when several time points are
specified. The names of the pseudovariables are changed by the generate() option.
3.2 Options
at(numlist) specifies the time points in ascending order of which pseudovalues should
be computed. at() is required.
generate(string) specifies a variable name for the pseudo-observations. The default is
generate(pseudo).
failure generates pseudovalues for the cumulative incidence proportion, which is one
minus the survival function.
conditional specifies that pseudovalues for the conditional mean should be computed
instead of those for the restricted mean.
E. T. Parner and P. K. Andersen 413
4 Example data
To illustrate the pseudovalue approach, we use data on sibling-donor bone marrow
transplants matched on human leukocyte antigen (Copelan et al. 1991). The data
are available in Klein and Moeschberger (2003). The data include information on
137 transplant patients on time to death, relapse, or lost to follow-up (tdfs); the
indicators of relapse and death (relapse, trm); the indicator of treatment failure
(dfs = relapse | trm); and three factors that may be related to outcome: disease
[acute lymphocytic leukemia (ALL), low-risk acute myeloid leukemia (AML), and high-
risk AML], the French–American–British (FAB) disease grade for AML (fab = 1 if AML
and grade 4 or 5; 0 otherwise), and recipient age at transplant (age).
Survival
1
.8
Probability
.6
.4
.2
Fab=1
Fab=0
0
0 500 1000 1500 2000
Time (days)
Based on the sts list output below, the risk difference (RD) for FAB is computed
as RD = 0.333 − 0.541 = −0.207 [95% confidence interval: −0.379, −0.039] and the
relative risk (RR) for FAB is RR = 0.333/0.541 = 0.616, where FAB = 0 is chosen as the
reference group. The confidence interval of the RD is based on computing the standard
error of the RD as (0.05222 + 0.07032 )1/2 . The confidence interval for the RR is not
easily estimated using the information from the sts list command.
414 Pseudo-observations
. use bmt
. stset tdfs, failure(dfs==1)
failure event: dfs == 1
obs. time interval: (0, tdfs]
exit on or before: failure
fab=0
0 0 0 1.0000 . . .
530 49 42 0.5408 0.0522 0.4334 0.6364
fab=1
0 0 0 1.0000 . . .
530 16 30 0.3333 0.0703 0.2018 0.4704
. stpsurv, at(530)
Computing pseudo observations (progress dots indicate percent completed)
1 2 3 4 5
.................................................. 50
.................................................. 100
Generated pseudo variable: pseudo
The pseudovalues are analyzed in generalized linear models with an identity link
function and a log link function, respectively.
E. T. Parner and P. K. Andersen 415
Robust
pseudo Coef. Std. Err. z P>|z| [95% Conf. Interval]
Robust
pseudo exp(b) Std. Err. z P>|z| [95% Conf. Interval]
The generalized linear models with an identity link function and a log link function
fit the relations
pi = E (Xi ) = β0 + β1 × FABi
log(pi ) = log{E (Xi )} = β0 + β1 × FABi
respectively, where pi = Si (530) is disease free survival probability at 530 days for
individual i. Hence, based on the pseudovalues approach, we estimate the RD for FAB
by RD = −0.208 [95% confidence interval: −0.381, −0.035] and the RR for FAB by
RR = 0.615 [95% confidence interval: 0.389, 0.974]. The results are very similar to the
direct computation from the Kaplan–Meier using the sts list command. We now
obtain the confidence interval for the RR.
Suppose we wish to compute the RR for FAB, adjusting for disease as a categorical
variable and age as a continuous variable. Using the same pseudovalues, we fit the
generalized linear model.
Robust
pseudo exp(b) Std. Err. z P>|z| [95% Conf. Interval]
disease
2 1.951343 .412121 3.17 0.002 1.289914 2.951931
3 1.005533 .3586364 0.02 0.988 .4998088 2.022965
Patients with AML and grade 4 or 5 (FAB = 1) have a 27% reduced disease free
survival probability at 530 days, when adjusting for disease and age.
. glm pseudo i.times i.fab i.disease age, link(cloglog) vce(cluster id) noheader
Iteration 0: log pseudolikelihood = -468.74476
Iteration 1: log pseudolikelihood = -457.41878 (not concave)
Iteration 2: log pseudolikelihood = -406.98781
Iteration 3: log pseudolikelihood = -365.23278
Iteration 4: log pseudolikelihood = -350.7435
Iteration 5: log pseudolikelihood = -349.97156
Iteration 6: log pseudolikelihood = -349.96409
Iteration 7: log pseudolikelihood = -349.96409
(Std. Err. adjusted for 137 clusters in id)
Robust
pseudo Coef. Std. Err. z P>|z| [95% Conf. Interval]
times
2 1.114256 .3269323 3.41 0.001 .4734805 1.755032
3 1.626173 .3567925 4.56 0.000 .9268721 2.325473
4 2.004267 .3707305 5.41 0.000 1.277649 2.730885
5 2.495327 .3824645 6.52 0.000 1.745711 3.244944
disease
2 -1.195542 .4601852 -2.60 0.009 -2.097489 -.2935959
3 .0036343 .3791488 0.01 0.992 -.7394838 .7467524
The estimated survival function in this model for a patient at time t with a set of
covariates Z is S(t) = exp{−Λ0 (t)eβZ }, where
The model shows that patients with AML who are at low risk have better disease
free survival than ALL patients [RR = exp(−1.1955) = 0.30] and that AML patients with
grade 4 or 5 FAB have a lower disease free survival [RR = exp(0.7620) = 2.14].
Without recomputing the pseudovalues, we can examine the effect of FAB over time.
Robust
pseudo exp(b) Std. Err. z P>|z| [95% Conf. Interval]
times
2 3.99608 2.023867 2.74 0.006 1.480921 10.78292
3 8.225489 4.601898 3.77 0.000 2.747526 24.62531
4 11.89654 6.835021 4.31 0.000 3.858093 36.68333
5 19.20116 11.25862 5.04 0.000 6.084498 60.59409
disease
2 .3024683 .1368087 -2.64 0.008 .1246451 .7339808
3 .9993425 .3815547 -0.00 0.999 .4728471 2.112069
. test fab50=fab105=fab170=fab280=fab530
( 1) [pseudo]fab50 - [pseudo]fab105 = 0
( 2) [pseudo]fab50 - [pseudo]fab170 = 0
( 3) [pseudo]fab50 - [pseudo]fab280 = 0
( 4) [pseudo]fab50 - [pseudo]fab530 = 0
chi2( 4) = 1.73
Prob > chi2 = 0.7855
The model shows that there is no statistically significant difference in the FAB effect
over time (p = 0.79); that is, proportional hazards are not contraindicated for FAB.
E. T. Parner and P. K. Andersen 419
. stpmean, at(1500)
Computing pseudo observations (progress dots indicate percent completed)
1 2 3 4 5
.................................................. 50
.................................................. 100
Generated pseudo variable: pseudo
. glm pseudo i.fab i.disease age, link(id) vce(robust) noheader
Iteration 0: log pseudolikelihood = -1065.6767
Robust
pseudo Coef. Std. Err. z P>|z| [95% Conf. Interval]
disease
2 461.1214 134.0932 3.44 0.001 198.3036 723.9391
3 78.00616 158.8357 0.49 0.623 -233.3061 389.3184
Here we see that low-risk AML patients have the longest restricted mean life, namely,
461.1 days longer than ALL patients within 1,500 days.
Robust
pseudo exp(b) Std. Err. z P>|z| [95% Conf. Interval]
times
1 .0286012 .0292766 -3.47 0.001 .0038467 .21266
2 .0791623 .0547411 -3.67 0.000 .0204131 .306993
3 .1261608 .0823572 -3.17 0.002 .0350965 .4535083
4 .1781601 .1117597 -2.75 0.006 .0521017 .6092124
5 .2383869 .1488814 -2.30 0.022 .0700932 .8107537
disease
2 .1708985 .1154623 -2.61 0.009 .0454622 .6424309
3 .7829133 .466016 -0.41 0.681 .2438093 2.514068
5 Conclusion
The pseudovalue method is a versatile tool for regression analysis of censored time-to-
event data. We have implemented the method for regression analysis of the survival
under right-censoring, for the cumulative incidence function under possible competing
risks, and for the restricted and conditional mean waiting time. Similar SAS macros and
R functions were presented by Klein et al. (2008).
6 References
Andersen, P. K., M. G. Hansen, and J. P. Klein. 2004. Regression analysis of restricted
mean survival time based on pseudo-observations. Lifetime Data Analysis 10: 335–
350.
Andersen, P. K., and J. P. Klein. 2007. Regression analysis for multistate models
based on a pseudo-value approach, with applications to bone marrow transplantation
studies. Scandinavian Journal of Statistics 34: 3–16.
Andersen, P. K., J. P. Klein, and S. Rosthøj. 2003. Generalised linear models for
correlated pseudo-observations, with applications to multi-state models. Biometrika
90: 15–27.
Andersen, P. K., and M. P. Perme. 2010. Pseudo-observations in survival analysis.
Statistical Methods in Medical Research 19: 71–99.
Copelan, E. A., J. C. Biggs, J. M. Thompson, P. Crilley, J. Szer, J. P. Klein, N. Kapoor,
B. R. Avalos, I. Cunningham, K. Atkinson, K. Downs, G. S. Harmon, M. B. Daly,
I. Brodsky, S. I. Bulova, and P. J. Tutschka. 1991. Treatment for acute myelocytic
leukemia with allogeneic bone marrow transplantation following preparation with
BuCy2. Blood 78: 838–843.
Fine, J. P., and R. J. Gray. 1999. A proportional hazards model for the subdistribution
of a competing risk. Journal of the American Statistical Association 94: 496–509.
Graw, F., T. A. Gerds, and M. Schumacher. 2009. On pseudo-values for regression
analysis in competing risks models. Lifetime Data Analysis 15: 241–255.
Kaplan, E. L., and P. Meier. 1958. Nonparametric estimation from incomplete obser-
vations. Journal of the American Statistical Association 53: 457–481.
Klein, J. P. 2006. Modeling competing risks in cancer studies. Statistics in Medicine
25: 1015–1034.
Klein, J. P., and P. K. Andersen. 2005. Regression modeling of competing risks data
based on pseudovalues of the cumulative incidence function. Biometrics 61: 223–229.
Klein, J. P., M. Gerster, P. K. Andersen, S. Tarima, and M. P. Perme. 2008. SAS and R
functions to compute pseudo-values for censored data regression. Computer Methods
and Programs in Biomedicine 89: 289–300.
422 Pseudo-observations
Klein, J. P., B. Logan, M. Harhoff, and P. K. Andersen. 2007. Analyzing survival curves
at a fixed point in time. Statistics in Medicine 26: 4505–4519.
Klein, J. P., and M. L. Moeschberger. 2003. Survival Analysis: Techniques for Censored
and Truncated Data. 2nd ed. New York: Springer.
Liang, K.-Y., and S. L. Zeger. 1986. Longitudinal data analysis using generalized linear
models. Biometrika 73: 13–22.
Perme, M. P., and P. K. Andersen. 2008. Checking hazard regression models using
pseudo-observations. Statistics in Medicine 27: 5309–5328.
1 Introduction
Ninety-five percent of applied econometrics is concerned with mean effects, yet distri-
butional effects are no less important. The distribution of the dependent variable may
change in many ways that are not revealed or are only incompletely revealed by an exam-
ination of averages. For example, the wage distribution can become more compressed or
the upper-tail inequality may increase while the lower-tail inequality decreases. There-
fore, applied economists and policy makers are increasingly interested in distributional
effects. The estimation of quantile treatment effects (QTEs) is a powerful and intuitive
tool that allows us to discover the effects on the entire distribution. As an alternative
motivation, median regression is often preferred to mean regression to reduce suscep-
tibility to outliers. Hence, the estimators presented below may thus be particularly
appealing with noisy data such as wages or earnings. In this article, we provide a brief
survey over recent developments in this literature and a description of the new ivqte
command, which implements these estimators.
c 2010 StataCorp LP st0203
424 Estimation of quantile treatment effects with Stata
Depending on the type of endogeneity of the treatment and the definition of the
estimand, we can define four different cases. We distinguish between conditional and
unconditional effects and whether selection is on observables or on unobservables. Con-
ditional QTEs are defined conditionally on the value of the regressors, whereas uncon-
ditional effects summarize the causal effect of a treatment for the entire population.
Selection on observables is often referred to as a matching assumption or as exogenous
treatment choice (that is, exogenous conditional on X). In contrast, we refer to selection
on unobservables as endogenous treatment choice.
First, if we are interested in conditional QTEs and we assume that the treatment
is exogenous (conditional on X), we can use the quantile regression estimators pro-
posed by Koenker and Bassett (1978). Second, if we are interested in conditional
QTEs but the treatment is endogenous, the instrumental-variable (IV) estimator of
Abadie, Angrist, and Imbens (2002) may be applied. Third, for estimating uncondi-
tional QTEs with exogenous treatment, various approaches have been suggested, for
example, Firpo (2007), Frölich (2007a), and Melly (2006). Currently, the weighting
estimator of Firpo (2007) is implemented. Finally, unconditional QTE in the presence
of an endogenous treatment can be estimated with the technique of Frölich and Melly
(2008). The estimators for the unconditional treatment effects do not rely on any (para-
metric) √functional forms assumptions. On the other hand, for the conditional treatment
effects, n convergence rate can only be obtained with a parametric restriction. Be-
cause estimators affected by the curse of dimensionality are of less interest to the applied
economist, we will discuss only parametric (linear) estimators for estimating conditional
QTEs.
This article only discusses the implementation of the proposed estimators and the
syntax of the commands. It draws heavily on the more technical discussion in the
original articles, and the reader is referred to those articles for more background on,
and formal derivations of, some of the properties of the estimators described here.
The contributions to this article and the related commands are manyfold. We provide
new standardized commands for the estimators proposed in Abadie, Angrist, and Imbens
(2002);1 Firpo (2007); and Frölich and Melly (2008); and estimators of their analytical
standard errors. For the conditional exogenous case, we provide heteroskedasticity
consistent standard errors. The estimator of Koenker and Bassett (1978) has already
been implemented in Stata with the qreg command, but its estimated standard errors
1. Joshua Angrist provides codes in Matlab to replicate the empirical results of Abadie, Angrist, and
Imbens (2002). Our codes for this estimator partially build on his codes.
M. Frölich and B. Melly 425
are not consistent in the presence of heteroskedasticity. The ivqte command thus
extends upon qreg in providing analytical standard errors for heteroskedastic errors.
At a higher level, locreg implements nonparametric estimation with both cate-
gorical and continuous regressors as suggested by Racine and Li (2004). Finally, we
incorporate cross-validation procedures to choose the smoothing parameters.
The next section outlines the definition of the estimands, the possible identifica-
tion approaches, and the estimators. Section 3 describes the ivqte command and
its various options, and contains simple applications to illustrate how ivqte can be
used. Appendix A describes somewhat more technical aspects for the estimation of the
asymptotic variance matrices. Appendix B describes the nonparametric estimators used
internally by ivqte and the additional locreg command.
Clearly, this linearity assumption is not sufficient for identification of QTEs because
the observed Di may be correlated with the error term εi . We assume that both D and
X are exogenous.
Assumption 2. Selection on observables with exogenous X
ε⊥⊥(D, X)
Assumptions 1 and 2 together imply that QτY |X,D = Xβ τ + Dδ τ , such that we can
recover the unknown parameters of the potential outcomes from the joint distribution
of the observed variables Y , X, and D. The unknown coefficients can thus be estimated
by the classical quantile regression estimator suggested by Koenker and Bassett (1978).
This estimator is defined by
(βτ , δτ ) = arg min ρτ (Yi − Xi β − Di δ) (1)
β,δ
4. If the instrument is nonbinary, it must be transformed into a binary variable. See Frölich and Melly
(2008).
5. An alternative approach is given in Chernozhukov and Hansen (2005), who rely on a monotonic-
ity/rank invariance assumption in the outcome equation.
M. Frölich and B. Melly 427
Assumption 3. IV
For almost all values of X
Y 0 , Y 1 , D0 , D1 ⊥⊥Z |X
0 < Pr (Z = 1 |X ) < 1
E (D1 |X ) = E (D0 |X )
Pr (D1 ≥ D0 |X ) = 1
This assumption is well known and requires monotonicity (that is, the nonexistence of
defiers) in addition to a conditional independence assumption on the IV. Individuals
with D1 > D0 are referred to as compliers, and treatment effects can be identified only
for this group because the always- and never-participants cannot be induced to change
treatment status by hypothetical movements of the instrument.
Abadie, Angrist, and Imbens (2002) (AAI) impose assumption 3. Furthermore, they
require assumption 1 to hold for the compliers (that is, those observations with D1 >
D0 ). They show that the conditional QTE, δ τ , for the compliers can be estimated
consistently by the weighted quantile regression:
(βIV
τ τ
, δIV ) = arg min WiAAI × ρτ (Yi − Xi β − Di δ) (2)
β,δ
Di (1 − Zi ) (1 − Di ) Zi
WiAAI = 1− −
1 − Pr (Z = 1 |Xi ) Pr (Z = 1 |Xi )
The intuition for these weights can be given in two steps. First, by assumption 3,6
0 1
Y , Y , D0 , D1 ⊥⊥Z |X
=⇒ Y 0 , Y 1 ⊥⊥Z |X, D1 > D0
=⇒ Y 0 , Y 1 ⊥⊥D |X, D1 > D0
This means that any observed relationship between D and Y has a causal interpretation
for compliers. To use this result, we have to find compliers in the population. This is
done in the following average sense by the weights WiAAI :7
E WiAAI ρτ (Yi − Xi β − Di δ) = Pr (D1 > D0 ) E {ρτ (Yi − Xi β − Di δ) |D1 > D0 }
AAI
Intuitively,
AAI this result holds
because
AAI Wi = 1 for the compliers and because
E Wi |Di,1 = Di,0 = 0 = E Wi |Di,1 = Di,0 = 1 = 0.
A preliminary estimator for Pr (Z = 1 |Xi ) is needed to implement this estima-
tor. ivqte uses the local logit estimator described in appendix B.8 A problem with
6. This is the result of lemma 2.1 in Abadie, Angrist, and Imbens (2002).
7. This is a special case of theorem 3.1.a in Abadie (2003).
8. In their original article, Abadie, Angrist, and Imbens (2002) use a series estimator instead of a
local estimator as in ivqte. Nevertheless, one can also use series estimation or, in fact, any other
method to estimate the propensity score by first generating a variable containing the estimated
propensity score and informing ivqte via the phat() option that the propensity-score estimate is
supplied by the user.
428 Estimation of quantile treatment effects with Stata
estimator (2) is that the optimization problem is not convex because some of the
weights are negative while others are positive. Therefore, this estimator has not been
implemented. Instead, ivqte implements the AAI estimator with positive weights.
Abadie, Angrist, and Imbens (2002) have shown that as an alternative to WiAAI , one
can use the weights
WiAAI+ = E W AAI |Yi , Di , Xi (3)
instead, which are always positive. Because these weights are unknown, ivqte uses local
linear regression to estimate WiAAI+ ; see appendix B. Some of these estimated weights
might be negative in finite samples, which are then set to zero.9
with a parametric restriction. The following estimators of the unconditional QTE are
entirely nonparametric, and we will no longer invoke assumption 1. This is an important
advantage because parametric restrictions are often difficult to justify from a theoretical
point of view. In addition, assumption 1 restricts the QTE to be the same independently
from the value of X. Obviously, interaction terms may be included, but the effects in
the entire population are often more interesting than many effects for different covariate
combinations.
The interpretation of the unconditional effects is slightly different from the interpre-
tation of the conditional effects, even if the conditional QTE is independent from the
value of X. This is because of the definition of the quantile. For instance, if we are in-
terested in a low quantile, the conditional QTE will summarize the effect for individuals
with relatively low Y even if their absolute level of Y is high. The unconditional QTE,
on the other hand, will summarize the effect with a relatively low absolute Y .
Finally, the conditional and unconditional QTEs are trivially the same in the absence
of covariates. They are also the same if the effect is the same independent of the value
of the covariates and of the value of the quantile τ . This is often called the location
shift model because the treatment affects only the location of the distribution of the
potential outcomes.
ing about running a weighted quantile regression of Y on a constant and D by using the
weights WiAAI . For that purpose, however, the weights of Abadie, Angrist, and Imbens
(2002) are not correct as shown in Frölich and Melly (2008). This estimator would es-
timate the difference between the τ quantile of Y 1 for the treated compliers and the τ
quantile of Y 0 for the nontreated compliers, which is not meaningful in general. How-
ever, weights WiAAI could be used to estimate unconditional effects in the special case
when the IV is independent of X such that Pr (Z = 1 |X ) is not a function of X.
On the other hand, if one is interested in estimating conditional QTE using a para-
metric specification, the weights WiFM could be used, as well. Hence, although not
developed for this case, the weights WiFM can be used to identify conditional QTEs. It
is not clear whether WiFM or WiAAI will be more efficient. For estimating conditional
effects, both are inefficient anyway because they do not incorporate the conditional
density function of the error term at the quantile.
Intuitively, the difference between the weights WiAAI and WiFM can be explained as
follows: They both find the compliers in the average sense discussed above. However,
only WiFM simultaneously balances the distribution of the covariates between treated
and nontreated compliers. Therefore, WiAAI can be used only in combination with a
conditional model because there is no need to balance covariates in such a case. It can
also be used without a conditional model when the treated and nontreated compliers
have the same covariate distribution. WiFM , on the other hand, can be used with or
without a conditional model.
A preliminary estimator for Pr (Z = 1 |Xi ) is needed to implement this estimator.
ivqte uses the local logit estimator described in appendix B. The optimization problem
(4) is neither convex nor smooth. However, only two parameters have to be estimated.
In fact, one can easily show that the estimator can be written as two univariate quantile
regressions, which can easily be solved despite the nonsmoothness; see the previous
footnotes. This is the way ivqte proceeds when the positive option is not activated.12
An alternative to solving this nonconvex problem consists in using the weights
WiFM+ = E W FM |Yi , Di (5)
which are always positive. ivqte estimates these weights by local linear regression if
the positive option has been activated. Again, estimated negative weights will be set
to zero.13
12. More precisely, ivqte solves the convex problem for the distribution function, and then mono-
tonizes the estimated distribution function using the method of Chernozhukov, Fernández-Val, and
Galichon (2010), and finally inverts it to obtain the quantiles. The parameters chosen in this way
solve the first-order conditions of the optimization problem, and therefore, the asymptotic results
apply to them.
13. If one is interested in average treatment effects, Frölich (2007b) has proposed an estimator for aver-
age treatment effects based on the same set of assumptions. This estimator has been implemented
in Stata in the command nplate, which can be downloaded from the websites of the authors of
this article.
M. Frölich and B. Melly 431
(Y 0 , Y 1 )⊥⊥D|X
0 < Pr (D = 1 |X ) < 1
3.2 Description
ivqte computes the QTEs of a binary variable using a weighting strategy. This com-
mand can estimate both conditional and unconditional QTEs under either exogene-
ity or endogeneity. The estimator proposed by Frölich and Melly (2008) is used if
unconditional QTEs under endogeneity are estimated. The estimator proposed by
Abadie, Angrist, and Imbens (2002) is used if conditional QTEs under endogeneity are
estimated. The estimator proposed by Firpo (2007) is used if unconditional QTEs under
exogeneity are estimated. The estimator proposed by Koenker and Bassett (1978) is
used if conditional QTEs under exogeneity are estimated.
The estimator used by ivqte is determined as follows:
indepvars contains the list of X variables for the Koenker and Bassett (1978) estima-
tor for the estimation of exogenous conditional QTEs.14 For all other estimators, indep-
vars must remain empty, and the control variables X are to be given in continuous(),
unordered(), and dummy(). The instrument or the treatment variable is assumed to
satisfy the exclusion restriction conditionally on these variables.
The IV has to be provided as a binary variable, taking only the values 0 and 1. If the
original IV takes different values, it first has to be transformed to a binary variable. If
the original IV is one-dimensional, one may use the endpoints of its support and discard
the other observations. If one has several discrete IVs, one would use only those two
combinations that maximize and minimize the treatment probability Pr(D = 1|Z = z)
and code these two values as 0 and 1. For more details on how to transform several
nonbinary IVs to this binary case, see Frölich and Melly (2008).
The estimation of all nonparametric functions is described in detail in appendix B.
A mixed kernel as suggested by Racine and Li (2004) is used to smooth over the con-
tinuous and categorical data. The more conventional approach of estimating the regres-
sion plane inside each cell defined by the discrete variables can be followed by setting
lambda() to 0. The propensity score is estimated by default by a local logit estima-
tor. A local linear estimator is used if the linear option is selected. Two algorithms
14. For the Koenker and Bassett (1978) estimator, the options continuous(), unordered(),
and dummy() are not permitted.
M. Frölich and B. Melly 433
are available to maximize the local logistic likelihood function. The default is a simple
Gauss–Newton algorithm written for this purpose. If you select the mata opt option,
the official Stata 10 optimizer is used. We expect the official estimator to be more stable
in difficult situations. However, it can be used only if you have Stata 10 or more recent
versions.
The ivqte command also requires the packages moremata (Jann 2005b) and kdens
(Jann 2005a).
3.3 Options
Model
quantiles(numlist) specifies the quantiles at which the effects are estimated and should
contain numbers between 0 and 1. The computational time needed to estimate an
additional quantile is very short compared with the time needed to estimate the
preliminary nonparametric regressions. When conditional QTEs are estimated, only
one quantile may be specified. If one is interested in several QTEs, then one can save
the estimated weights for later use by using the generate w() option. By default,
quantiles() is set to 0.5 when conditional QTEs are estimated, and quantiles()
contains the nine deciles from 0.1 to 0.9 when unconditional QTEs are estimated.
continuous(varlist), dummy(varlist), and unordered(varlist) specify the names of the
covariates depending on their type. Ordered discrete variables should be treated as
continuous. For all estimators except Koenker and Bassett (1978), the X variables
should be given here and not in indepvars. For the Koenker and Bassett (1978)
estimator, on the other hand, these options are not permitted and the X variables
must be given in indepvars.
aai selects the Abadie, Angrist, and Imbens (2002) estimator.
With the exception of the Koenker and Bassett (1978) estimator, several further
options are needed to control the estimation of the nonparametric components. First,
we need to estimate some kind of propensity score. For the Firpo (2007) estimator,
we need to estimate Pr(D = 1|X). For the Abadie, Angrist, and Imbens (2002) and
Frölich and Melly (2008) estimators, we need to estimate Pr(Z = 1|X), which we
also call a propensity score in the following discussion. These propensity scores are
then used to calculate the weights WiF , WiAAI , and WiFM , respectively, as defined
in section 2.15 The QTEs are estimated using these weights after applying some
trimming to eliminate observations with very large weights. The amount of trimming
is controlled by trim(), as explained below. This is the way the Firpo (2007)
estimator is implemented.
For the Abadie, Angrist, and Imbens (2002) and Frölich and Melly (2008) estima-
tors, more comments are required. First, the Abadie, Angrist, and Imbens (2002)
estimator is only implemented with the positive weights WiAAI+ . Hence, when the
Abadie, Angrist, and Imbens (2002) estimator is activated via the aai option, first
15. The weights WiKB used in the Koenker and Bassett (1978) estimator are always equal to one.
434 Estimation of quantile treatment effects with Stata
the propensity score is estimated to calculate the weights WiAAI , which are then
automatically projected via nonparametric regression to obtain WiAAI+ . This last
nonparametric regression to obtain the positive weights is controlled by the options
pkernel(), pbandwidth(), and plambda(), which are explained below. The letter
p in front of these options stresses that these are used to obtain the positive weights.
Finally, the Frölich and Melly (2008) estimator is implemented in two ways. We can
either use the weights WiFM after having trimmed very large weights, or alternatively,
we could also project these weights and then use WiFM+ to estimate the QTEs. If
one wants to pursue this second implementation, one has to activate the positive
option and specify pkernel() and pbandwidth() to control the projection of WiFM
to obtain the positive weights WiFM+ .
Estimation of the propensity score
linear selects the method used to estimate the instrument propensity score. If this
option is not activated, the local logit estimator is used. If linear is activated, the
local linear estimator is used.
mata opt selects the official optimizer introduced in Stata 10 to estimate the local logit,
Mata’s optimize(). The default is a simple Gauss–Newton algorithm written for
this purpose. This option is only relevant when the linear option has not been
selected.
kernel(kernel) specifies the kernel function used to estimate the propensity score. ker-
nel may be any of the following second-order kernels: epan2 (Epanechnikov ker-
nel function, the default), biweight (biweight kernel function), triweight (tri-
weight kernel function), cosine (cosine trace), gaussian (Gaussian kernel func-
tion), parzen (Parzen kernel function), rectangle (rectangle kernel function), or
triangle (triangle kernel function). In addition to these second-order kernels, there
are also several higher-order kernels: epanechnikov o4 (Epanechnikov order 4),
epanechnikov o6 (order 6), gaussian o4 (Gaussian order 4), gaussian o6 (order
6), gaussian o8 (order 8). By default, epan2, which specifies the Epanechnikov
kernel, is used.16
16. Here are the formulas for these kernel functions for Epanechnikov of order 4 and 6, respectively:
„ «
15 35 2 3 ` ´
K (z) = − z 1 − z 2 1 (z < 1)
8 8 4
„ «
175 525 2 5775 4 3 ` ´
K (z) = − z + z 1 − z 2 1 (z < 1)
64 32 320 4
And here are the formulas for Gaussian of order 4, 6, and 8, respectively:
1` ´
K (z) = 3 − z 2 φ (z)
2
1` ´
K (z) = 15 − 10z 2 + z 4 φ (z)
8
1 ` ´
K (z) = 105 − 105z 2 + 21z 4 − z 6 φ (z)
48
M. Frölich and B. Melly 435
bandwidth(#) sets the bandwidth h used to smooth over the continuous variables in the
estimation of the propensity score. The continuous regressors are first orthogonalized
such that their covariance matrix is the identity matrix. The bandwidth must be
strictly positive. If the bandwidth h is missing, an infinite bandwidth h = ∞ is
used. The default value is infinity. If the bandwidth h is infinity and the parameter
λ is one, a global model (linear or logit) is estimated without any local smoothing.
The cross-validation procedure implemented in locreg can be used to guide the
choice of the bandwidth. Because the optimal bandwidth converges at a faster rate
than the cross-validated bandwidth, the robustness of the results with respect to a
smaller bandwidth should be examined.
lambda(#) sets the λ used to smooth over the dummy and unordered discrete variables
in the estimation of the propensity score. It must be between 0 and 1. A value
of 0 implies that only observations within the cell defined by all discrete regressors
are used. The default is lambda(1), which corresponds to global smoothing. If
the bandwidth h is infinity and λ = 1, a global model (linear or logit) is estimated
without any local smoothing. The cross-validation procedure implemented in locreg
can be used to guide the choice of lambda. Again the robustness of the results with
respect to a smaller bandwidth should be examined.
trim(#) controls the amount of trimming. All observations with an estimated propen-
sity score less than trim() or greater than 1 − trim() are trimmed and not used
further by the estimation procedure. This prevents giving very high weights to single
observations. The default is trim(0.001). This option is not useful for the Koenker
and Bassett (1978) estimator, where no propensity score is estimated.
positive is used only with the Frölich and Melly (2008) estimator. If it is activated, the
positive weights WiFM+ defined in (5) are estimated by the projection of the weights
W FM on the dependent and the treatment variable. Weights W FM+ are estimated
by nonparametric regression on Y , separately for the D = 1 and the D = 0 samples.
After the estimation, negative estimated weights in W FM+ are set to zero.
i
Inference
variance activates the estimation of the variance. By default, no standard errors are
estimated because the estimation of the variance can be computationally demanding.
Except for the classical linear quantile regression estimator, it requires the estima-
tion of many nonparametric functions. This option should not be activated if you
bootstrap the results unless you bootstrap t-values to exploit possible asymptotic
refinements.
vbandwidth(#), vlambda(#), and vkernel(kernel) are used to calculate the vari-
ance if the variance option has been selected. They are defined similarly to
bandwidth(), lambda(), and kernel(). They are used only to estimate the vari-
ance. A “quick and dirty” estimate of the variance can be obtained by setting
vbandwidth() to infinity and vlambda() to 1, which is much faster than any other
choice. When vkernel(), vbandwidth(), or vlambda() is not specified, the values
given in kernel(), bandwidth(), and lambda() are taken as default.
level(#) specifies the confidence level, as a percentage, for confidence intervals. The
default is level(95) or as set by set level.
what(varname) gives the name of an existing variable containing the estimated weights.
The weights may have been estimated using ivqte or with any other command such
as a series estimator.
. use card
. qreg lwage college exper black motheduc reg662 reg663 reg664 reg665 reg666
> reg667 reg668 reg669, quantile(0.1)
(output omitted )
The syntax of the ivqte command is similar with the exception that the treatment
variable has to be included in parentheses after all other regressors:18
. ivqte lwage exper black motheduc reg662 reg663 reg664 reg665 reg666 reg667
> reg668 reg669 (college), quantiles(0.1) variance
(output omitted )
The point estimates are exactly identical because ivqte calls qreg, but the standard
errors differ. We recommend using the standard errors of ivqte because they are robust
against heteroskedasticity and other forms of dependence between the residuals and the
regressors.
We may be concerned that having a college degree might be endogenous and consider
using the “proximity of a four-year college” as an instrument. The proximity of a
four-year college is a binary variable, taking the value 1 if a college was close by. If
we are interested in the conditional QTE, we can apply the estimator suggested by
Abadie, Angrist, and Imbens (2002), as follows:
There are three differences compared with the previous syntax: First, the instrument
has to be given in parentheses after the treatment variable and the equal sign, that is,
(college=nearc4). Second, the control variables X are to be given in the corresponding
options—dummy(), continuous(), and unordered()—because they are used not only
to define the conditional QTE but also in the nonparametric estimation of the weights.
Third, the aai option must be activated. region enters here as a single unordered
17. This dataset is available for download from Stata together with the programs described in the
present article. The description of the variables can be found in Card (1995).
18. For this case of exogenous conditional QTEs, it is in principle arbitrary which variable is defined
as the treatment variable because the coefficients are estimated for all regressors. In addition,
nonbinary treatments are permitted here.
M. Frölich and B. Melly 439
discrete variable, which is expanded by ivqte to eight regional dummy variables in the
parametric model.
The two examples discussed so far refer to the conditional treatment effect of college
degree. We might be more interested in the unconditional QTE, which we examine in the
following example. Consider first the case where college degree is exogenous conditional
on X. The weighting estimator of Firpo (2007) is implemented by ivqte. We are
interested in the nine decile treatment effects with this estimator:
Only the treatment is given in parentheses, and the aai option is no longer activated.
Finally, to estimate unconditional QTEs with an endogenous treatment, the estimator
of Frölich and Melly (2008) is implemented in ivqte. The only difference from the
previous syntax is that the instrument (nearc4) now has to be given after the treatment
variable:
By default, the weights defined in (4) are used. If the positive option is activated,
the positive weights (5) are estimated and used:
If no control variables are included, then ivqte lwage (college = nearc4), aai
and ivqte lwage (college = nearc4), positive produce the same results.
We use card.dta and keep only 500 randomly sampled observations from card.dta
to reduce computation time. Because of missing values on covariates, eventually only
394 observations are retained in the estimation. The aim is to estimate the effect
of having a college degree (college) on log wages (lwage). A potential instrument is
living near a four-year college (nearc4). For ease of presentation, we use only experience
(exper) as a continuous control variable here. The other control variables are region
(unordered()) and black (dummy()).
Depending on the estimator, up to three functions have to be estimated nonpara-
metrically. Three sets of options correspond to these three functions. The options
kernel(), bandwidth(), and lambda() determine the kernel and the parameters h and
λ used for the estimation of the propensity score. It corresponds to Pr(Z = 1|X) for
the Abadie, Angrist, and Imbens (2002) and Frölich and Melly (2008) estimators and
to Pr(D = 1|X) for the Firpo (2007) estimator.
The options pkernel(), pbandwidth(), and plambda() determine the kernel and
smoothing parameters used for the estimation of the positive weights defined in (3) for
the estimator of Abadie, Angrist, and Imbens (2002). If the Frölich and Melly (2008)
estimator is to be used and the positive option has been activated, pkernel() and
pbandwidth() are used to estimate the positive weights (5).
Finally, the options vkernel(), vbandwidth(), and vlambda() are used for the
estimation of the variances of the estimators of Abadie, Angrist, and Imbens (2002),
Firpo (2007), and Frölich and Melly (2008).
A general finding in the literature is that the choice of the kernel functions—
kernel(), pkernel(), and vkernel()—is rarely crucial. The options vbandwidth()
and vlambda() are used only for the estimation of the variance. Therefore, during the
exploratory analysis, it may make sense to reduce the computational time by setting
vbandwidth() to infinity and vlambda() to one, that is, a parametric model. For the
final set of estimates, it often makes sense to set vbandwidth() equal to bandwidth()
and vlambda() equal to lambda(). This is done by default. In the following illustra-
tion, we show how the cross-validation procedure implemented in locreg can be used
to guide the choice of the important smoothing parameters.
We start with the estimator proposed by Firpo (2007). We do not need to use
the options pkernel(), pbandwidth(), and plambda() because the weights are always
positive by definition. We choose h = 2 and λ = 0.8 for the estimation of the propensity
score. In addition, we use h = ∞ and λ = 1 for the estimation of the variance. We use
the default Epanechnikov kernel in all cases.
. use card, clear
. set seed 123
. sample 500, count
(2510 observations deleted)
. ivqte lwage (college), quantiles(0.5) dummy(black) continuous(exper)
> unordered(region) bandwidth(2) lambda(0.8) variance vbandwidth(.) vlambda(1)
(output omitted )
M. Frölich and B. Melly 441
Of course, these choices of the smoothing parameters are arbitrary. One can use
the cross-validation option of locreg to choose the smoothing parameters. When we
use the Firpo (2007) estimator, we know that the options bandwidth() and lambda()
are used to estimate Pr (D = 1 |X ). Therefore, we can select the smoothing parameters
from a grid of, say, four possible values, as follows. We use the logit option because
D is a binary variable.
. locreg college, dummy(black) continuous(exper) unordered(region)
> bandwidth(1 2) lambda(0.8 1) logit
(output omitted )
The scalars r(optb) and r(optl) indicate that the choices of h = 1 and λ = 1
minimize the cross-validation criterion. We use the 2 × 2 search grid only for ease of
exposition. Usually, one would search within a much larger grid. We can now obtain
the point estimate using this choice of the smoothing parameters:
The optimal smoothing parameters are h = 0.8 and λ = 0.8. The generate(ps)
option implies that the fitted values E (Z = 1 |X ) are saved in the variable ps. These
fitted values are generated using the optimal bandwidth. That is, they are generated
after the cross-validation has selected the optimal bandwidth.
442 Estimation of quantile treatment effects with Stata
In the next step, we need to select bandwidths for pbandwidth() and plambda().
We know by equations (2) and (3) that pbandwidth() and plambda() are used to
estimate
Di (1 − Zi ) (1 − Di ) Zi
E WiAAI |Yi , Di , Xi = E 1 − − |Yi , Di , Xi
1 − Pr (Z = 1 |Xi ) Pr (Z = 1 |Xi )
Then we can use locreg to find the optimal parameters. The positive weights
are obtained by a nonparametric regression of W AAI on X and Y and D. This is
implemented in ivqte via two separate regressions: one nonparametric regression of
W AAI on X and Y for the D = 1 subsample and one separate nonparametric regression
of W AAI on X and Y for the D = 0 subsample. We proceed in the same way here by
adding Y , which in our example above is lwage, as a continuous regressor and running
separate regressions for the college==1 and the college==0 subsamples:
. locreg waai if college==1, dummy(black) continuous(exper lwage)
> unordered(region) bandwidth(0.5 0.8) lambda(0.8 1)
(output omitted )
. locreg waai if college==0, dummy(black) continuous(exper lwage)
> unordered(region) bandwidth(0.5 0.8) lambda(0.8 1)
(output omitted )
In the first case (that is, for the college==1 subsample), the optimal smoothing pa-
rameters are h = 0.8 and λ = 1. For the college==0 subsample, the optimal smoothing
parameters are h = 0.8 and λ = 0.8. The current implementation of ivqte permits only
one value of h and λ in the options pbandwidth() and plambda() to not overburden
the user with choosing nuisance parameters. If the suggested values for h and λ are
different for the college==1 and the college==0 subsamples, we recommend choos-
ing the smaller of these values but also examining the robustness of the results to the
other values. We suggest using the smaller bandwidth because the inference provided
by ivqte is based on the asymptotic formula given in (7) (see Appendix A.2), which
only contains variance but no bias term. To increase the accurateness of the inference
based on (7), one would prefer bandwidth choices that lead to smaller biases.
In our example, we choose pbandwidth(0.8), which was suggested by cross-valida-
tion in the college==1 and the college==0 subsamples, and plambda(0.8), which is
the smaller value of λ. With these bandwidth choices, we obtain the final estimates:
. ivqte lwage (college=nearc4), aai quantiles(0.5) dummy(black)
> continuous(age fatheduc motheduc) unordered(region) bandwidth(0.8) lambda(0.8)
> pbandwidth(0.8) plambda(0.8) variance
(output omitted )
By using the variance option without specifying vbandwidth() nor vlambda(), the
values given for bandwidth() and lambda() are used as defaults for vbandwidth() and
vlambda(). Alternatively, we could have specified different values for vbandwidth() and
M. Frölich and B. Melly 443
The standard errors are still different from the standard errors in the published arti-
cle because Abadie, Angrist, and Imbens (2002) have used a somewhat unconventional
bandwidth. We can replicate their standard errors of table II-A by activating the aai
option.19 The following command calculates the results for the median.
19. In this command, we use the fact that the estimator of Abadie, Angrist, and Imbens (2002) simpli-
fies to the standard quantile regression estimator when the treatment is used as its own instrument.
Similar relationships exist for the estimator proposed by Frölich and Melly (2008) and are discussed
in their article.
444 Estimation of quantile treatment effects with Stata
gives results that are slightly different from their table III-A. We cannot exactly repli-
cate their results because they have used series estimators to estimate the nonparametric
components of their estimator and because they have exploited the fact that the instru-
ment was completely randomized.20 In the following commands, we show how ivqte
with the options phat(varname) and what(varname) can be used to replicate their
results. The parameters and standard errors are then almost identical to the original
results.21
Abadie, Angrist, and Imbens (2002) first note that the instrument assignment has
been fully randomized. Therefore, they estimate Pr (Z = 1 |X ) by the sample mean of
Z. Then the negative and positive weights WiAAI can be generated:
. summarize assignment
(output omitted )
. generate pi=r(mean)
. generate kappa=1-treatment*(1-assignment)/(1-pi)-(1-treatment)*assignment/pi
In a second step, the positive weights WiAAI+ are estimated by a linear regression of
WiAAI on a polynomial of order 5 in Y and D:
. forvalues i=1/5 {
2. generate e`i´=earnings^`i´
3. generate de`i´=e`i´*treatment
4. }
. regress kappa earnings treatment e2 e3 e4 e5 de1 de2 de3 de4 de5
(output omitted )
. predict kappa pos
(option xb assumed; fitted values)
. ivqte earnings (treatment=assignment) if kappa pos>0, dummy($reg)
> quantiles(0.5) variance aai what(kappa pos) phat(pi)
(output omitted )
which gives almost the same estimates and standard errors as their table III-A.
4 Acknowledgments
We would like to thank Ben Jann, Andreas Landmann, Robert Poppe, and an anony-
mous referee for helpful comments.
20. There are no strong reasons to prefer series estimators to local nonparametric estimators. We
will use a series estimator here only to show that we can replicate their results. Actually,
Frölich and Melly (2008) require, in some sense, weaker regularity assumptions for the local es-
timator than what was required for the existing series estimators.
21. A small difference still remains, which is due to differences in the implementation of the estimator
of H(X). With slight adaptations, which would restrict the generality of the estimator, we can
replicate their results exactly.
M. Frölich and B. Melly 445
5 References
Abadie, A. 2003. Semiparametric instrumental variable estimation of treatment response
models. Journal of Econometrics 113: 231–263.
Abadie, A., J. Angrist, and G. Imbens. 2002. Instrumental variables estimates of the
effect of subsidized training on the quantiles of trainee earnings. Econometrica 70:
91–117.
Card, D. E. 1995. Using geographic variation in college proximity to estimate the return
to schooling. In Aspects of Labour Economics: Essays in Honour of John Vanderkamp,
ed. L. Christofides, E. K. Grant, and R. Swindinsky. Toronto, Canada: University of
Toronto Press.
Frölich, M., and B. Melly. 2008. Unconditional quantile treatment effects under en-
dogeneity. Discussion Paper No. 3288, Institute for the Study of Labor (IZA).
http://ideas.repec.org/p/iza/izadps/dp3288.html.
Hall, P., and S. J. Sheather. 1988. On the distribution of a studentized quantile. Journal
of the Royal Statistical Society, Series B 50: 381–391.
Jann, B. 2005a. kdens: Stata module for univariate kernel density estimation. Sta-
tistical Software Components S456410, Department of Economics, Boston College.
http://ideas.repec.org/c/boc/bocode/s456410.html.
———. 2005b. moremata: Stata module (Mata) to provide various Mata functions.
Statistical Software Components S455001, Department of Economics, Boston College.
http://ideas.repec.org/c/boc/bocode/s455001.html.
Koenker, R., and G. Bassett Jr. 1978. Regression quantiles. Econometrica 46: 33–50.
446 Estimation of quantile treatment effects with Stata
Racine, J., and Q. Li. 2004. Nonparametric estimation of regression functions with both
categorical and continuous data. Journal of Econometrics 119: 99–130.
Silverman, B. W. 1986. Density Estimation for Statistics and Data Analysis. Boca
Raton, FL: Chapman & Hall/CRC.
A Variance estimation
In this section, we describe the analytical variance estimators implemented in ivqte.
The bootstrap represents an alternative and can be implemented in Stata using the
bootstrap prefix. The validity of the bootstrap has been proven for standard quantile
regression but not for the other estimators so far, but it seems likely that it is valid.
where level is the level for the intended confidence interval, and φ and Φ are the normal
density and distribution functions, respectively. This estimator of the asymptotic vari-
ance is consistent under heteroskedasticity, which is in contrast to the official Stata com-
mand for quantile regression, qreg. This is important because quantile regression only
becomes interesting when the errors are not independent and identically distributed.
where W AAI+ are estimates of the projected weights. For the kernel function in the
i
previous regression, we use an Epanechnikov kernel and hn = n−0.2 Var (Yi − Xi γ
IV
τ )
23
as proposed by Abadie, Angrist, and Imbens (2002).
Furthermore, H
(Xi ) is estimated by the local linear regression of
⎡ ⎤
⎢ Di (1 − Zi ) (1 − Di ) Zi ⎥
{τ − 1 (Yi − Xi γ
IV
τ
< 0)} Xi ⎣− 2 + 2⎦
on Xi
(Z = 1 |Xi )
Pr (Z = 1 |X )
1 − Pr
23. In principle, the same kernel and bandwidth as those for quantile regression can be used. These
choices were made to replicate the results of Abadie, Angrist, and Imbens (2002).
448 Estimation of quantile treatment effects with Stata
ψi iAAI {τ − 1 (Yi − Xi γ
= W IV
τ
< 0)} Xi + H (Z = 1 |Xi )
(Xi ) × Zi − Pr
τ 1
Ω = ψi ψi
n
AAI are estimates of the weights.
where W i
with
1 FY |D=1,X QτY 1 1 − FY |D=1,X QτY 1
V = E
fY2 1 QτY 1 Pr(D = 1|X)
τ
1 FY |D=0,X QY 0 1 − FY |D=0,X QτY 0
+ 2 τ E
fY 0 QY 0 1 − Pr(D = 1|X)
" #
2
+ E {ϑ1 (X) − ϑ0 (X)}
where ϑd (x) = τ − FY |D=d,X (QτY d )/fY d (QτY d ). QτY 0 and QτY 1 have already been esti-
mated by α and α τ , respectively. The densities fY d (Qτ d ) are estimated by weighted
+Δ Y
kernel estimators
1 Y − Qτ d
fY d Q d = i × k
τ F i Y
Y W
nhn hn
Di =d
with Epanechnikov kernel function and Silverman (1986) bandwidth choice, and where
FY |D=d,X (QτY d ) is estimated by the local logit estimator described in appendix B.
» –
1 π(X, 1) ` ´˘ ` ´¯
V= “ ”E FY |D=1,Z=1,X QτY 1 |c 1 − FY |D=1,Z=1,X QτY 1 |c
Pc2 fY2 1 |c QτY 1 |c p(X)
» –
1 π(X, 0) ` ´˘ ` ´¯
+ “ ”E FY |D=1,Z=0,X QτY 1 |c 1 − FY |D=1,Z=0,X QτY 1 |c
Pc2 fY2 1 |c QτY 1 |c 1 − p(x)
» –
1 1 − π(X, 1) ` ´˘ ` ´¯
+ “ ”E FY |D=0,Z=1,X QτY 0 |c 1 − FY |D=0,Z=1,X QτY 0 |c
Pc2 fY2 0 |c QτY 0 |c p(X)
» –
1 1 − π(X, 0) ` ´˘ ` ´¯
+ “ ”E FY |D=0,Z=0,X QτY 0 |c 1 − FY |D=0,Z=0,X QτY 0 |c
Pc2 fY2 0 |c QτY 0 |c 1 − p(X)
» –
π(X, 1)ϑ211 (X) + {1 − π(X, 1)} ϑ201 (X) π(X, 0)ϑ210 (X) + {1 − π(X, 0)} ϑ200 (X)
+E +
p(X) 1 − p(X)
„ »
π(X, 1)ϑ11 (X) + {1 − π(X, 1)} ϑ01 (X)
− E p(X) {1 − p(X)}
p(X)
–2 !
π(X, 0)ϑ10 (X) + {1 − π(X, 0)} ϑ00 (X)
+
1 − p(X)
where ϑdz (x) = τ − FY |D=d,Z=z,X (QτY d |c )/Pc × fY d |c (QτY d |c ); p(x) = Pr(Z = 1|X = x);
π(x, z) = Pr(D = 1|X = x, Z = z); and Pc is the fraction of compliers. QτY 0 |c and
Qτ 1 have already been estimated by α IV and α IV + Δ τ , respectively. The terms
Y |c IV
FY |D=d,Z=z,X (QτY d ), p(X), and π(x, z) are estimated
by the local logit estimator de-
scribed in appendix B. Finally, Pc is estimated by (Xi , 1) − π
π (Xi , 0).
!1
1 QθY d |c − QτY d |c
fY d |c (QτY d |c ) = lim k dθ
h→∞ h h
0
!1
1 Q τ d
θ d − Q
Y |c Y |c
fY d |c (Q
τ d )
Y |c = k dθ
h h
0
where we used a change in variables uh = QθY d |c − QτY d |c , which implies that θ = FY d |c (uh +
QτY d |c ) and dθ = fY d |c (uh + QτY d |c ) × hdu. By the mean value theorem, fY d |c (uh + QτY d |c ) =
fY d |c (QτY d |c ) + uh × fY d |c (Q), where Q is on the line between QτY d |c and uh + QτY d |c . Hence,
Z
∞
and of smoothing only with respect to the continuous covariates. When the number
of cells in a dataset is large, each cell may not have enough observations to nonpara-
metrically estimate the relationship among the remaining continuous variables. For
this reason, many applied researchers have treated discrete variables in a parametric
way. We follow an intermediate way and use the hybrid product kernel developed by
Racine and Li (2004). This estimator covers all cases from the fully parametric model
up to the traditional nonparametric estimator.
Overall, we can distinguish four different types of regressors: continuous (for exam-
ple, age), ordered discrete (for example, family size), unordered discrete (for example,
regions), and binary variables (for example, gender). We will treat ordered discrete and
continuous variables in the same way and will refer to them as continuous variables in
the following discussion.25
The unordered discrete and the binary variables are handled differently in the kernel
function and in the local parametric model. The binary variables enter into both as
single regressors. The unordered discrete variables, however, enter as a single regressor
in the kernel function and as a vector of dummy variables in the local model. Consider,
for example, a variable called region that takes four different values: north, south, east,
and west. This variable enters as a single variable in the kernel function but is included
in the local model in the form of three dummies: “south”, “east”, and “west”.
The kernel function is defined in the following paragraph. Suppose that the variables
in X are arranged such that the first q1 regressors are continuous (including the ordered
discrete variables) and the remaining Q − q1 regressors are discrete without natural
ordering (including binary variables). The kernel weights K(Xi − x) are computed as
q1 Q
Xq,i − xq
Kh,λ (Xi − x) = κ × λ1(Xq,i =xq )
q=1
h q=q1 +1
where Xq,i and xq denote the qth element of Xi and x, respectively; 1(·) is the indicator
function; κ is a symmetric univariate weighting function; and h and λ are positive
bandwidth parameters with 0 ≤ λ ≤ 1. This kernel function measures the distance
between Xi and x through two components: The first term is the standard product
kernel for continuous regressors with h defining the size of the local neighborhood. The
second term measures the mismatch between the unordered discrete (including binary)
regressors. λ defines the penalty for the unordered discrete regressors. For example,
the multiplicative weight contribution of the Qth regressor is 1 if the Qth element of
Xi and x are identical, and it is λ if they are different. If h = ∞ and λ = 1, then
the nonparametric estimator corresponds to the global parametric estimator and no
interaction term between the covariates is allowed. On the other hand, if λ is zero
and h is small, then smoothing proceeds only within each of the cells defined by the
25. Racine and Li (2004) suggest using a geometrically declining kernel function for the ordered discrete
regressors. There are no reasons, however, against using quadratically declining kernel weights. In
other words, we can use the same (for example, Epanechnikov) kernel for ordered discrete as for
continuous regressors. We therefore treat ordered discrete regressors in the same way as continuous
regressors in the following discussion.
452 Estimation of quantile treatment effects with Stata
discrete regressors and only observations with similar continuous covariates will be used.
Finally, if λ and h are in the intermediate range, observations with similar discrete and
continuous covariates will be weighted more but further observations will also be used.
Principally, instead of using only two bandwidth values h, λ for all regressors, a dif-
ferent bandwidth could be employed for each regressor, but doing so would substantially
increase the computational burden for bandwidth selection. This approach might lead
to additional noise due to estimating these bandwidth parameters. Therefore, we prefer
to use only two smoothing parameters. ivqte automatically orthogonalizes the data
matrix of all continuous regressors to create an identity covariance matrix. This greatly
diminishes the appeal of having multiple bandwidths.
This kernel function, combined with a local model, is used to estimate E (Y |X ).
If Y is a continuous variable, then ivqte uses by default a local linear estimator to
estimate E (Y |X = x) as α in
n
( = arg min
α, β)
2
{Yj − a − b (Xj − x)} × Kh,λ (Xj − x)
a,b j=1
If Y is bound from above and below, a local logistic model is usually preferred. We
suppose in the following discussion that Y is bound within [0, 1].26 This includes the
special case where Y is binary. The local logit estimator guarantees that the fitted
values are always between 0 and 1. The local logit estimator can be used by selecting
the logit option. In this case, E (Y |X = x) is estimated by Λ(
α), where
n
( = arg min
α, β) (Yj ln Λ {a + b (Xj − x)}
a,b j=1
fails locally because of collinearity or perfect prediction, the bandwidths are increased
locally. This is repeated until convergence is achieved.
The locreg command also contains a leave-one-out cross-validation procedure to
choose the smoothing parameters.28 The user provides a grid of values for h and λ, and
the cross-validation criterion is computed for all possible combinations of these values.
The values of the cross-validation criterion are returned in r(cross valid) and the
combination that minimizes this criterion is chosen. If only one value is given for h and
λ, no grid search is performed.
sample(varname , replace )
aweights and pweights are allowed. See the [U] 11.1.6 weight for more information
on weights.
B.3 Description
locreg computes the nonparametric estimation of the mean of depvar conditionally on
the regressors given in continuous(), dummy(), and unordered(). A mixed kernel is
used to smooth over the continuous and discrete regressors. The fitted values are saved
in the variable newvarname. If a list of values is given in bandwidth() or lambda(), the
smoothing parameters h and λ are estimated via leave-one-out cross-validation. The
values of h and λ minimizing the cross-validation criterion are selected. These values are
then used to predict depvar, and the fitted values are saved in the variable newvarname.
locreg can be used in three different ways. First, if only one value is given in
bandwidth() and one in lambda(), locreg estimates the nonparametric regression
using these values and saves the fitted values in generate(newvarname). Alternatively,
28. The cross-validated parameters are optimal to estimate the weights but are not optimal to estimate
the unconditional QTE. In the absence of a better method, we offer cross-validation, but the user
should keep in mind that the optimal bandwidths for the unconditional QTE converge to zero at a
faster rate than the bandwidths delivered by cross-validation. The user is therefore encouraged to
also examine the estimated QTE when using some undersmoothing relative to the cross-validation
bandwidths.
454 Estimation of quantile treatment effects with Stata
locreg can also be used to estimate the smoothing parameters via leave-one-out cross-
validation. If we do not specify the generate() option but supply a list of values in
the bandwidth() or lambda() option only the cross-validation is performed. Finally, if
several values are specified in bandwidth() or lambda() when the generate() option is
also specified, locreg estimates the optimal smoothing parameters via cross-validation.
Thereafter, it estimates the conditional means with these smoothing parameters and
returns the fitted values in the variable generate(newvarname).
For the nonparametric regression, locreg offers two local models: linear and logistic.
The logistic model is usually preferred if depvar is bound within [0, 1]. This includes
the case where depvar is binary but also incorporates cases where depvar is nonbinary
but bound from above and below. If the lower and upper bounds of depvar are different
from 0 and 1, the variable depvar should be rescaled to the interval [0, 1] before using
this command. If depvar is not bound from above and below, the linear model should
be used.29
B.4 Options
generate(newvarname , replace ) specifies the name of the variable that will con-
tain the fitted values. If this option is not used, only the leave-one-out cross-
validation estimation of the smoothing parameters h and λ will be performed. The
replace option allows locreg to overwrite an existing variable or to create a new
one where none exists.
continuous(varlist), dummy(varlist), and unordered(varlist) specify the names of the
covariates depending on their type. Ordered discrete variables should be treated as
continuous.
kernel(kernel) specifies the kernel function. kernel may be epan2 (Epanechnikov ker-
nel function; the default), biweight (biweight kernel function), triweight (tri-
weight kernel function), cosine (cosine trace), gaussian (Gaussian kernel func-
tion), parzen (Parzen kernel function), rectangle (rectangle kernel function), or
triangle (triangle kernel function). In addition to these second-order kernels, there
are also several higher-order kernels: epanechnikov o4 (Epanechnikov order 4),
epanechnikov o6 (order 6), gaussian o4 (Gaussian order 4), gaussian o6 (order
6), gaussian o8 (order 8).30
29. In the current implementation, there is not yet a local model specifically designed for depvar that
is bound only from above or only from below. A local tobit or local exponential model may be
added in future versions.
30. The formulas for the higher-order kernel functions are given in footnote 16.
M. Frölich and B. Melly 455
sample(varname , replace ) specifies the name of the variable that marks the esti-
mation sample. This is similar to the function e(sample) for e-class commands.
31. In case of multicollinearity, h is increased repeatedly until the problem disappears. If multicollinear-
ity is still present at h = 100, then λ is increased repeatedly. Note that locreg first examined
whether multicollinearity is a problem in the global model (h = ∞, λ = 1 ) before attempting to
estimate locally.
32. This is different from ivqte, where local logit is the default for binary dependent variables.
456 Estimation of quantile treatment effects with Stata
B.6 Examples
We briefly illustrate the use of locreg with a few examples. We use card.dta and
keep only 200 observations to keep the computational time reasonable for this illus-
tration. (Because of missing values on covariates, eventually only 184 observations are
retained in the estimation.) The aim is to estimate the probability of living near a
four-year college (nearc4) as a function of experience, exper (continuous() variable);
mother’s education, motheduc (ordered discrete); region (unordered()); and black
(dummy()). locreg can be used in three different ways. First, if only one value is
given in bandwidth(#) and one in lambda(#), locreg estimates the nonparametric
regression using these values h and λ and saves the fitted values in newvarname:
. use card, clear
. set seed 123
. sample 200, count
(2810 observations deleted)
. locreg nearc4, generate(fitted1) bandwidth(0.5) lambda(0.5)
> continuous(exper motheduc) dummy(black) unordered(region)
(output omitted )
The fitted1 variable contains the estimated probabilities. Because some of them
turn out to be negative and others to be larger than one, we may prefer to fit a local
logit regression and add the logit option:
locreg can also be used to estimate the smoothing parameters via leave-one-out
cross-validation. If we do not specify the generate() option but instead supply a list of
values in the bandwidth() or the lambda() option (or both), only the cross-validation
is performed:
In this example, the cross-validation criterion is calculated for each of the four cases:
(h, λ) = (0.2, 0.5), (0.2, 0.8), (0.5, 0.5), and (0.5, 0.8). The scalars r(optb) and r(optl)
indicate the values that minimized the cross-validation criterion. In our example, we
obtain h = 0.2 and λ = 0.5. The cross-validation results are saved in the matrix
r(cross valid) for every h and λ combination of the search grid.
If we would like to include in our cross-validation search the values to infinity, that is,
the global parametric model, we would supply a missing value for h and a value of 1 for
λ. For example, specifying bandwidth(0.2 .) lambda(0.5 1) implies that the cross-
validation criterion is calculated for each of the four cases: (h, λ) = (0.2, 0.5), (0.2, 1),
(∞, 0.5), and (∞, 1). Similarly, specifying bandwidth(0.2 0.5 .) lambda(0.5 0.8
1) implies a search grid with nine values: (h, λ) = (0.2, 0.5), (0.2, 0.8), (0.2, 1), (0.5, 0.5),
(0.5, 0.8), (0.5, 1), (∞, 0.5), (∞, 0.8), and (∞, 1).
Finally, if several values are specified for the smoothing parameters and the
generate() option is also activated, then locreg first estimates h and λ via cross-
validation and thereafter returns the fitted values obtained with in the fitted3
h and λ
variable.
Abstract. In this article, we describe screening, a new Stata command for data
management that can be used to examine the content of complex narrative-text
variables to identify one or more user-defined keywords. The command is useful
when dealing with string data contaminated with abbreviations, typos, or mistakes.
A rich set of options allows a direct translation from the original narrative string
to a user-defined standard coding scheme. Moreover, screening is flexible enough
to facilitate the merging of information from different sources and to extract or
reorganize the content of string variables.
Editors’ note. This article refers to undocumented functions of Mata, meaning that
there are no corresponding manual entries. Documentation for these functions is
available only as help files; see help regex.
Keywords: dm0050, screening, keyword matching, narrative-text variables, stan-
dard coding schemes
1 Introduction
Many researchers in varied fields frequently deal with data collected as narrative text,
which are almost useless unless treated. For example,
• Electronic patient records (EPRs) are useful for decision making and clinical re-
search only if patient data that are currently documented as narrative text are
coded in standard form (Moorman et al. 1994).
• When different sources of data use different spellings to identify the same unit of in-
terest, the information can be exploited only if codes are made uniform (Raciborski
2008).
• Because of verbatim responses to open-ended questions, survey data items must
be converted into nominal categories with a fixed coding frame to be useful for
applied research.
These are only three of the many critical examples that motivate an ad hoc command.
Recoding a narrative-text variable into a user-defined standard coding scheme is cur-
rently possible in Stata by combining standard data-management commands (for exam-
ple, generate and replace) with regular expression functions (for example, regexm()).
c 2010 StataCorp LP dm0050
F. Belotti and D. Depalo 459
However, many problems do not yield easily to this approach, especially problems con-
taining complex narrative-text data. Consider, for example, the case when many source
variables can be used to identify a set of keywords; or the case when, looking at different
keywords, one is within a given source variable but not necessarily at the beginning of
that variable, whereas the others are at the beginning, the end, or within that or other
source variables. Because no command jointly handles all possible cases, these cases
can be treated with existing Stata commands only after long and tedious programming,
increasing the possibility of introducing errors. We developed the screening command
to fill this gap, simplifying data-cleaning operations while being flexible enough to cover
a wide range of situations.
In particular, screening checks the content of one or more string variables (sources)
to identify one or more user-defined regular expressions (keywords). Because string vari-
ables are not flexible, to make the command easier and more useful, a set of options
reduces your preparatory burden. You can make the matching task wholly case in-
sensitive or set matching rules aimed at matching keywords at the beginning, the end,
or within one or more sources. If source variables contain periods, commas, dashes,
double blanks, ampersands, parentheses, etc., it is possible to perform the matching by
removing such undesirable content. Moreover, if the matching task becomes more dif-
ficult because of abbreviations or even pure mistakes, screening allows you to specify
the number of letters to screen in a keyword. Finally, the command allows a direct
translation of the original string variables in a user-defined standard coding scheme.
All these features make the command simple, extremely flexible, and fast, minimizing
the possibility of introducing errors. It is worth emphasizing that we find Mata more
convenient to use than Stata, with advantages in terms of time execution.
The article is organized as follows. In section 2, we describe the new screening
command, and we provide some useful tips in section 3. Section 4 illustrates the main
features of the command using EPR data, while section 5 details some critical cases in
which the use of screening may aid your decision to merge data from different sources
or to extract and reorder messy data. In the last section, section 6, we offer a short
summary.
2.1 Syntax
time
2.2 Options
sourcesopts description
lower perform a case-insensitive match (lowercase)
upper perform a case-insensitive match (uppercase)
trim match keywords by removing leading and trailing blanks
from sources
itrim match keywords by collapsing sources with consecutive
internal blanks to one blank
removeblank match keywords by removing from sources all blanks
removesign match keywords by removing from sources the following
signs: * + ? / \ % ( ) [ ] { } | . ^ - _ # $
keys( matching rule "string" . . . ) specifies one or more regular expressions (key-
words) to be matched with source variables. keys() is required.
type description
tab tabulate all matched cases for each keyword within each source variable
count display a table of frequency counts of all matched cases for each
keyword within each source variable
cases(newvar) generates a set of categorical variables (as many as the number of key-
words) showing the number of occurrences of each keyword within all specified source
variables.
newcodeopts description
replace replace newvar if it already exists
add obtain newvar as a concatenation of subexpressions returned by
regexs(n), which must be specified as a
user defined code in recode
label attach keywords as value labels to newvar
numeric convert newvar from string to numeric; it can be specified only if
the recode() option is specified
recode(recoding rule "user defined code" recoding rule "user defined code" ... )
recodes the newcode() newvar according to a user-defined coding scheme. recode()
must contain at least one recoding rule followed by one user defined code. When you
specify recode(1 "user defined code"), the "user defined code" will be used to re-
code all matched cases from the first keyword within the list specified via the keys()
option. If recode(2,3 "user defined code") is specified, the "user defined code" will
be used to recode all cases for which second and third keywords are simultaneously
matched, and so on. This option can only be specified if the newcode() option is
specified.
checksources checks whether source variables contain special characters. If a match-
ing rule is specified (begin or end via keys()), checksources checks the sources’
boundaries accordingly.
tabcheck tabulates all cases from checksources. If there are too many cases, the
option does not produce a table.
memcheck performs a “preventive” memory check. When memcheck is specified, the
command will exit promptly if the allocated memory is insufficient to run screening.
When memory is insufficient and screening is run without memcheck, the command
could run for several minutes or even hours before producing the message no room
to add more variables.
462 From narrative text to standard codes
3 Tips
The low flexibility of string variables is a reason for concern. In this section, we provide
some tips to enhance the usefulness of screening. Some tips are useful to execute the
command, while other tips are useful to check the results.
Most importantly, capitalization matters: this means that screening for KEYWORD is
different from screening for keyword. If source variables contain HEMINGWAY and you are
searching for Hemingway, screening will not identify such keyword. If suboption upper
(lower) is specified in sources(), keywords will be automatically matched in uppercase
(lowercase).
Choose an appropriate matching rule. The screening default is to match keywords
over the entire content of source variables. By specifying the matching rule begin or
end within the keys() option, you may switch accordingly the matching rule on string
boundaries. For example, if sources contain HEMINGWAY ERNEST and ERNEST HEMINGWAY
and you are searching begin HEMINGWAY, the screening command will identify the
keyword only in the former case. Whether the two cases are equivalent must be evaluated
case by case.
Another issue is how to choose the optimal number of letters to be screened. For
example, with EPR data, different physicians might use different abbreviations for the
same pathologies. And so talking about a “right” number of letters is nonsense. As
a rule of thumb, the number of letters should be specified as the minimum number
that uniquely identifies the case of interest. Using many letters can be too exclusive,
while using few letters can be too inclusive. In all cases, but in particular when the
appropriate number of letters is unknown, we find it useful to tabulate all matched cases
through the explore(tab) option. Because it tabulates all possible matches between
all keywords and all source variables, it is the fastest way to explore the data and choose
the best matching strategy (in terms of keywords, matching rule, and letters).
Advanced users can maximize the potentiality of screening by mixing keywords
with Stata regular-expression operators. Mixing in operators allows you to match more-
complex patterns, as we show later in the article.1 For more details on regular-expression
syntaxes and operators, see the official documentation at
http://www.stata.com/support/faqs/data/regex.html.
1. The letters() option does not work if a keyword contains regular-expression operators.
F. Belotti and D. Depalo 463
screening displays several messages to inform you about the effects of the specified
options. For example, consider the case in which you are searching some keywords con-
taining regular-expression operators. screening will display a message with the correct
syntax to search a keyword containing regular-expression operators. The nowarnings
option allows you to suppress all warning messages.
screening generates several temporary variables (proportional to the number of
keywords you are looking for and to the number of sources you are looking from). So
when you are working with a big dataset and your computer is limited in terms of
RAM, it might be a good idea to perform a “preventive” memory check. When the
memcheck option is specified and the allocated memory is insufficient, screening will
exit promptly rather than running for several minutes or even hours before producing
the message no room to add more variables.
We conclude this section with an evaluation of the command in terms of time ex-
ecution using different Stata flavors and different operating systems. In particular, we
compare the latest version of screening written using Mata regular-expression func-
tions with its beta version written entirely using the Stata counterpart. We built three
datasets of 500,000 (A), 5 million (B), and 50 million (C) observations with an ad hoc
source variable containing 10 different words: HEMINGWAY, FITZGERALD, DOSTOEVSKIJ,
TOLSTOJ, SAINT-EXUPERY, HUGO, CERVANTES, BUKOWSKI, DUMAS, and DESSI. Screening
for HEMINGWAY (50% of total cases) gives the following results (in seconds):
4 Example
To illustrate the command, we use anonymized patient-level data from the Health Search
database, a nationally representative panel of patients run by the Italian College of
General Practitioners (Italian Society of General Medicine). Our sample contains freely
inputted EPRs concerning the prescription of diagnostic tests.2 A list of 15 observations
2. The original data are in Italian. Where necessary for comprehension, we translate to English.
464 From narrative text to standard codes
from the “uppercase” source variable diagn test description provides an overview of
cases at hand:
diagn_test_descr
TRIGLICERIDI
EMOCROMO FORMULA
COLESTEROLO TOTALE
ALTEZZA
PT TEMPO PROTROMBINA
VISITA CARDIOLOGICA CONTROLLO
HCV AB EPATITE C
COMPONENTE MONOCLONALE
ATTIVITA´ FISICA
PSA ANTIGENE PROSTATICO SPECIFICO
RX CAVIGLIA SN
FAMILIARITA´ K UTERO
TRIGLICERIDI
URINE ESAME COMPLETO
URINE PESO SPECIFICO
As you can see, this is a rich EPR dataset that is totally useless unless treated. If
data were collected for research purposes, physicians would be given a finite number of
possible options. There is much agreement in the scientific community that the cost to
leave the burden of inputting standard codes directly to physicians at the time of contact
with the patient is higher than the relative benefit: the task is extremely onerous, it is
unrelated to the physician’s primary job, and most importantly, it requires extra effort.
Therefore, the common view supports the implementation of data-entry methods that
do not disturb the physician’s workflow (Yamazaki and Satomura 2000).
From the above list of observations, it is also clear that free-text data entry provides
physicians with the freedom to determine the order and detail at which they want
to input data. Even if the original free-text data were complete, it would still be
difficult to extract standardized and structured data from this kind of record because
of abbreviations, typos, or mistakes (Moorman et al. 1994). Extracting data in the
presence of abbreviations and typos is exactly what screening allows you to do.
As a practical example, we focus on the identification of different types of cholesterol
tests. In particular, our aim is to create a new variable (diagn test code) containing
cholesterol test codes according to the Italian National Health System coding scheme.
Because at least three types of cholesterol test exist, namely, hdl, ldl, and total, our
matching strategy must take into account that a physician can input 1) only the types
of the test, 2) only its broad definition (cholesterol), or 3) both, without considering
abbreviations, typos, mistakes, and further details.
F. Belotti and D. Depalo 465
Thus we first explore the data by running screening with the explore(tab) option:
Here the lower suboption makes the matching task case insensitive. Apart from the
explore(tab) option, the syntax above is compulsory and performs what we call a
default matching, that is, an exact match of the keyword colesterolo over the entire
content of the source variable diagn test descr. The tabulation above (notice the
lowercase) informs you that the keyword colesterolo is encountered in 5,696 cases.
What do these cases contain? Because you did not instruct the command to match a
shorter length of the keyword, the only possible case is the keyword itself; all the cases
contain the keyword colesterolo.
Given the nature of the data, it might be convenient to run screening with a
shorter length of the keyword so as to find possible partial matching in the presence
of abbreviations or mistakes. The letters(#) option instructs screening to perform
the match on a shorter length:
3. Because of space restrictions, we deliberately omit the complete tabulation obtainable with the
explore(tab) option. It is available upon request.
F. Belotti and D. Depalo 467
Again screening detects new cases: 2,034 cases characterized by the abbreviation
col tot (that is, total cholesterol) that are impossible to identify without further re-
ducing the number of letters. The problem is that, among all matched cases (12,595),
there are also a number of unwanted cases, that is, cases containing the same spelling
of the keyword but related to another type of diagnostic test. Despite this incorrect
identification, we will show later in the section how to obtain a new “recoded variable”
by specifying the appropriate recoding rule as an argument of the recode() option.
The number of letters you match plays a critical role: specifying a high number
of letters may cause the number of matched observations to be artificially low due to
mistakes or abbreviations in the source variables; on the other hand, matching a small
number of letters may cause the number of matched observations to be artificially high
due to the inclusion of uninteresting cases containing the “too short” keyword.
F. Belotti and D. Depalo 469
For this reason, the best recoding strategy is to first specify keywords that uniquely
identify the cases of interest. Because keywords hdl and ldl each uniquely identify a
cholesterol test, they must have priority in the recoding process over totale, which is
an extension common to other pathologies.
Indeed, when we reverse the order of the keywords and specify the replace suboption
in the newcode() option, screening produces
where the newcode() variable now identifies all hdl and ldl cases. Notice that here
we followed the correct approach, from specific to general. Moreover, as shown by the
following code, when we specify the newcode() suboption label, screening attaches
the specified keywords as value labels to the newcode() variable.
The last step toward recoding is achieved by using the recode() option. This option
allows you to recode the newcode() variable according to a user-defined coding scheme.
When you specify this option, the coding process is completely under your control.
The recode() option requires a recoding rule followed by a "user defined code" (the
"user defined code" must be enclosed within double quotes).
When we specify recode(1 "90.14.1" ...), the standard code "90.14.1" will
be used to recode all matched cases from the first keyword (hdl); when we specify
F. Belotti and D. Depalo 471
recode(... 2 "90.14.2" ...), the standard code "90.14.2" will be used to recode
all matched cases from the second keyword (ldl); and so on. The third and forth
keywords deserve special attention. totale (which was specified as the forth keyword,
hence position 4) is a common extension that we want to identify only when it is
matched simultaneously with colesterolo (which was specified as the third keyword,
hence position 3). Thus the appropriate syntax in this case will be recode(... 3,4
"90.14.3" ...). Finally, when we specify recode(... 3 "not class. tests"), the
code "not class. tests" will be used to recode all matched cases from the third
keyword (colesterolo) that are not classified because they do not contain any further
specification.
The final syntax of our example is
. screening, sources(diagn_test_descr, lower)
> keys("hdl" "ldl" "colesterolo" "totale") letters(3 3 3 3)
> newcode(diagn_test_code)
> recode(1 "90.14.1" 2 "90.14.2" 3,4 "90.14.3" 3 "not class. tests")
. tabulate diagn_test_code
diagn_test_code Freq. Percent Cum.
0 0 0 2,244 2,244
90.14.1 2,872 0 0 0 2,872
90.14.2 0 2,015 0 0 2,015
90.14.3 0 0 5,055 0 5,055
not class. tests 0 0 2,676 0 2,676
This example shows that screening is a simple tool to manage complex string vari-
ables. Once you have obtained structured data (in our example, a categorical variable
indicating cholesterol tests), you can finally start your statistical analysis.
4. Because of space restrictions, we deliberately omit the tabulation of such cases. It is available upon
request.
472 From narrative text to standard codes
5 Extensions
Although the main utility of screening is the direct translation of complex narrative-
text variables in a user-defined coding scheme, the command is flexible enough to cover
a wide range of situations. In section 5.1, we present an example of how to use the
command to facilitate the merging of information from different sources, while in sec-
tion 5.2, we show how to use screening to extract or rearrange a portion of a string
variable.
As you can see, there are 288 inconsistencies.6 When we tabulate the unmatched
cases, we would realize that unconventional expressions, like apostrophes, accents, dou-
ble names, etc., are responsible for this imperfect result:
. preserve
. sort comune
. drop if _merge==3
(7812 observations deleted)
. list comune _merge in 1/20, separator(20) noobs
comune _merge
AGLIE´ 2
AGLI 1
ALA´ DEI SARDI 2
ALBISOLA MARINA 2
ALBISOLA SUPERIORE 2
ALBISSOLA MARINA 1
ALBISSOLA SUPERIORE 1
ALI´ 2
ALI´ TERME 2
ALLUVIONI CAMBIO´ 2
ALLUVIONI CAMBI 1
ALME´ 2
ALM 1
AL DEI SARDI 1
AL 1
AL TERME 1
ANTEY-SAINT-ANDRE´ 2
ANTEY-SAINT-ANDR 1
APPIANO SULLA STRADA DEL 2
APPIANO SULLA STRADA DEL VINO 1
. restore
If you wish to recover all 288 unmatched municipalities, the proposed command is
a simple and fast solution. Indeed, when you take advantage of the available options,
you can (almost) completely recover unmatched cases with only one command. As an
example, we recover nine cases (it is possible to recover all cases with this procedure),
with a loop running on values of merge equal to 1 or 2, that is, running only on
unmatched cases:
6. The number of unmatched cases is different between the master (288) and the using (290) datasets
because of aggregation and separation of municipalities. Solving this kind of problem is beyond
the illustrative scope of this example.
474 From narrative text to standard codes
. forvalues i=1/2 {
2. preserve
3. keep if _merge==`i´
4.
. screening, sources(comune) keys("ALBISSOLA" "AQUILA D´ARROSCIA" "BAJARDO"
> "BARCELLONA" "BARZAN" "BRIGNANO" "CADERZONE" "CAVAGLI" "MARINA" "SUPERIORE")
> cases(cases) newcode(comune, replace)
> recode(1,9 "ALBISOLA MARINA" 1,10 "ALBISOLA SUPERIORE" 2 "AQUILA DI ARROSCIA"
> 3 "BAIARDO" 4 "BARCELLONA POZZO DI GOTTO" 5 "BARZANO´" 6 "BRIGNANO FRASCATA"
> 7 "CAVAGLIA" 8 "CADERZONE TERME")
5. if `i´==1 drop codice_ente
6. if `i´==2 drop codice
7. keep comune codice
8. sort comune
9. save new_`i´,replace
10. restore
11. }
(8102 observations deleted)
WARNING: By specifying -replace- sub-option you are overwriting the -newcode()-
> variable.
(note: file new_1.dta not found)
file new_1.dta saved
(8100 observations deleted)
WARNING: By specifying -replace- sub-option you are overwriting the -newcode()-
> variable.
(note: file new_2.dta not found)
file new_2.dta saved
. keep if _merge==3
(578 observations deleted)
. save perfect_match, replace
(note: file perfect_match.dta not found)
file perfect_match.dta saved
. use new_1, clear
. merge 1:1 comune using new_2
(output omitted )
. tabulate _merge
_merge Freq. Percent Cum.
Because we deliberately recovered only nine cases, the number of unmatched cases
before the execution of screening is improved by nine cases, from 7,812 to 7,821 exact
matches.
Example 1
Imagine you have the string variable address, and you want to create a new variable
that contains just the zip codes. Here is what the source variable address may look
like:
address
To find the zip code, you have to use screening with specific regular expressions,
allowing it to exactly match all cases in the source variable address. Some examples
of specific regular expressions are the following:
• [0-9]* to match zero or more numbers, that is, the zip code plus any other
numbers
Once the correct regular expression(s) is found, to use screening to create a new
variable containing the zip codes, you have to do the following:
7. The following examples have been taken from the UCLA website resources to help you learn and
use Stata. See http://www.ats.ucla.edu/stat/stata/faq/regex.htm.
476 From narrative text to standard codes
2. Combine the above regular expressions and use them as a unique keyword.
3. Use the regexs(n) function as a "user defined code" in the recode() option.
regexs(n) returns the subexpression n from the respective keyword match, where
0 ≤ n ≤ 10. Stata regular-expression syntaxes use parentheses, (), to denote
a subexpression group. In particular, n = 0 is reserved for the entire string
that satisfied the regular expression (keyword); n = 1 is reserved for the first
subexpression that satisfied the regular expression (keyword); and so on.
Total 7 100.00
1. 1 is the recoding rule; that is, the coding process is related to the first (and unique)
keyword match.
2. regexs(1) is used to recode. Indeed, it returns the string related to the first (and
unique) subexpression match.8
As a result, the new variable zipcode is created by using only one line of code.
Notice that screening warns you that you are matching a keyword containing one or
more regular-expression operators.
8. Remember that subexpressions are denoted by using (). In the considered syntax, the only subex-
pression is represented by ([0-9][0-9][0-9][0-9][0-9]). This means that, in this case, you cannot
specify n > 1.
F. Belotti and D. Depalo 477
Example 2
Suppose you have a variable containing a person’s full name. Here is what the variable
fullname looks like:
fullname
John Adams
Adam Smiths
Mary Smiths
Charlie Wade
Our goal is to swap first name with last name, separating them by a comma. The
regular expression to reach the target is (([a-zA-Z]+)[ ]*([a-zA-Z]+)). It is com-
posed of three parts:
2. [ ]* to match with a space(s), that is, the blank between first and last name
3. ([a-zA-Z]+) again to capture a string consisting of letters, this time the last
name
fullname
Adams, John
Smiths, Adam
Smiths, Mary
Wade, Charlie
478 From narrative text to standard codes
Notice the newcode() suboption add. It can be specified only when a regexs(n)
function is specified as a "user defined code" in the recode() option. The add suboption
allows for the creation of the newcode() variable as a concatenation of subexpressions
returned by regexs(n). In the example above,
1. recode(1 "regexs(2)," ... returns the second subexpression from the first
keyword match (the last name) plus a comma.
2. ...2 "regexs(0)" ... returns the blank matched by the second keyword;
3. ...3 "regexs(1)") returns the first subexpression from the third keyword match
(the first name).
As a result, the variable fullname is replaced (note the suboption replace) sequen-
tially by the concatenation of subexpressions returned by 1, 2, and 3 above.
Example 3
Imagine that you have the string variable date containing dates:
. list date, noobs sep(20)
date
20jan2007
16June06
06sept1985
21june04
4july90
9jan1999
6aug99
19august2003
The goal is to produce a string variable with the appropriate four-digit year for each
case, which Stata can easily convert into a date. You can achieve the target by coding
something like the following:
. generate day = regexs(0) if regexm(date, "^[0-9]+")
. generate month = regexs(0) if regexm(date, "[a-zA-Z]+")
. generate year = regexs(0) if regexm(date, "[0-9]*$")
. replace year = "20"+regexs(0) if regexm(year, "^[0][0-9]$")
(2 real changes made)
. replace year = "19"+regexs(0) if regexm(year, "^[1-9][0-9]$")
(2 real changes made)
. generate date1 = day+month+year
F. Belotti and D. Depalo 479
date date1
20jan2007 20jan2007
16June06 16June2006
06sept1985 06sept1985
21june04 21june2004
4july90 4july1990
9jan1999 9jan1999
6aug99 6aug1999
19august2003 19august2003
Also in this case, as in the previous example, we specify the newcode() suboption add
because we need to create the newcode() variable as a concatenation of subexpressions
from keyword matching. The same result can be obtained using the following syntax:
. screening, sources(date)
> keys(begin "[0-9]+" "[a-zA-Z]+" end "[0][0-9]" end "[1-9][0-9]")
> newcode(date1, add)
> recode(1 "regexs(0)" 2 "regexs(0)" 3 "20+regexs(0)" 4 "19+regexs(0)")
WARNING! You are SCREENING some keywords using regular-expression operators
> like ^ . ( ) [ ] ? *
Notice that:
1) Option -letter- doesn´t work IF a keyword contains regular-expression operators
2) Unless you are looking for a specific regular-expression, regular-expression
operators must be preceded by a backslash \ to ensure keyword-matching
(e.g. \^ \. )
3) To match a keyword containing $ or \, you have to specify them as [\$] [\\]
. list date date1, noobs sep(10)
date date1
20jan2007 20jan2007
16June06 16June2006
06sept1985 06sept1985
21june04 21june2004
4july90 4july1990
9jan1999 9jan1999
6aug99 6aug1999
19august2003 19august2003
where the only difference is represented by the way in which the matching rule is spec-
ified: begin instead of ^ and end instead of $.
6 Summary
In this article, we introduced the new screening command, a data-management tool
that helps you examine and treat the content of string variables containing free, possibly
complex, narrative text. screening allows you to build new variables, to recode new
or existing variables, and to build a set of categorical variables indicating keyword
occurrences (a first step toward textual analysis). Considerable efforts were devoted
to making the command as flexible as possible; thus screening contains a rich set
of options that is intended to cover the most frequently encountered problems and
necessities. Because of this flexibility, the command can be used in many different
fields, like EPR data, data from different sources, or survey data. The execution of
screening is fast, thanks to Mata programming; its syntax is simple and common to
many other Stata commands, thus it is useful for all users regardless of their levels of
experience in Stata. We especially recommend that you use the explore() option; it
makes the command a useful data-mining tool. Nevertheless, expert users can exploit
a more complicated syntax that substantially eases the preparatory burden for data
cleaning.
F. Belotti and D. Depalo 481
Acknowledgments
We would like to thank Alice Cortignani, Rossana D’Amico, Andrea Piano Mortari,
and Riccardo Zecchinelli who tested the command, Vincenzo Atella who read an earlier
version of the article, Iacopo Cricelli who provided us with EPR data, and Rafal Raci-
borski for useful discussions. We are also grateful to David Drukker and all participants
at the 2009 Italian Stata Users Group meeting. Finally, the suggestions made by the
referee and the editor were useful to improve the command. We are responsible for any
remaining errors.
7 References
Moorman, P. W., A. M. van Ginneken, J. van der Lei, and J. H. van Bemmel. 1994. A
model for structured data entry based on explicit descriptional knowledge. Methods
of Information in Medicine 33: 454–463.
Raciborski, R. 2008. kountry: A Stata utility for merging cross-country data from
multiple sources. Stata Journal 8: 390–400.
Yamazaki, S., and Y. Satomura. 2000. Standard method for describing an electronic
patient record template: Application of XML to share domain knowledge. Methods
of Information in Medicine 39: 50–55.
Abstract. Sample skewness and kurtosis are limited by functions of sample size.
The limits, or approximations to them, have repeatedly been rediscovered over
the last several decades, but nevertheless seem to remain only poorly known. The
limits impart bias to estimation and, in extreme cases, imply that no sample could
bear exact witness to its parent distribution. The main results are explained in a
tutorial review, and it is shown how Stata and Mata may be used to confirm and
explore their consequences.
Keywords: st0204, descriptive statistics, distribution shape, moments, sample size,
skewness, kurtosis, lognormal distribution
1 Introduction
The use of moment-based measures for summarizing univariate distributions is long
established. Although there are yet longer roots, Thorvald Nicolai Thiele (1889) used
mean, standard deviation, variance, skewness, and kurtosis in recognizably modern
form. Appreciation of his work on moments remains limited, for all too understandable
reasons. Thiele wrote mostly in Danish, he did not much repeat himself, and he tended
to assume that his readers were just about as smart as he was. None of these habits
could possibly ensure rapid worldwide dissemination of his ideas. Indeed, it was not
until the 1980s that much of Thiele’s work was reviewed in or translated into English
(Hald 1981; Lauritzen 2002).
Thiele did not use all the now-standard terminology. The names standard deviation,
skewness, and kurtosis we owe to Karl Pearson, and the name variance we owe to Ronald
Aylmer Fisher (David 2001). Much of the impact of moments can be traced to these
two statisticians. Pearson was a vigorous proponent of using moments in distribution
curve fitting. His own system of probability distributions pivots on varying skewness,
measured relative to the mode. Fisher’s advocacy of maximum likelihood as a superior
estimation method was combined with his exposition of variance as central to statistical
thinking. The many editions of Fisher’s 1925 text Statistical Methods for Research
Workers, and of texts that in turn drew upon its approach, have introduced several
generations to the ideas of skewness and kurtosis. Much more detail on this history is
given by Walker (1929), Hald (1998, 2007), and Fiori and Zenga (2009).
c 2010 StataCorp LP st0204
N. J. Cox 483
Whatever the history and the terminology, a simple but fundamental point deserves
emphasis. A name like skewness has a very broad interpretation as a vague concept of
distribution symmetry or asymmetry, which can be made precise in a variety of ways
(compare with Mosteller and Tukey [1977]). Kurtosis is even more enigmatic: some
authors write of kurtosis as peakedness and some write of it as tail weight, but the
skeptical interpretation that kurtosis is whatever kurtosis measures is the only totally
safe story. Numerical examples given by Irving Kaplansky (1945) alone suffice to show
that kurtosis bears neither interpretation unequivocally.1
To the present, moments have been much disapproved, and even disproved, by math-
ematical statisticians who show that in principle moments may not even exist, and by
data analysts who know that in practice moments may not be robust. Nevertheless in
many quarters, they survive, and they even thrive. One of several lively fields making
much use of skewness and kurtosis measures is the analysis of financial time series (for
example, Taylor [2005]).
In this column, I will publicize one limitation of certain moment-based measures, in a
double sense. Sample skewness and sample kurtosis are necessarily bounded by functions
of sample size, imparting bias to the extent that small samples from skewed distributions
may even deny their own parentage. This limitation has been well established and
discussed in several papers and a few texts, but it still appears less widely known than
it should be. Presumably, it presents a complication too far for most textbook accounts.
The presentation here will include only minor novelties but will bring the key details
together in a coherent story and give examples of the use of Stata and Mata to confirm
and explore for oneself the consequences of a statistical artifact.
2 Deductions
2.1 Limits on skewness and kurtosis
n
Given a sample of n values y1 , . . . , yn and sample mean y = i=1 yi /n, sample moments
measured about the mean are at their simplest defined as averages of powered deviations
n
i=1 (y − y)r
mr =
n
√
so that m2 and s = m2 are versions of, respectively, the sample variance and sample
standard deviation.
Here sample skewness is defined as
m3 m3
3/2
= 3
= b1 = g1
m2 s
1. Kaplansky’s paper is one of a few that he wrote in the mid-1940s on probability and statistics. He
is much better known as a distinguished algebraist (Bass and Lam 2007; Kadison 2008).
484 Speaking Stata: The limits of sample skewness and kurtosis
Hence, both of the last two measures are scaled or dimensionless: Whatever units
of measurement were used appear raised to the same powers in both numerator and
denominator, and so cancel out. The commonly used m, s, b, and g notation corresponds
to a longstanding μ, σ, β, and γ notation for the corresponding theoretical or population
quantities. If 3 appears to be an arbitrary constant in the last equation, one explanation
starts with the fact that normal or Gaussian distributions have β1 = 0 and β2 = 3; hence,
γ2 = 0.
Naturally, if y is constant, then m2 is zero; thus skewness and kurtosis are not
defined. This includes the case of n = 1. The stipulations that y is genuinely variable
and that n ≥ 2 underlie what follows.
Newcomers to this territory are warned that usages in the statistical literature vary
considerably, even among entirely competent authors. This variation means that differ-
ent formulas may be found for the same terms—skewness and kurtosis—and different
terms for the same formulas. To start at the beginning: Although Karl Pearson in-
troduced the term skewness, and also made much use of β1 , he used skewness to refer
to (mean − mode) / standard deviation, a quantity that is well defined in his system
of distributions. In more recent literature, some differences reflect the use of divisors
other than n, usually with the intention of reducing bias, and so resembling in spirit
the common use of n − 1 as an alternative divisor for sample variance. Some authors
call γ2 (or g2 ) the kurtosis, while yet other variations may be found.
The key results for this column were extensively discussed by Wilkins (1944) and
Dalén (1987). Clearly, g1 may be positive, zero, or negative, reflecting the sign of m3 .
Wilkins (1944) showed that there is an upper limit to its absolute value,
n−2
|g1 | ≤ √ (1)
n−1
as was also independently shown by Kirby (1974). In contrast, b2 must be positive and
indeed (as may be shown, for example, using the Cauchy–Schwarz inequality) must be
at least 1. More pointedly, Dalén (1987) showed that there is also an upper limit to its
value:
n2 − 3n + 3
b2 ≤ (2)
n−1
The proofs of these inequalities are a little too long, and not quite interesting enough,
to reproduce here.
Both of these inequalities are sharp, meaning attainable. Test cases to explore the
precise limits have all values equal to some constant, except for one value that is equal
to another constant: n = 2, y1 = 0, y2 = 1 will do fine as a concrete example, for which
skewness is 0/1 = 0 and kurtosis is (1 − 3 + 3)/1 = 1.
N. J. Cox 485
For n = 2, we can rise above a mere example to show quickly that these results are
indeed general. The mean of two distinct values is halfway between them so that the
two deviations yi − y have equal magnitude and opposite sign. Thus their cubes have
sum 0, and m3 and b1 are both identically equal to 0. Alternatively, such values are
geometrically two points on the real line, a configuration that is evidently symmetric
around the mean in the middle, so skewness can be seen to be zero without any calcula-
tions. The squared deviations have an average equal to {(y1 − y2 )/2}2 , and their fourth
powers have an average equal to {(y1 − y2 )/2}4 , so g2 is identically equal to 1.
To see how the upper limit behaves numerically, we can rewrite (1) as
√ 1
|g1 | ≤ n−1− √
n−1
√ √
so that as sample size n increases, first n − 1 and then n become acceptable approx-
imations. Similarly, we can rewrite (2) as
1
b2 ≤ n − 2 +
n−1
from which, in large samples, first n − 2 and then n become acceptable approximations.
As it happens, these limits established
√ by Wilkins and Dalén sharpen up on the
results of other workers. Limits of n and n (the√latter when n is greater than 3)
were established by Cramér (1946, 357). Limits of n − 1 and n were independently
established by Johnson and Lowe (1979); Kirby (1981) advertised work earlier than
theirs (although not earlier than that of Wilkins or Cramér). Similarly, Stuart and Ord
(1994, 121–122) refer to the work of Johnson and Lowe (1979), but overlook the sharper
limits.2
There is yet another twist in the tale. Pearson (1916, 440) refers to the limit (2),
which he attributes to George Neville Watson, himself later a distinguished contributor
to analysis (but not to be confused with the statistician
√ Geoffrey Stuart Watson), and
to a limit of n − 1 on b1 , equivalent to a limit of n − 1 on g1 . Although Pearson
was the author of the first word on this subject, his contribution appears to have been
uniformly overlooked by later authors. However, he dismissed these limits as without
practical importance, which may have led others to downplay the whole issue.
In practice, we are, at least at first sight, less likely to care much about these limits
for large samples. It is the field of small samples in which limits are more likely to cause
problems, and sometimes without data analysts even noticing.
2. The treatise of Stuart and Ord is in line of succession, with one offset, from Yule (1911). Despite
that distinguished ancestry, it contains some surprising errors as well as the compendious collection
of results that makes it so useful. To the statement that mean, median, and mode ` ´differ in a skewed
distribution (p. 48), counterexamples are 0, 0, 1, 1, 1, 1, 3, and the binomial 10k
0.1k 0.910−k , k =
0, . . . , 10. For both of these skewed counterexamples, mean, median, and mode coincide at 1. To
the statement that they coincide in a symmetric distribution (p. 108), counterexamples are any
symmetric distribution with an even number of modes.
486 Speaking Stata: The limits of sample skewness and kurtosis
3 Confirmations
[R] summarize confirms that skewness b1 and kurtosis g2 are calculated in Stata pre-
cisely as above. There are no corresponding Mata functions at the time of this writing,
but readers interested in these questions will want to start Mata to check their own
understanding. One example to check is
. sysuse auto, clear
(1978 Automobile Data)
. summarize mpg, detail
Mileage (mpg)
Percentiles Smallest
1% 12 12
5% 14 12
10% 14 14 Obs 74
25% 18 14 Sum of Wgt. 74
50% 20 Mean 21.2973
Largest Std. Dev. 5.785503
75% 25 34
90% 29 35 Variance 33.47205
95% 34 35 Skewness .9487176
99% 41 41 Kurtosis 3.975005
The detail option is needed to get skewness and kurtosis results from summarize.
We will not try to write a bulletproof skewness or kurtosis function in Mata, but we
will illustrate its use calculator-style. After entering Mata, a variable can be read into
a vector. It is helpful to have a vector of deviations from the mean to work on.
. mata :
mata (type end to exit)
: y = st_data(., "mpg")
: dev = y :- mean(y)
: mean(dev:^3) / (mean(dev:^2)):^(3/2)
.9487175965
: mean(dev:^4) / (mean(dev:^2)):^2
3.975004596
So those examples at least check out. Those unfamiliar with Mata might note that
the colon prefix, as in :- or :^, merely flags an elementwise operation. Thus for example,
mean(y) returns a constant, which we wish to subtract from every element of a data
vector.
N. J. Cox 487
Mata may be used to check simple limiting cases. The minimal dataset (0, 1) may
be entered in deviation form. After doing so, we can just repeat earlier lines to calculate
b1 and g2 :
Mata may also be used to see how the limits of skewness and kurtosis vary with
sample size. We start out with a vector containing some sample sizes. We then calculate
the corresponding upper limits for skewness and kurtosis and tabulate the results. The
results are mapped to strings for tabulation with reasonable numbers of decimal places.
: n = (2::20\50\100\500\1000)
: skew = sqrt(n:-1) :- (1:/(n:-1))
: kurt = n :- 2 + (1:/(n:-1))
: strofreal(n), strofreal((skew, kurt), "%4.3f")
1 2 3
1 2 0.000 1.000
2 3 0.914 1.500
3 4 1.399 2.333
4 5 1.750 3.250
5 6 2.036 4.200
6 7 2.283 5.167
7 8 2.503 6.143
8 9 2.703 7.125
9 10 2.889 8.111
10 11 3.062 9.100
11 12 3.226 10.091
12 13 3.381 11.083
13 14 3.529 12.077
14 15 3.670 13.071
15 16 3.806 14.067
16 17 3.938 15.062
17 18 4.064 16.059
18 19 4.187 17.056
19 20 4.306 18.053
20 50 6.980 48.020
21 100 9.940 98.010
22 500 22.336 498.002
23 1000 31.606 998.001
The second and smaller term in each expression for (1) and (2) is 1/(n−1). Although
the calculation is, or should be, almost mental arithmetic, we can see how quickly this
term shrinks so much that it can be neglected:
488 Speaking Stata: The limits of sample skewness and kurtosis
1 2 1.000
2 3 0.500
3 4 0.333
4 5 0.250
5 6 0.200
6 7 0.167
7 8 0.143
8 9 0.125
9 10 0.111
10 11 0.100
11 12 0.091
12 13 0.083
13 14 0.077
14 15 0.071
15 16 0.067
16 17 0.062
17 18 0.059
18 19 0.056
19 20 0.053
20 50 0.020
21 100 0.010
22 500 0.002
23 1000 0.001
: end
These calculations are equally easy in Stata when you start with a variable containing
sample sizes.
4 Explorations
In statistical science, we use an increasing variety of distributions. Even when closed-
form expressions exist for their moments, which is far from being universal, the need
to estimate parameters from sample data often arises. Thus the behavior of sample
moments and derived measures remains of key interest. Even if you do not customarily
use, for example, summarize, detail to get skewness and kurtosis, these measures may
well underlie your favorite test for normality.
The limits on sample skewness and kurtosis impart the possibility of bias whenever
the upper part of their sampling distributions is cut off by algebraic constraints. In
extreme cases, a sample may even deny the distribution that underlies it, because it is
impossible for any sample to reproduce the skewness and kurtosis of its parent.
These questions may be explored by simulation. Lognormal distributions offer simple
but striking examples. We call a distribution for y lognormal if ln y is normally dis-
tributed. Those who prefer to call normal distributions by some other name (Gaussian,
notably) have not noticeably affected this terminology. Similarly, for some people the
terminology is backward, because a lognormal distribution is an exponentiated normal
distribution. Protest is futile while the term lognormal remains entrenched.
N. J. Cox 489
If ln y has mean μ and standard deviation σ, its skewness and kurtosis may be
defined in terms of exp(σ 2 ) = ω (Johnson, Kotz, and Balakrishnan 1994, 212):
√
γ1 = ω − 1(ω + 2); β2 = ω 4 + 2ω 3 + 3ω 2 − 3
Differently put, skewness and kurtosis depend on σ alone; μ is a location parameter for
the lognormal as well as the normal.
[R] simulate already has a worked example of the simulation of lognormals, which
we can adapt slightly for the present purpose. The program there called lnsim merely
needs to be modified by adding results for skewness and kurtosis. As before, summarize,
detail is now the appropriate call. Before simulation, we (randomly, capriciously, or
otherwise) choose a seed for random-number generation:
. clear all
. program define lnsim, rclass
1. version 11.1
2. syntax [, obs(integer 1) mu(real 0) sigma(real 1)]
3. drop _all
4. set obs `obs´
5. tempvar z
6. gen `z´ = exp(rnormal(`mu´,`sigma´))
7. summarize `z´, detail
8. return scalar mean = r(mean)
9. return scalar var = r(Var)
10. return scalar skew = r(skewness)
11. return scalar kurt = r(kurtosis)
12. end
. set seed 2803
. simulate mean=r(mean) var=r(var) skew=r(skew) kurt=r(kurt), nodots
> reps(10000): lnsim, obs(50) mu(-3) sigma(7)
command: lnsim, obs(50) mu(-3) sigma(7)
mean: r(mean)
var: r(var)
skew: r(skew)
kurt: r(kurt)
We are copying here the last example from help simulate, a lognormal for which
μ = −3, σ = 7. While a lognormal may seem a fairly well-behaved distribution, a quick
calculation shows that with these parameter choices, the skewness is about 8 × 1031 and
the kurtosis about 1085 , which no sample result can possibly come near! The previously
discussed limits are roughly 7 for skewness and 48 for kurtosis for this sample size. Here
are the Mata results:
. mata
mata (type end to exit)
: omega = exp(49)
: sqrt(omega - 1) * (omega + 2)
8.32999e+31
: omega^4 + 2 * omega^3 + 3*omega^2 - 3
1.32348e+85
: n = 50
490 Speaking Stata: The limits of sample skewness and kurtosis
1 6.979591837 48.02040816
: end
Sure enough, calculations and a graph (shown as figure 1) show the limits of 7 and
48 are biting hard. Although many graph forms would work well, I here choose qplot
(Cox 2005) for quantile plots.
. summarize
Variable Obs Mean Std. Dev. Min Max
7 50
6
40
5
skewness
kurtosis
30
20
10
2
0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1
fraction of the data fraction of the data
Figure 1. Sampling distributions of skewness and kurtosis for samples of size 50 from a
lognormal with μ = −3, σ = 7. Upper limits are shown by horizontal lines.
The natural comment is that the parameter choices in this example are a little
extreme, but the same phenomenon occurs to some extent even with milder choices.
With the default μ = 0, σ = 1, the skewness and kurtosis are less explosively high—but
still very high by many standards. We clear the data and repeat the simulation, but
this time we use the default values.
N. J. Cox 491
. clear
. simulate mean=r(mean) var=r(var) skew=r(skew) kurt=r(kurt), nodots
> reps(10000): lnsim, obs(50)
command: lnsim, obs(50)
mean: r(mean)
var: r(var)
skew: r(skew)
kurt: r(kurt)
Within Mata, we can recalculate the theoretical skewness and kurtosis. The limits
to sample skewness and kurtosis remain the same, given the same sample size n = 50.
. mata
mata (type end to exit)
: omega = exp(1)
: sqrt(omega - 1) * (omega + 2)
6.184877139
: omega^4 + 2 * omega^3 + 3*omega^2 - 3
113.9363922
: end
The problem is more insidious with these parameter values. The sampling distri-
butions look distinctly skewed (shown in figure 2) but are not so obviously truncated.
Only when the theoretical values for skewness and kurtosis are considered is it obvious
that the estimations are seriously biased.
. summarize
Variable Obs Mean Std. Dev. Min Max
8 50
40
6
30
skewness
kurtosis
4
20
2
10
0 0
0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1
fraction of the data fraction of the data
Figure 2. Sampling distributions of skewness and kurtosis for samples of size 50 from a
lognormal with μ = 0, σ = 1. Upper limits are shown by horizontal lines.
Naturally, these are just token simulations, but a way ahead should be clear. If
you are using skewness or kurtosis with small (or even large) samples, simulation with
some parent distributions pertinent to your work is a good idea. The simulations of
Wallis, Matalas, and Slack (1974) in particular pointed to empirical limits to skewness,
which Kirby (1974) then established independently of previous work.3
5 Conclusions
This story, like any other, lies at the intersection of many larger stories. Many statisti-
cally minded people make little or no use of skewness or kurtosis, and this paper may
have confirmed them in their prejudices. Some readers may prefer to see this as an-
other argument for using quantiles or order statistics for summarization (Gilchrist 2000;
David and Nagaraja 2003). Yet others may know that L-moments offer an alternative
approach (Hosking 1990; Hosking and Wallis 1997).
Arguably, the art of statistical analysis lies in choosing a model successful enough
to ensure that the exact form of the distribution of some response variable, conditional
on the predictors, is a matter of secondary importance. For example, in the simplest
regression situations, an error term for any really good model is likely to be fairly near
normally distributed, and thus not a source of worry. But authorities and critics differ
over how far that is a deductive consequence of some flavor of central limit theorem or
a naı̈ve article of faith that cries out for critical evaluation.
3. Connoisseurs of offbeat or irreverent titles might like to note some other papers by the same team:
Mandelbrot and Wallis (1968), Matalas and Wallis (1973), and Slack (1973).
N. J. Cox 493
6 Acknowledgments
This column benefits from interactions over moments shared with Ian S. Evans and over
L-moments shared with Patrick Royston.
7 References
Bass, H., and T. Y. Lam. 2007. Irving Kaplansky 1917–2006. Notices of the American
Mathematical Society 54: 1477–1493.
Cox, N. J. 2005. Speaking Stata: The protean quantile plot. Stata Journal 5: 442–460.
Dalén, J. 1987. Algebraic bounds on standardized sample moments. Statistics & Prob-
ability Letters 5: 329–331.
David, H. A. 2001. First (?) occurrence of common terms in statistics and probability.
In Annotated Readings in the History of Statistics, ed. H. A. David and A. W. F.
Edwards, 209–246. New York: Springer.
David, H. A., and H. N. Nagaraja. 2003. Order Statistics. Hoboken, NJ: Wiley.
Fiori, A. M., and M. Zenga. 2009. Karl Pearson and the origin of kurtosis. International
Statistical Review 77: 40–50.
Fisher, R. A. 1925. Statistical Methods for Research Workers. Edinburgh: Oliver &
Boyd.
Gilchrist, W. G. 2000. Statistical Modelling with Quantile Functions. Boca Raton, FL:
Chapman & Hall/CRC.
———. 1998. A History of Mathematical Statistics from 1750 to 1930. New York:
Wiley.
Johnson, M. E., and V. W. Lowe. 1979. Bounds on the sample skewness and kurtosis.
Technometrics 21: 377–378.
Katsnelson, J., and S. Kotz. 1957. On the upper limits of some measures of variability.
Archiv für Meteorologie, Geophysik und Bioklimatologie, Series B 8: 103–107.
Mandelbrot, B. B., and J. R. Wallis. 1968. Noah, Joseph, and operational hydrology.
Water Resources Research 4: 909–918.
Matalas, N. C., and J. R. Wallis. 1973. Eureka! It fits a Pearson type 3 distribution.
Water Resources Research 9: 281–289.
Mosteller, F., and J. W. Tukey. 1977. Data Analysis and Regression: A Second Course
in Statistics. Reading, MA: Addison–Wesley.
Stuart, A., and J. K. Ord. 1994. Kendall’s Advanced Theory of Statistics. Volume 1:
Distribution Theory. 6th ed. London: Arnold.
N. J. Cox 495
Taylor, S. J. 2005. Asset Price Dynamics, Volatility, and Prediction. Princeton, NJ:
Princeton University Press.
Walker, H. M. 1929. Studies in the History of Statistical Method: With Special Refer-
ence to Certain Educational Problems. Baltimore: Williams & Wilkins.
Wallis, J. R., N. C. Matalas, and J. R. Slack. 1974. Just a moment! Water Resources
Research 10: 211–219.
1 Introduction
In a statistical analysis, I usually want some basic descriptive statistics such as the mean,
standard deviation, extremes, and percentiles. See, for example, Pagano and Gauvreau
(2000). Stata conveniently provides these descriptive statistics with the summarize
command’s detail option. Alternatively, I can obtain percentiles with the centile
command. For example, with auto.dta, we have
. sysuse auto
(1978 Automobile Data)
. summarize price, detail
Price
Percentiles Smallest
1% 3291 3291
5% 3748 3299
10% 3895 3667 Obs 74
25% 4195 3748 Sum of Wgt. 74
50% 5006.5 Mean 6165.257
Largest Std. Dev. 2949.496
75% 6342 13466
90% 11385 13594 Variance 8699526
95% 13466 14500 Skewness 1.653434
99% 15906 15906 Kurtosis 4.819188
c 2010 StataCorp LP st0205
P. A. Lachenbruch 497
The following commands were generated from the multiple-imputation dialog box. I
used 20 imputations. Before Stata 11, this could also be done with the user-written com-
mands ice and mim (Royston 2004, 2005a,b, 2007; Royston, Carlin, and White 2009).
. mi set mlong
. mi register imputed newprice
(32 m=0 obs. now marked as incomplete)
. mi register regular mpg trunk weight length
. mi impute regress newprice, add(20) rseed(3252010)
Univariate imputation Imputations = 20
Linear regression added = 20
Imputed: m=1 through m=20 updated = 0
Observations per m
newprice 42 32 32 74
From this output, we see that the estimated mean is 5,693 with a standard error
of 455 (rounded up) compared with the complete data value of 6,165 with a standard
error of 343 (also rounded up). However, we do not have estimates of quantiles. This
could also have been done using mi estimate: mean newprice (the mean command is
near the bottom of the estimation command list for mi estimate).
498 Stata tip 89
We can apply the same principle using qreg. For the 10th percentile, type
Compare the value of 3,496 with the value of 3,895 from the full data. We can use
the simultaneous estimates command for the full set:
q10
_cons 3495.635 533.5129 6.55 0.000 2408.434 4582.836
q25
_cons 4130.037 237.1932 17.41 0.000 3642.459 4617.614
q50
_cons 5200.238 441.294 11.78 0.000 4292.719 6107.757
q75
_cons 6620.232 778.8488 8.50 0.000 5025.49 8214.974
q90
_cons 8901.985 1417.022 6.28 0.000 5971.962 11832.01
small. A second caution is that comparing two medians can be tricky: the difference
of two medians is not the median difference of the distributions. I have found it useful
to use percentiles because there is a one-to-one relationship between percentiles if data
are transformed. In our case, there is plentiful evidence that price is not normally
distributed, so it would be good to look for a transformation and impute those values.
This method of using regression commands without an independent variable can
provide estimates of quantities that otherwise would be difficult to obtain. For example,
it is much faster than finding 20 imputed percentiles and then combining them with
Rubin’s rules, and it is much less onerous and prone to error.
4 Acknowledgment
This work was supported in part by a grant from the Cure JM Foundation.
References
Pagano, M., and K. Gauvreau. 2000. Principles of Biostatistics. 2nd ed. Belmont, CA:
Duxbury.
———. 2005a. Multiple imputation of missing values: Update. Stata Journal 5: 188–
201.
———. 2005b. Multiple imputation of missing values: Update of ice. Stata Journal 5:
527–536.
———. 2007. Multiple imputation of missing values: Further update of ice, with an
emphasis on interval censoring. Stata Journal 7: 445–464.
Royston, P., J. B. Carlin, and I. R. White. 2009. Multiple imputation of missing values:
New features for mim. Stata Journal 9: 252–264.
Stata provides several features that allow users to display only part of their results.
If, for instance, you merely wanted to inspect the analysis of variance table returned by
anova or the coefficients returned by regress, you could instruct Stata to omit other
results:
. sysuse auto
(1978 Automobile Data)
. regress weight length price, notable
Source SS df MS Number of obs = 74
F( 2, 71) = 385.80
Model 40378658.3 2 20189329.2 Prob > F = 0.0000
Residual 3715520.06 71 52331.2685 R-squared = 0.9157
Adj R-squared = 0.9134
Total 44094178.4 73 604029.841 Root MSE = 228.76
. regress weight length price, noheader
Other examples of this type can be found in the help files for xtivreg for its first-
stage results and for xtmixed for its random-effects and fixed-effects table. Generally,
to check whether Stata does provide such options, you would look for them under the
heading Reporting in the respective help files.
If you want to further customize output to your own needs, you could use the
estimates table command; see [R] estimates table. It is part of the comprehensive
estimates suite of commands that save and manipulate estimation results in Stata. See
[R] estimates or Baum (2006, sec. 4.4), where user-written alternatives are introduced
as well.
estimates table can provide several benefits to the user. For one, you can restrict
output to selected coefficients or equations with its keep() and drop() options.
c 2010 StataCorp LP st0206
M. Weiss 501
. sysuse auto
(1978 Automobile Data)
. quietly regress weight length price trunk turn
. estimates table, keep(turn price)
Variable active
turn 35.214901
price .04624804
The original output of the estimation command itself is suppressed with quietly;
see [P] quietly. The keep() option also changes the order of the coefficients according
to your wishes. Additionally, you can elect to have Stata display results in a specific
format, for example, with fewer or more decimal places. The format can differ between
the elements that you choose to put into the table. In the case shown below, the
coefficients have three decimal places, while the standard error and the p-value have
two decimal places:
. sysuse auto
(1978 Automobile Data)
. quietly regress weight length price trunk turn
. estimates table, keep(turn price) b(%9.3fc) se(%9.2fc) p(%9.2fc)
Variable active
turn 35.215
11.65
0.00
price 0.046
0.01
0.00
legend: b/se/p
estimates table can also deal with models featuring multiple equations. If you
want to omit the coefficients for weight and the constant from every equation of your
sureg model, you could type
. sysuse auto
(1978 Automobile Data)
. qui sureg (price foreign weight length turn) (mpg foreign weight turn)
. estimates table, drop(weight _cons)
Variable active
price
foreign 3320.6181
length -78.75447
turn -144.37952
mpg
foreign -2.0756325
turn -.23516574
502 Stata tip 90
If your interest rests in the entire first equation and the constant from the second
equation, you would prepend coefficients with the equation names and separate the two
with a colon. The names of equations and coefficients are more accessible in Stata 11
with the coeflegend option, which is accepted by most estimation commands.
Coef. Legend
price
foreign 3320.618 _b[price:foreign]
weight 6.04491 _b[price:weight]
length -78.75447 _b[price:length]
turn -144.3795 _b[price:turn]
_cons 7450.657 _b[price:_cons]
mpg
foreign -2.075632 _b[mpg:foreign]
weight -.0055959 _b[mpg:weight]
turn -.2351657 _b[mpg:turn]
_cons 48.13492 _b[mpg:_cons]
Variable active
price
foreign 3320.6181
weight 6.0449101
length -78.75447
turn -144.37952
_cons 7450.657
mpg
weight -.00559588
Reference
Baum, C. F. 2006. An Introduction to Modern Econometrics Using Stata. College
Station, TX: Stata Press.
The Stata Journal (2010)
10, Number 3, pp. 503–504
Within interactive sessions, do-files, or programs, Stata users often want to work
with varlists, lists of variable names. For convenience, such lists may be stored in local
macros. Local macros can be directly defined for later use, as in
However, users frequently want to put longer lists of names into local macros, spelled
out one by one so that some later command can loop through the list defined by the
macro. Such varlists might be indirectly defined in abbreviations using the wildcard
characters * or ?. These characters can be used alone or can be combined to express
ranges. For example, specifying * catches all variables, *TX* might define all variables
for Texas, and *200? catches the years 2000–2009 used as suffixes.
In such cases, direct definition may not appeal for all the obvious reasons: it is
tedious, time-consuming, and error-prone. It is also natural to wonder if there is a
better method. You may already know that foreach (see [P] foreach) will take such
wildcarded varlists as arguments, which solves many problems.
Many users know that pushing an abbreviated varlist through describe or ds is
one way to produce an unabbreviated varlist. Thus
. describe, varlist
is useful principally for its side effect of leaving all the variable names in r(varlist).
ds is typically used in a similar way, as is the user-written findname command (Cox
2010).
However, if the purpose is just to produce a local macro, the method of using
describe or ds has some small but definite disadvantages. First, the output of each
may not be desired, although it is easily suppressed with a quietly prefix. Second,
the modus operandi of both describe and ds is to leave saved results as r-class results.
Every now and again, users will be frustrated by this when they unwittingly overwrite
r-class results that they wished to use again. Third, there is some inefficiency in using
either command for this purpose, although you would usually have to work hard to
measure it.
c 2010 StataCorp LP dm0051
504 Stata tip 91
The solution here is to use the unab command; see [P] unab. unab has just one
restricted role in life, but that role is the solution here. unab is billed as a programming
command, but nothing stops it from being used interactively as a simple tool in data
management. The simple examples
. unab myvars : *
. unab TX : *TX*
. unab twenty : *200?
show how a local macro, named at birth (here as myvars, TX, and twenty), is defined
as the unabbreviated equivalent of each argument that follows a colon. Note that using
wildcard characters, although common, is certainly not compulsory.
The word “unabbreviate” is undoubtedly ugly. The help and manual entry do also
use the much simpler and more attractive word “expand”, but the word “expand” was
clearly not available as a command name, given its use for another purpose.
This tip skates over all the fine details of unab, and only now does it mention the
sibling commands tsunab and fvunab, for use when you are using time-series operators
and factor variables. For more information, see [P] unab.
Reference
Cox, N. J. 2010. Speaking Stata: Finding variables. Stata Journal 10: 281–296.
The Stata Journal (2010)
10, Number 3, p. 505
Software Updates
c 2010 StataCorp LP up0029