Download as pdf or txt
Download as pdf or txt
You are on page 1of 200

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/46573859

Bacon: An Effective way to Detect Outliers in Multivariate Data Using Stata (and
Mata)

Article  in  Stata Journal · September 2010


DOI: 10.1177/1536867X1001000302 · Source: OAI

CITATIONS READS
104 2,467

1 author:

Sylvain Weber
Haute école de gestion de Genève
52 PUBLICATIONS   799 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

SCCER CREST View project

Research in Energy, Society and Transition View project

All content following this page was uploaded by Sylvain Weber on 02 July 2014.

The user has requested enhancement of the downloaded file.


The Stata Journal
Volume 10 Number 3 2010

A Stata Press publication


StataCorp LP
College Station, Texas
The Stata Journal
Editor Editor
H. Joseph Newton Nicholas J. Cox
Department of Statistics Department of Geography
Texas A&M University Durham University
College Station, Texas 77843 South Road
979-845-8817; fax 979-845-6077 Durham DH1 3LE UK
jnewton@stata-journal.com n.j.cox@stata-journal.com

Associate Editors
Christopher F. Baum Peter A. Lachenbruch
Boston College Oregon State University
Nathaniel Beck Jens Lauritsen
New York University Odense University Hospital
Rino Bellocco Stanley Lemeshow
Karolinska Institutet, Sweden, and Ohio State University
University of Milano-Bicocca, Italy
Maarten L. Buis J. Scott Long
Tübingen University, Germany Indiana University
A. Colin Cameron Roger Newson
University of California–Davis Imperial College, London
Mario A. Cleves Austin Nichols
Univ. of Arkansas for Medical Sciences Urban Institute, Washington DC
William D. Dupont Marcello Pagano
Vanderbilt University Harvard School of Public Health
David Epstein Sophia Rabe-Hesketh
Columbia University University of California–Berkeley
Allan Gregory J. Patrick Royston
Queen’s University MRC Clinical Trials Unit, London
James Hardin Philip Ryan
University of South Carolina University of Adelaide
Ben Jann Mark E. Schaffer
University of Bern, Switzerland Heriot-Watt University, Edinburgh
Stephen Jenkins Jeroen Weesie
University of Essex Utrecht University
Ulrich Kohler Nicholas J. G. Winter
WZB, Berlin University of Virginia
Frauke Kreuter Jeffrey Wooldridge
University of Maryland–College Park Michigan State University

Stata Press Editorial Manager Lisa Gilmore


Stata Press Copy Editors Deirdre Patterson and Erin Roberson
The Stata Journal publishes reviewed papers together with shorter notes or comments,
regular columns, book reviews, and other material of interest to Stata users. Examples
of the types of papers include 1) expository papers that link the use of Stata commands
or programs to associated principles, such as those that will serve as tutorials for users
first encountering a new field of statistics or a major new technique; 2) papers that go
“beyond the Stata manual” in explaining key features or uses of Stata that are of interest
to intermediate or advanced users of Stata; 3) papers that discuss new commands or
Stata programs of interest either to a wide spectrum of users (e.g., in data management
or graphics) or to some large segment of Stata users (e.g., in survey statistics, survival
analysis, panel analysis, or limited dependent variable modeling); 4) papers analyzing
the statistical properties of new or existing estimators and tests in Stata; 5) papers
that could be of interest or usefulness to researchers, especially in fields that are of
practical importance but are not often included in texts or other journals, such as the
use of Stata in managing datasets, especially large datasets, with advice from hard-won
experience; and 6) papers of interest to those who teach, including Stata with topics
such as extended examples of techniques and interpretation of results, simulations of
statistical concepts, and overviews of subject areas.
For more information on the Stata Journal, including information for authors, see the
webpage

http://www.stata-journal.com

The Stata Journal is indexed and abstracted in the following:

• CompuMath Citation Index R

• Current Contents/Social and Behavioral Sciences


R

• RePEc: Research Papers in Economics


• Science Citation Index Expanded (also known as SciSearch
R
)
TM
• Scopus
• Social Sciences Citation Index
R

Copyright Statement: The Stata Journal and the contents of the supporting files (programs, datasets, and
help files) are copyright 
c by StataCorp LP. The contents of the supporting files (programs, datasets, and
help files) may be copied or reproduced by any means whatsoever, in whole or in part, as long as any copy
or reproduction includes attribution to both (1) the author and (2) the Stata Journal.
The articles appearing in the Stata Journal may be copied or reproduced as printed copies, in whole or in part,
as long as any copy or reproduction includes attribution to both (1) the author and (2) the Stata Journal.
Written permission must be obtained from StataCorp if you wish to make electronic copies of the insertions.
This precludes placing electronic copies of the Stata Journal, in whole or in part, on publicly accessible websites,
fileservers, or other locations where the copy may be accessed by anyone other than the subscriber.
Users of any of the software, ideas, data, or other materials published in the Stata Journal or the supporting
files understand that such use is made without warranty of any kind, by either the Stata Journal, the author,
or StataCorp. In particular, there is no warranty of fitness of purpose or merchantability, nor for special,
incidental, or consequential damages such as loss of profits. The purpose of the Stata Journal is to promote
free communication among Stata users.
The Stata Journal, electronic version (ISSN 1536-8734) is a publication of Stata Press. Stata, Mata, NetCourse,
and Stata Press are registered trademarks of StataCorp LP.
Subscriptions are available from StataCorp, 4905 Lakeway Drive, College Station,
Texas 77845, telephone 979-696-4600 or 800-STATA-PC, fax 979-696-4601, or online at

http://www.stata.com/bookstore/sj.html

Subscription rates
The listed subscription rates include both a printed and an electronic copy unless oth-
erwise mentioned.

Subscriptions mailed to U.S. and Canadian addresses:


3-year subscription $195
2-year subscription $135
1-year subscription $ 69
1-year student subscription $ 42
1-year university library subscription $ 89
1-year institutional subscription $195

Subscriptions mailed to other countries:


3-year subscription $285
2-year subscription $195
1-year subscription $ 99
3-year subscription (electronic only) $185
1-year student subscription $ 69
1-year university library subscription $119
1-year institutional subscription $225

Back issues of the Stata Journal may be ordered online at

http://www.stata.com/bookstore/sjj.html

Individual articles three or more years old may be accessed online without charge. More
recent articles may be ordered online.

http://www.stata-journal.com/archives.html

The Stata Journal is published quarterly by the Stata Press, College Station, Texas, USA.
Address changes should be sent to the Stata Journal, StataCorp, 4905 Lakeway Drive,
College Station, TX 77845, USA, or emailed to sj@stata.com.
Volume 10 Number 3 2010

The Stata Journal


Articles and Columns 315

An introduction to maximum entropy and minimum cross-entropy estimation us-


ing Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Wittenberg 315
bacon: An effective way to detect outliers in multivariate data using Stata (and
Mata) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. Weber 331
Comparing the predictive powers of survival models using Harrell’s C or Somers’
D...........................................................................
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. B. Newson 339
Using Stata with PHASE and Haploview: Commands for importing and exporting
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J. C. Huber Jr. 359
simsum: Analyses of simulation studies including Monte Carlo error . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I. R. White 369
Projection of power and events in clinical trials with a time-to-event outcome . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P. Royston and F. M.-S. Barthel 386
metaan: Random-effects meta-analysis . . . . . . . . . . E. Kontopantelis and D. Reeves 395
Regression analysis of censored data using pseudo-observations . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .E. T. Parner and P. K. Andersen 408
Estimation of quantile treatment effects with Stata . . . . . M. Frölich and B. Melly 423
Translation from narrative text to standard codes variables with Stata . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F. Belotti and D. Depalo 458
Speaking Stata: The limits of sample skewness and kurtosis . . . . . . . . . . . N. J. Cox 482

Notes and Comments 496

Stata tip 89: Estimating means and percentiles following multiple imputation . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P. A. Lachenbruch 496
Stata tip 90: Displaying partial results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Weiss 500
Stata tip 91: Putting unabbreviated varlists into local macros . . . . . . . . . N. J. Cox 503

Software Updates 505


The Stata Journal (2010)
10, Number 3, pp. 315–330

An introduction to maximum entropy and


minimum cross-entropy estimation using Stata
Martin Wittenberg
University of Cape Town
School of Economics
Cape Town, South Africa
Martin.Wittenberg@uct.ac.za

Abstract. Maximum entropy and minimum cross-entropy estimation are applica-


ble when faced with ill-posed estimation problems. I introduce a Stata command
that estimates a probability distribution using a maximum entropy or minimum
cross-entropy criterion. I show how this command can be used to calibrate survey
data to various population totals.
Keywords: st0196, maxentropy, maximum entropy, minimum cross-entropy, survey
calibration, sample weights

1 Ill-posed problems and the maximum entropy criterion


All too many situations involve more unknowns than data points. Standard forms
of estimation are impossible when faced with such ill-posed problems (Mittelhammer,
Judge, and Miller 2000). One approach that is applicable in these cases is estimation
by maximizing an entropy measure (Golan, Judge, and Miller 1996). The purpose of
this article is to introduce the concept and to show how to apply it using the new
Stata command maxentropy. My discussion of the technique follows the treatment in
Golan, Judge, and Miller (1996). Furthermore, I show how a maximum entropy ap-
proach can be used to calibrate survey data to various population totals. This approach
is equivalent to the iterative raking procedure of Deming and Stephan (1940) or the
multiplicative method implemented in the calibration on margins (CALMAR) algorithm
(Deville and Särndal 1992; Deville, Särndal, and Sautory 1993).
The idea of maximum entropy estimation was motivated by Jaynes (1957, 621ff) in
terms of the problem of finding the probability distribution (p1 , p2 , . . . , pn ) for the set
of values (x1 , x2 , . . . , xn ), given only their expectation,


n
E {f (x)} = pi f (xi )
i=1

For concreteness, we consider a die known to have E (x) = 3.5, where x = (1, 2, 3, 4, 5, 6),
and we want to determine the associated probabilities. Clearly, there are infinitely many
possible solutions, but the obvious one is p1 = p2 = · · · = p6 = 1/6. The obviousness is
based on Laplace’s principle of insufficient reason, which states that two events should
be assigned equal probability unless there is a reason to think otherwise (Jaynes 1957,


c 2010 StataCorp LP st0196
316 Maximum entropy estimation

622). This negative reason is not much help if, instead, we know that E (x) = 4.
Jaynes’s solution was to tackle this from the point of view of Shannon’s information
theory. Jaynes wanted a criterion function H (p1 , p2 , . . . , pn ) that would summarize the
uncertainty about the distribution. This is given uniquely by the entropy measure

n
H (p1 , p2 , . . . , pn ) = −K pi ln(pi )
i=1

where pi ln(pi ) is defined to be zero if pi = 0 for some positive constant K. The


solution to Jaynes’s problem is to pick the distribution (p1 , p2 , . . . , pn ) that maximizes
the entropy, subject only to the constraints

E {f (x)} = pi f (xi )
i

pi = 1
i

As Golan, Judge, and Miller (1996, 8–10) show, if our knowledge of E {f (x)} is
based on the outcome of N (very large) trials, then the distribution function p = (p1 , p2 ,
. . . , pn ) that maximizes the entropy measure is the distribution that can give rise to
the observed outcomes in the greatest number of ways, which is consistent with what
we know. Any other distribution requires more information to justify it. Degenerate
distributions, ones where pi = 1 and pj = 0 for j = i, have entropy of zero. That is to
say, they correspond to zero uncertainty and therefore maximal information.

2 Maximum entropy and minimum cross-entropy estima-


tion
More formally, the maximum entropy problem can be represented as

n
maxp H (p) = − pi ln(pi )
i=1

n
such that yj = Xji pi , j = 1, . . . , J (1)
i=1

n
pi = 1 (2)
i=1

The J constraints given in (1) can be thought of as moment constraints, with yj being
the population mean of the Xj random variable. To solve this problem, we set up the
Lagrangian function

L = −p ln(p) − λ (X p − y) − μ (p 1 − 1)


M. Wittenberg 317

where X is the n × J data matrix,1 λ is a vector of Lagrange multipliers, and 1 is a


column vector of ones.
The first-order conditions for an interior solution—that is, one in which the vector
p is strictly positive—are given by
∂L −μ
= − ln(
p)−1 − Xλ 1 = 0 (3)
∂p
∂L
= y − X p
=0 (4)
∂λ
∂L
= 1 = 0
1−p (5)
∂μ

 and the solution for p


These equations can be solved for λ,  is given by
   
 = exp −Xλ
p  /Ω λ 

where
  n  
 =
Ω λ 
exp −xi λ
i=1

and xi is the ith row vector of the matrix X.


The maximum entropy framework can be extended to incorporate prior information
about p. Assuming that we have the prior probability distribution q = (q1 , q2 , . . . , qn ),
then the cross-entropy is defined as (Golan, Judge, and Miller 1996, 11)


n  
pi
I (p, q) = pi ln
i=1
qi

= p ln(p) − p ln(q)

The cross-entropy can be thought of as a measure of the additional information required


to go from the distribution q to the distribution p. The principle of minimum cross-
entropy asserts that we should pick the distribution p that meets the moment constraints
(1) and the normalization restriction (2) while requiring the least additional information;
that is, we should pick the one that is in some sense closest to q. Formally, we minimize
I (p, q), subject to the restrictions. Maximum entropy estimation is merely a variant
of minimum cross-entropy estimation where the prior q is the uniform distribution
(1/n, 1/n, . . . , 1/n).

1. In the Golan, Judge, and Miller (1996) book, the constraint is written as y = Xp, where X is J ×n.
For the applications considered below, it is more natural to write the data matrix in the form shown
here.
318 Maximum entropy estimation

The solution of this problem is given by (Golan, Judge, and Miller 1996, 29)
   
 /Ω λ
pi = qi exp xi λ  (6)
where
  n  

Ω λ = 
qi exp xi λ (7)
i=1

The most efficient way to calculate the estimates is, in fact, not by numerical solution
of the first-order conditions [along the lines of (3), (4), and (5)] but by the unconstrained
maximization of the dual problem as discussed further in section 3.5.

3 The maxentropy command


3.1 Syntax
The syntax of the maxentropy command is



maxentropy constraint varlist if in , generate(varname , replace )


prior(varname) log total(#) matrix(matrix)

The maxentropy command must identify the set of population constraints


contained

in
the y vector. These population constraints can be specified either as constraint or as
matrix in the matrix() option. If neither of these optional arguments is specified, it is
assumed that varlist is y and then X.
The command requires that a varname be specified in the generate() option, in
which the estimated p vector will be returned.

3.2 Description
maxentropy provides minimum cross-entropy or maximum entropy estimates of ill-
posed inverse problems, such as the Jaynes’s dice problem. The command can also
be used to calibrate survey datasets to external totals along the lines of the multi-
plicative method implemented in the SAS CALMAR macro (Deville and Särndal 1992;
Deville, Särndal, and Sautory 1993). This is a generalization of iterative raking as im-
plemented, for instance, in Nick Winter’s survwgt command, which is available from
the Statistical Software Components archive (type net search survwgt).

3.3 Options

generate(varname , replace ) provides the variable name in which Stata will store
the probability estimates. This must be a new variable name, unless the replace
suboption is specified, in which case the existing variable is overwritten. generate()
is required.
M. Wittenberg 319

prior(varname) requests minimum cross-entropy estimation with the vector of prior


probabilities q given in the variable varname. If prior() is not specified, then
maximum entropy estimates are returned.
log is necessary only if the command is failing to converge. This option specifies to
display the output from the maximum likelihood subroutine that is used to calculate
the vector λ. The iteration log might provide some diagnostics on what is going
wrong.
total(#) is required if “raising weights” rather than probabilities are desired. The
number must be the population total to which the weights are supposed to be
summed.
matrix(matrix) passes the constraint vector contained in matrix. This must be a col-
umn vector that must have as many elements as are given in varlist. The order of
the constraints in the vector must correspond to the order of the variables given in
varlist. If no matrix is specified, then maxentropy will look for the constraints in
the first variable after the command. This variable must have the constraints listed
in the first J positions corresponding to the J variables listed in varlist.

3.4 Output
maxentropy returns output in three forms. First, it returns estimates of the λ coeffi-
cients. The absolute magnitude of the coefficient is an indication of how informative
the corresponding constraint is, that is, how far it moves the resulting p distribution
away from the prior q distribution in the cross-entropy case or away from the uniform
distribution in the maximum entropy case.
Second, the estimates of p are returned in the variable specified by the user. Third,
the vector of constraints y is returned in the matrix e(constraint), with the rows of
the matrix labeled according to the variable whose constraint that row represents.

Example

Consider the Jaynes’s die problem described earlier. Specifically, let us calculate the
probabilities if we know that the mean of the die is 4. We set the problem up by
creating the x variable, which contains the discrete distribution of outcomes, that is,
(1, 2, 3, 4, 5, 6). The y vector contains the mean 4.

. set obs 6
obs was 0, now 6
. generate x = _n
. matrix y = (4)

(Continued on next page)


320 Maximum entropy estimation

. maxentropy x, matrix(y) generate(p4)


Cross entropy estimates

Variable lambda

x .17462893

p values returned in p4
constraints given in matrix y

The λ value corresponding to the constraint E(x) = 4 is 0.1746289, so the constraint


is informative, that is, the resulting distribution is no longer the uniform one. The
message at the end reminds us where the rest of the output is to be obtained (that is,
in the p4 variable) and that the constraints were passed by means of a Stata matrix.
To see the p estimate itself, we can just list the variable:

. list x p4, noobs sep(10)

x p4

1 .1030653
2 .1227305
3 .146148
4 .1740337
5 .2072401
6 .2467824

The distribution is weighted toward the larger numbers. We can check that these
estimates obey the restrictions:

. generate xp4=x*p4
. quietly summarize xp4
. display r(sum)
4

Finally, we can retrieve a copy of the constraint matrix labeled with the correspond-
ing variables.

. matrix list e(constraint)


symmetric e(constraint)[1,1]
c1
x 4

3.5 Methods and formulas


Instead of solving the constrained optimization problem given by the first-order condi-
tions [(3) to (5)] or their cross-entropy analogues, Golan, Judge, and Miller (1996, 30)
show that the solution can be found by maximizing the unconstrained dual cross-entropy
objective function
M. Wittenberg 321


J
L (λ) = λj yj − ln {Ω (λ)} = M (λ) (8)
j=1

where Ω (λ) is given by (7). Golan, Judge, and Miller show that this function behaves
like a maximum likelihood. In this case,

∇λ M (λ) = y − X p (9)

so that the constraint is met at the point where the gradient is zero. Furthermore,
2
∂2M 
n 
n
− = pi x2ji − pi xji = var (xj ) (10)
∂λ2j i=1 i=1
n n
∂2M n  
− = pi xji xki − pi xji pi xki = cov (xj , xk ) (11)
∂λj ∂λk i=1 i=1 i=1

where the variances and covariances are taken with respect to the distribution p. The
negative of the Hessian of M is therefore guaranteed to be positive definite, which
guarantees a unique solution provided that the constraints are not inconsistent.
Golan, Judge, and Miller (1996, 25) note that the function M can be thought of
as an expected log likelihood, given the exponential family p (λ) parameterized by λ.
Along these lines, we use Stata’s maximum likelihood routines to estimate λ, giving it
the dual objective function [(8)], gradient [(9)], and negative Hessian [(10) and (11)].
The routine that calculates these is contained in maxentlambda d2.ado. Because of
the globally concave nature of the objective function, convergence should be relatively
quick, provided that there is a feasible solution in the interior of the parameter space.
The command checks for some obvious errors; for example, the population means (yj )
must be inside the range of the Xj variables. If any mean is on the boundary of the
range, then a degenerate solution is feasible, but the corresponding Lagrange multiplier
will be ±∞, so the algorithm will not converge.
Once the estimates of λ have been obtained, estimates of p are derived from (6).

3.6 Saved results


maxentropy saves the following in e():
Macros
e(cmd) maxentropy e(properties) b V
Matrices
e(b) λ coefficient estimates e(V) inverse of negative Hessian
e(constraint) constraint vector
Functions
e(sample) marks estimation sample
322 Maximum entropy estimation

3.7 A cautionary note


The estimation routine treats λ as though it were estimated by maximum likelihood.
This is true only if we can write p as
p ∝ exp (−Xλ)
Given that assumption, we could test hypotheses on the λ parameters. Because the esti-
mation routine calculates the inverse of the negative of the Hessian (that is, the asymp-
totic covariance matrix of λ under this parametric assumption), it would be possible to
implement such tests. For most practical applications, this parametric interpretation of
the procedure is likely to be dubious.

4 Sample applications
4.1 Jaynes’s die problem
In section 3.4, I showed how to calculate the probability distribution given that y = 4.
The following code generates predictions given different values for y:
matrix y=(2)
maxentropy x, matrix(y) generate(p2)
matrix y=(3)
maxentropy x, matrix(y) generate(p3)
matrix y=(3.5)
maxentropy x, matrix(y) generate(p35)
matrix y=(5)
maxentropy x, matrix(y) generate(p5)
list p2 p3 p35 p4 p5, sep(10)

The impact of different prior information on the estimated probabilities is shown in


the following table:
. list p2 p3 p35 p4 p5, sep(10)

p2 p3 p35 p4 p5

1. .4781198 .2467824 .1666667 .1030653 .0205324


2. .254752 .2072401 .1666667 .1227305 .0385354
3. .135737 .1740337 .1666667 .146148 .0723234
4. .0723234 .146148 .1666667 .1740337 .135737
5. .0385354 .1227305 .1666667 .2072401 .2547519
6. .0205324 .1030652 .1666667 .2467824 .4781198

Note in particular that when we set y = 3.5, the command returns the uniform
discrete distribution with pi = 1/6.
We can see the impact of adding in a second constraint by considering the same
problem given the population moments
   
μ 3.5
y= =
σ2 σ2
M. Wittenberg 323
6 2
for different values of σ 2 . By definition in this case, σ 2 = i=1 pi (xi − 3.5) . We
2
can therefore create the values (xi − 3.5) and consider which probability distribution
p = (p1 , p2 , . . . , p6 ) will generate both a mean of 3.5 and a given value of σ 2 . The code
to run this is

generate dev2=(x-3.5)^2
matrix y=(3.5 \ (2.5^2/3+1.5^2/3+0.5^2/3))
maxentropy x dev2, matrix(y) generate(pv)
matrix y=(3.5 \ 1)
maxentropy x dev2, matrix(y) generate(pv1)
matrix y=(3.5 \ 2)
maxentropy x dev2, matrix(y) generate(pv2)
matrix y=(3.5 \ 3)
maxentropy x dev2, matrix(y) generate(pv3)
matrix y=(3.5 \ 4)
maxentropy x dev2, matrix(y) generate(pv4)
matrix y=(3.5 \ 5)
maxentropy x dev2, matrix(y) generate(pv5)
matrix y=(3.5 \ 6)
maxentropy x dev2, matrix(y) generate(pv6)

with the following final result:

. list pv1 pv2 pv pv3 pv4 pv5 pv6, sep(10) noobs

pv1 pv2 pv pv3 pv4 pv5 pv6

.018632 .0885296 .1666667 .1741325 .2672036 .3659436 .4713601


.1316041 .1719114 .1666667 .1651027 .1358892 .0896692 .0234196
.3497639 .2395591 .1666667 .1607649 .0969072 .0443872 .0052203
.3497639 .2395591 .1666667 .1607649 .0969072 .0443872 .0052203
.1316041 .1719113 .1666667 .1651026 .1358892 .0896692 .0234196
.018632 .0885296 .1666667 .1741325 .2672036 .3659436 .4713601

The probabilities behave as we would expect: in the case where σ 2 = 35/12, we get
the uniform distribution. With variances smaller than this, the probability distribution
puts more emphasis on the values 3 and 4, while with higher variances the distribution
becomes bimodal with greater probability being attached to extreme values. This output
does not reveal that in all cases the λ1 estimate is basically zero. The reason for this
is that with a symmetrical distribution of xi values around the population mean, the
mean is no longer informative and all the information about the distribution of p derives
from the second constraint. If we force p4 = p5 = 0 so that the distribution is no longer
symmetrical, the first constraint becomes informative, as shown in this output:

(Continued on next page)


324 Maximum entropy estimation

. maxentropy x dev2 if x!=5&x!=4, matrix(y) generate(p5, replace)


Cross entropy estimates

Variable lambda

x .0119916
dev2 .59568007

p values returned in p5
constraints given in matrix y
. list x p5 if e(sample), noobs

x p5

1 .4578909
2 .0427728
3 .0131515
6 .4861848

This example shows how to overwrite an existing variable and demonstrates that the
command allows if and in qualifiers. It also shows how to use the e(sample) function.

4.2 Calibrating a survey


The basic point of calibration is to adjust the sampling weights so that the marginal
totals in different categories correspond to the population totals. Typically, the ad-
justments are made on demographic (for example, age and gender) and spatial vari-
ables. Early approaches included iterative raking procedures (Deming and Stephan
1940). These were generalized in the CALMAR routines described in Deville and Särndal
(1992). The idea of using a minimum information loss criterion for this purpose is not
original (see, for instance, Merz and Stolze [2008]), although it does not seem to have
been appreciated that the procedure leads to identical estimates as iterative raking-ratio
adjustments, if those adjustments are iterated to convergence.
The major advantage of using the cross-entropy approach rather than raking is that it
becomes straightforward to incorporate constraints that do not include marginal totals.
In many household surveys, for instance, it is plausible that mismatches between the
sample and the population arise due to differential success in sampling household types
rather than in enumerating individuals within households. Under these conditions, it
makes sense to require that all raising weights within a household be identical. I give
an example below that shows how cross-entropy estimation with such a constraint can
be feasibly implemented.
These capacities also exist within other calibration macros and commands. The
advantage of the maxentropy command is that it can do so within Stata—and it is
fairly easy and quick to use.
To demonstrate these possibilities, we load example1.dta, which contains a hypo-
thetical survey with a set of prior weights. The sum of these weights by stratum and
M. Wittenberg 325

gender is given in table 1, where we have also indicated the population totals to which
the weights should gross.

Table 1. Sum of weights from example1.dta by stratum, gender, and gross weight for
population totals

gender
stratum 0 1 Margin Required
0 100 400 500 1600
1 300 200 500 400
Margin 400 600 1000
Required 1200 800 2000

The weights can be adjusted to these totals by using the downloadable survwgt
command. To use the maxentropy command, we need to convert the desired constraints
from population totals into population means. That is straightforward because


n
N = wi (12)
i=1
n
Ngender=0 = wi 1 (gender = 0) (13)
i=1

where 1 (gender = 0) is the indicator function. So dividing everything by N , the popu-


lation total, we get a set of constraints that look identical to those used earlier:

n
wi 
n
1 = = pi
i=1
N i=1
n
wi
Pr (gender = 0) = 1 (gender = 0)
i=1
N

n
= pi 1 (gender = 0)
i=1

We could obviously add a condition for the proportion where gender = 1, but because
of the adding-up constraint, that would be redundant. If we have k categories for a
particular variable, we can only use k − 1 constraints in our estimation.
In this particular example, the constraint vector is contained in the constraint
variable. The syntax of the command in this case is
maxentropy constraint stratum gender, generate(wt3) prior(weight) total(2000)
326 Maximum entropy estimation

We did not specify a matrix, so the first variable is interpreted as the constraint
vector. We did specify a prior weight and asked Stata to convert the calculated proba-
bilities to raising weights by multiplying them by 2,000. A comparison with the “raked
weights” confirms them to be identical in this case.
We can check whether the constraints were correctly rendered by retrieving the
constraint matrix used in the estimation:

. matrix C=e(constraint)
. matrix list C
C[2,1]
c1
stratum .2
gender .40000001

We see that E(stratum) = 0.2 and E(gender) = 0.4. Means of dummy variables
are, of course, just population proportions; that is, the proportion in stratum = 1 is
0.2 and the proportion where gender = 1 is 0.4.

4.3 Imposing constant weights within households


In most household surveys, the household is the unit that is sampled and the individuals
are enumerated within it. Consequently, the probability of including an individual
conditional on the household being selected is 1. This suggests that the weight attached
to every individual within a household should be equal. We can impose this restriction
with a fairly simple ploy. We rewrite constraint (12) by first summing over individuals
within the household (hhsize) and then summing over households as

N = wih
h i

= hhsizeh wh
h

that is,

N= wh∗
h

where wih is the weight of individual i within household h, equal to the common weight
wh . This constraint can again be written in the form of probabilities as
 w∗
h
1 =
N
h

that is,

1 = p∗h
h
M. Wittenberg 327

Consider now any other constraint involving individual aggregates [for example, (13)]

n
Nx = wi xi
i=1

= wih xih
h i

 
= wh xih
h i
Nx  wh hhsizeh xih
i
=
N N hhsizeh
h

Consequently,

E (x) = p∗h mxh (14)
h

The term mxh is just the mean of the x variable within household h.
If the prior weight qh is similarly constant within households (as it should be if it is
a design weight), then we similarly create a new variable
qh∗ = hhsizeh qh
We can then write the cross-entropy objective function as
 n     
pi pih
I (p, q) = pi ln = pih ln
i=1
qi i
qih
h
  
ph hhsizeh
= ph ln
i
qh hhsizeh
h
  
p∗h
= ph ln ∗
i
qh
h
  ∗
p
= hhsizeh ph ln h∗
qh
h
  
p∗
= p∗h ln h∗
qh
h

In short, the objective function evaluated over all individuals and imposing the con-
straint pih = ph for all i is identical to the objective function evaluated over house-
holds where the probabilities have been adjusted to p∗h and qh∗ . We therefore run the
maxentropy command on a household-level file, with the population constraints given
by (14). Our cross-entropy estimates can then be retrieved as
p∗h
ph =
hhsizeh
328 Maximum entropy estimation

We can check that the weights obtained in this way do, in fact, obey all the
restrictions—they are obviously constant within household, and when added up over
the individuals, they reproduce the required totals.

4.4 Calibrating the South African National Income Dynamics Survey


To assess the performance of the maxentropy command on a more realistic problem, we
consider the problem of calibrating South Africa’s National Income Dynamics Survey.
This was a nationally representative sample of around 7,300 households and around
30,000 individuals. From the sampling design, a set of “design weights” were calculated,
but application of these weights to the realized sample led to a severe undercount when
compared with the official population estimates.
The calibration was to be done to reproduce the nine provincial population counts
and 136 age × sex × race cell totals. One practical difficulty that was immediately
encountered was how to treat individuals where age, sex, or race information was miss-
ing, because this category does not exist in the national estimates. It was decided to
keep the relative weights of the missing observations constant through the calibration,
creating a 137th age, sex, and race category. From each group of dummy variables, one
category had to be omitted, creating altogether 144 (or 8 + 136) constraints.
hhcollapsed.dta contains household-level means of all these variables plus the
household design weights. The code to create cross-entropy weights that are constant
within households is given by the following:
use hhcollapsed
maxentropy constraint P1-WFa80, prior(q) generate(hw) total(48687000)
replace hw=hw/hhsize
matrix list e(constraint)

With 144 constraints and 7,305 observations, the command took 18 seconds to cal-
culate the new weights on a standard desktop computer.
M. Wittenberg 329

In this context, the λ estimates prove informative. The output of the command is

. maxentropy constraint P1-WFa80, prior(q) generate(hw) total(48687000)


Cross entropy estimates

Variable lambda

P1 -.15945276
P2 .00735986
P3 .14000206
(output omitted )
IMa75 15.402056
IMa80 8.6501559
IFa_0 -7.0753612
IFa_5 2.3584972
(output omitted )
IFa75 -9.2778495
IFa80 14.142518
(output omitted )
WFa70 .05009103
WFa75 .90961156
WFa80 4.6868009

p values returned in hw
constraints given in variable constraints

The huge coefficients for old Indian males and old Indian females suggests that the
population constraints affected the weights for these categories substantially. Given the
large number of constraints, mistakes are possible. The easiest way to check that the
command has worked correctly is to add up the weights within categories and to check
that they add up to the intended totals. Listing the constraint matrix used by the
command is also a useful check. In this case, the labeling of the rows does help:
. matrix list e(constraint)
e(constraint)[144,1]
c1
P1 .10803039
P2 .13514069
P3 .02320805
P4 .05914972
P5 .20764017
P6 .07044568
P7 .21462312
P8 .07373177
AMa_0 .04486157
AMa_5 .04584822
(output omitted )
WFa75 .0012318
WFa80 .00147087

The first eight constraints are the province proportions followed by the proportions
in the age, sex, and race cells.
330 Maximum entropy estimation

5 Conclusion
This article introduced the power of maximum entropy and minimum cross-entropy
estimation. The maxentropy command uses Stata’s powerful maximum-likelihood esti-
mation routines to provide fast estimates of even complicated problems. I have shown
how the command can be used to calibrate a survey to a set of known population totals
while imposing restrictions like constant weights within households.

6 References
Deming, W. E., and F. F. Stephan. 1940. On a least squares adjustment of a sample fre-
quency table when the expected marginal totals are known. Annals of Mathematical
Statistics 11: 427–444.

Deville, J.-C., and C.-E. Särndal. 1992. Calibration estimators in survey sampling.
Journal of the American Statistical Association 87: 376–382.

Deville, J.-C., C.-E. Särndal, and O. Sautory. 1993. Generalized raking procedures in
survey sampling. Journal of the American Statistical Association 88: 1013–1020.

Golan, A., G. G. Judge, and D. Miller. 1996. Maximum Entropy Econometrics: Robust
Estimation with Limited Data. Chichester, UK: Wiley.
Jaynes, E. T. 1957. Information theory and statistical mechanics. Physical Review 106:
620–630.

Merz, J., and H. Stolze. 2008. Representative time use data and new harmonised cali-
bration of the American Heritage Time Use Data 1965–1999. electronic International
Journal of Time Use Research 5: 90–126.

Mittelhammer, R. C., G. G. Judge, and D. J. Miller. 2000. Econometric Foundations.


Cambridge: Cambridge University Press.

About the author


Martin Wittenberg teaches core econometrics and microeconometrics to graduate students in
the Economics Department at the University of Cape Town.
The Stata Journal (2010)
10, Number 3, pp. 331–338

bacon: An effective way to detect outliers in


multivariate data using Stata (and Mata)

Sylvain Weber
University of Geneva
Department of Economics
Geneva, Switzerland
sylvain.weber@unige.ch

Abstract. Identifying outliers in multivariate data is computationally intensive.


The bacon command, presented in this article, allows one to quickly identify out-
liers, even on large datasets of tens of thousands of observations. bacon constitutes
an attractive alternative to hadimvo, the only other command available in Stata
for the detection of outliers.
Keywords: st0197, bacon, hadimvo, outliers detection, multivariate outliers

1 Introduction
The literature on outliers is abundant, as proved by Barnett and Lewis’s (1994) bibli-
ography of almost 1,000 articles. Despite this considerable research by the statistical
community, knowledge apparently fails to spill over, so proper methods for detecting
and handling outliers are seldom used by practitioners in other fields.
The reason is likely that algorithms implemented for the detection of outliers are
sparse. Moreover, the few algorithms available are so time-consuming that using them
may be discouraging. Until now, hadimvo was the only command in Stata available for
identifying outliers. Anyone who has tried to use hadimvo on large datasets, however,
knows it may take hours or even days to obtain a mere dummy variable indicating which
observations should be considered as outliers.
The new bacon command, presented in this article, provides a more efficient way
to detect outliers in multivariate data. It is named for the blocked adaptive computa-
tionally efficient outlier nominators (BACON) algorithm proposed by Billor, Hadi, and
Velleman (2000). bacon is a simple modification of the methodology proposed by Hadi
(1992, 1994) and implemented in hadimvo, but bacon is much less computationally in-
tensive. As a result, bacon runs many times faster than hadimvo, even though both
commands end up with similar sets of outliers. Identifying multivariate outliers thus
becomes fast and easy in Stata, even with large datasets of tens of thousands of obser-
vations.


c 2010 StataCorp LP st0197
332 Multivariate outliers detection

2 The BACON algorithm


The BACON algorithm was proposed by Billor, Hadi, and Velleman (2000). The reader
who is interested in details is referred to that original article, because only a brief
presentation is provided here.
In step 1, an initial subset of m outlier-free observations has to be identified out of
a sample of n observations and over p variables. Any of several distance measures could
be used as a criterion, and the Mahalanobis distance seems especially well adapted.
It possesses the desirable property of being scale-invariant—a great advantage when
dealing with variables of different magnitudes or with different units. The Mahalanobis
distance of a p-dimensional vector xi = (xi1 , xi2 , . . . , xip )T from a group of values with
mean x = (x1 , x2 , . . . , xp )T and covariance matrix S is defined as

di (x, S) = (xi − x)T S −1 (xi − x) , i = 1, 2, . . . , n

The initial basic subset is given by the m observations with the smallest Mahalanobis
distances from the whole sample. The subset size m is given by the product of the
number of variables p and a parameter chosen by the analyst.
Billor, Hadi, and Velleman (2000) also proposed using distances from the medians
for this first step. This second version of the algorithm is also implemented in bacon.
Distances from the medians are not scale-invariant, so they should be used carefully if
the variables analyzed are of different magnitudes.
In step 2, Mahalanobis distances from the basic subset are computed:

di (xb , Sb ) = (xi − xb )T Sb−1 (xi − xb ) , i = 1, 2, . . . , n (1)

In step 3, all observations with a distance smaller than some threshold—a corrected
percentile of a χ2 distribution—are added to the basic subset.
Steps 2 and 3 are iterated until the basic subset no longer changes. Observations
excluded from the final basic subset are nominated as outliers, whereas those inside the
final basic subset are nonoutliers.
The difference in the algorithm proposed by Hadi (1992, 1994) is that observa-
tions are added by blocks in the basic subset instead of observation by observation.
Thus some time is spared through a reduction of the number of iterations. Neverthe-
less, it is important to note that the performance of the algorithm is not altered, as
Billor, Hadi, and Velleman (2000) and section 5 of this article show.
The reduction in the number of iterations is not the only source of efficiency gain.
Another major improvement lies in the way bacon is coded. When hadimvo was im-
plemented, Mata did not exist. Now, though, Mata provides significant speed enhance-
ments to many computationally intensive tasks, like the calculation of Mahalanobis
distances. I therefore coded bacon so that it benefits from Mata’s power.
S. Weber 333

3 Why Mata matters for bacon


The bacon command uses Mata, the matrix programming language available in Stata
since version 9. I explain here how Mata allows bacon to run very fast. This section
draws heavily on Baum (2008), who offers a general overview of Mata’s capabilities.
The BACON algorithm requires creating matrices from data, computing the distances
using (1), and converting the new matrix-containing distances back into the data. Op-
erations that convert Stata variables into matrices (or vice versa) require at least twice
the memory needed for that set of variables, so it stands to reason that using Stata’s
matrices would consume a lot of memory. On the other hand, Mata’s matrices are only
views of, not copies of, data. Hence, using Mata’s virtual matrices instead of Stata’s
matrices in bacon spares memory that can be used to run the computations faster.
Moreover, Stata’s matrices are unsuited for holding large amounts of data, their
maximal size being 11,000 × 11,000. Using Stata, it would not be possible to create
a matrix X = (x1 , x2 , . . . , xi , . . . , xn )T containing all observations of the database if
the n were larger than 11,000. One would thus have to cut the X matrix into pieces
to compute the distances in (1), which is obviously inconvenient. Mata circumvents
the limitations of Stata’s traditional matrix commands, thus allowing the creation of
virtually infinite matrices (over 2 billion rows and columns). Thanks to Mata, I am
thus able to create a single matrix X containing all observations to whatever n. I then
use the powerful element-by-element operations available to compute the distances.
Mata is indeed efficient for handling element-by-element operations, whereas Stata
ado-file code written in the matrix language with explicit subscript references is slow.
Because the distances in (1) have to be computed for each individual at each iteration
of the algorithm, this feature of Mata provides another important efficiency gain.

4 The bacon command


4.1 Syntax
The syntax of bacon is as follows:




bacon varlist if
in , generate(newvar1 newvar2 ) replace

percentile(#) version(1 | 2) c(#)

4.2 Options

generate(newvar1 newvar2 ) is required; it identifies the new variable(s) that will be


created. Whether you specify two variables or one, however, is optional. newvar2, if
specified, will contain the distances from the final basic subset. That is, specifying
generate(out) creates a dummy variable out containing 1 if the observation is
an outlier in the BACON sense and 0 otherwise. Specifying generate(out dist)
334 Multivariate outliers detection

additionally creates a variable dist containing the distances from the final basic
subset.
replace specifies that the variables newvar1 and newvar2 be replaced if they already
exist in the database. This option makes it easier to run bacon several times on
the same data. It should be used cautiously because it might definitively drop some
data.
percentile(#) determines the 1 − # percentile of the chi-squared distribution to be
used as a threshold to separate outliers from nonoutliers. A larger # identifies a
larger proportion of the sample as outliers. The default is percentile(0.15). If #
is specified greater than 1, it is interpreted as a percent; thus percentile(15) is
the same as percentile(0.15).
version(1 | 2) specifies which version of the BACON algorithm must be used to identify
the initial basic subset in multivariate data. version(1), the default, identifies the
initial subset selected based on Mahalanobis distances. version(2) identifies the ini-
tial subset selected based on distances from the medians. In the case of version(2),
varlist must not contain missing values, and you must install the moremata command
before running bacon.
c(#) is the parameter that determines the size of the initial basic subset, which is given
by the product of # and the number of variables in varlist. # must be an integer.
c(4) is used by default as proposed by Billor, Hadi, and Velleman (2000, 285).

4.3 Saved results


bacon saves the following results in r():
Scalars
r(outlier) number of outliers r(iter) number of iterations
r(corr) correction factor r(chi2) percentile of the χ2 distribution

5 bacon versus hadimvo


Let us now compare bacon and hadimvo considering two criteria: i) the set of observa-
tions identified as outliers and ii) the speed. We will see that both commands lead to
similar outcomes (providing some tuning of the cutoff parameters) but that hadimvo is
terribly slower. bacon thus outperforms hadimvo and should be preferred in any case.
First, let us use auto.dta to illustrate the similarity of the results obtained through
both commands:
S. Weber 335

. webuse auto
(1978 Automobile Data)
. hadimvo weight length, generate(outhadi) percentile(0.05)
(output omitted )
. bacon weight length, generate(outbacon) percentile(0.15)
(output omitted )
. tabulate outhadi outbacon
Hadi
outlier BACON outlier (p=.15)
(p=.05) 0 1 Total

0 72 0 72
1 0 2 2

Total 72 2 74

Both commands have identified the same two observations as outliers. The param-
eter (in the percentile() option) was set higher in bacon than in hadimvo. With a
parameter of 5%, bacon would not have identified any observation as an outlier. It is
the role of the researcher to choose the parameter that is best adapted for each dataset,
but the default percentile(0.15) appears to bring sensible outcomes in any case and
could always be used as a first benchmark.
With two-dimensional data, it is helpful to draw a scatterplot such as figure 1 that
allows us to see where outliers are located:

. scatter weight length, ml(outbacon) ms(i) note("0 = nonoutlier, 1 = outlier")


5,000

0
0

0 0
4,000

0
0 000
0 0
0
Weight (lbs.)

0
00 0
0
0
0 00
000 0
1 0
1 0 000
00
0
3,000

0
0 0
0
00 0
00 000
0
0 0
0
00
0 0
2,000

00 0
00
00
00
0 00 0

140 160 180 200 220 240


Length (in.)
0 = nonoutlier, 1 = outlier

Figure 1. Scatterplot locating the observations identified as outliers


336 Multivariate outliers detection

To compare the speeds of bacon and hadimvo, let us now use a larger dataset.
Containing about 28,000 observations, nlswork.dta is sufficiently large to illustrate
the point. Suppose we want to identify outliers with respect to the variables ln wage,
age, and tenure. If we did not have bacon, we would type
. webuse nlswork, clear
(National Longitudinal Survey. Young Women 14-26 years of age in 1968)
. hadimvo ln_wage age tenure, generate(outhadi) percentile(0.05)
Beginning number of observations: 28101
Initially accepted: 4
Expand to (n+k+1)/2:

At this point, your screen will remain idle. You might become worried and think
your computer crashed, but in fact hadimvo is simply going to take some long minutes
to run its many iterations. Remember, there are “only” 28,000 observations in this
dataset. If you are patient enough, Stata will at last show you the outcome:
. hadimvo ln_wage age tenure, generate(outhadi) percentile(0.05)
Beginning number of observations: 28101
Initially accepted: 4
Expand to (n+k+1)/2: 14052
Expand, p = .05: 28081
Outliers remaining: 20

Thanks to bacon, you now have a faster alternative. If you type


. bacon ln_wage age tenure, generate(outbacon) percentile(0.15)
Total number of observations: 28101
BACON outliers (p = 0.15): 29
Non-outliers remaining: 28072

the solution appears in only a few seconds! Again we can check that the set of identified
outliers is pretty much the same in the two cases:
. tabulate outhadi outbacon
Hadi
outlier BACON outlier (p=.15)
(p=.05) 0 1 Total

0 28,072 9 28,081
1 0 20 20

Total 28,072 29 28,101

Given the time hadimvo needs and the similarities between the outcomes, it seems
clear that bacon is preferable.
Because there is no rule for the choice of percentile(), the practitioner might
legitimately be willing to test several values and decide after several trials which set of
observations to nominate as outliers. With hadimvo, such an iterative process is almost
impracticable, unless you are particularly patient and have enough time in front of
you. With bacon, on the other hand, completing the iterative process becomes readily
feasible.
S. Weber 337

bacon has a replace option precisely to give the possibility of running the algorithm
several times without having to add a new variable at each iteration. For the user
wanting to try several percentile() values, replace will prove convenient:

. bacon ln_wage age tenure, generate(outbacon) percentile(0.1)


outbacon already defined
r(110);
. bacon ln_wage age tenure, generate(outbacon) percentile(0.1) replace
Total number of observations: 28101
BACON outliers (p = 0.10): 6
Non-outliers remaining: 28095
. bacon ln_wage age tenure, generate(outbacon) percentile(0.2) replace
Total number of observations: 28101
BACON outliers (p = 0.20): 160
Non-outliers remaining: 27941

6 Conclusion
“The two big questions about outliers are ‘how do you find them?’ and ‘what do you
do about them?’” (Ord 1996). The bacon command presented here provides an answer
to the first of these questions. The answer to the second is beyond the scope of this
article and is left to the consideration of the researcher.
No doubt, bacon renders the process of detecting outliers in multivariate data easier.
Compared with hadimvo, the only other command devoted to this task in Stata, bacon
appears to identify a similar set of observations as outliers. In terms of speed, bacon
proves to be far faster. Hence, there is no apparent reason to use hadimvo instead of
bacon.
Even though the bacon command provides a fast and easy way to identify potential
outliers, a certain amount of judgment is always needed when deciding which cases to
nominate as outliers and what to do with those observations. Most researchers simply
discard outliers, but before you do so, keep in mind that something new and useful can
often be learned by looking at the nominated cases.

7 References
Barnett, V., and T. Lewis. 1994. Outliers in Statistical Data. 3rd ed. Chichester, UK:
Wiley.

Baum, C. F. 2008. Using Mata to work more effectively with Stata: A tutorial. UK
Stata Users Group meeting proceedings.
http://ideas.repec.org/p/boc/usug08/11.html.

Billor, N., A. S. Hadi, and P. F. Velleman. 2000. BACON: Blocked adaptive computa-
tionally efficient outlier nominators. Computational Statistics & Data Analysis 34:
279–298.
338 Multivariate outliers detection

Hadi, A. S. 1992. Identifying multiple outliers in multivariate data. Journal of the Royal
Statistical Society, Series B 54: 761–771.

———. 1994. A modification of a method for the detection of outliers in multivariate


samples. Journal of the Royal Statistical Society, Series B 56: 393–396.

Ord, K. 1996. Review of Outliers in Statistical Data, 3rd ed., by V. Barnett and T.
Lewis. International Journal of Forecasting 12: 175–176.

About the author


Sylvain Weber is working as a teaching assistant in the Department of Economics at the
University of Geneva in Switzerland. He is pursuing a PhD in the field of human capital
depreciation, wage growth over the career, and job stability.
The Stata Journal (2010)
10, Number 3, pp. 339–358

Comparing the predictive powers of survival


models using Harrell’s C or Somers’ D
Roger B. Newson
National Heart and Lung Institute
Imperial College London
London, UK
r.newson@imperial.ac.uk

Abstract. Medical researchers frequently make statements that one model pre-
dicts survival better than another, and they are frequently challenged to provide
rigorous statistical justification for those statements. Stata provides the estat
concordance command to calculate the rank parameters Harrell’s C and Somers’ D
as measures of the ordinal predictive power of a model. However, no confidence
limits or p-values are provided to compare the predictive power of distinct models.
The somersd package, downloadable from Statistical Software Components, can
provide such confidence intervals, but they should not be taken seriously if they are
calculated in the dataset in which the model was fit. Methods are demonstrated
for fitting alternative models to a training set of data, and then measuring and
comparing their predictive powers by using out-of-sample prediction and somersd
in a test set to produce statistically sensible confidence intervals and p-values for
the differences between the predictive powers of different models.
Keywords: st0198, somersd, stcox, estat concordance, streg, predict, survival,
model validation, prediction, concordance, rank methods, Harrell’s C, Somers’ D

1 Introduction
Harrell’s C and the equivalent parameter Somers’ D were proposed as measures of
the general predictive power of a general regression model by Harrell et al. (1982) and
Harrell, Lee, and Mark (1996), who focused attention on the case of a survival model
with a possibly right-censored outcome, which was interpreted as a lifetime. In the case
of a Cox proportional hazards regression model, both parameters are output by the
Stata postestimation command estat concordance (see [ST] stcox postestimation).1
However, because Harrell’s C and Somers’ D are rank parameters, they are equally valid
as measures of the predictive power of any model in which the scalar outcome Y is at
least ordinal (with or without censorship), and in which the conditional distribution
of the outcome, given the predictor variables, is governed by a scalar function of the
predictor variables and the parameters, such as the hazard ratio in a Cox regression or
the linear predictor in a generalized linear model. If the assumptions of the model are
true, then such a scalar predictive score plays the role of a balancing score as defined
by Rosenbaum and Rubin (1983).
1. As of Stata 11.1, estat concordance provides two concordance measures: Harrell’s C and Gönen
and Heller’s K. Harrell’s C is computed by default or if harrell is specified.


c 2010 StataCorp LP st0198
340 Comparing the predictive powers of survival models

Harrell’s C and Somers’ D are members of the Kendall family of rank parameters.
The family history can be summarized as follows: Kendall’s τa begat Somers’ D begat
Theil–Sen percentile slopes. This family is implemented in Stata by using the somersd
package, which can be downloaded from Statistical Software Components. An overview
of the parameter family is given in Newson (2002), and the methods and formulas are
given in detail in Newson (2006a,b,c).
Parameters in this family are defined by assuming the existence of a population of
bivariate data pairs of the form (Xi , Yi ) and a sampling scheme for sampling pairs of
pairs {(Xi , Yi ), (Xj , Yj )} from that population. A pair of pairs is said to be concordant
if the larger of the X values is paired with the larger of the Y values, and a pair is
said to be discordant if the larger of the X values is paired with the smaller of the
Y values. Kendall’s τa is the difference between the probability of concordance and
the probability of discordance. Somers’ D(X | Y ) is the difference between the cor-
responding conditional probabilities, assuming that the two Y values can be ordered.
Harrell’s C(X | Y ) is defined as {D(X | Y ) + 1}/2 and is equal to the conditional proba-
bility of concordance plus half the conditional probability that the data pairs are neither
concordant nor discordant, assuming that the two Y values can be ordered. In the case
where Y is an outcome to be predicted by a multivariate model with a scalar predictive
score, there is an underlying population of multivariate data points (Yi , Vi1 , . . . , Vik )
where the Vih are predictive covariates and the role of the Xi is played by the scalar
predictive score η(Vi1 , . . . , Vik ). In this case, the Somers’ D and Harrell’s C parameters
can be denoted as D{η(V1 , . . . , Vk ) | Y } and C{η(V1 , . . . , Vk ) | Y }, respectively. If the
model is a survival model, then the Y values are lifetimes, and there is the possibility
that one or both of a pair of Y values may be censored, which sometimes implies that
they cannot be ordered.
We often want to compare the predictive powers of alternative predictors of the same
outcome Y . Newson (2002, 2006b) argues that if there is an underlying population of
trivariate data points (Wi , Xi , Yi ) and if any positive association between the Yi and
the Xi is caused by a positive association of both of these variables with the Wi , then
we must have the inequality D(X | Y ) − D(W | Y ) ≤ 0 or, equivalently, C(X | Y ) −
C(W | Y ) = {D(X | Y ) − D(W | Y )}/2 ≤ 0. This inequality still holds if the Y variable
may be censored, but not if the W or X variable may be censored. This implies that if
we have multiple alternative positive predictors of the same outcome, such as alternative
predictive scores from alternative multivariate models, then it may be useful to calculate
confidence intervals for the differences between the Somers’ D or Harrell’s C parameters
of these predictors with respect to the outcome, and then to make statements regarding
which predictors may or may not be secondary to which other predictors. In Stata,
this can be done by using lincom after the somersd command, as demonstrated in
section 4.1 of Newson (2002).
Medical researchers frequently make statements that one model predicts survival
better than another. Statistical referees acting for medical journals frequently challenge
the researchers to provide rigorous statistical justification for these statements. The
Stata postestimation command estat concordance provides estimates of Harrell’s C
and Somers’ D but provides no confidence limits for these, nor any confidence limits or
R. B. Newson 341

p-values for the differences between the values of these rank parameters from different
models. This is the case for good reason: confidence-interval formulas do not protect the
user for finding a model in the same data in which its parameters are then estimated.
Used sequentially, the somersd and lincom commands provide confidence limits and p-
values for differences between the Somers’ D or Harrell’s C parameters between different
predictors. However, not all medical researchers know how to calculate a confidence
interval (CI) when the predictors are scalar predictive scores from models, and fewer
still know how to do so in such a way that the confidence limits can be taken seriously.
In this article, I aim to explain how medical researchers can calculate CIs and preempt
possible queries that may arise in the process.
The remainder of this article is divided into four sections. Section 2 addresses
the queries that commonly arise when users try to duplicate the results of estat
concordance using somersd. Section 3 describes the method of splitting the data into
a training set (to which models are fit) and a test set (in which their predictive powers
are measured). Section 4 describes the extension to non-Cox survival models, such as
those described in [ST] streg. Finally, section 5 briefly explains how the methods can
be extended even further.

2 The Cox model: somersd versus estat concordance


I will demonstrate the principles using the Cox proportional hazards model, which is
implemented in Stata using the stcox command (see [ST] stcox). I also use the Stanford
drug-trial dataset, which is used for the examples in [ST] stcox postestimation.
Before I raise the issue of confidence limits, we need to see how somersd can pro-
duce the same estimates as estat concordance. This is done using predict after the
survival estimation command to define the predictive score, and then using somersd to
measure the association of the predictive score with the lifetime. Users who attempt
to use somersd to duplicate the estimates of estat concordance may face confusion
caused by these three issues:

1. The predict command, used after stcox, by default produces a negative predic-
tion score, in contrast to the positive prediction score produced by using predict
after most estimation commands.
2. The default coding of a censorship status variable for stcox is different from the
coding of a censorship status variable for somersd.
3. The treatment of tied failure times by estat concordance is different from that
used by somersd.

There are solutions to all of these problems, and I will demonstrate them, enabling
users to use somersd and estat concordance as checks on one another.
Let’s start the demonstration by inputting the Stanford drug-trial data, fitting a
Cox model, and calling estat concordance:
342 Comparing the predictive powers of survival models

. use http://www.stata-press.com/data/r11/drugtr
(Patient Survival in Drug Trial)
. stset
-> stset studytime, failure(died)
failure event: died != 0 & died < .
obs. time interval: (0, studytime]
exit on or before: failure

48 total obs.
0 exclusions

48 obs. remaining, representing


31 failures in single record/single failure data
744 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 39
. stcox drug age
failure _d: died
analysis time _t: studytime
Iteration 0: log likelihood = -99.911448
Iteration 1: log likelihood = -83.551879
Iteration 2: log likelihood = -83.324009
Iteration 3: log likelihood = -83.323546
Refining estimates:
Iteration 0: log likelihood = -83.323546
Cox regression -- Breslow method for ties
No. of subjects = 48 Number of obs = 48
No. of failures = 31
Time at risk = 744
LR chi2(2) = 33.18
Log likelihood = -83.323546 Prob > chi2 = 0.0000

_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]

drug .1048772 .0477017 -4.96 0.000 .0430057 .2557622


age 1.120325 .0417711 3.05 0.002 1.041375 1.20526

. estat concordance
Harrell´s C concordance statistic
failure _d: died
analysis time _t: studytime
Number of subjects (N) = 48
Number of comparison pairs (P) = 849
Number of orderings as expected (E) = 679
Number of tied predictions (T) = 15
Harrell´s C = (E + T/2) / P = .8086
Somers´ D = .6172

The stset command shows us that the input dataset has already been set up as
a survival-time dataset that includes one observation per drug-trial subject as well as
data on survival time and termination modes, among other things (see [ST] stset).
The Cox model contains two predictive covariates, age (subject age in years) and drug
(indicating treatment group, with a value of 0 for placebo and a value of 1 for the
R. B. Newson 343

drug being tested). We then see that, according to estat concordance, Harrell’s C is
0.8086 and Somers’ D is 0.6172. The Somers’ D implies that when one of two subjects is
observed to survive another, the model predicts that the survivor is 61.72% more likely
to have a lower hazard ratio than the nonsurvivor. The Harrell’s C is the probability
that the survivor has the lower hazard ratio plus half the (possibly negligible) probability
that the two subjects have equal hazard ratios, and this sum is 80.86% on a percentage
scale.
We will now see how to duplicate these estimates by using predict and somersd.
We start by defining a negative predictor of lifetime by using predict to calculate a
hazard ratio. We then derive an inverse hazard ratio, which we expect to be a positive
predictor of lifetime:
. predict hr
(option hr assumed; relative hazard)
. generate invhr=1/hr

This strategy addresses the first of the three sources of confusion mentioned before.
Addressing the second source of confusion, we need to define a censorship indicator
for input to the somersd command. The somersd command has a cenind() option
that requires a list of censorship indicators. These censorship indicators are allocated
one-to-one to the corresponding variables of the variable list input to somersd and must
be either variable names or zeros (implying a censorship indicator variable whose values
are all zero). Censorship indicator variables for somersd are positive in observations
where the corresponding input variable value is right-censored (or known to be equal to
or greater than its stated value), are negative in observations where the corresponding
input variable value is left-censored (or known to be equal to or less than its stated
value), and are zero in observations where the corresponding input variable value is
uncensored (or known to be equal to its stated value). If the list of censorship indicators
is shorter than the input variable list, then the list of censorship indicators is extended
on the right with zeros, implying that the variables without censorship indicators are
uncensored.
This coding scheme is not the same as that for the censorship indicator variable d
that is created by the stset command, which is 1 in observations where the correspond-
ing lifetime is uncensored and is 0 in observations where the corresponding lifetime is
right-censored.
To convert an stset censorship indicator variable to a somersd censorship indicator
variable, we use the command

. generate censind=1-_d if _st==1

This command creates a new variable, censind, which assumes the following values:
missing in observations excluded from the survival sample, as indicated by the variable
st created by stset; 1 in observations with right-censored lifetimes (where d is 0);
and 0 in observations with uncensored lifetimes (where d is 1).
344 Comparing the predictive powers of survival models

We can now use somersd to calculate Harrell’s C and Somers’ D, using the transf(c)
option for Harrell’s C and the transf(z) option (indicating the normalizing and vari-
ance-stabilizing Fisher’s z or hyperbolic arctangent transformation) for Somers’ D:

. somersd _t invhr if _st==1, cenind(censind) tdist transf(c)


Somers´ D with variable: _t
Transformation: Harrell´s c
Valid observations: 48
Degrees of freedom: 47
Symmetric 95% CI for Harrell´s c

Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]

invhr .8106332 .0423076 19.16 0.000 .7255213 .8957451

. somersd _t invhr if _st==1, cenind(censind) tdist transf(z)


Somers´ D with variable: _t
Transformation: Fisher´s z
Valid observations: 48
Degrees of freedom: 47
Symmetric 95% CI for transformed Somers´ D

Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]

invhr .7270649 .1378034 5.28 0.000 .4498402 1.00429

Asymmetric 95% CI for untransformed Somers´ D


Somers_D Minimum Maximum
invhr .62126643 .42176765 .76338983

In both cases, we use the survival-time variable t, the survival sample indicator
st (created by stset), and the inverse hazard rate invhr (created using predict)
to estimate rank parameters of the inverse hazard ratio with respect to survival time
(censored by censorship status). In the case of Harrell’s C, the estimated parameter
is on a scale from 0 to 1 and is expected to be at least 0.5 for a positive predictor of
lifetime, such as an inverse hazard ratio. In the case of Somers’ D, the untransformed
parameter is on a scale from −1 to 1 and is expected to be at least 0 for a positive
predictor of lifetime.
However, we now encounter the third source of confusion mentioned before. If we
compare the estimates here to those produced earlier by estat concordance, we find
that the estimates for Harrell’s C and Somers’ D are similar but not exactly the same.
The estimates are 0.8106 and 0.6213, respectively, when computed by somersd, and
0.8086 and 0.6172, respectively, when computed by estat concordance. The reason
for this discrepancy is that somersd and estat concordance have different policies
for comparing two lifetimes that terminate simultaneously when one lifetime is right-
censored and the other is uncensored. The estat concordance policy assumes that
the owner of the right-censored lifetime survived the owner of the uncensored lifetime,
whereas the somersd policy assumes that neither of the two owners can be said to have
survived the other. In the case of a drug trial, one subject might be known to have
R. B. Newson 345

died in a certain month, whereas another might be known to have left the country in
the same month and has therefore become lost to follow-up. The estat concordance
policy assumes that the second subject must have survived the first, which might be
probable, given that this second subject seems to have been in a fit state to travel out
of the country. The somersd policy, more cautiously, allows the possibility that the
second subject may have left the country early in the month and died unexpectedly of
a venous thromboembolism on the outbound plane, whereas the first subject may have
died under observation of the trial organizers later in the same month.
Whatever the merits of the two policies, we might still like to show that somersd
and estat concordance can be made to duplicate one another’s estimates. This can
easily be done if lifetimes are expressed as whole numbers of time units, as they are
in the Stanford drug trial data, where lifetimes are expressed in months. In this case,
we can add half a unit to right-censored lifetimes only. As a result, right-censored
lifetimes become greater than uncensored lifetimes terminating within the same time
unit without affecting any other orderings of lifetimes.
In our example, we do this by generating a new lifetime variable, studytime2, that
is equal to the modified survival time. We then use stset to reset the various survival-
time variables and characteristics so that the modified survival time is now used. This
step is done after using the assert command to check that the old studytime variable
is indeed integer-valued; see [D] assert and [D] functions. We then proceed as in the
previous example:

. use http://www.stata-press.com/data/r11/drugtr, clear


(Patient Survival in Drug Trial)
. assert studytime==int(studytime)
. generate studytime2=studytime+0.5*(died==0)
. stset studytime2, failure(died)
failure event: died != 0 & died < .
obs. time interval: (0, studytime2]
exit on or before: failure

48 total obs.
0 exclusions

48 obs. remaining, representing


31 failures in single record/single failure data
752.5 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 39.5

(Continued on next page)


346 Comparing the predictive powers of survival models

. stcox drug age


failure _d: died
analysis time _t: studytime2
Iteration 0: log likelihood = -99.911448
Iteration 1: log likelihood = -83.551879
Iteration 2: log likelihood = -83.324009
Iteration 3: log likelihood = -83.323546
Refining estimates:
Iteration 0: log likelihood = -83.323546
Cox regression -- Breslow method for ties
No. of subjects = 48 Number of obs = 48
No. of failures = 31
Time at risk = 752.5
LR chi2(2) = 33.18
Log likelihood = -83.323546 Prob > chi2 = 0.0000

_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]

drug .1048772 .0477017 -4.96 0.000 .0430057 .2557622


age 1.120325 .0417711 3.05 0.002 1.041375 1.20526

. estat concordance
Harrell´s C concordance statistic
failure _d: died
analysis time _t: studytime2
Number of subjects (N) = 48
Number of comparison pairs (P) = 849
Number of orderings as expected (E) = 679
Number of tied predictions (T) = 15
Harrell´s C = (E + T/2) / P = .8086
Somers´ D = .6172
. predict hr
(option hr assumed; relative hazard)
. generate invhr=1/hr
. generate censind=1-_d if _st==1
. somersd _t invhr if _st==1, cenind(censind) tdist transf(c)
Somers´ D with variable: _t
Transformation: Harrell´s c
Valid observations: 48
Degrees of freedom: 47
Symmetric 95% CI for Harrell´s c

Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]

invhr .8085984 .0425074 19.02 0.000 .7230845 .8941122


R. B. Newson 347

. somersd _t invhr if _st==1, cenind(censind) tdist transf(z)


Somers´ D with variable: _t
Transformation: Fisher´s z
Valid observations: 48
Degrees of freedom: 47
Symmetric 95% CI for transformed Somers´ D

Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]

invhr .7204641 .1373271 5.25 0.000 .4441976 .9967306

Asymmetric 95% CI for untransformed Somers´ D


Somers_D Minimum Maximum
invhr .6171967 .41711782 .76021766

This time, the model fit produces the same output as before, and the command
estat concordance produces the same estimates as it did before of 0.8086 and 0.6172
for Harrell’s C and Somers’ D, respectively. But now the same estimates of 0.8086 and
0.6172 are also produced by somersd, at least after rounding to four decimal places.
It should be stressed that Harrell’s C and Somers’ D, computed as above either
by somersd or by estat concordance, are valid measures of the predictive power of a
survival model only if there are no time-dependent covariates or lifetimes with delayed
entries. However, if somersd (instead of estat concordance) is used, then sensible
estimates can still be produced with weighted data, so long as those weights are explicitly
supplied to somersd.

3 Comparing predictive powers with training and test


sets
Another caution about the results of the previous section is that the confidence intervals
generated by somersd should not really be taken seriously. This is because, in general,
confidence intervals do not protect the user against the consequences of finding a model
in a dataset and then estimating its parameters in the same dataset. In the case of
Harrell’s C and Somers’ D of inverse hazard ratios with respect to lifetime, we would
expect this incorrect practice to lead to overly optimistic estimates of predictive power
because we are measuring the predictive power of a model that is optimized for the
dataset in which the predictive power is measured.
We really should be finding models in a training set of data and testing the models’
predictive powers, both absolute and relative to each other, in a test set of data that
is independent of the training set. If we have only one set of data, we might divide
its primary sampling units (randomly or semirandomly) into two subsets, and use the
first subset as the training set and the second subset as the test set. Sections 3.1
and 3.2 below demonstrate this practice by splitting the Stanford drug-trial data into
a training set and a test set of similar sizes, using random subsets and semirandom
348 Comparing the predictive powers of survival models

stratified subsets, respectively. We will use the somersd policy, rather than the estat
concordance policy, regarding tied censored and noncensored lifetimes.

3.1 Completely random training and test sets


We will first demonstrate the relatively simple practice of splitting the sampling units,
completely at random, into a training set and a test set. We will fit three models to the
training set: model 1, containing the variables drug and age; model 2, containing drug
only; and model 3, containing age only. Next we will use out-of-sample prediction and
somersd to estimate the predictive powers of these three models in the test set. We
will then use lincom to compare their predictive powers, in the manner of section 5.2
of Newson (2006b).
We start by inputting the data and then splitting them, completely at random, into
a training set and a test set. We use the runiform() function to create a uniformly
distributed pseudorandom variable, sort to sort the dataset by this variable, and the
mod() function to allocate alternate observations to the training and test sets (see
[D] sort and [D] functions). We then re-sort the data back to their old order using the
generated variable oldord.

. use http://www.stata-press.com/data/r11/drugtr, clear


(Patient Survival in Drug Trial)
. set seed 987654321
. generate ranord=runiform()
. generate long oldord=_n
. sort ranord, stable
. generate testset=mod(_n,2)
. sort oldord
. tabulate testset, m
testset Freq. Percent Cum.

0 24 50.00 50.00
1 24 50.00 100.00

Total 48 100.00

We see that there are 24 patient lifetimes in the training set (where testset==0)
and 24 in the test set (where testset==1). We then fit the three Cox models to the
training set and create the inverse hazard-rate variables invhr1, invhr2, and invhr3
for models 1, 2 and 3, respectively:
R. B. Newson 349

. stcox drug age if testset==0


failure _d: died
analysis time _t: studytime
Iteration 0: log likelihood = -36.900079
Iteration 1: log likelihood = -30.207704
Iteration 2: log likelihood = -30.075862
Iteration 3: log likelihood = -30.075741
Refining estimates:
Iteration 0: log likelihood = -30.075741
Cox regression -- Breslow method for ties
No. of subjects = 24 Number of obs = 24
No. of failures = 14
Time at risk = 370
LR chi2(2) = 13.65
Log likelihood = -30.075741 Prob > chi2 = 0.0011

_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]

drug .1302894 .085747 -3.10 0.002 .0358683 .473269


age 1.139011 .0678588 2.18 0.029 1.013482 1.280089

. predict hr1
(option hr assumed; relative hazard)
. generate invhr1=1/hr1
. stcox drug if testset==0
failure _d: died
analysis time _t: studytime
Iteration 0: log likelihood = -36.900079
Iteration 1: log likelihood = -32.692209
Iteration 2: log likelihood = -32.647379
Iteration 3: log likelihood = -32.647309
Refining estimates:
Iteration 0: log likelihood = -32.647309
Cox regression -- Breslow method for ties
No. of subjects = 24 Number of obs = 24
No. of failures = 14
Time at risk = 370
LR chi2(1) = 8.51
Log likelihood = -32.647309 Prob > chi2 = 0.0035

_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]

drug .1843768 .112761 -2.76 0.006 .0556069 .611341

. predict hr2
(option hr assumed; relative hazard)
. generate invhr2=1/hr2

(Continued on next page)


350 Comparing the predictive powers of survival models

. stcox age if testset==0


failure _d: died
analysis time _t: studytime
Iteration 0: log likelihood = -36.900079
Iteration 1: log likelihood = -35.587135
Iteration 2: log likelihood = -35.58462
Refining estimates:
Iteration 0: log likelihood = -35.58462
Cox regression -- Breslow method for ties
No. of subjects = 24 Number of obs = 24
No. of failures = 14
Time at risk = 370
LR chi2(1) = 2.63
Log likelihood = -35.58462 Prob > chi2 = 0.1048

_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]

age 1.082178 .0526849 1.62 0.105 .9836912 1.190526

. predict hr3
(option hr assumed; relative hazard)
. generate invhr3=1/hr3

The variables invhr1, invhr2, and invhr3 are defined for all observations, both in
the training set and in the test set. We then define the censorship indicator, as before,
and estimate the Harrell’s C indexes in the test set for all three models fit to the training
set:

. generate censind=1-_d if _st==1


. somersd _t invhr1 invhr2 invhr3 if _st==1 & testset==1, cenind(censind) tdist
> transf(c)
Somers´ D with variable: _t
Transformation: Harrell´s c
Valid observations: 24
Degrees of freedom: 23
Symmetric 95% CI for Harrell´s c

Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]

invhr1 .8819444 .0490633 17.98 0.000 .7804493 .9834396


invhr2 .7916667 .0330999 23.92 0.000 .7231944 .860139
invhr3 .6365741 .0831046 7.66 0.000 .4646592 .808489

We see that Harrell’s C of inverse hazard ratio with respect to lifetime is 0.8819 for
model 1 (using both drug treatment and age), 0.7917 for model 2 (using drug treatment
only), and 0.6366 for model 3 (using age only). All of these estimates have confidence
limits, which are probably less unreliable than the ones we saw in the previous sec-
tion. However, the sample Harrell’s C is likely to have a skewed distribution in the
presence of such strong positive associations, for the same reasons as Kendall’s τa (see
Daniels and Kendall [1947]). Differences between Harrell’s C indexes are likely to have
R. B. Newson 351

a less-skewed sampling distribution and are also what we probably really wanted to
know. We estimate these differences with lincom, as follows:

. lincom invhr1-invhr2
( 1) invhr1 - invhr2 = 0

_t Coef. Std. Err. t P>|t| [95% Conf. Interval]

(1) .0902778 .0350965 2.57 0.017 .0176751 .1628804

. lincom invhr1-invhr3
( 1) invhr1 - invhr3 = 0

_t Coef. Std. Err. t P>|t| [95% Conf. Interval]

(1) .2453704 .0736766 3.33 0.003 .0929586 .3977821

. lincom invhr2-invhr3
( 1) invhr2 - invhr3 = 0

_t Coef. Std. Err. t P>|t| [95% Conf. Interval]

(1) .1550926 .0823647 1.88 0.072 -.0152917 .3254769

Model 1 seems to have a slightly higher predictive power than model 2 or (especially)
model 3, while the difference between model 2 and model 3 is slightly less convincing.
We can also do the same comparison using Somers’ D rather than Harrell’s C, by using
the normalizing and variance-stabilizing z transform, recommended by Edwardes (1995)
and implemented using the somersd option transf(z). In that case, the differences
between the predictive powers of the different models will be expressed in z units (not
shown).

3.2 Stratified semirandom training and test sets


Completely random training and test sets may have the disadvantage that, by chance,
important predictor variables may have different sample distributions in the training
and test sets, making both the training set and the test set less representative of the
sample as a whole and of the total population from which the training and test sets were
sampled. We might feel safer if we chose the training and test sets semirandomly, with
the constraint that the two sets have similar distributions of key predictor variables in
the various models.
In our case, we might want to ensure that both the training set and the test set
contain their “fair share” of drug-treated older subjects, drug-treated younger subjects,
placebo-treated older subjects, and placebo-treated younger subjects. To ensure this,
we might start by defining sampling strata that are combinations of treatment status
and age group, and split each of these strata as evenly as possible between the training
set and the test set. Again, this requires the dataset to be sorted, and we will afterward
352 Comparing the predictive powers of survival models

sort it back to its original order. We sort as follows, using the xtile command to define
age groups (see [D] pctile):

. use http://www.stata-press.com/data/r11/drugtr, clear


(Patient Survival in Drug Trial)
. set seed 987654321
. generate ranord=runiform()
. generate long oldord=_n
. xtile agegp=age, nquantiles(2)
. tabulate drug agegp, m
Drug type
(0=placebo 2 quantiles of age
) 1 2 Total

0 11 9 20
1 16 12 28

Total 27 21 48
. sort drug agegp ranord, stable
. by drug agegp: generate testset=mod(_n,2)
. sort oldord
. table testset drug agegp, row col scol

2 quantiles of age and Drug type (0=placebo)


1 2 Total
testset 0 1 Total 0 1 Total 0 1 Total

0 5 8 13 4 6 10 9 14 23
1 6 8 14 5 6 11 11 14 25

Total 11 16 27 9 12 21 20 28 48

This time, the training set is slightly smaller than the test set because of odd total
numbers of subjects in sampling strata. We then carry out the model fitting in the
training set and the calculation of inverse hazard ratios in both sets using the same
command sequence as with the completely random training and test sets, producing
mostly similar results, which are not shown. Finally, we estimate the Harrell’s C indexes
in the test set:
R. B. Newson 353

. generate censind=1-_d if _st==1


. somersd _t invhr1 invhr2 invhr3 if _st==1 & testset==1, cenind(censind) tdist
> transf(c)
Somers´ D with variable: _t
Transformation: Harrell´s c
Valid observations: 25
Degrees of freedom: 24
Symmetric 95% CI for Harrell´s c

Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]

invhr1 .7911392 .0674598 11.73 0.000 .6519091 .9303694


invhr2 .7257384 .049801 14.57 0.000 .6229542 .8285226
invhr3 .5780591 .0972101 5.95 0.000 .3774274 .7786908

The C estimates for the three models are not dissimilar to the previous ones with
completely random training and test sets. Their pairwise differences are as follows:

. lincom invhr1-invhr2
( 1) invhr1 - invhr2 = 0

_t Coef. Std. Err. t P>|t| [95% Conf. Interval]

(1) .0654008 .0491405 1.33 0.196 -.0360202 .1668219

. lincom invhr1-invhr3
( 1) invhr1 - invhr3 = 0

_t Coef. Std. Err. t P>|t| [95% Conf. Interval]

(1) .2130802 .0763467 2.79 0.010 .0555084 .3706519

. lincom invhr2-invhr3
( 1) invhr2 - invhr3 = 0

_t Coef. Std. Err. t P>|t| [95% Conf. Interval]

(1) .1476793 .1080388 1.37 0.184 -.0753017 .3706603

Model 1 (with drug treatment and age) still seems to predict better than model 3
(with age alone). This conclusion is similar if we compare the z-transformed Somers’ D
values, which are not shown.

4 Extensions to non-Cox survival models


Measuring predictive power using Harrell’s C and Somers’ D is not restricted to Cox
models, but can be applied to any model with a positive or negative ordinal predictor.
The streg command (see [ST] streg) fits a wide range of survival models, each of
which has a wide choice of predictive output variables, which can be computed using
354 Comparing the predictive powers of survival models

predict (see [ST] streg postestimation). These output variables may predict survival
times positively or negatively on an ordinal scale and may include median survival times,
mean survival times, median log survival times, mean log survival times, hazards, hazard
ratios, or linear predictors.
We will briefly demonstrate the principles involved by fitting Gompertz models to
the survival dataset that we used in previous sections. The Gompertz model assumes an
exponentially increasing (or decreasing) hazard rate, and the linear predictor is the log
of the zero-time baseline hazard rate, whereas the rate of increase (or decrease) in hazard
rate, after time zero, is a nuisance parameter. Therefore, if the Gompertz model is true,
then so is the Cox model. However, the argument of Fisher (1935) presumably implies
that if the Gompertz model is true, then we can be no less efficient, asymptotically, by
fitting a Gompertz model instead of a Cox model. We will use the predicted median
lifetime as the positive predictor, whose predictive power will be assessed using somersd.
We start by inputting the cancer trial dataset and defining the stratified, semirandom
training and test sets, exactly as we did in section 3.2. We then fit to the training
set Gompertz models 1, 2, and 3, containing, respectively, both drug treatment and
age, drug treatment only, and age only. After fitting each of the three models, we
use predict to compute the predicted median survival time for the whole sample,
deriving the alternative positive lifetime predictors medsurv1, medsurv2, and medsurv3
for models 1, 2, and 3, respectively:
. streg drug age if testset==0, distribution(gompertz) nolog
failure _d: died
analysis time _t: studytime
Gompertz regression -- log relative-hazard form
No. of subjects = 23 Number of obs = 23
No. of failures = 15
Time at risk = 338
LR chi2(2) = 20.62
Log likelihood = -14.076214 Prob > chi2 = 0.0000

_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]

drug .0948331 .0594575 -3.76 0.000 .0277512 .3240694


age 1.172588 .0616365 3.03 0.002 1.057798 1.299836

/gamma .1553139 .0430892 3.60 0.000 .0708605 .2397672

. predict medsurv1
(option median time assumed; predicted median time)
R. B. Newson 355

. streg drug if testset==0, distribution(gompertz) nolog


failure _d: died
analysis time _t: studytime
Gompertz regression -- log relative-hazard form
No. of subjects = 23 Number of obs = 23
No. of failures = 15
Time at risk = 338
LR chi2(1) = 11.02
Log likelihood = -18.873214 Prob > chi2 = 0.0009

_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]

drug .153411 .0877048 -3.28 0.001 .0500295 .4704213

/gamma .1063648 .0361612 2.94 0.003 .0354901 .1772394

. predict medsurv2
(option median time assumed; predicted median time)
. streg age if testset==0, distribution(gompertz) nolog
failure _d: died
analysis time _t: studytime
Gompertz regression -- log relative-hazard form
No. of subjects = 23 Number of obs = 23
No. of failures = 15
Time at risk = 338
LR chi2(1) = 5.56
Log likelihood = -21.606438 Prob > chi2 = 0.0184

_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]

age 1.117255 .0516156 2.40 0.016 1.020536 1.223142

/gamma .088458 .0341184 2.59 0.010 .0215871 .1553288

. predict medsurv3
(option median time assumed; predicted median time)

Unsurprisingly, the fitted parameters are not dissimilar to the corresponding param-
eters for the Cox regression. We then compute the censorship indicator censind, and
then the Harrell’s C indexes, for the test set:

(Continued on next page)


356 Comparing the predictive powers of survival models

. generate censind=1-_d if _st==1


. somersd _t medsurv1 medsurv2 medsurv3 if _st==1 & testset==1, cenind(censind)
> tdist transf(c)
Somers´ D with variable: _t
Transformation: Harrell´s c
Valid observations: 25
Degrees of freedom: 24
Symmetric 95% CI for Harrell´s c

Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]

medsurv1 .7911392 .0674598 11.73 0.000 .6519091 .9303694


medsurv2 .7257384 .049801 14.57 0.000 .6229542 .8285226
medsurv3 .5780591 .0972101 5.95 0.000 .3774274 .7786908

We then compare the Harrell’s C parameters for the alternative median survival
functions, using lincom, just as before:

. lincom medsurv1-medsurv2
( 1) medsurv1 - medsurv2 = 0

_t Coef. Std. Err. t P>|t| [95% Conf. Interval]

(1) .0654008 .0491405 1.33 0.196 -.0360202 .1668219

. lincom medsurv1-medsurv3
( 1) medsurv1 - medsurv3 = 0

_t Coef. Std. Err. t P>|t| [95% Conf. Interval]

(1) .2130802 .0763467 2.79 0.010 .0555084 .3706519

. lincom medsurv2-medsurv3
( 1) medsurv2 - medsurv3 = 0

_t Coef. Std. Err. t P>|t| [95% Conf. Interval]

(1) .1476793 .1080388 1.37 0.184 -.0753017 .3706603

Unsurprisingly, the conclusions for the Gompertz model are essentially the same as
those for the Cox model.
R. B. Newson 357

5 Further extensions
The use of Harrell’s C and Somers’ D in test sets to compare the power of models
fit to training sets can be extended further to nonsurvival regression models. In this
case, life is even simpler because we do not have to define a censorship indicator such
as censind for input to somersd. The predictive score is still computed using out-of-
sample prediction and can be either the fitted regression value or the linear predictor
(if one exists in the model).
The methods presented so far have the limitation that the Harrell’s C and Somers’ D
parameters that we calculated estimate only the ordinal predictive power (in the pop-
ulation from which the training and test sets were sampled) of the precise model that
we fit to the training set. We might prefer to estimate the mean predictive power that
we can expect (in the whole universe of possible training and test sets) using the same
set of alternative models. Bootstrap-like methods for doing this, involving repeated
splitting of the same sample into training and test sets, are described in Harrell et al.
(1982) and Harrell, Lee, and Mark (1996).
Another limitation of the methods presented here, as mentioned at the end of sec-
tion 2, is that they should not usually be used with models with time-dependent co-
variates. This is because the predicted variable input to somersd, which the alternative
predictive scores are competing to predict, is the length of a lifetime rather than an
event of survival or nonsurvival through a minimal time interval, such as a day. A
predictor variable for such a lifetime must therefore stay constant, at least through that
lifetime, which rules out functions of continuously varying time-dependent covariates.
In Stata, survival-time datasets may have multiple observations for each subject
with a lifetime, representing multiple sublifetimes. Discretely varying time-dependent
covariates, which remain constant through a sublifetime, can also be included in such
datasets. somersd can therefore be used when these conditions are met: the model
is a Cox regression, the time-dependent covariates vary only discretely, the multiple
sublifetimes are the times spent by a subject in an age group, and each subject becomes
at risk at the start of each age group to which she or he survives. If the subject
identifier variable is named subid, and the age group for each sublifetime is represented
by a discrete variable agegp, then the user may use somersd with cluster(subid)
funtype(bcluster) wstrata(agegp) to calculate Somers’ D or Harrell’s C estimates
restricted to comparisons between sublifetimes of different subjects in the same age
group. See Newson (2006b) for details of the options for somersd, and see [ST] stset
for details on survival-time datasets.
If the user has access to sufficient data-storage space, then the age groups can be
defined finely (as subject-years or even subject-days), and the discretely time dependent
covariates might therefore be very nearly continuously time-dependent. Any training
sets or test sets in this case should, of course, be sets of subjects rather than sets of
lifetimes.
358 Comparing the predictive powers of survival models

6 Acknowledgments
I would like to thank Samia Mora, MD, of Partners HealthCare, for sending me the
query that prompted me to write this article. I also thank the many other Stata users
who have also contacted me over the past few years with essentially similar queries on
how to use somersd to compare the predictive powers of survival models.

7 References
Daniels, H. E., and M. G. Kendall. 1947. The significance of rank correlation where
parental correlation exists. Biometrika 34: 197–208.

Edwardes, M. D. 1995. A confidence interval for Pr(X < Y ) − Pr(X > Y ) estimated
from simple cluster samples. Biometrics 51: 571–578.

Fisher, R. A. 1935. The logic of inductive inference. Journal of the Royal Statistical
Society 98: 39–82.

Harrell, F. E., Jr., R. M. Califf, D. B. Pryor, K. L. Lee, and R. A. Rosati. 1982.


Evaluating the yield of medical tests. Journal of the American Medical Association
247: 2543–2546.

Harrell, F. E., Jr., K. L. Lee, and D. B. Mark. 1996. Multivariable prognostic models:
Issues in developing models, evaluating assumptions and adequacy, and measuring
and reducing errors. Statistics in Medicine 15: 361–387.

Newson, R. 2002. Parameters behind “nonparametric” statistics: Kendall’s tau,


Somers’ D and median differences. Stata Journal 2: 45–64.

———. 2006a. Confidence intervals for rank statistics: Percentile slopes, differences,
and ratios. Stata Journal 6: 497–520.

———. 2006b. Confidence intervals for rank statistics: Somers’ D and extensions. Stata
Journal 6: 309–334.
———. 2006c. Efficient calculation of jackknife confidence intervals for rank statistics.
Journal of Statistical Software 15: 1–10.

Rosenbaum, P. R., and D. B. Rubin. 1983. The central role of the propensity score in
observational studies for causal effects. Biometrika 70: 41–55.

About the author


Roger Newson is a lecturer in medical statistics at Imperial College London, London, UK,
working principally in asthma research. He wrote the somersd and parmest Stata packages.
The Stata Journal (2010)
10, Number 3, pp. 359–368

Using Stata with PHASE and Haploview:


Commands for importing and exporting data
J. Charles Huber Jr.
Department of Epidemiology and Biostatistics
Texas A&M Health Science Center School of Rural Public Health
College Station, TX
jchuber@srph.tamhsc.edu

Abstract. Modern genetics studies require the use of many specialty software
programs for various aspects of the statistical analysis. PHASE is a program often
used to reconstruct haplotypes from genotype data, and Haploview is a program
often used to visualize and analyze single nucleotide polymorphism data. Three
new commands are described for performing these three steps: 1) exporting geno-
type data stored in Stata to PHASE, 2) importing the resulting inferred haplotypes
back into Stata, and 3) exporting the haplotype/single nucleotide polymorphism
data from Stata to Haploview.
Keywords: st0199, phaseout, phasein, haploviewout, genetics, haplotypes, SNPs,
PHASE, haploview

1 Introduction
For a variety of reasons, including favorable power for detecting small effects and
the low cost of genotyping, association studies based on single nucleotide polymor-
phism (SNP, pronounced “snip”) markers have become common in genetic epidemiology
(Cordell and Clayton 2005). SNP markers are positions along a chromosome that can
have four forms called alleles: adenine, cytosine, guanine, and thymine, which are de-
noted A, C, G, and T, respectively. Humans are diploid organisms, meaning that we
have two copies of each of our chromosomes; thus each SNP is composed of a pair of
alleles called a genotype.
For example, a SNP might have an adenine (A) molecule on one chromosome paired
with a cytosine (C) molecule on the other chromosome. This is often described as an
A/C genotype. When two SNP markers are physically close to one another, a pair of
alleles found on the same chromosome forms a haplotype. For example, a person might
have an A/C genotype for SNP1 and a G/T genotype for SNP2. If the A allele from SNP1
and the G allele from SNP2 are physically located on the same chromosome, they are
said to form an AG haplotype. Similarly, the C allele from SNP1 and the T allele from
SNP2 would form a CT haplotype.


c 2010 StataCorp LP st0199
360 Using Stata with PHASE and Haploview

It has been shown that association studies based on haplotypes are often more pow-
erful than similar studies based on individual SNPs (Akey, Jin, and Xiong 2001). Unfor-
tunately, haplotypes are not observed directly using typical low-cost, high-throughput
laboratory techniques. However, haplotypes can be inferred statistically based on the
observed genotypes.
David G. Clayton of the University of Cambridge has written a useful command for
Stata (snp2hap) that infers haplotypes for pairs of SNPs. In theory, this program could
be used iteratively to infer haplotypes across many SNPs. However, several sophisticated
algorithms have been developed for statistically inferring haplotypes from many SNP
genotypes simultaneously. These algorithms and the software that implement them have
been reviewed and compared elsewhere (Marchini et al. 2006; Stephens and Donnelly
2003; The International HapMap Consortium 2005). In most comparisons, the algo-
rithm used in the PHASE program (Stephens, Smith, and Donnelly 2001) was found to
be the most accurate and is arguably the most frequently used.
Rather than attempt the daunting task of creating a Stata command to imple-
ment the algorithm used in PHASE, a Stata command (phaseout) was developed for
exporting genotype data stored in Stata to an ASCII file formatted as a PHASE input
file. A second program (phasein) was developed to import the inferred haplotype data
back into Stata for subsequent association analyses with programs such as haplologit
(Marchenko et al. 2008). These commands use a group of Stata’s low-level file com-
mands including file open, file write, file read, and file close.
Once the haplotypes have been inferred for a set of genotypes, one would often like
to know certain attributes of the haplotypes. For example, the alleles of some pairs of
SNPs along a haplotype may tend to be transmitted together from parent to offspring
more frequently than alleles of other pairs of SNPs. This phenomenon, known as linkage
disequilibrium (Devlin and Risch 1995), is often quantified by the r2 or D statistics.
Similarly, some contiguous groups of SNPs often called haplotype blocks, may exhibit
high levels of pairwise linkage disequilibrium (Gabriel et al. 2002; Goldstein 2001). High
levels of linkage disequilibrium between two SNPs indicate that much of their statistical
information is redundant, so both SNPs are not necessary for association analyses. One
of the SNPs, called a tagSNP (Zhang et al. 2004), can be selected using one of several
algorithms. A tagSNP can be used in place of the group of redundant SNPs. Typically,
there are several tagSNPs in a group of contiguous SNPs found on a chromosome.
Haploview (Barrett et al. 2005) is a popular software package used for calculating
and visualizing the linkage disequilibrium statistics r2 and D , as well as for identifying
haplotype blocks and tagSNPs. The new Stata haploviewout program exports haplo-
type data from Stata to a pair of ASCII files formatted as Haploview input files: a haps
format data file and a haps format locus information file.
The dataset used for the following examples was downloaded from the SeattleSNPs
website (SeattleSNPs 2009) and was modified to include missing data. Genotypes for
47 individuals of African and European descent include 22 SNPs from the vascular
endothelial growth factor (VEGF) gene located on chromosome six.
J. C. Huber Jr. 361

2 The phaseout command


Genotype data stored in Stata are often formatted in a way that is similar to the
following example. In this example, the variable id contains individual identification
numbers, and the variables rs1413711, rs3024987, and rs3024989 contain data on
three SNPs. The genotype X/X indicates that the genotype is missing. The following
example uses fictitious data:

. list id rs1413711 rs3024987 rs3024989 in 1/2

id rs1413711 rs3024987 rs3024989

1. D001 C/C C/T T/T


2. D002 C/T X/X T/T

The input file for PHASE requires the data to be formatted in an ASCII file that
contains header information about the number of samples and the number and types of
markers (SNP or multiallelic), as well as the actual data:

47 (There are 47 samples in the entire file.)


3 (There are three markers in the file.)
P 674 836 1955 (Positions are listed.)
SSS (All three markers are biallelic SNPs.)
D001 (The data begin with the first ID.)
C C T (The genotype data are stored in two rows.)
C T T (These are not haplotypes yet.)
D002 (The data begin with the second ID.)
C ? T (The missing SNP data are
T ? T stored as question marks.)

The phaseout command calculates the header information, converts the ID and
genotype data to rows, and writes this data to the ASCII file. The types of markers—
SNPs or multiallelic markers—are automatically determined by tabulating the genotypes
and by examining the length of the genotype in the first record. If a marker has three
or fewer genotypes (for example, C/C, C/T, T/T) and the length of the genotype in the
first record is fewer than five alleles, the marker is treated as a SNP. All other markers
are treated as multiallelic.

2.1 Syntax

phaseout SNPlist, idvariable(varname) filename(filename) missing(string)

separator(string) positions(string)

SNPlist is a list of variables containing SNP genotypes.


362 Using Stata with PHASE and Haploview

2.2 Options
idvariable(varname) is required to specify the variable that contains the individual
identifiers.
filename(filename) is required to name the ASCII file that will be created. It is con-
ventional, though not necessary, to name PHASE input files with the extension .inp.
missing(string) may be used to provide a list of genotypes that indicate missing data.
For example, missing data might be included in the dataset as X/X for SNPs and
as 999/999 for multiallelic markers. Multiple missing values may be specified by
placing a space between them (for example, missing("X/X 9/9 999/999")). PHASE
requires missing SNP alleles to be coded as “?” and missing multiallelic alleles to
be coded as “−1”. It is not necessary to preprocess your data because phaseout
will automatically convert each genotype contained in the missing() list to its
appropriate PHASE missing value.
separator(string) specifies the separator to use when storing genotype data. Genotype
data are often stored with a separator between the two alleles. For data stored in
the format C/G, the separator() option would look like separator("/"). If SNP
data are stored without a separator (for example, CG) then the separator() option
is unnecessary, and phaseout will assume that the left character is allele 1 and the
right character is allele 2.
positions(string) provides a list of the marker positions for use by PHASE when infer-
ring haplotypes from the genotype data. If the positions() option is not specified,
PHASE will assume that the markers are equally spaced.

2.3 Output files


phaseout saves two ASCII files for subsequent use by the commands phasein and
haploviewout:

• MarkerList.txt contains a space-delimited list of marker names.

• PositionList.txt contains a space-delimited list of marker positions.

2.4 Examples
Markers and positions may be specified in the command itself:

. phaseout rs1413711 rs3024987 rs3024989, idvariable("id") filename("VEGF.inp")


> missing("X/X 9/9") positions("674 836 1955") separator("/")
J. C. Huber Jr. 363

phaseout may use markers and positions saved in local macros:

. local SNPList "rs1413711 rs3024987 rs3024989 rs833068 rs3024990"


. local PositionsList "674 836 1955 2523 3031"
. phaseout `SNPList´, idvariable("id") filename("VEGF.inp") missing("X/X 9/9")
> positions(`PositionsList´) separator("/")

3 The phasein command


PHASE saves the inferred haplotypes for each pair of chromosomes in a file with the
extension .out, and because there is a great deal of other information saved in the file,
the phasein command uses the keywords BEGIN BESTPAIRS1 and END BESTPAIRS1 to
identify the part of the file that contains the haplotypes:

BEGIN BESTPAIRS1
0 D001
C C T
C T T
......
......
0 E023
C C T
C C T
END BESTPAIRS1.

The data are imported into Stata in “long” format with one row per chromosome
(two rows per ID). The haplotypes are imported into a variable named haplotype, and
each of the markers that make up the haplotype are saved in an individual variable.
If the markers() option is specified, the marker variables will be renamed using their
original names.
. list id haplotype rs1413711 rs3024987 rs3024989 in 1/2

id haplotype rs1413711 rs3024987 rs3024989

1. D001 CCT C C T
2. D001 CTT C T T

If the positions() option is used, the positions will be placed in the variable label
of each marker variable:

(Continued on next page)


364 Using Stata with PHASE and Haploview

. describe
Contains data from VEGF_Haplotypes.dta
obs: 94
vars: 5 8 Jul 2010 13:09
size: 1,692 (99.9% of memory free)

storage display value


variable name type format label variable label

id str4 %9s
haplotype str3 %9s
rs1413711 str1 %9s position=674
rs3024987 str1 %9s position=836
rs3024989 str1 %9s position=1955

Sorted by:

3.1 Syntax

phasein PhaseOutputFile , markers(filename) positions(filename)

PhaseOutputFile is the name of the PHASE output file that contains the inferred haplo-
types. It will have the file extension .out.

3.2 Options
markers(filename) allows the user to specify an ASCII file that contains the names of
the markers included in the haplotype. If the original genotype data were exported
to PHASE using the phaseout command, the marker names will be automatically
saved to a file named MarkerList.txt. If that is the case, then the option would
be markers("MarkerList.txt"). Alternatively, the user can save a space-delimited
list of marker names in an ASCII file and use the markers("filename.txt") option.
positions(filename) allows the user to specify an ASCII file that contains the posi-
tions of the markers. If the original genotype data were exported to PHASE us-
ing the phaseout command, the marker positions will be automatically saved to
a file named PositionList.txt. If that is the case, then the option would be
positions("PositionList.txt"). Alternatively, the user can save a space-delimit-
ed list of marker positions in an ASCII file and use the positions("filename.txt")
option.

3.3 Examples
Using the default files created by phaseout:

. phasein VEGF.out, markers("MarkerList.txt") positions("PositionList.txt")


J. C. Huber Jr. 365

Using the files created by the user:

. phasein VEGF.out, markers("UserMarkerList.txt") positions("UserPositionList.txt")

4 The haploviewout command


The haploviewout command exports haplotype data from Stata to a pair of files. The
file Filename DataInput.txt contains the marker data for each individual, with the
alleles recoded as follows: missing = 0, A = 1, C = 2, G = 3, and T = 4.

D001 D001 2 2 4
D001 D001 2 4 4
D002 D002 2 2 4
D002 D002 4 2 4

The file Filename MarkerInput.txt contains the marker names and positions in
two columns:

rs1413711 674
rs3024987 836
rs3024989 1955

4.1 Syntax

haploviewout SNPlist, idvariable(varname) filename(filename)


positions(string) familyid(variable) poslabel

SNPlist is a list of SNP variables in long format (that is, one row per chromosome).
If your data are in wide format, you can convert them to long format by using the
reshape command.
Haploview will not accept multiallelic markers.

4.2 Options
idvariable(varname) is required to specify the variable that contains the individual
identifiers.
filename(filename) is required to name the two ASCII files that will be created. Those
files will have the extensions DataInput.txt and MarkerInput.txt appended
to filename. For example, the filename("VEGF") option will create a file named
VEGF DataInput.txt and a file named VEGF MarkerInput.txt. To open the files in
Haploview, select File > Open new data and click on the tab labeled Haps For-
mat. Click on the Browse button next to the box labeled Data File and select the
file VEGF DataInput.txt. Next click on the Browse button next to the box labeled
Locus Information File and select the file VEGF MarkerInput.txt.
366 Using Stata with PHASE and Haploview

positions(string) allows the user to specify a space-delimited list of the marker posi-
tions.
familyid(variable) allows the user to specify the variable that contains family identi-
fiers if relatives are included in the dataset. If familyid() is omitted, the
idvariable() will be automatically substituted for the familyid().
poslabel will automatically extract the SNP positions from the variable label of each
SNP if the haplotype data were created using the commands phaseout and phasein.
The positions for each marker are stored in the variable label of each SNP.

4.3 Examples
Using the default files created by phaseout:

. phaseout rs1413711 rs3024987 rs3024989, idvariable("id") filename("VEGF.inp")


> missing("X/X 9/9") positions("674 836 1955") separator("/")
. phasein VEGF.out, markers("MarkerList.txt") positions("PositionList.txt")
. haploviewout rs1413711 rs3024987 rs3024989, idvariable(id) filename("VEGF")
> poslabel

Using the files created by the user:

. haploviewout rs1413711 rs3024987 rs3024989, idvariable(id) filename("VEGF")


> positions("674 836 1955")

5 Discussion
Many young and rapidly evolving fields of inquiry, including genetic association studies,
use a variety of boutique software packages. While it would be very convenient to have
Stata commands that accomplish the same tasks, the time and programming expertise
required does not make this a practical option. However, a suite of commands that
allows easy exporting and importing of data from Stata to other specialized software
seems to be an efficient way for Stata users to accomplish specialized analytical tasks.

6 Acknowledgments
This work was supported in part by grant 1 R01 DK073618-02 from the National In-
stitute of Diabetes and Digestive and Kidney Diseases and by grant 2006-35205-16715
from the United States Department of Agriculture. The author would like to thank
Drs. Loren Skow, Krista Fritz, and Candice Brinkmeyer-Langford of the Texas A&M
College of Veterinary Medicine and Roger Newson of the Imperial College London for
their very useful feedback.
J. C. Huber Jr. 367

7 References
Akey, J., L. Jin, and M. Xiong. 2001. Haplotypes vs single marker linkage disequilibrium
tests: What do we gain? European Journal of Human Genetics 9: 291–300.

Barrett, J. C., B. Fry, J. Maller, and M. J. Daly. 2005. Haploview: Analysis and
visualization of LD and haplotype maps. Bioinformatics 21: 263–265.

Cordell, H. J., and D. G. Clayton. 2005. Genetic association studies. Lancet 366:
1121–1131.

Devlin, B., and N. Risch. 1995. A comparison of linkage disequilibrium measures for
fine-scale mapping. Genomics 29: 311–322.

Gabriel, S. B., S. F. Schaffner, H. Nguyen, J. M. Moore, J. Roy, B. Blumenstiel, J. Hig-


gins, M. DeFelice, A. Lochner, M. Faggart, S. N. Liu-Cordero, C. Rotimi, A. Adeyemo,
R. Cooper, R. Ward, E. S. Lander, M. J. Daly, and D. Altshuler. 2002. The structure
of haplotype blocks in the human genome. Science 296: 2225–2229.

Goldstein, D. B. 2001. Islands of linkage disequilibrium. Nature Genetics 29: 109–111.

Marchenko, Y. V., R. J. Carroll, D. Y. Lin, C. I. Amos, and R. G. Gutierrez. 2008.


Semiparametric analysis of case–control genetic data in the presence of environmental
factors. Stata Journal 8: 305–333.

Marchini, J., D. Cutler, N. Patterson, M. Stephens, E. Eskin, E. Halperin, S. Lin, Z. S.


Qin, H. M. Munro, G. R. Abecasis, P. Donnelly, and The International HapMap Con-
sortium. 2006. A comparison of phasing algorithms for trios and unrelated individuals.
American Journal of Human Genetics 78: 437–450.

SeattleSNPs. 2009. NHLBI Program for Genomic Applications.


http://pga.gs.washington.edu.

Stephens, M., and P. Donnelly. 2003. A comparison of Bayesian methods for haplotype
reconstruction from population genotype data. American Journal of Human Genetics
73: 1162–1169.

Stephens, M., N. J. Smith, and P. Donnelly. 2001. A new statistical method for haplo-
type reconstruction from population data. American Journal of Human Genetics 68:
978–989.

The International HapMap Consortium. 2005. A haplotype map of the human genome.
Nature 437: 1299–1320.

Zhang, K., Z. S. Qin, J. S. Liu, T. Chen, M. S. Waterman, and F. Sun. 2004. Haplotype
block partitioning and tag SNP selection using genotype data and their applications
to association studies. Genome Research 14: 908–916.
368 Using Stata with PHASE and Haploview

About the author


Chuck Huber is an assistant professor of biostatistics at the Texas A&M Health Science Center
School of Rural Public Health in the Department of Epidemiology and Biostatistics. He works
on projects in a variety of topical areas, but his primary area of interest is statistical genetics.
The Stata Journal (2010)
10, Number 3, pp. 369–385

simsum: Analyses of simulation studies


including Monte Carlo error
Ian R. White
MRC Biostatistics Unit
Institute of Public Health
Cambridge, UK
ian.white@mrc-bsu.cam.ac.uk

Abstract. A new Stata command, simsum, analyzes data from simulation studies.
The data may comprise point estimates and standard errors from several analysis
methods, possibly resulting from several different simulation settings. simsum can
report bias, coverage, power, empirical standard error, relative precision, average
model-based standard error, and the relative error of the standard error. Monte
Carlo errors are available for all of these estimated quantities.
Keywords: st0200, simsum, simulation, Monte Carlo error, normal approximation,
sandwich variance

1 Introduction
Simulation studies are an important tool for statistical research (Burton et al. 2006), but
they are often poorly reported. In particular, to understand the role of chance in results
of simulation studies, it is important to estimate the Monte Carlo (MC) error, defined
as the standard deviation of an estimated quantity over repeated simulation studies.
However, this error is often not reported: Koehler, Brown, and Haneuse (2009) found
that of 323 articles reporting the results of a simulation study in Biometrics, Biometrika,
and the Journal of the American Statistical Association in 2007, only 8 articles reported
the MC error.
This article describes a new Stata command, simsum, that facilitates analyses of
simulated data. simsum analyzes simulation studies in which each simulated dataset
yields point estimates by one or more analysis methods. Bias, empirical standard error
(SE), and precision relative to a reference method can be computed for each method. If,
in addition, model-based SEs are available, then simsum can compute the average model-
based SE, the relative error in the model-based SE, the coverage of nominal confidence
intervals, and the power to reject a null hypothesis. MC errors are available for all
estimated quantities.


c 2010 StataCorp LP st0200
370 simsum: Analyses of simulation studies including Monte Carlo error

2 The simsum command


2.1 Syntax
simsum accepts data in wide or long format.
In wide format, data contain one record per simulated dataset, with results from
multiple analysis methods stored as different variables. The appropriate syntax is


simsum estvarlist if in , true(expression) options

where estvarlist is a varlist containing point estimates from one or more analysis meth-
ods.
In long format, data contain one record per analysis method per simulated dataset,
and the appropriate syntax is



simsum estvarname if in , true(expression) methodvar(varname)

id(varlist) options

where estvarname is a variable containing the point estimates, methodvar(varname)


identifies the method, and id(varlist) identifies the simulated dataset.

2.2 Options
Main options

true(expression) gives the true value of the parameter. This option is required for
calculations of bias and coverage.
methodvar(varname) specifies that the data are in long format and that each record
represents one analysis of one simulated dataset using the method identified by
varname. The id() option is required with methodvar(). If methodvar() is not
specified, the data must be in wide format, and each record represents all analyses
of one simulated dataset.
id(varlist) uniquely identifies the dataset used for each record, within levels of any
by-variables. This is a required option in the long format. The methodvar() option
is required with id().
se(varlist) lists the names of the variables containing the SEs of the point estimates.
For data in long format, this is a single variable.
seprefix(string) specifies that the names of the variables containing the SEs of the
point estimates be formed by adding the given prefix to the names of the variables
containing the point estimates. seprefix() may be combined with sesuffix(string)
but not with se(varlist).
I. R. White 371

sesuffix(string) specifies that the names of the variables containing the SEs of the
point estimates be formed by adding the given suffix to the names of the variables
containing the point estimates. sesuffix() may be combined with seprefix(string)
but not with se(varlist).

Data-checking options

graph requests a descriptive graph of SEs against point estimates.


nomemcheck turns off checking that adequate memory is free. This check aims to avoid
spending calculation time when simsum is likely to fail because of lack of memory.
max(#) specifies the maximum acceptable absolute value of the point estimates, stan-
dardized to mean 0 and standard deviation 1. The default is max(10).
semax(#) specifies the maximum acceptable value of the SE as a multiple of the mean
SE. The default is semax(100).

dropbig specifies that point estimates or SEs beyond the maximum acceptable values
be dropped; otherwise, the command halts with an error. Missing values are always
dropped.
nolistbig suppresses listing of point estimates and SEs that lie outside the acceptable
limits.
listmiss lists observations with missing point estimates or SEs.

Calculation options

level(#) specifies the confidence level for coverages and powers. The default is
level(95) or as set by set level; see [R] level.
by(varlist) summarizes the results by varlist.
mcse reports MC errors for all summaries.
robust requests robust MC errors (see section 4) for the statistics empse, relprec, and
relerror. The default is MC errors based on an assumption of normally distributed
point estimates. robust is only useful if mcse is also specified.
modelsemethod(rmse | mean) specifies whether the model SE should be summarized as
the root mean squared value (modelsemethod(rmse), the default) or as the arith-
metic mean (modelsemethod(mean)).
ref(string) specifies the reference method against which relative precisions will be cal-
culated. With data in wide format, string must be a variable name. With data in
long format, string must be a value of the method variable; if the value is labeled,
the label must be used.
372 simsum: Analyses of simulation studies including Monte Carlo error

Options specifying degrees of freedom

The number of degrees of freedom is used in calculating coverages and powers.


df(string) specifies the degrees of freedom. It may contain a number (to apply to all
methods), a variable name, or a list of variables containing the degrees of freedom
for each method.
dfprefix(string) specifies that the names of the variables containing the degrees of
freedom be formed by adding the given prefix to the names of the variables containing
the point estimates. dfprefix() may be combined with dfsuffix(string) but not
with df(string).
dfsuffix(string) specifies that the names of the variables containing the degrees of
freedom be formed by adding the given suffix to the names of the variables containing
the point estimates. dfsuffix() may be combined with dfprefix(string) but not
with df(string).

Statistic options

If none of the following options are specified, then all available statistics are computed.
bsims reports the number of simulations with nonmissing point estimates.
sesims reports the number of simulations with nonmissing SEs.
bias estimates the bias in the point estimates.
empse estimates the empirical SE, defined as the standard deviation of the point esti-
mates.
relprec estimates the relative precision, defined as the inverse squared ratio of the
empirical SE of this method to the empirical SE of the reference method. This
calculation is slow; omitting it can reduce run time by up to 90%.
modelse estimates the model-based SE. See modelsemethod() above.
relerror estimates the proportional error in the model-based SE, using the empirical
SE as the gold standard.

cover estimates the coverage of nominal confidence intervals at the specified level.
power estimates at the specified level the power to reject the null hypothesis that the
true parameter is zero.

Output options

clear loads the summary data into memory.


saving(filename) saves the summary data into filename.
I. R. White 373

nolist suppresses listing of the results and is allowed only when clear or saving() is
specified.
listsep lists results using one table per statistic, giving output that is narrower and
better formatted. The default is to list the results as a single table.
format(string) specifies the format for printing results and saving summary data. If
listsep is also specified, then up to three formats may be specified: 1) for results
on the scale of the original estimates (bias, empse, and modelse), 2) for percentages
(relprec, relerror, cover, and power), and 3) for integers (bsims and sesims).
The default is the existing format of the (first) estimate variable for 1 and 2 and
%7.0f for 3.
sepby(varlist) invokes this list option when printing results.
abbreviate(#) invokes this list option when printing results.
gen(string) specifies the prefix for new variables identifying the different statistics in
the output dataset. gen() is only useful with clear or saving(). The default is
gen(stat) so that the new identifiers are, for example, statnum and statcode.

3 Example
This example is based on, but distinct from, a simulation study comparing different
ways to handle missing covariates when fitting a Cox model (White and Royston 2009).
One thousand datasets were simulated, each containing normally distributed covariates
x and z and a time-to-event outcome. Both covariates had 20% of their values deleted
independently of all other variables so the data became missing completely at random
(Little and Rubin 2002). Each simulated dataset was analyzed in three ways. A Cox
model was fit to the complete cases (CC). Then two methods of multiple imputation using
chained equations (van Buuren, Boshuizen, and Knook 1999), implemented in Stata as
ice (Royston 2004, 2009), were used. The MI LOGT method multiply imputes the missing
values of x and z with the outcome included as log(t) and d, where t is the survival time
and d is the event indicator. The MI T method is the same except that log(t) is replaced
by t in the imputation model. The results are stored in long format, with variable
dataset identifying the simulated dataset number, string variable method identifying
the method used, variable b holding the point estimate, and variable se holding the SE.
The data start like this:

(Continued on next page)


374 simsum: Analyses of simulation studies including Monte Carlo error

dataset method b se

1. 1 CC .7067682 .14651
2. 1 MI_T .6841882 .1255043
3. 1 MI_LOGT .7124795 .1410814

4. 2 CC .3485008 .1599879
5. 2 MI_T .4060082 .1409831
6. 2 MI_LOGT .4287003 .1358589

7. 3 CC .6495075 .1521568
8. 3 MI_T .5028701 .130078
9. 3 MI_LOGT .5604051 .1168512

They are then summarized thus:


. summarize
Variable Obs Mean Std. Dev. Min Max

dataset 3000 500.5 288.7231 1 1000


method 0
b 3000 .5054995 .1396257 -.1483829 1.004529
se 3000 .1375334 .0183683 .0907097 .2281933

simsum produces the following output:


. simsum b, se(se) methodvar(method) id(dataset) true(0.5) mcse
> format(%6.3f %6.1f %6.0f) listsep
Reshaping data to wide format ...
Starting to process results ...
Non-missing point estimates

CC MI_LOGT MI_T

1000 1000 1000

Non-missing standard errors

CC MI_LOGT MI_T

1000 1000 1000

Bias in point estimate

CC (MCse) MI_LOGT (MCse) MI_T (MCse)

0.017 0.005 0.001 0.004 -0.001 0.004

Empirical standard error

CC (MCse) MI_LOGT (MCse) MI_T (MCse)

0.151 0.003 0.132 0.003 0.134 0.003


I. R. White 375

% gain in precision relative to method CC

CC (MCse) MI_LOGT (MCse) MI_T (MCse)

. . 31.0 3.9 26.4 3.8

RMS model-based standard error

CC (MCse) MI_LOGT (MCse) MI_T (MCse)

0.147 0.001 0.135 0.001 0.134 0.001

Relative % error in standard error

CC (MCse) MI_LOGT (MCse) MI_T (MCse)

-2.7 2.2 2.2 2.3 -0.4 2.3

Coverage of nominal 95% confidence interval

CC (MCse) MI_LOGT (MCse) MI_T (MCse)

94.3 0.7 94.9 0.7 94.3 0.7

Power of 5% level test

CC (MCse) MI_LOGT (MCse) MI_T (MCse)

94.6 0.7 96.9 0.5 96.3 0.6

Some points of interest include the following:

• Table 3: CC has small-sample bias away from the null.

• Tables 4 and 5: CC is inefficient compared with MI LOGT and MI T.

• Comparing tables 4 and 6 shows that model-based SEs are close to the empirical
values. This is shown more directly in table 7.

• Table 8: Coverage of nominal 95% confidence intervals also seems fine, which is
not surprising in view of the lack of bias and good model-based SEs.

• Table 9: CC lacks power compared with MI LOGT and MI T, which is not surprising
in view of its inefficiency.

If different formatting of the results is required, the results can be loaded into mem-
ory using the clear option and can then be manipulated.
376 simsum: Analyses of simulation studies including Monte Carlo error

4 Formulas
Assume that the true parameter is β and that the ith simulated dataset (i = 1, . . . , n)
yields a point estimate βi with SE si . Define

1
β = βi
n i
1  2
Vβb = βi − β
n−1 i
1
s2 = s2i
n i
1  2 2
V s2 = si − s2
n−1 i

 Bias and empse


Performance of β:
Bias is defined as E(βi ) − β and estimated by

estimated bias β−β =



MC error = Vβb/n (1)
 
Precision is measured by the empirical standard deviation SD βi and is estimated by

empirical standard deviation = Vβb

MC error = Vβb/2(n − 1)
 
assuming β is normally distributed, as then (n − 1)Vβb/var β ∼ χ2n−1 .
I. R. White 377

Estimation method comparison: relprec


In a small change of notation, consider two estimators β1 and β2 with values β1i and
β2i in the ith simulated dataset. The relative gain in precision for β2 compared with β1
is

relative gain in precision = Vβb1 /Vβb2



2Vβb1 1 − ρ212
MC error ≈
Vβb2 n−1

where ρ12 is the correlation of β1 with β2 .


 
The MC error expression can be proved by observing the following: 1) var log Vβb1 =
    
var log Vβb2 = 2/(n − 1); 2) var log Vβb1 /Vβb2 = 4(1 − ρV )/(n − 1) where ρV =
 
corr Vβb1 , Vβb2 ; and 3) ρV = ρ212 . Result 3 may be derived by observing that Vβb ≈
 2  
1/n i βi − β so that under a bivariate normal assumption for β1 , β2 ,

   2  2 
n cov Vβb1 , Vβb2 ≈ cov β1 − β1, β2 − β2
 2  2  

= cov β1 − β1 , E β2 − β2  β1
 2  2 
= cov β1 − β1 , ρ212 Vβb2 /Vβb1 β1 − β1

= 2ρ212 Vβb1 Vβb2


 

where the third step follows because β2 − β2  β1 is normal with mean
  
ρ12 Vβb2 /Vβb1 β1 − β1 and constant variance.

Performance of model-based SE si: modelse and relerror


The average model-based SE is (by default) computed on the variance scale, because
standard theory yields unbiased estimates of the variance, not of the standard deviation.

average model-based SE s = s2

MC error ≈ Vs2 /4ns2
 
using the Taylor series approximation var (X) ≈ var X 2 /4E(X)2 .
378 simsum: Analyses of simulation studies including Monte Carlo error

We can now compute the relative error in the model-based SE as



relative error = s/ Vβb − 1 (2)
    
MC error ≈ s/ Vβb Vs2 / 4ns4 + 1/2(n − 1) (3)

assuming that s and Vβb are approximately uncorrelated and using a further Taylor
approximation.
However, if the modelsemethod(mean) option is used, the formulas are
1
average model-based SE s = si
n i

1 2
MC error = (si − s)
n i

with consequent adjustments to equations (2) and (3).

Joint performance of β and si: Cover and power


Let zα/2 be the critical value from the normal distribution, or (if the number of degrees
of freedom has been specified) the critical value from the appropriate t distribution.
The coverage of a nominal 100(1 − α)% confidence interval is
1   
coverage C = 1 | βi − β | < zα/2 si
n i

MC error = C(1 − C)/n

where 1(·) is the indicator function. The power of a significance test at the α level is
1   
power P = 1 | βi | ≥ zα/2 si
n i

MC error = P (1 − P )/n

Robust MC errors
Several of the MC errors presented above require a normality assumption. Alternative
approximations can be derived using an estimating equations method. The empirical
standard deviation, Vβb, can be written as the solution θ of the equation

 n  2 
 
βi − β − θ 2
=0
i
n−1
I. R. White 379

The relative precision of β2 compared with β1 can be written as the solution θ of
  2   2 
β1i − β 1 − θ + 1 β2i − β 2 =0
i

The relative error in the model-based SE can be written as the solution θ of


  2  2 
s2 − θ + 1
i βi − β =0
i

provided that the modelsemethod(rmse) method is used. (If modelsemethod(mean) is


specified, it is ignored in computing robust MC errors.) Ignoring the uncertainty in the
sample means β, β 1 , and β 2 , each estimating equation is of the form
   
Ti − f θ Bi = 0
i

so the sandwich variance (White 1982) is given by


−2
       2 
 f θ ≈
var Ti − f θ Bi Bi
i i

and using the delta method,


      2
 θ ≈ var
var  f θ /f  θ

Finally, as an attempt to allow for uncertainty in the sample means, we multiply the
sandwich variance by n/(n − 1). A rationale is that this agrees exactly with (1) if the
method is applied to the MC error of the bias. However, most simulation studies are
large enough that this correction is unimportant.

5 Evaluations
Most of the formulas used by simsum to compute MC errors involve approximations, so
I evaluated them in two simulation studies.

5.1 Multiple imputation, revisited


First, I repeated 250 times the simulation study described in section 3. The data have
the same format as before, with a new variable, simno, identifying the 250 different
simulation studies. I ran simsum twice. In the first run, each quantity and its MC error
was computed in each simulation study:
. simsum b, true(0.5) methodvar(method) id(dataset) se(se) mcse by(simno)
> bias empse relprec modelse relerror cover power nolist clear
Reshaping data to wide format ...
Starting to process results ...
Results are now in memory.
380 simsum: Analyses of simulation studies including Monte Carlo error

The data are now held in memory, with one record for each statistic for each of the
250 simulation studies. The statistics are identified by the values of a newly created
numerical variable statnum, and the different simulation studies are still identified by
simno. The variables bCC, bMI LOGT, and bMI T contain the analysis results for the three
methods. MC errors in variables are suffixed with mcse. In the second run, these values
are treated as ordinary output from a simulation study, and the average calculated MC
error is compared with the empirical MC error.

. simsum bCC bMI_LOGT bMI_T, sesuffix(_mcse) by(statnum) mcse gen(newstat)


> empse modelse relerror nolist clear
Warning: found 250 observations with missing values
Starting to process results ...
Results are now in memory.

The 250 observations with missing values refer to the relative precisions, which are
missing for the reference method (CC). Average calculated MC errors for each statistic are
compared in table 1 with empirical MC errors. The calculated MC errors are naturally
similar to those reported in the single simulation study above (some values have been
multiplied by 1,000 for convenience). Empirical MC errors are close to the model-based
values. The only exception is for coverage, where the model-based MC errors appear
rather small for methods CC and MI LOGT. This is likely to be a chance finding, because
there is no doubt about the accuracy of the model-based MC formula for this statistic.
Table 1. Simulation study comparing three ways to handle incomplete covariates in a Cox model: Comparison of average
calculated MC error (Calc) with empirical MC error (Emp) for various statistics
I. R. White

Statistic1 CC method MI LOGT method MI T method


Emp Calc % error2 Emp Calc % error2 Emp Calc % error2
Bias × 1000 4.79 4.74 −1.1 (4.4) 4.23 4.17 −1.3 (4.4) 4.22 4.18 −1.0 (4.4)
EmpSE × 1000 3.37 3.36 −0.3 (4.5) 3.11 2.95 −5.2 (4.3) 3.11 2.96 −4.9 (4.3)
RelPrec . . . 4.21 3.97 −5.7 (4.2) 4.13 3.97 −4.1 (4.3)
ModSE × 1000 0.52 0.50 −3.1 (4.3) 0.59 0.59 0.3 (4.5) 0.60 0.59 −3.1 (4.4)
RelErr 2.16 2.22 2.9 (4.6) 2.40 2.34 −2.6 (4.4) 2.43 2.33 −4.2 (4.3)
Cover 0.62 0.70 13.5 (5.1) 0.61 0.68 11.3 (5.0) 0.67 0.68 1.8 (4.6)
Power 0.74 0.73 −1.4 (4.4) 0.59 0.59 −0.4 (4.5) 0.60 0.59 −2.4 (4.4)

1
Statistics are abbreviated as follows: Bias, bias in point estimate; EmpSE, empirical SE; RelPrec, % gain in precision relative
to method CC; ModSE, RMS model-based SE; RelErr, relative % error in SE; Cover, coverage of nominal 95% confidence interval;
Power, power of 5% level test.
2
Relative % error in average calculated SE, with its MC error in parentheses.
381
382 simsum: Analyses of simulation studies including Monte Carlo error

5.2 Nonnormal joint distributions


In a second evaluation, I simulated 100,000 datasets of size n = 100 from the model
X ∼ N (0, 1), Y ∼ Bern(0.5). I then estimated the parameter β in the logistic regression
model
logit P (Y = 1 | X) = α + βX (4)
in two ways: 1) βLR was the maximum likelihood estimate from fitting the logistic
regression model (4), and 2) βLDA was the estimate from linear discriminant analysis
(LDA), fitting the linear regression model
 
X | Y ∼ N γ + δX, σ 2

and taking βLDA = δ/


 σ2 .

The 100,000 datasets were divided into 100 simulation studies each of 1,000 simu-
lated datasets. The quantities described above and their SEs were calculated for each
simulation study, except that power for testing β = 0 was not computed because this
null hypothesis was true. Finally, the empirical MC error of each quantity across simula-
tion studies was compared with the average MC error estimated within each simulation
study.
Results are shown in table 2. The calculated MC error is adequate for all quantities
except for the relative precision of LDA compared with logistic regression, for which the
calculated SE is some three times too small. This appears to be due to the nonnormal
joint distribution of the parameter estimates shown in figure 1. The robust MC errors
perform well in all cases.
I. R. White 383

Table 2. Simulation study comparing LDA with logistic regression: Comparison of


empirical with average calculated MC errors for various statistics

Quantity Method Mean MC error


Empirical Average calculated
Normal Robust
Bias × 1000 Logistic 0.41 6.79 6.71 .
LDA 0.41 6.66 6.57 .
Empirical SE × 1000 Logistic 212.00 4.78 4.74 5.07
LDA 207.86 4.69 4.65 4.97
% gain in precision Logistic . . . .
LDA 4.027 0.124 0.048 0.131
Model SE × 1000 Logistic 207.32 0.51 0.51 .
LDA 203.12 0.48 0.47 .
% error in model SE Logistic −2.16 2.13 2.20 2.26
LDA −2.23 2.18 2.20 2.30
% coverage Logistic 95.36 0.60 0.66 .
LDA 94.70 0.64 0.71 .
.06 .04
Difference, LDA − LR
0 .02
−.02

−1 −.5 0 .5 1
Average of LDA and LR

 
Figure 1. Scatterplot of the difference βLDA − βLR against the average βLDA + βLR /2
in 2,000 simulated datasets
384 simsum: Analyses of simulation studies including Monte Carlo error

6 Discussion
I hope that simsum will help statisticians improve the reporting of their simulation
studies. In particular, I hope simsum will help them think about and report MC errors.
If MC errors are too large to enable the desired conclusions to be drawn, then it is
usually straightforward to increase the sample size, a luxury rarely available in applied
research.
For three statistics (empirical SE, and relative precision and relative error in model-
based SE), I have proposed two approximate MC error methods, one based on a normality
assumption and one based on a sandwich estimator. The MC error should only be taken
as a guide, so errors of some 10–20% in calculating the MC error are of little importance.
In most cases, both MC error methods performed adequately. However, the normality-
based MC error was about three times too small when evaluating the relative precision of
two estimators with a highly nonnormal joint distribution (figure 1). It is good practice
to examine the marginal and joint distributions of parameter estimates in simulation
studies, and this practice should be used to guide the choice of MC error method.
Other methods are available for estimating MC errors. Koehler, Brown, and Haneuse
(2009) proposed more computationally intensive techniques that are available for im-
plementation in R. Other software (Doornik and Hendry 2009) is available with an
econometric focus.

7 Acknowledgment
This work was supported by MRC grant U.1052.00.006.

8 References
Burton, A., D. G. Altman, P. Royston, and R. L. Holder. 2006. The design of simulation
studies in medical statistics. Statistics in Medicine 25: 4279–4292.
Doornik, J. A., and D. F. Hendry. 2009. Interactive Monte Carlo Experimentation in
Econometrics Using PcNaive 5. London: Timberlake Consultants Press.
Koehler, E., E. Brown, and S. J.-P. A. Haneuse. 2009. On the assessment of Monte Carlo
error in simulation-based statistical analyses. American Statistician 63: 155–162.
Little, R. J. A., and D. B. Rubin. 2002. Statistical Analysis with Missing Data. 2nd
ed. Hoboken, NJ: Wiley.
Royston, P. 2004. Multiple imputation of missing values. Stata Journal 4: 227–241.
———. 2009. Multiple imputation of missing values: Further update of ice, with an
emphasis on categorical variables. Stata Journal 9: 466–477.
van Buuren, S., H. C. Boshuizen, and D. L. Knook. 1999. Multiple imputation of missing
blood pressure covariates in survival analysis. Statistics in Medicine 18: 681–694.
I. R. White 385

White, H. 1982. Maximum likelihood estimation of misspecified models. Econometrica


50: 1–25.

White, I. R., and P. Royston. 2009. Imputing missing covariate values for the Cox
model. Statistics in Medicine 28: 1982–1998.

About the author


Ian R. White is a program leader at the MRC Biostatistics Unit in Cambridge, United Kingdom.
His research interests focus on handling missing data, noncompliance, and measurement error
in the analysis of clinical trials, observational studies, and meta-analysis. He frequently uses
simulation studies.
The Stata Journal (2010)
10, Number 3, pp. 386–394

Projection of power and events in clinical trials


with a time-to-event outcome
Patrick Royston
Hub for Trials Methodology Research
MRC Clinical Trials Unit and University College London
London, UK
pr@ctu.mrc.ac.uk

Friederike M.-S. Barthel


Oncology Research & Development
GlaxoSmithKline
Uxbridge, UK
FriederikeB@ctu.mrc.ac.uk

Abstract. In 2005, Barthel, Royston, and Babiker presented a menu-driven Stata


program under the generic name of ART (assessment of resources for trials) to
calculate sample size and power for complex clinical trial designs with a time-to-
event or binary outcome. In this article, we describe a Stata tool called ARTPEP,
which is intended to project the power and events of a trial with a time-to-event
outcome into the future given patient accrual figures so far and assumptions about
event rates and other defining parameters. ARTPEP has been designed to work
closely with the ART program and has an associated dialog box. We illustrate the
use of ARTPEP with data from a phase III trial in esophageal cancer.
Keywords: st0013 2, artpep, artbin, artsurv, artmenu, randomized controlled trial,
time-to-event outcome, power, number of events, projection, ARTPEP, ART

1 Introduction
Barthel, Royston, and Babiker (2005) presented a menu-driven Stata program under
the generic name of ART (assessment of resources for trials) to calculate sample size and
power for complex clinical trial designs with a time-to-event or binary outcome. Briefly,
the features of ART include multiarm trials, dose–response trends, arbitrary failure-
time distributions, nonproportional hazards, nonuniform rates of patient entry, loss to
follow-up, and possible changes from allocated treatment. A full report on the method-
ology and its performance—in particular, regarding loss to follow-up, nonproportional
hazards, and treatment crossover—is given by Barthel et al. (2006).
In this article, we concentrate on a new tool that addresses a practical issue in trials
with a time-to-event outcome. Because of staggered entry of patients and the gradual
maturing of the data, the accumulation of events from the date the trial opens is a
process that occurs over a relatively long period of time and with a variable course.
Trials are planned and their resources are assigned under certain critical assumptions.


c 2010 StataCorp LP st0013 2
P. Royston and F. M.-S. Barthel 387

If those assumptions are unrealistic, timely completion of the trial may be threatened.
Because the cumulative number of events is the key indicator of trial maturity and is
the parameter targeted in the sample-size calculation, it is of considerable interest and
relevance to monitor and project this number at particular points during the trial.
The new tool is called ARTPEP (ART projection of events and power). ARTPEP
comprises an ado-file (artpep) and an associated dialog box. It works in conjunction
with the ART system, of which the latest update is included with this article.

2 Example: A trial in advanced esophageal cancer


2.1 Sample-size calculation using ART
As an example, we describe sample-size calculation and ARTPEP analysis of a “typical”
cancer trial. The OE05 trial in advanced esophageal carcinoma is coordinated by the
MRC Clinical Trials Unit. The protocol is available online at http://www.ctu.mrc.ac.uk/
plugins/StudyDisplay/protocols/OE05%20Protocol%20Version%205%2031st%20July
%202008.pdf. The design, which comprises two randomized groups of patients with
equal allocation, aims to test the hypothesis that a new chemotherapy regimen, in
conjunction with surgery, improves overall survival at 3 years.
According to the protocol, the probability of 3-year survival in this patient group
is 30%, and the trial has 82% power at the 5% two-sided significance level to detect
an improvement in overall survival to 38%. The overall sample size is stated to be 842
patients, and the required number of events is 673. The plan is to recruit patients over
6 years and to follow up with them for a further 2 years before performing the definitive
analysis of the outcome (overall survival).
The description in the protocol provides nearly all the ingredients for an ART sample-
size and power calculation. The only missing item is the target hazard ratio, which is
ln (0.38) / ln (0.30) = 0.80 under proportional hazards of the treatment effect (a standard
assumption). We first use the artsurv command (Barthel, Royston, and Babiker 2005)
to verify the sample-size calculation and to set up some of the parameter values needed
by ARTPEP. We supply the other design features, and then we run the artsurv command
to compute the power and events:

(Continued on next page)


388 Projection of power and events in clinical trials

. artsurv, method(l) nperiod(8) ngroups(2) edf0(0.3, 3) hratio(1, 0.80) n(842)


> alpha(0.05) recrt(6)
ART - ANALYSIS OF RESOURCES FOR TRIALS (version 1.0.7, 19 October 2009)

A sample size program by Abdel Babiker, Patrick Royston & Friederike Barthel,
MRC Clinical Trials Unit, London NW1 2DA, UK.

Type of trial Superiority - time-to-event outcome


Statistical test assumed Unweighted logrank test (local)
Number of groups 2
Allocation ratio Equal group sizes
Total number of periods 8
Length of each period One year
Survival probs per period (group 1) 0.669 0.448 0.300 0.201 0.134 0.090
0.060 0.040
Survival probs per period (group 2) 0.725 0.526 0.382 0.277 0.201 0.146
0.106 0.077
Number of recruitment periods 6
Number of follow-up periods 2
Method of accrual Uniform
Recruitment period-weights 1 1 1 1 1 1 0 0
Hazard ratios as entered (groups 1,2) 1, 0.80
Alpha 0.050 (two-sided)
Power (calculated) 0.824
Total sample size (designed) 842
Expected total number of events 673

Apart from small, unimportant differences, the protocol power (0.82) and the number
of events (673) are consistent with ART’s results.

2.2 Analysis with ARTPEP


To run ARTPEP successfully, three preliminary steps are required:

1. You must activate the ART and ARTPEP items on the User menu by typing the
command artmenu on.
2. You must compute the relevant sample size for the trial using either the ART
dialog box or the artsurv command. This automatically sets up a global macro
called $S ARTPEP whose contents are used by the artpep command. (A slightly
more convenient alternative with the same result is to use the ART Settings...
button on the ARTPEP dialog box to set up the necessary quantities for ART
without having to run ART or artsurv separately.)
3. To set up additional parameters that ARTPEP needs, you must use the ARTPEP
dialog box, either by typing db artpep or by selecting User > ART > Artpep
from the menu.

As a worked example, we now imagine that the OE05 trial has been running for
1 year and has accrued 100 patients so far. Assuming the survival distribution to be
P. Royston and F. M.-S. Barthel 389

correct, when may we expect to complete the trial (that is, obtain the required number
of events)? To answer this question, we complete the three steps described above. The
resulting empty dialog box is shown in figure 1.

Figure 1. Incomplete ARTPEP dialog box

We now explain the various items that the dialog box needs. The name of the
corresponding option for the artpep command is given in square brackets:

• ART Settings...: As already mentioned, this button may be used to set up the
parameters of an ART run if that has not been done already. It accesses the ART
dialog box.

• Patients recruited in each period so far [pts]: A “period” here is 1 year, and we
have recruited 100 patients in the first period. We therefore enter 100 for this
item.

• Additional patients to be recruited [epts]: To get to the 842 patients (we will
use 850), we hope to recruit at about 150 patients per year for the next 5 years,
making a total of 6 years’ planned recruitment. We enter 150. The program knows
the period in which recruitment is to cease and, by default, repeats the number
150 over the next 5 periods. If we had expected a differing recruitment rate (say,
accelerating toward the end of the trial), we could have entered a different number
of patients to be recruited in each period.
390 Projection of power and events in clinical trials

• Number of periods over which to project [eperiods]: Let us say we wish to project
events and power over the next 10 years. We enter 10.

• Period in which recruitment cease [stoprecruit]: Here enter the number of periods
after which recruitment is to cease. The number must be no smaller than the
number of periods implied by Patients recruited in each period so far [pts]. If
the option is left blank, it is assumed that recruitment continues indefinitely. As
already noted, we wish to stop recruitment at 850 patients, which we will achieve
by the end of period 6. We therefore enter 6 for this item.

• Period to start reporting projections [startperiod]: Usually, we want to enter 1


here, signifying the start of the trial. By default, if the item is left blank, the
program assumes that the current period is intended. We enter 1.

• Save using filename [using]: The numerical results of the artpep run can be saved
to a .dta file for a permanent record or for plotting. We leave the item blank.

• Start date of trial (ddmmmyyyy) [datestart]: If we enter the start date, the output
from artpep is conveniently labeled with the calendar date of the end of each
period. We recommend using this option. We enter 01jan2009.

The completed ARTPEP dialog box is shown in figure 2.

Figure 2. Completed ARTPEP dialog box for the OE05 trial


P. Royston and F. M.-S. Barthel 391

After submitting the above setup to Stata (version 10 or later), we get the following
result:

. artpep, pts(100) $S ARTPEP epts(150) eperiods(10) startperiod(1)


> stoprecruit(6) datestart(01jan2009)
Date year #pats #C-events #events Power

31dec2009 1 100 9 17 0.06498

31dec2010 2 250 36 66 0.14480


31dec2011 3 400 79 146 0.26850
31dec2012 4 550 132 247 0.41622
31dec2013 5 700 193 362 0.56360
31dec2014 6 850 258 488 0.69209
31dec2015 7 850 314 597 0.77737
31dec2016 8 850 351 673 0.82423
31dec2017 9 850 375 726 0.85155
31dec2018 10 850 392 763 0.86825
31dec2019 11 850 403 789 0.87882

The program reports the total number of events (#events) and the number of events in
the control arm (#C-events), which are often of interest. The required total number of
events (that is, both arms combined) of 673 is projected to be reached on 31 December
2016, the end of period 8. We expect 351 events in the control arm by that time. The
projection is not surprising because the accrual figures that have been entered more or
less agree with the trial plan. Nevertheless, the output shows us the expected progress
of the number of events and the power over time. The trial may be monitored (and the
ARTPEP analysis updated) to follow its progress.

The dialog box has, as usual, created and run the necessary artpep command line.
The second item in the command is $S ARTPEP. As already mentioned, it contains
additional information needed by artpep. On displaying its contents, we find

. display "$S ARTPEP"


alpha(.05) aratios() hratio(1, 0.80) ngroups(2) ni(0) onesided(0) trend(0)
> tunit(1) edf0(0.3, 3) median(0) method(l)

The key pieces of information here are hratio(1, 0.80) and edf0(0.3, 3), which
specify the hazard ratios in groups 1 and 2, and the survival function in group 1,
respectively. All the other items are default values and could be omitted in the present
example. The present example could have been run directly from the command line as
follows:

. artpep, pts(100) edf0(0.3, 3) epts(150) eperiods(10) startperiod(1)


> stoprecruit(6) datestart(01jan2009) hratio(1, 0.8)

2.3 Sensitivity analysis of the event rate


We have assumed a 30% survival probability 3 years after recruitment. Suppose, in fact,
that the patients do better than that—their 3-year survival is 40% instead. What effect
would that have on the power and events timeline?
392 Projection of power and events in clinical trials

We need only change the edf0() option to edf0(0.4, 3):


. artpep, pts(100) epts(150) edf0(0.4, 3) eperiods(10) startperiod(1)
> stoprecruit(6) datestart(01jan2009) hratio(1, 0.8)
Date year #pats #C-events #events Power

31dec2009 1 100 7 13 0.05869

31dec2010 2 250 29 53 0.12410


31dec2011 3 400 65 119 0.22732
31dec2012 4 550 111 205 0.35714
31dec2013 5 700 165 306 0.49586
31dec2014 6 850 224 419 0.62612
31dec2015 7 850 277 522 0.72135
31dec2016 8 850 316 600 0.77974
31dec2017 9 850 345 660 0.81685
31dec2018 10 850 366 705 0.84128
31dec2019 11 850 382 739 0.85784

The time to observe the required number of events has advanced by more than 1 year,
to period 9 (31dec2017).

3 Syntax
Once you have gained a little experience with using the ARTPEP dialog box, you will
find it more natural and efficient to use the command line. The syntax of artpep is as
follows:


artpep using filename , pts(numlist) edf0(slist0) epts(numlist)
eperiods(#) stoprecruit(#) startperiod(#) datestart(ddmmmyyyy)

replace artsurv options

4 Options
pts(numlist) is required. numlist specifies the number of patients recruited in each
period since the start of the trial, that is, since randomization. See help on artsurv
for the definition of a “period”. The number of items in numlist defines the number
of periods of recruitment so far. For example, pts(23 12 25) specifies three initial
periods of recruitment, with recruitment of 23 patients in period 1, 12 in period 2,
and 25 in period 3. The “current” period would be period 3 and would be demarcated
by parallel lines in the output.
edf0(slist0 ) is required and gives the survival function in the control group (group 1).
This need not be one of the survival distributions to be compared in the trial, unless
hratio() = 1 for at least one of the groups. The format of slist0 is #1 [#2 . . . #r,
#1 #2 . . . #r]. Thus edf0(p1 p2 . . . pr , t1 t2 . . . tr ) gives the value pi for the survival
function for the event time at the end of time period ti , i = 1, . . . , r. Instantaneous
event rates (that is, hazards) are assumed constant within time periods; that is, the
P. Royston and F. M.-S. Barthel 393

distribution of time-to-event is assumed to be piecewise exponential. When used in


a given calculation up to period T , tr may validly be less than, equal to, or greater
than T . If tr ≤ T , the rules described in the edf0() option of artsurv are applied
to compute the survival function at all periods ≤ T . If tr > T , the same calculation
is used but estimated survival probabilities for periods > T are not used in the
calculation at T , although they may of course be used in calculations (for example,
projections of sample size and events) for periods later than T . Be aware that use
of the median() option (an alternative to edf0()) and the fp() option of artsurv
may modify the effects and interpretation of edf0().
epts(numlist) specifies in numlist the number of additional patients to be recruited in
each period following the recruitment phase defined by the pts() option. For exam-
ple, pts(23 12 25) epts(30 30) would specify three initial periods of recruitment
followed by two further periods. A projection of events and power is required over
the two further periods. The initial recruitment is of 23 patients in period 1, 12 in
period 2, and 25 in period 3; in each of periods 4 and 5, we expect to recruit an
additional 30 patients. If the number of items in (or implied by expanding) numlist
is less than that specified by pts(), the final value in numlist is replicated as neces-
sary to all subsequent periods. If epts() is not given, the default is that the mean
of the numbers of patients specified in pts() is used for all projections.
eperiods(#) specifies the number of future periods over which projection of power and
number of events is to be calculated. The default is eperiods(1).
stoprecruit(#) specifies the number of periods after which recruitment is to cease. #
must be no smaller than the number of periods of recruitment implied by pts(). The
default is stoprecruit(0), meaning to continue recruiting indefinitely (no follow-up
phase).
startperiod(#) specifies # as the period in which to start reporting the projec-
tions of events and power. To report from the beginning of the trial, specify
startperiod(1). Note that startperiod() does not affect the period at which
the calculations are started, only how the results are reported. The default # is the
last period defined by pts().
datestart(ddmmmyyyy) signifies the opening date of the trial (that is, when recruit-
ment started), for example, datestart(14oct2009). The date of the end of each
period is used to label the output and is stored in filename if using is specified.
replace allows filename to be replaced if it already exists.
artsurv options are any of the options of artsurv except recrt(), nperiod(), power(),
and n().

(Continued on next page)


394 Projection of power and events in clinical trials

5 Final comments
We have illustrated ARTPEP with a basic example. However, ARTPEP understands
the more complex options of artsurv. Therefore, complex features, including loss to
follow up, treatment crossover, and nonproportional hazards, can be allowed for in the
projection of power and events.
Sometimes it is desirable to make projections on a finer time scale than 1 year,
for example, in 3- or 6-month periods. This is easily done by adjusting the period
parameters used in ART and ARTPEP.

6 References
Barthel, F. M.-S., A. Babiker, P. Royston, and M. K. B. Parmar. 2006. Evaluation of
sample size and power for multi-arm survival trials allowing for non-uniform accrual,
non-proportional hazards, loss to follow-up and cross-over. Statistics in Medicine 25:
2521–2542.

Barthel, F. M.-S., P. Royston, and A. Babiker. 2005. A menu-driven facility for complex
sample size calculation in randomized controlled trials with a survival or a binary
outcome: Update. Stata Journal 5: 123–129.

About the authors


Patrick Royston is a medical statistician with 30 years of experience, with a strong interest in
biostatistical methods and in statistical computing and algorithms. He now works in cancer
clinical trials and related research issues. Currently, he is focusing on problems of model
building and validation with survival data, including prognostic factor studies; on parametric
modeling of survival data; on multiple imputation of missing values; and on novel clinical trial
designs.
Friederike Barthel is a senior statistician in Oncology Research & Development at Glaxo-
SmithKline. Previously, she worked at the MRC Clinical Trials Unit and the Institute of
Psychiatry. Her current research interests include sample-size issues, particularly concerning
multistage, multiarm trials, microarray study analyses, and competing risks. Friederike has
taught undergraduate courses in statistics at the University of Westminster and at Kingston
University.
The Stata Journal (2010)
10, Number 3, pp. 395–407

metaan: Random-effects meta-analysis


Evangelos Kontopantelis David Reeves
National Primary Care Health Sciences Primary Care
Research & Development Centre Research Group
University of Manchester University of Manchester
Manchester, UK Manchester, UK
e.kontopantelis@manchester.ac.uk david.reeves@manchester.ac.uk

Abstract. This article describes the new meta-analysis command metaan, which
can be used to perform fixed- or random-effects meta-analysis. Besides the stan-
dard DerSimonian and Laird approach, metaan offers a wide choice of available
models: maximum likelihood, profile likelihood, restricted maximum likelihood,
and a permutation model. The command reports a variety of heterogeneity mea-
sures, including Cochran’s Q, I 2 , HM
2
, and the between-studies variance estimate
τb2 . A forest plot and a graph of the maximum likelihood function can also be
generated.
Keywords: st0201, metaan, meta-analysis, random effect, effect size, maximum
likelihood, profile likelihood, restricted maximum likelihood, REML, permutation
model, forest plot

1 Introduction
Meta-analysis is a statistical methodology that integrates the results of several inde-
pendent clinical trials in general that are considered by the analyst to be “combinable”
(Huque 1988). Usually, this is a two-stage process: in the first stage, the appropriate
summary statistic for each study is estimated; then in the second stage, these statis-
tics are combined into a weighted average. Individual patient data (IPD) methods
exist for combining and meta-analyzing data across studies at the individual patient
level. An IPD analysis provides advantages such as standardization (of marker values,
outcome definitions, etc.), follow-up information updating, detailed data-checking, sub-
group analyses, and the ability to include participant-level covariates (Stewart 1995;
Lambert et al. 2002). However, individual observations are rarely available; addition-
ally, if the main interest is in mean effects, then the two-stage and the IPD approaches
can provide equivalent results (Olkin and Sampson 1998).
This article concerns itself with the second stage of the two-stage approach to meta-
analysis. At this stage, researchers can select between two main approaches—the fixed-
effects (FE) or the random-effects model—in their efforts to combine the study-level
summary estimates and calculate an overall average effect. The FE model is simpler
and assumes the true effect to be the same (homogeneous) across studies. However, ho-
mogeneity has been found to be the exception rather than the rule, and some degree of
true effect variability between studies is to be expected (Thompson and Pocock 1991).
Two sorts of between-studies heterogeneity exist: clinical heterogeneity stems from dif-


c 2010 StataCorp LP st0201
396 metaan: Random-effects meta-analysis

ferences in populations, interventions, outcomes, or follow-up times, and methodological


heterogeneity stems from differences in trial design and quality (Higgins and Green 2009;
Thompson 1994). The most common approach to modeling the between-studies variance
is the model proposed by DerSimonian and Laird (1986), which is widely used in generic
and specialist meta-analysis statistical packages alike. In Stata, the DerSimonian–Laird
(DL) model is used in the most popular meta-analysis commands—the recently up-
dated metan and the older but still useful meta (Harris et al. 2008). However, the
between-studies variance component can be estimated using more-advanced (and com-
putationally expensive) iterative techniques: maximum likelihood (ML), profile likeli-
hood (PL), and restricted maximum likelihood (REML) (Hardy and Thompson 1996;
Thompson and Sharp 1999). Alternatively, the estimate can be obtained using non-
parametric approaches, such as the permutations (PE) model proposed by Follmann
and Proschan (1999).
We have implemented these models in metaan, which performs the second stage
of a two-stage meta-analysis and offers alternatives to the DL random-effects model.
The command requires the studies’ effect estimates and standard errors as input. We
have also created metaeff, a command that provides support in the first stage of the
two-stage process and complements metaan. The metaeff command calculates for each
study the effect size (standardized mean difference) and its standard error from the
input parameters supplied by the user, using one of the models described in the Cochrane
Handbook for Systematic Reviews of Interventions (Higgins and Green 2006). For more
details, type ssc describe metaeff in Stata or see Kontopantelis and Reeves (2009).
The metaan command does not offer the plethora of options metan does for in-
putting various types of binary or continuous data. Other useful features in metan
(unavailable in metaan) include stratified meta-analysis, user-input study weights, vac-
cine efficacy calculations, the Mantel–Haenszel FE method, L’Abbe plots, and funnel
plots. The REML model, assumed to be the best model for fitting a random-effects
meta-analysis model even though this assumption has not been thoroughly investi-
gated (Thompson and Sharp 1999), has recently been coded in the updated meta-
regression command metareg (Harbord and Higgins 2008) and the new multivariate
random-effects meta-analysis command mvmeta (White 2009). However, the output
and options provided by metaan can be more useful in the univariate meta-analysis
context.

2 The metaan command


2.1 Syntax



metaan varname1 varname2 in , {fe | dl | ml | reml | pl | pe} varc
if

label(varname) forest forestw(#) plplot(string)


E. Kontopantelis and D. Reeves 397

where
varname1 is the study effect size.
varname2 is the study effect variation, with standard error used as the default.

2.2 Options
fe fits an FE model that assumes there is no heterogeneity between the studies. The
model assumes that within-study variances may differ, but that there is homogeneity
of effect size across studies. Often the homogeneity assumption is unlikely, and
variation in the true effect across studies is to be expected. Therefore, caution is
required when using this model. Reported heterogeneity measures are estimated
using the dl option. You must specify one of fe, dl, ml, reml, pl, or pe.
dl fits a DL random-effects model, which is the most commonly used model. The model
assumes heterogeneity between the studies; that is, it assumes that the true effect
can be different for each study. The model assumes that the individual-study true
effects are distributed with a variance τ 2 around an overall true effect, but the model
makes no assumptions about the form of the distribution of either the within-study
or the between-studies effects. Reported heterogeneity measures are estimated using
the dl option. You must specify one of fe, dl, ml, reml, pl, or pe.
ml fits an ML random-effects model. This model makes the additional assumption
(necessary to derive the log-likelihood function, and also true for reml and pl, below)
that both the within-study and the between-studies effects have normal distributions.
It solves the log-likelihood function iteratively to produce an estimate of the between-
studies variance. However, the model does not always converge; in some cases, the
between-studies variance estimate is negative and set to zero, in which case the
model is reduced to an fe specification. Estimates are reported as missing in the
event of nonconvergence. Reported heterogeneity measures are estimated using the
ml option. You must specify one of fe, dl, ml, reml, pl, or pe.
reml fits an REML random-effects model. This model is similar to ml and uses the same
assumptions. The log-likelihood function is maximized iteratively to provide esti-
mates, as in ml. However, under reml, only the part of the likelihood function that
is location invariant is maximized (that is, maximizing the portion of the likelihood
that does not involve μ if estimating τ 2 , and vice versa). The model does not always
converge; in some cases, the between-studies variance estimate is negative and set
to zero, in which case the model is reduced to an fe specification. Estimates are re-
ported as missing in the event of nonconvergence. Reported heterogeneity measures
are estimated using the reml option. You must specify one of fe, dl, ml, reml, pl,
or pe.
pl fits a PL random-effects model. This model uses the same likelihood function as ml
but takes into account the uncertainty associated with the between-studies variance
estimate when calculating an overall effect, which is done by using nested iterations
to converge to a maximum. The confidence intervals (CIs) provided by the model
are asymmetric, and hence so is the diamond in the forest plot. However, the model
398 metaan: Random-effects meta-analysis

does not always converge. Values that were not computed are reported as missing.
Reported heterogeneity measures are estimated using the ml option because μ  and
τ2 , the effect and between-studies variance estimates, are the same. Only their
CIs are reestimated. The model also provides a CI for the between-studies variance
estimate. You must specify one of fe, dl, ml, reml, pl, or pe.
pe fits a PE random-effects model. This model can be described in three steps. First, in
line with a null hypothesis that all true study effects are zero and observed effects
are due to random variation, a dataset of all possible combinations of observed
study outcomes is created by permuting the sign of each observed effect. Then, the
dl model is used to compute an overall effect for each combination. Finally, the
resulting distribution of overall effect sizes is used to derive a CI for the observed
overall effect. The CI provided by the model is asymmetric, and hence so is the
diamond in the forest plot. Reported heterogeneity measures are estimated using
the dl option. You must specify one of fe, dl, ml, reml, pl, or pe.
varc specifies that the study-effect variation variable, varname2, holds variance values.
If this option is omitted, metaan assumes that the variable contains standard-error
values (the default).
label(varname) selects labels for the studies. One or two variables can be selected
and converted to strings. If two variables are selected, they will be separated by a
comma. Usually, the author names and the year of study are selected as labels. The
final string is truncated to 20 characters.
forest requests a forest plot. The weights from the specified analysis are used for
plotting symbol sizes (pe uses dl weights). Only one graph output is allowed in each
execution.
forestw(#) requests a forest plot with adjusted weight ratios for better display. The
value can be in the [1, 50] range. For example, if the largest to smallest weight ratio
is 60 and the graph looks awkward, the user can use this command to improve the
appearance by requesting that the weight be rescaled to a largest/smallest weight
ratio of 30. Only the weight squares in the plot are affected, not the model. The CIs
in the plot are unaffected. Only one graph output is allowed in each execution.
plplot(string) requests a plot of the likelihood function for the average effect or
between-studies variance estimate of the ml, pl, or reml model. The plplot(mu) op-
tion fixes the average effect parameter to its model estimate in the likelihood function
and creates a two-way plot of τ 2 versus the likelihood function. The plplot(tsq)
option fixes the between-studies variance to its model estimate in the likelihood
function and creates a two-way plot of μ versus the likelihood function. Only one
graph output is allowed in each execution.
E. Kontopantelis and D. Reeves 399

2.3 Saved results


metaan saves the following in r() (some varying by selected model):
Scalars
r(Hsq) heterogeneity measure HM 2
r(Q) Cochran’s Q value
r(df) degrees of freedom
r(effvar) effect variance
r(efflo) effect size, lower 95% CI
r(Isq) heterogeneity measure I 2
r(Qpval) p-value for Cochran’s Q
r(eff) effect size
r(effup) effect size, upper 95% CI

In addition to the standard results, metaan, fe and metaan, dl save the following in
r():
Scalars
r(tausq dl) τb2 , from the DL model

In addition to the standard results, metaan, ml saves the following in r():


Scalars
r(tausq dl) τb2 , from the DL model
r(conv ml) ML convergence information
r(tausq ml) τb2 , from the ML model

In addition to the standard results, metaan, reml saves the following in r():
Scalars
r(tausq dl) τb2 , from the DL model
r(conv reml) REML convergence information
r(tausq reml) τb2 , from the REML model

In addition to the standard results, metaan, pl saves the following in r():


Scalars
r(tausq dl) τb2 , from the DL model
r(tausqlo pl) τb2 (PL), lower 95% CI
r(cloeff pl) convergence information, PL effect size (lower CI)
r(ctausqlo pl) convergence information, PL τb2 (lower CI)
r(conv ml) ML convergence information
r(tausq pl) τb2 , from the PL model
r(tausqup pl) τb2 (PL), upper 95% CI
r(cupeff pl) convergence information, PL effect size (upper CI)
r(ctausqup pl) convergence information, PL τb2 (upper CI)

In addition to the standard results, metaan, pe saves the following in r():


Scalars
r(tausq dl) τb2 , from the DL model
r(exec pe) information on PE execution
400 metaan: Random-effects meta-analysis

2
In each case, heterogeneity measures HM and I 2 are computed using the returned
between-variances estimate τ . Convergence and PE execution information is returned
2

as 1 if successful and as 0 otherwise. r(effvar) cannot be computed for PE. r(effvar)


is the same for ML and PL, but for PL the CIs are “amended” to take into account the
τ2 uncertainty.

3 Methods
The metaan command offers six meta-analysis models for calculating a mean effect esti-
mate and its CIs: FE model, random-effects DL method, ML random-effects model, REML
random-effects model, PL random-effects model, and PE method using a DL random-
effects model. Models of the random-effects family take into account the identified
between-studies variation, estimate it, and usually produce wider CIs for the overall
effect than would an FE analysis. Brief descriptions of the models have been provided
in section 2.2. In this section, we will provide a few more details and practical advice in
selecting among the models. Their complexity prohibits complete descriptions in this
article, and users wishing to look into model details are encouraged to refer to the orig-
inal articles that described them (DerSimonian and Laird 1986; Hardy and Thompson
1996; Follmann and Proschan 1999; Brockwell and Gordon 2001).
The three ML models are iterative and usually computationally expensive. ML and PL
derive the μ (overall effect) and τ 2 estimates by maximizing the log-likelihood function
in (1) under different conditions. REML estimates τ 2 and μ by maximizing the restricted
log-likelihood function in (2).
 k 
1    2   k 2
(yi − μ)
log L(μ, τ ) = −
2
i + τ
log 2π σ 2
+ , μ ∈ & τ 2 ≥ 0 (1)
2 i=1 i=1
i2 + τ 2
σ
 k 
1     k
(y − 
μ)
2
log L (μ, τ 2 ) = −
i
log 2π σi2 + τ 2 +
2 i=1 i=1
i2 + τ 2
σ

1 k
1
− log 2 + τ2 ,  ∈ & τ2 ≥ 0
μ (2)
2 
σ
i=1 i

where k is the number of studies to be meta-analyzed, yi and σ i2 are the effect and
 is the overall effect estimate.
variance estimates for study i, and μ
ML follows the simplest approach, maximizing (1) in a single iteration loop. A criti-
cism of ML is that it takes no account of the loss in degrees of freedom that results from
estimating the overall effect. REML derives the likelihood function in a way that adjusts
for this and removes downward bias in the between-studies variance estimator. A use-
ful description for REML, in the meta-analysis context, has been provided by Normand
(1999). PL uses the same likelihood function as ML, but uses nested iterations to take
into account the uncertainty associated with the between-studies variance estimate when
calculating an overall effect. By incorporating this extra factor of uncertainty, PL yields
E. Kontopantelis and D. Reeves 401

CIsthat are usually wider than for DL and also are asymmetric. PL has been shown to
outperform DL in various scenarios (Brockwell and Gordon 2001).
The PE model (Follmann and Proschan 1999) can be described as follows: First, in
line with a null hypothesis that all true study effects are zero and observed effects are due
to random variation, a dataset of all possible combinations of observed study outcomes
is created by permuting the sign of each observed effect. Next the dl model is used to
compute an overall effect for each combination. Finally, the resulting distribution of
overall effect sizes is used to derive a CI for the observed overall effect.
Method performance is known to be affected by three factors: the number of studies
in the meta-analysis, the degree of heterogeneity in true effects, and—provided there is
heterogeneity present—the distribution of the true effects (Brockwell and Gordon 2001).
Heterogeneity, which is attributed to clinical or methodological diversity (Higgins and
Green 2006), is a major problem researchers have to face when combining study results
in a meta-analysis. The variability that arises from different interventions, populations,
outcomes, or follow-up times is described by clinical heterogeneity, while differences in
trial design and quality are accounted for by methodological heterogeneity (Thompson
1994). Traditionally, heterogeneity is tested with Cochran’s Q, which provides a p-value
for the test of homogeneity, when compared with a χ2k−1 distribution where k is the
number of studies (Brockwell and Gordon 2001). However, the test is known to be poor
at detecting heterogeneity because its power is low when the number of studies is small
(Hardy and Thompson 1998). An alternative measure is I 2 , which is thought to be more
informative in assessing inconsistency between studies. I 2 values of 25%, 50%, and 75%
correspond to low, moderate, and high heterogeneity, respectively (Higgins et al. 2003).
2
Another measure is HM , the measure least affected by the value of k. It takes values in
the [0, +∞) range, with 0 indicating perfect homogeneity (Mittlböck and Heinzl 2006).
Obviously, the between-studies variance estimate τ2 can also be informative about the
presence or absence of heterogeneity.
The test for heterogeneity is often used as the basis for applying an FE or a random-
effects model. However, the often low power of the Q test makes it unwise to base a
decision on the result of the test alone. Research studies, even on the same topic, can
vary on a large number of factors; hence, homogeneity is often an unlikely assumption
and some degree of variability between studies is to be expected (Thompson and Pocock
1991). Some authors recommend the adoption of a random-effects model unless there
are compelling reasons for doing otherwise, irrespective of the outcome of the test for
heterogeneity (Brockwell and Gordon 2001).
However, even though random-effects methods model heterogeneity, the performance
of the ML models (ML, REML, and PL) in situations where the true effects violate the
assumptions of a normal distribution may not be optimal (Brockwell and Gordon 2001;
Hardy and Thompson 1998; Böhning et al. 2002; Sidik and Jonkman 2007). The num-
ber of studies in the analysis is also an issue, because most meta-analysis models (includ-
ing DL, ML, REML, and PL —but not PE) are only asymptotically correct; that is, they
provide the theoretical 95% coverage only as the number of studies increases (approaches
infinity). Method performance is therefore affected when the number of studies is small,
402 metaan: Random-effects meta-analysis

but the extent depends on the model (some are more susceptible), along with the degree
of heterogeneity and the distribution of the true effects (Brockwell and Gordon 2001).

4 Example
As an example, we apply the metaan command to health-risk outcome data from seven
studies. The information was collected for an unpublished meta-analysis, and the data
are available from the authors. Using the describe and list commands, we provide
details of the dataset and proceed to perform a univariate meta-analysis with metaan.

. use metaan_example
. describe
Contains data from metaan_example.dta
obs: 7
vars: 4 19 Apr 2010 12:19
size: 560 (99.9% of memory free)

storage display value


variable name type format label variable label

study str16 %16s First author and year


outcome str48 %35s Outcome description
effsize float %9.0g effect sizes
se float %9.0g SE of the effect sizes

Sorted by: study outcome


. list study outcome effsize se, noobs clean
study outcome effsize se
Bakx A, 1985 Serum cholesterol (mmol/L) -.3041526 .0958199
Campbell A, 1998 Diet .2124063 .0812414
Cupples, 1994 BMI .0444239 .090661
Eckerlund SBP -.3991309 .12079
Moher, 2001 Cholesterol (mmol/l) -.9374746 .0691572
Woolard A, 1995 Alcohol intake (g/week) -.3098185 .206331
Woolard B, 1995 Alcohol intake (g/week) -.4898825 .2001602
E. Kontopantelis and D. Reeves 403

. metaan effsize se, pl label(study) forest


Profile Likelihood method selected

Study Effect [95% Conf. Interval] % Weight

Bakx A, 1985 -0.304 -0.492 -0.116 15.09


Campbell A, 1998 0.212 0.053 0.372 15.40
Cupples, 1994 0.044 -0.133 0.222 15.20
Eckerlund -0.399 -0.636 -0.162 14.49
Moher, 2001 -0.937 -1.073 -0.802 15.62
Woolard A, 1995 -0.310 -0.714 0.095 12.01
Woolard B, 1995 -0.490 -0.882 -0.098 12.19

Overall effect (pl) -0.308 -0.622 0.004 100.00

ML method succesfully converged


PL method succesfully converged for both upper and lower CI limits
Heterogeneity Measures

value df p-value

Cochrane Q 139.81 6 0.000


I^2 (%) 91.96
H^2 11.44

value [95% Conf. Interval]

tau^2 est 0.121 0.000 0.449

Estimate obtained with Maximum likelihood - Profile likelihood provides the CI


PL method succesfully converged for both upper and lower CI limits of the tau^2
> estimate

The PL model used in the example converged successfully, as did ML, whose convergence
is a prerequisite. The overall effect is not found to be significant at the 95% level,
and there is considerable heterogeneity across studies, according to the measures. The
model also displays a 95% CI for the between-studies variance estimate τ2 (provided
that convergence is achieved, as is the case in this example). The forest plot created by
the command is displayed in figure 1.

(Continued on next page)


404 metaan: Random-effects meta-analysis

Bakx A, 1985

Campbell A, 1998

Cupples, 1994

Eckerlund
Studies

Moher, 2001

Woolard A, 1995

Woolard B, 1995

Overall effect (pl)

−1.1 −1 −.9 −.8 −.7 −.6 −.5 −.4 −.3 −.2 −.1 0 .1 .2 .3 .4
Effect sizes and CIs
Original weights (squares) displayed. Largest to smallest ratio: 1.30

Figure 1. Forest plot displaying PL meta-analysis

When we reexecute the analysis with the plplot(mu) and plplot(tsq) options, we
obtain the log-likelihood function plots shown in figures 2 and 3.

Likelihood plot
for mu fixed to the ML/PL estimate
−2

−4
log−likelihood

−6

−8

−10

0 .05 .1 .15 .2
tau² values

Figure 2. Log-likelihood function plot for μ fixed to the model estimate


E. Kontopantelis and D. Reeves 405

Likelihood plot
for tau² fixed to the ML/PL estimate
0

−5

log−likelihood

−10

−15

−20

−25

−1.5 −1 −.5 0 .5
mu values

Figure 3. Log-likelihood function plot for τ 2 fixed to the model estimate

5 Discussion
The metaan command can be a useful meta-analysis tool that includes newer and, in
certain circumstances, better-performing models than the standard DL random-effects
model. Unpublished results exploring model performance in various scenarios are avail-
able from the authors. Future work will involve implementing more models in the
metaan command and embellishing the forest plot.

6 Acknowledgments
We would like to thank the authors of meta and metan for all their work and the
anonymous reviewer whose useful comments improved the article considerably.

7 References
Böhning, D., U. Malzahn, E. Dietz, P. Schlattmann, C. Viwatwongkasem, and A. Big-
geri. 2002. Some general points in estimating heterogeneity variance with the
DerSimonian–Laird estimator. Biostatistics 3: 445–457.

Brockwell, S. E., and I. R. Gordon. 2001. A comparison of statistical methods for


meta-analysis. Statistics in Medicine 20: 825–840.

DerSimonian, R., and N. Laird. 1986. Meta-analysis in clinical trials. Controlled Clinical
Trials 7: 177–188.
406 metaan: Random-effects meta-analysis

Follmann, D. A., and M. A. Proschan. 1999. Valid inference in random effects meta-
analysis. Biometrics 55: 732–737.

Harbord, R. M., and J. P. T. Higgins. 2008. Meta-regression in Stata. Stata Journal 8:


493–519.

Hardy, R. J., and S. G. Thompson. 1996. A likelihood approach to meta-analysis with


random effects. Statistics in Medicine 15: 619–629.

———. 1998. Detecting and describing heterogeneity in meta-analysis. Statistics in


Medicine 17: 841–856.

Harris, R. J., M. J. Bradburn, J. J. Deeks, R. M. Harbord, D. G. Altman, and J. A. C.


Sterne. 2008. metan: Fixed- and random-effects meta-analysis. Stata Journal 8: 3–28.

Higgins, J. P. T., and S. Green. 2006. Cochrane Handbook for Systematic Reviews of
Interventions Version 4.2.6.
http://www2.cochrane.org/resources/handbook/Handbook4.2.6Sep2006.pdf.

———. 2009. Cochrane Handbook for Systematic Reviews of Interventions Version


5.0.2. http://www.cochrane-handbook.org/.

Higgins, J. P. T., S. G. Thompson, J. J. Deeks, and D. G. Altman. 2003. Measuring


inconsistency in meta-analyses. British Medical Journal 327: 557–560.

Huque, M. F. 1988. Experiences with meta-analysis in NDA submissions. Proceedings


of the Biopharmaceutical Section of the American Statistical Association 2: 28–33.

Kontopantelis, E., and D. Reeves. 2009. MetaEasy: A meta-analysis add-in for Microsoft
Excel. Journal of Statistical Software 30: 1–25.

Lambert, P. C., A. J. Sutton, K. R. Abrams, and D. R. Jones. 2002. A comparison


of summary patient-level covariates in meta-regression with individual patient data
meta-analysis. Journal of Clinical Epidemiology 55: 86–94.

Mittlböck, M., and H. Heinzl. 2006. A simulation study comparing properties of het-
erogeneity measures in meta-analyses. Statistics in Medicine 25: 4321–4333.

Normand, S.-L. T. 1999. Meta-analysis: Formulating, evaluating, combining, and re-


porting. Statistics in Medicine 18: 321–359.

Olkin, I., and A. Sampson. 1998. Comparison of meta-analysis versus analysis of variance
of individual patient data. Biometrics 54: 317–322.

Sidik, K., and J. N. Jonkman. 2007. A comparison of heterogeneity variance estimators


in combining results of studies. Statistics in Medicine 26: 1964–1981.

Stewart, L. A. 1995. Practical methodology of meta-analyses (overviews) using updated


individual patient data. Statistics in Medicine 14: 2057–2079.
E. Kontopantelis and D. Reeves 407

Thompson, S. G. 1994. Systematic review: Why sources of heterogeneity in meta-


analysis should be investigated. British Medical Journal 309: 1351–1355.

Thompson, S. G., and S. J. Pocock. 1991. Can meta-analyses be trusted? Lancet 338:
1127–1130.

Thompson, S. G., and S. J. Sharp. 1999. Explaining heterogeneity in meta-analysis: A


comparison of methods. Statistics in Medicine 18: 2693–2708.

White, I. R. 2009. Multivariate random-effects meta-analysis. Stata Journal 9: 40–56.

About the authors


Evangelos (Evan) Kontopantelis is a research fellow in statistics at the National Primary Care
Research and Development Centre, University of Manchester, England. His research interests
include statistical methods in health sciences with a focus on meta-analysis, longitudinal data
modeling, and large clinical database management.
David Reeves is a senior research fellow in statistics at the Health Sciences Primary Care
Research Group, University of Manchester, England. David has worked as a statistician in
health services research for nearly three decades, mainly in the fields of learning disability
and primary care. His methodological research interests include the robustness of statistical
methods, the analysis of observational studies, and applications of social network analysis
methods to health systems.
The Stata Journal (2010)
10, Number 3, pp. 408–422

Regression analysis of censored data using


pseudo-observations
Erik T. Parner Per K. Andersen
University of Aarhus University of Copenhagen
Aarhus, Denmark Copenhagen, Denmark
parner@biostat.au.dk P.K.Andersen@biostat.ku.dk

Abstract. We draw upon a series of articles in which a method based on pseu-


dovalues is proposed for direct regression modeling of the survival function, the
restricted mean, and the cumulative incidence function in competing risks with
right-censored data. The models, once the pseudovalues have been computed, can
be fit using standard generalized estimating equation software. Here we present
Stata procedures for computing these pseudo-observations. An example from a
bone marrow transplantation study is used to illustrate the method.
Keywords: st0202, stpsurv, stpci, stpmean, pseudovalues, time-to-event, survival
analysis

1 Introduction
Statistical methods in survival analysis need to deal with data that are incomplete
because of right-censoring; a host of such methods are available, including the Kaplan–
Meier estimator, the log-rank test, and the Cox regression model. If one had complete
data, standard methods for quantitative data could be applied directly for the observed
survival time X, or methods for binary outcomes could be applied by dichotomizing
X as I(X > τ ) for a suitably chosen τ . With complete data, one could furthermore
set up regression models for any function f (X) and check such models using standard
graphical methods such as scatterplots or residuals for quantitative or binary outcomes.
One way of achieving these goals with censored survival data and with more-general
event history data (for example, competing-risks data) is to use a technique based on
pseudo-observations, as recently described in a series of articles. Thus the technique
has been studied in modeling of the survival function (Klein et al. 2007), the restricted
mean (Andersen, Hansen, and Klein 2004), and the cumulative incidence function in
competing risks (Andersen, Klein, and Rosthøj 2003; Klein and Andersen 2005; Klein
2006; Andersen and Klein 2007).
The basic idea is simple. Suppose a well-behaved estimator θ, for the expectation θ =
E{f (X)}, is available—for example, the Kaplan–Meier estimator for S(t) = E{I(X >
t)}—based on a sample of size n. The ith pseudo-observation (i = 1, . . . , n) for f (X) is
then defined as θi = n× θ−(n−1)× θ−i where θ−i is the estimator applied to the sample
of size n − 1, which is obtained by eliminating the ith observation from the dataset. The
pseudovalues are generated once, and the idea is to replace the incompletely observed


c 2010 StataCorp LP st0202
E. T. Parner and P. K. Andersen 409

f (Xi ) by θi . That is, θi may be used as an outcome variable in a regression model
or it may be used to compute residuals. θi also may be used in a scatterplot when
assessing model assumptions (Perme and Andersen 2008; Andersen and Perme 2010).
The intuition is that,
in the absence of censoring, θ = E{f (X)} could, obviously, be
estimated as (1/n) i f (Xi ), in which case the ith pseudo-observation is simply the
observed value f (Xi ). The pseudovalues are related to the jackknife residuals used in
regression diagnostics.
We present three new Stata commands—stpsurv, stpci, and stpmean—that pro-
vide a new possibility in Stata for analyzing regression models and that generate pseu-
dovalues (respectively) for the survival function (or the cumulative distribution func-
tion, “the cumulative incidence”) under right-censoring, for the cumulative incidence
in competing risks, and for the restricted mean under right-censoring. Cox regression
models can be fit using the pseudovalue function for survival probabilities in several
time points. Thereby, the pseudovalue method provides an alternative to Cox regres-
sion, for example, in situations where rates are not proportional. As discussed by
Perme and Andersen (2008), residuals for model checking may also be obtained from
the pseudovalues. An example based on bone marrow transplantation data is presented
to illustrate the methodology.
In section 2, we briefly present the general pseudovalue approach to censored data
regression. In section 3, we present the new Stata commands; and in section 4, we show
examples of the use of the commands. Section 5 concludes with some remarks.

2 Some methodological details


2.1 The general approach
In this section, we briefly introduce censored data regression based on pseudo-obser-
vations; see, for example, Andersen, Klein, and Rosthøj (2003) or Andersen and Perme
(2010) for more details. Let X1 , . . . , Xn be independent and identically distributed
survival times, and suppose we are interested in a parameter of the form
θ = E{f (X)}
for some function f (·). This function could be multivariate, for example,
f (X) = {f1 (X), . . . , fM (X)} = {I(X > τ1 ), . . . , I(X > τM )}
for a series of time points τ1 , . . . , τM , in which case,
θ = (θ1 , . . . , θM ) = {S(τ1 ), . . . , S(τM )}
where S(·) is the survival function for X. More examples are provided below. Fur-
thermore, let Z1 , . . . , Zn be independent and identically distributed covariates. Also
suppose we are interested in a regression model of θ = E{f (Xi )} on Zi —for example,
a generalized linear model of the form
g[E{f (Xi ) | Zi }] = β T Zi
410 Pseudo-observations

where g(·) is the link function. If right-censoring prevents us from observing all the
Xi s, then it is not simple to analyze this regression model. However, suppose θ is
an approximately unbiased estimator of the marginal mean θ = E{f (X)} that may
be computed from the sample of right-censored observations. If f (X) = I(X > τ ),
then θ = S(τ ) may be estimated using the Kaplan–Meier estimator. The ith pseudo-
observation is now defined, as suggested in section 1, as

θi = n × θ − (n − 1) × θ−i

Here θ−i is the “leave-one-out” estimator for θ based on all observations but the ith:
Xj , j = i. The idea is to replace the possibly incompletely observed f (Xi ) by θi and to
obtain estimates of the βs based on the estimating equation:

 ∂ T   
−1 T
g (β Zi ) Vi−1 (β) θi − g −1 (β T Zi ) = Ui (β) = U (β) = 0 (1)
i
∂β i

In (1), Vi is a working covariance matrix. Graw, Gerds, and Schumacher (2009) showed
that for the examples studied in this article, E{f (Xi ) | Zi } = E(θi | Zi ), and thereby
(1) is unbiased, provided that censoring is independent of covariates; see also Andersen
and Perme (2010). A sandwich estimator is used to estimate the variance of β.  Let

 ∂ T  −1 T
∂g (β Zi )

−1 T −1
I(β) = g (β Zi ) Vi (β)
i
∂β ∂β

and      T  
 U β =
Var Ui β Ui β
i

then    −1     −1


 β = I β
Var  U β I β
Var

The estimator of β can be shown to be asymptotically normal (Graw, Gerds, and


Schumacher 2009; Liang and Zeger 1986), and the sandwich estimator converges in
probability to the true variance. Once the pseudo-observations have been computed,
the estimators of β can be obtained by using standard software for generalized estimating
equations.
The pseudo-observations may also be used to define residuals after fitting some
standard model (for example, a Cox regression model) for survival data; see Perme and
Andersen (2008) or Andersen and Perme (2010).

2.2 The survival function


Suppose we are interested in the survival function S(τj ) = Pr(X > τj ) at a grid of
time points τ1 < · · · < τM , for a survival time X. Hence, θ = (θ1 , . . . , θM ) where
E. T. Parner and P. K. Andersen 411

θj = S(τj ). When M = 1, we consider the survival function at a single point in time.


Under right-censoring, the survival function is estimated by the Kaplan–Meier estimator
(Kaplan and Meier 1958),
 = Yj − dj
S(t)
Yj
tj ≤t

where t1 < · · · < tD are the distinct event times, Yj is the number at risk, and dj is the
number of events at time tj . The cumulative distribution function is then estimated by
F(t) = 1 − S(t).
 In this case, the link function of interest could be the cloglog function
cloglog {F (τ )} = log [− log {1 − F (τ )}]
which is equivalent to a Cox regression model for the survival function evaluated in τ .

2.3 The mean survival time


The mean time-to-event is the area under the survival curve:
! ∞
μ= S(u)du (2)
0

For right-censored data, the estimated survival function (the Kaplan–Meier estimator)
does not always converge down to zero. Then the mean cannot be estimated reliably
by plugging the Kaplan–Meier estimator into (2). An alternative to the mean is the
restricted mean, defined as the area under the survival curve up to a time τ < ∞
(Klein and Moeschberger 2003), which is equal to θ = μτ = E{min(X, τ )}. The re-
stricted mean survival time is estimated by the area under the Kaplan–Meier curve up
to time τ . That is, ! τ
τ =
μ 
S(u)du
0
An alternative mean is the conditional mean given that the event time is smaller than
τ , μcτ = E(X | X ≤ τ ), which is similarly estimated by
! τ   )
S(u) − S(τ
cτ =
μ du
0  )
1 − S(τ
For the restricted and conditional mean, a link function of interest could be the log or
the identity.

2.4 The cumulative incidence


Under competing risks, the cumulative incidence function is estimated in a different
way. Suppose the event of interest has hazard function h1 (t) and the competing risk
has hazard function h2 (t). The cumulative incidence function for the event of interest
is then given as
! t  ! u 
F1 (t) = h1 (u) exp − {h1 (v) + h2 (v)} dv du
0 0
412 Pseudo-observations

If t1 < · · · < tD are the distinct times of the primary event and the competing risk
combined, Yj is the number at risk, d1j is the number of the primary events at time
tj , and d2j is the number of competing risks at time tj . Then the cumulative incidence
function of the primary event is estimated by
  d1j  
Yi − (d1i + d2i )

F1 (t) =
Yj t <t Yi
tj ≤t i j

Again the link function of interest could be cloglog corresponding to the regression
model for the competing risks cumulative incidence studied by Fine and Gray (1999).

3 The stpsurv, stpmean, and stpci commands


3.1 Syntax
Pseudovalues for the survival function, the mean survival time, and the cumulative
incidence function for competing risks are generated using the following syntaxes:


stpsurv if in , at(numlist) generate(string) failure

stpmean if in , at(numlist) generate(string) conditional

stpci varname if in , at(numlist) generate(string)

stpsurv, stpmean, and stpci are for use with st data. You must, therefore, stset
your data before issuing these commands. Frequency weights are allowed in the stset
command. In the stpci command for the cumulative incidence function in competing
risks, an indicator variable for the competing risks should always be specified. The
pseudovalues are by default stored in the pseudo variable when one time point is spec-
ified and are stored in variables pseudo1, pseudo2, . . . when several time points are
specified. The names of the pseudovariables are changed by the generate() option.

3.2 Options
at(numlist) specifies the time points in ascending order of which pseudovalues should
be computed. at() is required.
generate(string) specifies a variable name for the pseudo-observations. The default is
generate(pseudo).
failure generates pseudovalues for the cumulative incidence proportion, which is one
minus the survival function.
conditional specifies that pseudovalues for the conditional mean should be computed
instead of those for the restricted mean.
E. T. Parner and P. K. Andersen 413

4 Example data
To illustrate the pseudovalue approach, we use data on sibling-donor bone marrow
transplants matched on human leukocyte antigen (Copelan et al. 1991). The data
are available in Klein and Moeschberger (2003). The data include information on
137 transplant patients on time to death, relapse, or lost to follow-up (tdfs); the
indicators of relapse and death (relapse, trm); the indicator of treatment failure
(dfs = relapse | trm); and three factors that may be related to outcome: disease
[acute lymphocytic leukemia (ALL), low-risk acute myeloid leukemia (AML), and high-
risk AML], the French–American–British (FAB) disease grade for AML (fab = 1 if AML
and grade 4 or 5; 0 otherwise), and recipient age at transplant (age).

4.1 The survival function at a single time point


We will first examine regression models for disease free survival at 530 days based on
the Kaplan–Meier estimator. Disease free survival probabilities for the single prognostic
factor FAB at 530 days (figure 1) can be compared using information obtained using the
Stata sts list command, which evaluates the Kaplan–Meier estimator.

Survival
1

.8
Probability

.6

.4

.2
Fab=1
Fab=0
0
0 500 1000 1500 2000
Time (days)

Figure 1. Disease free survival

Based on the sts list output below, the risk difference (RD) for FAB is computed
as RD = 0.333 − 0.541 = −0.207 [95% confidence interval: −0.379, −0.039] and the
relative risk (RR) for FAB is RR = 0.333/0.541 = 0.616, where FAB = 0 is chosen as the
reference group. The confidence interval of the RD is based on computing the standard
error of the RD as (0.05222 + 0.07032 )1/2 . The confidence interval for the RR is not
easily estimated using the information from the sts list command.
414 Pseudo-observations

. use bmt
. stset tdfs, failure(dfs==1)
failure event: dfs == 1
obs. time interval: (0, tdfs]
exit on or before: failure

137 total obs.


0 exclusions

137 obs. remaining, representing


83 failures in single record/single failure data
107138 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 2640
. sts list, at(0 530) by(fab)
failure _d: dfs == 1
analysis time _t: tdfs
Beg. Survivor Std.
Time Total Fail Function Error [95% Conf. Int.]

fab=0
0 0 0 1.0000 . . .
530 49 42 0.5408 0.0522 0.4334 0.6364
fab=1
0 0 0 1.0000 . . .
530 16 30 0.3333 0.0703 0.2018 0.4704

Note: survivor function is calculated over full data and evaluated at


indicated times; it is not calculated from aggregates shown at left.

Now we turn to the pseudovalues approach. We start by computing the pseudovalues


at 530 days using the stpsurv command. The pseudovalues are stored in the pseudo
variable.

. stpsurv, at(530)
Computing pseudo observations (progress dots indicate percent completed)
1 2 3 4 5
.................................................. 50
.................................................. 100
Generated pseudo variable: pseudo

The pseudovalues are analyzed in generalized linear models with an identity link
function and a log link function, respectively.
E. T. Parner and P. K. Andersen 415

. glm pseudo i.fab, link(id) vce(robust) noheader


Iteration 0: log pseudolikelihood = -96.989802

Robust
pseudo Coef. Std. Err. z P>|z| [95% Conf. Interval]

1.fab -.2080377 .0881073 -2.36 0.018 -.3807248 -.0353506


_cons .5406774 .0522411 10.35 0.000 .4382867 .6430681

. glm pseudo i.fab, link(log) vce(robust) eform noheader


Iteration 0: log pseudolikelihood = -123.14846
Iteration 1: log pseudolikelihood = -101.53512
Iteration 2: log pseudolikelihood = -96.991808
Iteration 3: log pseudolikelihood = -96.989802
Iteration 4: log pseudolikelihood = -96.989802

Robust
pseudo exp(b) Std. Err. z P>|z| [95% Conf. Interval]

1.fab .6152278 .1440588 -2.07 0.038 .3887968 .9735298

The generalized linear models with an identity link function and a log link function
fit the relations

pi = E (Xi ) = β0 + β1 × FABi
log(pi ) = log{E (Xi )} = β0 + β1 × FABi

respectively, where pi = Si (530) is disease free survival probability at 530 days for
individual i. Hence, based on the pseudovalues approach, we estimate the RD for FAB
by RD = −0.208 [95% confidence interval: −0.381, −0.035] and the RR for FAB by
RR = 0.615 [95% confidence interval: 0.389, 0.974]. The results are very similar to the
direct computation from the Kaplan–Meier using the sts list command. We now
obtain the confidence interval for the RR.
Suppose we wish to compute the RR for FAB, adjusting for disease as a categorical
variable and age as a continuous variable. Using the same pseudovalues, we fit the
generalized linear model.

(Continued on next page)


416 Pseudo-observations

. glm pseudo i.fab i.disease age, link(log) vce(robust) eform noheader


Iteration 0: log pseudolikelihood = -114.83229
Iteration 1: log pseudolikelihood = -93.440112
Iteration 2: log pseudolikelihood = -88.620704
Iteration 3: log pseudolikelihood = -88.601028
Iteration 4: log pseudolikelihood = -88.601013
Iteration 5: log pseudolikelihood = -88.601013

Robust
pseudo exp(b) Std. Err. z P>|z| [95% Conf. Interval]

1.fab .6322634 .1665066 -1.74 0.082 .3773412 1.059405

disease
2 1.951343 .412121 3.17 0.002 1.289914 2.951931
3 1.005533 .3586364 0.02 0.988 .4998088 2.022965

age .9856265 .0080274 -1.78 0.075 .970018 1.001486

Patients with AML and grade 4 or 5 (FAB = 1) have a 27% reduced disease free
survival probability at 530 days, when adjusting for disease and age.

4.2 The survival function at several time points


In this example, we compute pseudovalues at five data points roughly equally spaced on
the event scale: 50, 105, 170, 280, and 530 days. To fit the model log[− log{S(t | Z)}] =
log{Λ0 (t)}+βZ, we can use the cloglog link on the pseudovalues on failure probabilities;
that is, we fit a Cox regression model for the five time points simultaneously.

. stpsurv, at(50 105 170 280 530) failure


Computing pseudo observations (progress dots indicate percent completed)
1 2 3 4 5
.................................................. 50
.................................................. 100
Generated pseudo variables: pseudo1-pseudo5
. generate id=_n
. reshape long pseudo, i(id) j(times)
(note: j = 1 2 3 4 5)
Data wide -> long

Number of obs. 137 -> 685


Number of variables 32 -> 29
j variable (5 values) -> times
xij variables:
pseudo1 pseudo2 ... pseudo5 -> pseudo
E. T. Parner and P. K. Andersen 417

. glm pseudo i.times i.fab i.disease age, link(cloglog) vce(cluster id) noheader
Iteration 0: log pseudolikelihood = -468.74476
Iteration 1: log pseudolikelihood = -457.41878 (not concave)
Iteration 2: log pseudolikelihood = -406.98781
Iteration 3: log pseudolikelihood = -365.23278
Iteration 4: log pseudolikelihood = -350.7435
Iteration 5: log pseudolikelihood = -349.97156
Iteration 6: log pseudolikelihood = -349.96409
Iteration 7: log pseudolikelihood = -349.96409
(Std. Err. adjusted for 137 clusters in id)

Robust
pseudo Coef. Std. Err. z P>|z| [95% Conf. Interval]

times
2 1.114256 .3269323 3.41 0.001 .4734805 1.755032
3 1.626173 .3567925 4.56 0.000 .9268721 2.325473
4 2.004267 .3707305 5.41 0.000 1.277649 2.730885
5 2.495327 .3824645 6.52 0.000 1.745711 3.244944

1.fab .7619547 .354821 2.15 0.032 .0665183 1.457391

disease
2 -1.195542 .4601852 -2.60 0.009 -2.097489 -.2935959
3 .0036343 .3791488 0.01 0.992 -.7394838 .7467524

age .0130686 .0146629 0.89 0.373 -.0156702 .0418074


_cons -2.981582 .6066311 -4.91 0.000 -4.170557 -1.792607

The estimated survival function in this model for a patient at time t with a set of
covariates Z is S(t) = exp{−Λ0 (t)eβZ }, where

Λ0 (50) = exp(−2.9816) = 0.051


Λ0 (105) = exp(−2.9816 + 1.1143) = 0.155
Λ0 (170) = exp(−2.9816 + 1.6262) = 0.258
Λ0 (280) = exp(−2.9816 + 2.0043) = 0.376
Λ0 (530) = exp(−2.9816 + 2.4953) = 0.615

The model shows that patients with AML who are at low risk have better disease
free survival than ALL patients [RR = exp(−1.1955) = 0.30] and that AML patients with
grade 4 or 5 FAB have a lower disease free survival [RR = exp(0.7620) = 2.14].
Without recomputing the pseudovalues, we can examine the effect of FAB over time.

(Continued on next page)


418 Pseudo-observations

. generate fab50=(fab==1 & times==1)


. generate fab105=(fab==1 & times==2)
. generate fab170=(fab==1 & times==3)
. generate fab280=(fab==1 & times==4)
. generate fab530=(fab==1 & times==5)
. glm pseudo i.times fab50-fab530 i.disease age, link(cloglog) vce(cluster id)
> noheader eform
Iteration 0: log pseudolikelihood = -471.86839
Iteration 1: log pseudolikelihood = -464.24832 (not concave)
Iteration 2: log pseudolikelihood = -406.31257
Iteration 3: log pseudolikelihood = -361.28364
Iteration 4: log pseudolikelihood = -349.90468
Iteration 5: log pseudolikelihood = -349.44613
Iteration 6: log pseudolikelihood = -349.43492
Iteration 7: log pseudolikelihood = -349.43485
Iteration 8: log pseudolikelihood = -349.43485
(Std. Err. adjusted for 137 clusters in id)

Robust
pseudo exp(b) Std. Err. z P>|z| [95% Conf. Interval]

times
2 3.99608 2.023867 2.74 0.006 1.480921 10.78292
3 8.225489 4.601898 3.77 0.000 2.747526 24.62531
4 11.89654 6.835021 4.31 0.000 3.858093 36.68333
5 19.20116 11.25862 5.04 0.000 6.084498 60.59409

fab50 4.047315 3.227324 1.75 0.080 .8480474 19.31586


fab105 2.866106 1.433666 2.11 0.035 1.07525 7.639677
fab170 2.008426 .795497 1.76 0.078 .9240856 4.365155
fab280 2.022028 .7258472 1.96 0.050 1.000533 4.086419
fab530 2.048864 .7838364 1.87 0.061 .9679838 4.33669

disease
2 .3024683 .1368087 -2.64 0.008 .1246451 .7339808
3 .9993425 .3815547 -0.00 0.999 .4728471 2.112069

age 1.012745 .0148835 0.86 0.389 .9839899 1.04234

. test fab50=fab105=fab170=fab280=fab530
( 1) [pseudo]fab50 - [pseudo]fab105 = 0
( 2) [pseudo]fab50 - [pseudo]fab170 = 0
( 3) [pseudo]fab50 - [pseudo]fab280 = 0
( 4) [pseudo]fab50 - [pseudo]fab530 = 0
chi2( 4) = 1.73
Prob > chi2 = 0.7855

The model shows that there is no statistically significant difference in the FAB effect
over time (p = 0.79); that is, proportional hazards are not contraindicated for FAB.
E. T. Parner and P. K. Andersen 419

4.3 The restricted mean


For the restricted mean time to treatment failure, we use the stpmean command. To
illustrate, we look at a regression model for the mean time to treatment failure restricted
to 1,500 days. Here we use the identity link function.

. stpmean, at(1500)
Computing pseudo observations (progress dots indicate percent completed)
1 2 3 4 5
.................................................. 50
.................................................. 100
Generated pseudo variable: pseudo
. glm pseudo i.fab i.disease age, link(id) vce(robust) noheader
Iteration 0: log pseudolikelihood = -1065.6767

Robust
pseudo Coef. Std. Err. z P>|z| [95% Conf. Interval]

1.fab -352.0442 123.311 -2.85 0.004 -593.7293 -110.359

disease
2 461.1214 134.0932 3.44 0.001 198.3036 723.9391
3 78.00616 158.8357 0.49 0.623 -233.3061 389.3184

age -8.169236 5.060915 -1.61 0.106 -18.08845 1.749976


_cons 895.118 159.1586 5.62 0.000 583.173 1207.063

Here we see that low-risk AML patients have the longest restricted mean life, namely,
461.1 days longer than ALL patients within 1,500 days.

4.4 Competing risks


For the cumulative incidence function, we use the stpci command to compute the
pseudovalues. To illustrate, we use the complementary log–log model to the relapse
cumulative incidence evaluated at 50, 105, 170, 280, and 530 days. The event of interest
is death in remission. Here relapse is a competing event.

. stset tdfs, failure(trm==1)


failure event: trm == 1
obs. time interval: (0, tdfs]
exit on or before: failure

137 total obs.


0 exclusions

137 obs. remaining, representing


42 failures in single record/single failure data
107138 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 2640
. generate compet=(trm==0 & relapse==1)
420 Pseudo-observations

. stpci compet, at(50 105 170 280 530)


Computing pseudo observations (progress dots indicate percent completed)
1 2 3 4 5
.................................................. 50
.................................................. 100
Generated pseudo variables: pseudo1-pseudo5
. generate id=_n
. reshape long pseudo, i(id) j(times)
(note: j = 1 2 3 4 5)
Data wide -> long

Number of obs. 137 -> 685


Number of variables 33 -> 30
j variable (5 values) -> times
xij variables:
pseudo1 pseudo2 ... pseudo5 -> pseudo

. fvset base none times


. glm pseudo i.times i.fab i.disease age, link(cloglog) vce(cluster id)
> noheader noconst eform
Iteration 0: log pseudolikelihood = -462.96735 (not concave)
Iteration 1: log pseudolikelihood = -348.27329
Iteration 2: log pseudolikelihood = -221.69131
Iteration 3: log pseudolikelihood = -198.31467
Iteration 4: log pseudolikelihood = -197.38196
Iteration 5: log pseudolikelihood = -197.37526
Iteration 6: log pseudolikelihood = -197.37524
(Std. Err. adjusted for 137 clusters in id)

Robust
pseudo exp(b) Std. Err. z P>|z| [95% Conf. Interval]

times
1 .0286012 .0292766 -3.47 0.001 .0038467 .21266
2 .0791623 .0547411 -3.67 0.000 .0204131 .306993
3 .1261608 .0823572 -3.17 0.002 .0350965 .4535083
4 .1781601 .1117597 -2.75 0.006 .0521017 .6092124
5 .2383869 .1488814 -2.30 0.022 .0700932 .8107537

1.fab 3.104153 1.52811 2.30 0.021 1.182808 8.146518

disease
2 .1708985 .1154623 -2.61 0.009 .0454622 .6424309
3 .7829133 .466016 -0.41 0.681 .2438093 2.514068

age 1.014382 .0258272 0.56 0.575 .9650037 1.066286

Here we are modeling C(t | Z) = 1 − exp{−Λ0 (t)eβZ }. Positive values of β for a


covariate suggest a larger cumulative incidence for patients with Z = 1. The model
suggests that the low-risk AML patients have the smallest risk of death in remission and
the AML FAB 4/5 patients have the highest risk of death in remission.
E. T. Parner and P. K. Andersen 421

5 Conclusion
The pseudovalue method is a versatile tool for regression analysis of censored time-to-
event data. We have implemented the method for regression analysis of the survival
under right-censoring, for the cumulative incidence function under possible competing
risks, and for the restricted and conditional mean waiting time. Similar SAS macros and
R functions were presented by Klein et al. (2008).

6 References
Andersen, P. K., M. G. Hansen, and J. P. Klein. 2004. Regression analysis of restricted
mean survival time based on pseudo-observations. Lifetime Data Analysis 10: 335–
350.
Andersen, P. K., and J. P. Klein. 2007. Regression analysis for multistate models
based on a pseudo-value approach, with applications to bone marrow transplantation
studies. Scandinavian Journal of Statistics 34: 3–16.
Andersen, P. K., J. P. Klein, and S. Rosthøj. 2003. Generalised linear models for
correlated pseudo-observations, with applications to multi-state models. Biometrika
90: 15–27.
Andersen, P. K., and M. P. Perme. 2010. Pseudo-observations in survival analysis.
Statistical Methods in Medical Research 19: 71–99.
Copelan, E. A., J. C. Biggs, J. M. Thompson, P. Crilley, J. Szer, J. P. Klein, N. Kapoor,
B. R. Avalos, I. Cunningham, K. Atkinson, K. Downs, G. S. Harmon, M. B. Daly,
I. Brodsky, S. I. Bulova, and P. J. Tutschka. 1991. Treatment for acute myelocytic
leukemia with allogeneic bone marrow transplantation following preparation with
BuCy2. Blood 78: 838–843.
Fine, J. P., and R. J. Gray. 1999. A proportional hazards model for the subdistribution
of a competing risk. Journal of the American Statistical Association 94: 496–509.
Graw, F., T. A. Gerds, and M. Schumacher. 2009. On pseudo-values for regression
analysis in competing risks models. Lifetime Data Analysis 15: 241–255.
Kaplan, E. L., and P. Meier. 1958. Nonparametric estimation from incomplete obser-
vations. Journal of the American Statistical Association 53: 457–481.
Klein, J. P. 2006. Modeling competing risks in cancer studies. Statistics in Medicine
25: 1015–1034.
Klein, J. P., and P. K. Andersen. 2005. Regression modeling of competing risks data
based on pseudovalues of the cumulative incidence function. Biometrics 61: 223–229.
Klein, J. P., M. Gerster, P. K. Andersen, S. Tarima, and M. P. Perme. 2008. SAS and R
functions to compute pseudo-values for censored data regression. Computer Methods
and Programs in Biomedicine 89: 289–300.
422 Pseudo-observations

Klein, J. P., B. Logan, M. Harhoff, and P. K. Andersen. 2007. Analyzing survival curves
at a fixed point in time. Statistics in Medicine 26: 4505–4519.

Klein, J. P., and M. L. Moeschberger. 2003. Survival Analysis: Techniques for Censored
and Truncated Data. 2nd ed. New York: Springer.

Liang, K.-Y., and S. L. Zeger. 1986. Longitudinal data analysis using generalized linear
models. Biometrika 73: 13–22.

Perme, M. P., and P. K. Andersen. 2008. Checking hazard regression models using
pseudo-observations. Statistics in Medicine 27: 5309–5328.

About the authors


Erik T. Parner has a PhD in statistics from the University of Aarhus. He is an associate profes-
sor of biostatistics at the University of Aarhus. His research fields are time-to-event analysis,
statistical methods in epidemiology and genetics, and the etiology and changing prevalence of
autism.
Per K. Andersen has a PhD in statistics and a DrMedSci degree in biostatistics, both from the
University of Copenhagen. He is a professor of biostatistics at the University of Copenhagen.
His main research fields are time-to-event analysis and statistical methods in epidemiology.
The Stata Journal (2010)
10, Number 3, pp. 423–457

Estimation of quantile treatment effects with


Stata
Markus Frölich Blaise Melly
Universität Mannheim and Department of Economics
Institute for the Study of Labor Brown University
Bonn, Germany Providence, RI
froelich@uni-mannheim.de blaise melly@brown.edu

Abstract. In this article, we discuss the implementation of various estimators


proposed to estimate quantile treatment effects. We distinguish four cases involv-
ing conditional and unconditional quantile treatment effects with either exogenous
or endogenous treatment variables. The introduced ivqte command covers four
different estimators: the classical quantile regression estimator of Koenker and
Bassett (1978, Econometrica 46: 33–50) extended to heteroskedasticity consis-
tent standard errors; the instrumental-variable quantile regression estimator of
Abadie, Angrist, and Imbens (2002, Econometrica 70: 91–117); the estimator for
unconditional quantile treatment effects proposed by Firpo (2007, Econometrica
75: 259–276); and the instrumental-variable estimator for unconditional quantile
treatment effects proposed by Frölich and Melly (2008, IZA discussion paper 3288).
The implemented instrumental-variable procedures estimate the causal effects for
the subpopulation of compliers and are only well suited for binary instruments.
ivqte also provides analytical standard errors and various options for nonpara-
metric estimation. As a by-product, the locreg command implements local linear
and local logit estimators for mixed data (continuous, ordered discrete, unordered
discrete, and binary regressors).
Keywords: st0203, ivqte, locreg, quantile treatment effects, nonparametric regres-
sion, instrumental variables

1 Introduction
Ninety-five percent of applied econometrics is concerned with mean effects, yet distri-
butional effects are no less important. The distribution of the dependent variable may
change in many ways that are not revealed or are only incompletely revealed by an exam-
ination of averages. For example, the wage distribution can become more compressed or
the upper-tail inequality may increase while the lower-tail inequality decreases. There-
fore, applied economists and policy makers are increasingly interested in distributional
effects. The estimation of quantile treatment effects (QTEs) is a powerful and intuitive
tool that allows us to discover the effects on the entire distribution. As an alternative
motivation, median regression is often preferred to mean regression to reduce suscep-
tibility to outliers. Hence, the estimators presented below may thus be particularly
appealing with noisy data such as wages or earnings. In this article, we provide a brief
survey over recent developments in this literature and a description of the new ivqte
command, which implements these estimators.


c 2010 StataCorp LP st0203
424 Estimation of quantile treatment effects with Stata

Depending on the type of endogeneity of the treatment and the definition of the
estimand, we can define four different cases. We distinguish between conditional and
unconditional effects and whether selection is on observables or on unobservables. Con-
ditional QTEs are defined conditionally on the value of the regressors, whereas uncon-
ditional effects summarize the causal effect of a treatment for the entire population.
Selection on observables is often referred to as a matching assumption or as exogenous
treatment choice (that is, exogenous conditional on X). In contrast, we refer to selection
on unobservables as endogenous treatment choice.

First, if we are interested in conditional QTEs and we assume that the treatment
is exogenous (conditional on X), we can use the quantile regression estimators pro-
posed by Koenker and Bassett (1978). Second, if we are interested in conditional
QTEs but the treatment is endogenous, the instrumental-variable (IV) estimator of
Abadie, Angrist, and Imbens (2002) may be applied. Third, for estimating uncondi-
tional QTEs with exogenous treatment, various approaches have been suggested, for
example, Firpo (2007), Frölich (2007a), and Melly (2006). Currently, the weighting
estimator of Firpo (2007) is implemented. Finally, unconditional QTE in the presence
of an endogenous treatment can be estimated with the technique of Frölich and Melly
(2008). The estimators for the unconditional treatment effects do not rely on any (para-
metric) √functional forms assumptions. On the other hand, for the conditional treatment
effects, n convergence rate can only be obtained with a parametric restriction. Be-
cause estimators affected by the curse of dimensionality are of less interest to the applied
economist, we will discuss only parametric (linear) estimators for estimating conditional
QTEs.

The implementation of most of these estimators requires the preliminary nonpara-


metric estimation of some kind of (instrument) propensity scores. We use nonparametric
linear and logistic regressions to estimate these propensity scores. As a by-product, we
also offer the locreg command for researchers interested only in these nonparametric
regression estimators. We allow for different types of regressors, including continuous,
ordered discrete, unordered discrete, and binary variables. A cross-validation routine is
implemented for choosing the smoothing parameters.

This article only discusses the implementation of the proposed estimators and the
syntax of the commands. It draws heavily on the more technical discussion in the
original articles, and the reader is referred to those articles for more background on,
and formal derivations of, some of the properties of the estimators described here.

The contributions to this article and the related commands are manyfold. We provide
new standardized commands for the estimators proposed in Abadie, Angrist, and Imbens
(2002);1 Firpo (2007); and Frölich and Melly (2008); and estimators of their analytical
standard errors. For the conditional exogenous case, we provide heteroskedasticity
consistent standard errors. The estimator of Koenker and Bassett (1978) has already
been implemented in Stata with the qreg command, but its estimated standard errors
1. Joshua Angrist provides codes in Matlab to replicate the empirical results of Abadie, Angrist, and
Imbens (2002). Our codes for this estimator partially build on his codes.
M. Frölich and B. Melly 425

are not consistent in the presence of heteroskedasticity. The ivqte command thus
extends upon qreg in providing analytical standard errors for heteroskedastic errors.
At a higher level, locreg implements nonparametric estimation with both cate-
gorical and continuous regressors as suggested by Racine and Li (2004). Finally, we
incorporate cross-validation procedures to choose the smoothing parameters.
The next section outlines the definition of the estimands, the possible identifica-
tion approaches, and the estimators. Section 3 describes the ivqte command and
its various options, and contains simple applications to illustrate how ivqte can be
used. Appendix A describes somewhat more technical aspects for the estimation of the
asymptotic variance matrices. Appendix B describes the nonparametric estimators used
internally by ivqte and the additional locreg command.

2 Framework, assumptions, and estimators


We consider the effect of a binary treatment variable D on a continuous outcome variable
Y . Let Yi1 and Yi0 be the potential outcomes of individual i. Hence, Yi1 would be realized
if individual i were to receive treatment 1, and Yi0 would be realized otherwise. Yi is
the observed outcome, which is Yi ≡ Yi1 Di + Yi0 (1 − Di ).
In this article, we identify and estimate the entire distribution functions of Y 1 and
0 2
Y . Because QTEs are an intuitive way to summarize the distributional impact of a
treatment, we focus our attention especially on them.
We often observe not only the outcome and the treatment variables but also some
characteristics X.3 We can therefore either define the QTEs conditionally on the co-
variates or unconditionally. In addition, we have to deal with endogenous treatment
choice. We distinguish between the case where selection is only on observables and the
case where selection is also on unobservables.

2.1 Conditional exogenous QTEs


We start with the standard model for linear quantile regression, which is a model for
conditional effects and where one assumes selection on observables. We assume that Y
is a linear function in X and D.
Assumption 1. Linear model for potential outcomes
Yid = Xi β τ + dδ τ + εi and Qτεi = 0
for i = 1, . . . , n and d ∈ (0, 1). Qτεi refers to the τ th quantile of the unobserved random
variable εi . β τ and δ τ are the unknown parameters of the model. Here δ τ represents
the conditional QTEs at quantile τ .
2. In the case with endogenous treatment, we identify the potential outcomes only for compliers, as
defined later.
3. If we do not observe covariates, then conditional and unconditional QTEs are identical and the
estimators simplify accordingly.
426 Estimation of quantile treatment effects with Stata

Clearly, this linearity assumption is not sufficient for identification of QTEs because
the observed Di may be correlated with the error term εi . We assume that both D and
X are exogenous.
Assumption 2. Selection on observables with exogenous X

ε⊥⊥(D, X)

Assumptions 1 and 2 together imply that QτY |X,D = Xβ τ + Dδ τ , such that we can
recover the unknown parameters of the potential outcomes from the joint distribution
of the observed variables Y , X, and D. The unknown coefficients can thus be estimated
by the classical quantile regression estimator suggested by Koenker and Bassett (1978).
This estimator is defined by

(βτ , δτ ) = arg min ρτ (Yi − Xi β − Di δ) (1)
β,δ

where ρτ (u) = u × {τ − 1 (u < 0)}. This is a convex linear programming problem


and is solved rather efficiently by the built-in qreg command in Stata. The ivqte
command produces exactly the same point estimates as does qreg. In contrast to qreg,
however, ivqte produces analytical standard errors that are consistent also in the case
of heteroskedasticity.
To illustrate the similarity to all the following estimators, we could also write the
previous expression as

(βτ , δτ ) = arg min WiKB × ρτ (Yi − Xi β − Di δ)
β,δ

where the weights WiKB are all equal to one.

2.2 Conditional endogenous QTEs


In many applications, the treatment D is self selected and potentially endogenous. We
may not be able to observe all covariates to make assumption 2 valid. In this case,
the traditional quantile regression estimator will be biased, and we need to use an IV
identification strategy to recover the true effects. We assume that we observe a binary
instrument Z and can therefore define two potential treatments denoted by Dz .4 We
use the following IV assumption as in Abadie, Angrist, and Imbens (2002).5

4. If the instrument is nonbinary, it must be transformed into a binary variable. See Frölich and Melly
(2008).
5. An alternative approach is given in Chernozhukov and Hansen (2005), who rely on a monotonic-
ity/rank invariance assumption in the outcome equation.
M. Frölich and B. Melly 427

Assumption 3. IV
For almost all values of X
 
Y 0 , Y 1 , D0 , D1 ⊥⊥Z |X
0 < Pr (Z = 1 |X ) < 1
E (D1 |X ) = E (D0 |X )
Pr (D1 ≥ D0 |X ) = 1

This assumption is well known and requires monotonicity (that is, the nonexistence of
defiers) in addition to a conditional independence assumption on the IV. Individuals
with D1 > D0 are referred to as compliers, and treatment effects can be identified only
for this group because the always- and never-participants cannot be induced to change
treatment status by hypothetical movements of the instrument.
Abadie, Angrist, and Imbens (2002) (AAI) impose assumption 3. Furthermore, they
require assumption 1 to hold for the compliers (that is, those observations with D1 >
D0 ). They show that the conditional QTE, δ τ , for the compliers can be estimated
consistently by the weighted quantile regression:

(βIV
τ τ
, δIV ) = arg min WiAAI × ρτ (Yi − Xi β − Di δ) (2)
β,δ
Di (1 − Zi ) (1 − Di ) Zi
WiAAI = 1− −
1 − Pr (Z = 1 |Xi ) Pr (Z = 1 |Xi )

The intuition for these weights can be given in two steps. First, by assumption 3,6
 0 1 
Y , Y , D0 , D1 ⊥⊥Z |X
 
=⇒ Y 0 , Y 1 ⊥⊥Z |X, D1 > D0
 
=⇒ Y 0 , Y 1 ⊥⊥D |X, D1 > D0

This means that any observed relationship between D and Y has a causal interpretation
for compliers. To use this result, we have to find compliers in the population. This is
done in the following average sense by the weights WiAAI :7
 
E WiAAI ρτ (Yi − Xi β − Di δ) = Pr (D1 > D0 ) E {ρτ (Yi − Xi β − Di δ) |D1 > D0 }
AAI
Intuitively,
 AAI this result holds
 because
 AAI Wi = 1 for the compliers and because
E Wi |Di,1 = Di,0 = 0 = E Wi |Di,1 = Di,0 = 1 = 0.
A preliminary estimator for Pr (Z = 1 |Xi ) is needed to implement this estima-
tor. ivqte uses the local logit estimator described in appendix B.8 A problem with
6. This is the result of lemma 2.1 in Abadie, Angrist, and Imbens (2002).
7. This is a special case of theorem 3.1.a in Abadie (2003).
8. In their original article, Abadie, Angrist, and Imbens (2002) use a series estimator instead of a
local estimator as in ivqte. Nevertheless, one can also use series estimation or, in fact, any other
method to estimate the propensity score by first generating a variable containing the estimated
propensity score and informing ivqte via the phat() option that the propensity-score estimate is
supplied by the user.
428 Estimation of quantile treatment effects with Stata

estimator (2) is that the optimization problem is not convex because some of the
weights are negative while others are positive. Therefore, this estimator has not been
implemented. Instead, ivqte implements the AAI estimator with positive weights.
Abadie, Angrist, and Imbens (2002) have shown that as an alternative to WiAAI , one
can use the weights  
WiAAI+ = E W AAI |Yi , Di , Xi (3)
instead, which are always positive. Because these weights are unknown, ivqte uses local
linear regression to estimate WiAAI+ ; see appendix B. Some of these estimated weights
might be negative in finite samples, which are then set to zero.9

2.3 Unconditional QTEs


The two estimators presented above focused on conditional treatment effects, that is,
conditional on a set of variables X. We will now consider unconditional QTEs, which
have some advantages over the conditional effects. The unconditional QTE (for quantile
τ ) is given by
Δτ = QτY 1 − QτY 0
First, the definition of the unconditional QTE does not change when we change the
set of covariates X. Although we aim to estimate the unconditional effect, we still
use the covariates X for two reasons. On the one hand, we often need covariates to
make the identification assumptions more plausible. On the other hand, covariates can
increase efficiency. Therefore, covariates X are included in the first-step regression and
then integrated out. However, the definition of the effects is not a function of the
covariates. This is an advantage over the conditional QTE, which changes with the set
of conditioning variables even if the covariates are not needed to satisfy the selection on
observables or the IV assumptions.
A very simple example illustrates this advantage. Assume that the treatment D
has been completely randomized and is therefore independent both from the potential
outcomes as well as from the covariates. A simple comparison of the distribution of Y in
the treated and nontreated populations has a causal interpretation in such a situation.
For efficiency reasons, however, we may wish to include covariates in the estimation. If
we are interested in mean effects, it is well known that including in a linear regression
covariates that are independent from the treatment leaves the estimated treatment effect
asymptotically unchanged. This property is lost for QTEs! Including covariates that
are independent from the treatment can change the limit of the estimated conditional
QTEs. On the other hand, it does not change the unconditional treatment effects if
the assumptions of the model are satisfied for both sets of covariates, which is trivially
satisfied in our randomized example.
A second advantage
√ of unconditional effects is that they can be estimated consis-
tently at the n rate without any parametric restrictions, which is not possible for
conditional effects. For the conditional QTE, we therefore only implemented estimators
9. Again, other estimators may be used with ivqte. The weights are first estimated by the user and
then supplied via the what() option.
M. Frölich and B. Melly 429

with a parametric restriction. The following estimators of the unconditional QTE are
entirely nonparametric, and we will no longer invoke assumption 1. This is an important
advantage because parametric restrictions are often difficult to justify from a theoretical
point of view. In addition, assumption 1 restricts the QTE to be the same independently
from the value of X. Obviously, interaction terms may be included, but the effects in
the entire population are often more interesting than many effects for different covariate
combinations.
The interpretation of the unconditional effects is slightly different from the interpre-
tation of the conditional effects, even if the conditional QTE is independent from the
value of X. This is because of the definition of the quantile. For instance, if we are in-
terested in a low quantile, the conditional QTE will summarize the effect for individuals
with relatively low Y even if their absolute level of Y is high. The unconditional QTE,
on the other hand, will summarize the effect with a relatively low absolute Y .
Finally, the conditional and unconditional QTEs are trivially the same in the absence
of covariates. They are also the same if the effect is the same independent of the value
of the covariates and of the value of the quantile τ . This is often called the location
shift model because the treatment affects only the location of the distribution of the
potential outcomes.

2.4 Unconditional endogenous QTEs


We consider first the case of an endogenous treatment with a binary IV Z. This includes
the situation with exogenous treatment as a special case when we use Z = D.
Frölich and Melly (2008) showed that Δτ for the compliers is identified under a
somewhat weaker version of assumption 3, and they proposed the following estimator:

 τIV ) = arg min
αIV , Δ
( WiFM × ρτ (Yi − α − Di Δ) (4)
α,Δ
Zi − Pr (Z = 1 |Xi )
WiFM = (2Di − 1)
Pr (Z = 1 |Xi ) {1 − Pr (Z = 1 |Xi )}
This is a bivariate quantile regressor estimator with weights. One can easily see that
αIV + ΔτIV is identified only from the D = 1 observations and that αIV is identified
only from the D = 0 observations. Therefore, this estimator is equivalent to using
two univariate weighted quantile regressions separately for the D = 1 and the D = 0
observations.10
There are two differences between (4) and (2): The covariates are not included in the
weighted quantile regression in (4), and the weights are different.11 One might be think-
P
bIV = arg min
10. The previous expression is numerically identical to α WiFM × ρτ (Yi − q0 ) and
q0 i:Di =0
b τ = arg min P b τ via two univariate
bIV + Δ
α IV WiFM × ρτ (Yi − q1 ), from which we thus obtain Δ IV
q1 i:Di =1
quantile regressions.
11. The weights WiFM were suggested in theorem 3.1.b and 3.1.c of Abadie (2003) for a general purpose.
Frölich and Melly (2008) used these weights to estimate unconditional QTEs.
430 Estimation of quantile treatment effects with Stata

ing about running a weighted quantile regression of Y on a constant and D by using the
weights WiAAI . For that purpose, however, the weights of Abadie, Angrist, and Imbens
(2002) are not correct as shown in Frölich and Melly (2008). This estimator would es-
timate the difference between the τ quantile of Y 1 for the treated compliers and the τ
quantile of Y 0 for the nontreated compliers, which is not meaningful in general. How-
ever, weights WiAAI could be used to estimate unconditional effects in the special case
when the IV is independent of X such that Pr (Z = 1 |X ) is not a function of X.
On the other hand, if one is interested in estimating conditional QTE using a para-
metric specification, the weights WiFM could be used, as well. Hence, although not
developed for this case, the weights WiFM can be used to identify conditional QTEs. It
is not clear whether WiFM or WiAAI will be more efficient. For estimating conditional
effects, both are inefficient anyway because they do not incorporate the conditional
density function of the error term at the quantile.
Intuitively, the difference between the weights WiAAI and WiFM can be explained as
follows: They both find the compliers in the average sense discussed above. However,
only WiFM simultaneously balances the distribution of the covariates between treated
and nontreated compliers. Therefore, WiAAI can be used only in combination with a
conditional model because there is no need to balance covariates in such a case. It can
also be used without a conditional model when the treated and nontreated compliers
have the same covariate distribution. WiFM , on the other hand, can be used with or
without a conditional model.
A preliminary estimator for Pr (Z = 1 |Xi ) is needed to implement this estimator.
ivqte uses the local logit estimator described in appendix B. The optimization problem
(4) is neither convex nor smooth. However, only two parameters have to be estimated.
In fact, one can easily show that the estimator can be written as two univariate quantile
regressions, which can easily be solved despite the nonsmoothness; see the previous
footnotes. This is the way ivqte proceeds when the positive option is not activated.12
An alternative to solving this nonconvex problem consists in using the weights
 
WiFM+ = E W FM |Yi , Di (5)

which are always positive. ivqte estimates these weights by local linear regression if
the positive option has been activated. Again, estimated negative weights will be set
to zero.13

12. More precisely, ivqte solves the convex problem for the distribution function, and then mono-
tonizes the estimated distribution function using the method of Chernozhukov, Fernández-Val, and
Galichon (2010), and finally inverts it to obtain the quantiles. The parameters chosen in this way
solve the first-order conditions of the optimization problem, and therefore, the asymptotic results
apply to them.
13. If one is interested in average treatment effects, Frölich (2007b) has proposed an estimator for aver-
age treatment effects based on the same set of assumptions. This estimator has been implemented
in Stata in the command nplate, which can be downloaded from the websites of the authors of
this article.
M. Frölich and B. Melly 431

2.5 Unconditional exogenous QTEs


Finally, we consider the case where the treatment is exogenous, conditional on X. We
assume that X contains all confounding variables, which we denote as the selection on
observables assumption. We also have to assume that the support of the covariates is
the same independent of the treatment, because in a nonparametric model, we cannot
extrapolate the conditional distribution outside the support of the covariates.
Assumption 4. Selection on observables and common support

(Y 0 , Y 1 )⊥⊥D|X

0 < Pr (D = 1 |X ) < 1

Assumption 4 identifies the unconditional QTE, as shown in Firpo (2007), Frölich


(2007a), and Melly (2006). The estimator of Firpo (2007) is a special case of (4), when
D is used as its own instrument. The weighting estimator for Δτ therefore is

(  τ ) = arg min
α, Δ WiF × ρτ (Yi − α − Di Δ) (6)
α,Δ
Di 1 − Di
WiF = +
Pr (D = 1 |Xi ) 1 − Pr (D = 1 |Xi )

This is a traditional propensity-score weighting estimator, also known as inverse


probability weighting. A preliminary estimator for Pr (D = 1 |Xi ) is needed to imple-
ment this estimator. ivqte uses the local logit estimator described in appendix B.

3 The ivqte command


3.1 Syntax
The syntax of ivqte is as follows:





ivqte depvar indepvars (treatment = instrument ) if in ,
quantiles(numlist) continuous(varlist) dummy(varlist) unordered(varlist)
aai linear mata opt kernel(kernel) bandwidth(#) lambda(#) trim(#)
positive pbandwidth(#) plambda(#) pkernel(kernel) variance
vbandwidth(#) vlambda(#) vkernel(kernel) level(#)


generate p(newvarname , replace ) generate w(newvarname ,

replace ) phat(varname) what(varname)


432 Estimation of quantile treatment effects with Stata

3.2 Description
ivqte computes the QTEs of a binary variable using a weighting strategy. This com-
mand can estimate both conditional and unconditional QTEs under either exogene-
ity or endogeneity. The estimator proposed by Frölich and Melly (2008) is used if
unconditional QTEs under endogeneity are estimated. The estimator proposed by
Abadie, Angrist, and Imbens (2002) is used if conditional QTEs under endogeneity are
estimated. The estimator proposed by Firpo (2007) is used if unconditional QTEs under
exogeneity are estimated. The estimator proposed by Koenker and Bassett (1978) is
used if conditional QTEs under exogeneity are estimated.
The estimator used by ivqte is determined as follows:

• If an instrument is provided and aai is not activated, the estimator proposed by


Frölich and Melly (2008) is used.

• If an instrument is provided and aai is activated, the estimator proposed by


Abadie, Angrist, and Imbens (2002) is used.

• If there is no instrument and indepvars is empty, the estimator proposed by Firpo


(2007) is used.

• If there is no instrument and indepvars contains variables, the estimator proposed


by Koenker and Bassett (1978) is used.

indepvars contains the list of X variables for the Koenker and Bassett (1978) estima-
tor for the estimation of exogenous conditional QTEs.14 For all other estimators, indep-
vars must remain empty, and the control variables X are to be given in continuous(),
unordered(), and dummy(). The instrument or the treatment variable is assumed to
satisfy the exclusion restriction conditionally on these variables.
The IV has to be provided as a binary variable, taking only the values 0 and 1. If the
original IV takes different values, it first has to be transformed to a binary variable. If
the original IV is one-dimensional, one may use the endpoints of its support and discard
the other observations. If one has several discrete IVs, one would use only those two
combinations that maximize and minimize the treatment probability Pr(D = 1|Z = z)
and code these two values as 0 and 1. For more details on how to transform several
nonbinary IVs to this binary case, see Frölich and Melly (2008).
The estimation of all nonparametric functions is described in detail in appendix B.
A mixed kernel as suggested by Racine and Li (2004) is used to smooth over the con-
tinuous and categorical data. The more conventional approach of estimating the regres-
sion plane inside each cell defined by the discrete variables can be followed by setting
lambda() to 0. The propensity score is estimated by default by a local logit estima-
tor. A local linear estimator is used if the linear option is selected. Two algorithms
14. For the Koenker and Bassett (1978) estimator, the options continuous(), unordered(),
and dummy() are not permitted.
M. Frölich and B. Melly 433

are available to maximize the local logistic likelihood function. The default is a simple
Gauss–Newton algorithm written for this purpose. If you select the mata opt option,
the official Stata 10 optimizer is used. We expect the official estimator to be more stable
in difficult situations. However, it can be used only if you have Stata 10 or more recent
versions.
The ivqte command also requires the packages moremata (Jann 2005b) and kdens
(Jann 2005a).

3.3 Options
Model

quantiles(numlist) specifies the quantiles at which the effects are estimated and should
contain numbers between 0 and 1. The computational time needed to estimate an
additional quantile is very short compared with the time needed to estimate the
preliminary nonparametric regressions. When conditional QTEs are estimated, only
one quantile may be specified. If one is interested in several QTEs, then one can save
the estimated weights for later use by using the generate w() option. By default,
quantiles() is set to 0.5 when conditional QTEs are estimated, and quantiles()
contains the nine deciles from 0.1 to 0.9 when unconditional QTEs are estimated.
continuous(varlist), dummy(varlist), and unordered(varlist) specify the names of the
covariates depending on their type. Ordered discrete variables should be treated as
continuous. For all estimators except Koenker and Bassett (1978), the X variables
should be given here and not in indepvars. For the Koenker and Bassett (1978)
estimator, on the other hand, these options are not permitted and the X variables
must be given in indepvars.
aai selects the Abadie, Angrist, and Imbens (2002) estimator.
With the exception of the Koenker and Bassett (1978) estimator, several further
options are needed to control the estimation of the nonparametric components. First,
we need to estimate some kind of propensity score. For the Firpo (2007) estimator,
we need to estimate Pr(D = 1|X). For the Abadie, Angrist, and Imbens (2002) and
Frölich and Melly (2008) estimators, we need to estimate Pr(Z = 1|X), which we
also call a propensity score in the following discussion. These propensity scores are
then used to calculate the weights WiF , WiAAI , and WiFM , respectively, as defined
in section 2.15 The QTEs are estimated using these weights after applying some
trimming to eliminate observations with very large weights. The amount of trimming
is controlled by trim(), as explained below. This is the way the Firpo (2007)
estimator is implemented.
For the Abadie, Angrist, and Imbens (2002) and Frölich and Melly (2008) estima-
tors, more comments are required. First, the Abadie, Angrist, and Imbens (2002)
estimator is only implemented with the positive weights WiAAI+ . Hence, when the
Abadie, Angrist, and Imbens (2002) estimator is activated via the aai option, first
15. The weights WiKB used in the Koenker and Bassett (1978) estimator are always equal to one.
434 Estimation of quantile treatment effects with Stata

the propensity score is estimated to calculate the weights WiAAI , which are then
automatically projected via nonparametric regression to obtain WiAAI+ . This last
nonparametric regression to obtain the positive weights is controlled by the options
pkernel(), pbandwidth(), and plambda(), which are explained below. The letter
p in front of these options stresses that these are used to obtain the positive weights.
Finally, the Frölich and Melly (2008) estimator is implemented in two ways. We can
either use the weights WiFM after having trimmed very large weights, or alternatively,
we could also project these weights and then use WiFM+ to estimate the QTEs. If
one wants to pursue this second implementation, one has to activate the positive
option and specify pkernel() and pbandwidth() to control the projection of WiFM
to obtain the positive weights WiFM+ .
Estimation of the propensity score
linear selects the method used to estimate the instrument propensity score. If this
option is not activated, the local logit estimator is used. If linear is activated, the
local linear estimator is used.
mata opt selects the official optimizer introduced in Stata 10 to estimate the local logit,
Mata’s optimize(). The default is a simple Gauss–Newton algorithm written for
this purpose. This option is only relevant when the linear option has not been
selected.
kernel(kernel) specifies the kernel function used to estimate the propensity score. ker-
nel may be any of the following second-order kernels: epan2 (Epanechnikov ker-
nel function, the default), biweight (biweight kernel function), triweight (tri-
weight kernel function), cosine (cosine trace), gaussian (Gaussian kernel func-
tion), parzen (Parzen kernel function), rectangle (rectangle kernel function), or
triangle (triangle kernel function). In addition to these second-order kernels, there
are also several higher-order kernels: epanechnikov o4 (Epanechnikov order 4),
epanechnikov o6 (order 6), gaussian o4 (Gaussian order 4), gaussian o6 (order
6), gaussian o8 (order 8). By default, epan2, which specifies the Epanechnikov
kernel, is used.16

16. Here are the formulas for these kernel functions for Epanechnikov of order 4 and 6, respectively:
„ «
15 35 2 3 ` ´
K (z) = − z 1 − z 2 1 (z < 1)
8 8 4
„ «
175 525 2 5775 4 3 ` ´
K (z) = − z + z 1 − z 2 1 (z < 1)
64 32 320 4
And here are the formulas for Gaussian of order 4, 6, and 8, respectively:
1` ´
K (z) = 3 − z 2 φ (z)
2
1` ´
K (z) = 15 − 10z 2 + z 4 φ (z)
8
1 ` ´
K (z) = 105 − 105z 2 + 21z 4 − z 6 φ (z)
48
M. Frölich and B. Melly 435

bandwidth(#) sets the bandwidth h used to smooth over the continuous variables in the
estimation of the propensity score. The continuous regressors are first orthogonalized
such that their covariance matrix is the identity matrix. The bandwidth must be
strictly positive. If the bandwidth h is missing, an infinite bandwidth h = ∞ is
used. The default value is infinity. If the bandwidth h is infinity and the parameter
λ is one, a global model (linear or logit) is estimated without any local smoothing.
The cross-validation procedure implemented in locreg can be used to guide the
choice of the bandwidth. Because the optimal bandwidth converges at a faster rate
than the cross-validated bandwidth, the robustness of the results with respect to a
smaller bandwidth should be examined.
lambda(#) sets the λ used to smooth over the dummy and unordered discrete variables
in the estimation of the propensity score. It must be between 0 and 1. A value
of 0 implies that only observations within the cell defined by all discrete regressors
are used. The default is lambda(1), which corresponds to global smoothing. If
the bandwidth h is infinity and λ = 1, a global model (linear or logit) is estimated
without any local smoothing. The cross-validation procedure implemented in locreg
can be used to guide the choice of lambda. Again the robustness of the results with
respect to a smaller bandwidth should be examined.

Estimation of the weights

trim(#) controls the amount of trimming. All observations with an estimated propen-
sity score less than trim() or greater than 1 − trim() are trimmed and not used
further by the estimation procedure. This prevents giving very high weights to single
observations. The default is trim(0.001). This option is not useful for the Koenker
and Bassett (1978) estimator, where no propensity score is estimated.
positive is used only with the Frölich and Melly (2008) estimator. If it is activated, the
positive weights WiFM+ defined in (5) are estimated by the projection of the weights
W FM on the dependent and the treatment variable. Weights W FM+ are estimated
by nonparametric regression on Y , separately for the D = 1 and the D = 0 samples.
After the estimation, negative estimated weights in W FM+ are set to zero.
i

pbandwidth(#), plambda(#), and pkernel(kernel) are used to calculate the positive


weights. These options are useful only for the Abadie, Angrist, and Imbens (2002)
estimator, which can be activated via the aai option, and for the Frölich and Melly
(2008) estimator, but only when the positive option has been activated to estimate
W FM+ . pkernel() and pbandwidth() are used to calculate the positive weights if
the positive option has been selected. They are defined similarly to kernel(),
bandwidth(), and lambda(). When pkernel(), pbandwidth(), and plambda() are
not specified, the values given in kernel(), bandwidth(), and lambda() are taken
as default.
The positive weights are always estimated by local linear regression. After esti-
mation, negative estimated weights are set to zero. The smoothing parameters
436 Estimation of quantile treatment effects with Stata

pbandwidth() and plambda() are in principle as important as the other smoothing


parameters bandwidth() and lambda(), and it is worth inspecting the robustness
of the results with respect to these parameters. Cross-validation can also be used to
guide these choices.

Inference

variance activates the estimation of the variance. By default, no standard errors are
estimated because the estimation of the variance can be computationally demanding.
Except for the classical linear quantile regression estimator, it requires the estima-
tion of many nonparametric functions. This option should not be activated if you
bootstrap the results unless you bootstrap t-values to exploit possible asymptotic
refinements.
vbandwidth(#), vlambda(#), and vkernel(kernel) are used to calculate the vari-
ance if the variance option has been selected. They are defined similarly to
bandwidth(), lambda(), and kernel(). They are used only to estimate the vari-
ance. A “quick and dirty” estimate of the variance can be obtained by setting
vbandwidth() to infinity and vlambda() to 1, which is much faster than any other
choice. When vkernel(), vbandwidth(), or vlambda() is not specified, the values
given in kernel(), bandwidth(), and lambda() are taken as default.
level(#) specifies the confidence level, as a percentage, for confidence intervals. The
default is level(95) or as set by set level.

Saved propensity scores and weights


generate p(newvarname , replace ) generates newvarname containing the esti-


mated propensity score. Remember that the propensity score is Pr(Z = 1|X) for the
Abadie, Angrist, and Imbens (2002) and Frölich and Melly (2008) estimators and is
Pr(D = 1|X) for the Firpo (2007) estimator. This may be useful if one wants to
compare the results with and without the projection of the weights or to compare the
conditional and unconditional QTEs under endogeneity. One can first estimate the
QTEs using one method and save the propensity score in the variable newvarname.
In the second step, one can use the already estimated propensity score as an input
in the phat() option. The replace option allows ivqte to overwrite an existing
variable or to create a new one where none exists.

generate w(newvarname , replace ) generates newvarname containing the esti-


mated weights. This may be useful if you want to estimate several conditional
QTEs. The weights must be estimated only once and then be given as an input
in the what() option. The replace option allows ivqte to overwrite an existing
variable or to create a new one where none exists.
phat(varname) gives the name of an existing variable containing the estimated instru-
ment propensity score. The propensity score may have been estimated using ivqte
or with any other command such as a series estimator.
M. Frölich and B. Melly 437

what(varname) gives the name of an existing variable containing the estimated weights.
The weights may have been estimated using ivqte or with any other command such
as a series estimator.

3.4 Saved results


ivqte saves the following in e():
Scalars
e(N) number of observations
e(bandwidth) bandwidth
e(lambda) lambda
e(pbandwidth) pbandwidth
e(plambda) plambda
e(vbandwidth) vbandwidth
e(vlambda) vlambda
e(pseudo r2) pseudo-R2 of the quantile regression
e(compliers) proportion of compliers
e(trimmed) number of observations trimmed
Macros
e(command) ivqte
e(depvar) name of dependent variable
e(treatment) name of treatment variable
e(instrument) name of IV
e(continuous) name of continuous covariates
e(dummy) name of binary covariates
e(regressors) name of regressors (conditional QTEs)
e(unordered) name of unordered covariates
e(estimator) name of estimator
e(ps method) linear or logistic model
e(optimization) algorithm used
e(kernel) kernel function
e(pkernel) kernel function for positive weights
e(vkernel) kernel function for variance estimation
Matrices
e(b) row vector containing the QTEs
e(quantiles) row vector containing the quantiles at which the QTEs have been estimated
e(V) matrix containing the variances of the estimated QTEs in the diagonal
and 0 otherwise
Functions
e(sample) marks the estimation sample

3.5 Simple examples (without local smoothing)


Having given the syntax for ivqte in a previous subsection, we now illustrate how the
command can be used with some very simple examples. In particular, we defer the use
of smoothing parameters (h, λ) to the next subsection to keep things simple here. This
means that all regressions will be estimated parametrically because the default values
are h = ∞ and λ = 1.
438 Estimation of quantile treatment effects with Stata

We use the “distance to college” dataset of card.dta.17 The aim is to estimate


the effect of having a college degree (college) on log wages (lwage), controlling for
parental education, experience, race, and region. A potential instrument is living near a
four-year college (nearc4). The control variables are experience, exper (continuous vari-
able); mother’s education, motheduc (ordered discrete); region (unordered discrete);
and black (dummy).
We first consider the quantile regression estimator with exogenous regressors for the
first decile. As mentioned, this estimator is already implemented in Stata with the qreg
command:

. use card
. qreg lwage college exper black motheduc reg662 reg663 reg664 reg665 reg666
> reg667 reg668 reg669, quantile(0.1)
(output omitted )

The syntax of the ivqte command is similar with the exception that the treatment
variable has to be included in parentheses after all other regressors:18

. ivqte lwage exper black motheduc reg662 reg663 reg664 reg665 reg666 reg667
> reg668 reg669 (college), quantiles(0.1) variance
(output omitted )

The point estimates are exactly identical because ivqte calls qreg, but the standard
errors differ. We recommend using the standard errors of ivqte because they are robust
against heteroskedasticity and other forms of dependence between the residuals and the
regressors.
We may be concerned that having a college degree might be endogenous and consider
using the “proximity of a four-year college” as an instrument. The proximity of a
four-year college is a binary variable, taking the value 1 if a college was close by. If
we are interested in the conditional QTE, we can apply the estimator suggested by
Abadie, Angrist, and Imbens (2002), as follows:

. ivqte lwage (college=nearc4), quantiles(0.1) variance dummy(black)


> continuous(exper motheduc) unordered(region) aai
(output omitted )

There are three differences compared with the previous syntax: First, the instrument
has to be given in parentheses after the treatment variable and the equal sign, that is,
(college=nearc4). Second, the control variables X are to be given in the corresponding
options—dummy(), continuous(), and unordered()—because they are used not only
to define the conditional QTE but also in the nonparametric estimation of the weights.
Third, the aai option must be activated. region enters here as a single unordered

17. This dataset is available for download from Stata together with the programs described in the
present article. The description of the variables can be found in Card (1995).
18. For this case of exogenous conditional QTEs, it is in principle arbitrary which variable is defined
as the treatment variable because the coefficients are estimated for all regressors. In addition,
nonbinary treatments are permitted here.
M. Frölich and B. Melly 439

discrete variable, which is expanded by ivqte to eight regional dummy variables in the
parametric model.
The two examples discussed so far refer to the conditional treatment effect of college
degree. We might be more interested in the unconditional QTE, which we examine in the
following example. Consider first the case where college degree is exogenous conditional
on X. The weighting estimator of Firpo (2007) is implemented by ivqte. We are
interested in the nine decile treatment effects with this estimator:

. ivqte lwage (college), variance dummy(black) continuous(exper motheduc)


> unordered(region)
(output omitted )

Only the treatment is given in parentheses, and the aai option is no longer activated.
Finally, to estimate unconditional QTEs with an endogenous treatment, the estimator
of Frölich and Melly (2008) is implemented in ivqte. The only difference from the
previous syntax is that the instrument (nearc4) now has to be given after the treatment
variable:

. ivqte lwage (college = nearc4), variance dummy(black)


> continuous(exper motheduc) unordered(region)
(output omitted )

By default, the weights defined in (4) are used. If the positive option is activated,
the positive weights (5) are estimated and used:

. ivqte lwage (college = nearc4), variance dummy(black)


> continuous(exper motheduc) unordered(region) positive
(output omitted )

If no control variables are included, then ivqte lwage (college = nearc4), aai
and ivqte lwage (college = nearc4), positive produce the same results.

3.6 Advanced examples (with local smoothing)


In the examples given above, we have not used the smoothing options. Therefore,
by default, parametric regressions have been used to estimate all functions. In an
application, we should use smoothing parameters converging to 0, unless we have strong
reasons to believe that we do know the true functional forms. Appendix B contains many
details about the nonparametric estimation of functions. We illustrate here the use of
these techniques for ivqte.

(Continued on next page)


440 Estimation of quantile treatment effects with Stata

We use card.dta and keep only 500 randomly sampled observations from card.dta
to reduce computation time. Because of missing values on covariates, eventually only
394 observations are retained in the estimation. The aim is to estimate the effect
of having a college degree (college) on log wages (lwage). A potential instrument is
living near a four-year college (nearc4). For ease of presentation, we use only experience
(exper) as a continuous control variable here. The other control variables are region
(unordered()) and black (dummy()).
Depending on the estimator, up to three functions have to be estimated nonpara-
metrically. Three sets of options correspond to these three functions. The options
kernel(), bandwidth(), and lambda() determine the kernel and the parameters h and
λ used for the estimation of the propensity score. It corresponds to Pr(Z = 1|X) for
the Abadie, Angrist, and Imbens (2002) and Frölich and Melly (2008) estimators and
to Pr(D = 1|X) for the Firpo (2007) estimator.
The options pkernel(), pbandwidth(), and plambda() determine the kernel and
smoothing parameters used for the estimation of the positive weights defined in (3) for
the estimator of Abadie, Angrist, and Imbens (2002). If the Frölich and Melly (2008)
estimator is to be used and the positive option has been activated, pkernel() and
pbandwidth() are used to estimate the positive weights (5).
Finally, the options vkernel(), vbandwidth(), and vlambda() are used for the
estimation of the variances of the estimators of Abadie, Angrist, and Imbens (2002),
Firpo (2007), and Frölich and Melly (2008).
A general finding in the literature is that the choice of the kernel functions—
kernel(), pkernel(), and vkernel()—is rarely crucial. The options vbandwidth()
and vlambda() are used only for the estimation of the variance. Therefore, during the
exploratory analysis, it may make sense to reduce the computational time by setting
vbandwidth() to infinity and vlambda() to one, that is, a parametric model. For the
final set of estimates, it often makes sense to set vbandwidth() equal to bandwidth()
and vlambda() equal to lambda(). This is done by default. In the following illustra-
tion, we show how the cross-validation procedure implemented in locreg can be used
to guide the choice of the important smoothing parameters.
We start with the estimator proposed by Firpo (2007). We do not need to use
the options pkernel(), pbandwidth(), and plambda() because the weights are always
positive by definition. We choose h = 2 and λ = 0.8 for the estimation of the propensity
score. In addition, we use h = ∞ and λ = 1 for the estimation of the variance. We use
the default Epanechnikov kernel in all cases.
. use card, clear
. set seed 123
. sample 500, count
(2510 observations deleted)
. ivqte lwage (college), quantiles(0.5) dummy(black) continuous(exper)
> unordered(region) bandwidth(2) lambda(0.8) variance vbandwidth(.) vlambda(1)
(output omitted )
M. Frölich and B. Melly 441

Of course, these choices of the smoothing parameters are arbitrary. One can use
the cross-validation option of locreg to choose the smoothing parameters. When we
use the Firpo (2007) estimator, we know that the options bandwidth() and lambda()
are used to estimate Pr (D = 1 |X ). Therefore, we can select the smoothing parameters
from a grid of, say, four possible values, as follows. We use the logit option because
D is a binary variable.
. locreg college, dummy(black) continuous(exper) unordered(region)
> bandwidth(1 2) lambda(0.8 1) logit
(output omitted )

The scalars r(optb) and r(optl) indicate that the choices of h = 1 and λ = 1
minimize the cross-validation criterion. We use the 2 × 2 search grid only for ease of
exposition. Usually, one would search within a much larger grid. We can now obtain
the point estimate using this choice of the smoothing parameters:

. ivqte lwage (college), quantiles(0.5) dummy(black) continuous(exper)


> unordered(region) bandwidth(1) lambda(1)
(output omitted )

In addition to the values suggested by cross-validation, the user is encouraged to


also try other—especially smaller—smoothing parameters and examine the robustness
of the final results. For instance, we examine the results with h = 0.5 and λ = 0.5.

. ivqte lwage (college), quantiles(0.5) dummy(black) continuous(exper)


> unordered(region) bandwidth(0.5) lambda(0.5)
(output omitted )

In this case, the results are relatively stable.


When we use the estimator of Abadie, Angrist, and Imbens (2002), we have to addi-
tionally specify pbandwidth() and plambda() for estimating the positive weights. We
proceed by first choosing values for bandwidth() and lambda() and thereafter choosing
values for pbandwidth() and plambda(). We know that the options bandwidth() and
lambda() are used to estimate Pr (Z = 1 |X ). Therefore, we can select the smoothing
parameters from a grid of four possible values, as follows. Again we use the logit
option because Z is a binary variable.

. locreg nearc4, dummy(black) continuous(exper) unordered(region)


> bandwidth(0.5 0.8) lambda(0.8 1) generate(ps) logit
(output omitted )

The optimal smoothing parameters are h = 0.8 and λ = 0.8. The generate(ps)
option implies that the fitted values E (Z = 1 |X ) are saved in the variable ps. These
fitted values are generated using the optimal bandwidth. That is, they are generated
after the cross-validation has selected the optimal bandwidth.
442 Estimation of quantile treatment effects with Stata

In the next step, we need to select bandwidths for pbandwidth() and plambda().
We know by equations (2) and (3) that pbandwidth() and plambda() are used to
estimate
 
  Di (1 − Zi ) (1 − Di ) Zi
E WiAAI |Yi , Di , Xi = E 1 − − |Yi , Di , Xi
1 − Pr (Z = 1 |Xi ) Pr (Z = 1 |Xi )

We first generate WiAAI :


. generate waai=1-college*(1-nearc4)/(1-ps)-(1-college)*nearc4/ps

Then we can use locreg to find the optimal parameters. The positive weights
are obtained by a nonparametric regression of W AAI on X and Y and D. This is
implemented in ivqte via two separate regressions: one nonparametric regression of
W AAI on X and Y for the D = 1 subsample and one separate nonparametric regression
of W AAI on X and Y for the D = 0 subsample. We proceed in the same way here by
adding Y , which in our example above is lwage, as a continuous regressor and running
separate regressions for the college==1 and the college==0 subsamples:
. locreg waai if college==1, dummy(black) continuous(exper lwage)
> unordered(region) bandwidth(0.5 0.8) lambda(0.8 1)
(output omitted )
. locreg waai if college==0, dummy(black) continuous(exper lwage)
> unordered(region) bandwidth(0.5 0.8) lambda(0.8 1)
(output omitted )

In the first case (that is, for the college==1 subsample), the optimal smoothing pa-
rameters are h = 0.8 and λ = 1. For the college==0 subsample, the optimal smoothing
parameters are h = 0.8 and λ = 0.8. The current implementation of ivqte permits only
one value of h and λ in the options pbandwidth() and plambda() to not overburden
the user with choosing nuisance parameters. If the suggested values for h and λ are
different for the college==1 and the college==0 subsamples, we recommend choos-
ing the smaller of these values but also examining the robustness of the results to the
other values. We suggest using the smaller bandwidth because the inference provided
by ivqte is based on the asymptotic formula given in (7) (see Appendix A.2), which
only contains variance but no bias term. To increase the accurateness of the inference
based on (7), one would prefer bandwidth choices that lead to smaller biases.
In our example, we choose pbandwidth(0.8), which was suggested by cross-valida-
tion in the college==1 and the college==0 subsamples, and plambda(0.8), which is
the smaller value of λ. With these bandwidth choices, we obtain the final estimates:
. ivqte lwage (college=nearc4), aai quantiles(0.5) dummy(black)
> continuous(age fatheduc motheduc) unordered(region) bandwidth(0.8) lambda(0.8)
> pbandwidth(0.8) plambda(0.8) variance
(output omitted )

By using the variance option without specifying vbandwidth() nor vlambda(), the
values given for bandwidth() and lambda() are used as defaults for vbandwidth() and
vlambda(). Alternatively, we could have specified different values for vbandwidth() and
M. Frölich and B. Melly 443

vlambda(). In an exploratory analysis, one could use, for example, h = ∞ and λ = 1,


which are certainly nonoptimal choices but reduce computation time considerably.

. ivqte lwage (college=nearc4), aai quantiles(0.5) dummy(black)


> continuous(age fatheduc motheduc) unordered(region) bandwidth(0.8) lambda(0.8)
> pbandwidth(0.8) plambda(0.8) variance vbandwidth(.) vlambda(1)
(output omitted )

3.7 Replication of results of AAI


In the last illustration of the ivqte command, we replicate tables II-A and III-A of
Abadie, Angrist, and Imbens (2002). jtpa.dta contains their dataset for males. We
can replicate the point estimates of table II A with the official Stata qreg command:

. use jtpa, clear


. global reg "highschool black hispanic married part time classroom
> OJT JSA age5 age4 age3 age2 age1 second follow"
. qreg earnings treatment $reg, quantile(0.5)
(output omitted )

The same point estimates can also be obtained using ivqte:

. ivqte earnings $reg (treatment), quantiles(0.5) variance


(output omitted )

The standard errors are still different from the standard errors in the published arti-
cle because Abadie, Angrist, and Imbens (2002) have used a somewhat unconventional
bandwidth. We can replicate their standard errors of table II-A by activating the aai
option.19 The following command calculates the results for the median.

. ivqte earnings (treatment=treatment), quantiles(0.5) dummy($reg) variance


> aai
(output omitted )

Now we attempt to replicate the results of table III-A. Using a bandwidth of 2,

. ivqte earnings (treatment=assignment), quantiles(0.5) dummy($reg) variance


> aai pbandwidth(2)
(output omitted )

19. In this command, we use the fact that the estimator of Abadie, Angrist, and Imbens (2002) simpli-
fies to the standard quantile regression estimator when the treatment is used as its own instrument.
Similar relationships exist for the estimator proposed by Frölich and Melly (2008) and are discussed
in their article.
444 Estimation of quantile treatment effects with Stata

gives results that are slightly different from their table III-A. We cannot exactly repli-
cate their results because they have used series estimators to estimate the nonparametric
components of their estimator and because they have exploited the fact that the instru-
ment was completely randomized.20 In the following commands, we show how ivqte
with the options phat(varname) and what(varname) can be used to replicate their
results. The parameters and standard errors are then almost identical to the original
results.21
Abadie, Angrist, and Imbens (2002) first note that the instrument assignment has
been fully randomized. Therefore, they estimate Pr (Z = 1 |X ) by the sample mean of
Z. Then the negative and positive weights WiAAI can be generated:

. summarize assignment
(output omitted )
. generate pi=r(mean)
. generate kappa=1-treatment*(1-assignment)/(1-pi)-(1-treatment)*assignment/pi

In a second step, the positive weights WiAAI+ are estimated by a linear regression of
WiAAI on a polynomial of order 5 in Y and D:

. forvalues i=1/5 {
2. generate e`i´=earnings^`i´
3. generate de`i´=e`i´*treatment
4. }
. regress kappa earnings treatment e2 e3 e4 e5 de1 de2 de3 de4 de5
(output omitted )
. predict kappa pos
(option xb assumed; fitted values)
. ivqte earnings (treatment=assignment) if kappa pos>0, dummy($reg)
> quantiles(0.5) variance aai what(kappa pos) phat(pi)
(output omitted )

which gives almost the same estimates and standard errors as their table III-A.

4 Acknowledgments
We would like to thank Ben Jann, Andreas Landmann, Robert Poppe, and an anony-
mous referee for helpful comments.

20. There are no strong reasons to prefer series estimators to local nonparametric estimators. We
will use a series estimator here only to show that we can replicate their results. Actually,
Frölich and Melly (2008) require, in some sense, weaker regularity assumptions for the local es-
timator than what was required for the existing series estimators.
21. A small difference still remains, which is due to differences in the implementation of the estimator
of H(X). With slight adaptations, which would restrict the generality of the estimator, we can
replicate their results exactly.
M. Frölich and B. Melly 445

5 References
Abadie, A. 2003. Semiparametric instrumental variable estimation of treatment response
models. Journal of Econometrics 113: 231–263.

Abadie, A., J. Angrist, and G. Imbens. 2002. Instrumental variables estimates of the
effect of subsidized training on the quantiles of trainee earnings. Econometrica 70:
91–117.

Card, D. E. 1995. Using geographic variation in college proximity to estimate the return
to schooling. In Aspects of Labour Economics: Essays in Honour of John Vanderkamp,
ed. L. Christofides, E. K. Grant, and R. Swindinsky. Toronto, Canada: University of
Toronto Press.

Chernozhukov, V., I. Fernández-Val, and A. Galichon. 2010. Quantile and probability


curves without crossing. Econometrica 78: 1093–1125.

Chernozhukov, V., and C. Hansen. 2005. An IV model of quantile treatment effects.


Econometrica 73: 245–261.

Firpo, S. 2007. Efficient semiparametric estimation of quantile treatment effects. Econo-


metrica 75: 259–276.

Frölich, M. 2007a. Propensity score matching without conditional independence


assumption—with an application to the gender wage gap in the United Kingdom.
Econometrics Journal 10: 359–407.

———. 2007b. Nonparametric IV estimation of local average treatment effects with


covariates. Journal of Econometrics 139: 35–75.

Frölich, M., and B. Melly. 2008. Unconditional quantile treatment effects under en-
dogeneity. Discussion Paper No. 3288, Institute for the Study of Labor (IZA).
http://ideas.repec.org/p/iza/izadps/dp3288.html.

Hall, P., and S. J. Sheather. 1988. On the distribution of a studentized quantile. Journal
of the Royal Statistical Society, Series B 50: 381–391.

Jann, B. 2005a. kdens: Stata module for univariate kernel density estimation. Sta-
tistical Software Components S456410, Department of Economics, Boston College.
http://ideas.repec.org/c/boc/bocode/s456410.html.

———. 2005b. moremata: Stata module (Mata) to provide various Mata functions.
Statistical Software Components S455001, Department of Economics, Boston College.
http://ideas.repec.org/c/boc/bocode/s455001.html.

Koenker, R. 2005. Quantile Regression. New York: Cambridge University Press.

Koenker, R., and G. Bassett Jr. 1978. Regression quantiles. Econometrica 46: 33–50.
446 Estimation of quantile treatment effects with Stata

Melly, B. 2006. Estimation of counterfactual distributions using quantile regression.


Discussion paper, Universität St. Gallen.
http://www.alexandria.unisg.ch/Publikationen/22644.

Powell, J. L. 1986. Censored regression quantiles. Journal of Econometrics 32: 143–155.

Racine, J., and Q. Li. 2004. Nonparametric estimation of regression functions with both
categorical and continuous data. Journal of Econometrics 119: 99–130.

Silverman, B. W. 1986. Density Estimation for Statistics and Data Analysis. Boca
Raton, FL: Chapman & Hall/CRC.

About the authors


Markus Frölich is a full professor of econometrics at the University of Mannheim and is Pro-
gram Director for Employment and Development at the Institute for the Study of Labor. His
research interests include policy evaluation, microeconometrics, labor economics, and develop-
ment economics.
Blaise Melly is an assistant professor of economics at Brown University. He specializes in
microeconometrics and applied labor economics and has special interests in the effects of policies
on the distribution of outcome variables.

A Variance estimation
In this section, we describe the analytical variance estimators implemented in ivqte.
The bootstrap represents an alternative and can be implemented in Stata using the
bootstrap prefix. The validity of the bootstrap has been proven for standard quantile
regression but not for the other estimators so far, but it seems likely that it is valid.

A.1 Conditional exogenous QTEs


 
Let X = (D, X  ) and γ τ = (δ τ , β τ  ) . The asymptotic distribution of the quantile
regression estimator defined in (1) is given by22
√  
γ τ − γ τ ) → N 0, Jτ−1 Στ Jτ−1
n (
   
where Jτ = E fY |X (X γ τ ) × XX and Στ = τ (1 − τ ) E XX . The term Στ is
straightforward to estimate by τ (1 − τ ) n−1 Xi Xi . We estimate Jτ by the kernel
method of Powell (1986),
 
1  Yi − Xi γ

Jτ = k Xi Xi
nhn hn

22. See, for example, Koenker (2005).


M. Frölich and B. Melly 447

where k is a univariate kernel function and hn is a bandwidth sequence. In the actual


implementation, we use a normal kernel and the bandwidth suggested by Hall and
Sheather (1988),
  2 1/3
1.5φ Φ−1 (τ )
hn = n−1/3 Φ−1 (1 − level/2)
2/3
2
2φ {Φ−1 (τ )} + 1

where level is the level for the intended confidence interval, and φ and Φ are the normal
density and distribution functions, respectively. This estimator of the asymptotic vari-
ance is consistent under heteroskedasticity, which is in contrast to the official Stata com-
mand for quantile regression, qreg. This is important because quantile regression only
becomes interesting when the errors are not independent and identically distributed.

A.2 Conditional endogenous QTEs


The asymptotic distribution of the IV quantile regression estimator defined in (2) is
given by √  
n ( τ
γIV − γ τ ) → N 0, Iτ−1 Ωτ Iτ−1 (7)
 
where Iτ = E fY |X,D1 >D0 (X γ τ ) × XX |D1 > D0 Pr (D1 > D0 ) and Ωτ = E (ψψ  )
with ψ = W AAI mτ (X, Y ) + H (X) {Z − Pr (Z = 1 |X )} and
 τ
mτ (X, Y ) = {τ − 1 (Y −
 X"γ < 0)} X and #
2
H (X) = E mτ (X, Y ) − D (1 − Z) / {1 − Pr (Z = 1 |X )} +
 
2
(1 − D) Z/Pr (Z = 1 |X ) |X .

We estimate these elements as


 
1   AAI+ Yi − Xi γ
IV
τ
Iτ = Wi ×k Xi Xi
nhn hn

where W AAI+ are estimates of the projected weights. For the kernel function in the
i 
previous regression, we use an Epanechnikov kernel and hn = n−0.2 Var (Yi − Xi γ
IV
τ )
23
as proposed by Abadie, Angrist, and Imbens (2002).

Furthermore, H
(Xi ) is estimated by the local linear regression of
⎡ ⎤
⎢ Di (1 − Zi ) (1 − Di ) Zi ⎥
{τ − 1 (Yi − Xi γ
IV
τ
< 0)} Xi ⎣−  2 + 2⎦
on Xi
 (Z = 1 |Xi ) 
Pr (Z = 1 |X )
1 − Pr

This nonparametric regression is controlled by the options vkernel(), vbandwidth(),


and vlambda() in ivqte. With these ingredients, we calculate

23. In principle, the same kernel and bandwidth as those for quantile regression can be used. These
choices were made to replicate the results of Abadie, Angrist, and Imbens (2002).
448 Estimation of quantile treatment effects with Stata

 
ψi iAAI {τ − 1 (Yi − Xi γ
= W IV
τ
< 0)} Xi + H  (Z = 1 |Xi )
(Xi ) × Zi − Pr

τ 1   
Ω = ψi ψi
n
 AAI are estimates of the weights.
where W i

A.3 Unconditional exogenous QTEs


The asymptotic distribution of the estimator defined in (6) is given by
√  τ 
n Δ  − Δτ → N (0, V)

with
     
1 FY |D=1,X QτY 1 1 − FY |D=1,X QτY 1
V =  E
fY2 1 QτY 1 Pr(D = 1|X)
  τ    
1 FY |D=0,X QY 0 1 − FY |D=0,X QτY 0
+ 2  τ E
fY 0 QY 0 1 − Pr(D = 1|X)
" #
2
+ E {ϑ1 (X) − ϑ0 (X)}

where ϑd (x) = τ − FY |D=d,X (QτY d )/fY d (QτY d ). QτY 0 and QτY 1 have already been esti-
mated by α  and α  τ , respectively. The densities fY d (Qτ d ) are estimated by weighted
+Δ Y
kernel estimators

  1  Y − Qτ d
fY d Q  d = i × k
τ F i Y
Y W
nhn hn
Di =d

with Epanechnikov kernel function and Silverman (1986) bandwidth choice, and where
FY |D=d,X (QτY d ) is estimated by the local logit estimator described in appendix B.

A.4 Unconditional endogenous QTEs


Finally, the asymptotic variance of the estimator defined in (4) is the most tedious and
is given by
M. Frölich and B. Melly 449

» –
1 π(X, 1) ` ´˘ ` ´¯
V= “ ”E FY |D=1,Z=1,X QτY 1 |c 1 − FY |D=1,Z=1,X QτY 1 |c
Pc2 fY2 1 |c QτY 1 |c p(X)
» –
1 π(X, 0) ` ´˘ ` ´¯
+ “ ”E FY |D=1,Z=0,X QτY 1 |c 1 − FY |D=1,Z=0,X QτY 1 |c
Pc2 fY2 1 |c QτY 1 |c 1 − p(x)
» –
1 1 − π(X, 1) ` ´˘ ` ´¯
+ “ ”E FY |D=0,Z=1,X QτY 0 |c 1 − FY |D=0,Z=1,X QτY 0 |c
Pc2 fY2 0 |c QτY 0 |c p(X)
» –
1 1 − π(X, 0) ` ´˘ ` ´¯
+ “ ”E FY |D=0,Z=0,X QτY 0 |c 1 − FY |D=0,Z=0,X QτY 0 |c
Pc2 fY2 0 |c QτY 0 |c 1 − p(X)
» –
π(X, 1)ϑ211 (X) + {1 − π(X, 1)} ϑ201 (X) π(X, 0)ϑ210 (X) + {1 − π(X, 0)} ϑ200 (X)
+E +
p(X) 1 − p(X)
„ »
π(X, 1)ϑ11 (X) + {1 − π(X, 1)} ϑ01 (X)
− E p(X) {1 − p(X)}
p(X)
–2 !
π(X, 0)ϑ10 (X) + {1 − π(X, 0)} ϑ00 (X)
+
1 − p(X)

where ϑdz (x) = τ − FY |D=d,Z=z,X (QτY d |c )/Pc × fY d |c (QτY d |c ); p(x) = Pr(Z = 1|X = x);
π(x, z) = Pr(D = 1|X = x, Z = z); and Pc is the fraction of compliers. QτY 0 |c and
Qτ 1 have already been estimated by α IV and α IV + Δ  τ , respectively. The terms
Y |c IV
FY |D=d,Z=z,X (QτY d ), p(X), and π(x, z) are estimated
by the local logit estimator de-
scribed in appendix B. Finally, Pc is estimated by (Xi , 1) − π
π (Xi , 0).

(Continued on next page)


450 Estimation of quantile treatment effects with Stata

To estimate the densities fY d |c (QτY d |c ), we note that24

!1
1 QθY d |c − QτY d |c
fY d |c (QτY d |c ) = lim k dθ
h→∞ h h
0

where k is the Epanechnikov kernel function with Silverman (1986) bandwidth. We


therefore estimate fY d |c as

!1
1 Q τ d
θ d − Q
Y |c Y |c
fY d |c (Q
τ d )
Y |c = k dθ
h h
0

where we replace the integral by a sum of n uniformly spaced values of θ between 0


and 1.

B Nonparametric regression with mixed data


B.1 Local parametric regression
A key ingredient for the previously introduced estimators (except for the exogenous con-
ditional quantile regression estimator) is the nonparametric estimation of some weights.
Local linear and local logit estimators have been implemented for this purpose. This
is fully automated in the ivqte command. Nevertheless, some understanding of the
nonparametric estimators facilitates the use of the ivqte command.
In many instances, we need to estimate conditional expected values like
E (Y |X = Xi ). We use a local parametric approach throughout; that is, we estimate a
locally weighted version of the parametric model. A complication is that many econo-
metric applications contain continuous as well as discrete regressors X. Both types of
regressors need to be accommodated in the local parametric model and in the kernel
function defining the local neighborhood. The traditional nonparametric approach con-
sists of estimating the model within each of the cells defined by the discrete regressors
24. To see this, note that
0 1
Z1 QθY d |c − QτY d |c Z

1
k@ A dθ = k (u) × fY d |c (uh + QτY d |c ) × du
h h
0 −∞

where we used a change in variables uh = QθY d |c − QτY d |c , which implies that θ = FY d |c (uh +
QτY d |c ) and dθ = fY d |c (uh + QτY d |c ) × hdu. By the mean value theorem, fY d |c (uh + QτY d |c ) =
fY d |c (QτY d |c ) + uh × fY d |c (Q), where Q is on the line between QτY d |c and uh + QτY d |c . Hence,

Z

= fY d |c (QτY d |c ) k (u) × du + O(h) = fY d |c (QτY d |c ) + O(h)


−∞
M. Frölich and B. Melly 451

and of smoothing only with respect to the continuous covariates. When the number
of cells in a dataset is large, each cell may not have enough observations to nonpara-
metrically estimate the relationship among the remaining continuous variables. For
this reason, many applied researchers have treated discrete variables in a parametric
way. We follow an intermediate way and use the hybrid product kernel developed by
Racine and Li (2004). This estimator covers all cases from the fully parametric model
up to the traditional nonparametric estimator.
Overall, we can distinguish four different types of regressors: continuous (for exam-
ple, age), ordered discrete (for example, family size), unordered discrete (for example,
regions), and binary variables (for example, gender). We will treat ordered discrete and
continuous variables in the same way and will refer to them as continuous variables in
the following discussion.25
The unordered discrete and the binary variables are handled differently in the kernel
function and in the local parametric model. The binary variables enter into both as
single regressors. The unordered discrete variables, however, enter as a single regressor
in the kernel function and as a vector of dummy variables in the local model. Consider,
for example, a variable called region that takes four different values: north, south, east,
and west. This variable enters as a single variable in the kernel function but is included
in the local model in the form of three dummies: “south”, “east”, and “west”.
The kernel function is defined in the following paragraph. Suppose that the variables
in X are arranged such that the first q1 regressors are continuous (including the ordered
discrete variables) and the remaining Q − q1 regressors are discrete without natural
ordering (including binary variables). The kernel weights K(Xi − x) are computed as
q1   Q
Xq,i − xq
Kh,λ (Xi − x) = κ × λ1(Xq,i =xq )
q=1
h q=q1 +1

where Xq,i and xq denote the qth element of Xi and x, respectively; 1(·) is the indicator
function; κ is a symmetric univariate weighting function; and h and λ are positive
bandwidth parameters with 0 ≤ λ ≤ 1. This kernel function measures the distance
between Xi and x through two components: The first term is the standard product
kernel for continuous regressors with h defining the size of the local neighborhood. The
second term measures the mismatch between the unordered discrete (including binary)
regressors. λ defines the penalty for the unordered discrete regressors. For example,
the multiplicative weight contribution of the Qth regressor is 1 if the Qth element of
Xi and x are identical, and it is λ if they are different. If h = ∞ and λ = 1, then
the nonparametric estimator corresponds to the global parametric estimator and no
interaction term between the covariates is allowed. On the other hand, if λ is zero
and h is small, then smoothing proceeds only within each of the cells defined by the
25. Racine and Li (2004) suggest using a geometrically declining kernel function for the ordered discrete
regressors. There are no reasons, however, against using quadratically declining kernel weights. In
other words, we can use the same (for example, Epanechnikov) kernel for ordered discrete as for
continuous regressors. We therefore treat ordered discrete regressors in the same way as continuous
regressors in the following discussion.
452 Estimation of quantile treatment effects with Stata

discrete regressors and only observations with similar continuous covariates will be used.
Finally, if λ and h are in the intermediate range, observations with similar discrete and
continuous covariates will be weighted more but further observations will also be used.
Principally, instead of using only two bandwidth values h, λ for all regressors, a dif-
ferent bandwidth could be employed for each regressor, but doing so would substantially
increase the computational burden for bandwidth selection. This approach might lead
to additional noise due to estimating these bandwidth parameters. Therefore, we prefer
to use only two smoothing parameters. ivqte automatically orthogonalizes the data
matrix of all continuous regressors to create an identity covariance matrix. This greatly
diminishes the appeal of having multiple bandwidths.
This kernel function, combined with a local model, is used to estimate E (Y |X ).
If Y is a continuous variable, then ivqte uses by default a local linear estimator to
estimate E (Y |X = x) as α in

n
(  = arg min
α, β)
2
{Yj − a − b (Xj − x)} × Kh,λ (Xj − x)
a,b j=1

If Y is bound from above and below, a local logistic model is usually preferred. We
suppose in the following discussion that Y is bound within [0, 1].26 This includes the
special case where Y is binary. The local logit estimator guarantees that the fitted
values are always between 0 and 1. The local logit estimator can be used by selecting
the logit option. In this case, E (Y |X = x) is estimated by Λ(
α), where

n
(  = arg min
α, β) (Yj ln Λ {a + b (Xj − x)}
a,b j=1

+ (1 − Yj ) ln [1 − Λ {a + b (Xj − x)}]) × Kh,λ (Xj − x)

and Λ(x) = 1/1 + e−x .


As mentioned before, each of the unordered discrete variables enters in the form of
a dummy variable for each of its support points except for an arbitrary base category;
for example, if the region variable takes four different values, then three dummies are
included.
The ivqte command requires that the values of the smoothing parameters h and
λ are supplied by the user. Before estimating local linear or local logit with these
smoothing parameters, ivqte (as well as locreg) first attempts to estimate the global
model (that is, with h = ∞ and λ = 1). If estimation fails due to collinearity or perfect
prediction, the regressors which caused these problems are eliminated.27 Thereafter, the
model is estimated locally with the user-supplied smoothing parameters. If estimation
26. If the lower and upper bounds of Y are different from 0 and 1, Y should be rescaled to the interval
[0, 1].
27. This is done using rmcollright, where ivqte first searches for collinearity among the continuous
regressors and thereafter among all other regressors.
M. Frölich and B. Melly 453

fails locally because of collinearity or perfect prediction, the bandwidths are increased
locally. This is repeated until convergence is achieved.
The locreg command also contains a leave-one-out cross-validation procedure to
choose the smoothing parameters.28 The user provides a grid of values for h and λ, and
the cross-validation criterion is computed for all possible combinations of these values.
The values of the cross-validation criterion are returned in r(cross valid) and the
combination that minimizes this criterion is chosen. If only one value is given for h and
λ, no grid search is performed.

B.2 The locreg command


Because the codes implementing the nonparametric regressions are likely to be of inde-
pendent interest in other contexts, we offer a separate command for the local parametric
regressions. This locreg command implements local linear and local logit regression
and chooses the smoothing parameters by leave-one-out cross-validation. The formal
syntax of locreg is as follows:



locreg depvar if in weight , generate(newvarname , replace )


continuous(varlist) dummy(varlist) unordered(varlist) kernel(kernel)

bandwidth(# # # . . . ) lambda(# # # . . . ) logit mata opt



sample(varname , replace )

aweights and pweights are allowed. See the [U] 11.1.6 weight for more information
on weights.

B.3 Description
locreg computes the nonparametric estimation of the mean of depvar conditionally on
the regressors given in continuous(), dummy(), and unordered(). A mixed kernel is
used to smooth over the continuous and discrete regressors. The fitted values are saved
in the variable newvarname. If a list of values is given in bandwidth() or lambda(), the
smoothing parameters h and λ are estimated via leave-one-out cross-validation. The
values of h and λ minimizing the cross-validation criterion are selected. These values are
then used to predict depvar, and the fitted values are saved in the variable newvarname.
locreg can be used in three different ways. First, if only one value is given in
bandwidth() and one in lambda(), locreg estimates the nonparametric regression
using these values and saves the fitted values in generate(newvarname). Alternatively,

28. The cross-validated parameters are optimal to estimate the weights but are not optimal to estimate
the unconditional QTE. In the absence of a better method, we offer cross-validation, but the user
should keep in mind that the optimal bandwidths for the unconditional QTE converge to zero at a
faster rate than the bandwidths delivered by cross-validation. The user is therefore encouraged to
also examine the estimated QTE when using some undersmoothing relative to the cross-validation
bandwidths.
454 Estimation of quantile treatment effects with Stata

locreg can also be used to estimate the smoothing parameters via leave-one-out cross-
validation. If we do not specify the generate() option but supply a list of values in
the bandwidth() or lambda() option only the cross-validation is performed. Finally, if
several values are specified in bandwidth() or lambda() when the generate() option is
also specified, locreg estimates the optimal smoothing parameters via cross-validation.
Thereafter, it estimates the conditional means with these smoothing parameters and
returns the fitted values in the variable generate(newvarname).
For the nonparametric regression, locreg offers two local models: linear and logistic.
The logistic model is usually preferred if depvar is bound within [0, 1]. This includes
the case where depvar is binary but also incorporates cases where depvar is nonbinary
but bound from above and below. If the lower and upper bounds of depvar are different
from 0 and 1, the variable depvar should be rescaled to the interval [0, 1] before using
this command. If depvar is not bound from above and below, the linear model should
be used.29

B.4 Options

generate(newvarname , replace ) specifies the name of the variable that will con-
tain the fitted values. If this option is not used, only the leave-one-out cross-
validation estimation of the smoothing parameters h and λ will be performed. The
replace option allows locreg to overwrite an existing variable or to create a new
one where none exists.
continuous(varlist), dummy(varlist), and unordered(varlist) specify the names of the
covariates depending on their type. Ordered discrete variables should be treated as
continuous.
kernel(kernel) specifies the kernel function. kernel may be epan2 (Epanechnikov ker-
nel function; the default), biweight (biweight kernel function), triweight (tri-
weight kernel function), cosine (cosine trace), gaussian (Gaussian kernel func-
tion), parzen (Parzen kernel function), rectangle (rectangle kernel function), or
triangle (triangle kernel function). In addition to these second-order kernels, there
are also several higher-order kernels: epanechnikov o4 (Epanechnikov order 4),
epanechnikov o6 (order 6), gaussian o4 (Gaussian order 4), gaussian o6 (order
6), gaussian o8 (order 8).30

bandwidth(# # # . . . ) is used to smooth over the continuous variables. The de-


fault is h = ∞. The continuous regressors are first orthogonalized such that their
covariance matrix is the identity matrix. The bandwidth must be strictly positive.
If the bandwidth h is missing, an infinite bandwidth h = ∞ is used. The default
value is infinity.

29. In the current implementation, there is not yet a local model specifically designed for depvar that
is bound only from above or only from below. A local tobit or local exponential model may be
added in future versions.
30. The formulas for the higher-order kernel functions are given in footnote 16.
M. Frölich and B. Melly 455

If a list of values is supplied for bandwidth(), cross-validation is used with respect


to each value in this list to estimate the bandwidth among the proposed values. If a
list of values is supplied for bandwidth() and for lambda(), cross-validation consid-
ers all pairwise combinations from these two lists. In case of local multicollinearity,
the bandwidth is progressively increased until the multicollinearity problem disap-
pears.31

lambda(# # # . . . ) is used to smooth over the dummy and unordered discrete


variables. It must be between 0 and 1. A value of 0 implies that only observations
within the cell defined by all discrete regressors are used to estimate the conditional
mean. The default is lambda(1), which corresponds to global smoothing. If a list of
values is supplied for lambda(), cross-validation is used with respect to each value
in this list to estimate the lambda among the proposed values. If a list of values is
supplied for bandwidth() and for lambda(), cross-validation considers all pairwise
combinations from these two lists.
logit activates the local logit estimator. If it is not activated, the local linear estimator
is used as the default.32
mata opt selects the official optimizer introduced in Stata 10, Mata’s optimize(), to
obtain the local logit. The default is a simple Gauss–Newton algorithm written for
this purpose. This option is only relevant when the logit option has been specified.

sample(varname , replace ) specifies the name of the variable that marks the esti-
mation sample. This is similar to the function e(sample) for e-class commands.

31. In case of multicollinearity, h is increased repeatedly until the problem disappears. If multicollinear-
ity is still present at h = 100, then λ is increased repeatedly. Note that locreg first examined
whether multicollinearity is a problem in the global model (h = ∞, λ = 1 ) before attempting to
estimate locally.
32. This is different from ivqte, where local logit is the default for binary dependent variables.
456 Estimation of quantile treatment effects with Stata

B.5 Saved results


locreg saves the following in r():
Scalars
r(N) number of observations
r(optb) optimal bandwidth
r(optl) optimal lambda
r(best mse) smallest value of the cross-validation criterion
Macros
r(command) locreg
r(depvar) name of the dependent variable
r(continuous) name of the continuous covariates
r(dummy) name of the binary covariates
r(unordered) name of the unordered covariates
r(kernel) kernel function
r(model) linear or logistic model used
r(optimization) algorithm used
Matrices
r(cross valid) bandwidths, lambda, and resulting values of the cross-validation criterion

B.6 Examples
We briefly illustrate the use of locreg with a few examples. We use card.dta and
keep only 200 observations to keep the computational time reasonable for this illus-
tration. (Because of missing values on covariates, eventually only 184 observations are
retained in the estimation.) The aim is to estimate the probability of living near a
four-year college (nearc4) as a function of experience, exper (continuous() variable);
mother’s education, motheduc (ordered discrete); region (unordered()); and black
(dummy()). locreg can be used in three different ways. First, if only one value is
given in bandwidth(#) and one in lambda(#), locreg estimates the nonparametric
regression using these values h and λ and saves the fitted values in newvarname:
. use card, clear
. set seed 123
. sample 200, count
(2810 observations deleted)
. locreg nearc4, generate(fitted1) bandwidth(0.5) lambda(0.5)
> continuous(exper motheduc) dummy(black) unordered(region)
(output omitted )

The fitted1 variable contains the estimated probabilities. Because some of them
turn out to be negative and others to be larger than one, we may prefer to fit a local
logit regression and add the logit option:

. locreg nearc4, generate(fitted2) bandwidth(0.5) lambda(0.5)


> continuous(exper motheduc) dummy(black) unordered(region) logit
(output omitted )
M. Frölich and B. Melly 457

locreg can also be used to estimate the smoothing parameters via leave-one-out
cross-validation. If we do not specify the generate() option but instead supply a list of
values in the bandwidth() or the lambda() option (or both), only the cross-validation
is performed:

. locreg nearc4, bandwidth(0.2 0.5) lambda(0.5 0.8) continuous(exper motheduc)


> dummy(black) unordered(region)
(output omitted )

In this example, the cross-validation criterion is calculated for each of the four cases:
(h, λ) = (0.2, 0.5), (0.2, 0.8), (0.5, 0.5), and (0.5, 0.8). The scalars r(optb) and r(optl)
indicate the values that minimized the cross-validation criterion. In our example, we
obtain h = 0.2 and λ  = 0.5. The cross-validation results are saved in the matrix
r(cross valid) for every h and λ combination of the search grid.
If we would like to include in our cross-validation search the values to infinity, that is,
the global parametric model, we would supply a missing value for h and a value of 1 for
λ. For example, specifying bandwidth(0.2 .) lambda(0.5 1) implies that the cross-
validation criterion is calculated for each of the four cases: (h, λ) = (0.2, 0.5), (0.2, 1),
(∞, 0.5), and (∞, 1). Similarly, specifying bandwidth(0.2 0.5 .) lambda(0.5 0.8
1) implies a search grid with nine values: (h, λ) = (0.2, 0.5), (0.2, 0.8), (0.2, 1), (0.5, 0.5),
(0.5, 0.8), (0.5, 1), (∞, 0.5), (∞, 0.8), and (∞, 1).
Finally, if several values are specified for the smoothing parameters and the
generate() option is also activated, then locreg first estimates  h and λ via cross-
validation and thereafter returns the fitted values obtained with   in the fitted3
h and λ
variable.

. locreg nearc4, generate(fitted3) bandwidth(0.2 0.5) lambda(0.5 0.8)


> continuous(exper motheduc) dummy(black) unordered(region)
(output omitted )
The Stata Journal (2010)
10, Number 3, pp. 458–481

Translation from narrative text to standard


codes variables with Stata
Federico Belotti Domenico Depalo
University of Rome “Tor Vergata” Bank of Italy
Rome, Italy Rome, Italy
federico.belotti@uniroma2.it domenico.depalo@bancaditalia.it

Abstract. In this article, we describe screening, a new Stata command for data
management that can be used to examine the content of complex narrative-text
variables to identify one or more user-defined keywords. The command is useful
when dealing with string data contaminated with abbreviations, typos, or mistakes.
A rich set of options allows a direct translation from the original narrative string
to a user-defined standard coding scheme. Moreover, screening is flexible enough
to facilitate the merging of information from different sources and to extract or
reorganize the content of string variables.
Editors’ note. This article refers to undocumented functions of Mata, meaning that
there are no corresponding manual entries. Documentation for these functions is
available only as help files; see help regex.
Keywords: dm0050, screening, keyword matching, narrative-text variables, stan-
dard coding schemes

1 Introduction
Many researchers in varied fields frequently deal with data collected as narrative text,
which are almost useless unless treated. For example,

• Electronic patient records (EPRs) are useful for decision making and clinical re-
search only if patient data that are currently documented as narrative text are
coded in standard form (Moorman et al. 1994).
• When different sources of data use different spellings to identify the same unit of in-
terest, the information can be exploited only if codes are made uniform (Raciborski
2008).
• Because of verbatim responses to open-ended questions, survey data items must
be converted into nominal categories with a fixed coding frame to be useful for
applied research.

These are only three of the many critical examples that motivate an ad hoc command.
Recoding a narrative-text variable into a user-defined standard coding scheme is cur-
rently possible in Stata by combining standard data-management commands (for exam-
ple, generate and replace) with regular expression functions (for example, regexm()).


c 2010 StataCorp LP dm0050
F. Belotti and D. Depalo 459

However, many problems do not yield easily to this approach, especially problems con-
taining complex narrative-text data. Consider, for example, the case when many source
variables can be used to identify a set of keywords; or the case when, looking at different
keywords, one is within a given source variable but not necessarily at the beginning of
that variable, whereas the others are at the beginning, the end, or within that or other
source variables. Because no command jointly handles all possible cases, these cases
can be treated with existing Stata commands only after long and tedious programming,
increasing the possibility of introducing errors. We developed the screening command
to fill this gap, simplifying data-cleaning operations while being flexible enough to cover
a wide range of situations.
In particular, screening checks the content of one or more string variables (sources)
to identify one or more user-defined regular expressions (keywords). Because string vari-
ables are not flexible, to make the command easier and more useful, a set of options
reduces your preparatory burden. You can make the matching task wholly case in-
sensitive or set matching rules aimed at matching keywords at the beginning, the end,
or within one or more sources. If source variables contain periods, commas, dashes,
double blanks, ampersands, parentheses, etc., it is possible to perform the matching by
removing such undesirable content. Moreover, if the matching task becomes more dif-
ficult because of abbreviations or even pure mistakes, screening allows you to specify
the number of letters to screen in a keyword. Finally, the command allows a direct
translation of the original string variables in a user-defined standard coding scheme.
All these features make the command simple, extremely flexible, and fast, minimizing
the possibility of introducing errors. It is worth emphasizing that we find Mata more
convenient to use than Stata, with advantages in terms of time execution.
The article is organized as follows. In section 2, we describe the new screening
command, and we provide some useful tips in section 3. Section 4 illustrates the main
features of the command using EPR data, while section 5 details some critical cases in
which the use of screening may aid your decision to merge data from different sources
or to extract and reorder messy data. In the last section, section 6, we offer a short
summary.

2 The screening command


String variables are useful in many practical circumstances. A drawback is that they
are not so flexible: for example, in EPR data, coding CHOLESTEROL is different from
coding CHOLESTEROL LDL, although the broad pathology is the same. Stata and Mata
offer many built-in functions to handle strings. In particular, screening extensively
uses the Mata regular-expression functions regexm(), regexr(), and regexs().

(Continued on next page)


460 From narrative text to standard codes

2.1 Syntax



screening if in , sources(varlist , sourcesopts ) keys( matching rule




"string" . . . ) letters(#) explore(type) cases(newvar) newcode(newvar


, newcodeopts ) recode(recoding rule "user defined code" recoding rule

"user defined code" . . . ) checksources tabcheck memcheck nowarnings save


time

2.2 Options

sources(varlist , sourcesopts ) specifies one or more string source variables to be


screened. sources() is required.

sourcesopts description
lower perform a case-insensitive match (lowercase)
upper perform a case-insensitive match (uppercase)
trim match keywords by removing leading and trailing blanks
from sources
itrim match keywords by collapsing sources with consecutive
internal blanks to one blank
removeblank match keywords by removing from sources all blanks
removesign match keywords by removing from sources the following
signs: * + ? / \ % ( ) [ ] { } | . ^ - _ # $

keys( matching rule "string" . . . ) specifies one or more regular expressions (key-
words) to be matched with source variables. keys() is required.

matching rule description


begin match keywords at beginning of string
end match keywords at end of string

letters(#) specifies the number of letters to be matched in a keyword. The number


of letters can play a critical role: specifying a high number of letters may cause
the number of matched observations to be artificially low because of mistakes or
abbreviations in the source variables; on the other hand, matching a small number
of letters may cause the number of matched observations to be artificially high
because of the inclusion of uninteresting cases containing the “too short” keyword.
The default is to match keywords as a whole.
F. Belotti and D. Depalo 461

explore(type) allows you to explore screening results.

type description
tab tabulate all matched cases for each keyword within each source variable
count display a table of frequency counts of all matched cases for each
keyword within each source variable

cases(newvar) generates a set of categorical variables (as many as the number of key-
words) showing the number of occurrences of each keyword within all specified source
variables.

newcode(newvar , newcodeopts ) generates a new (numeric) variable that contains the


position of the keywords or the regular expressions in keys(). The coding process
is driven by the order of keywords or regular expressions.

newcodeopts description
replace replace newvar if it already exists
add obtain newvar as a concatenation of subexpressions returned by
regexs(n), which must be specified as a
user defined code in recode
label attach keywords as value labels to newvar
numeric convert newvar from string to numeric; it can be specified only if
the recode() option is specified

recode(recoding rule "user defined code" recoding rule "user defined code" ... )
recodes the newcode() newvar according to a user-defined coding scheme. recode()
must contain at least one recoding rule followed by one user defined code. When you
specify recode(1 "user defined code"), the "user defined code" will be used to re-
code all matched cases from the first keyword within the list specified via the keys()
option. If recode(2,3 "user defined code") is specified, the "user defined code" will
be used to recode all cases for which second and third keywords are simultaneously
matched, and so on. This option can only be specified if the newcode() option is
specified.
checksources checks whether source variables contain special characters. If a match-
ing rule is specified (begin or end via keys()), checksources checks the sources’
boundaries accordingly.
tabcheck tabulates all cases from checksources. If there are too many cases, the
option does not produce a table.
memcheck performs a “preventive” memory check. When memcheck is specified, the
command will exit promptly if the allocated memory is insufficient to run screening.
When memory is insufficient and screening is run without memcheck, the command
could run for several minutes or even hours before producing the message no room
to add more variables.
462 From narrative text to standard codes

nowarnings suppresses all warning messages.


save saves in r( ) the number of cases detected, matching each source with each key-
word.
time reports elapsed time for execution (seconds).

3 Tips
The low flexibility of string variables is a reason for concern. In this section, we provide
some tips to enhance the usefulness of screening. Some tips are useful to execute the
command, while other tips are useful to check the results.
Most importantly, capitalization matters: this means that screening for KEYWORD is
different from screening for keyword. If source variables contain HEMINGWAY and you are
searching for Hemingway, screening will not identify such keyword. If suboption upper
(lower) is specified in sources(), keywords will be automatically matched in uppercase
(lowercase).
Choose an appropriate matching rule. The screening default is to match keywords
over the entire content of source variables. By specifying the matching rule begin or
end within the keys() option, you may switch accordingly the matching rule on string
boundaries. For example, if sources contain HEMINGWAY ERNEST and ERNEST HEMINGWAY
and you are searching begin HEMINGWAY, the screening command will identify the
keyword only in the former case. Whether the two cases are equivalent must be evaluated
case by case.
Another issue is how to choose the optimal number of letters to be screened. For
example, with EPR data, different physicians might use different abbreviations for the
same pathologies. And so talking about a “right” number of letters is nonsense. As
a rule of thumb, the number of letters should be specified as the minimum number
that uniquely identifies the case of interest. Using many letters can be too exclusive,
while using few letters can be too inclusive. In all cases, but in particular when the
appropriate number of letters is unknown, we find it useful to tabulate all matched cases
through the explore(tab) option. Because it tabulates all possible matches between
all keywords and all source variables, it is the fastest way to explore the data and choose
the best matching strategy (in terms of keywords, matching rule, and letters).
Advanced users can maximize the potentiality of screening by mixing keywords
with Stata regular-expression operators. Mixing in operators allows you to match more-
complex patterns, as we show later in the article.1 For more details on regular-expression
syntaxes and operators, see the official documentation at
http://www.stata.com/support/faqs/data/regex.html.

1. The letters() option does not work if a keyword contains regular-expression operators.
F. Belotti and D. Depalo 463

screening displays several messages to inform you about the effects of the specified
options. For example, consider the case in which you are searching some keywords con-
taining regular-expression operators. screening will display a message with the correct
syntax to search a keyword containing regular-expression operators. The nowarnings
option allows you to suppress all warning messages.
screening generates several temporary variables (proportional to the number of
keywords you are looking for and to the number of sources you are looking from). So
when you are working with a big dataset and your computer is limited in terms of
RAM, it might be a good idea to perform a “preventive” memory check. When the
memcheck option is specified and the allocated memory is insufficient, screening will
exit promptly rather than running for several minutes or even hours before producing
the message no room to add more variables.
We conclude this section with an evaluation of the command in terms of time ex-
ecution using different Stata flavors and different operating systems. In particular, we
compare the latest version of screening written using Mata regular-expression func-
tions with its beta version written entirely using the Stata counterpart. We built three
datasets of 500,000 (A), 5 million (B), and 50 million (C) observations with an ad hoc
source variable containing 10 different words: HEMINGWAY, FITZGERALD, DOSTOEVSKIJ,
TOLSTOJ, SAINT-EXUPERY, HUGO, CERVANTES, BUKOWSKI, DUMAS, and DESSI. Screening
for HEMINGWAY (50% of total cases) gives the following results (in seconds):

Stata flavor and Mata Stata


operating system A B C A B C
Stata/SE 10 (32-bit) and
0.66 6.67 na 0.93 9.24 na
Mac OS X 10.5.8 (64-bit)∗
Stata/MP 11 (64-bit) and
0.60 5.66 na 0.85 7.73 na
Mac OS X 10.5.8 (64-bit)∗
Stata/MP 11 (64-bit) and
0.37 3.70 37.22 0.70 7.06 70.59
Window Server 2003 (64-bit)+
∗ Intel Core 2 Duo 2.2 GHz (dual core) with 4 GB RAM
+ AMD Opteron 2.2 GHz (quad core) with 20 GB RAM

The table speaks for itself!

4 Example
To illustrate the command, we use anonymized patient-level data from the Health Search
database, a nationally representative panel of patients run by the Italian College of
General Practitioners (Italian Society of General Medicine). Our sample contains freely
inputted EPRs concerning the prescription of diagnostic tests.2 A list of 15 observations

2. The original data are in Italian. Where necessary for comprehension, we translate to English.
464 From narrative text to standard codes

from the “uppercase” source variable diagn test description provides an overview of
cases at hand:

. list diagn_test_descr in 1/15, noobs separator(20)

diagn_test_descr

TRIGLICERIDI
EMOCROMO FORMULA
COLESTEROLO TOTALE
ALTEZZA
PT TEMPO PROTROMBINA
VISITA CARDIOLOGICA CONTROLLO
HCV AB EPATITE C
COMPONENTE MONOCLONALE
ATTIVITA´ FISICA
PSA ANTIGENE PROSTATICO SPECIFICO
RX CAVIGLIA SN
FAMILIARITA´ K UTERO
TRIGLICERIDI
URINE ESAME COMPLETO
URINE PESO SPECIFICO

As you can see, this is a rich EPR dataset that is totally useless unless treated. If
data were collected for research purposes, physicians would be given a finite number of
possible options. There is much agreement in the scientific community that the cost to
leave the burden of inputting standard codes directly to physicians at the time of contact
with the patient is higher than the relative benefit: the task is extremely onerous, it is
unrelated to the physician’s primary job, and most importantly, it requires extra effort.
Therefore, the common view supports the implementation of data-entry methods that
do not disturb the physician’s workflow (Yamazaki and Satomura 2000).
From the above list of observations, it is also clear that free-text data entry provides
physicians with the freedom to determine the order and detail at which they want
to input data. Even if the original free-text data were complete, it would still be
difficult to extract standardized and structured data from this kind of record because
of abbreviations, typos, or mistakes (Moorman et al. 1994). Extracting data in the
presence of abbreviations and typos is exactly what screening allows you to do.
As a practical example, we focus on the identification of different types of cholesterol
tests. In particular, our aim is to create a new variable (diagn test code) containing
cholesterol test codes according to the Italian National Health System coding scheme.
Because at least three types of cholesterol test exist, namely, hdl, ldl, and total, our
matching strategy must take into account that a physician can input 1) only the types
of the test, 2) only its broad definition (cholesterol), or 3) both, without considering
abbreviations, typos, mistakes, and further details.
F. Belotti and D. Depalo 465

Thus we first explore the data by running screening with the explore(tab) option:

. screening, sources(diagn_test_descr, lower) keys(colesterolo) explore(tab)


Cases of colesterolo found in diagn_test_descr
colesterolo Freq. Percent Cum.

colesterolo totale 2,954 51.86 51.86


hdl colesterolo 1,854 32.55 84.41
ldl colesterolo 617 10.83 95.24
colesterolo hdl 117 2.05 97.30
colesterolo ldl 37 0.65 97.95
colesterolo tot 28 0.49 98.44
colesterolo 24 0.42 98.86
colesterolo hdl sangue 16 0.28 99.14
colesterolo totale sangue 16 0.28 99.42
colesterolo esterificato 4 0.07 99.49
colesterolo tot. 4 0.07 99.56
colesterolo hdl 90.14.1 3 0.05 99.61
colesterolo totale 90143 3 0.05 99.67
colesterolo libero 2 0.04 99.70
colesterolo stick 2 0.04 99.74
colesterolo tot hdl 2 0.04 99.77
colesterolo totale 90.143 2 0.04 99.81
ultima misurazione colesterolo 2 0.04 99.84
colesterolo hdl 1 0.02 99.86
colesterolo ldl 90.14.2 1 0.02 99.88
colesterolo non ldl 1 0.02 99.89
colesterolo t. mg/dl 1 0.02 99.91
colesterolo tot. c 1 0.02 99.93
colesterolo tot. hdl 1 0.02 99.95
colesterolo tot., 1 0.02 99.96
colesterolo totale h 1 0.02 99.98
rich,specialistica colesterolo trigl 1 0.02 100.00

Total 5,696 100.00

Here the lower suboption makes the matching task case insensitive. Apart from the
explore(tab) option, the syntax above is compulsory and performs what we call a
default matching, that is, an exact match of the keyword colesterolo over the entire
content of the source variable diagn test descr. The tabulation above (notice the
lowercase) informs you that the keyword colesterolo is encountered in 5,696 cases.
What do these cases contain? Because you did not instruct the command to match a
shorter length of the keyword, the only possible case is the keyword itself; all the cases
contain the keyword colesterolo.
Given the nature of the data, it might be convenient to run screening with a
shorter length of the keyword so as to find possible partial matching in the presence
of abbreviations or mistakes. The letters(#) option instructs screening to perform
the match on a shorter length:

(Continued on next page)


466 From narrative text to standard codes

. screening, sources(diagn_test_descr, lower) keys(colesterolo) letters(5)


> explore(tab)
Cases of coles found in diagn_test_descr
coles Freq. Percent Cum.

colesterolo totale 2,954 37.25 37.25


hdl colesterolo 1,854 23.38 60.62
coles ldl 1,343 16.93 77.56
hdl colest 853 10.76 88.31
ldl colesterolo 617 7.78 96.09
colesterolo hdl 117 1.48 97.57
colesterolo ldl 37 0.47 98.03
colesterolo tot 28 0.35 98.39
colesterolo 24 0.30 98.69
colesterolo hdl sangue 16 0.20 98.89
colesterolo totale sangue 16 0.20 99.09
colesterolemia 14 0.18 99.27
hdl colest. 5 0.06 99.33
colest.tot. 4 0.05 99.38
colesterolo esterificato 4 0.05 99.43
colesterolo tot. 4 0.05 99.48
azotemia glicemia colest 3 0.04 99.52
colest. hdl 3 0.04 99.56
colesterolo hdl 90.14.1 3 0.04 99.60
colesterolo totale 90143 3 0.04 99.63
colesterolo libero 2 0.03 99.66
colesterolo stick 2 0.03 99.68
colesterolo tot hdl 2 0.03 99.71
colesterolo totale 90.143 2 0.03 99.74
ldl colest. 2 0.03 99.76
ultima misurazione colesterolo 2 0.03 99.79
colest. ldl 1 0.01 99.80
colest. tot. 1 0.01 99.81
colest.tot 1 0.01 99.82
colester.tot.hdl, 1 0.01 99.84
colesterolo hdl 1 0.01 99.85
colesterolo ldl 90.14.2 1 0.01 99.86
colesterolo non ldl 1 0.01 99.87
colesterolo t. mg/dl 1 0.01 99.89
colesterolo tot. c 1 0.01 99.90
colesterolo tot. hdl 1 0.01 99.91
colesterolo tot., 1 0.01 99.92
colesterolo totale h 1 0.01 99.94
emocromo c. colester 1 0.01 99.95
glicemia colesterolemia- 1 0.01 99.96
got gpt colest / trigli/creat/emocromo 1 0.01 99.97
rich,specialistica colesterolo trigl 1 0.01 99.99
uricemia uricuria colest 1 0.01 100.00

Total 7,931 100.00

By specifying a five-letter partial match, screening detects 2,235 new cases of


cholesterol tests. By further reducing the number of letters, we get the following result:3

3. Because of space restrictions, we deliberately omit the complete tabulation obtainable with the
explore(tab) option. It is available upon request.
F. Belotti and D. Depalo 467

. screening, sources(diagn_test_descr, lower) keys("colesterolo") letters(3)


> explore(tab)
Cases of col found in diagn_test_descr
col Freq. Percent Cum.

colesterolo totale 2,954 23.45 23.45


col tot 2,034 16.15 39.60
hdl colesterolo 1,854 14.72 54.32
coles ldl 1,343 10.66 64.99
hdl colest 853 6.77 71.76
ldl colesterolo 617 4.90 76.66
urinocoltura coltura urina 326 2.59 79.25
v.ginecologica 161 1.28 80.52
eco tiroide eco capo e collo 150 1.19 81.71
colesterolo hdl 117 0.93 82.64
(output omitted )
colesterolo ldl 37 0.29 90.77
calcolo rischio cardiovascolare (iss) 35 0.28 91.04
coprocoltura coltura feci 33 0.26 91.31
colore 32 0.25 91.56
ecocolordoppler arti inf. art. 32 0.25 91.81
urinocoltura 32 0.25 92.07
colposcopia 31 0.25 92.31
colesterolo tot 28 0.22 92.54
reticolociti 28 0.22 92.76
ecodoppler a.inferiori ecocolor venosa 27 0.21 92.97
eco ginecologica 25 0.20 93.17
colesterolo 24 0.19 93.36
rischio cardio vascolare nota 13 23 0.18 93.55
rischio cardiovascolare % a 10 anni 22 0.17 93.72
ecodoppler a.inferiori ecocolor arter. 19 0.15 93.87
(output omitted )
col hdl 3 0.02 97.13
colest. hdl 3 0.02 97.16
colesterolo hdl 90.14.1 3 0.02 97.18
colesterolo totale 90143 3 0.02 97.21
conta batt.,urinocoltura, antibiogramma 3 0.02 97.23
eco cardiaca con doppler e colordoppler 3 0.02 97.25
eco color/doppl.car. ver 3 0.02 97.28
eco(color)dopplergrafia 3 0.02 97.30
ecocardiografia colordoppler 3 0.02 97.32
ecocolordoppler art.aa.inf. 3 0.02 97.35
ecocolordoppler arterioso arti inferior 3 0.02 97.37
ecocolordoppler tronchi sovraortici 3 0.02 97.40
ecocolordopplergrafia cardiaca 3 0.02 97.42
ecografia muscolotendinea 3 0.02 97.44
ecografia tiroide eco capo e collo 3 0.02 97.47
familiarita´ ev.cerebrovascol.( 72m 74f 3 0.02 97.49
immunocomplessi circolanti 3 0.02 97.51
rx digerente (tenue e colon) 3 0.02 97.54
test broncodilatazione farmacologica 3 0.02 97.56
test cardiovascolare da sforzo con cicl 3 0.02 97.59
test sforzo cardiovascol. pedana mobile 3 0.02 97.61
urinocoltura atb+mic 3 0.02 97.63
urinocoltura con antibiogramma 3 0.02 97.66
urinocoltura identificazione batt.+ ab 3 0.02 97.68
che colinesterasi 2 0.02 97.70
col 2 0.02 97.71
468 From narrative text to standard codes

colangio rm 2 0.02 97.73


colesterolo libero 2 0.02 97.75
colesterolo stick 2 0.02 97.76
colesterolo tot hdl 2 0.02 97.78
colesterolo totale 90.143 2 0.02 97.79
(output omitted )
ldl colest. 2 0.02 98.11
(output omitted )
col tot 216 hdl 58 fibri 1 0.01 98.48
col=245ldl=193tr=91 1 0.01 98.48
colangiografia intravenosa 1 0.01 98.49
colecistografia 1 0.01 98.50
colecistografia per os c 1 0.01 98.51
colest. ldl 1 0.01 98.52
colest. tot. 1 0.01 98.52
colest.tot 1 0.01 98.53
colester.tot.hdl, 1 0.01 98.54
colesterolo hdl 1 0.01 98.55
colesterolo ldl 90.14.2 1 0.01 98.55
colesterolo non ldl 1 0.01 98.56
colesterolo t. mg/dl 1 0.01 98.57
colesterolo tot. c 1 0.01 98.58
colesterolo tot. hdl 1 0.01 98.59
colesterolo tot., 1 0.01 98.59
colesterolo totale h 1 0.01 98.60
colloquio psicologico 1 0.01 98.61
(output omitted )
hdl col 1 0.01 99.22
(output omitted )
visita specialistica colonscopia con bi 1 0.01 99.99
yersinia coltura feci 1 0.01 100.00

Total 12,595 100.00

Again screening detects new cases: 2,034 cases characterized by the abbreviation
col tot (that is, total cholesterol) that are impossible to identify without further re-
ducing the number of letters. The problem is that, among all matched cases (12,595),
there are also a number of unwanted cases, that is, cases containing the same spelling
of the keyword but related to another type of diagnostic test. Despite this incorrect
identification, we will show later in the section how to obtain a new “recoded variable”
by specifying the appropriate recoding rule as an argument of the recode() option.
The number of letters you match plays a critical role: specifying a high number
of letters may cause the number of matched observations to be artificially low due to
mistakes or abbreviations in the source variables; on the other hand, matching a small
number of letters may cause the number of matched observations to be artificially high
due to the inclusion of uninteresting cases containing the “too short” keyword.
F. Belotti and D. Depalo 469

As mentioned above, we are interested in the identification of three types of choles-


terol tests. To achieve this objective, in what follows we focus on a set of four keywords
(totale, colesterolo, ldl, hdl) with three identifying letters. We also specify the
newcode() option to generate a new variable recoding the observations that match the
specified keywords.
At this point, we describe more deeply the recoding mechanism of screening:

• If newcode() is specified, a new variable is generated, taking as values the position


of the keywords or regular expressions specified through the keys() option. The
coding process is driven by the order of keywords or regular expressions.

• If recode() is specified, the newcode() newvar suboption is recoded according to


the user-defined coding scheme.

Thus a first recoding of the source variable can be obtained as follows:

. screening, sources(diagn_test_descr, lower)


> keys("totale" "colesterolo" "ldl" "hdl") letters(3 3 3 3) explore(count)
> newcode(tmp_diagn_test_code)
Source Key Freq. Percent

diagn_test_descr tot 7304 29.47


col 12595 50.81
ldl 2015 8.13
hdl 2872 11.59

Total 24786 100.00


. tabulate tmp_diagn_test_code
tmp_diagn_t
est_code Freq. Percent Cum.

1 7,304 49.15 49.15


2 7,535 50.70 99.85
3 12 0.08 99.93
4 11 0.07 100.00

Total 14,862 100.00

The explore(count) option instructs screening to display a table of frequency


counts of all matched cases. The newcode() option creates tmp diagn test code, which
is a new variable that takes as values the position of the keywords or regular expressions
specified through the keys() option. The coding process is driven by the order of
keywords or regular expressions: the number 1 is associated with the 7,304 observations
matching the first keyword, tot; the number 2 is associated with the 7,535 observations
matching the second keyword, col; and so on. Hence, by specifying keys("totale"
"colesterolo" "ldl" "hdl") together with letters(3 3 3 3), tot takes precedence
over col in the recoding process. This means that if some observations are recoded
according to the first keyword match, they will not be recoded according to the following
keywords in the keys() list, even if they match.
470 From narrative text to standard codes

For this reason, the best recoding strategy is to first specify keywords that uniquely
identify the cases of interest. Because keywords hdl and ldl each uniquely identify a
cholesterol test, they must have priority in the recoding process over totale, which is
an extension common to other pathologies.
Indeed, when we reverse the order of the keywords and specify the replace suboption
in the newcode() option, screening produces

. screening, sources(diagn_test_descr, lower)


> keys("hdl" "ldl" "colesterolo" "totale") letters(3 3 3 3)
> newcode(tmp_diagn_test_code, replace)
WARNING: By specifying -replace- sub-option you are overwriting the -newcode()-
> variable.
. tabulate tmp_diagn_test_code
tmp_diagn_t
est_code Freq. Percent Cum.

1 2,872 19.32 19.32


2 2,015 13.56 32.88
3 7,731 52.02 84.90
4 2,244 15.10 100.00

Total 14,862 100.00

where the newcode() variable now identifies all hdl and ldl cases. Notice that here
we followed the correct approach, from specific to general. Moreover, as shown by the
following code, when we specify the newcode() suboption label, screening attaches
the specified keywords as value labels to the newcode() variable.

. screening, sources(diagn_test_descr, lower)


> keys("hdl" "ldl" "colesterolo" "totale") letters(3 3 3 3)
> newcode(tmp_diagn_test_code, replace label)
WARNING: By specifying -replace- sub-option you are overwriting the -newcode()-
> variable.
. tabulate tmp_diagn_test_code
tmp_diagn_t
est_code Freq. Percent Cum.

hdl 2,872 19.32 19.32


ldl 2,015 13.56 32.88
colesterolo 7,731 52.02 84.90
totale 2,244 15.10 100.00

Total 14,862 100.00

The last step toward recoding is achieved by using the recode() option. This option
allows you to recode the newcode() variable according to a user-defined coding scheme.
When you specify this option, the coding process is completely under your control.
The recode() option requires a recoding rule followed by a "user defined code" (the
"user defined code" must be enclosed within double quotes).
When we specify recode(1 "90.14.1" ...), the standard code "90.14.1" will
be used to recode all matched cases from the first keyword (hdl); when we specify
F. Belotti and D. Depalo 471

recode(... 2 "90.14.2" ...), the standard code "90.14.2" will be used to recode
all matched cases from the second keyword (ldl); and so on. The third and forth
keywords deserve special attention. totale (which was specified as the forth keyword,
hence position 4) is a common extension that we want to identify only when it is
matched simultaneously with colesterolo (which was specified as the third keyword,
hence position 3). Thus the appropriate syntax in this case will be recode(... 3,4
"90.14.3" ...). Finally, when we specify recode(... 3 "not class. tests"), the
code "not class. tests" will be used to recode all matched cases from the third
keyword (colesterolo) that are not classified because they do not contain any further
specification.
The final syntax of our example is
. screening, sources(diagn_test_descr, lower)
> keys("hdl" "ldl" "colesterolo" "totale") letters(3 3 3 3)
> newcode(diagn_test_code)
> recode(1 "90.14.1" 2 "90.14.2" 3,4 "90.14.3" 3 "not class. tests")
. tabulate diagn_test_code
diagn_test_code Freq. Percent Cum.

90.14.1 2,872 22.76 22.76


90.14.2 2,015 15.97 38.73
90.14.3 5,055 40.06 78.79
not class. tests 2,676 21.21 100.00

Total 12,618 100.00

As the tabulate command shows, the new variable diagn_test_code is created


according to the user-defined codes. Notice that only 5,055 cases are coded as “total
cholesterol” (90.14.3). A two-way tabulate command (below) helps to highlight that
2,244 cases have to be considered incorrect identifications—that is, cases containing the
same spelling of the keywords (totale) but related to other types of diagnostic tests4
—whereas 2,676 are incomplete because they contain only colesterolo without further
specification.
. tabulate diagn_test_code tmp_diagn_test_code if tmp_diagn_test_code !=., m
tmp_diagn_test_code
diagn_test_code hdl ldl colestero totale Total

0 0 0 2,244 2,244
90.14.1 2,872 0 0 0 2,872
90.14.2 0 2,015 0 0 2,015
90.14.3 0 0 5,055 0 5,055
not class. tests 0 0 2,676 0 2,676

Total 2,872 2,015 7,731 2,244 14,862

This example shows that screening is a simple tool to manage complex string vari-
ables. Once you have obtained structured data (in our example, a categorical variable
indicating cholesterol tests), you can finally start your statistical analysis.
4. Because of space restrictions, we deliberately omit the tabulation of such cases. It is available upon
request.
472 From narrative text to standard codes

5 Extensions
Although the main utility of screening is the direct translation of complex narrative-
text variables in a user-defined coding scheme, the command is flexible enough to cover
a wide range of situations. In section 5.1, we present an example of how to use the
command to facilitate the merging of information from different sources, while in sec-
tion 5.2, we show how to use screening to extract or rearrange a portion of a string
variable.

5.1 Merging from different sources


In applied studies, a classic problem comes from trying to merge information from dif-
ferent sources that use different codes for the same units. A recently released command,
kountry (Raciborski 2008), is an important step toward a solution.
The kountry command can be used to facilitate the merging of information from
different sources by recoding a string variable into a standardized form. This recoding is
possible using a custom dictionary created through a helper command.5 In this section,
we show an alternative way to merge information from different sources by using the
screening command.
As an example, we try to merge two Italian datasets, one provided by the National
Statistical Office (National Institute of Statistics in Italy) and the other provided by the
Italian Ministry of the Interior. The two datasets contain, for each Italian municipality,
the complete name and an alphanumeric code, the latter being different across sources.
In theory, with the (uniquely identified) name of each municipality, it should be easy to
merge the two datasets.
We first proceed by matching the two original datasets:

. use istat, clear


. sort comune
. merge m:m comune using ministero
(output omitted )
. tabulate _merge
_merge Freq. Percent Cum.

master only (1) 288 3.43 3.43


using only (2) 290 3.46 6.89
matched (3) 7,812 93.11 100.00

Total 8,390 100.00

5. See help kountryadd (if kountry is installed).


F. Belotti and D. Depalo 473

As you can see, there are 288 inconsistencies.6 When we tabulate the unmatched
cases, we would realize that unconventional expressions, like apostrophes, accents, dou-
ble names, etc., are responsible for this imperfect result:

. preserve
. sort comune
. drop if _merge==3
(7812 observations deleted)
. list comune _merge in 1/20, separator(20) noobs

comune _merge

AGLIE´ 2
AGLI 1
ALA´ DEI SARDI 2
ALBISOLA MARINA 2
ALBISOLA SUPERIORE 2
ALBISSOLA MARINA 1
ALBISSOLA SUPERIORE 1
ALI´ 2
ALI´ TERME 2
ALLUVIONI CAMBIO´ 2
ALLUVIONI CAMBI 1
ALME´ 2
ALM 1
AL DEI SARDI 1
AL 1
AL TERME 1
ANTEY-SAINT-ANDRE´ 2
ANTEY-SAINT-ANDR 1
APPIANO SULLA STRADA DEL 2
APPIANO SULLA STRADA DEL VINO 1

. restore

If you wish to recover all 288 unmatched municipalities, the proposed command is
a simple and fast solution. Indeed, when you take advantage of the available options,
you can (almost) completely recover unmatched cases with only one command. As an
example, we recover nine cases (it is possible to recover all cases with this procedure),
with a loop running on values of merge equal to 1 or 2, that is, running only on
unmatched cases:

6. The number of unmatched cases is different between the master (288) and the using (290) datasets
because of aggregation and separation of municipalities. Solving this kind of problem is beyond
the illustrative scope of this example.
474 From narrative text to standard codes

. forvalues i=1/2 {
2. preserve
3. keep if _merge==`i´
4.
. screening, sources(comune) keys("ALBISSOLA" "AQUILA D´ARROSCIA" "BAJARDO"
> "BARCELLONA" "BARZAN" "BRIGNANO" "CADERZONE" "CAVAGLI" "MARINA" "SUPERIORE")
> cases(cases) newcode(comune, replace)
> recode(1,9 "ALBISOLA MARINA" 1,10 "ALBISOLA SUPERIORE" 2 "AQUILA DI ARROSCIA"
> 3 "BAIARDO" 4 "BARCELLONA POZZO DI GOTTO" 5 "BARZANO´" 6 "BRIGNANO FRASCATA"
> 7 "CAVAGLIA" 8 "CADERZONE TERME")
5. if `i´==1 drop codice_ente
6. if `i´==2 drop codice
7. keep comune codice
8. sort comune
9. save new_`i´,replace
10. restore
11. }
(8102 observations deleted)
WARNING: By specifying -replace- sub-option you are overwriting the -newcode()-
> variable.
(note: file new_1.dta not found)
file new_1.dta saved
(8100 observations deleted)
WARNING: By specifying -replace- sub-option you are overwriting the -newcode()-
> variable.
(note: file new_2.dta not found)
file new_2.dta saved
. keep if _merge==3
(578 observations deleted)
. save perfect_match, replace
(note: file perfect_match.dta not found)
file perfect_match.dta saved
. use new_1, clear
. merge 1:1 comune using new_2
(output omitted )
. tabulate _merge
_merge Freq. Percent Cum.

master only (1) 279 49.03 49.03


using only (2) 281 49.38 98.42
matched (3) 9 1.58 100.00

Total 569 100.00


. append using perfect_match
. tabulate _merge
_merge Freq. Percent Cum.

master only (1) 279 3.33 3.33


using only (2) 281 3.35 6.68
matched (3) 7,821 93.32 100.00

Total 8,381 100.00


F. Belotti and D. Depalo 475

Because we deliberately recovered only nine cases, the number of unmatched cases
before the execution of screening is improved by nine cases, from 7,812 to 7,821 exact
matches.

5.2 Extracting a piece of a string variable


In this section, we show through three examples how screening can be used to extract
or rearrange a portion of a string variable.7

Example 1

Imagine you have the string variable address, and you want to create a new variable
that contains just the zip codes. Here is what the source variable address may look
like:

. list, noobs sep(10)

address

4905 Lakeway Drive, College Station, Texas 77845 USA


673 Jasmine Street, Los Angeles, CA 90024
2376 First street, San Diego, CA 90126
66666 West Central St, Tempe AZ 80068
12345 Main St. Cambridge, MA 01238-1234
12345 Main St Sommerville MA 01239-2345
12345 Main St Watertwon MA 01239 USA

To find the zip code, you have to use screening with specific regular expressions,
allowing it to exactly match all cases in the source variable address. Some examples
of specific regular expressions are the following:

• ([0-9][0-9][0-9][0-9][0-9]) to find a five-digit number, the zip code

• [\-]* to match zero or more dashes, - or - -

• [0-9]* to match zero or more numbers, that is, the zip code plus any other
numbers

• [ a-zA-Z]* to match zero or more blank spaces and (lowercase or uppercase)


letters

Once the correct regular expression(s) is found, to use screening to create a new
variable containing the zip codes, you have to do the following:

7. The following examples have been taken from the UCLA website resources to help you learn and
use Stata. See http://www.ats.ucla.edu/stat/stata/faq/regex.htm.
476 From narrative text to standard codes

1. Use the newcode() option to create the new variable zipcode.

2. Combine the above regular expressions and use them as a unique keyword.

3. Use the regexs(n) function as a "user defined code" in the recode() option.
regexs(n) returns the subexpression n from the respective keyword match, where
0 ≤ n ≤ 10. Stata regular-expression syntaxes use parentheses, (), to denote
a subexpression group. In particular, n = 0 is reserved for the entire string
that satisfied the regular expression (keyword); n = 1 is reserved for the first
subexpression that satisfied the regular expression (keyword); and so on.

Hence, you may code


. screening, sources(address)
> keys("([0-9][0-9][0-9][0-9][0-9])[\-]*[0-9]*[ a-zA-Z]*$")
> cases(c) newcode(zipcode) recode(1 "regexs(1)")
WARNING! You are SCREENING some keywords using regular-expression operators
> like ^ . ( ) [ ] ? *
Notice that:
1) Option -letter- doesn´t work IF a keyword contains regular-expression operators
2) Unless you are looking for a specific regular-expression, regular-expression
operators must be preceded by a backslash \ to ensure keyword-matching
(e.g. \^ \. )
3) To match a keyword containing $ or \, you have to specify them as [\$] [\\]
. tabulate zipcode
zipcode Freq. Percent Cum.

01238 1 14.29 14.29


01239 2 28.57 42.86
77845 1 14.29 57.14
80068 1 14.29 71.43
90024 1 14.29 85.71
90126 1 14.29 100.00

Total 7 100.00

where recode(1 "regexs(1)") indicates that

1. 1 is the recoding rule; that is, the coding process is related to the first (and unique)
keyword match.

2. regexs(1) is used to recode. Indeed, it returns the string related to the first (and
unique) subexpression match.8

As a result, the new variable zipcode is created by using only one line of code.
Notice that screening warns you that you are matching a keyword containing one or
more regular-expression operators.

8. Remember that subexpressions are denoted by using (). In the considered syntax, the only subex-
pression is represented by ([0-9][0-9][0-9][0-9][0-9]). This means that, in this case, you cannot
specify n > 1.
F. Belotti and D. Depalo 477

Example 2

Suppose you have a variable containing a person’s full name. Here is what the variable
fullname looks like:

. list, noobs sep(10)

fullname

John Adams
Adam Smiths
Mary Smiths
Charlie Wade

Our goal is to swap first name with last name, separating them by a comma. The
regular expression to reach the target is (([a-zA-Z]+)[ ]*([a-zA-Z]+)). It is com-
posed of three parts:

1. ([a-zA-Z]+) to capture a string consisting of letters (lowercase and uppercase),


that is, the first name

2. [ ]* to match with a space(s), that is, the blank between first and last name

3. ([a-zA-Z]+) again to capture a string consisting of letters, this time the last
name

The following is a way to proceed using screening:


. screening, sources(fullname)
> keys("([a-zA-Z]+)[ ]*([a-zA-Z]+)" "[ ]" "([a-zA-Z]+)[ ]*([a-zA-Z]+)")
> newcode(fullname, add replace) recode(1 "regexs(2)," 2 "regexs(0)"
> 3 "regexs(1)")
WARNING! You are SCREENING some keywords using regular-expression operators
> like ^ . ( ) [ ] ? *
Notice that:
1) Option -letter- doesn´t work IF a keyword contains regular-expression operators
2) Unless you are looking for a specific regular-expression, regular-expression
operators must be preceded by a backslash \ to ensure keyword-matching
(e.g. \^ \. )
3) To match a keyword containing $ or \, you have to specify them as [\$] [\\]
. list fullname, noobs sep(10)

fullname

Adams, John
Smiths, Adam
Smiths, Mary
Wade, Charlie
478 From narrative text to standard codes

Notice the newcode() suboption add. It can be specified only when a regexs(n)
function is specified as a "user defined code" in the recode() option. The add suboption
allows for the creation of the newcode() variable as a concatenation of subexpressions
returned by regexs(n). In the example above,

1. recode(1 "regexs(2)," ... returns the second subexpression from the first
keyword match (the last name) plus a comma.

2. ...2 "regexs(0)" ... returns the blank matched by the second keyword;

3. ...3 "regexs(1)") returns the first subexpression from the third keyword match
(the first name).

As a result, the variable fullname is replaced (note the suboption replace) sequen-
tially by the concatenation of subexpressions returned by 1, 2, and 3 above.

Example 3

Imagine that you have the string variable date containing dates:
. list date, noobs sep(20)

date

20jan2007
16June06
06sept1985
21june04
4july90
9jan1999
6aug99
19august2003

The goal is to produce a string variable with the appropriate four-digit year for each
case, which Stata can easily convert into a date. You can achieve the target by coding
something like the following:
. generate day = regexs(0) if regexm(date, "^[0-9]+")
. generate month = regexs(0) if regexm(date, "[a-zA-Z]+")
. generate year = regexs(0) if regexm(date, "[0-9]*$")
. replace year = "20"+regexs(0) if regexm(year, "^[0][0-9]$")
(2 real changes made)
. replace year = "19"+regexs(0) if regexm(year, "^[1-9][0-9]$")
(2 real changes made)
. generate date1 = day+month+year
F. Belotti and D. Depalo 479

. list, noobs sep(10)

date day month year date1

20jan2007 20 jan 2007 20jan2007


16June06 16 June 2006 16June2006
06sept1985 06 sept 1985 06sept1985
21june04 21 june 2004 21june2004
4july90 4 july 1990 4july1990
9jan1999 9 jan 1999 9jan1999
6aug99 6 aug 1999 6aug1999
19august2003 19 august 2003 19august2003

Alternately, you can obtain the same result by using screening:

. screening, sources(date) keys("^[0-9]+" "[a-zA-Z]+" "[0][0-9]$" "[1-9][0-9]$")


> newcode(date1, add)
> recode(1 "regexs(0)" 2 "regexs(0)" 3 "20+regexs(0)" 4 "19+regexs(0)")
WARNING! You are SCREENING some keywords using regular-expression operators
> like ^ . ( ) [ ] ? *
Notice that:
1) Option -letter- doesn´t work IF a keyword contains regular-expression operators
2) Unless you are looking for a specific regular-expression, regular-expression
operators must be preceded by a backslash \ to ensure keyword-matching
(e.g. \^ \. )
3) To match a keyword containing $ or \, you have to specify them as [\$] [\\]
. list date date1, noobs sep(10)

date date1

20jan2007 20jan2007
16June06 16June2006
06sept1985 06sept1985
21june04 21june2004
4july90 4july1990
9jan1999 9jan1999
6aug99 6aug1999
19august2003 19august2003

Also in this case, as in the previous example, we specify the newcode() suboption add
because we need to create the newcode() variable as a concatenation of subexpressions
from keyword matching. The same result can be obtained using the following syntax:

(Continued on next page)


480 From narrative text to standard codes

. screening, sources(date)
> keys(begin "[0-9]+" "[a-zA-Z]+" end "[0][0-9]" end "[1-9][0-9]")
> newcode(date1, add)
> recode(1 "regexs(0)" 2 "regexs(0)" 3 "20+regexs(0)" 4 "19+regexs(0)")
WARNING! You are SCREENING some keywords using regular-expression operators
> like ^ . ( ) [ ] ? *
Notice that:
1) Option -letter- doesn´t work IF a keyword contains regular-expression operators
2) Unless you are looking for a specific regular-expression, regular-expression
operators must be preceded by a backslash \ to ensure keyword-matching
(e.g. \^ \. )
3) To match a keyword containing $ or \, you have to specify them as [\$] [\\]
. list date date1, noobs sep(10)

date date1

20jan2007 20jan2007
16June06 16June2006
06sept1985 06sept1985
21june04 21june2004
4july90 4july1990
9jan1999 9jan1999
6aug99 6aug1999
19august2003 19august2003

where the only difference is represented by the way in which the matching rule is spec-
ified: begin instead of ^ and end instead of $.

6 Summary
In this article, we introduced the new screening command, a data-management tool
that helps you examine and treat the content of string variables containing free, possibly
complex, narrative text. screening allows you to build new variables, to recode new
or existing variables, and to build a set of categorical variables indicating keyword
occurrences (a first step toward textual analysis). Considerable efforts were devoted
to making the command as flexible as possible; thus screening contains a rich set
of options that is intended to cover the most frequently encountered problems and
necessities. Because of this flexibility, the command can be used in many different
fields, like EPR data, data from different sources, or survey data. The execution of
screening is fast, thanks to Mata programming; its syntax is simple and common to
many other Stata commands, thus it is useful for all users regardless of their levels of
experience in Stata. We especially recommend that you use the explore() option; it
makes the command a useful data-mining tool. Nevertheless, expert users can exploit
a more complicated syntax that substantially eases the preparatory burden for data
cleaning.
F. Belotti and D. Depalo 481

Acknowledgments
We would like to thank Alice Cortignani, Rossana D’Amico, Andrea Piano Mortari,
and Riccardo Zecchinelli who tested the command, Vincenzo Atella who read an earlier
version of the article, Iacopo Cricelli who provided us with EPR data, and Rafal Raci-
borski for useful discussions. We are also grateful to David Drukker and all participants
at the 2009 Italian Stata Users Group meeting. Finally, the suggestions made by the
referee and the editor were useful to improve the command. We are responsible for any
remaining errors.

7 References
Moorman, P. W., A. M. van Ginneken, J. van der Lei, and J. H. van Bemmel. 1994. A
model for structured data entry based on explicit descriptional knowledge. Methods
of Information in Medicine 33: 454–463.

Raciborski, R. 2008. kountry: A Stata utility for merging cross-country data from
multiple sources. Stata Journal 8: 390–400.

Yamazaki, S., and Y. Satomura. 2000. Standard method for describing an electronic
patient record template: Application of XML to share domain knowledge. Methods
of Information in Medicine 39: 50–55.

About the authors


Federico Belotti is a PhD student in econometrics and empirical economics at the University
of Rome Tor Vergata.
Domenico Depalo is a researcher in the Economic Research Department of the Bank of Italy
in Rome. He received his PhD in econometrics and empirical economics from the University
of Rome Tor Vergata and was enrolled in a Post Doc program at the University of Rome La
Sapienza.
The Stata Journal (2010)
10, Number 3, pp. 482–495

Speaking Stata: The limits of sample skewness


and kurtosis
Nicholas J. Cox
Department of Geography
Durham University
Durham, UK
n.j.cox@durham.ac.uk

Abstract. Sample skewness and kurtosis are limited by functions of sample size.
The limits, or approximations to them, have repeatedly been rediscovered over
the last several decades, but nevertheless seem to remain only poorly known. The
limits impart bias to estimation and, in extreme cases, imply that no sample could
bear exact witness to its parent distribution. The main results are explained in a
tutorial review, and it is shown how Stata and Mata may be used to confirm and
explore their consequences.
Keywords: st0204, descriptive statistics, distribution shape, moments, sample size,
skewness, kurtosis, lognormal distribution

1 Introduction
The use of moment-based measures for summarizing univariate distributions is long
established. Although there are yet longer roots, Thorvald Nicolai Thiele (1889) used
mean, standard deviation, variance, skewness, and kurtosis in recognizably modern
form. Appreciation of his work on moments remains limited, for all too understandable
reasons. Thiele wrote mostly in Danish, he did not much repeat himself, and he tended
to assume that his readers were just about as smart as he was. None of these habits
could possibly ensure rapid worldwide dissemination of his ideas. Indeed, it was not
until the 1980s that much of Thiele’s work was reviewed in or translated into English
(Hald 1981; Lauritzen 2002).
Thiele did not use all the now-standard terminology. The names standard deviation,
skewness, and kurtosis we owe to Karl Pearson, and the name variance we owe to Ronald
Aylmer Fisher (David 2001). Much of the impact of moments can be traced to these
two statisticians. Pearson was a vigorous proponent of using moments in distribution
curve fitting. His own system of probability distributions pivots on varying skewness,
measured relative to the mode. Fisher’s advocacy of maximum likelihood as a superior
estimation method was combined with his exposition of variance as central to statistical
thinking. The many editions of Fisher’s 1925 text Statistical Methods for Research
Workers, and of texts that in turn drew upon its approach, have introduced several
generations to the ideas of skewness and kurtosis. Much more detail on this history is
given by Walker (1929), Hald (1998, 2007), and Fiori and Zenga (2009).


c 2010 StataCorp LP st0204
N. J. Cox 483

Whatever the history and the terminology, a simple but fundamental point deserves
emphasis. A name like skewness has a very broad interpretation as a vague concept of
distribution symmetry or asymmetry, which can be made precise in a variety of ways
(compare with Mosteller and Tukey [1977]). Kurtosis is even more enigmatic: some
authors write of kurtosis as peakedness and some write of it as tail weight, but the
skeptical interpretation that kurtosis is whatever kurtosis measures is the only totally
safe story. Numerical examples given by Irving Kaplansky (1945) alone suffice to show
that kurtosis bears neither interpretation unequivocally.1
To the present, moments have been much disapproved, and even disproved, by math-
ematical statisticians who show that in principle moments may not even exist, and by
data analysts who know that in practice moments may not be robust. Nevertheless in
many quarters, they survive, and they even thrive. One of several lively fields making
much use of skewness and kurtosis measures is the analysis of financial time series (for
example, Taylor [2005]).
In this column, I will publicize one limitation of certain moment-based measures, in a
double sense. Sample skewness and sample kurtosis are necessarily bounded by functions
of sample size, imparting bias to the extent that small samples from skewed distributions
may even deny their own parentage. This limitation has been well established and
discussed in several papers and a few texts, but it still appears less widely known than
it should be. Presumably, it presents a complication too far for most textbook accounts.
The presentation here will include only minor novelties but will bring the key details
together in a coherent story and give examples of the use of Stata and Mata to confirm
and explore for oneself the consequences of a statistical artifact.

2 Deductions
2.1 Limits on skewness and kurtosis
n
Given a sample of n values y1 , . . . , yn and sample mean y = i=1 yi /n, sample moments
measured about the mean are at their simplest defined as averages of powered deviations
n
i=1 (y − y)r
mr =
n

so that m2 and s = m2 are versions of, respectively, the sample variance and sample
standard deviation.
Here sample skewness is defined as

m3 m3 
3/2
= 3
= b1 = g1
m2 s

1. Kaplansky’s paper is one of a few that he wrote in the mid-1940s on probability and statistics. He
is much better known as a distinguished algebraist (Bass and Lam 2007; Kadison 2008).
484 Speaking Stata: The limits of sample skewness and kurtosis

while sample kurtosis is defined as


m4 m4
= 4 = b2 = g2 + 3
m22 s

Hence, both of the last two measures are scaled or dimensionless: Whatever units
of measurement were used appear raised to the same powers in both numerator and
denominator, and so cancel out. The commonly used m, s, b, and g notation corresponds
to a longstanding μ, σ, β, and γ notation for the corresponding theoretical or population
quantities. If 3 appears to be an arbitrary constant in the last equation, one explanation
starts with the fact that normal or Gaussian distributions have β1 = 0 and β2 = 3; hence,
γ2 = 0.
Naturally, if y is constant, then m2 is zero; thus skewness and kurtosis are not
defined. This includes the case of n = 1. The stipulations that y is genuinely variable
and that n ≥ 2 underlie what follows.
Newcomers to this territory are warned that usages in the statistical literature vary
considerably, even among entirely competent authors. This variation means that differ-
ent formulas may be found for the same terms—skewness and kurtosis—and different
terms for the same formulas. To start at the beginning: Although Karl Pearson in-
troduced the term skewness, and also made much use of β1 , he used skewness to refer
to (mean − mode) / standard deviation, a quantity that is well defined in his system
of distributions. In more recent literature, some differences reflect the use of divisors
other than n, usually with the intention of reducing bias, and so resembling in spirit
the common use of n − 1 as an alternative divisor for sample variance. Some authors
call γ2 (or g2 ) the kurtosis, while yet other variations may be found.
The key results for this column were extensively discussed by Wilkins (1944) and
Dalén (1987). Clearly, g1 may be positive, zero, or negative, reflecting the sign of m3 .
Wilkins (1944) showed that there is an upper limit to its absolute value,

n−2
|g1 | ≤ √ (1)
n−1

as was also independently shown by Kirby (1974). In contrast, b2 must be positive and
indeed (as may be shown, for example, using the Cauchy–Schwarz inequality) must be
at least 1. More pointedly, Dalén (1987) showed that there is also an upper limit to its
value:
n2 − 3n + 3
b2 ≤ (2)
n−1
The proofs of these inequalities are a little too long, and not quite interesting enough,
to reproduce here.
Both of these inequalities are sharp, meaning attainable. Test cases to explore the
precise limits have all values equal to some constant, except for one value that is equal
to another constant: n = 2, y1 = 0, y2 = 1 will do fine as a concrete example, for which
skewness is 0/1 = 0 and kurtosis is (1 − 3 + 3)/1 = 1.
N. J. Cox 485

For n = 2, we can rise above a mere example to show quickly that these results are
indeed general. The mean of two distinct values is halfway between them so that the
two deviations yi − y have equal magnitude and opposite sign. Thus their cubes have
sum 0, and m3 and b1 are both identically equal to 0. Alternatively, such values are
geometrically two points on the real line, a configuration that is evidently symmetric
around the mean in the middle, so skewness can be seen to be zero without any calcula-
tions. The squared deviations have an average equal to {(y1 − y2 )/2}2 , and their fourth
powers have an average equal to {(y1 − y2 )/2}4 , so g2 is identically equal to 1.
To see how the upper limit behaves numerically, we can rewrite (1) as

√ 1
|g1 | ≤ n−1− √
n−1
√ √
so that as sample size n increases, first n − 1 and then n become acceptable approx-
imations. Similarly, we can rewrite (2) as

1
b2 ≤ n − 2 +
n−1

from which, in large samples, first n − 2 and then n become acceptable approximations.
As it happens, these limits established
√ by Wilkins and Dalén sharpen up on the
results of other workers. Limits of n and n (the√latter when n is greater than 3)
were established by Cramér (1946, 357). Limits of n − 1 and n were independently
established by Johnson and Lowe (1979); Kirby (1981) advertised work earlier than
theirs (although not earlier than that of Wilkins or Cramér). Similarly, Stuart and Ord
(1994, 121–122) refer to the work of Johnson and Lowe (1979), but overlook the sharper
limits.2
There is yet another twist in the tale. Pearson (1916, 440) refers to the limit (2),
which he attributes to George Neville Watson, himself later a distinguished contributor
to analysis (but not to be confused with the statistician
√ Geoffrey Stuart Watson), and
to a limit of n − 1 on b1 , equivalent to a limit of n − 1 on g1 . Although Pearson
was the author of the first word on this subject, his contribution appears to have been
uniformly overlooked by later authors. However, he dismissed these limits as without
practical importance, which may have led others to downplay the whole issue.
In practice, we are, at least at first sight, less likely to care much about these limits
for large samples. It is the field of small samples in which limits are more likely to cause
problems, and sometimes without data analysts even noticing.

2. The treatise of Stuart and Ord is in line of succession, with one offset, from Yule (1911). Despite
that distinguished ancestry, it contains some surprising errors as well as the compendious collection
of results that makes it so useful. To the statement that mean, median, and mode ` ´differ in a skewed
distribution (p. 48), counterexamples are 0, 0, 1, 1, 1, 1, 3, and the binomial 10k
0.1k 0.910−k , k =
0, . . . , 10. For both of these skewed counterexamples, mean, median, and mode coincide at 1. To
the statement that they coincide in a symmetric distribution (p. 108), counterexamples are any
symmetric distribution with an even number of modes.
486 Speaking Stata: The limits of sample skewness and kurtosis

2.2 An aside on coefficient of variation


The literature contains similar limits related to sample size on other sample statistics.
For example, the coefficient of variation is the ratio of standard deviation to mean, or
s/y. Katsnelson and Kotz √ (1957) proved that so long as all yi ≥ 0, then the coefficient of
variation cannot exceed n − 1, a result mentioned earlier by Longley (1952). Cramér
(1946, 357) proved a less sharp result, and Kirby (1974) proved a less general result.

3 Confirmations
[R] summarize confirms that skewness b1 and kurtosis g2 are calculated in Stata pre-
cisely as above. There are no corresponding Mata functions at the time of this writing,
but readers interested in these questions will want to start Mata to check their own
understanding. One example to check is
. sysuse auto, clear
(1978 Automobile Data)
. summarize mpg, detail
Mileage (mpg)

Percentiles Smallest
1% 12 12
5% 14 12
10% 14 14 Obs 74
25% 18 14 Sum of Wgt. 74
50% 20 Mean 21.2973
Largest Std. Dev. 5.785503
75% 25 34
90% 29 35 Variance 33.47205
95% 34 35 Skewness .9487176
99% 41 41 Kurtosis 3.975005

The detail option is needed to get skewness and kurtosis results from summarize.
We will not try to write a bulletproof skewness or kurtosis function in Mata, but we
will illustrate its use calculator-style. After entering Mata, a variable can be read into
a vector. It is helpful to have a vector of deviations from the mean to work on.
. mata :
mata (type end to exit)
: y = st_data(., "mpg")
: dev = y :- mean(y)
: mean(dev:^3) / (mean(dev:^2)):^(3/2)
.9487175965
: mean(dev:^4) / (mean(dev:^2)):^2
3.975004596

So those examples at least check out. Those unfamiliar with Mata might note that
the colon prefix, as in :- or :^, merely flags an elementwise operation. Thus for example,
mean(y) returns a constant, which we wish to subtract from every element of a data
vector.
N. J. Cox 487

Mata may be used to check simple limiting cases. The minimal dataset (0, 1) may
be entered in deviation form. After doing so, we can just repeat earlier lines to calculate
b1 and g2 :

: dev = (.5 \ -.5)


: mean(dev:^3) / (mean(dev:^2)):^(3/2)
0
: mean(dev:^4) / (mean(dev:^2)):^2
1

Mata may also be used to see how the limits of skewness and kurtosis vary with
sample size. We start out with a vector containing some sample sizes. We then calculate
the corresponding upper limits for skewness and kurtosis and tabulate the results. The
results are mapped to strings for tabulation with reasonable numbers of decimal places.

: n = (2::20\50\100\500\1000)
: skew = sqrt(n:-1) :- (1:/(n:-1))
: kurt = n :- 2 + (1:/(n:-1))
: strofreal(n), strofreal((skew, kurt), "%4.3f")
1 2 3

1 2 0.000 1.000
2 3 0.914 1.500
3 4 1.399 2.333
4 5 1.750 3.250
5 6 2.036 4.200
6 7 2.283 5.167
7 8 2.503 6.143
8 9 2.703 7.125
9 10 2.889 8.111
10 11 3.062 9.100
11 12 3.226 10.091
12 13 3.381 11.083
13 14 3.529 12.077
14 15 3.670 13.071
15 16 3.806 14.067
16 17 3.938 15.062
17 18 4.064 16.059
18 19 4.187 17.056
19 20 4.306 18.053
20 50 6.980 48.020
21 100 9.940 98.010
22 500 22.336 498.002
23 1000 31.606 998.001

The second and smaller term in each expression for (1) and (2) is 1/(n−1). Although
the calculation is, or should be, almost mental arithmetic, we can see how quickly this
term shrinks so much that it can be neglected:
488 Speaking Stata: The limits of sample skewness and kurtosis

: strofreal(n), strofreal(1 :/ (n :- 1), "%4.3f")


1 2

1 2 1.000
2 3 0.500
3 4 0.333
4 5 0.250
5 6 0.200
6 7 0.167
7 8 0.143
8 9 0.125
9 10 0.111
10 11 0.100
11 12 0.091
12 13 0.083
13 14 0.077
14 15 0.071
15 16 0.067
16 17 0.062
17 18 0.059
18 19 0.056
19 20 0.053
20 50 0.020
21 100 0.010
22 500 0.002
23 1000 0.001

: end

These calculations are equally easy in Stata when you start with a variable containing
sample sizes.

4 Explorations
In statistical science, we use an increasing variety of distributions. Even when closed-
form expressions exist for their moments, which is far from being universal, the need
to estimate parameters from sample data often arises. Thus the behavior of sample
moments and derived measures remains of key interest. Even if you do not customarily
use, for example, summarize, detail to get skewness and kurtosis, these measures may
well underlie your favorite test for normality.
The limits on sample skewness and kurtosis impart the possibility of bias whenever
the upper part of their sampling distributions is cut off by algebraic constraints. In
extreme cases, a sample may even deny the distribution that underlies it, because it is
impossible for any sample to reproduce the skewness and kurtosis of its parent.
These questions may be explored by simulation. Lognormal distributions offer simple
but striking examples. We call a distribution for y lognormal if ln y is normally dis-
tributed. Those who prefer to call normal distributions by some other name (Gaussian,
notably) have not noticeably affected this terminology. Similarly, for some people the
terminology is backward, because a lognormal distribution is an exponentiated normal
distribution. Protest is futile while the term lognormal remains entrenched.
N. J. Cox 489

If ln y has mean μ and standard deviation σ, its skewness and kurtosis may be
defined in terms of exp(σ 2 ) = ω (Johnson, Kotz, and Balakrishnan 1994, 212):

γ1 = ω − 1(ω + 2); β2 = ω 4 + 2ω 3 + 3ω 2 − 3

Differently put, skewness and kurtosis depend on σ alone; μ is a location parameter for
the lognormal as well as the normal.
[R] simulate already has a worked example of the simulation of lognormals, which
we can adapt slightly for the present purpose. The program there called lnsim merely
needs to be modified by adding results for skewness and kurtosis. As before, summarize,
detail is now the appropriate call. Before simulation, we (randomly, capriciously, or
otherwise) choose a seed for random-number generation:
. clear all
. program define lnsim, rclass
1. version 11.1
2. syntax [, obs(integer 1) mu(real 0) sigma(real 1)]
3. drop _all
4. set obs `obs´
5. tempvar z
6. gen `z´ = exp(rnormal(`mu´,`sigma´))
7. summarize `z´, detail
8. return scalar mean = r(mean)
9. return scalar var = r(Var)
10. return scalar skew = r(skewness)
11. return scalar kurt = r(kurtosis)
12. end
. set seed 2803
. simulate mean=r(mean) var=r(var) skew=r(skew) kurt=r(kurt), nodots
> reps(10000): lnsim, obs(50) mu(-3) sigma(7)
command: lnsim, obs(50) mu(-3) sigma(7)
mean: r(mean)
var: r(var)
skew: r(skew)
kurt: r(kurt)

We are copying here the last example from help simulate, a lognormal for which
μ = −3, σ = 7. While a lognormal may seem a fairly well-behaved distribution, a quick
calculation shows that with these parameter choices, the skewness is about 8 × 1031 and
the kurtosis about 1085 , which no sample result can possibly come near! The previously
discussed limits are roughly 7 for skewness and 48 for kurtosis for this sample size. Here
are the Mata results:
. mata
mata (type end to exit)
: omega = exp(49)
: sqrt(omega - 1) * (omega + 2)
8.32999e+31
: omega^4 + 2 * omega^3 + 3*omega^2 - 3
1.32348e+85
: n = 50
490 Speaking Stata: The limits of sample skewness and kurtosis

: sqrt(n:-1) :- (1:/(n:-1)), n :- 2 + (1:/(n:-1))


1 2

1 6.979591837 48.02040816

: end

Sure enough, calculations and a graph (shown as figure 1) show the limits of 7 and
48 are biting hard. Although many graph forms would work well, I here choose qplot
(Cox 2005) for quantile plots.
. summarize
Variable Obs Mean Std. Dev. Min Max

mean 10000 1.13e+09 1.11e+11 1.888205 1.11e+13


var 10000 6.20e+23 6.20e+25 42.43399 6.20e+27
skew 10000 6.118604 .9498364 2.382902 6.857143
kurt 10000 40.23354 10.06829 7.123528 48.02041
. qplot skew, yla(, ang(h)) name(g1, replace) ytitle(skewness) yli(6.98)
. qplot kurt, yla(, ang(h)) name(g2, replace) ytitle(kurtosis) yli(48.02)
. graph combine g1 g2

7 50

6
40

5
skewness

kurtosis

30

20

10

2
0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1
fraction of the data fraction of the data

Figure 1. Sampling distributions of skewness and kurtosis for samples of size 50 from a
lognormal with μ = −3, σ = 7. Upper limits are shown by horizontal lines.

The natural comment is that the parameter choices in this example are a little
extreme, but the same phenomenon occurs to some extent even with milder choices.
With the default μ = 0, σ = 1, the skewness and kurtosis are less explosively high—but
still very high by many standards. We clear the data and repeat the simulation, but
this time we use the default values.
N. J. Cox 491

. clear
. simulate mean=r(mean) var=r(var) skew=r(skew) kurt=r(kurt), nodots
> reps(10000): lnsim, obs(50)
command: lnsim, obs(50)
mean: r(mean)
var: r(var)
skew: r(skew)
kurt: r(kurt)

Within Mata, we can recalculate the theoretical skewness and kurtosis. The limits
to sample skewness and kurtosis remain the same, given the same sample size n = 50.

. mata
mata (type end to exit)
: omega = exp(1)
: sqrt(omega - 1) * (omega + 2)
6.184877139
: omega^4 + 2 * omega^3 + 3*omega^2 - 3
113.9363922
: end

The problem is more insidious with these parameter values. The sampling distri-
butions look distinctly skewed (shown in figure 2) but are not so obviously truncated.
Only when the theoretical values for skewness and kurtosis are considered is it obvious
that the estimations are seriously biased.

. summarize
Variable Obs Mean Std. Dev. Min Max

mean 10000 1.657829 .3106537 .7871802 4.979507


var 10000 4.755659 7.43333 .3971136 457.0726
skew 10000 2.617803 1.092607 .467871 6.733598
kurt 10000 11.81865 7.996084 1.952879 46.89128
. qplot skew, yla(, ang(h)) name(g1, replace) ytitle(skewness) yli(6.98)
. qplot kurt, yla(, ang(h)) name(g2, replace) ytitle(kurtosis) yli(48.02)
. graph combine g1 g2

(Continued on next page)


492 Speaking Stata: The limits of sample skewness and kurtosis

8 50

40
6

30
skewness

kurtosis
4

20

2
10

0 0
0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1
fraction of the data fraction of the data

Figure 2. Sampling distributions of skewness and kurtosis for samples of size 50 from a
lognormal with μ = 0, σ = 1. Upper limits are shown by horizontal lines.

Naturally, these are just token simulations, but a way ahead should be clear. If
you are using skewness or kurtosis with small (or even large) samples, simulation with
some parent distributions pertinent to your work is a good idea. The simulations of
Wallis, Matalas, and Slack (1974) in particular pointed to empirical limits to skewness,
which Kirby (1974) then established independently of previous work.3

5 Conclusions
This story, like any other, lies at the intersection of many larger stories. Many statisti-
cally minded people make little or no use of skewness or kurtosis, and this paper may
have confirmed them in their prejudices. Some readers may prefer to see this as an-
other argument for using quantiles or order statistics for summarization (Gilchrist 2000;
David and Nagaraja 2003). Yet others may know that L-moments offer an alternative
approach (Hosking 1990; Hosking and Wallis 1997).
Arguably, the art of statistical analysis lies in choosing a model successful enough
to ensure that the exact form of the distribution of some response variable, conditional
on the predictors, is a matter of secondary importance. For example, in the simplest
regression situations, an error term for any really good model is likely to be fairly near
normally distributed, and thus not a source of worry. But authorities and critics differ
over how far that is a deductive consequence of some flavor of central limit theorem or
a naı̈ve article of faith that cries out for critical evaluation.
3. Connoisseurs of offbeat or irreverent titles might like to note some other papers by the same team:
Mandelbrot and Wallis (1968), Matalas and Wallis (1973), and Slack (1973).
N. J. Cox 493

More prosaically, it is a truism—but one worthy of assent—that researchers using


statistical methods should know the strengths and weaknesses of the various items in
the toolbox. Skewness and kurtosis, over a century old, may yet offer surprises, which
a wide range of Stata and Mata commands may help investigate.

6 Acknowledgments
This column benefits from interactions over moments shared with Ian S. Evans and over
L-moments shared with Patrick Royston.

7 References
Bass, H., and T. Y. Lam. 2007. Irving Kaplansky 1917–2006. Notices of the American
Mathematical Society 54: 1477–1493.

Cox, N. J. 2005. Speaking Stata: The protean quantile plot. Stata Journal 5: 442–460.

Cramér, H. 1946. Mathematical Methods of Statistics. Princeton, NJ: Princeton Uni-


versity Press.

Dalén, J. 1987. Algebraic bounds on standardized sample moments. Statistics & Prob-
ability Letters 5: 329–331.

David, H. A. 2001. First (?) occurrence of common terms in statistics and probability.
In Annotated Readings in the History of Statistics, ed. H. A. David and A. W. F.
Edwards, 209–246. New York: Springer.

David, H. A., and H. N. Nagaraja. 2003. Order Statistics. Hoboken, NJ: Wiley.

Fiori, A. M., and M. Zenga. 2009. Karl Pearson and the origin of kurtosis. International
Statistical Review 77: 40–50.

Fisher, R. A. 1925. Statistical Methods for Research Workers. Edinburgh: Oliver &
Boyd.

Gilchrist, W. G. 2000. Statistical Modelling with Quantile Functions. Boca Raton, FL:
Chapman & Hall/CRC.

Hald, A. 1981. T. N. Thiele’s contribution to statistics. International Statistical Review


49: 1–20.

———. 1998. A History of Mathematical Statistics from 1750 to 1930. New York:
Wiley.

———. 2007. A History of Parametric Statistical Inference from Bernoulli to Fisher,


1713–1935. New York: Springer.
494 Speaking Stata: The limits of sample skewness and kurtosis

Hosking, J. R. M. 1990. L-moments: Analysis and estimation of distributions using lin-


ear combinations of order statistics. Journal of the Royal Statistical Society, Series B
52: 105–124.

Hosking, J. R. M., and J. R. Wallis. 1997. Regional Frequency Analysis: An Approach


Based on L-Moments. Cambridge: Cambridge University Press.

Johnson, M. E., and V. W. Lowe. 1979. Bounds on the sample skewness and kurtosis.
Technometrics 21: 377–378.

Johnson, N. L., S. Kotz, and N. Balakrishnan. 1994. Continuous Univariate Distribu-


tions, Vol. 1. New York: Wiley.

Kadison, R. V. 2008. Irving Kaplansky’s role in mid-twentieth century functional anal-


ysis. Notices of the American Mathematical Society 55: 216–225.

Kaplansky, I. 1945. A common error concerning kurtosis. Journal of the American


Statistical Association 40: 259.

Katsnelson, J., and S. Kotz. 1957. On the upper limits of some measures of variability.
Archiv für Meteorologie, Geophysik und Bioklimatologie, Series B 8: 103–107.

Kirby, W. 1974. Algebraic boundedness of sample statistics. Water Resources Research


10: 220–222.

———. 1981. Letter to the editor. Technometrics 23: 215.

Lauritzen, S. L. 2002. Thiele: Pioneer in Statistics. Oxford: Oxford University Press.

Longley, R. W. 1952. Measures of the variability of precipitation. Monthly Weather


Review 80: 111–117.

Mandelbrot, B. B., and J. R. Wallis. 1968. Noah, Joseph, and operational hydrology.
Water Resources Research 4: 909–918.

Matalas, N. C., and J. R. Wallis. 1973. Eureka! It fits a Pearson type 3 distribution.
Water Resources Research 9: 281–289.

Mosteller, F., and J. W. Tukey. 1977. Data Analysis and Regression: A Second Course
in Statistics. Reading, MA: Addison–Wesley.

Pearson, K. 1916. Mathematical contributions to the theory of evolution. XIX: Second


supplement to a memoir on skew variation. Philosophical Transactions of the Royal
Society of London, Series A 216: 429–457.

Slack, J. R. 1973. I would if I could (self-denial by conditional models). Water Resources


Research 9: 247–249.

Stuart, A., and J. K. Ord. 1994. Kendall’s Advanced Theory of Statistics. Volume 1:
Distribution Theory. 6th ed. London: Arnold.
N. J. Cox 495

Taylor, S. J. 2005. Asset Price Dynamics, Volatility, and Prediction. Princeton, NJ:
Princeton University Press.

Thiele, T. N. 1889. Forlæsinger over Almindelig Iagttagelseslære: Sandsynlighedsregn-


ing og Mindste Kvadraters Methode. Copenhagen: C. A. Reitzel. English translation
included in Lauritzen 2002.

Walker, H. M. 1929. Studies in the History of Statistical Method: With Special Refer-
ence to Certain Educational Problems. Baltimore: Williams & Wilkins.
Wallis, J. R., N. C. Matalas, and J. R. Slack. 1974. Just a moment! Water Resources
Research 10: 211–219.

Wilkins, J. E. 1944. A note on skewness and kurtosis. Annals of Mathematical Statistics


15: 333–335.

Yule, G. U. 1911. An Introduction to the Theory of Statistics. London: Griffin.

About the author


Nicholas Cox is a statistically minded geographer at Durham University. He contributes talks,
postings, FAQs, and programs to the Stata user community. He has also coauthored 15 com-
mands in official Stata. He wrote several inserts in the Stata Technical Bulletin and is an editor
of the Stata Journal.
The Stata Journal (2010)
10, Number 3, pp. 496–499

Stata tip 89: Estimating means and percentiles following


multiple imputation
Peter A. Lachenbruch
Oregon State University
Corvallis, OR
peter.lachenbruch@oregonstate.edu

1 Introduction
In a statistical analysis, I usually want some basic descriptive statistics such as the mean,
standard deviation, extremes, and percentiles. See, for example, Pagano and Gauvreau
(2000). Stata conveniently provides these descriptive statistics with the summarize
command’s detail option. Alternatively, I can obtain percentiles with the centile
command. For example, with auto.dta, we have

. sysuse auto
(1978 Automobile Data)
. summarize price, detail
Price

Percentiles Smallest
1% 3291 3291
5% 3748 3299
10% 3895 3667 Obs 74
25% 4195 3748 Sum of Wgt. 74
50% 5006.5 Mean 6165.257
Largest Std. Dev. 2949.496
75% 6342 13466
90% 11385 13594 Variance 8699526
95% 13466 14500 Skewness 1.653434
99% 15906 15906 Kurtosis 4.819188

However, if I have missing values, the summarize command is not supported by mi


estimate or by the user-written mim command (Royston 2004, 2005a,b, 2007; Royston,
Carlin, and White 2009).

2 Finding means and percentiles when missing values are


present
For a general multiple-imputation reference, see Stata 11 Multiple-Imputation Reference
Manual (2009). By recognizing that a regression with no independent variables estimates
the mean, I can use mi estimate: regress to get multiply imputed means. If I wish to
get multiply imputed quantiles, I can use mi estimate: qreg or mi estimate: sqreg
for this purpose.


c 2010 StataCorp LP st0205
P. A. Lachenbruch 497

I now create a dataset with missing values of price:

. clonevar newprice = price


. set seed 19670221
. replace newprice = . if runiform() < .4
(32 real changes made, 32 to missing)

The following commands were generated from the multiple-imputation dialog box. I
used 20 imputations. Before Stata 11, this could also be done with the user-written com-
mands ice and mim (Royston 2004, 2005a,b, 2007; Royston, Carlin, and White 2009).

. mi set mlong
. mi register imputed newprice
(32 m=0 obs. now marked as incomplete)
. mi register regular mpg trunk weight length
. mi impute regress newprice, add(20) rseed(3252010)
Univariate imputation Imputations = 20
Linear regression added = 20
Imputed: m=1 through m=20 updated = 0
Observations per m

Variable complete incomplete imputed total

newprice 42 32 32 74

(complete + incomplete = total; imputed is the minimum across m


of the number of filled in observations.)
. mi estimate: regress newprice
Multiple-imputation estimates Imputations = 20
Linear regression Number of obs = 74
Average RVI = 1.3880
Complete DF = 73
DF: min = 19.46
avg = 19.46
DF adjustment: Small sample max = 19.46
F( 0, .) = .
Within VCE type: OLS Prob > F = .

newprice Coef. Std. Err. t P>|t| [95% Conf. Interval]

_cons 5693.489 454.9877 12.51 0.000 4742.721 6644.258

From this output, we see that the estimated mean is 5,693 with a standard error
of 455 (rounded up) compared with the complete data value of 6,165 with a standard
error of 343 (also rounded up). However, we do not have estimates of quantiles. This
could also have been done using mi estimate: mean newprice (the mean command is
near the bottom of the estimation command list for mi estimate).
498 Stata tip 89

We can apply the same principle using qreg. For the 10th percentile, type

. mi estimate: qreg newprice, quantile(10)


Multiple-imputation estimates Imputations = 20
.1 Quantile regression Number of obs = 74
Average RVI = 0.2901
Complete DF = 73
DF: min = 48.05
avg = 48.05
DF adjustment: Small sample max = 48.05
F( 0, .) = .
Prob > F = .

newprice Coef. Std. Err. t P>|t| [95% Conf. Interval]

_cons 3495.635 708.54 4.93 0.000 2071.058 4920.212

Compare the value of 3,496 with the value of 3,895 from the full data. We can use
the simultaneous estimates command for the full set:

. mi estimate: sqreg newprice, quantiles(10 25 50 75 90) reps(20)


Multiple-imputation estimates Imputations = 20
Simultaneous quantile regression Number of obs = 74
Average RVI = 0.6085
Complete DF = 73
DF adjustment: Small sample DF: min = 23.19
avg = 26.97
max = 31.65

newprice Coef. Std. Err. t P>|t| [95% Conf. Interval]

q10
_cons 3495.635 533.5129 6.55 0.000 2408.434 4582.836

q25
_cons 4130.037 237.1932 17.41 0.000 3642.459 4617.614

q50
_cons 5200.238 441.294 11.78 0.000 4292.719 6107.757

q75
_cons 6620.232 778.8488 8.50 0.000 5025.49 8214.974

q90
_cons 8901.985 1417.022 6.28 0.000 5971.962 11832.01

3 Comments and cautions


The qreg command does not give the same result as the centile command when
you have complete data. This is because the centile command uses one observation,
while the qreg command uses a weighted combination of the observations. It will have
somewhat shorter confidence intervals, but with large datasets, the difference will be
P. A. Lachenbruch 499

small. A second caution is that comparing two medians can be tricky: the difference
of two medians is not the median difference of the distributions. I have found it useful
to use percentiles because there is a one-to-one relationship between percentiles if data
are transformed. In our case, there is plentiful evidence that price is not normally
distributed, so it would be good to look for a transformation and impute those values.
This method of using regression commands without an independent variable can
provide estimates of quantities that otherwise would be difficult to obtain. For example,
it is much faster than finding 20 imputed percentiles and then combining them with
Rubin’s rules, and it is much less onerous and prone to error.

4 Acknowledgment
This work was supported in part by a grant from the Cure JM Foundation.

References
Pagano, M., and K. Gauvreau. 2000. Principles of Biostatistics. 2nd ed. Belmont, CA:
Duxbury.

Royston, P. 2004. Multiple imputation of missing values. Stata Journal 4: 227–241.

———. 2005a. Multiple imputation of missing values: Update. Stata Journal 5: 188–
201.

———. 2005b. Multiple imputation of missing values: Update of ice. Stata Journal 5:
527–536.

———. 2007. Multiple imputation of missing values: Further update of ice, with an
emphasis on interval censoring. Stata Journal 7: 445–464.

Royston, P., J. B. Carlin, and I. R. White. 2009. Multiple imputation of missing values:
New features for mim. Stata Journal 9: 252–264.

StataCorp. 2009. Stata 11 Multiple-Imputation Reference Manual. College Station, TX:


Stata Press.
The Stata Journal (2010)
10, Number 3, pp. 500–502

Stata tip 90: Displaying partial results


Martin Weiss
Department of Economics
Tübingen University
Tübingen, Germany
martin.weiss@uni-tuebingen.de

Stata provides several features that allow users to display only part of their results.
If, for instance, you merely wanted to inspect the analysis of variance table returned by
anova or the coefficients returned by regress, you could instruct Stata to omit other
results:

. sysuse auto
(1978 Automobile Data)
. regress weight length price, notable
Source SS df MS Number of obs = 74
F( 2, 71) = 385.80
Model 40378658.3 2 20189329.2 Prob > F = 0.0000
Residual 3715520.06 71 52331.2685 R-squared = 0.9157
Adj R-squared = 0.9134
Total 44094178.4 73 604029.841 Root MSE = 228.76
. regress weight length price, noheader

weight Coef. Std. Err. t P>|t| [95% Conf. Interval]

length 30.60949 1.333171 22.96 0.000 27.95122 33.26776


price .042138 .0100644 4.19 0.000 .0220702 .0622058
_cons -2992.848 232.1722 -12.89 0.000 -3455.786 -2529.91

Other examples of this type can be found in the help files for xtivreg for its first-
stage results and for xtmixed for its random-effects and fixed-effects table. Generally,
to check whether Stata does provide such options, you would look for them under the
heading Reporting in the respective help files.
If you want to further customize output to your own needs, you could use the
estimates table command; see [R] estimates table. It is part of the comprehensive
estimates suite of commands that save and manipulate estimation results in Stata. See
[R] estimates or Baum (2006, sec. 4.4), where user-written alternatives are introduced
as well.
estimates table can provide several benefits to the user. For one, you can restrict
output to selected coefficients or equations with its keep() and drop() options.


c 2010 StataCorp LP st0206
M. Weiss 501

. sysuse auto
(1978 Automobile Data)
. quietly regress weight length price trunk turn
. estimates table, keep(turn price)

Variable active

turn 35.214901
price .04624804

The original output of the estimation command itself is suppressed with quietly;
see [P] quietly. The keep() option also changes the order of the coefficients according
to your wishes. Additionally, you can elect to have Stata display results in a specific
format, for example, with fewer or more decimal places. The format can differ between
the elements that you choose to put into the table. In the case shown below, the
coefficients have three decimal places, while the standard error and the p-value have
two decimal places:
. sysuse auto
(1978 Automobile Data)
. quietly regress weight length price trunk turn
. estimates table, keep(turn price) b(%9.3fc) se(%9.2fc) p(%9.2fc)

Variable active

turn 35.215
11.65
0.00
price 0.046
0.01
0.00

legend: b/se/p

estimates table can also deal with models featuring multiple equations. If you
want to omit the coefficients for weight and the constant from every equation of your
sureg model, you could type
. sysuse auto
(1978 Automobile Data)
. qui sureg (price foreign weight length turn) (mpg foreign weight turn)
. estimates table, drop(weight _cons)

Variable active

price
foreign 3320.6181
length -78.75447
turn -144.37952

mpg
foreign -2.0756325
turn -.23516574
502 Stata tip 90

If your interest rests in the entire first equation and the constant from the second
equation, you would prepend coefficients with the equation names and separate the two
with a colon. The names of equations and coefficients are more accessible in Stata 11
with the coeflegend option, which is accepted by most estimation commands.

. sureg, coeflegend noheader

Coef. Legend

price
foreign 3320.618 _b[price:foreign]
weight 6.04491 _b[price:weight]
length -78.75447 _b[price:length]
turn -144.3795 _b[price:turn]
_cons 7450.657 _b[price:_cons]

mpg
foreign -2.075632 _b[mpg:foreign]
weight -.0055959 _b[mpg:weight]
turn -.2351657 _b[mpg:turn]
_cons 48.13492 _b[mpg:_cons]

. estimates table, keep(price: mpg:weight)

Variable active

price
foreign 3320.6181
weight 6.0449101
length -78.75447
turn -144.37952
_cons 7450.657

mpg
weight -.00559588

See help estimates table to learn more about the syntax.

Reference
Baum, C. F. 2006. An Introduction to Modern Econometrics Using Stata. College
Station, TX: Stata Press.
The Stata Journal (2010)
10, Number 3, pp. 503–504

Stata tip 91: Putting unabbreviated varlists into local


macros
Nicholas J. Cox
Department of Geography
Durham University
Durham, UK
n.j.cox@durham.ac.uk

Within interactive sessions, do-files, or programs, Stata users often want to work
with varlists, lists of variable names. For convenience, such lists may be stored in local
macros. Local macros can be directly defined for later use, as in

. local myx "weight length displacement"


. regress mpg `myx´

However, users frequently want to put longer lists of names into local macros, spelled
out one by one so that some later command can loop through the list defined by the
macro. Such varlists might be indirectly defined in abbreviations using the wildcard
characters * or ?. These characters can be used alone or can be combined to express
ranges. For example, specifying * catches all variables, *TX* might define all variables
for Texas, and *200? catches the years 2000–2009 used as suffixes.
In such cases, direct definition may not appeal for all the obvious reasons: it is
tedious, time-consuming, and error-prone. It is also natural to wonder if there is a
better method. You may already know that foreach (see [P] foreach) will take such
wildcarded varlists as arguments, which solves many problems.
Many users know that pushing an abbreviated varlist through describe or ds is
one way to produce an unabbreviated varlist. Thus

. describe, varlist

is useful principally for its side effect of leaving all the variable names in r(varlist).
ds is typically used in a similar way, as is the user-written findname command (Cox
2010).
However, if the purpose is just to produce a local macro, the method of using
describe or ds has some small but definite disadvantages. First, the output of each
may not be desired, although it is easily suppressed with a quietly prefix. Second,
the modus operandi of both describe and ds is to leave saved results as r-class results.
Every now and again, users will be frustrated by this when they unwittingly overwrite
r-class results that they wished to use again. Third, there is some inefficiency in using
either command for this purpose, although you would usually have to work hard to
measure it.


c 2010 StataCorp LP dm0051
504 Stata tip 91

The solution here is to use the unab command; see [P] unab. unab has just one
restricted role in life, but that role is the solution here. unab is billed as a programming
command, but nothing stops it from being used interactively as a simple tool in data
management. The simple examples

. unab myvars : *
. unab TX : *TX*
. unab twenty : *200?

show how a local macro, named at birth (here as myvars, TX, and twenty), is defined
as the unabbreviated equivalent of each argument that follows a colon. Note that using
wildcard characters, although common, is certainly not compulsory.
The word “unabbreviate” is undoubtedly ugly. The help and manual entry do also
use the much simpler and more attractive word “expand”, but the word “expand” was
clearly not available as a command name, given its use for another purpose.
This tip skates over all the fine details of unab, and only now does it mention the
sibling commands tsunab and fvunab, for use when you are using time-series operators
and factor variables. For more information, see [P] unab.

Reference
Cox, N. J. 2010. Speaking Stata: Finding variables. Stata Journal 10: 281–296.
The Stata Journal (2010)
10, Number 3, p. 505

Software Updates

st0140 2: fuzzy: A program for performing qualitative comparative analyses (QCA) in


Stata. K. C. Longest and S. Vaisey. Stata Journal 8: 452; 79–104.
A typo has been fixed in the setgen subcommand. Specifically, the drect extension
was not calculating values below the middle anchor correctly because of the typo.
This has been fixed and drect is now operating correctly. Note that no other aspects
of the setgen command have been altered.


c 2010 StataCorp LP up0029

View publication stats

You might also like