Professional Documents
Culture Documents
(Royston P.) A Parametric Model For Ordinal Respon
(Royston P.) A Parametric Model For Ordinal Respon
263–277
Printed in Great Britain
S UMMARY
A model for ordinal response data based on an underlying (but unobserved) Normal distribution is
proposed. The model is particularly useful for highly discrete data with a large proportion of zero values.
It is applied to the estimation of age-specific reference intervals in two substantive example datasets.
1. I NTRODUCTION
This paper was motivated by a request to produce smooth age-specific reference centile curves for a
response variable with a discrete, positively skewed distribution. Standard methods had failed to produce
satisfactory results. Subsequently I received a second, similar request from a different source relating to a
different response variable. Analyses of these datasets are presented in Sections 3 and 4.
Several authors have proposed parametric and non-parametric methods for producing age-specific ref-
erence intervals (estimated quantile curves) for continuous response variables, including Bonellie and
Raab (1996), Cole (1988), Cole and Green (1992), He (1997), Heagerty and Pepe (1999), Healy et al.
(1988), Moulton et al. (1996), Pan et al. (1990), Rossiter (1991), Royston (1991), Royston and Wright
(1998), Yu and Jones (1998). Despite all these efforts, with the exception of Wade et al. (1995) the case
of ordinal responses appears to have been neglected. Wade et al. (1995) used a proportional odds model
(McCullagh, 1980)
p j (x)
ln = f j (x; β)
1 − p j (x)
to model the probability p j that a child of age x has a visual acuity score of level j ∈ {1, 2, 3, 4} or better.
Various nonlinear functions f j with an asymptote were used to represent the fact that visual acuity levels
off to adult values. Functions parallel between levels on the logistic scale were used. The parameters were
estimated by maximum likelihood. The methodology was framed only in terms of logistic regression, with
no explicit discussion of the idea of modelling the probability distribution of the ordinal response variable.
Ordinal variables may have dozens of levels. The idea of a parametric family of distributions for
such variables makes sense and leads to the possibility of parsimonious models, as opposed to needing
to parameterize all the levels indexed by j. In terms of the p j , this amounts to defining a suitable model
which smooths the p j with respect to j. A possible characteristic of such response variables is a proportion
of zero values, which may even approach 100% at some ages. Such distributions are difficult to model
accurately and parsimoniously. However, the essence of successful age-specific reference interval models
is adequately to represent the whole distribution of y|x (that is, the mechanism which generates the p j ),
not just, say, the age-specific mean or median.
c Oxford University Press (2000)
264 P. ROYSTON
The method proposed here is an extension and a generalization of the method of Wade and Ades. It
is introduced by way of a reinterpretation of the Normal errors regression model in Section 2.1. Subse-
quent subsections deal with the identification of suitable functional forms for the model and parameter
estimation. Sections 3 and 4 present the analysis of the two datasets. Section 5 is a discussion.
2. M ETHODS
2.1. Centile-based interpretation of Normal-errors regression
Suppose we write a Normal-errors linear regression model for a random variable Y based on a single
covariate x as
Y = σβ0 + σβ1 x + σ Z
where σ > 0 and the standardized residual Z ∼ N (0, 1). The model can be re-expressed on the standard
Normal deviate scale as
Y
Z= − β0 − β1 x.
σ
At a fixed value of Y , a unit increase in x results in a change of −β1 in Z , which may be interpreted as a
change of centile position in the distribution of Z . Suppose that β0 = 0, β1 = 1, Y = 0. A change from
x = 0 to x = 1 corresponds to a change from the mean or median (Z = 0) to approximately the 16th
centile (Z = −1, (−1) = 0.159, where (.) is the standard Normal cdf). The model may equivalently
be written in a way suggestive of probit regression. In terms of the cdf F(y|x) of Y , we have
y
F(y|x) = Pr(Y ≤ y|x) = (Z ) = − β0 − β1 x .
σ
now we have a random sample of observations {xi , yi }i=1,... ,n . Define binary indicator vari-
Suppose
ables u i j i=1,... ,n; j=1,... ,n−1 by u i j = 1 if yi ≤ y j , 0 otherwise. For a given j ∈ [1, n − 1], the parame-
ters y j /σ − β0 and −β1 could in principle be estimated by probit regression of the u i j (i = 1, . . . , n) on
the xi . When y1 , . . . , yn are ordered categorical rather than continuous outcomes, this model is essentially
the ordered probit approach proposed by Aitchison and Silvey (1957). It is similar to the more popular
proportional odds model of McCullagh (1980) in which −1 F(y|x) is replaced by the inverse logit func-
tion applied to F(y|x). In the ordered probit or proportional odds approach, the parameters y j /σ − β0 ,
which are sometimes known as ‘cutpoints’, are regarded as nuisance parameters and are estimated simul-
taneously with the regression coefficient −β1 by maximum likelihood. Interest centres around β1 and no
attempt is made to estimate β0 and σ separately.
With the proportional odds model, the supposition that the effect of x on the logistic scale is the same at
all values of Y is known as the proportional odds assumption. The analogous assumption for the ordered
probit model is that the effect of x on the Normal deviate scale is the same at all Y .
γ 0 + γ1 y + δ0 x − γ0 +δ
γ1
0x 1
γ1 Homoscedastic linear
regression
− γ0 +δ00γx+δ
2
01 x 1
γ0 + γ1 y + δ00 x + δ01 x2 1 γ1 Homoscedastic quadratic
regression
γ0 + γ1 y + x (δ0 + δ1 y) − γγ0 +δ 0x 1 Heteroscedastic non-
1 +δ1 x γ1 +δ1 x
linear regression. Approx.
linear if |δ1 x| |γ1 | ∀x
where β0 (y; θ 0 ), . . . , βk (y; θ k ) are possibly nonlinear functions of y and θ 0 , . . . , θ k . The variable Y is
considered to be ‘latent’ or unobserved. Actual observations are in categories which may be notionally
regarded as ‘bins’ of Y . The function β0 (y; θ 0 ) is a transformation towards (underlying) Normality of
β0 (Y ; θ 0 ). Covariate effects have two components: effects of x operate on the standard Normal deviate
scale Z (.), but if a function β j (y; θ j ) is nonconstant with respect to y, then x j also modifies the scale
and/or shape of the distribution of Y .
Now consider a simple case with k = 1 in which there is a single covariate x (e.g. age) related to the
distribution of Y as follows:
Z (y|x) = γ0 + γ1 y + x (δ0 + δ1 y) , (2)
so that θ 0 = (γ0 , γ1 ) , θ 1 = (δ0 , δ1 ) , β0 (y; θ 0 ) and β1 (y; θ 1 ) are linear in y, and Y is Normally dis-
tributed. The terms δ0 x and δ1 x y allow x to influence respectively the location and scale of Y . If δ1 = 0
then (2) reduces to a model closely related to the ordered probit.
Model (2) is too restrictive to be generally useful for age-specific reference interval estimation and
may require extension in two respects. First, there is no a priori reason why Y should be Normal, and
therefore a nonlinear transformation β0 (y; θ 0 ) may be needed. Second, linearity in x may fail and a
more complex function, such as a polynomial or fractional polynomial (FP) (Royston and Altman, 1994)
may be required. For example, a possible model which involves a power transformation λ1 of y towards
Normality, a second-degree FP in x with powers ( p1 , p2 ) and further power transformations λ2 , λ3 of y
is
Z (y|x) = γ0 + γ1 y λ1 + x p1 δ0 + δ1 y λ2 + x p2 ε0 + ε1 y λ3 .
Considerable flexibility is available with such models.
2.4. Estimation
Suppose that observations of Y in a given dataset are made in c > 1 classes y (1) , . . . , y (c) . It is not
necessary to assume that all possible classes are present in a given dataset. Supposethere is a single
covariate x with m distinct values x (1) , . . . , x (m) and that the observed frequency of x (l) , y ( j) is rl j .
Some (possibly many) rl j may be zero. According to model (1) the log likelihood of the observations is
c
rl j ln fl j
l=1 j=1
Fig. 1. Schematic depiction of bias due to lack of a continuity correction. Circles: ‘observations’ y1 , . . . , y100 . Solid
line: median of Y |x. Dashed line: estimated median if ‘observations’ are uncorrected. See the text for further details.
where ε is 0.5 for rounding and 1 for truncation. Except for the highest class y (c) where the previous
interval is used, each interval between y ( j) and y ( j+1) is used. Simulations from Normal distributions
with rounding of responses as exemplified above (not reported) showed that this approach provided a
satisfactory continuity correction. It was used in all the analyses reported in Sections 3 and 4.
2.5. Preliminaries
Before θ 0 , . . . , θ k can be estimated by maximum likelihood as just described, the functional form of
the model must be determined. This is a nontrivial task. I suggest the following approach. As a first
approximation, the ordered probit type of model is assumed, so that terms involving products of y and x
are ignored. The functional form for x is determined by ordered probit regression of y on x. The number c
of classes may be too large for software to accommodate and initial grouping of y may be necessary. Once
the functional form for x has been decided, binary probit regression of the jth ( j = 1, . . . , c − 1) vector
of indicator variables u i j = I (yi ≤ y ( j) ) (i = 1, . . . , n) on covariates representing the chosen function
of x, such as a polynomial or FP, is performed. It is important to centre covariates on a suitable value such
as the observed mean because the appropriateness of the function β0 (y; θ 0 ) depends on sensible centring.
The analysis generates c − 1 estimated regression coefficients for the constant term and c − 1 further terms
for each of the covariates representing x.
The c−1 coefficients for the constant term constitute a nonparametric estimate of the function β0 (y; θ 0 )
of y, namely the cdf of Y on a Normal deviate scale. Possible functions β0 (y; θ 0 ) may be determined by
inspecting plots of the coefficients against y and by fitting suitable functions of y, such as the power trans-
formation model already mentioned. Since the c − 1 coefficients are strongly positively correlated, care
must be taken to avoid the overcomplex models which may seem to be necessary because of the consid-
erably underestimated standard errors of parameters in any models that are tried. It is essential to choose
only monotonic functions of y since nonmonotonic functions cannot validly represent a cdf. Power,
logarithmic and exponential transformations and monotonic subfamilies of FPs are good candidates.
of distinct values of y and x found in the data (see Section 2.4). The way this is achieved will depend
on the statistical package used to program the model; I used Stata 6.0 (StataCorp, 1999). In practice
it is helpful to create a ‘lagged’ copy of the response ( j−1) for j = 2, . . . , c. This
( (l) ( (l)
variable, i.e. of y
enables fitted values F(y |x ) = Z (y |x ) of the cdf of Y to be differenced easily, facilitating
j) j)
calculation of the probability density elements fl j which form the log likelihood. The maximization of
the likelihood again depends on the specifics of the package used. I used Stata 6.0’s generic maximum
likelihood ‘engine’ ml, which is robust and provides considerable flexibility. For example, it allows
multiple ‘equations’. In the present application, a separate equation is required to define the model for
y and for each predictor. For example, for the Ca score data (see Section 3) the model comprises three
equations: γ0 + γ1 y, δ0 x1 and ε0 x2 . In this case, as far as ml is concerned there is no response variable.
Starting values for the parameters are not compulsory and in practice do not seem to be critical, though
of course a judicious choice of them will reduce the number of iterations needed for convergence to the
MLE.
Fig. 2. Ca score plotted against age for 3136 individuals. Values have been ‘jittered’ (see text for details). The vertical
axis is scaled as log(y + 1). (a) Males (n = 799), (b) females (n = 2337).
mean (50 years) so that x1 = age − 50. Sex (x2 ) was coded 0 for males, 1 for females. Figure 3 shows the
relation between the estimated coefficients β 0 (y ( j) ), β
1 (y ( j) ), β
2 (y ( j) ) and y ( j) , plotted on a horizontal
scale of log(y + 1).
The relation between β̂0 (y) and log(y + 1) is seen to be approximately linear. It was actually modelled
using a linear function of (y + 1)λ , 1 being added to avoid zeros. The estimate (SE) of λ was 0.147
(0.006). Due to the high serial correlation in the values β 0 (y ( j) ) these estimates are not to be taken at face
value. β1 (y) appears to be positively and curvilinearly associated with log(y + 1) whereas β 2 (y) seems
only weakly related.
The next stage was to fit the following version of model (1):
Z (y|x1 , x2 ) = γ0 + γ1 (y + 1)λ + δ0 x1 + ε0 x2
i.e. omitting product terms between y and x1 or x2 initially. The parameter λ and its 95% confidence
interval were estimated by use of the profile log likelihood function, which was quadratic in shape. The
MLE (95% CI) of λ was 0.226 (0.187, 0.265), substantially different from the coefficient-based estimate
of 0.147. Addition of the term δ1 x1 (y + 1)λ to the model did not improve the fit significantly (P > 0.9,
likelihood ratio test), the conclusion being the same for other powers of y + 1 that were tried. Similar
results were obtained for x2 . The conclusion is that neither δ1 nor ε1 are required in the model. The results
suggest a need to interpret coefficient plots such as Figure 3 extremely conservatively.
270 P. ROYSTON
0 (y), (b) β
Fig. 3. Estimated coefficients (a) β 1 (y) and (c) β
2 (y) plotted against y (total Ca score). The horizontal
axis is scaled as log(y + 1).
Figure 4 shows the 50, 75, 90 and 95th centiles of total Ca score estimated from the model.
Also shown in Figure 4 are empirical centile estimates (dashed curves) calculated in five-year age
groups up to age 70, then age 70 to 89. The empirical estimates are plotted against the mean age in each
age group. The model-based estimates agree very well with the empirical ones, the largest discrepancy
being for the 75th centile curve for females which appears to be slightly underestimated.
Model-based centiles for (Y + 1)λ are given by the formula
γ0 − x1
−1 (q) − δ0 − x2
ε0
(Yq + 1)λ =
γ1
(see Section 2.4). The variance of (Yq + 1)λ was approximated by using the rule for the variance of a ratio
of correlated random variables, A and B:
var(A) E2 (A) E(A)
var(A/B) + var(B) − 2cov(A, B) 3 .
E (B)
2 E (B)
4 E (B)
with A = −1 (q) − γ0 − x1 δ0 − x2 ε0 , B = γ1 . Confidence intervals were obtained on the power-
transformed scale assuming Normality of the estimates and back-transformed to the original scale. Fig-
ure 6 shows the centiles of Figure 4 together with estimated 95% confidence intervals.
Figure 5 shows the estimated mean of total Ca score for each sex according to the model, together with
a nonparametric estimate of the mean calculated using a running line smoothing (Sasieni and Royston,
1998).
While both mean curves rise steeply with age, that for males rises at an earlier age than that for females.
This pattern is compatible with that of the age-related incidence rates for heart disease.
A parametric model for ordinal response data 271
Fig. 4. Empirical (dashed curves) and model-based (solid curves) centile curves for total Ca score. The vertical axis
is scaled as log(y + 1). (a) Males, (b) females.
Goodness of fit was assessed graphically (see Figure 4) and by comparing the observed and expected
numbers of observations lying below estimated centile curves. Since the centile curves only apply mean-
ingfully for positive values of total Ca score, observations for which the estimated centile was 0 or less
were excluded. The results are given in Table 2.
With the possible exception of the 75th centile for females (as already noted), the results indicate that
the model fits well.
Fig. 5. Centile curves with 95% confidence intervals for total Ca score. The vertical axis is scaled as log(y + 1).
(a) Males, (b) females.
Fig. 6. Running-line (dashed curves) and model-based (solid curves) mean curves for total Ca score. The vertical axis
is scaled as log(y + 1). Long dashes, males; short dashes, females.
A parametric model for ordinal response data 273
Fig. 7. Comparison between centile curves for total Ca score from proposed model (solid curves) and Normal model
based on the method of Royston and Wright (1998) (dashed curves). The vertical axis is scaled as log(y + 1).
(a) Males, (b) females.
Quite clearly, the functions representing the 50, 75, 90 and 95th centile curves are completely inappro-
priate. Ignoring the structure of the data in this way is an infeasible approach to analysis and underlines
the need for a better method.
274 P. ROYSTON
(Figure 8).
As a result of the preliminary investigation, the following version of model (1) was fitted to the original
data:
The coefficient ε1 differed significantly from zero (P < 0.01 , likelihood ratio test). A selection of
empirical and fitted centile curves is shown in Figure 9.
Judging by the empirical centiles, the fitted curves for males seem less accurate than for females, but
the data are much sparser. Goodness of fit was also assessed as for total Ca scores in the previous example,
with the results shown in Table 3. The fit is generally excellent.
A parametric model for ordinal response data 275
Fig. 9. Empirical (dashed curves) and model-based (solid curves) centile curves for HAQ score. The vertical axis is
scaled as log(y + 1). (a) Males (n = 933), (b) females (n = 2860).
r Proportion r Proportion
50 1426 0.499 472 0.506
75 2087 0.730 699 0.749
90 2556 0.894 747 0.891
95 2716 0.950 884 0.947
99 2851 0.997 927 0.994
276 P. ROYSTON
Since HAQ scores are bounded on [0, 3] it may be argued that an underlying Normal model, which is
doubly unbounded, is inappropriate. The relation between β 0 (y ( j) ) and y ( j) appears somewhat sigmoid.
A possible alternative model, which specifies an upper bound κ1 and a lower bound κ1 − κ2 for Y , is as
follows:
κ1 − y
Z (y|x) = γ0 + γ1 ln − ln + δ01 x11 + δ02 x12 + (ε0 + ε1 y)x2 .
κ2
ML estimates of κ1 and κ2 were found by grid search of the profile likelihood function and were 3.07
and 4.48 respectively. The interval
[κ1 − κ2 , κ1 ] was
[−1.41, 3.07]. The fit of model (1) was improved
by using the transformation ln − ln ((κ1 − y) /κ2 ) , the log likelihood being 44.1 higher. However, the
improvement in fit, in terms of proportions of observations lying below estimated centile curves, was
hardly discernible. Due to the incorporation of an upper bound, the new model might be expected to
provide more appropriate estimates of extreme upper centiles. In practice, centiles above, say, the 95th
may be of little clinical importance and the more complex model is probably unnecessary in this case.
Since more parameters and a more complex functional form are estimated, it is also likely to be less
robust.
5. D ISCUSSION
The class (1) of models proposed here is motivated by earlier work of Aitchison and Silvey (1957)
and Wade et al. (1995), but with increased emphasis on estimating the age-related distribution of a latent
variable, Y . It provides an economical yet flexible parametric framework for ordinal responses. The
model may include products of y with covariates, which effectively allows the location and scale of
the distribution of Y to depend on covariates. Non-Normal distributions of Y are available by applying
transformations to y.
In contrast to methods such as quantile regression and various nonparametric approaches, the esti-
mated centile curves from the model are guaranteed not to cross each other. The centile position of an
individual with given values (x, y) is easily computed from the model parameters. However, the esti-
mation procedure is somewhat complex and requires several steps: determination of the functional form
for x, identification of a suitable transformation for y, preliminary choice of the form of the rest of the
model based on the coefficients from the probit regressions, and finally fitting possible models by maxi-
mum likelihood. I feel that the two examples illustrated here, together with the illustration in Section 3.1,
of what may happen if a method which works well with continuous responses is applied, show that the
results justify the effort. The final models are parsimonious and produce smooth centile curves. Attempts
to produce convincing curves for the example datasets by other methods were not successful since the
models were unable satisfactorily to handle the highly discrete nature of the data, in particular the large
proportion of zeros.
A useful feature of the method is the ability to estimate centile curves and moment curves such as the
mean and standard deviation as functions of x. This may be useful in contexts wider than age-specific
reference intervals. For example, the total cost of medical care is increasingly important as an outcome
variable in clinical trials and other studies. It is important to model the effects of covariates on the mean,
since this permits a calculation of total costs. The distribution of costs is often extremely skewed and not
satisfactorily approximated by, for example, the standard distributions used in generalized linear models.
There may be many zero or near-zero values. Another possible application area is in regression models for
scores, such as quality of life indicators, which take few or many possible values and have distributions
not unlike that of the HAQ score analysed here.
Software to fit the models described here has been written for the Stata 6.0 package (StataCorp, 1999)
and is available from the author on request. It will be published in the Stata Technical Bulletin in due
course.
A parametric model for ordinal response data 277
ACKNOWLEDGMENTS
I am most grateful to Ammarin Thakkinstian and Fred Wolfe for making respectively the total Ca
and HAQ score datasets available to me, and to Vern Farewell for a helpful discussion of the estimation
problem.
R EFERENCES
A ITCHISON , J. AND S ILVEY, S. D. (1957). The generalization of probit analysis to the case of multiple responses.
Biometrika 44, 131–140.
B ONELLIE , S. R. AND R AAB , G. M. (1996). A comparison of different approaches for fitting centile curves to
birthweight data. Statistics in Medicine 15, 2657–2667.
C OLE , T. J. (1988). Fitting smoothed centile curves to reference data (with discussion). Journal of the Royal Statis-
tical Society, Series A 151, 385–418.
C OLE , T. J. AND G REEN , P. J. (1992). Smoothing reference centile curves: the LMS method and penalised likeli-
hood. Statistics in Medicine 11, 1305–1319.
F RIES , J. F., S PITZ , P. W., K RAINES , R. G. AND H OLMAN H. R. (1980). Measurement of patient outcome in
arthritis. Arthritis Rheumatology 23, 137–145.
H E , X. (1997). Quantile curves without crossing. The American Statistician 51, 186–192.
H EAGERTY, P. J. AND P EPE , M. S. (1999). Semiparametric estimation of regression quantiles with application to
standardizing weight for height and age in US children. Applied Statistics 48, 553–551.
H EALY, M. J. R., R ASBASH , J. AND YANG , M. (1988). Distribution-free estimation of age-related centiles. Annals
of Human Biology 15, 17–22.
M C C ULLAGH , P. (1980). Regression models for ordinal data (with discussion). Journal of the Royal Statistical
Society, Series B 42, 109–142.
M OULTON , L. H., H OLT, E. A., J OB , J. S. AND H ALSEY, N. A. (1996). Percentile regression analysis of correlated
antibody responses. Statistics in Medicine 15, 2657–2667.
PAN , H. Q., G OLDSTEIN , H. AND YANG , Q. (1990). Nonparametric estimation of age-related centiles over wide
age ranges. Annals of Human Biology 17, 475–481.
ROSSITER , J. E. (1991). Calculating centile curves using kernel density estimation methods with application to infant
kidney lengths. Statistics in Medicine 10, 1693–1701.
ROYSTON , P. (1991). Constructing time-specific reference ranges. Statistics in Medicine 10, 675–690.
ROYSTON , P. AND A LTMAN , D. G. (1994). Regression using fractional polynomials of continuous covariates: par-
simonious parametric modelling. Applied Statistics 43, 3429–3467.
ROYSTON , P. AND W RIGHT, E. M. (1998). A method for estimating age-specific reference intervals based on frac-
tional polynomials and exponential transformation. Journal of the Royal Statistical Society, Series A 161, 79–101.
S ASIENI , P. D. AND ROYSTON , P. (1998). Pointwise confidence intervals for running. Stata Technical Bulletin
41, 17–23.
S TATAC ORP (1999). Stata Reference Manual, Version 6.0. College Station, Texas: Stata Press.
WADE , A. M., A DES , A. E., S ALT, A. T., JAYATUNGA , R. AND S ONKSEN , P. M. (1995). Age-related standards for
ordinal data: modelling the changes in visual acuity from 2 to 9 years of age. Statistics in Medicine 14, 257–266.
Y U , K. AND J ONES , M. C. (1998). Local linear quantile regression. Journal of the American Statistical Association
93, 228–237.
[Received June 23, 1999; revised January 28, 2000; accepted for publication February 9, 2000]