Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Biostatistics (2000), 1, 3, pp.

263–277
Printed in Great Britain

A parametric model for ordinal response data, with


application to estimating age-specific reference intervals
PATRICK ROYSTON
MRC Clinical Trials Unit, 222 Euston Road, London NW1 2DA, UK
patrick.royston@ctu.mrc.ac.uk

S UMMARY
A model for ordinal response data based on an underlying (but unobserved) Normal distribution is
proposed. The model is particularly useful for highly discrete data with a large proportion of zero values.
It is applied to the estimation of age-specific reference intervals in two substantive example datasets.

1. I NTRODUCTION
This paper was motivated by a request to produce smooth age-specific reference centile curves for a
response variable with a discrete, positively skewed distribution. Standard methods had failed to produce
satisfactory results. Subsequently I received a second, similar request from a different source relating to a
different response variable. Analyses of these datasets are presented in Sections 3 and 4.
Several authors have proposed parametric and non-parametric methods for producing age-specific ref-
erence intervals (estimated quantile curves) for continuous response variables, including Bonellie and
Raab (1996), Cole (1988), Cole and Green (1992), He (1997), Heagerty and Pepe (1999), Healy et al.
(1988), Moulton et al. (1996), Pan et al. (1990), Rossiter (1991), Royston (1991), Royston and Wright
(1998), Yu and Jones (1998). Despite all these efforts, with the exception of Wade et al. (1995) the case
of ordinal responses appears to have been neglected. Wade et al. (1995) used a proportional odds model
(McCullagh, 1980)
 
p j (x)
ln = f j (x; β)
1 − p j (x)

to model the probability p j that a child of age x has a visual acuity score of level j ∈ {1, 2, 3, 4} or better.
Various nonlinear functions f j with an asymptote were used to represent the fact that visual acuity levels
off to adult values. Functions parallel between levels on the logistic scale were used. The parameters were
estimated by maximum likelihood. The methodology was framed only in terms of logistic regression, with
no explicit discussion of the idea of modelling the probability distribution of the ordinal response variable.
Ordinal variables may have dozens of levels. The idea of a parametric family of distributions for
such variables makes sense and leads to the possibility of parsimonious models, as opposed to needing
to parameterize all the levels indexed by j. In terms of the p j , this amounts to defining a suitable model
which smooths the p j with respect to j. A possible characteristic of such response variables is a proportion
of zero values, which may even approach 100% at some ages. Such distributions are difficult to model
accurately and parsimoniously. However, the essence of successful age-specific reference interval models
is adequately to represent the whole distribution of y|x (that is, the mechanism which generates the p j ),
not just, say, the age-specific mean or median.


c Oxford University Press (2000)
264 P. ROYSTON

The method proposed here is an extension and a generalization of the method of Wade and Ades. It
is introduced by way of a reinterpretation of the Normal errors regression model in Section 2.1. Subse-
quent subsections deal with the identification of suitable functional forms for the model and parameter
estimation. Sections 3 and 4 present the analysis of the two datasets. Section 5 is a discussion.

2. M ETHODS
2.1. Centile-based interpretation of Normal-errors regression
Suppose we write a Normal-errors linear regression model for a random variable Y based on a single
covariate x as

Y = σβ0 + σβ1 x + σ Z

where σ > 0 and the standardized residual Z ∼ N (0, 1). The model can be re-expressed on the standard
Normal deviate scale as
Y
Z= − β0 − β1 x.
σ
At a fixed value of Y , a unit increase in x results in a change of −β1 in Z , which may be interpreted as a
change of centile position in the distribution of Z . Suppose that β0 = 0, β1 = 1, Y = 0. A change from
x = 0 to x = 1 corresponds to a change from the mean or median (Z = 0) to approximately the 16th
centile (Z = −1,  (−1) = 0.159, where (.) is the standard Normal cdf). The model may equivalently
be written in a way suggestive of probit regression. In terms of the cdf F(y|x) of Y , we have
y 
F(y|x) = Pr(Y ≤ y|x) =  (Z ) =  − β0 − β1 x .
σ

  now we have a random sample of observations {xi , yi }i=1,... ,n . Define binary indicator vari-
Suppose
ables u i j i=1,... ,n; j=1,... ,n−1 by u i j = 1 if yi ≤ y j , 0 otherwise. For a given j ∈ [1, n − 1], the parame-
ters y j /σ − β0 and −β1 could in principle be estimated by probit regression of the u i j (i = 1, . . . , n) on
the xi . When y1 , . . . , yn are ordered categorical rather than continuous outcomes, this model is essentially
the ordered probit approach proposed by Aitchison and Silvey (1957). It is similar to the more popular
proportional odds model of McCullagh (1980) in which −1 F(y|x) is replaced by the inverse logit func-
tion applied to F(y|x). In the ordered probit or proportional odds approach, the parameters y j /σ − β0 ,
which are sometimes known as ‘cutpoints’, are regarded as nuisance parameters and are estimated simul-
taneously with the regression coefficient −β1 by maximum likelihood. Interest centres around β1 and no
attempt is made to estimate β0 and σ separately.
With the proportional odds model, the supposition that the effect of x on the logistic scale is the same at
all values of Y is known as the proportional odds assumption. The analogous assumption for the ordered
probit model is that the effect of x on the Normal deviate scale is the same at all Y .

2.2. Proposed model


I shall write the proposed model in general terms and then simplify it appropriately for age-specific ref-
erence interval estimation. Suppose Y is a continuous random variable which depends on a k-dimensional
covariate vector x and parameter vectors θ 0 , . . . , θ k in the following way:

Z (y|x) = −1 F(y|x) = β0 (y; θ 0 ) + β1 (y; θ 1 )x1 + . . . + βk (y; θ k )xk , (1)


A parametric model for ordinal response data 265

Table 1. Some simple examples of the proposed class of models


Model for Z (y|x) E(Y ) var1/2 (Y ) Comment
γ0 + γ1 y − γγ01 1
γ1 Y Normal, no covariates

γ 0 + γ1 y + δ0 x − γ0 +δ
γ1
0x 1
γ1 Homoscedastic linear
regression
− γ0 +δ00γx+δ
2
01 x 1
γ0 + γ1 y + δ00 x + δ01 x2 1 γ1 Homoscedastic quadratic
regression
γ0 + γ1 y + x (δ0 + δ1 y) − γγ0 +δ 0x 1 Heteroscedastic non-
1 +δ1 x γ1 +δ1 x
linear regression. Approx.
linear if |δ1 x| |γ1 | ∀x

where β0 (y; θ 0 ), . . . , βk (y; θ k ) are possibly nonlinear functions of y and θ 0 , . . . , θ k . The variable Y is
considered to be ‘latent’ or unobserved. Actual observations are in categories which may be notionally
regarded as ‘bins’ of Y . The function β0 (y; θ 0 ) is a transformation towards (underlying) Normality of
β0 (Y ; θ 0 ). Covariate effects have two components: effects of x operate on the standard Normal deviate
scale Z (.), but if a function β j (y; θ j ) is nonconstant with respect to y, then x j also modifies the scale
and/or shape of the distribution of Y .
Now consider a simple case with k = 1 in which there is a single covariate x (e.g. age) related to the
distribution of Y as follows:
Z (y|x) = γ0 + γ1 y + x (δ0 + δ1 y) , (2)
 
so that θ 0 = (γ0 , γ1 ) , θ 1 = (δ0 , δ1 ) , β0 (y; θ 0 ) and β1 (y; θ 1 ) are linear in y, and Y is Normally dis-
tributed. The terms δ0 x and δ1 x y allow x to influence respectively the location and scale of Y . If δ1 = 0
then (2) reduces to a model closely related to the ordered probit.
Model (2) is too restrictive to be generally useful for age-specific reference interval estimation and
may require extension in two respects. First, there is no a priori reason why Y should be Normal, and
therefore a nonlinear transformation β0 (y; θ 0 ) may be needed. Second, linearity in x may fail and a
more complex function, such as a polynomial or fractional polynomial (FP) (Royston and Altman, 1994)
may be required. For example, a possible model which involves a power transformation λ1 of y towards
Normality, a second-degree FP in x with powers ( p1 , p2 ) and further power transformations λ2 , λ3 of y
is
 
Z (y|x) = γ0 + γ1 y λ1 + x p1 δ0 + δ1 y λ2 + x p2 ε0 + ε1 y λ3 .
Considerable flexibility is available with such models.

2.3. Examples of simple models


To aid interpretation, Table 1 illustrates four examples of the class (1) of models.
We see that addition of the term xδ1 y to the model γ0 +γ1 y +δ0 x changes it from homoscedastic linear
regression in Y to heteroscedastic nonlinear regression in which E(Y ) is a ratio of linear functions of x.
Heteroscedastic linear regression does not belong to the class (1). However, if |δ1 x| is much smaller than
|γ1 | then var1/2 (Y ) is approximately linear in x, and depending on the magnitude of δ0 , E(Y ) is quadratic
or approximately linear in x.
266 P. ROYSTON

2.4. Estimation
Suppose that observations of Y in a given dataset are made in c > 1 classes y (1) , . . . , y (c) . It is not
necessary to assume that all possible classes are present in a given dataset. Supposethere is a single
covariate x with m distinct values x (1) , . . . , x (m) and that the observed frequency of x (l) , y ( j) is rl j .
Some (possibly many) rl j may be zero. According to model (1) the log likelihood of the observations is

c
rl j ln fl j
l=1 j=1

where the probability density elements fl j are given by



F(y ( j) |x (l) ) if j = 1,
fl j = ( (l) ( (l)
F(y |x ) − F(y
j) j−1) |x ) otherwise.
In practice, estimation involves reorganizing the data to comprise the required frequencies rl j and a ‘pre-
decessor’ variable y ( j−1) for each class y ( j) except the first. A flexible maximum likelihood routine is
needed to estimate the parameters θ 0 , . . . , θ k . Extension to multiple covariates is straightforward.
An age-specific reference curve Yq (x) for a given quantile q is obtained by simple algebra. For exam-
ple, for model (2) the required curve is
−1 (q) − γ0 − xδ0
Yq (x) = ,
γ1 + δ1 x
which is nonlinear in x despite (2) being linear in x. Other quantities of interest include moments of
the distribution of Y . The pth moment E(Y p |x), such as the expected value ( p = 1), may be found by
integration over a suitable grid of Y -values of the product of the estimated pdf and Y p .

2.4.1. Continuity correction


Consider a hypothetical dataset with 100 observations in which the ith observation is generated by a
rounding process as follows:

Yi 1
xi = i, Yi = xi , yi = 10 + ,
10 2
where [x] represents truncation to the nearest integer below x. There are c = 11 classes y (1) , . . . , y (11)
with respective values 0, 10, 20, . . . , 100. A plot of y versus x is a ‘staircase’ (see Figure 1).
The median of Y |x of course equals x. However, the values y ( j) , on which estimation of the parameters
of the model
Z (y|x) = γ0 + γ1 y + δ0 x
is based, are the ‘bottom corners’ of the steps of the staircase rather than the middles of the steps through
which the median line passes. As a result, the fitted median is biased downwards by one half-the rounding
interval (distance
  between successive y values)—5 in this example. For a truncation process, for example
Yi
yi = 10 10 , the bias equals the rounding interval.
To avoid such biases in estimating centile curves from model (1), it is necessary to modify the values
of y ( j) by adding ε times the rounding interval, before model fitting:

( j) y ( j) + ε y ( j+1) − y ( j) if j < c,
y = 
y ( j) + ε y ( j) − y ( j−1) if j = c,
A parametric model for ordinal response data 267

Fig. 1. Schematic depiction of bias due to lack of a continuity correction. Circles: ‘observations’ y1 , . . . , y100 . Solid
line: median of Y |x. Dashed line: estimated median if ‘observations’ are uncorrected. See the text for further details.

where ε is 0.5 for rounding and 1 for truncation. Except for the highest class y (c) where the previous
interval is used, each interval between y ( j) and y ( j+1) is used. Simulations from Normal distributions
with rounding of responses as exemplified above (not reported) showed that this approach provided a
satisfactory continuity correction. It was used in all the analyses reported in Sections 3 and 4.

2.5. Preliminaries
Before θ 0 , . . . , θ k can be estimated by maximum likelihood as just described, the functional form of
the model must be determined. This is a nontrivial task. I suggest the following approach. As a first
approximation, the ordered probit type of model is assumed, so that terms involving products of y and x
are ignored. The functional form for x is determined by ordered probit regression of y on x. The number c
of classes may be too large for software to accommodate and initial grouping of y may be necessary. Once
the functional form for x has been decided, binary probit regression of the jth ( j = 1, . . . , c − 1) vector
of indicator variables u i j = I (yi ≤ y ( j) ) (i = 1, . . . , n) on covariates representing the chosen function
of x, such as a polynomial or FP, is performed. It is important to centre covariates on a suitable value such
as the observed mean because the appropriateness of the function β0 (y; θ 0 ) depends on sensible centring.
The analysis generates c − 1 estimated regression coefficients for the constant term and c − 1 further terms
for each of the covariates representing x.
The c−1 coefficients for the constant term constitute a nonparametric estimate of the function β0 (y; θ 0 )
of y, namely the cdf of Y on a Normal deviate scale. Possible functions β0 (y; θ 0 ) may be determined by
inspecting plots of the coefficients against y and by fitting suitable functions of y, such as the power trans-
formation model already mentioned. Since the c − 1 coefficients are strongly positively correlated, care
must be taken to avoid the overcomplex models which may seem to be necessary because of the consid-
erably underestimated standard errors of parameters in any models that are tried. It is essential to choose
only monotonic functions of y since nonmonotonic functions cannot validly represent a cdf. Power,
logarithmic and exponential transformations and monotonic subfamilies of FPs are good candidates.

2.6. Fitting algorithm


The main difficulty with fitting
 the model is not so much the process of estimation as of restructuring
the data into a rectangular array x (l) , y ( j) , rl j l=1,... ,m; j=1,... ,c which embodies all possible combinations
268 P. ROYSTON

of distinct values of y and x found in the data (see Section 2.4). The way this is achieved will depend
on the statistical package used to program the model; I used Stata 6.0 (StataCorp, 1999). In practice
it is helpful to create a ‘lagged’ copy of the response ( j−1) for j = 2, . . . , c. This
( (l) ( (l)
 variable, i.e. of y
enables fitted values F(y |x ) =  Z (y |x ) of the cdf of Y to be differenced easily, facilitating
j) j)

calculation of the probability density elements fl j which form the log likelihood. The maximization of
the likelihood again depends on the specifics of the package used. I used Stata 6.0’s generic maximum
likelihood ‘engine’ ml, which is robust and provides considerable flexibility. For example, it allows
multiple ‘equations’. In the present application, a separate equation is required to define the model for
y and for each predictor. For example, for the Ca score data (see Section 3) the model comprises three
equations: γ0 + γ1 y, δ0 x1 and ε0 x2 . In this case, as far as ml is concerned there is no response variable.
Starting values for the parameters are not compulsory and in practice do not seem to be critical, though
of course a judicious choice of them will reduce the number of iterations needed for convergence to the
MLE.

2.7. Interval estimation


It is straightforward to obtain confidence intervals for estimates of Z (y|x) and hence of the centile
position of a given individual, by using the variance–covariance matrix of all the estimated parameters.
If a transformation of y has been used in β0 (y; θ 0 ), one can allow for the uncertainty associated with
estimation of any nonlinear parameter(s) by the usual approach of Taylor expansion around the MLE. For
 
example, if β0 (y; θ 0 ) = γ0 + γ1 y λ then fitting the expanded model γ0 + γ1 y λ + γ2 y λ ln y (for which
γ2 = 0) would give asymptotically correct standard errors. Approximate confidence intervals around

estimated centile curves curves for Y itself can be obtained provided the centile curves are expressible
as functions of the parameters in closed form. Such expression is not always possible. The effect of
estimating λ is then ignored, resulting in an underestimation of the width of the confidence intervals; this
may be regarded as a limitation of the approach. In other cases where centiles are not expressible in closed
form, an alternative possibility is the bootstrap.

3. E XAMPLE 1: T OTAL CALCIFICATION SCORE


As a person ages, increasing calcification of the coronary arteries occurs. Such deposits are believed
to increase the chance of coronary artery disease and of heart attack. In a recent study in Thailand,
5382 unselected people were enrolled into a screening study based around a technique of measuring
calcification (the Ultrafast CT scanner). The scan generates an index of calcification known as the total
Ca score, which equals the sum of the scores from the four coronary arteries. The aim was to construct
reference centile curves for a ‘relatively normal sub-population’. To that end, individuals positive for any
of several risk factors for coronary heart disease were excluded from analysis. There remained a dataset
comprising 799 males and 2337 females. Figure 2 shows the distribution of total Ca score (y) by age.
Scores of 0 and 1 have been ‘jittered’ by the addition of noise uniform on [−0.25, 0.25] to improve
legibility. The distribution of y is highly positively skewed. The proportion of zeros is nearly 100% at age
30 but falls steadily to near-zero at age 80.
For investigation of the functional form for age I used the ordered probit procedure oprobit in Stata
6.0. The total Ca score was truncated to 49 since the software only permitted 50 categories for the outcome
variable. FP analysis showed that a linear function for age, adjusted for sex, fitted adequately. There was
no evidence of a sex–age interaction.
Probit regression models β0 (y ( j) ) + β1 (y ( j) )x1 + β2 (y ( j) )x2 for j = 1, . . . , c − 1 = 211 were fitted as
described in Section 2.4 for each distinct value y ( j) of y. Age was centred on approximately the sample
A parametric model for ordinal response data 269

Fig. 2. Ca score plotted against age for 3136 individuals. Values have been ‘jittered’ (see text for details). The vertical
axis is scaled as log(y + 1). (a) Males (n = 799), (b) females (n = 2337).

mean (50 years) so that x1 = age − 50. Sex (x2 ) was coded 0 for males, 1 for females. Figure 3 shows the
relation between the estimated coefficients β 0 (y ( j) ), β
1 (y ( j) ), β
2 (y ( j) ) and y ( j) , plotted on a horizontal
scale of log(y + 1).
The relation between β̂0 (y) and log(y + 1) is seen to be approximately linear. It was actually modelled
using a linear function of (y + 1)λ , 1 being added to avoid zeros. The estimate (SE) of λ was 0.147
(0.006). Due to the high serial correlation in the values β 0 (y ( j) ) these estimates are not to be taken at face

value. β1 (y) appears to be positively and curvilinearly associated with log(y + 1) whereas β 2 (y) seems
only weakly related.
The next stage was to fit the following version of model (1):

Z (y|x1 , x2 ) = γ0 + γ1 (y + 1)λ + δ0 x1 + ε0 x2

i.e. omitting product terms between y and x1 or x2 initially. The parameter λ and its 95% confidence
interval were estimated by use of the profile log likelihood function, which was quadratic in shape. The
MLE (95% CI) of  λ was 0.226 (0.187, 0.265), substantially different from the coefficient-based estimate

of 0.147. Addition of the term δ1 x1 (y + 1)λ to the model did not improve the fit significantly (P > 0.9,
likelihood ratio test), the conclusion being the same for other powers of y + 1 that were tried. Similar
results were obtained for x2 . The conclusion is that neither δ1 nor ε1 are required in the model. The results
suggest a need to interpret coefficient plots such as Figure 3 extremely conservatively.
270 P. ROYSTON

0 (y), (b) β
Fig. 3. Estimated coefficients (a) β 1 (y) and (c) β
2 (y) plotted against y (total Ca score). The horizontal
axis is scaled as log(y + 1).

Figure 4 shows the 50, 75, 90 and 95th centiles of total Ca score estimated from the model.
Also shown in Figure 4 are empirical centile estimates (dashed curves) calculated in five-year age
groups up to age 70, then age 70 to 89. The empirical estimates are plotted against the mean age in each
age group. The model-based estimates agree very well with the empirical ones, the largest discrepancy
being for the 75th centile curve for females which appears to be slightly underestimated.

Model-based centiles for (Y + 1)λ are given by the formula
 γ0 − x1
−1 (q) −  δ0 − x2
ε0
(Yq + 1)λ =
γ1


(see Section 2.4). The variance of (Yq + 1)λ was approximated by using the rule for the variance of a ratio
of correlated random variables, A and B:
var(A) E2 (A) E(A)
var(A/B)  + var(B) − 2cov(A, B) 3 .
E (B)
2 E (B)
4 E (B)
with A = −1 (q) −  γ0 − x1 δ0 − x2 ε0 , B = γ1 . Confidence intervals were obtained on the power-
transformed scale assuming Normality of the estimates and back-transformed to the original scale. Fig-
ure 6 shows the centiles of Figure 4 together with estimated 95% confidence intervals.
Figure 5 shows the estimated mean of total Ca score for each sex according to the model, together with
a nonparametric estimate of the mean calculated using a running line smoothing (Sasieni and Royston,
1998).
While both mean curves rise steeply with age, that for males rises at an earlier age than that for females.
This pattern is compatible with that of the age-related incidence rates for heart disease.
A parametric model for ordinal response data 271

Fig. 4. Empirical (dashed curves) and model-based (solid curves) centile curves for total Ca score. The vertical axis
is scaled as log(y + 1). (a) Males, (b) females.

Goodness of fit was assessed graphically (see Figure 4) and by comparing the observed and expected
numbers of observations lying below estimated centile curves. Since the centile curves only apply mean-
ingfully for positive values of total Ca score, observations for which the estimated centile was 0 or less
were excluded. The results are given in Table 2.
With the possible exception of the 75th centile for females (as already noted), the results indicate that
the model fits well.

3.1. Failure of an existing method


To show that special methods really are needed for dealing properly with response variables such as
total Ca score, I apply naively the method of analysis described by Royston and Wright (1998) to log(y+1)
for each sex separately. The fractional polynomial model required to represent E(log(y + 1|x)) was
found to have power 3 for each sex. An attempt to determine the relation between the standard deviation
of log(y + 1) and age failed due to nonconvergence of the algorithm; perforce, a constant variance was
assumed. An attempt to fit an exponential transformation model, as described by Royston and Wright
(1998), failed because an inappropriate value of the skewness parameter was estimated, resulting in centile
curves which bore no sensible relation to the data. A Normal model for log(y + 1) was therefore assumed.
The results of this exercise compared with those from the model derived in Section 3 are illustrated in
Figure 7.
272 P. ROYSTON

Fig. 5. Centile curves with 95% confidence intervals for total Ca score. The vertical axis is scaled as log(y + 1).
(a) Males, (b) females.

Fig. 6. Running-line (dashed curves) and model-based (solid curves) mean curves for total Ca score. The vertical axis
is scaled as log(y + 1). Long dashes, males; short dashes, females.
A parametric model for ordinal response data 273

Table 2. Goodness of fit of the model for centiles of total Ca


score. Values in table are numbers and proportions of observa-
tions lying below particular estimated centile curves in relation
to given denominators
Centile Females Males

r/n Proportion r/n Proportion


50 64/131 0.489 102/198 0.515
75 334/475 0.703 360/470 0.766
90 1100/1211 0.908 619/685 0.904
95 1651/1735 0.952 734/772 0.951
99 2318/2337 0.992 790/799 0.989

Fig. 7. Comparison between centile curves for total Ca score from proposed model (solid curves) and Normal model
based on the method of Royston and Wright (1998) (dashed curves). The vertical axis is scaled as log(y + 1).
(a) Males, (b) females.

Quite clearly, the functions representing the 50, 75, 90 and 95th centile curves are completely inappro-
priate. Ignoring the structure of the data in this way is an infeasible approach to analysis and underlines
the need for a better method.
274 P. ROYSTON

2 (y), plotted against y (HAQ score).


Fig. 8. Estimated coefficient for sex, β

4. E XAMPLE 2: HAQ SCORES IN ARTHRITIS


Rheumatoid arthritis is a debilitating joint disease that progresses gradually over a period of many
years. Functional ability of patients may be assessed by a multifactorial index known as the HAQ score
(Fries et al., 1980), which is the average of scores from eight categories including degree of joint pain
and swelling. By construction, HAQ scores are bound to the interval [0, 3]. A unique dataset comprising
observations from 2860 female and 933 male patients of >300 specialty physicians in the USA was made
available. Disease durations ranged from 0 to >60 years. There were 42 distinct scores and 10.6% of the
observations were zero. Reference intervals for HAQ score as a function of disease duration were required.
The relation between HAQ score and duration is weak. A scatter plot does not reveal the functional
form. Initial assessment of the functional form using oprobit suggested that an FP with powers (0, 0),
i.e. a quadratic in log(duration +1), adjusted for sex, could represent the relation with duration. The
fitted function dropped to a minimum at about 1.5 years and subsequently rose steadily. The behaviour is
clinically appropriate in that patients are given initial treatment which subdues the acute symptoms quite
effectively, but after that the disease slowly takes its course.
Preliminary probit regression models β0 (y ( j) ) + β11 (y ( j) )x11 + β12 (y ( j) )x12 + β2 (y ( j) )x2 for j =
1, . . . , c − 1 = 41 were fitted, where x11 = ln(d + 12 1
) − ln(12 12 1
), x12 = ln2 (d + 12 1
) − ln2 (12 12
1
),
d = disease duration (years), x2 = sex coded 0 for females, 1 for males. One-twelfth of a year (i.e.
1 month) was added to duration to avoid zeros and the functions were centred on 12 years. The es-
timated coefficients β 0 (y ( j) ) were approximately linearly related to y j , suggesting that an underlying
Normal model for Y was adequate. The coefficients β 11 (y ( j) ) and β12 (y ( j) ) appeared to be indepen-
(
dent of y . However, the coefficients β
j) 2 (y ) for x2 showed a fairly strong linear relation with y ( j)
( j)

(Figure 8).
As a result of the preliminary investigation, the following version of model (1) was fitted to the original
data:

Z (y|x11 , x12 , x2 ) = γ0 + γ1 y + δ01 x11 + δ02 x12 + (ε0 + ε1 y)x2 .

The coefficient ε1 differed significantly from zero (P < 0.01 , likelihood ratio test). A selection of
empirical and fitted centile curves is shown in Figure 9.
Judging by the empirical centiles, the fitted curves for males seem less accurate than for females, but
the data are much sparser. Goodness of fit was also assessed as for total Ca scores in the previous example,
with the results shown in Table 3. The fit is generally excellent.
A parametric model for ordinal response data 275

Fig. 9. Empirical (dashed curves) and model-based (solid curves) centile curves for HAQ score. The vertical axis is
scaled as log(y + 1). (a) Males (n = 933), (b) females (n = 2860).

Table 3. Goodness of fit of the model for centiles


of HAQ score. Values in table are numbers (r) and
proportions of observations not exceeding particular
estimated centile curves. The denominators are 2860
for females and 933 for males
Centile Females Males

r Proportion r Proportion
50 1426 0.499 472 0.506
75 2087 0.730 699 0.749
90 2556 0.894 747 0.891
95 2716 0.950 884 0.947
99 2851 0.997 927 0.994
276 P. ROYSTON

Since HAQ scores are bounded on [0, 3] it may be argued that an underlying Normal model, which is
doubly unbounded, is inappropriate. The relation between β 0 (y ( j) ) and y ( j) appears somewhat sigmoid.
A possible alternative model, which specifies an upper bound κ1 and a lower bound κ1 − κ2 for Y , is as
follows:
 
κ1 − y
Z (y|x) = γ0 + γ1 ln − ln + δ01 x11 + δ02 x12 + (ε0 + ε1 y)x2 .
κ2
ML estimates of κ1 and κ2 were found by grid search of the profile likelihood function and were 3.07
and 4.48 respectively. The interval
 [κ1 − κ2 , κ1 ] was
 [−1.41, 3.07]. The fit of model (1) was improved
by using the transformation ln − ln ((κ1 − y) /κ2 ) , the log likelihood being 44.1 higher. However, the
improvement in fit, in terms of proportions of observations lying below estimated centile curves, was
hardly discernible. Due to the incorporation of an upper bound, the new model might be expected to
provide more appropriate estimates of extreme upper centiles. In practice, centiles above, say, the 95th
may be of little clinical importance and the more complex model is probably unnecessary in this case.
Since more parameters and a more complex functional form are estimated, it is also likely to be less
robust.

5. D ISCUSSION
The class (1) of models proposed here is motivated by earlier work of Aitchison and Silvey (1957)
and Wade et al. (1995), but with increased emphasis on estimating the age-related distribution of a latent
variable, Y . It provides an economical yet flexible parametric framework for ordinal responses. The
model may include products of y with covariates, which effectively allows the location and scale of
the distribution of Y to depend on covariates. Non-Normal distributions of Y are available by applying
transformations to y.
In contrast to methods such as quantile regression and various nonparametric approaches, the esti-
mated centile curves from the model are guaranteed not to cross each other. The centile position of an
individual with given values (x, y) is easily computed from the model parameters. However, the esti-
mation procedure is somewhat complex and requires several steps: determination of the functional form
for x, identification of a suitable transformation for y, preliminary choice of the form of the rest of the
model based on the coefficients from the probit regressions, and finally fitting possible models by maxi-
mum likelihood. I feel that the two examples illustrated here, together with the illustration in Section 3.1,
of what may happen if a method which works well with continuous responses is applied, show that the
results justify the effort. The final models are parsimonious and produce smooth centile curves. Attempts
to produce convincing curves for the example datasets by other methods were not successful since the
models were unable satisfactorily to handle the highly discrete nature of the data, in particular the large
proportion of zeros.
A useful feature of the method is the ability to estimate centile curves and moment curves such as the
mean and standard deviation as functions of x. This may be useful in contexts wider than age-specific
reference intervals. For example, the total cost of medical care is increasingly important as an outcome
variable in clinical trials and other studies. It is important to model the effects of covariates on the mean,
since this permits a calculation of total costs. The distribution of costs is often extremely skewed and not
satisfactorily approximated by, for example, the standard distributions used in generalized linear models.
There may be many zero or near-zero values. Another possible application area is in regression models for
scores, such as quality of life indicators, which take few or many possible values and have distributions
not unlike that of the HAQ score analysed here.
Software to fit the models described here has been written for the Stata 6.0 package (StataCorp, 1999)
and is available from the author on request. It will be published in the Stata Technical Bulletin in due
course.
A parametric model for ordinal response data 277

ACKNOWLEDGMENTS
I am most grateful to Ammarin Thakkinstian and Fred Wolfe for making respectively the total Ca
and HAQ score datasets available to me, and to Vern Farewell for a helpful discussion of the estimation
problem.

R EFERENCES
A ITCHISON , J. AND S ILVEY, S. D. (1957). The generalization of probit analysis to the case of multiple responses.
Biometrika 44, 131–140.
B ONELLIE , S. R. AND R AAB , G. M. (1996). A comparison of different approaches for fitting centile curves to
birthweight data. Statistics in Medicine 15, 2657–2667.
C OLE , T. J. (1988). Fitting smoothed centile curves to reference data (with discussion). Journal of the Royal Statis-
tical Society, Series A 151, 385–418.
C OLE , T. J. AND G REEN , P. J. (1992). Smoothing reference centile curves: the LMS method and penalised likeli-
hood. Statistics in Medicine 11, 1305–1319.
F RIES , J. F., S PITZ , P. W., K RAINES , R. G. AND H OLMAN H. R. (1980). Measurement of patient outcome in
arthritis. Arthritis Rheumatology 23, 137–145.
H E , X. (1997). Quantile curves without crossing. The American Statistician 51, 186–192.
H EAGERTY, P. J. AND P EPE , M. S. (1999). Semiparametric estimation of regression quantiles with application to
standardizing weight for height and age in US children. Applied Statistics 48, 553–551.
H EALY, M. J. R., R ASBASH , J. AND YANG , M. (1988). Distribution-free estimation of age-related centiles. Annals
of Human Biology 15, 17–22.
M C C ULLAGH , P. (1980). Regression models for ordinal data (with discussion). Journal of the Royal Statistical
Society, Series B 42, 109–142.
M OULTON , L. H., H OLT, E. A., J OB , J. S. AND H ALSEY, N. A. (1996). Percentile regression analysis of correlated
antibody responses. Statistics in Medicine 15, 2657–2667.
PAN , H. Q., G OLDSTEIN , H. AND YANG , Q. (1990). Nonparametric estimation of age-related centiles over wide
age ranges. Annals of Human Biology 17, 475–481.
ROSSITER , J. E. (1991). Calculating centile curves using kernel density estimation methods with application to infant
kidney lengths. Statistics in Medicine 10, 1693–1701.
ROYSTON , P. (1991). Constructing time-specific reference ranges. Statistics in Medicine 10, 675–690.
ROYSTON , P. AND A LTMAN , D. G. (1994). Regression using fractional polynomials of continuous covariates: par-
simonious parametric modelling. Applied Statistics 43, 3429–3467.
ROYSTON , P. AND W RIGHT, E. M. (1998). A method for estimating age-specific reference intervals based on frac-
tional polynomials and exponential transformation. Journal of the Royal Statistical Society, Series A 161, 79–101.
S ASIENI , P. D. AND ROYSTON , P. (1998). Pointwise confidence intervals for running. Stata Technical Bulletin
41, 17–23.
S TATAC ORP (1999). Stata Reference Manual, Version 6.0. College Station, Texas: Stata Press.
WADE , A. M., A DES , A. E., S ALT, A. T., JAYATUNGA , R. AND S ONKSEN , P. M. (1995). Age-related standards for
ordinal data: modelling the changes in visual acuity from 2 to 9 years of age. Statistics in Medicine 14, 257–266.
Y U , K. AND J ONES , M. C. (1998). Local linear quantile regression. Journal of the American Statistical Association
93, 228–237.

[Received June 23, 1999; revised January 28, 2000; accepted for publication February 9, 2000]

You might also like