Professional Documents
Culture Documents
Small Area Estimates of Labour Force Par
Small Area Estimates of Labour Force Par
A (2007)
170, Part 4, pp. 975–1000
Isabel Molina,
Universidad Carlos III de Madrid, Madrid, Spain
Ayoub Saei
University of Southampton, UK
1. Introduction
Unemployment is an indicator of socio-economic situation and is thus an issue of primary
interest for society in general, and in particular for local, regional and central governments
that need to allocate effectively the funds which are needed for conducting employment plans
or policies. The European Union provides structural funds to cofinance specific employment
programmes with the purpose of achieving a better equilibrium in the levels of development
of the different European regions. Of course, the effectiveness of these programmes depends
on rigorous knowledge of regional socio-economic activity via adequate and reliable statistical
information. Thus, regional studies and investigations are currently of great interest.
In particular, the European statistical office Eurostat demands from the statistical offices
of the members increasingly detailed statistical information on smaller geographical regions.
However, the national statistical offices face the problem that the sample sizes of current
national surveys are not planned to provide reliable direct estimates for such small areas, and
the increase in size that is necessary to cover all these areas adequately is not affordable.
For instance, concerning labour force statistics, the Office for National Statistics of the UK
Address for correspondence: Isabel Molina, Departamento de Estadística, Universidad Carlos III de Madrid,
28903 Getafe, Madrid, Spain.
E-mail: isabel.molina@uc3m.es
The inclusion of area random effects in the model is a common practice in the current literature
on small area estimation. These effects model the variations over areas that are not explained
by auxiliary variables and additionally allow for correlations between the units within an area.
Such correlations are often observed in practice when the areas are geographical regions or
homogeneous domains.
The auxiliary information is typically taken from census or other administrative sources. If
there is relevant auxiliary information for each unit in the population, then the models are
usually formulated at the unit level. However, sometimes the information at the unit level is
not updated and other times there may be confidentiality reasons that prevent its use. In such
situations, it is usually possible to obtain data that are aggregated by areas, and the model
is then stated at the area level. The model that is assumed in the application of Section 2 is
in between the two approaches, since the available data are aggregated by sex–age categories
within areas. Then sex–age categories can be regarded as individual units within areas, but the
statistical advantages of aggregated data remain.
Linear mixed models are a common tool for small area estimation. Totals of unemployed
and employed individuals could be estimated via two separate models of this kind, relating the
direct estimates of the proportions of unemployed and employed to some area level auxiliary
variables. However, the estimated proportions derived might be inconsistent in the sense that
they might not be within the [0, 1] interval, and also the sum of both proportions might exceed
Estimates of Labour Force Participation 977
1, which in terms of totals means that the estimated number of unemployed plus employed
individuals could exceed the corresponding population total. Another disadvantage of these
models is that they do not take into account the typical strong dependence between the pro-
portions of unemployed, employed and inactive people. A bivariate linear mixed model could
provide estimates for two of these quantities, allowing them to be correlated, and the third
quantity could be calculated by subtraction from the population total. However, the previously
mentioned inconsistency problems remain.
The estimated proportions can be brought to the [0, 1] interval by using logistic models, which
relate the logit transformation of the proportions to the auxiliary variables. A univariate logistic
model with random area effects was proposed by the EURAREA Consortium (2004) (see pro-
ject reference volume D7.1.4, part 1, pages C5.6–C5.8) to model the proportion of unemployed
population. Moreover, the UK Office for National Statistics has recently released small area
estimates of unemployment rates that were obtained from a model of this type. The model
provides estimated totals of unemployed individuals, which are then combined with the direct
estimates of the totals of employed individuals to derive the rates of unemployment (Hastings
et al., 2003).
In this paper we propose to estimate certain unemployment or employment measures of inter-
est, namely totals, proportions and rates of unemployment, assuming a joint multinomial logit
model with random area effects for the proportions of unemployed and employed individuals.
This model adapts naturally to the characteristics of the problem, solving the inconveniences
of previous approaches, and allowing simultaneous model-based estimation of unemployment,
employment and inactivity totals. The model coefficients are interpretable as relative incre-
ments of ratios of unemployed or employed over inactive totals. Rates of unemployment or
other quantities of interest such as rates of inactivity are easily derived.
In Section 2 we illustrate the proposed methodology with a data set from the Great Britain
Labour Force Survey from the year 2000 (see Office for National Statistics (2004), volume 6,
for details on the Labour Force Survey for local area data). Section 2.1 specifies the model and
Section 2.2 describes the results of the model fit. The estimated totals of unemployed and
employed are compared with the direct estimators in Section 2.4. We observe an increase
in accuracy for the new model-based estimators for all areas. This increase is remarkable
particularly for unemployment because of the small number of sampled unemployed individuals
within the areas.
The accuracy of small area estimates is indeed crucial, because the loss of unbiasedness will
be accepted only if there is a clear gain in accuracy. Thus, in Section 2.5 we describe two different
approaches for approximating the mean-squared error of the new small area estimators. The
first is an analytical approximation based on Taylor linearizations. The second is a bootstrap
estimator that is obtained by a parametric bootstrap procedure which was specially designed
for the data structure at hand. It avoids linearizations, is of simple practical application and
easily extends to other types of parameter and model. In the simulation study that is described
in Section 3 we show the good performance of the bootstrap estimator. Furthermore, in that
section we use an approach which is similar to that of Hastings et al. (2003) referred to above
for the simulated data, and we compare the results with those obtained from the multinomial
logit mixed model.
Models must be constructed ad hoc for each data set at hand, and this means that each data
set must be studied until an adequate model is found for these data. Although the main objective
of the application in Section 2 is illustrative, the results for the available real data show that the
model that is fitted in Section 2 provides reliable estimates of unemployment or employment
characteristics.
978 I. Molina, A. Saei and M. J. Lombardía
2. Illustration with labour force data of Great Britain
2.1. Model specification
The available data set (source: Office for National Statistics) contains labour force data for
small areas (unitary authorities and local authority districts) in Great Britain from the year
2000 aggregated by sex–age categories. There are 406 × 6 records corresponding to the 406
small areas and six sex–age groups for each area, and nine columns with the variables that are
described in Table 1. The variable CLUSTER is a socio-economic classification of areas that
was developed by the Office for National Statistics (Bailey et al., 2000). The variables GOR,
CLUSTER and REG.UNEMPLOYED are obtained from an administrative source, and the
rest of the variables come from the Labour Force Survey.
Consider the multinomial vector that counts the number of sampled unemployed, employed
and inactive individuals within each AREA–SEXAGE group. The aim of this work is to obtain
small area estimates of some usual labour force participation characteristics through a model for
the multinomial probabilities of unemployed and employed individuals. Thus, first a preliminary
analysis was performed to assess the potential predictive power of each auxiliary variable in the
data set.
Fig. 1 plots the mean proportions of employed and unemployed people over the GOR, SEX-
AGE and CLUSTER categories. Observe that both mean proportions vary across the differ-
ent categories of each variable, but this variation is different for the two proportions since
the lines are not parallel. Indeed, analysis of variance confirmed that there are statistically
significant differences in each mean proportion between the different GOR, CLUSTER and
SEXAGE categories. These results suggest that the indicators of the categories of the three
variables are potentially helpful in predicting the probabilities of unemployed and employed
individuals.
Modelling of probabilities by real-valued explanatory variables requires transformation of
these probabilities into quantities that vary over the whole real line. The logit transformation
is commonly used for multinomial models owing to its simplicity. For a multinomial variable
with three categories and probabilities p1 , p2 and p3 , considering the last category as base
reference, the logit of pj is defined as log.pj =p3 /, j = 1, 2. In our case, regarding the inactive
as the reference category, the sample logit for the proportion of unemployed is equal to the
Table 1. Description of the variables in the labour force 2000 data file
1.0
0.6
0.8
0.5
0.4
0.6
0.3
0.4
0.2
0.2
0.1
0.0
0.0
2 4 6 8 10 12 1 2 3 4 5 6
GOR SEXAGE
(a) (b)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1 2 3 4 5 6 7
CLUSTER
(c)
Fig. 1. Mean proportion of employed () and unemployed (ı) over (a) the GOR, (b) SEXAGE and
(c) CLUSTER categories
4
log(prop.emp/prop.inac)
0
log(prop.emp/prop.inac)
3
−1
2
−2
1
−3
0
−4
−1
−5
−2
−7 −6 −5 −4 −3 −2 −1 −7 −6 −5 −4 −3 −2 −1
log(prop.registered) log(prop.registered)
(a) (b)
Fig. 2. Logit of (a) unemployed and (b) employed against log(REG.UNEMPLOYED)
probabilities. If the logits of the proportions are plotted against the untransformed proportion
of REG.UNEMPLOYED, a clear pattern can barely be distinguished owing to the high dis-
persion of the points. Anyway, both models (with and without logarithmic scale) were fitted
and the differences in the predicted values were negligible.
Fig. 3 plots SAMP.EMPLOYED against SAMP.UNEMPLOYED. The whole scatterplot
is depicted in Fig. 3(a), and in Fig. 3(b) we have augmented the scale to see the main cloud
of points more clearly. The integer nature and the frequent small figures of SAMP.UNEM-
PLOYED in the AREA–SEXAGE combinations produce the vertical lines that are observed in
the plot. Observe that the points are distributed along a band with positive slope, where large
numbers of sampled unemployed are mostly associated with large numbers of employed indi-
viduals. Thus, this plot suggests that the numbers of unemployed and employed individuals are
linearly dependent. Furthermore, the concentration of points in the bottom left-hand corner
200
400
150
300
samp.employed
samp.employed
100
200
50
100
0
0
0 10 20 30 40 50 0 5 10 15 20
samp.unemployed samp.unemployed
(a) (b)
Fig. 3. SAMP.EMPLOYED against SAMP.UNEMPLOYED
Estimates of Labour Force Participation 981
1.5
1.0
0.5
0.0
indicates that the joint distribution of these two variables is highly skewed. On the view of this
plot, it seems convenient to consider a bivariate model representing the observed dependence.
Fig. 4 plots the two rates, unemployed over employed (at the bottom) and inactive over active
(at the top), for each small area. Observe that the variation across areas of the rate unem-
ployed/employed is small compared with the variation of the inactivity/activity rate. Thus,
a large part of the variation across areas of the distribution of unemployed, employed and in-
active is due to the variation in activity/inactivity, and a smaller part due to the variation of
unemployed over employed. In accordance with this we assume that the small variations of the
rate unemployed/employed across areas can be explained sufficiently by the auxiliary variables.
Thus, we have included in the model random area effects that represent the variation of activ-
ity/inactivity, which is the largest part of the across-area variation that is observed in the data.
However, these random effects are constant for the categories unemployed and employed; see
model (2). This assumption considerably simplifies the model and the fitting method, and makes
the subsequent estimation of mean-squared errors easier and more understandable. In Section
2.3 we propose a diagnostic method based on the residuals for assessing whether a model with
specific random area effects for unemployed and employed is worthwhile for the data at hand.
The results indicate that not much can be gained in our case, and consequently the simpler
model with common area effects is preferred here.
Thus we considered as explanatory variables the log-proportion of REG.UNEMPLOYED
and 22 dummy indicators for categories of GOR, CLUSTER and SEXAGE, taking the last
category of each as base reference. With an intercept the constructed incidence matrix X has
24 columns. We use index i (i = 1, . . . , 6) for the SEXAGE category and d (d = 1, . . . , 406)
for AREA. Thus, the rows of X are indexed by xdi , ydi1 , ydi2 and ydi3 denote the number of
sampled unemployed, employed and inactive respectively, mdi = ydi1 + ydi2 + ydi3 the sample
size, and pdi1 , pdi2 , pdi3 = 1 − pdi1 − pdi2 the respective probabilities of unemployed, employed
and inactive individuals. Finally, ud denotes the random effect of area d. We assume that the
982 I. Molina, A. Saei and M. J. Lombardía
vectors (ydi1 , ydi2 , ydi3 ) given ud and mdi are independent across d and i with multinomial dis-
tribution, i.e. with probability mass function
mdi ! y y y
f.ydi1 , ydi2 |ud / = p di1 p di2 p di3 : .1/
ydi1 ! ydi2 ! ydi3 ! di1 di2 di3
Moreover, we assume that the probabilities .pdi1 , pdi2 / are related to the auxiliary variables and
the random area effects through the logit link as follows:
IID
log.pdij =pdi3 / = xdi βj + ud , j = 1, 2, i = 1, . . . , 6, d = 1, . . . , 406, ud ∼ N.0, ϕ/,
.2/
where βj = .β1j , . . . , β24j /T contains the coefficients of the explanatory variables for the multi-
nomial category j, j = 1, 2. This model introduces a natural correlation structure among the
unemployed, employed and inactive, and among units within the same small area. The model fit
provides estimated probabilities of unemployed and employed contained in the [0, 1] interval and
that add up to 1. Estimates of totals or proportions can be obtained even for areas without unem-
ployed people in the sample, although for a price in terms of sampling error. For areas with at
least a few sampled unemployed individuals, estimates with acceptable accuracy can be obtained.
Estimation of small area totals of unemployed, employed and inactive people requires the
prediction of the corresponding unsampled numbers of unemployed, employed and inactive
people in each AREA–SEXAGE group. Let us denote these quantities by ydi1 r , yr and yr
di2 di3
r r r r
respectively, and let mdi = ydi1 + ydi2 + ydi3 be the number of unsampled units. We assume that
model (1)–(2) holds also for .ydi1 r , yr /, with m replaced by mr . Furthermore, we denote by
di2 di di
r
Mdi = mdi + mdi the number of population units in the ith SEXAGE group within AREA d. We
assume that the population size Mdi is known for each AREA and SEXAGE group, and that
there are some observations in each small area.
†ϕ: estimate 0.026; likelihood ratio test statistic, 591.45; critical value, 7.68.
‡Significant at level 0.001.
§Significant at level 0.05.
§§Significant at level 0.01.
983
984 I. Molina, A. Saei and M. J. Lombardía
Taking the exponential in equation (2), we obtain that the marginal effect of an increment ∆Xk
of an explanatory variable Xk on the ratio pdij =pdi3 is a multiplicative effect of exp.βkj ∆Xk /,
j = 1, 2. In particular, when Xk is a dummy indicator, the ratio of unemployed over inactive
for the category that is represented by Xk is exp.βk1 / times the value of the ratio for the base
category. In this way, the coefficients of the SEXAGE indicators that are displayed in Table 2
can be interpreted as follows. The ratio unemployed over inactive for SEXAGE = 1 (men aged
between 16 and 25 years) is about 10 ≈ exp.2:3/ times the ratio for SEXAGE = 6 (women over 40
years). A similar effect is observed in the ratio of employed over inactive, although the increase is
somewhat smaller (about 7:4 ≈ exp.2/ times). From this we conclude that there is a large increase
in the activity when we move from the group of women over 40 years old to men aged between
16 and 25 years, and this increase is bigger in the unemployed. Similar conclusions are obtained
for the group of men between 25 and 40 years (SEXAGE = 2) in comparison with women over
40 years old; only the increase in activity is slightly smaller and there is a bigger gap between the
unemployed and the employed. Comparing the group of men over 40 years old (SEXAGE = 3)
with the reference group we see that the activity grows with respect to the previous cases, the
increase being much higher for employed people (about 18 times for unemployed and 40 times
for employed). In the group of women aged between 16 and 25 years (SEXAGE = 4) there is also
more activity than in the reference group, and the increase is greater in the number of employed
people. Finally, for women aged between 25 and 40 years (SEXAGE = 5) there is a considerable
decrease in activity, but the number of unemployed women reduces more. The remaining model
coefficients can be interpreted similarly.
2
Residuals Employed
0
0 10 20 30 40 50 60 70 80 90 100
-1
-2
-3
-4
-5
Predicted Employed
Fig. 5. Residuals against predicted values for the employed
Estimates of Labour Force Participation 985
4
Residuals Unemployed
0
0 2 4 6 8 10 12 14 16 18 20
-1
-2
-3
-4
Predicted Unemployed
Fig. 6. Residuals against predicted values for the unemployed
observations in the plot. For this reason, we have reduced the scale of the x-axis to 0–100 to
show 93% of the observations more clearly. In the remaining 7% there are no large residuals.
In fact, as we can see in the plot, there are no high residuals in absolute value or any visible
pattern; only a slight decrease of the variability when the predicted values increase. This could
be an effect of the skewness of the predicted values in the graph because, in the absence of
overdispersion, in regions with fewer observations we should see less variability.
For the category unemployed, the analogous plot appears in Fig. 6. The x-range has also
been reduced to show clearly over 99% of the observations. In the x-range from 6 to 20,
there is no obvious pattern. However, between 0 and 6 we can see more variability and a
strange pattern in the form of decreasing parallel curves. The higher variability that is
observed could again be an effect of the skewness of the predicted values. Observe that the
quantities to predict (the number of unemployed individuals) are integer. Thus, each decreas-
ing curve is naturally formed by the residuals corresponding to the same integer value. In
any case the plot indicates some underprediction when the number of unemployed is very
small.
Further validation of the model includes checking whether a model with specific random
effects for the categories unemployed and employed would substantially improve the predic-
tion. A specific diagnostic method has been developed for this, based on the idea that, if the
true model has additional across-area variability in any of the categories that is not explained
by the fitted model, then this extra variability should be found in the residuals. Consider that
the true data-generating model has different random effects for the categories unemployed and
employed. Then, without loss of generality, the true model verifies
0
log.pdij =pdi3 / = ηdij , j = 1, 2,
Additionally, we define the vector εdi = ."di1 , "di2 /T of random errors "dij = ydij − mdi pdij , and
the vector edi = .edi1 , edi2 /T of residuals edij = ydij − mdi p̂dij , where the p̂dij are the estimated
probabilities that are obtained by fitting model (2). It is easy to see that
edi = mdi .pdi − p̂di / + εdi : .3/
By a first-order Taylor series expansion of pdi .η/ about η = η0di , evaluated at η = η̂di , we obtain
0
pdi − p̂di = m−1
di Σdi .η di − η̂ di /, .4/
where Σdi is the variance–covariance matrix of .ydi1 , ydi2 /T . Observe that the derivatives of
pdi = .pdi1 , pdi2 /T with respect to ηdi = .ηdi1 , ηdi2 /T are the elements of m−1
di Σdi . Substituting
equation (4) in equation (3), multiplying the resulting equation on the left by Σ−1 di and subtract-
ing the second component of the obtained equation from the first component, we obtain the
following univariate mixed linear model for the difference of scaled residuals:
.mdi pdi1 /−1 edi1 − .mdi pdi2 /−1 edi2 = xdi α + vd + "di , .5/
where the obtained errors "di are heteroscedastic, with variances
var."di / = m−1 −1 −1
di .pdi1 + pdi2 /:
Then estimates of the total number of unemployed and employed individuals, and of the rates
of unemployment in each area, are calculated as
6
r
δ̂dj = .ydij + ŷdij /, j = 1, 2,
i=1
.7/
δ̂d1
d = 100
ur , d = 1, . . . , 406:
δ̂d1 + δ̂d2
Similarly, other usual labour statistics such as rates of employment, activity or inactivity can be
easily derived from the fit of model (1)–(2).
Direct estimates of small area characteristics are design based and are usually calculated by
using only the sample data belonging to the target area. Direct estimates of the totals of unem-
ployed and employed for each small area were provided by the Office for National Statistics
for the same data. In Fig. 7 we plotted the estimates that were derived from model (1)–(2) and
equation (7) against these direct estimates. We observe that the estimated totals of employed
people are almost equal for both methods. Direct estimates of employment totals are based on
sufficient observations to achieve an acceptable sampling error. Thus, the strong similarity with
4 e+05
40000
3 e+05
30000
model estimate
model estimate
2 e+05
20000
1 e+05
10000
0 e+00
0
3.0
12
ratio of coef. of variation
2.5
10
2.0
8
6
1.5
4
1.0
2
An estimator of MSE.ur d / is obtained by replacing the unknown parameters that appear in the
formulae of MSE.δ̂dj /, j = 1, 2, and MCPE.δ̂d1 , δ̂d2 / by their estimated values. We denote this
estimator by mseA .ur
d /, where A stands for ‘analytical’.
When explicit exact formulae of mean-squared errors cannot be calculated, an alternative
approach that avoids Taylor linearizations and further approximations is resampling. Several
resampling methods have been suggested in small area estimation. Jiang et al. (2002) proposed
a jackknife methodology for estimation under generalized linear mixed models. Pfeffermann
990 I. Molina, A. Saei and M. J. Lombardía
and Tiller (2005) proposed a parametric and a non-parametric bootstrap estimator of mean
prediction errors under state space models. Butar and Lahiri (2003) used a bootstrap for esti-
mation under linear mixed models. Hall and Maiti (2006) proposed a double-bootstrap approach
for bias correction, which is applicable for constructing bias-corrected estimators of the mean-
squared error and for computing prediction regions under general settings. Under logistic mixed
models, González-Manteiga et al. (2007) proposed a bootstrap for mean-squared error estima-
tion on finite populations. This method works by generating bootstrap populations from a
model with probabilistic properties that is similar to the original model but conditional on the
initial sample, and then extracting samples from these populations.
Here we generalize the proposal of González-Manteiga et al. (2007) to the multinomial model
and adapt it to the data structure at hand. The simulation study that was described in Section 3
shows its good performance in a simulation experiment with artificial data similar to the main
application of this paper. The proposed bootstrap works as follows.
(a) Model fitting: fit model (1)–(2) to the original data, obtaining parameter estimates β̂j =
.β̂1j , . . . , β̂24j /T , j = 1, 2, and ϕ̂.
(b) Generation of random effects: generate a vector w containing D independent copies of a
standard normal variable w. Construct the vector uÅ = ϕ̂1=2 w = .u1Å , . . . , uÅD /T such that
E.uÅ / = 0D and var.uÅ / = ϕ̂ID .
(c) Generation of a bootstrap population (sample and non-sample): for d = 1, . . . , D, calculate
the probabilities
2
pÅ = 1 + exp.x β̂ + uÅ / −1 ,
di3 di j d
j=1
Å = pÅ exp.x β̂ + uÅ /,
pdij j = 1, 2:
di3 di j d
(d) Model fitting to the bootstrap sample and parameter estimation: fit model (1)–(2) to the
bootstrap sample data .ydi1 Å , yÅ /, i = 1, . . . , 6, d = 1, . . . , D, obtaining estimates β̂Å and
di2 j
predicted values ûdÅ . From these, calculate individual predicted values
Å
Å r r
exp.xdi β̂j + ûd /
ŷdij = mdi , j = 1, 2:
2 Å
1+ exp.xdi β̂j + ûd / Å
j=1
Fig. 9 depicts the MSE estimates based on the analytical approximation mseA .ur d / and the
estimates based on bootstrap mseB .ur d / for the first 200 small areas of the Labour Force Survey
data file. We observe that the estimates behave similarly along small areas without big differ-
ences, with the analytical approximation often being somewhat below the bootstrap values. In
the simulation study of Section 3, where the true values of the MSEs are available, the analytical
approximation turns out to be clearly downward biased (see Fig. 11 in Section 3). However,
here the two types of MSE estimates are more similar than in the simulation experiment. Since
the parametric bootstrap relies on full knowledge of the data-generating process, we conjecture
that, when the model is correct as in the simulation study, the performance of the bootstrap-
based estimator is very good. However, in practice the correct model is rarely known. In the
application to the Labour Force Survey data the bootstrap works nicely because the model
fits the data reasonably well, although the differences from the analytical approximation are
smaller.
0.30
0.25
0.20
0.15
0.10
0.05
0.00
M
The mean-squared error of the estimators that was obtained by the two models ur L
d and urd
was approximated empirically as
K
l.k/ .k/
ld / = K−1
MSE.ur d − urd /2 ,
.ur l ∈ {M, L}:
k=1
The resulting empirical MSEs of the estimates derived from the two models are plotted on a
logarithmic scale in Fig. 10. We can observe that the empirical mean-squared errors of the esti-
mates that are derived from the multinomial logit mixed model are much smaller. This happens
because the univariate logistic model does not take into account the dependence between the
number of unemployed, employed and inactive people in the estimation process.
Regarding the second purpose of the simulation study, for the comparison of the two MSE
estimates that were developed in Section 2.5 with the true values being fair, first these true
values were empirically calculated with greater precision (K = 5000). After this preliminary sim-
ulation for obtaining the empirical MSEs, the same simulation scheme was followed, i.e. K = 600
populations were generated with sample and non-sample sizes as before. From each sample k,
estimates of unemployment rates ur .k/
d were derived from the multinomial logit mixed model,
Estimates of Labour Force Participation 993
2
1
0
−1
−2
−3
and analytical and bootstrap estimates of the mean-squared errors mseA .ur .k/ B .k/ /
d / and mse .ur d
were computed. The latter were obtained with B = 600 replications of the bootstrap procedure
that was described in Section 2.5. As a result, the following quantities were computed:
K
.k/
mseA .ur
d / = K−1 mseA .ur
d /,
k=1
K
.k/
EdA = K−1 {mseA .ur d /}2 ,
d / − MSE.ur
k=1
K
.k/
mseB .ur
d / = K−1 mseB .ur
d /,
k=1
K
.k/
EdB = K−1 {mseB .ur d /}2 :
d / − MSE.ur
k=1
In Fig. 11 the true values MSE.ur d / that were obtained in the preliminary simulation, the
analytical estimates mseA .ur
d / and the bootstrap estimates mseB .ur
d / are plotted for each area.
Observe that the bootstrap estimates are very close to the true values; in fact they are super-
posed for most of the areas. However, the analytical approximations underestimate the true
values for all areas. This bias seriously affects the overall accuracy of MSE estimates. Thus,
although both MSE estimates rely on the model, when small area rates of unemployment are
derived from a reliable model, we recommend estimating MSE by using the bootstrap proposed.
4. Conclusions
A multinomial logit model with random area effects has been proposed for modelling employ-
ment or unemployment data, and small area estimators have been derived from it. The estimates
994 I. Molina, A. Saei and M. J. Lombardía
0.20
0.15
0.10
0.05
0.00
obtained are consistent in the sense that they lie in the desired space, i.e. the sum of estimated
totals of unemployed, employed and inactive sum up to the population total. In comparison
with direct estimators, they have reduced variance without a significant bias.
Furthermore, two different ways of estimating the mean-squared error of the small area esti-
mators proposed are given: an analytical expression and a bootstrap estimator. The analytical
approximation is based on Taylor linearizations that are specific for the model and the parameter
at hand, whereas the bootstrap procedure is designed for the multinomial logit model avoiding
any linearization and can be easily adapted to some variations in the model and to different tar-
get parameters. Furthermore, the bootstrap estimator has performed better than the analytical
estimator in the simulations, although the differences are smaller in the application with UK
unemployment data.
There are various straightforward extensions of the multinomial logit mixed model that was
proposed in this work. If auxiliary information is available for all units of the population, a
unit level model can be used, whereas, if there is only area level information, the model should
be stated at the area level. Moreover, the sampling design can be introduced in the estimation
procedure by taking as response variables the direct estimates of the totals of unemployed and
employed individuals, and assuming that these totals follow a multinomial model.
Acknowledgements
This work started during a research stay of the first author in the Department of Social Sta-
tistics of the University of Southampton in the summer of 2003 by invitation of Professor
Raymond L. Chambers. We thank him and Professor Domingo Morales for their continu-
ous support and advice during this work, Miguel Molina for his help in the enhancement of
the program code, Zsolt Sándor and Roland Fried for their help in the last stage of the work
Estimates of Labour Force Participation 995
and finally the referees for their careful reading and helpful comments. It has been supported
by grants MTM 2006-05693, SEJ2004-03303, MTM2005-00820, PGIDT03PXIC20702PN and
PGIDIT06PXIB207009PR.
where Pdi = diag.pdi1 , pdi2 /. The natural parameter is θdi = .θdi1 , θdi2 /T , where θdij = log.pdij =pdi3 /, j = 1, 2,
and where pdi3 = 1 − pdi1 − pdi2 is the multinomial probability for the third category. Let u = .u1 , . . . , uD /T
be the vector of random effects that are associated with D small areas. With this notation, the proposed
multinomial logit mixed model (2) can be written as
θdi = Xdi β + Zdi u, u ∼ ND .0D , ϕID /, i = 1, . . . , 6, d = 1, . . . , D:
Here,
xdi 01×24
Xdi = ,
01×24 xdi
Zdi = .02×.d−1/ 12 02×.D−d/ /
are the 2 × p and 2 × D incidence matrices for observation i within area d with p = 48. We denote by xdij
the jth row of matrix Xdi , j = 1, 2. Additionally, let us denote by y, X and Z the matrices with the sample
elements ydi , Xdi and Zdi stacked in columns. The conditional density of y given u is
D
6
f1 .y|u/ = f.ydi1 , ydi2 |ud /,
d=1 i=1
Let us denote ξdi = .g1 .ydi /, g2 .ydi //T and edi = Σ−1
di .ydi − µdi /. Calculating the expressions of the derivatives
involved and using matrix notation, the above Taylor series expansion becomes
ξdi = Xdi β + Zdi u + edi , .10/
where var.edi / = Σ−1
Let ξ denote the vector that is constructed by stacking the vectors ξdi in one column
di .
and V = var.ξ/. Then V = ϕZZT + Σ−1 , where Σ = diag.Σdi , i = 1, . . . , 6, d = 1, . . . , D/. Assuming that the
marginal distribution of ξ is approximately normal, and maximizing the log-likelihood of ξ with respect
996 I. Molina, A. Saei and M. J. Lombardía
to ϕ, we obtain the approximate likelihood equation
D 1 D
ϕ = .n − r1 /−1 u2d , r1 = v−1 , .11/
d=1 ϕ d=1 d
where
6
vd = mdi pdi3 .1 − pdi3 / − ϕ−1 :
i=1
Thus, if β and u are known, plugging an initial value of ϕ in r1 and iterating via the formula of ϕ in
equation (11), we obtain an approximated ML of ϕ.
Following Harville (1977), the approximated RML estimator of ϕ is obtained by maximizing the re-
stricted likelihood
1
f.ϕ; ξ/ = .2πϕ/−.n−p/=2 |XT X|1=2 |V|−1=2 |XT V−1 X|−1=2 exp − ξ′ Πξ , .12/
2ϕ
where
Thus, starting with some initial values, estimates of β, u and ϕ can be obtained through a double-iter-
ation scheme. First update β and u by the Newton–Raphson equation to obtain PQL estimators, with ϕ
known, and then take the updated values of β and u as entries for one of the updating equations for ϕ,
either equation (11) or equation (13). The detailed PQL–ML fitting algorithm is described below.
k
pkdij = pkdi3 exp.θdij /, j = 1, 2,
k
p
µkdi = mdi kdi1 ,
pdi2
k
k
pdi1 .1 − pkdi1 / −pkdi1 pkdi2
Σdi = mdi :
−pkdi1 pkdi2 pkdi2 .1 − pkdi2 /
Compute
Estimates of Labour Force Participation 997
D
6
T k
Ak = Xdi Σdi Xdi ,
d=1 i=1
D
6
T k
Bk = Xdi Σdi Zdi ,
d=1 i=1
6
vkd = mdi pkdi3 .1 − pkdi3 / − ϕ−1
.l/ , d = 1, . . . , D,
i=1
and
Tk = diag{.vk1 /−1 , . . . , .vkD /−1 },
From this, compute Wk = {Ak − Bk Tk .Bk /T }−1 . The updating equation is
k+1 k Sk
β β Wk −Wk Bk Tk β
= +
uk+1 uk −Tk .Bk /T Wk Tk + Tk .Bk /T Wk Bk Tk Sku
where
D
6
T
Skβ = Xdi .ydi − µkdi /,
d=1 i=1
D
6
Sku = ZTdi .ydi − µkdi / − ϕ−1 k
.l/ u :
d=1 i=1
(iii) If the condition below holds, denote the last estimates by β.l/ and u.l/ . Otherwise increase k by 1
unit and return to step (ii).
k+1 k+1
βj − βjk ud − ukd
max , j = 1, . . . , p, , d = 1, . . . , D < ":
βjk ukd
(d) If the condition below holds stop. Otherwise increase l by 1 unit and return to step (b).
βj.l+1/ − βj.l/
max , j = 1, . . . , p, ud.l+1/ − ud.l/ , d = 1, . . . , D, ϕ.l+1/ − ϕ.l/ < ":
βj.l/ ud.l/ ϕ.l/
exp.θdij /
µrdij = mrdi = µrdij .θdi /, j = 1, 2:
2
1 + exp.θdik /
k=1
The estimator of δ̂ d is
6
6
δ̂ d = ydi + µ̂rdi ,
i=1 i=1
where µ̂rdij = µrdij .θ̂di /, j = 1, 2, and θ̂di = Xdi β̂ + Zdi û. Let us consider the working parameter τ d = Σ6i=1 µrdi
and its estimator τ̂ d = Σ6i=1 µ̂rdi , and let us denote the unpredictable part of equation (14) by εrd = Σ6i=1 .ydi r
−
r
µdi /. Then, the mean-squared error of δ̂ d can be written in terms of the mean-squared error of τ̂ d plus
additional terms as
The second term on the right-hand side of equation (15) can be approximated by the conditional expec-
tation
6
6
6
E{εrd .εrd /T |ud } = r
E{.ydi r
− µrdi /.ydk − µrdk /|ud } = Σrdi ,
i=1 k=1 i=1
2 @µr
dij
µ̂rdij ∼
= µrdij + .θ̂dik − θdik /, j = 1, 2:
k=1 @θdik
Calculating the expressions of the derivatives, the Taylor series expansion written in matrix notation is
µ̂rdi − µrdi ∼
= Σrdi .θ̂di − θdi /:
6
τ̂ ′d = Σrdi θ̂di = Md β̂ + Kd û,
i=1
where Md = Σ6i=1 Σrdi Xdi and Kd = Σ6i=1 Σrdi Zdi . Then it holds that MSE.τ̂ d / = MSE.τ̂ ′d /, where now τ̂ ′d is
linear in β̂ and û.
Under linear mixed models, Prasad and Rao (1990) obtained an analytical approximation of the MSE
of an estimator of the type λT β̂ + mT û, where β̂ and û are respectively the best linear unbiased estimator
of β and the best linear unbiased predictor of u. The multinomial mixed model can be approximated by
the linear mixed model (10) for the transformed data vector ξ. Moreover, PQL equations for β and u are
(see Breslow and Clayton (1993))
If V were known, these formulae would be the best linear unbiased estimator of β and the best linear
unbiased predictor of u under the linear model (10). Thus, this fact justifies the use of Prasad and Rao’s
formula for approximating MSE.τ̂ ′d /. This formula was adapted to a multivariate mixed linear model and
a multidimensional parameter by Baíllo and Molina (2005). Let us denote
Λd = Kd − Md TZT ΣX,
Γd = ϕV−1 ZMdT :
where
G1 .ϕ/ = Md TMdT ,
G2 .ϕ/ = Λd PΛTd ,
G3 .ϕ/ = .@Γd =@ϕ/T V.@Γd =@ϕ/I −1 :
Here, I denotes the Fisher information of the parameter ϕ obtained from the likelihood of ξ. If the ML
method is used for estimating ϕ, then the Fisher information is obtained from the (normal) likelihood of
ξ and is equal to
1 D ωd2
6
I1 = , ωd = mdi pdi3 .1 − pdi3 /, d = 1, . . . , D:
2 d=1 .1 + ϕωd /2 i=1
If the method that is used for estimating ϕ is the RML, then the Fisher information that is obtained from
the restricted likelihood (12) becomes
1 2 1
I2 = n− tr.R/ + 2 tr.R2 / :
2ϕ2 ϕ ϕ
Let us denote G4 .ϕ/ = Σ6i=1 Σrdi . Then, an approximation to the mean-squared error of the original target
parameter δ̂ d is
4
MSEA .δ̂ d / = Gk .ϕ/:
k=1
An estimator of MSEA .δ̂ d / could be obtained by replacing ϕ in each Gk .ϕ/ by its estimator, either the
ML or the RML estimator. However, it is known (see for example Prasad and Rao (1990)) that G1 .ϕ̂/
1000 I. Molina, A. Saei and M. J. Lombardía
is asymptotically biased for G1 .ϕ/, with negative bias equal to G3 .ϕ/. Thus, an asymptotically unbiased
estimator of G1 .ϕ/ is G1 .ϕ̂/ + G3 .ϕ̂/. Therefore, we take the following estimator of MSEA .δ̂ d /:
mseA .δ̂ d / = G1 .ϕ̂/ + G2 .ϕ̂/ + 2 G3 .ϕ̂/ + G4 .ϕ̂/:
References
Bailey, S., Charlton, J., Dollamore, G. and Fitzpatrick, J. (2000) Families, Groups and Clusters of local and health
authorities: revised for authorities in 1999. Popln Trends, 99, 37–52.
Baíllo, A. and Molina, I. (2005) Mean squared errors of small area estimators under a unit-level multivariate
model. Working Paper 05-40 (07). Universidad Carlos III de Madrid, Madrid.
Breslow, N. E. and Clayton, D. G. (1993) Approximate inference in generalized linear mixed models. J. Am.
Statist. Ass., 88, 9–25.
Butar, F. B. and Lahiri, P. (2003) On measures of uncertainty of empirical Bayes small-area estimators. J. Statist.
Planng Inf., 112, 63–76.
Claeskens, G. (2004) Restricted likelihood ratio lack-of-fit tests using mixed spline models. J. R. Statist. Soc. B,
66, 909–926.
Estevao, V. M. and Särndal, C. E. (1999) The use of auxiliary information in design-based estimation for domains.
Surv. Methodol., 25, 213–221.
Estevao, V. M. and Särndal, C. E. (2005) Borrowing strength is not the best technique within a wide class of
design-consistent domain estimators. J. Off. Statist., 20, 1–25.
EURAREA Consortium (2004) EURAREA Project IST-2000-26290. (Available from http://www.
statistics.gov.uk/eurarea.)
Fay, R. E. and Herriot, R. A. (1979) Estimation of income from small places: an application of James-Stein
procedures to census data. J. Am. Statist. Ass., 74, 269–277.
González-Manteiga, W., Lombardía, M. J., Molina, I., Morales, D. and Santamaría, L. (2007) Estimation of
the mean squared error of predictors of small area linear parameters under a logistic mixed model. Computnl
Statist. Data Anal., 51, 2720–2733.
Hall, P. and Maiti, T. (2006) On parametric bootstrap methods for small area prediction. J. R. Statist. Soc. B, 68,
221–238.
Harville, D. A. (1977) Maximum likelihood approaches to variance component estimation and related problems.
J. Am. Statist. Ass., 72, 322–340.
Hastings, D., Maine, N., Brown, G. and Crudas, M. (2003) Development of improved estimation methods for
local area unemployment levels and rates. In Technical Report, Labour Market Trends, pp. 37–43. London:
Office for National Statistics.
Jiang, J. and Lahiri, P. (2006) Mixed model prediction and small area estimation. Test, 15, 1–96.
Jiang, J., Lahiri, P. and Wan, S. (2002) A unified jackknife theory for empirical best prediction with M-estimation.
Ann. Statist., 30, 1782–1810.
Lehtonen, R. and Veijanen, A. (1998) Logistic generalized regression estimators. Surv. Methodol., 24, 51–55.
Office for National Statistics (2004) Labour Force Survey User Guide. London: Office for National Statistics.
(Available from http://www.statistics.gov.uk/downloads/theme-labour/Vol6.pdf.)
Pfeffermann, D. and Tiller, R. (2005) Bootstrap approximation to prediction MSE for state-space models with
estimated parameters. J. Time Ser. Anal., 26, 893–916.
Prasad, N. G. N. and Rao, J. N. K. (1990) The estimation of the mean squared error of small-area estimators.
J. Am. Statist. Ass., 85, 163–171.
Rao, J. N. K. (2003) Small Area Estimation. New York: Wiley.
Saei, A. and Chambers, R. (2003) Small area estimation under linear and generalized linear mixed models with
time and area effects. Working Paper M03/15. Southampton Statistical Sciences Research Institute, University
of Southampton, Southampton.
Schall, R. (1991) Estimation in generalized linear models with random effects. Biometrika, 78, 719–727.
Self, S. G. and Liang, K.-Y. (1987) Asymptotic properties of maximum likelihood estimators and likelihood ratio
tests under nonstandard conditions. J. Am. Statist. Ass., 82, 605–610.