Professional Documents
Culture Documents
Journal Soche Vol 3.1
Journal Soche Vol 3.1
Corresponding author. Francesca Ieva. Dipartimento di Matematica Politecnico di Milano, Piazza Leonardo da
Vinci 32 I-20133, Milano, Italy. Email: francesca.ieva@mail.polimi.it
ISSN: 0718-7912 (print)/ISSN: 0718-7920 (online)
c Chilean Statistical Society Sociedad Chilena de Estadstica
http://www.soche.cl/chjs
16 A. Guglielmi, F. Ieva, A.M. Paganoni and F. Ruggeri
survival concerning a specic disease. As a worthy contribution of this work, both clinical
registry and administrative database were used to model in-hospital survival of acute
myocardial infarction patients, in order to point out benchmarks to be used in provider
proling process.
The disease we are interested in is the ST-segment elevation acute myocardial infarction
(STEMI): it consists of a stenotic plaque detachment, which causes a coronary thrombosis
and a sudden critical reduction of blood ow in coronary vessels. STEMI is characterized
by a great incidence (650 - 700 events per month have been estimated only in Lombardia
region, whose inhabitants are approximately ten millions) and serious mortality (about 8%
in Italy), and in fact it is one of the main causes of death all over the world. A case of STEMI
can be diagnosed through the electrocardiogram (ECG), observing the elevation of ST
segment, and treated by thrombolytic therapy and/or percutaneous transluminal coronary
angioplasty (PTCA), which up to now are the most common procedures. The patients in
our data set always undergo directly to a PTCA procedure avoiding the thrombolysis, even
if the two treatments are not mutually exclusive. Anyway, good results for any of the two
treatments can be evaluated by observing rst the in-hospital survival of inpatients, and
then quantifying the reduction of ST segment elevation one hour after the intervention.
Concerning heart attacks, both survival and quantity of myocardial tissues saved from
damage strongly depend on time saved during the process; in this work, we focus on the
survival outcome. Anyhow, time has indeed a fundamental role in the overall STEMI health
care process. By Symptom Onset to Door time we mean the time since symptoms onset up
to the arrival at Emergency Room (ER), and Door to Balloon time (DB time) is the time
since the arrival at ER up to the surgical practice of PTCA. Clinical literature strongly
stresses the connection between in-hospital survival and procedures time, as attested, e.g.,
in Cannon et al. (2000), Jneid et al. (2008) and MacNamara et al. (2006).
The presence of dierences in the outcomes of health care has been documented ex-
tensively in recent years. In order to design regulatory interventions by institutions for
instance, it is interesting to study the eects of variations in health care utilization on
patients outcomes, in particular examining the relationship between process indicators,
which dene regional or hospital practice patterns, and outcomes measures, such as pa-
tients survival or treatments ecacy. If the analysis of variations concerns in particular
the comparison of the performance of health care providers, it is commonly referred to as
provider proling; see Normand et al. (1997) and Racz and Sedransk (2010). The results of
proling analyses often have far-reaching implications. They are used to generate feedback
for health care providers, to design educational and regulatory interventions by institutions
and government agencies, to design marketing campaigns by hospitals and managed care
organizations, and, ultimately, to select health care providers by individuals and managed
care groups.
The major aim of this work is to measure the magnitude of the variations of health care
providers and to assess the role of contributing factors, including patients and providers
characteristics on survival outcome in STEMI patients. Data on health care utilization
have a natural multilevel structure, usually with patients at the lower level and hospi-
tals forming the upper-level clusters. Within this formulation, two main goals are taken
into account: one is to provide cluster-specic estimates of a particular response, adjusted
for patients characteristics, while the other one is to derive estimates of covariates eects,
such as dierences between patients of dierent gender or between hospitals. Hierarchical
regression modelling from a Bayesian perspective provides a framework that can accom-
plish both these goals. In particular, this article considers a Bayesian generalized linear
mixed model (see Zeger and Karim, 1991) to predict the binary survival outcome by means
of relevant covariates, taking into account overdispersion induced by the grouping factor.
Chilean Journal of Statistics 17
We illustrate the analysis on a subset of data collected in the MOMI
2
survey on patients
admitted with STEMI diagnosis in one of the structures belonging to the Milano Cardio-
logical Network, using a logit model for the survival probability. For this analysis, patients
are grouped by the hospital they have been admitted to for their infarction. Assuming a
Bayesian hierarchical approach for the hospital factors yields modelling dependence among
the random eects parameters, but also using the data set to make inferences on hospitals
which do not have patients in the study, borrowing strength across patients, as well as
clustering the hospitals. A Markov chain Monte Carlo (MCMC) algorithm is necessary to
compute the posterior distributions of parameters and predictive distributions of outcomes,
as well as to use other diagnostic tools, such as Bayesian residuals, for goodness-of-t anal-
ysis. The choice of covariates and link functions was suggested rst in Ieva and Paganoni
(2011), according to frequentist selection procedures and clinical know-how; however, it
was conrmed here using Bayesian tools. We found out that killip rst, that is an index
of the severity of the infarction, and then age, have a sharp negative eect on the survival
probability, while the Symptom Onset to Balloon time has a lighter inuence on it. An
interesting, novel nding is that the resulting variability among hospitals seems not too
large, even if we underlined that four hospitals have a more extreme eect on the survival
(one has a positive eect, while the remaining three have a negative eect) then the others.
Such nding can be explained by the relative homogeneity among the hospitals, all located
in Milano, the region capital. Larger heterogeneity is expected in future when extending
the analysis to all the hospitals in the region. The advantages of a Bayesian approach to
this problem are more than one: providers proling or patients classication are allowed
to be guided not only by statistical but clinical knowledge also, hospitals with low expo-
sure can be automatically included in the analysis, and providers proling can be simply
achieved through the posterior distribution of the hospital-eects parameters.
To the best of our knowledge, this study is the rst example of the use of Bayesian
methods in provider proling using data which arise from the linkage between Italian
administrative databanks and clinical registries. This paper shares the same framework
of hierarchical generalized linear mixed models as in Daniels and Gatsonis (1999), who
examined dierences in the utilization of coronary artery bypass graft surgery for elderly
heart attack patients treated in hospitals.
The paper is organized as follows. Section 2 illustrates the data set about STEMI in
Milano Cardiological Network, while Section 3 describes the main features of the pro-
posed model, with a short discussion on covariates selection. Section 4 and 5 discuss prior
elicitation and Bayesian inferences, respectively. Finally, Section 6 presents results of the
inference on quantities of interest with a discussion. Some nal remarks are reported in
Section 7. All the analyses have been performed with WinBUGS; see Lunn et al. (2000)
and also http://www.mrc-bsu.cam.ac.uk/bugs and R (2009) (version 2.10.1) programs.
2. The STEMI Data Set
A net connecting the territory to hospitals, by a centralized coordination of the emergency
resources, has been activated in the Milano urban area since 2001. The aim of a moni-
toring project on it is the activation of a registry on STEMI to collect process indicators
(Symptom Onset to Door time, rst ECG time, Door to Balloon time and so on), in order
to identify and develop new diagnostic, therapeutic and organizational strategies to be
applied to patients aected by STEMI by Lombardia region, hospitals and 118 organi-
zation (the national toll-free number for medical emergencies). To reach this goal, it is
necessary to understand which organizational aspects can be considered as predictive of
time to treatment reduction. In fact, organizational policies in STEMI health care process
concern both 118 organization and hospitals, since a subject aected by an infarction can
reach the hospital by himself or can be taken to the hospital by 118 rescue units.
18 A. Guglielmi, F. Ieva, A.M. Paganoni and F. Ruggeri
So, in order to monitor the Milano Cardiological Network activity, times to treatment
and clinical outcomes, the data collection MOMI
2
was planned and made on STEMI pa-
tients, during six periods corresponding to monthly/bimonthly collections. For these units,
information concerning mode of admission (on his/her own or by three dierent types of
118 rescue units), demographic features (sex, age), clinical appearance (presenting symp-
toms and Killip class at admittance), received therapy (thrombolysis, PTCA), Symptom
Onset to Door time, in-hospital times (rst ECG time, DB time), hospital organization
(for example, admission during on/o hours) and clinical outcome (in-hospital survival)
have been listed and studied. The Killip classication is a system used with acute myocar-
dial infarction patients, in order to stratify them in four risk severity classes. Individuals
with a low Killip class are less likely to die within the rst 30 days after their myocardial
infarction than individuals with a high Killip class. The whole MOMI
2
survey consists of
840 statistical units, but in this work we only focus on patients who underwent primary
PTCA and belonging to the third and fourth collections, since they are of better quality.
Among the resulting PTCA-patients, we selected those who had their own hospital admis-
sion registered also in the Public Health Database of Lombardia region, in order to conrm
the reliability of the information collected in the MOMI
2
registry. Finally, the considered
data set consists of 240 patients.
Previous frequentist analyses on MOMI
2
survey (see Grieco et al., 2008; Ieva, 2008;
Ieva and Paganoni, 2010) pointed out that age, total ischemic time (Symptom Onset to
Balloon time, denoted by OB) in the logarithmic scale and killip of the patient, are the
most signicant factors in order to explain survival probability from a statistical and
clinical point of view. Here killip is a binary variable, corresponding to 0 for less severe
(Killip class equal to 1 or 2) and 1 for more severe (Killip class equal to 3 or 4) infarction.
This choice of covariates was conrmed using Bayesian variable selection procedure; see
the next section for more details.
The main goal of our study is to explain and predict, by means of a Bayesian random
eects model, the in-hospital survival (i.e., the proportion of patients discharged alive
from the hospital). The data set consists of n = 240 patients who were admitted to J = 17
hospitals after a STEMI event. The number of STEMI patients per hospital ranges from 1
to 32, with a mean of 14.12. Each observation y
i
records if a patient survived after STEMI,
i.e., y
i
= 1 if the ith patient survived, y
i
= 0 otherwise. In the rest of the paper, y denotes
the vector of all responses (y
1
, . . . , y
n
). The data set is strongly unbalanced, since 95%
of the patients have been discharged alive. The observed hospital-survival rates ranges
from 75% to 100%. These high values are explained because they are in-hospital survival
probabilities, a follow-up data being not available yet. The data set contained some missing
covariates, with proportions of 7%, 24% and 2% for age, OB and killip respectively. The
missing data for age and OB were imputed as the empirical means (64 years for age,
553 minutes for OB), while we sampled the missing 0-1 killip class covariates from the
Bernoulli distribution with probability of success estimated from the non-missing data.
After having imputed all the covariates, the mean value of age and OB did not change,
while the proportion of patients with less severe infarction (killip = 0) was 94%. Finally,
we had no missing data concerning hospital of admission and outcome.
3. A Bayesian Generalized Mixed-Effects Model
We considered a generalized mixed-eects model for binary data from a Bayesian view-
point. For a recent review on this topic, see Chapters 13 in Dey et al. (2000). For each
patient (i = 1, . . . , n), let Y
i
be a Bernoulli random variable with mean p
i
, which represents
the probability that the ith patient survived after STEMI. The p
i
s are modelled through
a logit regression with covariates x := {x
i
}, x
i
:= (1, x
i1
, x
i2
, x
i3
) which represent the age,
Chilean Journal of Statistics 19
the Symptom Onset to Balloon time in the log scale (log-OB) and the killip, respectively,
of the ith patient in the data set. Moreover, age and log-OB have been centered. Since the
patients come from J dierent hospitals, we assume the following multilevel model, with
the hospital as a random eect:
Y
i
|p
i
ind
Be(p
i
), i = 1, . . . , n, (1)
and
logit(p
i
) = log
p
i
1 p
i
=
0
+
1
x
i1
+
2
x
i2
+
3
x
i3
+ b
k[i]
, (2)
where b
k[i]
represents the hospital eect of the ith patient in hospital k[i]. We denote by
the vector of regression parameters (
0
,
1
,
2
,
3
). It is well-known that Equations (1)
and (2) have a latent variable representation (see Albert and Chib, 1993), which can be
very useful in performing Bayesian inference, as well as in providing medical signicance:
conditioning on the latent variables Z
1
, . . . , Z
n
, the Y
1
, . . . , Y
n
are independent, and, for
i = 1, . . . , n,
Y
i
=
1, if Z
i
0;
0, if Z
i
< 0;
(3)
where
Z
i
= x
i
+ b
k[i]
+
i
,
i
i.i.d.
f
, (4)
being f
(t) = e
t
(1 + e
t
)
2
the standard logistic density function. The same class of
models, however, without considering random eects, was applied in Souza and Migon
(2004) to a similar data set of patients after acute myocardial infarction.
As mentioned in Section 2, the choice of covariates was rst suggested in Ieva and
Paganoni (2011), using frequentist model choice tools. However, we have considered it also
in a Bayesian framework, using the Gibbs variable selection method by Dellaportas et al.
(2002). But rst, as a default analysis, we considered covariates selection via the R package
BMA; see Raftery et al. (2009). A subgroup of 197 patients with 11 non-missing covariates
was processed by the function bic.glm, and 7 covariates were selected (age, OB time, killip,
sex, admission during on/o hours, ECG time, number of previous hospitalizations). For
this choice of covariates, the non-missing data extracted from the 240-patients data set
consists of 217 units, which were again analyzed via bic.glm . The posterior probability
that each variable is non-zero was very high (about 40%) for age and killip, while they
were smaller than 7% for the others. Moreover, the smallest BICs denoting the best
models resulted for those including age, killip and sex. Since sex is strongly correlated
with age in our data set (only elderly women are in), at the end, we agreed with the choice
of covariates in Ieva and Paganoni (2011), considering only age and killip, while the OB
time was strongly recommended by clinical and health care utilization know-how, since it
was the main process indicator of the MOMI
2
clinical survey.
As a second analysis, we consider only covariates which have non-missing values for all
patients (age, OB time, killip, sex, admission during on/o hours, number of previous
hospitalizations), to be analyzed using the Gibbs variable selection method. The linear
predictor assumed in the right hand-side of Equation (2) to select covariates can be rep-
resented as
i
=
0
+
6
j=1
j
x
ij
, i = 1, . . . , n, (5)
20 A. Guglielmi, F. Ieva, A.M. Paganoni and F. Ruggeri
where (
1
, . . . ,
6
) is a vector of parameters in {0, 1}. Of course, a prior for both the
regression parameter and the model index parameter must be elicited, so that the
marginal posterior probability of suggest which variables must be included in the model.
We assumed dierent noninformative priors for the logit model with the linear predictor
given in Equation (5), as suggested in Ntzoufras (2002), implementing a simple BUGS
code to compute the marginal posterior distributions for each
j
, for j = 1, . . . , 6, and the
posterior inclusion probabilities. However the analysis conrmed the previously selected
model.
The selection of such a few number of covariates (with respect to 13, the total number)
is not surprising since previous analyses; see Ieva (2008) and Ieva and Paganoni (2010)
pointed out that the covariates are highly correlated. For instance, there is dependency
between age on one hand and sex, or symptoms, or mode of admission, on the other,
between symptoms and killip, or symptoms and mode of admission, and between sex
and symptoms. These relationships can be explained because acute coronary syndromes,
as STEMI, aect mainly male patients instead of females, and are more frequent as the
patient age increases. Moreover, it is well-known that the STEMI symptoms depend on the
severity of the infarction itself, and elderly patients have usually more atypical symptoms.
Furthermore, the symptoms may inuence the choice of the type of ambulance sent to
rescue the patient; ambulances which allow the ECG teletransmission are usually sent to
patients presenting more typical infarction symptoms, in order to allow them to skip the
waiting time due to ER procedures, and to reduce accordingly the door to balloon time.
4. The Prior Distribution
As mentioned in the previous sections, one of the aim of this paper is to make a compar-
ison among the patients survival probabilities treated in dierent hospitals of the Milano
Cardiological Network. Such an aim can be accomplished if, for instance, we assume the
hospital each patient was admitted to as a random factor. We make the usual (from a
Bayesian viewpoint) random eects assumption for the hospitals, that is, the hospital ef-
fect parameters b
j
s are drawn from a common distribution; moreover, since no information
is available at the moment to distinguish among the hospitals, we assume symmetry among
the hospital parameters themselves, i.e., b
1
, . . . , b
J
can be considered as (the rst part of
an innite sequence of) exchangeable random variables. Via Bayesian hierarchical models,
not only we model dependence among the random eects parameters b := (b
1
, . . . , b
J
), but
it be also possible to use the data set to make inferences on hospitals which have few or
no patients in the study, borrowing strength across hospitals. As usual in the hierarchical
Bayesian approach, the regression parameter and the hospital parameter b are assumed
a priori independent, is given a (multivariate) Gaussian distribution and b is given a
scale-mixture of (multivariate) Gaussian distributions; more specically:
b, MN(
, V
),
b
1
, . . . , b
J
|
i.i.d.
N(
b
,
2
), and U(0,
0
).
(6)
Observe that the prior assumption on b is that, conditionally on the parameter , each
hospital eect parameter has a Gaussian distribution with variance
2
; here the uniform
prior on is set as an assumption of ignorance/symmetry on the standard deviation of
each hospital eect. The Gaussian prior for is standard, but its hyperparameters, as
well as the hyperparameter of the prior distribution for , it is given informatively, using
available information from other MOMI
2
collections; for more details, see Section 6.2. On
the other hand, a more standard prior for b
j
would be a scale-mixture of normals, mixed
Chilean Journal of Statistics 21
by an inverse-gamma distribution for
2
, with parameter (, ) for small . However, this
prior has been often criticized (see Gelman, 2006), mainly because the inferences do not
result robust with respect to the choice of , and the prior density (for all small ), as well
as the resulting posterior, are too peculiar. In what follows, the parameter vector (, b, )
is denoted by .
5. Bayesian Inference
Based on given priors and likelihood, the posterior distribution of is expressed by
(|y, x) () L(y|, z, x)f(z)
= ()(b|)()
n
i=1
(I
(0,+)
(z
i
))
yi
(I
(,0]
(z
i
))
1yi
n
i=1
f
(z
i
x
i
b
k[i]
).
(7)
We are interested in predictions too. This implies (i) considering the posterior predictive
survival probability of a new patient coming from an hospital already included in the
study, or (ii) the posterior predictive survival probability of a new patient coming from a
new (J + 1)th hospital. We have
P(Y
n+1
= 1|y, x, b
j
) =
R
4
P(Y
n+1
= 1|, b
j
, x)(|b
j
, y)d, j = 1, . . . , J, (8)
for a new patient with covariate vector x coming from the jth hospital in the study, and
P(Y
n+1
= 1|y, x, b
J+1
) =
R
4
P(Y
n+1
= 1|, b
J+1
, x)(|b
J+1
, y) d, (9)
where (|b
J+1
, y) is computed from
(, b
J+1
|y) =
R
+
(b
J+1
|)(, |y)d ,
being (b
J+1
|) the prior population conditional distribution given in Equation (6).
As far as model checking is concerned, we consider predictive distributions for patients
already enrolled in the study in the spirit of replicated data in Gelman et al. (2004). More
specically, we compute
P(Y
new
i
= 1|y, x
i
, b
k[i]
), for all i = 1, . . . , n. (10)
Here, Y
new
i
denotes the ith replicated data that could have been observed, or, to think
predictively, as the data that we would see tomorrow if the experiment that produced
y
i
today were replicated with the same model and the same value of parameters that
produced the observed data; see Gelman et al. (2004, Section 6.3). Since we have a very
unbalanced data set, the following Bayesian rule is adopted: a patient is classied as alive
if P(Y
new
i
= 1|y, x
i
, b
k[i]
) = E[Y
new
i
|y, x
i
, b
k[i]
] is greater than the empirical mean y
n
. This
22 A. Guglielmi, F. Ieva, A.M. Paganoni and F. Ruggeri
rule is equivalent to minimize the expected value of the following loss function
L(P(Y
i
= 1|y, x
i
, b
k[i]
), a
1
) = Max{0, y
n
P(Y
i
= 1|y, x
i
, b
k[i]
)},
L(P(Y
i
= 1|y, x
i
, b
k[i]
), a
0
) = Max{0, P(Y
i
= 1|y, x
i
, b
k[i]
) y
n
},
where the action a
1
is to classify the patient as alive and the action a
0
corresponds to
classify the patient as dead. Then the coherence between the Bayesian rule and the data
set is checked.
Finally we computed the latent Bayesian residuals for binary data as suggested in Albert
and Chib (1995). Thanks to the latent variable representation in Equations (3) and (4) of
the model, we can consider the realized errors
e
i
= Z
i
(x
i
+ b
k[i]
), i = 1, . . . , n, (11)
obtained solving Equation (4) w.r.t.
i
. Each e
i
is a function of the unknown parameters,
so that its posterior distribution can be computed through the MCMC simulated values,
and later examined for indications of possible departures from the assumed model and the
presence of outliers; see also Chaloner and Brant (1998). Therefore, it is sensible to plot
credibility intervals for the marginal posterior of each e
i
, comparing them to the marginal
prior credibility intervals (of the same level).
6. Data Analysis
In this section we illustrate the Bayesian analysis of the data set described in Section 2,
giving some details on computations and prior elicitation.
6.1 Bayesian computations
As we mentioned in Section 1, all estimates were derived using WinBUGS. The compu-
tation of the full conditionals to directly implement a Gibbs sampler algorithm can be
computed starting from Equation (7); however they are not standard distributions, i.e.,
closed form expressions do not exist for all of them, given the priors in Equation (6). Some
details on the full conditionals for general design GLMMs required by WinBUGS are in
Zhao et al. (2006).
The rst 100,000 iterations of the chain were discarded, retaining parameter values each
80 iterations to decrease autocorrelations, with a nal sample size equal to 5,000; we
run the chains much longer (for a nal sample size of 10,000 iterations), but the gain
in the MC errors was relatively small. Some convergence diagnostics (Gewekes and the
two Heidelberger-Welch ones) were checked; see, e.g., the reference manual of the CODA
package (Plummer et al., 2006) for more details. Moreover, we monitored traceplots, au-
tocorrelations and MC error/posterior standard deviation ratios for all the parameters,
indicating the MCMC algorithm converged. Code is available from the authors upon re-
quest.
6.2 Informative prior hyperparameters
Concerning information about hyperprior parameters, we xed
b
= 0 regardless of any in-
formation, since, by the exchangeability assumption, the dierent hospitals have the same
prior mean (xed equal to 0 to avoid confounding with
0
). As far as is concerned, we
have enough past data to be relatively informative in eliciting prior hyperparameters; they
Chilean Journal of Statistics 23
were xed after having tted model given in Equations (1) and (2), under non-informative
priors for , to similar data, i.e., 359 patients undergone primary PTCA whose data were
collected during the other four MOMI
2
collections. Therefore, for the present analysis, we
xed
= diag(2, 0.04,
0.5882, 3.3333), which, except for the second value, are about 10 times the posterior vari-
ances of the regression parameters under the preliminary analysis (0.04 is 100 times the
posterior variance, in order to consider a vaguer prior for
1
). The prior hyperparameter
0
was xed equal to 10, a value compatible with the support of the posterior distribu-
tion for in the preliminary analysis. Posterior estimates of , b and proved to be
robust with respect to
0
d
e
n
s
i
t
y
0.20 0.15 0.10 0.05 0.00 0.05
0
2
4
6
8
1
0
1
2
1
d
e
n
s
i
t
y
1.5 1.0 0.5 0.0 0.5 1.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1
.
2
2
d
e
n
s
i
t
y
4 2 0 2
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
3
d
e
n
s
i
t
y
Figure 1. Marginal posterior density of the regression coecients.
0 2 4 6 8 10
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1
.
2
1
.
4
d
e
n
s
i
t
y
Figure 2. Marginal posterior density of .
In Table 2, we report
p
j
= min{P(b
j
> 0|y), P(b
j
< 0|y)}, j = 1, . . . , J,
together with the signum of the posterior median of the b
j
s. Low values of p
j
denote the
posterior distribution of b
j
is far from 0, so that the jth hospital signicantly contributes to
the (estimated) regression intercept
0
+b
j
. In Figure 3, the credible intervals corresponding
to p
j
less than 0.18 are depicted in yellow; it is clear that hospital 9 has a positive eect,
while hospital 10, 11 and 15 have a negative eect on the survival probability.
Chilean Journal of Statistics 25
0 5 10 15
2
0
2
4
hospitals
b
j
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Figure 3. Posterior median (bullet), mean (square) and 95% credibility intervals of all random eect
parameters b
j
. The credible intervals for hospitals such that min(P(b
j
> 0|y), P(b
j
< 0|y)) < 0.18
are dashed.
Table 2. Values of p
j
and the signum of the posterior median of each hospital parameters.
b
1
b
2
b
3
b
4
b
5
b
6
b
7
b
8
b
9
0.27 0.40 0.32 0.25 0.44 0.41 0.49 0.49 0.18
+ + + + - + + + +
b
10
b
11
b
12
b
13
b
14
b
15
b
16
b
17
0.17 0.12 0.28 0.28 0.44 0.17 0.26 0.29
- + - - + - - +
Observe that all the credible intervals of the random eect parameters in Figure 3
include 0, so that we might wonder if the random intercept should be discarded from the
model. However, Mauri (2011) presents a Bayesian selection analysis of the same data set
considered here, concluding that the posterior inclusion probability of the random eect is
signicantly larger than 0 (between 0.2 and 0.6 under dierent reasonable priors). Similar
ndings were drawn in Ieva and Paganoni (2010) from a frequentist perspective.
Figure 4 displays medians and 95% credibility intervals for the posterior predictive sur-
vival probabilities give in Equation (8) of four benchmark patients:
(a) x
1
= 0, x
2
= 0, x
3
= 0, i.e., a patient with average age (64 years), average OB (553
min.) and less severe infarction (Killip class 1 or 2);
(b) x
1
= 0, x
2
= 0, x
3
= 1, i.e., a patient with same age and OB as (a), but with severe
infarction (Killip class 3 or 4);
(c) x
1
= 16, x
2
= 0, x
3
= 0, i.e., an elder patient (80 years), with average OB (553 min.)
and less severe infarction;
(d) x
1
= 16, x
2
= 0, x
3
= 1, i.e., an elder patient with average OB and severe infarction,
coming from an hospital already in the study. The last credibility interval (in red in each
panel) corresponds to the posterior predictive survival probability give in Equation (9) of
a benchmark patient coming from a new random (J + 1)th hospital. Moreover, from the
gure it is clear that killip has a stronger (on average) inuence on survival than age since,
moving from left to right panels (same age, killip increased) the credibility intervals get
much wider than moving from the top to the bottom panels (same killip, age increased).
Finally, as far as predictive model checking is concerned, we computed the predictive
probabilities in Equation (10); the classication rule described in Section 5 gives an error
rate equal to 27% (64 patients were erroneously classied as dead and only 1 patient was
26 A. Guglielmi, F. Ieva, A.M. Paganoni and F. Ruggeri
0 5 10 15
0
.
0
0
.
4
0
.
8
Patient (a)
hospitals
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0 5 10 15
0
.
0
0
.
4
0
.
8
Patient (b)
hospitals
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0 5 10 15
0
.
0
0
.
4
0
.
8
Patient (c)
hospitals
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0 5 10 15
0
.
0
0
.
4
0
.
8
Patient (d)
hospitals
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Figure 4. Posterior median (bullet), mean (square) and 95% credible intervals of the posterior
predictive survival probabilities for 4 benchmark patients from each hospital in the study and from
a new random hospital (the 18th dashed credible interval).
erroneously classied as alive). As a measure of goodness of t we also computed the Brier
score, the average squared deviation between predicted probabilities and outcomes, which
is equal to 0.04, showing a fairly good predictive t of our model.
The left panel of Figure 5 displays the posterior distributions of the Bayesian residuals,
as in Equation (11), for each observations, where the red line in the plot denotes the
prior marginal distribution (logistic). On the other hand, the right panel shows the same
posterior distributions in a 3-dimensional perspective, each residual posterior referring to
the posterior survival probability of the corresponding patient.
The picture shows that there are no outliers among the patients who survived, since
their posterior residual densities and the prior residual one share the same cluster. More
variability appears among the dead patients as far as posterior location and dispersion
are concerned. This feature could be brought about by the disparity in the number of
cases among the dead and the alive in our data set. Moreover, most deaths occur in the
class of more severe infarction, and concern elder people. This rationale explains the larger
credibility intervals in Figure 4(d) (right bottom panel) as well, which in fact refers to
elderly patients with severe infarction.
Chilean Journal of Statistics 27
10 5 0 5
0
.
0
0
.
1
0
.
2
0
.
3
residuals
d
e
n
s
it
y
Figure 5. Left panel: posterior distributions of the latent Bayesian residuals. The dashed and solid
lines correspond to observations y
i
= 0 (dead) and y
i
= 1 (alive), respectively. The solid gray
line is the marginal prior distribution (logistic). Right panel: posterior distributions of the latent
Bayesian residuals against the expected posterior survival probabilities.
7. Conclusions
In this work we have considered a Bayesian hierarchical generalized linear model with ran-
dom eects for the analysis of clinical and administrative data with a multilevel structure.
These data arise from MOMI
2
clinical registry, based on a survey on patients admitted with
ST-elevation myocardial infarction diagnosis, integrated with administrative databanks.
The analysis carried out on them could provide a decisional support to the cardiovascular
health care governance. We adopted a Bayesian point of view to tackle the problem of
modelling survival outcomes by means of relevant covariates, taking into account overdis-
persion induced by the grouping factor, i.e., the hospital where each patient has been
admitted to. To the best of our knowledge, this study is the rst example of a Bayesian
analysis of data arising from the linkage between Italian administrative databanks and
clinical registries. The main aim of this paper was to study the eects of variations in
health care utilization on patient outcomes, since the adopted model points out relation-
ships between process and outcome measures. We also provided cluster-specic estimates
of survival probabilities, adjusted for patients characteristics, and derived estimates of
covariates eects, using MCMC simulation of posterior distributions of the parameters;
moreover we discussed model selection and goodness of t. We found out that Killip rst,
and age, have a sharp negative eect on the survival probability, while the OB (onset to
balloon) time has a lighter inuence on it. The resulting variability among hospitals seems
not too large, even if we underlined that 4 hospitals have a more extreme eect on the
survival: in particular hospital 9 had a positive eect, while hospitals 10, 11 and 15 had a
negative eect. As far as negative features of the MCMC outputs are concerned, we found
that the marginal posterior distributions of (
0
, b
j
), for each j, are concentrated on lines
of the whole parameter space, due to the confounding between the intercept parame-
ter and the random eects parameters. However the mixing and the convergence of the
chain, under a suitable thinning, were completely satisfactory. Finally, as a further step in
the analysis, we are considering Bayesian nonparametrics to model the hospital eects, in
order to take advantage of the in-built clustering they provide.
28 A. Guglielmi, F. Ieva, A.M. Paganoni and F. Ruggeri
References
Albert, J.H., Chib, S., 1993. Bayesian analysis of binary and polychotomous response data.
Journal of the American Statistical Association, 88, 669679.
Albert, J.H., Chib S., 1995. Bayesian residual analysis for binary response regression mod-
els. Biometrika, 82, 747759.
Cannon, C.P, Gibson, C.M., Lambrew, C.T., Shoultz, D.A., Levy, D., French, W.J., Gore,
J.M., Weaver, W.D., Rogers, W.J., Tiefenbrunn, A.J., 2000. Relationship of symptom-
onset-to-balloon time and door-to-balloon time with mortality in patients undergoing
angioplasty for acute myocardial infarction. Journal of American Medical Association,
283, 29412947.
Chaloner, K., Brant, R., 1988. A Bayesian approach to outlier detection and residual
analysis. Biometrika, 31, 651659.
Daniels, M.J., Gatsonis, C., 1999. Hierarchical generalized linear models in the analysis of
variation in health care utilization. Journal of the American Statistical Association, 94,
2942.
Dellaportas, P., Forster, J.J., Ntzoufras, I., 2002. On Bayesian model and variable selection
using MCMC. Statistics and Computing, 12, 2736.
Dey, D.K., Ghosh, S.K., Mallick, B.M., (eds.) 2000. Generalized Linear Models: A Bayesian
Perspective. Chapman & Hall/CRC, Biostatistics Series, New York.
Gelman, A., 2006. Prior distributions for variance parameters in hierarchical models (Com-
ment on article by Browne and Draper). Bayesian Analysis, 3, 515534.
Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B., 2004. Bayesian Data Analysis. Second
edition. Chapman & Hall/CRC, Boca Raton, Florida.
Grieco, N., Corrada, E., Sesana, G., Fontana, G., Lombardi, F., Ieva, F., Marzegalli, M.,
Paganoni, A.M., 2008. Predictors of reduction of treatment time for ST-segment el-
evation myocardial infarction in a complex urban reality. The MoMi2 survey. MOX
Report n. 10/2008. Dipartimento di Matematica, Politecnico di Milano. Available at
http://mox.polimi.it/it/progetti/pubblicazioni/quaderni/10-2008.pdf.
Hasday, D., Behar, S., Wallentin, L., Danchin, N., Gitt, A.K., Boersma, E., Fioretti, P.M.,
Simoons, M.L., Battler, A., 2002. A prospective survey of the characteristics, treatments
and outcomes of patients with acute coronary syndromes in Europe and the mediter-
ranean basin. The euro heart survey of acute coronary syndromes. European Heart
Journal, 23, 11901210.
Ieva, F., 2008. Modelli statistici per lo studio dei tempi di intervento nellinfarto miocardico
acuto. Master Thesis. Dipartimento di Matematica, Politecnico di Milano. Available at
http://mox.polimi.it/it/progetti/pubblicazioni/tesi/ieva.pdf.
Ieva, F., Paganoni, A.M., 2010. Multilevel models for clinical registers concerning STEMI
patients in a complex urban reality: a statistical analysis of MOMI
2
survey. Communi-
cations in Applied and Industrial Mathematics, 1, 128147.
Ieva, F., Paganoni, A.M., 2011. Process indicators for assessing quality of hospitals care:
a case study on STEMI patients. JP Journal of Biostatistics, 6, 5375.
Jneid, H., Fonarow, G., Cannon, C., Palacios, I., Kilic, T., Moukarbel, G.V., Maree, A.O.,
Liang, L., Newby, L.K., Fletcher, G., Wexler, L., Peterson, E., 2008. Impact of time of
presentation on the care and outcomes of acute myocardial infarction. Circulation, 117,
25022509.
Lunn, D.J., Thomas, A., Best, N., Spiegelhalter, D., 2000. WinBUGS - a Bayesian mod-
elling framework: concepts, structure, and extensibility. Statistics and Computing, 10,
325337.
Chilean Journal of Statistics 29
MacNamara, R.L., Wang, Y., Herrin, J., Curtis, J.P., Bradley, E.H., Magid, D.J., Peterson,
E.D., Blaney, M., Frederick, P.D., Krumholz, H.M., 2006. Eect of door-to-balloon time
on mortality in patients with ST-segment elevation myocardial infarction. Journal of
American College of Cardiology, 47, 21802186.
Mauri, F., 2011. Bayesian variable selection for logit models with random intercept: appli-
cation to STEMI data set. Master Thesis. Dipartimento di Matematica, Politecnico di
Milano.
Normand, S.T., Glickman, M.E., Gatsonis, C.A., 1997. Statistical methods for proling
providers of medical care: issues and applications. Journal of the American Statistical
Association, 92, 803814.
Ntzoufras, I., 2002. Gibbs variable selection using BUGS. Journal of Statistical Software,
7. Available at http://www.jstatsoft.org/.
Plummer, M., Best, N., Cowles, K., Vines, K., 2006. CODA: Convergence diagnosis and
output analysis for MCMC. R News, 6, 711.
R Development Core Team, 2009. R: A Language and Environment for Statistical Com-
puting. R Foundation for Statistical Computing, Vienna, Austria. Available at http:
//www.R-project.org.
Racz, J., Sedransk, J., 2010. Bayesian and frequentist methods for provider proling us-
ing risk-adjusted assessments of medical outcomes. Journal of the American Statistical
Association, 105, 4858.
Raftery, A., Hoeting, J., Volinsky, C., Painter, I., Yeung, K.Y., 2009. BMA: Bayesian model
averaging. Available at http://CRAN.R-project.org/package=BMA.
Saia, F., Piovaccari, G., Manari, G., Guastaroba, P., Vignali, L., Varani, E., Santarelli, A.,
Benassi, A., Liso, A., Campo, G., Tondi, S., Tarantino, F., De Palma, R., Marzocchi,
A., 2009. Patient selection to enhance the long-term benet of rst generation drug-
eluting stents for coronary revascularization procedures: insights from a large multicenter
registry. Eurointervention, 5, 5766.
Souza, A.D.P., Migon, H.S., 2004. Bayesian binary regression model: an application to
in-hospital death after AMI prediction. Pesquisa Operacional, 24, 253267.
Zeger, S.L., Karim, M.R., 1991. Generalized linear models with random eects: a Gibbs
sampling approach. Journal of the American Statistical Association, 86, 7986.
Zhao, Y., Staudenmayer, J., Coull, B.A., Wand, M.P., 2006. General design Bayesian
generalized linear mixed models. Statistical Science, 21, 3551.
30 Chilean Journal of Statistics
Chilean Journal of Statistics
Vol. 3, No. 1, April 2012, 3142
Nonparametric Statistics
Research Paper
On the wavelet estimation of a function in a
density model with non-identically distributed
observations
Christophe Chesneau
1,
and Nargess Hosseinioun
2
1
LMNO, Universite de Caen Basse-Normandie, Caen, France
2
Department of Statistics, Payame Noor University, Iran
(Received: 10 March 2011 Accepted in nal form: 30 May 2011)
Abstract
A density model with possible non-identically distributed random variables is consid-
ered. We aim to estimate a common function appearing in the densities. We construct a
new linear wavelet estimator and study its performance for independent and dependent
data (the -mixing case is explored). Then, in the independent case, we develop a new
adaptive hard thresholding wavelet estimator and prove that it attains a sharp rate of
convergence.
Keywords: Biased observations Dependent data Rates of convergence Wavelet
basis.
Mathematics Subject Classication: Primary 62G07 Secondary 62G20.
1. Introduction
We consider the following density model. Let (X
i
)
iZ
be a random process such that, for
any i Z, the density of X
i
is
g
i
(x) = w
i
(x)f(x), x R, (1)
where (w
i
(x))
iZ
is a known sequence of positive functions and f is an unknown positive
function. Let L > 0 and X
i
() = {x R; g
i
(x) = 0}. We suppose that X
i
() does not
depend on i, X
1
() [L, L], there exists a constant C
, (2)
Corresponding author. Christophe Chesneau. Department of Mathematics, LMNO, University of Caen, UFR de
Sciences, F-14032, Caen, France. Email: christophe.chesneau@gmail.com
ISSN: 0718-7912 (print)/ISSN: 0718-7920 (online)
c Chilean Statistical Society Sociedad Chilena de Estadstica
http://www.soche.cl/chjs
32 C. Chesneau and N. Hosseinioun
and there exists a sequence of real positive numbers (v
i
)
iZ
(which can depend on n) such
that
inf
xX1()
w
i
(x) v
i
. (3)
The goal is to estimate f globally when only n random variables X
1
, . . . , X
n
of (X
i
)
iZ
are observed. Such an estimation problem has been recently investigated by Aubin and
Leoni-Aubin (2008a,b). It can be viewed as a generalization of the standard biased density
model; see e.g., Patil and Rao (1977), El Barmi and Simono (2000), Brunel et al. (2009)
and Ramirez and Vidakovic (2010).
In this article, we investigate the estimation of f via the powerful tool of the wavelet
analysis. Wavelets are attractive for nonparametric density estimation because of their
spatial adaptivity, computational eciency and asymptotic optimality properties. They
enjoy excellent mean integrated squared error (MISE) properties and can achieve fast rates
of convergence over a wide range of function classes (including spatially inhomogeneous
function). Details on wavelet analysis in nonparametric function estimation can be found
in Antoniadis (1997) and Hardle et al. (1998).
In the rst part of this study, we develop a new linear wavelet estimator. We determine a
sharp upper bound for the associated MISE for independent (X
i
)
iZ
. Then, we extend this
result for possible dependent (X
i
)
iZ
following the -mixing case. In particular, we prove
the upper bound obtained in the independent case is not deteriorated by our dependence
condition as soon as the -mixing coecients (
m
)
mN
of (X
i
)
iZ
(dened in Section 3)
satisfy
n
m=1
m
C, where C > 0 denotes a constant independent of n. The second
part of the study is devoted to the adaptive estimation of f for independent (X
i
)
iZ
. We
construct a new hard thresholding wavelet estimator and prove that it attains a sharp
upper bound, close to the one attained by the corresponding linear wavelet estimator. Let
us mention that our results are proved under very mild assumptions on w
1
(x), . . . , w
n
(x).
Section 2 presents wavelets and the Besov balls. The linear wavelet estimation is devel-
oped in Section 3. Section 4 is devoted to our hard thresholding wavelet estimator. The
proofs are postponed to Section 5.
2. Wavelets and Besov Balls
Let L > 0, N be a positive integer, and and be the Daubechies wavelets db2N (which
satisfy supp() = supp() = [1 N, N]). Set
j,k
(x) = 2
j/2
(2
j
x k),
j,k
(x) = 2
j/2
(2
j
x k),
and
j
= {k Z; 1N 2
j
xk N, x [L, L]} = {k Z; L2
j
+N1 k L2
j
N}.
Then, there exists an integer such that, for any integer , the collection
B = {
,k
(.), k
;
j,k
(.); j N {0, . . . , 1}, k
j
}
is an orthonormal basis of L
2
([L, L]) = {h : [L, L] R;
_
L
L
h
2
(x)dx < }. For more
details about wavelet basis, see Meyer (1992) and Cohen et al. (1993).
Chilean Journal of Statistics 33
For any integer , any h L
2
([L, L]) can be expanded on B as
h(x) =
,k
,k
(x) +
j=
k
j
j,k
j,k
(x),
where
j,k
and
j,k
are the wavelet coecients of h dened by
j,k
=
_
L
L
h(x)
j,k
(x)dx,
j,k
=
_
L
L
h(x)
j,k
(x)dx. (4)
Let M > 0, s > 0, p 1, r 1 and L
p
([L, L]) = {h : [L, L] R;
_
L
L
|h(x)|
p
dx < }.
Set, for every measurable function h on [L, L] and 0,
(h)(x) =
,
N
(h)(x). Let
N
(t, h, p) = sup
[t,t]
__
L
L
|
N
(h)(u)|
p
du
_
1/p
.
Then, for s (0, N), we dene the Besov ball B
s
p,r
(M) by
B
s
p,r
(M) =
_
h L
p
([L, L]);
__
L
L
_
N
(t, h, p)
t
s
_
r
dt
t
_
1/r
M
_
.
We have the following equivalence: h B
s
p,r
(M) if and only if there exists a constant
M
> 0 (depending on M) such that the associated wavelet coecients give in Equation
(4) satisfy
2
(1/21/p)
_
_
|
,k
|
p
_
_
1/p
+
_
j=
_
_2
j(s+1/21/p)
_
_
k
j
|
j,k
|
p
_
_
1/p
_
_
r
_
_
1/r
M
. (5)
In Equation (5), s is a smoothness parameter and p and r are norm parameters. The Besov
balls capture a wide variety of smoothness features in a function; see e.g., Meyer (1992).
3. Linear Wavelet Estimation
For any integer j and k
j
, we can estimate the unknown wavelet coecient
j,k
=
_
L
L
f(x)
j,k
(x)dx by a standard empirical one given by
j,k
=
1
n
n
i=1
j,k
(X
i
)
w
i
(X
i
)
. (6)
However, in this study, we consider
j,k
=
1
z
n
n
i=1
v
i
j,k
(X
i
)
w
i
(X
i
)
, z
n
=
n
i=1
v
i
. (7)
Our choice is motivated by the following upper bound results.
34 C. Chesneau and N. Hosseinioun
Proposition 3.1 Suppose that (X
i
)
iZ
are independent. For any integer j and
k
j
, let
j,k
=
_
L
L
f(x)
j,k
(x)dx,
j,k
be as in Equation (6) and
j,k
be as in Equation
(7). Then,
j,k
and
j,k
are unbiased estimators of
j,k
and there exists a constant C > 0
such that
E
_
(
j,k
j,k
)
2
C
1
n
2
n
i=1
1
v
i
, E
_
(
j,k
j,k
)
2
C
1
z
n
.
These bounds are as sharp as possible and we have 1/z
n
(1/n
2
)
n
i=1
1/v
i
.
We dene the linear wavelet estimator
f
lin
by
f
lin
(x) =
k
j
0
j
0
,k
j
0
,k
(x), (8)
where
j
0
,k
is dened by Equation (7) and j
0
is an integer which is chosen later.
Naturally, taking w
1
(x) = = w
n
(x) = 1, Equation (1) becomes the standard density
model and
f
lin
the standard linear wavelet estimator for this problem; see Hardle et al.
(1998, Subsection 10.2). For a survey on wavelet linear estimators for various density
models, we refer to Chaubey et al. (2010).
Theorem 3.2 Suppose that (X
i
)
iZ
are independent and lim
n
z
n
= . Suppose that
f B
s
p,r
(M), with s (0, N), p 2 and r 1. Let
f
lin
be as in Equation (8) with the
integer j
0
satisfying (1/2)z
1/(2s+1)
n
2
j
0
z
1/(2s+1)
n
. Then, there exists a constant C > 0
such that
E
__
L
L
_
f
lin
(x) f(x)
_
2
dx
_
Cz
2s/(2s+1)
n
.
Note that, when w
1
(x) = = w
n
(x) = 1, we have z
n
= n
2s/(2s+1)
and this is the
optimal rate of convergence (in the minimax sense) for the standard density estimation
problem; see Hardle et al. (1998, Theorem 10.1).
Let us now explore the performance of
f
lin
for a class of dependent (X
i
)
iZ
.
Definition 3.3 Let (X
i
)
iZ
be a random process. For any u Z, let F
X
,u
be the -
algebra generated by . . . , X
u1
, X
u
and F
X
u,
is the -algebra generated by X
u
, X
u+1
, . . .
For any m Z, we dene the mth maximal correlation coecient of (X
i
)
iZ
by
m
= sup
Z
sup
(U,V )L
2
(F
X
,
)L
2
(F
X
m+,
)
|C(U, V )|
_
V[U]V[V ]
,
where, for any A {F
X
,
, F
X
m+,
}, L
2
(A) =
_
U A; E[U
2
] <
_
and C( , ) denotes
the covariance function. Then, we say that (X
i
)
iZ
is -mixing if and only if
lim
m
m
= 0.
Further details on -mixing dependence can be found in, e.g., Kolmogorov and Rozanov
(1960), Shao (1995) and Zhengyan and Lu (1996).
Results on wavelet estimation of a density in the -mixing case can be found in Leblanc
(1996) and Hosseinioun et al. (2010).
Chilean Journal of Statistics 35
Theorem 3.4 Suppose that (X
i
)
iZ
is -mixing and there exist three constants > 0,
[0, 1) and 0 such that
lim
n
1
n
[log(n)]
m=1
m
= , lim
n
z
n
n
[log(n)]
= . (9)
Suppose that f B
s
p,r
(M), with s (0, N), p 2 and r 1. Let
f
lin
be as in Equation
(8), with the integer j
0
satisfying
1
2
_
z
n
n
[log(n)]
_
1/(2s+1)
2
j
0
_
z
n
n
[log(n)]
_
1/(2s+1)
.
Then, there exists a constant C > 0 such that
E
__
L
L
_
f
lin
(x) f(x)
_
2
dx
_
C
_
z
n
n
[log(n)]
_
2s/(2s+1)
.
The main role of the parameters and in Equation (9) is to measure the inuence of the
-mixing dependence of (X
i
)
iZ
when lim
n
n
m=1
m
= on the performance of
f
lin
.
The rst assumption in Equation (9) can be viewed as a generalization of the standard one,
i.e.,
m=1
m
C, which corresponds to = = 0; see e.g., Leblanc (1996, Assumption
M1). Observe that, if = = 0, Theorem 3.4 extends the result of Theorem 3.2; the
-mixing dependence on (X
i
)
iZ
does not deteriorate the rate of convergence z
2s/(2s+1)
n
.
The main drawback of
f
lin
is that it is not adaptive. It depends on the smoothness
parameter s in its construction. The adaptive estimation of f for independent (X
i
)
iZ
is
explored in the next section.
4. On the Adaptive Estimation of f in the Independent Case
Suppose that (X
i
)
iZ
are independent. We dene the hard thresholding estimator
f
hard
by
f
hard
(x) =
k
,k
,k
(x) +
j
1
j=
kj
j,k
1I
j,k
|
log(z
n
)
z
n
j,k
(x), (10)
where
,k
is dened by Equation (7),
j,k
=
1
z
n
n
i=1
v
i
j,k
(X
i
)
w
i
(X
i
)
1I
vi
j,k
(X
i
)
w
i
(X
i
)
z
n
log(z
n
)
, (11)
for any random event A, 1I
A
is the indicator function on A, j
1
is the integer satisfying
(1/2)z
n
< 2
j
1
z
n
, =
and 8/3 + 2 + 2
_
16/9 + 4.
The originality of
f
hard
is in the denition of Equation (11). We do not estimate the
unknown mother wavelet coecient by the standard empirical estimator; we consider
a thresholding version of it. This thresholding combined with a suitable calibration of
the parameters allows us to have power MISE properties under very mild assumptions
on w
1
(x), . . . , w
n
(x). Such a technique has been rstly introduced in a hard thresholding
wavelet procedure in Delyon and Juditsky (1996) for nonparametric regression. Another
application of this technique can be found in Chesneau (2011).
36 C. Chesneau and N. Hosseinioun
Theorem 4.1 Suppose that (X
i
)
iZ
are independent and lim
n
z
n
= . Let
f
hard
be
as in Equation (10). Suppose that f B
s
p,r
(M) with r 1, {p 2 and s (0, N)} or
{p [1, 2) and s (1/p, N)}. Then, for a large enough n, there exists a constant C > 0
such that
E
__
L
L
_
f
hard
(x) f(x)
_
2
dx
_
C
_
log(z
n
)
z
n
_
2s/(2s+1)
.
Theorem 4.1 shows that
f
hard
attains a rate of convergence close to one attains by
f
lin
.
The only dierence is the negligible logarithmic term [log(n)]
2s/(2s+1)
. Mention that the
proof of Theorem 4.1 is based on Chesneau (2011, Theorem 2).
5. Proofs
In this section, C denotes any constant that does not depend on j, k and n. Its value may
change from one term to another and may depends on or .
Proof [Proposition 3.1] We have
E
_
j,k
=
1
n
n
i=1
_
L
L
j,k
(x)
w
i
(x)
g
i
(x)dx =
_
L
L
j,k
(x)f(x)dx =
j,k
. (12)
Using Equation (12), the independence of X
1
, . . . , X
n
, Equations (2) and (3) and
_
L
L
(
j,k
(x))
2
dx = 1, we obtain
E
_
(
j,k
j,k
)
2
= V
_
j,k
=
1
n
2
n
i=1
V
_
j,k
(X
i
)
w
i
(X
i
)
_
1
n
2
n
i=1
E
_
_
j,k
(X
i
)
w
i
(X
i
)
_
2
_
=
1
n
2
n
i=1
_
L
L
_
1
w
i
(x)
j,k
(x)
_
2
g
i
(x)dx
=
1
n
2
n
i=1
_
L
L
(
j,k
(x))
2
f(x)
w
i
(x)
dx C
1
n
2
n
i=1
1
v
i
.
We have
E[
j,k
] =
1
z
n
n
i=1
v
i
_
L
L
j,k
(x)
w
i
(x)
g
i
(x)dx =
1
z
n
z
n
_
L
L
j,k
(x)f(x)dx =
j,k
. (13)
Using Equation (13), again the independence of X
1
, . . . , X
n
, Equations (2), (3) and
Chilean Journal of Statistics 37
_
L
L
(
j,k
(x))
2
dx = 1, we obtain
E
_
(
j,k
j,k
)
2
= V[
j,k
] =
1
z
2
n
n
i=1
v
2
i
V
_
j,k
(X
i
)
w
i
(X
i
)
_
1
z
2
n
n
i=1
v
2
i
E
_
_
j,k
(X
i
)
w
i
(X
i
)
_
2
_
=
1
z
2
n
n
i=1
v
2
i
_
L
L
_
j,k
(x)
w
i
(x)
_
2
g
i
(x)dx
=
1
z
2
n
n
i=1
v
2
i
_
L
L
(
j,k
(x))
2
f(x)
w
i
(x)
dx C
1
z
2
n
z
n
= C
1
z
n
. (14)
The Holder inequality yields
n =
n
i=1
v
i
1
v
i
z
1/2
n
_
n
i=1
1
v
i
_
1/2
.
Therefore
1
z
n
1
n
2
n
i=1
1
v
i
.
The proof of Proposition 3.1 is complete.
Proof [Theorem 3.2] We expand the function f on B as
f(x) =
k
j
0
j0,k
j0,k
(x) +
j=j
0
k
j
j,k
j,k
(x),
where
j
0
,k
=
_
L
L
f(x)
j
0
,k
(x)dx and
j,k
=
_
L
L
f(x)
j,k
(x)dx.
Using the fact that B is an orthonormal basis of L
2
([L, L]), Proposition 3.1 and, since
p 2, B
s
p,r
(M) B
s
2,
(M), we have
E
__
L
L
_
f
lin
(x) f(x)
_
2
dx
_
=
k
j
0
E
_
(
j0,k
j0,k
)
2
j=j
0
k
j
2
j,k
C
_
2
j
0
1
z
n
+ 2
2j
0
s
_
Cz
2s/(2s+1)
n
.
Theorem 3.2 is proved.
Proof [Theorem 3.4] First of all, let us prove the existence of a constant C > 0 such that
E
_
(
j
0
,k
j
0
,k
)
2
Cn
[log(n)]
1
z
n
.
38 C. Chesneau and N. Hosseinioun
Since
j
0
,k
is an unbiased estimator of
j
0
,k
, we have
E
_
(
j0,k
j0,k
)
2
_
= V[
j0,k
]
=
1
z
2
n
n
i=1
n
=1
v
i
v
C
_
j
0
,k
(X
i
)
w
i
(X
i
)
,
j
0
,k
(X
)
w
(X
)
_
1
z
2
n
n
i=1
v
2
i
V
_
j0,k
(X
i
)
w
i
(X
i
)
_
+
1
z
2
n
n
i=1
n
=1
=i
v
i
v
C
_
j0,k
(X
i
)
w
i
(X
i
)
,
j0,k
(X
)
w
(X
)
_
.
(15)
It follows from Equation (14) that
1
z
2
n
n
i=1
v
2
i
V
_
j
0
,k
(X
i
)
w
i
(X
i
)
_
C
1
z
n
.
In order to bound the second term in Equation (15), we use the following result on -
mixing. The proof can be found in Doukhan (1994, Section 1.2.2.).
Lemma 5.1 Let (X
i
)
iZ
be a -mixing sequence. Then, for any (i, j) Z
2
such that i =
and any functions g and h, we have
|C(h(X
i
), g(X
)) |
|i|
_
E[(h(X
i
))
2
] E[(g(X
))
2
],
whenever these quantities exist.
Using Lemma 5.1, we obtain
n
i=1
n
=1
=i
v
i
v
C
_
j
0
,k
(X
i
)
w
i
(X
i
)
,
j
0
,k
(X
)
w
(X
)
_
i=1
n
=1
=i
v
i
v
|i|
_
E
_
_
j
0
,k
(X
i
)
w
i
(X
i
)
_
2
_
E
_
_
j
0
,k
(X
)
w
(X
)
_
2
_
.
(16)
By Equations (2), (3) and
_
L
L
(
j
0
,k
(x))
2
dx = 1, we have
E
_
_
j
0
,k
(X
i
)
w
i
(X
i
)
_
2
_
=
_
L
L
_
j
0
,k
(x)
w
i
(x)
_
2
g
i
(x)dx =
_
L
L
(
j
0
,k
(x))
2
f(x)
w
i
(x)
dx C
1
v
i
.
Chilean Journal of Statistics 39
Therefore
n
i=1
n
=1
=i
v
i
v
C
_
j
0
,k
(X
i
)
w
i
(X
i
)
,
j
0
,k
(X
)
w
(X
)
_
C
n
i=1
n
=1
=i
v
i
|i|
C
n
i=2
i1
=1
v
i
i
C
n
i=2
i1
=1
(v
i
+ v
)
i
= C
n
i=2
i1
u=1
(v
i
+ v
iu
)
u
= C
_
n
i=2
v
i
i1
u=1
u
+
n
i=2
i1
u=1
v
iu
u
_
.
Using Equation (9), we obtain
n
i=2
v
i
i1
u=1
u
z
n
n
u=1
u
Cn
[log(n)]
z
n
,
and
n
i=2
i1
u=1
v
iu
u
=
n1
u=1
u
n
i=u+1
v
iu
z
n
n
u=1
u
Cn
[log(n)]
z
n
.
Hence
n
i=1
n
=1
=i
v
i
v
C
_
j0,k
(X
i
)
w
i
(X
i
)
,
j0,k
(X
)
w
(X
)
_
Cn
[log(n)]
z
n
. (17)
Putting Equations (15), (16) and (17) together, we obtain
E
_
(
j
0
,k
j
0
,k
)
2
_
C
_
1
z
n
+
n
[log(n)]
z
n
_
C
n
[log(n)]
z
n
.
Then we proceed as in Theorem 3.2. We expand the function f on B as
f(x) =
kj
0
j0,k
j0,k
(x) +
j=j
0
kj
j,k
j,k
(x),
where
j0,k
=
_
L
L
f(x)
j0,k
(x)dx and
j,k
=
_
L
L
f(x)
j,k
(x)dx. Using the fact that B is
40 C. Chesneau and N. Hosseinioun
an orthonormal basis of L
2
([L, L]) and, since p 2, B
s
p,r
(M) B
s
2,
(M), we obtain
E
__
L
L
_
f
lin
(x) f(x)
_
2
dx
_
=
k
j
0
E
_
(
j
0
,k
j
0
,k
)
2
j=j0
k
j
2
j,k
C
_
2
j0
n
[log(n)]
z
n
+ 2
2j0s
_
C
_
z
n
n
[log(n)]
_
2s/(2s+1)
.
The proof of Theorem 3.4 is complete.
Proof [Theorem 4.1.] The result is proven using the following general result. It is a
reformulation of the result given in Chesneau (2011, Theorem 2).
Theorem 5.2 (Chesneau, 2011). Let L > 0. We want to estimate an unknown function
f with support in [L, L] from n independent random variables U
1
, . . . , U
n
. We consider
the wavelet basis B and the notations of Section 3. Suppose that there exist n functions
h
1
, . . . , h
n
such that, for any {, },
(A1) Any integer j and any k
j
,
E
_
1
n
n
i=1
h
i
(
j,k
, U
i
)
_
=
_
L
L
f(x)
j,k
(x)dx.
(A2) There exist a sequence of real numbers (
i
)
iN
satisfying lim
i
i
= and two
constants,
> 0 and > 0, such that, for any integer j and any k
j
,
1
n
2
n
i=1
E
_
(h
i
(
j,k
, U
i
))
2
_
2
2
2j
1
n
.
We dene the hard thresholding estimator
f
H
by
f
H
(x) =
,k
,k
(x) +
j
1
j=
k
j
j,k
1I
{|
j,k|j,n}
j,k
(x),
where
j,k
=
1
n
n
i=1
h
i
(
j,k
, U
i
),
j,k
=
1
n
n
i=1
h
i
(
j,k
, U
i
)1I
{|h
i
(
j,k
,U
i
)|
j,n
}
,
for any random event A, 1I
A
is the indicator function on A,
j,n
=
2
j
_
n
log(
n
)
,
j,n
=
2
j
log(
n
)
n
,
= 8/3 +2 +2
_
16/9 + 4 and j
1
is the integer satisfying (1/2)
1/(2+1)
n
< 2
j
1
1/(2+1)
n
.
Chilean Journal of Statistics 41
Let r 1, {p 2 and s (0, N)} or {p [1, 2) and s ((2 + 1)/p, N)}. Suppose that
f B
s
p,r
(M). Then, there exists a constant C > 0 such that
E
__
L
L
_
f
H
(x) f(x)
_
2
dx
_
C
_
log(
n
)
n
_
2s/(2s+2+1)
.
Let us now investigate the assumptions (A1) and (A2) of Theorem 5.2 with, for any
i {1, . . . , n}, U
i
= X
i
,
, = 0,
n
= z
n
and
h
i
(
j,k
, y) =
n
z
n
v
i
j,k
(y)
w
i
(y)
.
On (A1). By Proposition 3.1, for any {, }, we have
E
_
1
n
n
i=1
h
i
(
j,k
, X
i
)
_
=
_
L
L
f(x)
j,k
(x)dx.
On (A2). Using Equation (14), we have
1
n
2
n
i=1
E
_
(h
i
(
j,k
, X
i
))
2
_
C
1
z
n
.
Let f B
s
p,r
(M). It follows from Theorem 5.2 that the hard thresholding estimator give
in Equation (10) satises, for any r 1, {p 2 and s (0, N)} or {p [1, 2) and
s (1/p, N)},
E
__
L
L
_
f
hard
(x) f(x)
_
2
dx
_
C
_
log(z
n
)
z
n
_
2s/(2s+1)
.
The proof of Theorem 4.1 is complete.
Acknowledgments
The authors would like to thank three anonymous referees whose constructive comments
and remarks have considerably improved the paper.
References
Antoniadis, A., 1997. Wavelets in statistics: a review (with discussion). Journal of the
Italian Statistical Society Series B, 6, 97144.
Aubin, J.-B., Leoni-Aubin, S., 2008. Adaptive Projection Density Estimation Under a
m-Sample Semiparametric Model. Annales de lI.S.U.P. Volume 5 - Numero especial,
Volume 52, Fasc. 1-2, pp. 139156.
Aubin, J.-B., Leoni-Aubin, S., 2008. Projection density estimation under a m-sample semi-
parametric model. Computational Statistics and Data Analysis, 5, 24512468.
El Barmi, H., Simono, J.S., 2000. Transformation-based estimation for weighted distri-
butions. Journal of Nonparametric Statistics, 12, 861878.
Brunel, E., Comte, F., Guilloux, A., 2009. Nonparametric density estimation in presence
of bias and censoring. TEST, 1, 166194.
42 C. Chesneau and N. Hosseinioun
Chaubey, Y.P., Chesneau, C., Doosti, H., 2011. Wavelet linear density estimation: a review.
Journal of the Indian Society of Agricultural Statistics, 65, 169-179.
Chesneau, C., 2011. Adaptive wavelet estimator for a function and its derivatives in an
indirect convolution model. Journal of Statistical Theory and Practice, 2, 303326.
Cohen, A., Daubechies, I., Jawerth, B., Vial, P., 1993. Wavelets on the interval and fast
wavelet transforms. Applied and Computational Harmonic Analysis, 1, 5481.
Delyon, B., Juditsky, A., 1996. On minimax wavelet estimators. Applied and Computa-
tional Harmonic Analysis, 3, 215228.
Doukhan, P., 1994. Mixing: Properties and Examples, Lecture Notes in Statistics Volume
85. Springer, New York.
Hardle, W., Kerkyacharian, G., Picard, D., Tsybakov, A., 1998. Wavelet, Approximation
and Statistical Applications, Lectures Notes in Statistics. Springer Verlag, New York.
Hosseinioun, N., Doosti, H., Nirumand, H.A., 2012. Nonparametric estimation of the
derivatives of a density by the method of wavelet for mixing sequences. Statistical Pa-
pers, 53, 195-203.
Kolmogorov, A.N., Rozanov, Yu.A., 1960. On strong mixing conditions for stationary
Gaussian processes. Theory of Probability and Applications, 5, 204208.
Leblanc, F., 1996. Wavelet linear density estimator for a discrete time stochastic process:
L
p
-losses. Statistics and Probability Letters, 27, 7184.
Meyer, Y., 1992. Wavelets and Operators. Cambridge University Press, Cambridge.
Patil, G.P., Rao, sC.R., 1977. The weighted distributions: a survey of their applications.
In Krishnaiah, P.R., (ed.). Applications of Statistics. North-Holland, Amsterdam, pp.
383405.
Ramirez, P., Vidakovic, B., 2010. Wavelet density estimation for stratied size-biased sam-
ple. Journal of Statistical Planning and Inference, 2, 419432.
Shao, Q.-M., 1995. Maximal inequality for partial sums of -mixing sequences. Annals of
Probability, 23, 948965.
Zhengyan, L., Lu, C., 1996. Limit Theory for Mixing Dependent Random Variables.
Kluwer, Dordrecht.
Chilean Journal of Statistics
Vol. 3, No. 1, April 2012, 4356
Time Series
Research Paper
On the singular values of the Hankel matrix with
application in singular spectrum analysis
Rahim Mahmoudvand
1,
and Mohammad Zokaei
1
1
Department of Statistics, Shahid Beheshti University, Tehran, Iran
(Received: 21 June 2011 Accepted in nal form: 24 August 2011)
Abstract
Hankel matrices are an important family of matrices that play a fundamental role in
diverse elds of study, such as computer science, engineering, mathematics and statistics.
In this paper, we study the behavior of the singular values of the Hankel matrix by
changing its dimension. In addition, as an application, we use the obtained results for
choosing the optimal values of the parameters of singular spectrum analysis, which is a
powerful technique in time series based on the Hankel matrix.
Keywords: Eigenvalues Singular spectrum analysis.
Mathematics Subject Classication: Primary 15A18 Secondary 37M10.
1. Introduction
A Hankel matrix can be nite or innite and its (i, j) entry is a function of i+j; see Widom
(1966). In other words, a matrix whose entries are the same along the anti-diagonals is
called the Hankel matrix. Specically, an LK Hankel matrix H is a rectangular matrix
of the form
H =
_
_
_
_
_
h
1
h
2
. . . h
K
h
2
h
3
. . . h
K+1
.
.
.
.
.
.
.
.
.
.
.
.
h
L
h
L+1
. . . h
N
_
_
_
_
_
, (1)
where K = N L + 1.
Hankel matrices play many roles in diverse areas of mathematics, such as approximation
and interpolation theory, stability theory, system theory, theory of moments and theory
Corresponding author: Rahim Mahmoudvand. Department of Statistics, Shahid Beheshti University, PO. Box
1983963113, Evin, Tehran, Iran. Email: r.mahmodvand@gmail.com
ISSN: 0718-7912 (print)/ISSN: 0718-7920 (online)
c Chilean Statistical Society Sociedad Chilena de Estadstica
http://www.soche.cl/chjs
44 R. Mahmoudvand and M. Zokaei
of orthogonal polynomials, as well as in communication and control engineering, including
lter design, identication, model reduction and broadband matching; for more details, see
Peller (2003). Thus, this type of matrices has been subjected to intensive study with re-
spect to its spectrum (collection of eigenvalues) and many interesting results were derived.
However, closed form computation of eigenvalues is not known and, consequently, the ef-
fect of changing the dimension of the matrix on its eigenvalues have not been investigated
in detail.
In recent years, singular spectrum analysis (SSA), a relatively novel, but powerful tech-
nique in time series analysis, has been developed and applied to many practical problems;
see, e.g., Golyandina et al. (2001), Hassani et al. (2009), Hassani and Thomakos (2010)
and references therein. The SSA decomposes the original time series into a sum of small
numbers of interpretable components, such as slowly varying trend, oscillatory component
and noise. The basic SSA method consists of two complementary stages: decomposition
and reconstruction; each stage includes two separate steps. At the rst stage, we decom-
pose the series and, at the second stage, we reconstruct the noise free series by using the
reconstructed series for forecasting new data points.
A short description of the SSA technique is given in the next section. For more ex-
planations and comparison with other time series analysis techniques, refer to Hassani
(2007).
The whole procedure of the SSA technique depends upon two parameters:
(i) The window length, which is usually denoted by L.
(ii) The number of needed singular values, denoted by r, for reconstruction.
Improper choice values of parameters L or r may yield incomplete reconstruction and
misleading results in forecasting.
Considering a series of length N, Elsner and Tsonis (1996) provided some discussion
and remarked that choosing L = N/4 is a common practice. Golyandina et al. (2001)
recommended that L should be large enough, but not larger than N/2. Large values of
L allow longer period oscillations to be resolved, but choosing L too large leaves too few
observations from which to estimate the covariance matrix of the L variables. It should
be noted that variations in L may inuence separability feature of the SSA technique: the
orthogonality and closeness of the singular values. There are some methods for selecting
L. For example, the weighted correlation between the signal and noise component has
been proposed in Golyandina et al. (2001) to determine the suitable value of L in terms
of separability.
Although considerable attempt and various techniques have been taken into account for
selecting the proper value of L, but there is not enough algebraic and theoretical materials
for choosing L and r. The aim of his paper is to obtain some theoretical properties of the
singular values of the Hankel matrix that can be used directly for choosing proper values
of the two parameters of the SSA.
The outline of this paper is as follows. Section 2 describes the SSA technique and also
shows the importance of a Hankel matrix for this technique. Section 3 provides the main
results of the paper. Section 4 discusses some examples and an application of the obtained
results. Section 5 sketches some conclusion of this work.
2. Singular Spectrum Analysis
In this section, we briey introduce stages of the SSA method and discuss the importance
of using a Hankel matrix in the development of this technique.
Chilean Journal of Statistics 45
2.1 Stage I: decomposition
1st step: embedding. Embedding is as a mapping that transfers a one-dimensional time
series Y
N
= (y
1
, . . . , y
N
) into the multi-dimensional series X
1
, . . . , X
K
with vectors X
i
=
(y
i
, . . . , y
i+L1
)
R
L
, where L (2 L N1) is the window length and K = NL+1.
The result of this step is the trajectory matrix
X = (X
1
, . . . , X
K
) = (x
ij
)
L,K
i,j=1
. (2)
Note that the matrix given in Equation (2) is a Hankel matrix as dened in Equation (1).
2nd step: singular value decomposition (SVD). In this step, we perform the SVD
of X. Denote by
1
, . . . ,
L
the eigenvalues of XX
i
U
i
V
i
and V
i
= X
U
i
/
i
(if
i
= 0
we set X
i
= 0).
2.2 Stage II: reconstruction
1st step: grouping. The grouping step corresponds to splitting the elementary matrices
into several groups and summing the matrices within each group. Let I = {i
1
, . . . , i
p
},
for p < L, be a group of indices i
1
, . . . , i
p
. Then, the matrix X
I
corresponding to the
group I is dened as X
I
= X
i1
+ + X
ip
. The split of the set of indices {1, . . . , L}
into disjoint subsets I
1
, . . . , I
m
corresponds to the representation X = X
I1
+ + X
Im
.
The procedure of choosing the sets I
1
, . . . , I
m
is called the grouping. For a given group
I, the contribution of the component X
I
is measured by the share of the corresponding
eigenvalues
iI
i
/
d
i=1
i
, where d is the rank of X.
2nd step: diagonal averaging. The purpose of diagonal averaging is to transform a
matrix Z to the form of a Hankel matrix HZ, which can be subsequently converted to a
time series. If z
ij
stands for an element of a matrix Z, then the kth term of the resulting
series is obtained by averaging z
ij
for all i, j such that i +j = k +1. Hankelization HZ is
an optimal procedure, which is nearest to Z with respect to the matrix norm.
3. Theoretical Results
Along the paper, the matrices to be considered are over the eld of the real numbers. In
addition, we consider dierent values of L, whereas N is supposed to be xed. Recall that,
for any operator A, the operator AA
is given by
T
L,N
H
= tr(HH
) =
L
j=1
L,N
j
. (3)
46 R. Mahmoudvand and M. Zokaei
The behavior of T
L,N
H
given in Equation (3), with respect to dierent values of L, is
considered in the following theorem.
Theorem 3.1 Consider the Hankel matrix H as dened in Equation (1). Then,
T
L,N
H
=
N
j=1
w
L,N
j
h
2
j
,
where w
L,N
j
= min{min{L, K}, j, L +K j} = w
K,N
j
.
Proof Applying denition of H as given in Equation (1), we have
T
L,N
H
=
L
i=1
NL+i
j=i
h
2
j
. (4)
Changing the order of the summations in Equation (4), we get
T
L,N
H
=
N
j=1
C
j,L,N
h
2
j
,
where C
j,L,N
= min{j, L} max{1, j N +L} +1. Therefore, we only need to show that
C
j,L,N
= w
L,N
j
, for all j and L. We consider two cases: L K and L > K. For the rst
case, we have
C
j,L,N
=
_
_
_
j, 1 j L;
L, L + 1 j K;
N j + 1, K + 1 j N,
which is exactly equals to w
L,N
j
. Similarly for the second case, we get
C
j,L,N
=
_
_
_
j, 1 j K;
K, K + 1 j L;
N j + 1, L + 1 j N;
and again is equal to w
L,N
j
, for L > K.
The weight w
L,N
j
dened in Theorem 3.1 can be written in the functional form
w
L,N
j
=
N + 1
2
N+1
2
L
N+1
2
j
N+1
2
L
N+1
2
j
2
. (5)
Equation (5) shows that
w
L,N
j
is a concave function of L for all j, where j {1, . . . , N};
w
L,N
j
is a concave function of j for all L, where L {2, . . . , N 1};
w
L,N
j
is a symmetric function around line (N + 1)/2 with respect to j and L.
The above mentioned results imply that the behavior of the quantity T
L,N
H
is similar on
two intervals 2 L [(N + 1)/2] and [(N + 1)/2] + 1 L N 1, where, as usual, [x]
denotes the integer part of the number x. Therefore, we only need to consider one of these
intervals.
Chilean Journal of Statistics 47
Theorem 3.2 Let T
L,N
H
be dened as in Equation (3). Then, T
L,N
H
is an increasing func-
tion of L on {2, . . . , [(N + 1)/2]}, a decreasing function on {[(N + 1)/2] + 1, . . . , N 1},
and
max T
L,N
H
= T
[
N+1
2
],N
H
.
Proof First, we show that w
L,N
j
is an increasing function of L on {2, . . . , [(N + 1)/2]}.
Let L
1
and L
2
be two arbitrary values, where L
1
< L
2
[(N + 1)/2]. From the denition
of w
L,N
j
, we have
w
L2,N
j
w
L1,N
j
=
_
_
0, 1 j L
1
;
j L
1
, L
1
+ 1 j L
2
;
L
2
L
1
, L
2
+ 1 j N L
2
+ 1;
N j + 1 L
1
, N L
2
+ 2 j N L
1
+ 1;
0, N L
1
+ 2 j N.
Therefore, w
L2,N
j
w
L1,N
j
0, for all j, and inequality is strict for some j. Thus,
T
L2,N
H
T
L1,N
H
=
N
j=1
_
w
L2,N
j
w
L1,N
j
_
h
2
j
> 0. (6)
This conrms that T
L,N
H
is an increasing function of L on {2, . . . , [(N + 1)/2]}. Similar
approach for the set {[(N + 1)/2] +1, . . . , N1} implies that T
L,N
H
is a decreasing function
of L on this interval. Note also that T
L
2
,N
H
T
L
1
,N
H
in Equation (6) increases as the value
of L
2
increases too proving that T
L,N
H
is an increasing function on {2, . . . , [(N + 1)/2]}.
Therefore, the maximum value of T
L
2
,N
H
is attained at the maximum value of L, which is
[(N + 1)/2].
Corollary 3.3 Let L
max
denote the value of L such that T
L,N
H
T
L
max
,N
H
, for all L, and
the inequality to be strict for some values of L. Then,
L
max
=
_
_
N + 1
2
, if N is odd;
N
2
and
N
2
+ 1, if N is even.
Corollary 3.3 shows that L = median{1, . . . , N} maximizes the sum of squares of the
Hankel matrix singular values with xed values of N. Applying Corollary 3.3 and Equation
(5), we can show that
w
L
max
,N
j
=
N + 1
2
N + 1
2
j
. (7)
Equation (7) shows that h
[(N+1)/2]
has maximum weight at T
L,N
H
.
48 R. Mahmoudvand and M. Zokaei
3.2 Eigenvalues of HH
and rank of H
Here, some inequalities between the ordered eigenvalues for dierent values of L are derived.
According to Cauchys interlacing theorem, it can be given the following theorem; see
Bhatia (1997).
Theorem 3.4 Let H be an L K Hankel matrix as dened in Equation (1). Then, we
have
L,N
j
Lm,Nm
j
L,N
j+m
, j = 1, . . . , L m,
where m is a number belonging to the set {1, . . . , L 1}.
Proof Consider the partition
HH
=
_
_
HH
(1)
HH
(3)
HH
(2)
HH
(4)
_
_
,
where
HH
(1)
=
_
_
_
_
_
_
_
_
_
_
_
_
K
j=1
h
2
j
K
j=1
h
j
h
j+1
. . .
K
j=1
h
j
h
j+Lm1
K
j=1
h
j+1
h
j
K
j=1
h
2
j+1
. . .
K
j=1
h
j+1
h
j+Lm1
.
.
.
.
.
.
.
.
.
.
.
.
K
j=1
h
j+Lm1
h
j
K
j=1
h
j+Lm1
h
j+1
. . .
K
j=1
h
2
j+Lm1
_
_
_
_
_
_
_
_
_
_
_
_
.
Using this partitioning form, we can say that the sub-matrix HH
(1)
is obtained from a
Hankel matrix corresponding to the sub-series H
Nm
= (h
1
, . . . , h
Nm
) and its eigenvalues
are
Lm,Nm
1
Lm,Nm
2
Lm,Nm
Lm
0. Therefore, the proof is completed
using Cauchys interlacing theorem.
Now, we would like to nd a relationship between
Lm,N
j
and
L,N
j
. Therefore, Theorem
3.4 should not use directly. Next, we consider four cases and show that we can nd general
relationships for some classes of the Hankel matrix.
3.2.1 Case 1: L 1, rank of H = 1
In this case, it is obvious that we have one positive eigenvalue. Therefore, we can write
L,N
1
=
L
j=1
L,N
j
= tr(HH
) =
L
l=1
K+l1
j=l
h
2
j
.
According to Theorem 3.2, eigenvalue
L,N
1
increases with L till [(N + 1)/2] and then
decreases for L [(N + 1)/2] +1. Therefore, we have
Lm,N
1
L,N
1
and L [(N + 1)/2]
providing that the conditions of Case 1 are satised.
Chilean Journal of Statistics 49
3.2.2 Case 2: L = 2, rank of H = 2
In this case, HH
has at most two eigenvalues which are the solution of the quadratic
equation
_
_
N1
j=1
h
2
j
+
N
j=2
h
2
j
_
_
+
N1
j=1
h
2
j
N
j=2
h
2
j
_
_
N1
j=1
h
j
h
j+1
_
_
2
= 0. (8)
Equation (8) has two real solutions so that we have two real eigenvalues. The rst eigen-
value (larger one) is given by
2,N
1
=
N1
j=1
h
2
j
+
N
j=2
h
2
j
+
_
_
h
2
1
h
2
N
_
2
+ 4
_
N1
j=1
h
j
h
j+1
_
2
2
. (9)
Equation (9) shows that
2,N
1
=
_
_
_
1,N
1
,
_
N1
j=1
h
j
h
j+1
_
2
h
2
1
h
2
N
;
1,N
1
,
_
N1
j=1
h
j
h
j+1
_
2
h
2
1
h
2
N
;
(10)
where
1,N
1
=
N
j=1
h
2
j
, when L = 1. Practically, it seems that the rst condition of
Equation (10) is usually satised for a wide classes of models. For example, it can be
seen that the condition is equivalent to monotonicity of the sequence {h
j
, j = 1, . . . , N}.
For a non-negative (or non-positive) monotone sequence, we have
N1
j=1
h
j
h
j+1
h
1
h
N
.
Applying Equation (9), it follows
2,N
1
N1
j=1
h
2
j
=
1,N1
1
. A greater class is obtained
if we consider positive data, where all observations are bigger that the rst one and h
1
h
N
/(N 1). Under this condition, it is easy to show that
N1
j=1
h
j
h
j+1
h
1
h
N
and
therefore
2,N
1
1,N
1
. In the next section, we see some examples of models that have not
these conditions but
2,N
1
1,N
1
.
It is worth mention that we can state a geometrical display of Equation (8) as
_
||h
1:N1
||
2
+||h
2:N
||
2
_
+||h
1:N1
||
2
||h
2:N
||
2
(sin (
1,2
))
2
= 0, (11)
where h
1:N1
and h
2:N
denote the rst and second rows of H, ||.|| the Euclidean norm and
1,2
the angle between two rows of H. Notice that last expression in Equation (11) is the
magnitude of the cross product between two rst rows of H. Since (sin (
1,2
))
2
1, it is
easy to obtain the inequality
2,N
1
1,N1
1
, which is a direct result of Theorem 3.4 from
characteristics given in Equation (11).
3.2.3 Case 3: L > 2, rank of H = 2
In this case, HH
has two positive eigenvalues. To obtain the eigenvalues, rst of all note
that
det
_
I HH
_
=
L
+c
1
L1
+ +c
L1
+c
L
, (12)
where the coecients of c
j
can be obtained from following lemma.
50 R. Mahmoudvand and M. Zokaei
Lemma 3.5 (Horn and Johnson, 1985, Theorem 1.2.12) Let A be an nn real or complex
matrix with eigenvalues
1
, . . . ,
n
. Then, for 1 k n,
(i) s
k
() = (1)
k
c
k
, and
(ii) s
k
() is the sum of all the k k principal minors of A.
Equation (12) shows that the eigenvalues of HH
l=1
K+l1
j=l
h
2
j
+
L1
l=1
Ll
i=1
_
_
_
_
_
K+l1
j=l
h
2
j
_
_
_
_
K+l1
j=l
h
2
j+i
_
_
_
_
k+l1
j=l
h
j
h
j+i
_
_
2
_
_
_
= 0.
(13)
The rst eigenvalue (larger one) is given by
L,N
1
=
1
2
_
_
_
L
l=1
K+l1
j=l
h
2
j
+
_
L
_
_
_
, (14)
where
L
is the discriminant of the quadratic expression given in Equation (13). According
to Equation (14), it is easy to see that, for L [(N + 1)/2],
L,N
j
L1,N
j
L1
NL+1
l=L
h
2
l
, j = 1, 2.
Similar to the previous case, Equation (13) may be reformulated in the language of mul-
tivariate geometry for the L-lagged vectors given by
j=1
||h
j:K+j1
||
2
+
L1
i=1
L
j=i+1
||h
i:K+i1
||
2
||h
j:K+j1
||
2
(sin (
i,j
))
2
= 0,
where notations are dened similarly as mentioned in Case 3.
3.2.4 Case 4: L > 2, rank of H > 2
Applying Equation (12), it can be obtained the characteristic equation whose solution
gives the eigenvalues of HH
L,N
j
/
L
j=1
L,N
j
is the characteristic of the best r-dimensional approxi-
mation of the lagged vectors in the SSA technique. Furthermore, this ratio is an obvious
criterion for choosing the proper values of the parameters r and L in the SSA. Therefore,
the study for changing this ratio with respect to L and r is important for the SSA tech-
nique. First of all, note that, if let C
L,N
j
=
L,N
j
/
L
j=1
L,N
j
, then it is easy to see that,
for some j {1, . . . , L m} and all values of m belonging to {1, . . . , L 1}, we have
C
L,N
j
C
Lm,N
j
. (15)
Since inequality given in Equation (15) is satised for all values of m belonging to
{1, . . . , L 1}, it appears that C
L,N
1
is decreasing on L {2, . . . , [(N + 1)/2]}. In the
next section, we see examples that show whether such a behavior is true for polynomial
models or not.
Chilean Journal of Statistics 51
4. Examples and Application
In this section, we discuss some examples related to the theoretical results obtained in
Section 3. Also, we provide an application of these results.
4.1 Examples
Example 4.1 Let h
t
= exp(
0
+
1
t), for t = 1 . . . , N. It is easy to see that the corre-
sponding Hankel matrix H has rank one. Figure 1 shows rst singular value of H for this
model with
0
= 0.1,
1
= 0.2 and N = 20, which is convex with respect to L and attains
maximum value at L = 10, 11, i.e., the median of {1, . . . , 20}.
5 10 15
1
4
0
1
5
0
1
6
0
1
7
0
1
8
0
L
S
i
n
g
u
l
a
r
v
a
l
u
e
Figure 1. Plot of the rst singular value of H for dierent values of L: example 4.1.
Now, we consider two dierent examples where their corresponding Hankel matrices have
rank two. The rst one is a simple linear model and the second is a cosine model. As we
see for both models, roughly speaking, we can say that the results are somewhat similar
to Example 4.1.
Example 4.2 Let h
t
=
0
+
1
t, for t = 1, . . . , N. It is easy to show that rank of the
corresponding Hankel matrix H is two. Figure 2 shows the rst and second singular values
of H for
0
= 1,
1
= 2 and N = 20. From this gure, we can say that both rst and
second singular values of H increase for L [(N + 1)/2] and then decrease.
5 10 15
1
6
0
1
8
0
2
0
0
2
2
0
2
4
0
L
S
i
n
g
u
l
a
r
v
a
l
u
e
5 10 15
6
8
1
0
1
2
1
4
1
6
L
S
i
n
g
u
l
a
r
v
a
l
u
e
Figure 2. Plots of the rst (left) and second (right) singular values of H for dierent values of L: example 4.2.
52 R. Mahmoudvand and M. Zokaei
Example 4.3 Let h
t
= cos (t/12), for t = 1, . . . , N. First and second singular values of
H are depicted in Figure 3 for series length 100. If we connive some small uctuations in
the plots, we can say that behavior of singular values of H is similar to Example 4.2.
0 20 40 60 80 100
1
0
1
5
2
0
2
5
L
S
i
n
g
u
l
a
r
v
a
l
u
e
0 20 40 60 80 100
5
1
0
1
5
2
0
2
5
L
S
i
n
g
u
l
a
r
v
a
l
u
e
Figure 3. Plots of the rst (left) and second (right) singular values of H for dierent values of L: example 4.3.
Example 4.4 Let h
t
=
0
+
1
t +
2
t
2
, for t = 1, . . . , N. It is easy to show that rank
of the corresponding Hankel matrix H is 3. Figure 4 shows the singular values of H for
0
= 1,
1
= 2,
2
= 3 and N = 20. From this gure, we note that all the singular values
of H increase for L [(N + 1)/2] and then decrease, which coincides with Theorem 3.2.
5 10 15
3
5
0
0
4
0
0
0
4
5
0
0
5
0
0
0
L
S
i
n
g
u
l
a
r
v
a
l
u
e
5 10 15
5
0
1
0
0
1
5
0
2
0
0
2
5
0
3
0
0
3
5
0
L
S
i
n
g
u
l
a
r
v
a
l
u
e
5 10 15
5
1
0
1
5
L
S
i
n
g
u
l
a
r
v
a
l
u
e
Figure 4. Plots of the three largest singular values of H for dierent values of L: example 4.4.
Example 4.5 Let h
t
= log(t), for t = 1, . . . , N. Then, it can be seen that rank of the
corresponding Hankel matrix H is four. Singular values of H are shown in Figure 5 for
N = 20. The results of this example are in concordance with Example 4.4.
Figure 6 shows two singular values for models h
t
= cos(t/12) (left) and h
t
= log(t)
(right), for N = 5, . . . , 100. Solid and dashed lines in Figure 6 denote the singular values
for L = 2 and L = 1, respectively. Both of these values conrm our expectation for
discrepancy between two singular values. Notice that the cosine model is not monotone,
but
2,N
1
1,N
1
.
Chilean Journal of Statistics 53
5 10 15
1
4
1
6
1
8
2
0
2
2
2
4
L
S
i
n
g
u
l
a
r
v
a
l
u
e
5 10 15
1
.
0
1
.
2
1
.
4
1
.
6
1
.
8
2
.
0
L
S
i
n
g
u
l
a
r
v
a
l
u
e
4 6 8 10 12 14 16
0
.
0
6
0
.
0
8
0
.
1
0
0
.
1
2
L
S
i
n
g
u
l
a
r
v
a
l
u
e
6 8 10 12 14 16
0
.
0
0
3
0
.
0
0
4
0
.
0
0
5
0
.
0
0
6
0
.
0
0
7
0
.
0
0
8
L
S
i
n
g
u
l
a
r
v
a
l
u
e
Figure 5. Plots of the four largest singular values of H with respect to dierent values of L: example 4.5.
20 40 60 80
2
4
6
8
Length of Series
S
i
n
g
u
l
a
r
v
a
l
u
e
L=1
L=2
20 40 60 80
1
0
2
0
3
0
4
0
5
0
Length of Series
S
i
n
g
u
l
a
r
v
a
l
u
e
L=1
L=2
Figure 6. Plots of the rst singular value for values of L and N in cosine (left) and logarithm (right) models.
Example 4.6 Let h
t
=
0
+
1
t +
2
t
2
, for t = 1, . . . , N (a polynomial model). Figure 7
shows the ratio C
L,N
j
for
0
= 1,
1
= 2,
2
= 3, N = 20 and j = 1, 2, 3. From this gure,
we note that C
L,N
1
decreases for the values of L less than [(N + 1)/2] and then increases
on the set L {[(N + 1)/2] + 1, . . . , N 1}. Whereas C
L,N
2
and C
L,N
3
increase on the set
{1, . . . , [(N + 1)/2]} and decrease on {[(N + 1)/2] + 1, . . . , N}.
54 R. Mahmoudvand and M. Zokaei
5 10 15
0
.
9
9
5
0
.
9
9
6
0
.
9
9
7
0
.
9
9
8
0
.
9
9
9
L
R
a
t
i
o
5 10 15
0
.
0
0
1
0
.
0
0
2
0
.
0
0
3
0
.
0
0
4
0
.
0
0
5
L
R
a
t
i
o
5 10 15
0
.
0
e
+
0
0
4
.
0
e
0
6
8
.
0
e
0
6
1
.
2
e
0
5
L
R
a
t
i
o
Figure 7. Plots of C
L,N
j
with respect to L for N = 20 and j = 1 (left), j = 2 (center), j = 3 (right): example 4.6.
Next, we examine the cases where the degree of the polynomial is greater than two.
Furthermore, dierent coecients are considered. The results are similar to Example 4.6
and thus we do not report them here. As a general result, we can say that inequality given
in Equation (15) is satised for j = 1 in the polynomial models. Now, we consider the
ratio C
L,N
1:r
=
r
j=1
C
L,N
j
. Since C
L,N
1
is bigger than C
L,N
j
, for j > 1, and the discrepancy
between them usually is so much (see the polynomial model of Example 4.6), we expect
that the ratio C
L,N
1:r
has a behavior such as C
L,N
1
. In the following example, the behavior
of this ratio is depicted for the polynomial model with degree four.
Example 4.7 Let h
t
=
0
+
1
t +
2
t
2
+
3
t
3
+
4
t
4
, for t = 1, . . . , N. Figure 8 shows
the ratio C
L,N
1:r
for
0
= 1,
1
= 2,
2
= 3,
3
= 4,
4
= 5 and N = 20. From this gure,
we note that C
L,N
1:r
decreases on L {2, . . . , [(N + 1)/2]}, for r 1, and then increases on
L > [(N + 1)/2], as expected.
5 10 15
0
.
9
9
7
5
0
.
9
9
8
0
0
.
9
9
8
5
0
.
9
9
9
0
0
.
9
9
9
5
L
R
a
t
i
o
5 10 15 0
.
9
9
9
9
8
0
0
.
9
9
9
9
8
5
0
.
9
9
9
9
9
0
0
.
9
9
9
9
9
5
1
.
0
0
0
0
0
0
L
R
a
t
i
o
5 10 15
L
R
a
t
i
o
1
Figure 8. Plot of C
L,N
1:r
with respect to L for N = 20 and r = 1 (left), r = 2 (center), r = 3 (right): example 4.7.
Chilean Journal of Statistics 55
4.2 Choosing the SSA parameters
Several rules have been proposed in the literature for choosing the SSA parameters; see,
e.g., Golyandina et al. (2001) and Hassani et al. (2011). However, the list is by no means
exhaustive. Certainly, the choice of parameters depends on the data collected and on the
analysis we have performed. Anyway one important note is that singular values give most
eective information for choosing parameters in the SSA. In previous subsections, several
criteria and theorems were considered to investigate the behavior of singular values of the
Hankel matrix. Considering theoretical results about the structure of the Hankel matrix,
trajectory matrix and relationship with their dimensions, enable us to state that the choice
of L close to one-half of the time series length is a suitable choice for decomposition stage in
most cases. The previous empirical and theoretical results also conrm the results obtained
by us here. However, by using denition of the criteria T
L,N
H
, it can be seen that
T
L,N
H
T
L1,N
H
=
K
j=L
h
2
j
. (16)
Equation (16) is the rate of change in tr(HH
(
L,N
j
) increases with L in L {1, . . . , [(N + 1)/2]} and
decreases in L {[(N + 1)/2] + 1, . . . , N}. In addition, we have investigated the behavior
of the sum of square and the contribution of each singular value. The results based on
these criteria have shown that the choice of L close to one-half of the time series length is
a suitable choice for decomposition stage in most cases for the singular spectrum analysis.
56 R. Mahmoudvand and M. Zokaei
Acknowledgements
Authors would like to thank anonymous referees for their valuable comments that improved
the exposition of the paper.
References
Bhatia, R., 1997. Matrix Analysis. Springer, New York.
Elsner, J.B., Tsonis, A.A., 1996. Singular Spectrum Analysis: A New Tool in Time Series
Analysis. Plenum Press, New York.
Golyandina, N., Nekrutkin, V., Zhigljavsky, A., 2001. Analysis of Time Series Structure:
SSA and Related Techniques. Chapman & Hall/CRC, New York.
Hassani, H., 2007. Singular spectrum analysis: methodology and comparision. Journal of
Data Science, 5, 239257.
Hassani, H., Heravi, S., Zhigljavsky, A., 2009. Forecasting European industrial production
with singular spectrum analysis. International Journal of Forecasting, 25, 103118.
Hassani, H., Mahmoudvand, R., Zokaei, M., 2011. Separability and window length in sin-
gular spectrum analysis. Comptes Rendus Mathematique (in press).
Hassani, H., Thomakos, D., 2010. A Review on singular spectrum analysis for economic
and nancial time series. Statistics and its Interface, 3, 377397.
Horn, R.A., Johnson, C.R., 1985. Matrix Analysis. Cambridge University Press, Cam-
bridge.
Jollie, I.T., 2002. Principal Component Analysis. Springer, New York.
Peller, V., 2003. Hankel Operators and Their Applications. Springer, New York.
Widom, H., 1966. Hankel matrix. Transactions of the American Mathematical Society,
121, 135.
Chilean Journal of Statistics
Vol. 3, No. 1, April 2012, 5773
Statistical Modeling
Research Paper
On linear mixed models and their inuence
diagnostics applied to an actuarial problem
Luis Gustavo Bastos Pinho, Juv encio Santos Nobre
Corresponding author. Juvencio Santos Nobre. Departamento de Estatstica e Matematica Aplicada. Universidade
Federal do Ceara. Fortaleza/CE-Brazil. CEP:60.440-900. Email: juvencio@ufc.br
ISSN: 0718-7912 (print)/ISSN: 0718-7920 (online)
c Chilean Statistical Society Sociedad Chilena de Estadstica
http://www.soche.cl/chjs
58 L.G. Bastos Pinho, J.S. Nobre and S.M. Freitas
same day for three consecutive years. In Figure 1(a) we do not consider the within-region
(within-subject) correlation. The dashed line is a simple linear regression and suggests that
the more the policy holders, the less claims occur. In Figure 1(b) we joined the observations
for each region by a solid line. It is clear now that the number of claims increases with the
number of policy holders.
12 14 16 18 20
5
0
0
5
5
0
6
0
0
Number of policy holders (thousands)
N
u
m
b
e
r
o
f
c
l
a
i
m
s
12 14 16 18 20
5
0
0
5
5
0
6
0
0
Number of policy holders (thousands)
N
u
m
b
e
r
o
f
c
l
a
i
m
s
Figure 1. (a) Not considering the within-subject correlation, (b) considering the within-subject correlation.
It is necessary to take into consideration that each region may have a particular behavior
which should be modeled, but only this is usually not enough. Techniques summarized
under the name of diagnostic procedures may help to identify issues of concern, such as high
inuential observations, which may distort the analysis. For linear homoskedastic models,
a well known diagnostic procedure is the residual plot. For linear mixed models better
types of residuals are dened. Besides residual techniques, which are useful, there is a less
used class of diagnostic procedures, which includes case deletion and measuring changes in
the likelihood of the adjusted model under minor perturbations. Several important issues
may not be noticed without the aid of these last diagnostics methods.
For introductory information regarding regression models and respective diagnostic anal-
ysis; see Cook and Weisberg (1982) or Drapper and Smith (1998). For a comprehensive
introduction to linear mixed models, see Verbeke and Molenberghs (2000), McCulloch
and Searle (2001) and Demidenko (2004). Diagnostic analysis of linear mixed models were
presented and discussed in Beckman et al. (1987), Christensen and Pearson (1992), Hilden-
Minton (1995), Lesare and Verbeke (1998), Banerjee and Frees (1997), Tan et al. (2001),
Fung et al. (2002), Demidenko (2004), Demidenko and Stukel (2005), Zewotir and Galpin
(2005), Gumedze et al. (2010) and Nobre and Singer (2007, 2011).
The seminal work of Frees et al. (1999) showed some similarities and equivalences be-
tween mixed models and some well known credibility models. Applications to data sets in
actuarial context may be seen in Antonio and Beirlant (2006). Our contribution is to show
how to use diagnostic methods for linear mixed models applied to actuarial science. We
illustrate how to identify outliers and inuential observations and subjects. We also show
how to use diagnostics as a tool for model selection. These methods are very important
and usually overlooked by most of the actuaries.
This paper is divided as follows. In Section 2 we present a motivational example using
a well known data set. In Section 3 we briey present the linear mixed models. Section 4
contains a short introduction to the diagnostic methods used in the example. In Section
5 we present an application based on the motivational example. Section 6 shows some
conclusions. Finally, in an Appendix, we present mathematical details of some formulas
and expressions used in the text.
Chilean Journal of Statistics 59
2. Motivational Example
For a practical example, consider the Hachemeister (1975) data on private passenger bodily
injury insurance. The data were collected from ve states (subjects) in the US, through
twelve trimesters between July 1970 and June 1973, and show the mean claim amount and
the total number of claims in each trimester. The data may be found in the actuar package
(see Dutang et al., 2008) from R (R Development Core Team, 2009) and are partially shown
in Table 1.
Table 1. Hachemeisters data.
Trimester State Mean claim amount Number of claims
1 1 1738 7861
1 2 1364 1622
1 3 1759 1147
1 4 1223 407
1 5 1456 2902
2 1 1642 9251
.
.
.
.
.
.
.
.
.
.
.
.
12 1 2517 9077
12 2 1471 1861
12 3 2059 1121
12 4 1306 342
12 5 1690 3425
In Figure 2 we plot the individual proles for each state and the mean prole. It suggests
that the claims have a dierent behavior along the trimesters for each state. One may
notice that the claims from state 1 are greater than those from other states for almost
every observation, and the claims from states 2 and 3 seem to grow more slowly than
those from state 1. If the insurer wants to accurately predict the severity, the subjects
individual behavior must also be modeled. Traditionally this is possible with the aid of
credibility models; see, e.g., B uhlmann (1967), Hachemeister (1975) and Dannenburg et
al. (1996). These models assign weights, known as credibility factors, to a pair of dierent
estimates of severity.
2 4 6 8 10 12
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
Trimester
A
v
e
r
a
g
e
c
l
a
i
m
a
m
o
u
n
t
State 1
State 2
State 3
State 4
State 5
Average
Figure 2. Individual proles and mean prole for Hachemeister (1975) data.
Credibility models may be functionally dened as
ZB + (1 Z)C,
60 L.G. Bastos Pinho, J.S. Nobre and S.M. Freitas
where A represents the severity in a given state, Z is a credibility factor restricted to [0, 1],
B is a priori estimate of the expected severity for the same estate and C is a posteriori
estimate also of the expected severity. Considering a particular state, B may be equal to
the sample mean of the severity of its observations and C equal to the overall sample mean
of the data in the same period.
Frees et al. (1999) showed that it is possible to nd linear mixed models equivalent to
some known credibility models, such as B uhlmann (1967) and Hachemeister (1975) models.
Information about linear mixed models is provided in the next section.
3. Linear Mixed Models
Linear mixed models are a popular alternative to analyze repeated measures. Such models
may be functionally expressed as
y
i
= X
i
+Z
i
b
i
+e
i
, i = 1, . . . , k, (1)
where y
i
= (y
1
, y
2
, . . . , y
n
i
)
is a n
i
1 vector of the observed values of the response variable
for the ith subject, X
i
is a n
i
p known full rank matrix, is a p 1 vector of unknown
parameters, also known as xed eects, which are used to model E[y
i
], Z
i
is a n
i
q known
full rank matrix, b
i
is a q 1 vector of latent variables, also known as random eects,
used to model the within-subject correlation structure, and e
i
= (e
i1
, e
i2
, . . . , e
in
i
)
is the
n
i
1 random vector of (within-subject) measurement errors. It is usually also assumed
that e
i
ind
N
ni
(0,
2
I
ni
), where I
ni
denotes the identity matrix of order n
i
for i = 1, . . . , k,
b
i
iid
N
q
(0,
2
G) for i = 1, . . . , k in which G is a q q positive denite matrix, and
e
i
and b
j
are independent i, j. Under these assumption, this is called a homoskedastic
conditional independence model. It is possible to rewrite model given in Equation (1) in a
more concise way as
y = X +Zb +e, (2)
where y = (y
1
, . . . , y
k
)
, X = (X
1
, . . . , X
k
)
, Z =
k
i=1
Z
i
, b = (b
1
, . . . , b
k
)
and
e = (e
1
, . . . , e
k
)
, with
representing the direct sum.
It can be shown that, conditional on known covariance parameters of the model, that is
conditional to the elements of G and
2
, the best linear unbiased estimator (BLUE) for
and the best linear unbiased predictor (BLUP) for b are given by
= (X
V
1
X)
1
X
V
1
y, (3)
and
b = DZ
V
1
(y X
),
respectively, where D =
2
G, V =
2
(I
n
+ ZGZ
), with n =
k
i=i
n
i
; see Hachemeister
(1975).
Maximum likelihood (ML) and restricted maximum likelihood (RML) methods can be
used to estimate the variance components of the model. The latter, proposed in Patterson
and Thompson (1971), is usually chosen since it often generates less biased estimators
related to the variance structure. When estimates for V are used in Equation (3) to
obtain
and
b, they are called empirical BLUE (EBLUE) and empirical BLUP (EBLUP),
respectively. Usually the estimation of the parameters involves the use of iterative methods
for maximizing the likelihood function.
Chilean Journal of Statistics 61
Linear mixed models are not the only way to deal with repeated measures studies. Other
popular alternatives are the generalized estimation equations (see Liang and Zeger, 1986;
Diggle et al., 2002) and multivariate models as seen in Johnson and Whichern (1982)
and Vonesh and Chinchilli (1997). But usually these alternatives are more restrictive than
linear mixed models, and they only model the marginal expected value of the response
variable.
4. Diagnostic Methods
Diagnostic methods comprehend techniques whose purpose is to investigate the plausi-
bility and robustness of the assumptions made when choosing a model. It is possible to
divide the techniques shown here in two classes: residual analysis, which investigates the
assumptions on the distribution of errors and presence of outliers; and sensitivity analysis,
which analyzes the sensitivity of a statistical model when subject to minor perturbations.
Usually, it would be far more dicult, or even impossible, to observe these aspects in a
traditional credibility model.
In the context of traditional linear models (homoskedastic and independent), examples
of diagnostic methods may be seen in Hoaglin and Welsch (1978), Belsley et al. (1980)
andCook and Weisberg (1982). Linear mixed models, extensions and generalizations are
briey discussed here and may be seen in Beckman et al. (1987), Christensen and Pearson
(1992), Hilden-Minton (1995), Lesare and Verbeke (1998), Banerjee and Frees (1997),
Tan et al. (2001), Fung et al. (2002), Demidenko (2004), Demidenko and Stukel (2005),
Zewotir and Galpin (2005), Nobre and Singer (2007, 2011) and Gumedze et al. (2010).
4.1 Residual analysis
In the linear mixed models class three dierent kinds of residuals may be considered. The
conditional residuals: e = y X
b, the EBLUP: Z
= y X
i
=
e
i
q
ii
,
where q
ii
represents the ith element in the main diagonal of Q dened as
Q =
2
(V
1
V
1
X(X
V
1
X)
1
X
V
1
).
Under normality assumptions on e, this standardization identies outlier observations and
subjects; see Nobre and Singer (2007). To do so, the same authors consider the quadratic
form M
I
= y
QU
I
(U
I
QU
I
)
1
U
I
Qy, where U
I
= (u
ij
)
(nk)
= (U
i1
, . . . , U
ik
), with U
i
representing the ith column of the identity matrix of order n. To identify an outlier subject
let I be the index set of the subject observations and evaluate M
I
for this subset.
62 L.G. Bastos Pinho, J.S. Nobre and S.M. Freitas
Table 2. Diagnostic techniques involving residuals.
Diagnostic Graph
Linearity of xed eects
vs. explanatory variables (tted values)
Presence of outliers e vs. observation index
Homoskedasticity of the conditional errors e vs. tted values
Normality of the conditional errors QQ plot for the least confounded residuals
Presence of outlier subjects Mahalanobis distance vs. observation index
Normality of the xed eects weighted QQ plot for
b
i
4.1.2 Confounded residuals
It can be shown that, under the assumptions made by model given in Equation (1), we
have
e = RQe +RQZb and Z
b = ZGZ
QZb +ZGZ
Qe,
where R =
2
I
n
. These identities tell us that e and Z
e instead
of e as suggested by Hilden-Minton (1995) and veried by simulation in Nobre and Singer
(2007).
4.1.3 EBLUP
The EBLUP is useful to identify outlier subjects given that it represents the distance
between the population mean value and the value predicted for the ith subject. A way of
using the EBLUP to search for outliers subjects is to use the Mahalanobis distance (see
Waternaux et al., 1989),
i
=
b
i
(
Var[
b
i
b
i
])
1
b
i
. It is also possible to use the EBLUP
to verify the random eects normality assumption. For more information; see Nobre and
Singer (2007). In Table 2 we summarize diagnostic techniques involving residuals discussed
in Nobre and Singer (2007).
4.2 Sensitivity analysis
Inuence diagnostic techniques are used to detect observations that may produce excessive
inuence in the parameters estimates. There are two main approaches of such techniques:
global inuence, which is usually based on case deletion; and local inuence, which intro-
duces small perturbations in dierent components of the model.
In normal homoskedastic linear regression, examples of sensitivity measures are the
Cook distance, DFFITS and the COVRATIO; see Cook (1977), Belsley et al. (1980) and
Chatterjee and Hadi (1986, 1988).
4.2.1 Global influence
A simple way to verify the inuence of a group of observations in the parameters es-
timates is to remove the group and observe the changes in the estimation. The group
of observations are inuential if the changes are considerably large. However, in LMM,
it may not be practical to reestimate the parameters every time a set of observations is
removed. To avoid doing so, Hilden-Minton (1995) presented an update formula for the
Chilean Journal of Statistics 63
BLUE and BLUP. Let I = {i
1
, . . . , i
k
} be the index set of the removed observations and
U
I
= (U
i1
, . . . , U
ik
). Hilden-Minton (1995) showed that
(I)
= (X
MX)
1
X
MU
I
(I)
and
b
b
(I)
= DZ
QU
I
(I)
,
where the subscript (I) indicates that the estimates were obtained without the observations
indexed by I and
(I)
= (U
I
QU
I
)
1
U
I
Qy.
A suggestion to measure the inuence on the parameters estimates in linear mixed
models is to use the Cook distance (see Cook, 1977) given by
D
I
=
(
(I)
)
(X
V
1
X)
1
(
(I)
)
c
=
(y y
(I)
)
V
1
(y y
(I)
)
c
,
such as seen in Christensen and Pearson (1992) and Banerjee and Frees (1997), where c
is a scale factor. However, it was pointed out by Tan et al. (2001) that D
I
is not always
able to measure the inuence on the estimation properly in the mixed models class. The
same authors suggest the use of a measure similar to the Cook distance, but conditional
to BLUP (
b). The conditional Cook distance is dened for the ith observation as
D
cond
i
=
k
j=1
P
j(i)
Var[y|b]
1
P
j(i)
(n 1)k + p
, i = 1, . . . , k,
where P
j(i)
= y
j
y
j(i)
= (X
j
+Z
j
b
j
)(X
j
(i)
+Z
j
b
j(i)
). The same authors decomposed
D
cond
i
= D
cond
i1
+ D
cond
i2
+ D
cond
i3
and commented the interpretation of each part of the
decomposition. D
cond
i1
is related to the inuence in the xed eects, D
cond
i2
is related to the
inuence on the predicted values and D
cond
i3
to the covariance of the BLUE and the BLUP,
which should be close to zero if the model is valid.
When all the observations from a subject are deleted, it is not possible to obtain the
BLUP for the random eects of that subject, making it impossible to obtain D
cond
I
as
stated above. For this purpose, Nobre (2004) suggested using D
cond
I
= (n
i
)
1
jI
D
cond
j
,
where I indexes the observation from a subject, as a way to measure the inuence of a
subject on the parameters estimates when its observations are deleted.
There are natural extensions of leverage measures for linear mixed models. These can
be seen in Banerjee and Frees (1997), Fung et al. (2002), Demidenko (2004) and Nobre
(2004). However, they only provide information about leverage regarding tted marginal
values. This has two main limitations as commented in Nobre and Singer (2011). First
we may be interested in detecting high-leverage within-subject observations. Second, in
some cases the presence of high-leverage within-subject observations does not imply that
the subject itself is detected as a high-leverage subject. Suggestions of how to evaluate
the within-subject leverage may be seen in Demidenko and Stukel (2005) and Nobre and
Singer (2011).
4.2.2 Local influence
The concept of local inuence was proposed by Cook (1986) and consists in analyzing the
sensitivity of a statistical model when subjected to small perturbations. It is suggested
to use an inuence measure called the likelihood displacement. Considering the model
described in Equation (2), up to a constant, the log-likelihood function may be written as
L() =
k
i=1
L
i
() =
1
2
k
i=1
_
ln |V
i
| + (y
i
X
i
)
V
1
(y
i
X
i
)
_
.
64 L.G. Bastos Pinho, J.S. Nobre and S.M. Freitas
The likelihood displacement is dened as LD() = 2{L(
) L(
)}, where is a l 1
perturbations vector in an open set R
l
; is the parameters vector of the model,
including covariance parameters;
is the ML estimate of and
) = L(
0
)
and such that LD has its rst and second derivatives in a neighborhood of (
0
)
. Cook
(1986) considered a R
l+1
surface formed by the inuence function () = (
, LD(
))
L
1
Hd|,
where
L =
2
L()/
and H =
2
L(
)/
both evaluated at =
; see Cook
(1986). It can be shown that C
d
always lies between the minimum and maximum eigen-
value of the matrix
F = H
L
1
H, so d
max
, the eigenvector associated to the highest
eigenvalue, gives information about the direction that exhibits more sensitivity of LD()
in a
0
neighborhood. Beckman et al. (1987) made some comments on the eectiveness of
the local inuence approach. Lesare and Verbeke (1998) and Nobre (2004) showed some
examples of perturbation schemes in the linear mixed models context.
Perturbation scheme for the covariance matrix of the conditional errors. To ver-
ify the sensitivity of the model to the conditional homoskedasticity assumption, pertur-
bations are inserted in the covariance matrix of the conditional errors. This can be done
by considering Var[] =
2
(), where () = diag(), with = (
1
, . . . ,
N
)
, the
perturbation vector. For this case we have
0
= 1
N
. The log-likelihood function in this
case is given by
L = L(
) =
1
2
_
ln |V ()| + (y X)
V ()
1
(y X)
_
,
where V
= ZDZ
+
2
().
Perturbation scheme for the response. For the local inuence approach, Beckman et
al. (1987) proposed the perturbation scheme
y() = y + s,
where s represents a scale factor and is a n 1 perturbation vector. For this scheme
we have
0
= 0, with 0 representing the n 1 null vector. In this case, the perturbed
log-likelihood function is proportional to
L(
) =
1
2
(y + s X)
V
1
(y + s X).
Perturbation scheme for the random effects covariance matrix. It is possible to
assess the sensitivity of the model in relation to the random eects homoskedasticity
assumption by perturbing the matrix G. Nobre (2004) suggested the use of Var[b
i
] =
i
G
as a perturbation scheme. In this case is a q 1 vector and
0
= 1
q
. The perturbed
log-likelihood function is proportional to
L() =
1
2
k
i=1
_
ln |V
i
()| + (y
i
X
i
)
1
V ()
1
(y
i
X
i
)
_
.
Chilean Journal of Statistics 65
Perturbation scheme for the weighted case. Verbeke (1995) and Lesare and Verbeke
(1998) suggested perturbing the log-likelihood function as
L(|) =
k
i=1
i
L
i
().
Such a perturbation scheme is appropriate for measuring the inuence of the ith subject
using the normal curvature in its direction and is given by
C
i
= 2|d
i
H
L
1
Hd
i
|,
where d
i
is a vector whose entries are 1 in the ith coordinate and zero everywhere else.
Verbeke (1995) showed that if C
i
has a high value, then the ith subject has great inuence
in the value of
. A threshold of twice the mean value of all C
j
s helps to decide whether
or not the observation is inuential.
Lesare and Verbeke (1998) extracted from C
i
some interpretable measures. They es-
pecially propose using X
i
X
i
2
, R
i
2
Z
i
Z
i
2
, I
n
i
R
i
R
i
2
and
V
1
i
2
, where
X
i
=
V
1/2
X
i
, Z
i
=
V
1/2
i
Z
i
, R
i
=
V
1/2
i
e
i
, to evaluate the inuence of the ith subject
in the model parameter estimates. The actual interpretation of each of these terms can be
seen in the original paper.
4.2.3 Conform local influence
The C
d
measure proposed by Cook (1986) is not invariant to scale re-parametrization.
To obtain a similar standardized measure and make it more comparable, Poon and Poon
(1999) used the conform normal curvature instead of the normal curvature given by
B
d
() =
2|d
L
1
Hd|
2H
L
1
H
.
It can be shown that 0 B
d
() 1 to d direction and that B
d
is invariant to conform scale
re-parametrization. A re-parametrization is said to be conform if its jacobian J is such that
J
J = tI
s
, to some real t and integer s. They showed that if
1
, . . . ,
l
are the eigenval-
ues of
F matrix with v
1
, . . . , v
l
representing the respective normalized eigenvectors, then
the value of the conform normal curvature in v
i
direction is equal to
i
/
_
l
i=1
2
i
and
l
i=1
B
2
vi
() = 1. If every eigenvector has the same conform normal curvature, its value is
equal to 1/
l. Poon and Poon (1999) proposed to use this measure as a referential to mea-
sure the intensity of the local inuence of an eigenvector. It can also be shown that when
d has the direction of d
max
the conform normal curvature also attains its maximum. In
this way, the normal curvature and the conform normal curvature are equivalent methods.
5. Application
According to Frees et al. (1999), the random coecient models are equivalent to the
Hachemeister linear regression model which is used for the example data in Hachemeister
(1975). The random coecient model to the data in Table 1 may be described as
y
ij
=
i
+ j
i
+ e
ij
, i = 1, . . . , 5, j = 1, . . . , 12,
66 L.G. Bastos Pinho, J.S. Nobre and S.M. Freitas
where y
ij
represents the average claim amount for state i in the jth trimester,
i
= +a
i
and
i
= + b
i
, with xed and , and (a
i
, b
i
)
N
2
(0, D), in which D is a 2 2
covariance matrix. In order to nd a possible simpler model, we used R to apply the
asymptotic likelihood ratio test described in Giampaoli and Singer (2009) to compare the
suggested random coecients model and a random intercept model. The p-value obtained
from the test was 0.0514. It indicates that it may be enough to consider the random
eect for the intercept only. This decision is also supported by the Bayesian information
criterion (BIC), which is equal to 808.3 for the single random eect model and 811.6 for the
model with two random eects. We may also use another set of tests, involving bootstrap,
monte-carlo and permutational methods, to investigate whether or not should we prefer
the random intercept model. These tests may be seen in Crainiceanu and Ruppert (2004),
Greven et al. (2008) and Fitzmaurice et al. (2007). However, this is very distant from our
goals and is not discussed here. For the sake of simplicity and based on the presented
reasons we shall use the random intercept model, which diers a little from the model
proposed by Frees et al. (1999). Thus, the model to be adjusted for the data in this
example is
y
ij
=
i
+ j +
ij
, i = 1, . . . , 5, j = 1, . . . , 12, (4)
where
i
= +a
i
, and are the same as dened before. Assume also that Var[
ij
] =
2
and Var[a
i
] =
2
a
.
The model parameter estimates were obtained by the RML method using the lmer()
function from the lme4 package in R. The standard errors were obtained from SAS c (SAS
Institute Inc., 2004) using the proc MIXED. The estimates are shown in Table 3.
Table 3. Model parameter estimates.
Parameter
2
2
a
Estimate 1460.32 32.41 32981.53 73398.25
SE 131.07 6.79 6347.17 24088.00
Figure 3 shows the ve conditional regression lines obtained from the linear mixed model
given in Equation (4). The adjusted model clearly suggests that the claim amount is higher
in state 1. Also it suggests a similarity in the claim amounts from states 2 and 4. Besides
that, we can expect a smaller risk from policies in state 5, since they are much closer to
the respective adjusted conditional line. Further information is explored by the diagnostic
analysis commented next.
0 10 20 30 40 50 60
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
Conditional regression lines
Observation
A
g
g
r
e
g
a
t
e
c
l
a
i
m
a
m
o
u
n
t
State 1
State 2
State 3
State 4
State 5
Figure 3. Conditional regression lines.
Chilean Journal of Statistics 67
5.1 Diagnostic analysis
The standardized residuals proposed by Nobre and Singer (2007) suggest that observation
4.7 (obtained from state 4 in the seventh trimester) may be considered an outlier as shown
in Figure 4(a). According to the QQ plot in Figure 4(b) it is reasonable to assume that the
conditional errors are normally distributed. The Mahalanobis distance in Figure 4(c) was
normalized to t the interval [0, 1] and suggests that the rst state may be an outlier. The
measure M
I
proposed by Nobre and Singer (2007) in Figure 4(d), also normalized, suggests
that none of the states have outlier observations. The Mahalonobis distance should not
be confounded with M
I
. The rst is based on the EBLUP and the last is based on the
conditional errors, and thus they have dierent meanings. For both analyses, an observation
is highlighted if it is greater than twice the mean of the measures.
1 2 3 4 5
1
0
1
2
3
(a)
State
S
t
a
n
d
a
r
d
i
z
e
d
c
o
n
d
i
t
i
o
n
a
l
r
e
s
i
d
u
a
l
4.7
2 1 0 1 2
1
0
1
2
3
(b)
Quantiles of N(0,1)
L
e
a
s
t
c
o
n
f
o
u
n
d
e
d
r
e
s
i
d
u
a
l
s
t
a
n
d
a
r
d
i
z
e
d
1 2 3 4 5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
(c)
State
M
a
h
a
l
a
n
o
b
i
s
d
i
s
t
a
n
c
e
1
1 2 3 4 5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
(d)
State
M
I
Figure 4. Residual analysis: (a) standardized residuals, (b) least confounding residuals, (c) EBLUP, (d) values for
M
I
.
The conditional Cook distance is shown in Figure 5. The distances were normalized for
comparison. Figure 5(a) suggests that observation 4.7 is inuential in the model estimates.
The rst term of the distance decomposition suggests that no observations were inuential
in the estimate of as shown in Figure 5(b). The second term of the decomposition
suggests that observation 4.7 is potentially inuential in the prediction of b as seen on
Figure 5(c). The last term, D
i3
, is as close to zero as expected and is omitted.
68 L.G. Bastos Pinho, J.S. Nobre and S.M. Freitas
1 2 3 4 5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
(a)
State
C
o
o
k
s
c
o
n
d
i
t
i
o
n
a
l
d
i
s
t
a
n
c
e
4.7
1 2 3 4 5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
(b)
State
D
1
1 2 3 4 5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
(c)
State
D
2
4.7
Figure 5. (a) Conditional Cook distance, (b) D
i1
, (c) D
i2
.
Figure 6 shows the local inuence analysis using three dierent perturbation schemes.
The rst, in Figure 6(a), is related to the conditional errors covariance matrix, as suggested
in Beckman et al. (1987), and indicates that the observations from the fourth state, espe-
cially 4.7, are possibly inuential in the homoskedasticity and independence assumption
for the conditional errors. Notice that it is possible to explain the inuence of observation
4.7 analyzing Figure 2. This observation has a value considerably higher than the others
from the same state. Figure 6(b) demonstrates the perturbation scheme for the covariance
matrix associated to the random eects as shown in Nobre (2004). Alternative perturba-
tion schemes for this case can be seen at Beckman et al. (1987). These schemes suggest that
all states are equally inuential in the random eects covariance matrix estimate. Finally,
there are evidences that the observations in the fourth state may not be well predicted by
the model; see Figure 6(c).
After the diagnostic we proceed to a conrmatory analysis by removing the observations
from states 1 and 4, one at a time and then both at the same time. The new estimates are
shown in Table 4. For each parameter, we calculate the relative change in the estimated
values, dened for parameter , as
RC() =
(i)
100%.
Chilean Journal of Statistics 69
1 2 3 4 5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
(a)
State
A
b
s
o
l
u
t
e
v
a
l
u
e
s
o
f
d
m
a
x
c
o
m
p
o
n
e
n
t
s
4.7
4.1
4.11
4.12
1.2
1 2 3 4 5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
(b)
State
A
b
s
o
l
u
t
e
v
a
l
u
e
o
f
d
m
a
x
c
o
m
p
o
n
e
n
t
s
1 2 3 4 5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
(c)
State
|
|
I
R
R
T
|
|
2
4
Figure 6. Perturbation schemes: (a) conditional covariance matrix, (b) random eects covariance matrix, (c) values
for I
n
i
R
i
R
i
2
.
Table 4. Estimates and relative changes for the model given in Equation (4) parameter estimates with and without
states 1 and 4.
Situation
2
2
a
Complete data 1460.32 32.41 32981.53 73398.25
Without State 1 1408.63 (3.67%) 25.26 (22.06%) 34666.31 ( 5.11%) 34335.64 (53.22%)
Without State 4 1530.94 (4.61%) 33.50 ( 3.36%) 24940.12 (24.38%) 59214.50 ( 19.32%)
Without States 1 and 4 1485.56 (1.70%) 24.32 (24.96%) 24497.48 (25.72%) 23707.07 (67.70%)
If all ve states were equally inuential, we would expect the value for RC to lie around
1/5 = 20% after removing a state. If RC() exceeds two times this value, that is 40%,
for some parameter we consider the state was potentially inuential. It is possible to
conclude that three observations from state 1 were inuential in the within-subject variance
estimate. From Figure 2, one can explain this inuence noticing that all the observations
from state 1 had higher values compared to the other states. Notice that such inuence was
not detected in Figure 5(b), but was pointed out by the Mahalanobis distance in Figure
4(c). Removing state 1 from our analysis and running every diagnostic procedure again
we detect no excessive inuence and the only issue is the observation 4.7, which is still an
outlier. From this result the model is validated and it is assumed to be robust and ready
for use.
70 L.G. Bastos Pinho, J.S. Nobre and S.M. Freitas
6. Conclusions
The use of linear mixed models in actuarial science should be encouraged given their
capability to model the within-subject correlation, their exibility and the presence of
diagnostic tools. Insurers should not use a model without validating it rst. For the specic
example seen here, the decision makers may consider a dierent approach for state 1.
After removing observations from state 1 there was a relative change of more than 50% in
the random eect variance estimate, which reects signicantly in the premium estimate.
Such analysis would not be possible in the traditional credibility models approach. This
illustrates how the model can be used to identify dierent sources of risk and can be used
in portfolio management. Linear mixed models are also usually easier to understand and
to present, when compared to standard actuarial methods, such as the credibility models
and Bayesian approach for determining the fair premium. The natural extension of this
work is to repeat the estimation and diagnostic procedures, adapting what is necessary,
to the generalized linear mixed models, which are also useful to actuarial science. Some
works have already been made in this area; see, e.g., Antonio and Beirlant (2006). It is also
interesting to continue a further analysis of the example in Hachemeister (1975), using the
diagnostic procedures again when weights are introduced to the covariance matrix of the
conditional residuals in the random coecient models, and to evaluate the robustness of
the linear mixed models equivalent to the other classic credibility models. Again, this care
is justied because the fairest premium is more competitive in the market.
Appendix
We present here expressions for matrix H and the derivatives seen in the dierent pertur-
bation schemes presented in Section 4.2.2. These calculations are taken from Nobre (2004)
and are presented here to make this text more self-content.
Appendix A. Perturbation Scheme for the Covariance Matrix of the
Conditional Errors
Let H
(k)
be the kth column of H and f be the number of distinct components of matrix
D, then
H
(k)
=
_
_
2
L()
,
2
L()
2
,
2
L()
1
, . . . ,
2
L()
f
_
,
where
2
L()
;=
0
= X
D
k
e,
2
L()
;=
0
=
1
2
_
2
tr
_
D
k
Z
DZ
_
2 e
D
k
V
1
e +
2
e
D
k
e
_
,
2
L()
;=
0
=
1
2
_
tr
_
D
k
Z
D
i
Z
_
2 e
D
k
Z
D
i
Z
e
_
,
Chilean Journal of Statistics 71
with
D
k
=
V
1
()
;=0
=
2
V
(k)
(
V
(k)
)
, e
D
i
=
D
;=0
,
and V
(k)
representing the kth column of V
1
.
Appendix B. Perturbation Scheme for the Response
It can be shown that
2
L()
;=0
= s
V
1
X,
2
L()
;=
0
= s
V
1
V
1
e,
2
L()
;=
0
= sV
1
Z
D
i
Z
V
1
e,
implying
H
= sV
1
_
X,
V
1
e, Z
D
1
Z
V
1
e, . . . , Z
D
f
Z
V
1
e
_
.
Appendix C. Perturbation Scheme for the Random Effects Covariance
Matrix
For this scheme we have
H
(k)
=
_
_
2
L()
,
2
L()
2
,
2
L()
1
, . . . ,
2
L()
f
_
.
It can be shown that
2
L(
k
)
;=
0
= X
k
V
1
k
Z
k
GZ
k
V
1
k
e
k
,
2
L(
k
)
;=0
= tr
_
V
1
k
Z
k
G
k
Z
k
_
2 e
k
V
1
k
Z
k
GZ
k
V
1
k
V
1
k
e
k
,
2
L(
k
)
;=0
= tr
_
V
1
k
Z
k
G
k
Z
k
V
1
k
Z
k
G
i
Z
k
_
e
k
V
1
k
Z
k
GZ
k
V
1
k
Z
k
G
i
Z
k
V
1
k
e
k
.
72 L.G. Bastos Pinho, J.S. Nobre and S.M. Freitas
Acknowledgements
We are grateful to Conselho Nacional de Desenvolvimento Cientco e Tecnologico (CNPq
project # 564334/2008-1) and Fundacao Cearense de Apoio ao Desenvolvimento Cientco
e Tecnologico (FUNCAP), Brazil, for partial nancial support. We also thank an anonym
referee and the executive editor for their careful and constructive review.
References
Antonio, K., Beirlant, J., 2006. Actuarial statistics with generalized linear mixed models.
Insurance: Mathematics and Economics, 75, 643676.
Banerjee, M., Frees, E.W., 1997. Inuence diagnostics for linear longitudinal models. Jour-
nal of the American Statistical Association, 92, 9991005.
Beckman, R.J., Nachtsheim, C.J., Cook, R.D., 1987. Diagnostics for mixed-model analysis
of variance. Technometrics, 29, 413426.
Belsley, D.A., Kuh, E., Welsch, R.E., 1980. Regression Diagnostics: Identifying Inuential
Data and Sources of Collinearity. John Wiley & Sons, New York.
B uhlmann, H., 1967. Experience rating and credibility. ASTIN Bulletin, 4, 199207.
Chatterjee, S., Hadi, A.S., 1986. Inuential observations, high leverage points, and outliers
in linear regression (with discussion). Statistical Science, 1, 379393.
Chatterjee, S., Hadi, A.S., 1988. Sensitivity Analysis in Linear Regression. John Wiley &
Sons, New York.
Christensen, R., Pearson, L.M., 1992. Case-deletion diagnostics for mixed models. Tech-
nometrics, 34, 3845.
Cook, R.D., 1977. Detection of inuential observation in linear regression. Technometrics,
19, 1528.
Cook, R.D., 1986. Assessment of local inuence (with discussion). Journal of The Royal
Statistical Society Series B - Statistical Methodology, 48, 117131.
Cook, R.D., Weisberg, S., 1982. Residuals and Inuence in Regression. Chapman and Hall,
London.
Crainiceanu, C.M., Ruppert, D., 2004. Likelihood ratio tests in linear mixed models with
one variance component. Journal of The Royal Statistical Society Series B - Statistical
Methodology, 66, 165185.
Dannenburg, D.R., Kaas, R., Goovaerts, M.J., 1996. Practical actuarial credibility models.
Institute of Actuarial Science and Economics, University of Amsterdam, Amsterdam.
Demidenko, E., 2004. Mixed Models - Theory and Applications. Wiley, New York.
Demidenko, E., Stukel, T.A., 2005. Inuence analysis for linear mixed-eects models,
Statistics in Medicine, 24, 893909.
Diggle, P.J., Heagerty, P., Liang, K.Y., Zeger, S.L., 2002. Analysis of Longitudinal Data.
Oxford Statistical Science Series.
Drapper, N.R., Smith, N., 1998. Applied Regression Analysis. Wiley, New York.
Dutang, C., Goulet, V., Pigeon, M., 2008. Actuar: an R package for actuarial science.
Journal of Statistical Software, 25, 137.
Fitzmaurice, G.M., Lipsitz, S.R., Ibrahim, J.G., 2007. A note on permutation tests for
variance components in multilevel generalized linear mixed models. Biometrics, 63, 942
946.
Frees, E.W., Young, V.R., Luo, Y., 1999. A longitudinal data analysis interpretation of
credibility models. Insurance: Mathematics and Economics, 24, 229247.
Fung, W.K., Zhu, Z.Y., Wei, B.C., He, X., 2002. Inuence diagnostics and outliers tests
for semiparametric mixed models. Journal of The Royal Statistical Society Series B -
Statistical Methodology, 64, 565579.
Chilean Journal of Statistics 73
Giampaoli, V., Singer, J., 2009. Restricted likelihood ratio testing for zero variance com-
ponents in linear mixed models. Journal of Statistical Planning and Inference, 139,
14351448.
Greven, S., Crainiceanu, C.M., Kuchenho, H., Peters, A., 2008. Likelihood ratio tests for
variance components in linear mixed models. Journal of Computational and Graphical
Statistics, 17, 870891.
Gumedze, F.N., Welham, S.J., Gogel, B.J., Thompson, R., 2010. A variance shift model
for detection of outliers in the linear mixed model. Computational Statistics and Data
Analysis, 54, 21282144.
Hachemeister, C.A., 1975. Credibility for regression models with application to trend.
Proceedings of the Berkeley Actuarial Research Conference on Credibility, pp. 129163.
Hilden-Minton, J.A., 1995. Multilevel diagnostics for mixed and hierarchical linear models.
Ph.D. Thesis. University of California, Los Angeles.
Hoaglin, D.C., Welsch, R.E., 1978. The hat matrix in regression and ANOVA. American
Statistical Association, 32, 1722.
Johnson, R.A., Whichern, D.W., 1982. Applied Multivariate Stastical Analysis. Sixth edi-
tion. Prentice Hall. pp. 273332.
Lesare, E., Verbeke, G., 1998. Local inuence in linear mixed models. Biometrics, 54,
570582.
Liang, K.Y., Zeger, S.L., 1986. Longitudinal analysis using generalized linear models.
Biometrika, 73, 1322.
McCulloch, C.E, Searle, S.R., 2001. Generalized, Linear, and Mixed Models. Wiley, New
York.
Nobre, J.S., 2004. Mtodos de diagnstico para modelos lineares mistos. Unpublished Master
Thesis (in portuguese). IME/USP, Sao Paulo.
Nobre, J.S., Singer, J.M., 2007. Residuals analysis for linear mixed models. Biometrical
Journal, 49, 863875.
Nobre, J.S., Singer, J.M., 2011. Leverage analysis for linear mixed models. Journal of
Applied Statistics, 38, 10631072.
Patterson, H.D., Thompson, R., 1971. Recovery of interblock information when block sizes
are unequal. Biometrika, 58, 545554.
Poon, W.Y., Poon, Y.S., 1999. Conformal normal curvature and assessment of local in-
uence. Journal of The Royal Statistical Society Series B - Statistical Methodology, 61,
5161.
R Development Core Team, 2009. R: A Language and Environment for Statistical Com-
puting. R Foundation for Statistical Computing, Vienna, Austria.
SAS Institute Inc., 2004. SAS 9.1.3 Help and Documentation. SAS Institute Inc., Cary,
North Carolina.
Tan, F.E.S., Ouwens, M.J.N., Berger, M.P.F., 2001. Detection of inuential observation in
longitudinal mixed eects regression models. The Statistician, 50, 271284.
Verbeke, G., 1995. The linear mixed model. A critical investigation in the context of longi-
tudinal data analysis. Ph.D. Thesis. Catholic University of Leuven, Faculty of Science,
Department of Mathematics, Leuven, Belgium.
Verbeke, G., Molenberghs, G., 2000. Linear Mixed Models for Longitudinal Data. Springer.
Vonesh, E.F., Chinchilli, V.M., 1997. Linear and Nonlinear Models for the Analysis of
Repeated Measurements. Marcel Dekker, New York.
Waternaux, C., Laird, N.M., Ware, J.H., 1989. Methods for analysis of longitudinal data:
blood-lead concentrations and cognitive development. Journal of the American Statisti-
cal Association, 84, 3341.
Zewotir, T., Galpin, J.S., 2005. Inuence diagnostics for linear mixed models. Journal of
Data Science, 3, 53177.
74 Chilean Journal of Statistics
Chilean Journal of Statistics
Vol. 3, No. 1, April 2012, 7591
Statistical Modeling
Research Paper
Real estate appraisal of land lots using
GAMLSS models
Lutemberg Florencio
1
, Francisco Cribari-Neto
2
and Raydonal Ospina
2,
1
Banco do Nordeste do Brasil S.A, Boa Vista, Recife/PE, 50060004, Brazil
2
Departamento de Estatstica, Universidade Federal de Pernambuco, Recife/PE, Brazil
(Received: 17 May 2011 Accepted in nal form: 24 September 2011)
Abstract
The valuation of real estates is of extreme importance for decision making. Their sin-
gular characteristics make valuation through hedonic pricing methods dicult since the
theory does not specify the correct regression functional form nor which explanatory
variables should be included in the hedonic equation. In this article, we perform real
estate appraisal using a class of regression models proposed by Rigby and Stasinopoulos
(2005) called generalized additive models for location, scale and shape (GAMLSS). Our
empirical analysis shows that these models seem to be more appropriate for estimation
of the hedonic prices function than the regression models currently used to that end.
Keywords: Cubic splines GAMLSS regression models Hedonic prices function
Nonparametric smoothing Semiparametric models.
Mathematics Subject Classication: Primary 62P25 Secondary 62P20 62P05.
1. Introduction
The real estate, apart from being a consumer good that provides comfort and social status,
is one of the economic pillars of all modern societies. It has become a form of stock capital,
given the expectations of increasing prices, and a means of obtaining nancial gains through
rental revenues and sale prots. As a consequence, the real estate market value has become
a parameter of extreme importance.
The estimation of a real estate value is usually done using a hedonic pricing equation
according to the methodology proposed by Rosen (1974). This is seen as a heterogeneous
good comprised of a set of characteristics and it is then important to estimate an explicit
function, called hedonic price function, that determines which are the most inuential
attributes, or attribute package, when it comes to determining its price. However, the
estimation of a hedonic equation is not a trivial task since the theory does not determine
the exact functional form nor the relevant conditioning variables.
i
= (
i1
, . . . ,
ip
)
k
+
Jk
j=1
Z
jk
jk
, (1)
where
k
and
k
are n1 vectors,
k
= (
1k
, . . . ,
J
k
k
)
k
and X
k
and Z
jk
are xed (covariate) design matrixes of orders n J
k
and n q
jk
,
respectively. Finally,
jk
is a q
jk
-dimensional random variable. Model given in Equation
(1) is called GAMLSS; see Rigby and Stasinopoulos (2005).
In many practical situations it suces to model four parameters (p = 4), usually location
(
1
= ), scale (
2
= ), skewness (
3
= ) and kurtosis (
4
= ); the latter two are said
to be shape parameters. Thus, we have the models
Location and scale
parameters
g
1
() =
1
= X
1
1
+
J1
j=1
Z
j1
j1
;
g
2
() =
2
= X
2
2
+
J2
j=1
Z
j2
j2
;
Shape parameters
g
3
() =
3
= X
3
3
+
J3
j=1
Z
j3
j3
;
g
4
() =
4
= X
4
4
+
J4
j=1
Z
j4
j4
.
It is also possible to add to the predictor functions h
jk
that involve smoothers like cubic
splines, penalized splines, fractional polynomials, loess curves, terms of variable coecients,
and others. Any combination of these functions can be included in the submodels for , ,
and . As Akantziliotou et al. (2002) pointed out, the GAMLSS framework can be applied
to the parameters of any population distribution and generalized to allow the modeling of
more than four parameters.
GAMLSS models can be estimated using the gamlss package for R (see Ihaka and
Gentleman, 1996; Cribari-Neto and Zarkos, 1999), which is a free software; see http:
//www.R-project.org. Practitioners can then choose from more than 50 response distri-
butions.
2.1 Estimation
Two aspects are central to the GAMLSS additive components tting, namely: the backt-
ting algorithm and the fact that quadratic penalties in the likelihood function follow from
the assumption that all random eects in the linear predictor are normally distributed.
Suppose that the random eects
jk
in model given in Equation (1) are independent and
normally distributed with
jk
N
qjk
(0, G
1
jk
), where G
1
jk
is the q
jk
q
jk
(generalized)
78 L. Florencio, F. Cribari-Neto and R. Ospina
inverse of the symmetric matrix G
jk
= G
jk
(
jk
). Rigby and Stasinopoulos (2005) noted
that for xed values of
jk
, one can estimate
k
and
jk
by maximizing the penalized
log-likelihood function
p
=
1
2
p
k=1
Jk
j=1
jk
G
jk
jk
,
where =
n
i=1
log{f(y
i
|
i
)} is the log-likelihood function of the data given
i
, for i =
1, . . . , n. This can be accomplished by using a backtting algorithm; for details, see Rigby
and Stasinopoulos (2005, 2007), Hastie and Tibshirani (1990) and Hrdle et al. (2004).
2.2 Model selection and diagnostic
GAMLSS model selection is performed by comparing various competing models in which
dierent combinations of the components M = {D, G, T , } are used, where D species
the distribution of the response variable, G is the set of link functions (g
1
, . . . , g
p
) for the
parameters (
1
, . . . ,
p
), T denes the set of predictor terms (t
1
, . . . , t
p
) for the predictors
(
1
, . . . ,
p
) and species the set of hiperparameters.
In the parametric GAMLSS regression setting, each nested model M can be assessed
from its tted global deviance (GD), given by GD = 2(
), where (
) =
n
i=1
(
i
).
Two nested competing GAMLSS models M
0
and M
1
, with tted global deviances GD
0
and GD
1
and error degrees of freedom (df), namely df
e0
and df
e1
, respectively, can be
compared using the generalized likelihood ratio test statistic = GD
0
GD
1
, which is
asymptotically distributed as
2
with d = df
e0
df
e1
df under M
0
. For each model M the
number of error df, namely df
e
is df
e
= n
p
k=1
df
k
, where df
k
are the df used in the
predictor of the model for the parameter
k
, for k = 1, . . . , p.
When comparing nonnested GAMLSS models (including models with smoothing terms),
the generalized Akaike information criterion (GAIC) (see Akaike, 1983) can be used to
penalize overttings. This is achieved by adding to the tted global deviances a xed
penalty # for each eective df that are used in the model, that is, GAIC(#) = GD+#df,
where GD is the tted global deviance. One then selects the model with the smallest
GAIC(#) value.
To assess the overall adequacy of the tted model, we propose the randomized quantile
residual; see Dunn (1996). This is a randomized version of the residual proposed by Cox
and Snell (1989) dened as
r
q
i
=
1
(u
i
), i = 1, . . . , n,
where () denotes the standard normal distribution function, u
i
is a uniform random
variable on the interval (a
i
, b
i
], with a
i
= lim
yyi
F(y
i
|
i
) and b
i
= F(y
i
|
i
). A plot of these
residuals against the index of the observations (i) should show no detectable pattern. A
detectable trend in the plot of some residual against the predictors may be suggestive of
link function misspecication.
Also, normal probability plots with simulated envelopes (see Atkinson, 1985) or worm
plots (see Buuren and Fredriks, 2001) are a helpful diagnostic tool. The worm plots are
useful for analyzing the residuals in dierent regions (intervals) of the explanatory variable.
If no explanatory variable is specied, the worm plot becomes a detrended normal QQ plot
of the (normalized quantile) residuals. When all points lie inside the (dotted) condence
bands (the two elliptical curves) there is no evidence of model misspecication.
In the context of a fully parametric model GAMLSS we can use pseudo R
2
measures.
Chilean Journal of Statistics 79
For example, R
2
p
= 1 log
L/ log
L
0
(see McFadden, 1974) and R
2
LR
= 1 (
L
0
/
L)
2/n
(see
Cox and Snell, 1989, pp. 208209), where
L
0
and
L are the maximized likelihood functions
of the null (intercept only) and tted (unrestricted) models, respectively. The ratio of the
likelihoods or log-likelihoods may be regarded as a measure of the improvement, over the
model with
i
parameters achieved by the model under investigation.
Our proposal, however, is to compare the dierent models using the pseudo-R
2
given
by the square of the sample correlation coecient between the response and the tted
values. Notice that by doing so we can consider both fully parametric models and models
that include nonparametric components. We can also compare the explanatory power of a
GAMLSS model to those of GLM and CNLRM models. This is the pseudo-R
2
we use, which
was introduced by Ferrari and Cribari-Neto (2004) in the context of beta regressions and
it is a straightforward generalization of the R
2
measure used in linear regression analysis.
3. Data Description
The data contain 2,109 observations on empty urban land lots located in the city of Aracaju,
capital of the state of Sergipe (SE), Brazil, and comes from two sources: (i) data collected
by the authors from real estate agencies, advertisements on newspapers and research on
location (land lots for sale or already sold); (ii) data obtained from the Departamento de
Cadastro Imobilirio da Prefeitura de Aracaju. Observations cover the years 2005, 2006
and 2007. Each land lot price was recorded only once during that period. It is also note-
worthy that the land lots in the sample are geographically referenced relative to the South
American Datum
1
and have their geographical positions (latitude, longitude) projected
onto the Universal Transverse Mercator (UTM) coordinate system
2
.
The sample used to estimate the hedonic prices equation (i.e., the equation of hedonic
prices of urban land lots in Aracaju-SE) contains, besides the year of reference, information
on the physical (area, front, topography, pavement and block position), location (neighbor-
hood, geographical coordinates, utilization coecient and type of street in which the land
lot is located) and economic (nature of the information that generated the observation,
average income of the head of household of the censitary system, where the land is located
and the land lot price) characteristics of the land lots. In particular, we use the variables
YEAR (YR): qualitative ordinal variable that identies the year in which the information
was obtained. It assumes the values 2005, 2006 (YR06) and 2007 (YR07). It enters the
model through dummy variables;
AREA (AR): continuous quantitative variable, measured in m
2
(square meters), relative to
the projection on a horizontal plane of the land surface;
FRONT (FR): continuous quantitative variable, measured in m (meters), concerning the
projection of the land lot front over a line which is perpendicular to one of the lot
boundaries, when both are oblique in the same sense, or to the chord, in the case of
curved fronts;
TOPOGRAPHY (TO): nominal qualitative variable that relates to the topographical confor-
mations of the land lot. It is classied as plain when the land acclivity is smaller than
10% or its declivity is smaller than 5%, and as rough otherwise. It is a dummy variable
that equals 1 for plain and 0 rough;
1
The South American Datum (SAD) is the regional geodesic system for South America and refers to the mathematical
representation of the Earth surface at sea level.
2
Cilindrical cartographic projection of the terrestrial spheroid in 60 secant cylinders at Earth level alongside the
meridians in multiple zones of 6 degrees longitude and stretching out 80 degrees South latitude to 84 degrees North
latitude.
80 L. Florencio, F. Cribari-Neto and R. Ospina
PAVEMENT (PA): nominal qualitative variable that indicates the presence or absence of
pavement (concrete, asphalt, among others) on the street in which the main land lot
front is located. It enters the model as a dummy variable that equals 1 when the land
lot is located on a paved street and 0 otherwise;
SITUATION (SI): nominal qualitative variable used to dierentiate the disposition of the
land lot on the block. It is classied as corner lot or middle lot. It is a dummy variable
that assumes value 1 for corner lots and 0 for all other land lots;
NEIGHBORHOOD (NB): nominal qualitative variable referring to the name of the neigh-
borhood where the land lot is located. It was categorized as valuable (highly priced)
neighborhoods and other neighborhoods, with the variable shown as VN and regarded
as a dummy (1 for valuable neighborhoods). The neighborhoods were also grouped as
belonging or not belonging to the city South Zone, dummy denoted by SZ (1 for South
Zone);
LATITUDE (LAT) and LONGITUDE (LON): continuous quantitative variables corresponding to
the geographical position of the land lot at the point z = (LAT, LON), where LAT and LON
are the coordinates measured in UTM;
UTILIZATION COEFFICIENT (UC): discrete variable given by a number that, when multiplied
by the area of the land lot, yields the maximal area (in square meters) available for
construction. UC is dened in an ocial urban development document. It assumes the
values 3.0, 3.5, . . . , 5.5, 6.0;
STREET (STR): ordinal qualitative variable used to dierentiate the land lot location rel-
ative to streets and avenues. It is classied as minor arterial (STR1), collector street
(STR2) and local street according to the importance of the street where the land lot
is located. It enters the model as dummy variables;
NATURE OF THE INFORMATION (NI): nominal qualitative variable that indicates whether
the observation is derived from oer, transaction or from the Aracaju register oce
(real state sale taxes). It enters the model through dummy variables;
SECTOR (ST): discrete quantitative proxy variable of macrolocation used to socioeconom-
ically distinguish the various neighborhoods, represented by the average income of the
head of household, in minimum wages, according to the IBGE census (2000). The neigh-
borhood average income functions as a proxy to other characteristics, such as urban
amenities. It assumes the values: 1, . . . , 18;
FRONT IN HIGHLY VALUED NEIGHBORHOODS (FRVN): continuous quantitative variable that
assumes strictly positive values and corresponds to the interaction between FR and VN
variables. It is included in the model to capture the inuence of land lots front dimensions
in valuable neighborhoods;
UNIT PRICE (UP): continuous quantitative variable that assumes strictly positive values
and corresponds to the land lot price divided by its area, measured in R$/m
2
(reais per
square meter).
In real estate appraisals (and specically in land lots valuations), the interest typically lies
in modeling the unit price as a function of the underlying structural, location and economic
characteristics of the real estate. We then use UP as the dependent variable (response). The
independent variables relate to the location (NB, VN, SZ, LAT, LON, ST, UC and STR), physical
(AR, FR, TO, SI and FRVN) and economic (NI) land lot characteristics; we also account for
the year of data collection.
Figure 1 presents box-plots of UP, AR and FR and Table 1 displays summary statistics
on those variables. The box-plot of UP shows that its distribution is skewed and that there
are several extreme observations. Notice from Table 1 that the sample values of UP range
from R$ 2.36/m
2
to R$ 800.00/m
2
and that 75% of the land lots have unit prices smaller
than R$ 82.82/m
2
.
Chilean Journal of Statistics 81
We note that 263 extreme observations have been identied from the box-plot of AR (see
Figure 1). These observations are not in error, they appear as outlying data points in the
plot because the variable assumes a quite wide range of values: from 41 m
2
to 91, 780 m
2
,
that is, the largest land lot is nearly two thousand times larger than the smallest one.
0
2
0
0
4
0
0
6
0
0
8
0
0
UP
5
e
+
0
1
5
e
+
0
2
5
e
+
0
3
5
e
+
0
4
AR
5
1
0
2
0
5
0
1
0
0
2
0
0
5
0
0
FR
Figure 1. Box-plots of UP, AR and FR.
Table 1. Descriptive statistics.
Variable Mean Median Standard error Minimum Maximum Range
UP 72.82 55.56 70.28 2.36 800.00 797.64
LAT 710100.00 710300.00 2722.34 701500.00 714600.00 13100.00
LON 8787000.00 8786000.00 6638.77 8769000.00 8798000.00 29000.00
AR 1355.00 300.00 6063.53 48.00 91780.00 91732.00
FR 18.13 10.00 30.54 2.60 516.00 513.40
In order to investigate how UP relates to some explanatory variables, we produced dis-
persion plots. Figure 2 contains the pairwise plots: (i) UP LAT; (ii) UP LON; (iii) log(UP)
log(AR); (iv) log(UP) log(FR); (v) UP ST and (vi) UP UC. It shows that there is
a direct relationship between UP and the corresponding regressor in (i), (ii), (v) and (vi),
whereas in (iii) and (iv) the relationship is inverse. Thus, there is a tendency for the land
lot unit price to increase with latitude, longitude, sector and also with the utilization co-
ecient, and to decrease as the area and the front size increase. We note that the inverse
relationship between unit price and front size was not expected. It motivated the inclusion
of the covariate FRVN in our analysis.
It is not clear from Figure 2 whether the usual assumptions of normality and homoskedas-
ticity are reasonable. As noted by Rigby and Stasinopoulos (2007), transformations of the
response variable and/or of the explanatory variables are usually made in order to mini-
mize deviations from the underlying assumptions. However, this practice may not deliver
the expected results. Additionally, the resulting model parameters are not typically easily
interpretable in terms of the untransformed variables. A more general modeling strategy is
thus called for.
4. Empirical Modeling
In what follows, we estimate the hedonic price function of land lots located in Aracaju
using the highly exible class of GAMLSS models. At the outset, however, we estimate
standard linear regression and generalized linear models. We use these ts as benchmarks
for our estimated GAMLSS hedonic price function.
82 L. Florencio, F. Cribari-Neto and R. Ospina
702000 704000 706000 708000 710000 712000 714000
0
2
0
0
4
0
0
6
0
0
8
0
0
(i) UP x LAT
LAT
U
P
8770000 8775000 8780000 8785000 8790000 8795000
0
2
0
0
4
0
0
6
0
0
8
0
0
(ii) UP x LON
LON
U
P
4 6 8 10
1
2
3
4
5
6
(iii) log(UP) x log(AR)
log(AR)
lo
g
(
U
P
)
1 2 3 4 5 6
1
2
3
4
5
6
(iv) log(UP) x log(FR)
log(FR)
lo
g
(
U
P
)
5 10 15
0
2
0
0
4
0
0
6
0
0
8
0
0
(v) UP x ST
ST
U
P
3.0 3.5 4.0 4.5 5.0 5.5 6.0
0
2
0
0
4
0
0
6
0
0
8
0
0
(vi) UP x UC
UC
U
P
Figure 2. Dispersion plots.
4.1 Data modeling based on the CNLRM
Table 2 lists the classical normal linear regressions that were estimated. The transformation
parameter of the Box-Cox model was estimated by maximizing the prole log-likelihood
function:
= 0.1010. All four models are heteroskedastic and there is strong evidence of
nonnormality for the rst two models. The coecients of determination range from 0.54
to 0.66. Since the error variances are not constant, we present in Table 3 the estimated
parameters of Model E, which yields the best t, along with heteroskedasticity-robust
HC3 standard errors; see Davidson and MacKinnon (1993). Notice that all covariates are
statistically signicant at the 5% nominal level, except for LAT (p-value = 0.1263), which
suggests that pricing dierentiation mostly takes place as we move in the North-South
direction.
4.2 Hedonic GLM function
Table 4 displays the maximum likelihood t of the generalized linear model that we call
Model A given by
g(UP
) =
0
+
2
LON +
3
log(AR) +
4
UC +
5
log(ST) +
6
STR1 +
7
STR2
+
8
SI +
9
PA +
10
TO +
11
NIO +
12
NIT +
13
YR06 +
14
YR07
+
15
SZ +
16
log(FRVN), (2)
where UP
5
ST+
6
STR1+
7
STR2+
8
SI+
9
PA+
10
TO+
11
NIO+
12
NIT+
13
YR06+
14
YR07+
15
SZ+
16
FRVN +
The null hypotheses that the errors are ho-
moskedastic and normal are rejected at the 1%
nominal level by the Breusch-Pagan and Jarque-
Bera tests, respectively. The explanatory variables
proved to be statistically signicant at the 1%
nominal level (z-tests). Also, R
2
= 0.539, AIC =
22304 and BIC = 22406.
C log(UP) =
0
+
1
LAT+
2
LON+
3
AR+
4
UC+
5
ST+
6
STR1+
7
STR2+
8
SI+
9
PA+
10
TO+
11
NIO+
12
NIT+
13
YR06+
14
YR07+
15
SZ+
16
FRVN +
The null hypotheses that the errors are ho-
moskedastic and normal are rejected at the 1%
nominal level by the Breusch-Pagan and Jarque-
Bera tests, respectively. All explanatory variables
proved to be statistically signicant at 1% the
nominal level (z-tests). Also, R
2
= 0.599, AIC =
2912 and BIC = 3014.
D log(UP) =
0
+
1
LAT +
2
LON +
3
log(AR) +
4
UC +
5
log(ST) +
6
STR1 +
7
STR2 +
8
SI +
9
PA +
10
TO +
11
NIO +
12
NIT +
13
YR06 +
14
YR07 +
15
SZ +
16
log(FRVN) +
The Jarque-Bera test does not reject the null hy-
pothesis of normality at the usual nominal lev-
els, but the Breusch-Pagan test rejects the null
hypothesis of homoskedasticity at the 1% nomi-
nal level. All explanatory variables are statistically
signicant at the 1% nominal level, except for the
LAT variable (p-value = 0.0190). Also, R
2
= 0.651,
AIC = 2619 and BIC = 2721.
E
UP
=
0
+
1
LAT +
2
LON +
3
log(AR) +
4
UC +
5
log(ST) +
6
STR1 +
7
STR2 +
8
PA +
9
TO+
10
NIO+
11
NIT+
12
YR06 +
13
YR07 +
14
log(FRVN) +
Normality is not rejected by the Jarque-Bera test,
but the Breusch-Pagan test rejects the null hy-
pothesis of homoskedasticity at the 1% nominal
level. All covariates proved to be statistically sig-
nicant at the 1% nominal level, except for the
LAT variable (p-value = 0.0881). Also, R
2
= 0.657,
AIC = 4290 and BIC = 4392.
Table 3. Hedonic price function estimated via CNLRM Model E.
Estimate Standard error z-statistic p-value
(Intercept) 162.6307 34.1920 4.756 0.0000
LAT 1.85e-05 1.21e-05 1.529 0.1263
LON 1.74e-05 4.60e-06 3.798 0.0001
log(AR) 0.3507 0.0192 18.236 0.0000
log(ST) 0.4423 0.0332 13.297 0.0000
UC 0.2651 0.0412 6.429 0.0000
STR1 0.4874 0.0717 6.789 0.0000
STR2 0.1678 0.0675 2.485 0.0130
SI 0.1119 0.0405 2.757 0.0058
PA 0.3853 0.0302 12.767 0.0000
TO 0.4905 0.0798 6.145 0.0000
NIO 0.5994 0.0592 10.131 0.0000
NIT 0.5111 0.0131 3.886 0.0000
YR06 0.2560 0.0351 7.289 0.0000
YR07 0.6450 0.0345 18.645 0.0000
SZ 0.7221 0.0474 15.239 0.0000
log(FRVN) 1.2041 0.0137 8.797 0.0000
4.3 GAMLSS hedonic fit
4.3.1 Location parameter modeling ()
Since UP (the response) only assumes positive values, we consider the distributions log-
normal (LOGNO), inverse Gaussian (IG), Weibull (WEI) and gamma (GA). As noted
earlier, we use pseudo-R
2
given by
pseudo-R
2
= [correlation (observed values of UP, predicted values of UP)]
2
, (3)
to measure the overall goodness-of-t.
84 L. Florencio, F. Cribari-Neto and R. Ospina
Table 4. Hedonic price function estimated via GLM Model A given in Equation (2).
Estimate Standard error z-statistic p-value
(Intercept) 151.8019 15.7792 9.620 0.0000
LON 1.77e-05 1.80e-06 9.851 0.0000
log(AR) 0.2276 0.0108 21.120 0.0000
UC 0.1272 0.0231 5.515 0.0000
log(ST) 0.2880 0.0193 14.954 0.0000
STR1 0.3562 0.0395 9.021 0.0000
STR2 0.1419 0.0408 3.482 0.0005
SI 0.0945 0.0255 3.707 0.0002
PA 0.2324 0.0220 10.556 0.0000
TO 0.3139 0.0503 6.236 0.0000
NIO 0.4208 0.0348 12.087 0.0000
NIT 0.3779 0.0642 5.884 0.0000
YR06 0.1947 0.0242 8.035 0.0000
YR07 0.4551 0.0242 18.780 0.0000
SZ 0.4716 0.0310 15.220 0.0000
log(FRVN) 0.7467 0.0622 11.997 0.0000
Table 5. Fitted models via GAMLSS.
Model D G Equation Considerations
E LOGNO logarithmic UP =
0
+ cs(LAT) + cs(LON) +
cs(log(AR)) + cs(UC) + cs(ST) +
1
STR1 +
2
STR2 +
3
SI +
4
PA +
5
TO +
6
NIO +
7
NIT +
8
YR06 +
9
YR07 +
10
SZ + cs(log(FRVN))
All regressors are signicant at the
level 1% signicance level (z-tests).
Also, AIC = 19155, BIC = 19359
and GD = 19083. Pseudo-R
2
=
0.739.
F IG logarithmic UP =
0
+ cs(LAT) + cs(LON) +
cs(log(AR)) + cs(UC) + cs(ST) +
1
STR1 +
2
STR2 +
3
SI +
4
PA +
5
TO +
6
NIO +
7
NIT +
8
YR06 +
9
YR07 +
10
SZ + cs(log(FRVN))
All regressors are signicant at the
1% signicance level (z-test). Also,
AIC = 19845, BIC = 20048 and
GD = 19773. Pseudo-R
2
= 0.678.
G WEI logarithmic UP =
0
+ cs(LAT) + cs(LON) +
cs(log(AR)) + cs(UC) + cs(ST) +
1
STR1 +
2
STR2 +
3
SI +
4
PA +
5
TO +
6
NIO +
7
NIT +
8
YR06 +
9
YR07 +
10
SZ + cs(log(FRVN))
All regressors proved to be signi-
cant at the 1% signicance level (z-
tests). Also, AIC = 19260, BIC =
19463 and GD = 19188. Pseudo-R
2
=
0.748.
H GA logarithmic UP =
0
+ cs(LAT) + cs(LON) +
cs(log(AR)) + cs(UC) + cs(ST) +
1
STR1 +
2
STR2 +
3
SI +
4
PA +
5
TO +
6
NIO +
7
NIT +
8
YR06 +
9
YR07 +
10
SZ + cs(log(FRVN))
All regressors are signicant at the
1% signicance level (z-tests). Also,
AIC = 19062, BIC = 19337 and
GD = 19134. Pseudo-R
2
= 0.746.
Table 6. Hedonic price function estimated via GAMLSS Model I.
Estimative Standard error z-statistic p-value
(Intercept) 165.4000 16.1300 10.251 0.0000
cs(LAT) 5.17e-05 6.22e-06 8.307 0.0000
cs(LON) 1.51e-05 2.13e-06 7.071 0.0000
cs(log(AR)) 0.2317 0.0096 24.074 0.0000
cs(ST) 0.0465 0.0037 12.416 0.0000
cs(UC) 0.1223 0.0206 5.947 0.0000
STR1 0.3133 0.0349 8.963 0.0000
STR2 0.0926 0.0364 2.545 0.0100
SI 0.0920 0.0227 4.054 0.0000
PA 0.1891 0.0195 9.670 0.0000
TO 0.2662 0.0474 5.951 0.0000
NIO 0.4135 0.0395 13.362 0.0000
NIT 0.3485 0.0571 6.102 0.0000
YR06 0.1645 0.0215 7.632 0.0000
YR07 0.4358 0.0215 20.235 0.0000
cs(log(FRVN)) 0.6513 0.0569 11.443 0.0000
SZ 0.3875 0.0299 12.935 0.0000
The models listed in Table 5 include smoothing cubic splines (cs) with 3 eective df
for the covariates LAT, LON, log(AR), UC, ST and log(FRVN). Other smoothers (such as loess
and penalized splines), as well as dierent combinations of D (see Rigby and Stasinopoulos,
2007), such as BCPE, BCCG, LNO, BCT, exGAUSS, among others, and G, such as identity,
inverse, reciprocal, among others, were considered. However, they did not yield superior
ts. We also note that Model I yields the smallest values of the three model selection
criteria. Table 6 contains the a summary of the model t.
Chilean Journal of Statistics 85
The use of three eective df in the smoothing functions delivered a good model t.
However, in order to determine whether a dierent number of eective df delivers superior
t, we used two criteria, namely: the AIC (objective) and visual inspection of the smoothed
curves (subjective); visual inspection aimed at avoiding overtting. We then arrived at
Model J. It also uses cubic spline smoothing (cs), but with a dierent number of eective
df in the smoothing functions; see Table 7. Notice that there was a considerable reduction
relative to Model I in the AIC, BIC and GD values (18822, 19212 and 18684, respectively)
and that there is a better agreement between observed and predicted response values.
Table 7. Hedonic price function estimated via GAMLSS Model J.
Estimative Standard error z-statistic p-value
(Intercept) 130.1000 14.8100 8.787 0.0000
cs(LAT, df = 10) 5.92e-05 5.71e-06 10.354 0.0000
cs(LON, df = 10) 1.05e-05 1.96e-06 5.352 0.0000
cs(log(AR), df = 10) 0.2559 8.83e-03 28.963 0.0000
cs(ST, df = 8) 0.0373 3.44e-03 10.831 0.0000
cs(UC, df = 3) 0.1769 0.0188 9.370 0.0000
STR1 0.2571 0.0320 8.012 0.0000
STR2 0.0728 0.0334 2.180 0.0293
SI 0.1029 0.0208 4.940 0.0000
PA 0.1436 0.0179 7.999 0.0000
TO 0.1822 0.0410 4.436 0.0000
NIO 0.4173 0.0284 14.690 0.0000
NIT 0.3388 0.0524 6.462 0.0000
YR06 0.1373 0.0198 6.941 0.0000
YR07 0.4190 0.0197 21.190 0.0000
cs(log(FRVN), df = 10) 0.6599 0.0522 12.630 0.0000
SZ 0.5119 0.0275 18.613 0.0000
Figure 3 contains plots of the smoothed curves from Model J. The dashed lines are con-
dence bands based on pointwise standard errors. Panels (I), (II), (III), (IV), (V) and
(VI) reveal that the eects/impacts of LAT, LON, log(AR), ST, UC and log(FRVN) are typ-
ically increasing, increasing/decreasing, decreasing, increasing, increasing and increasing,
respectively, with increases in latitude, longitude, log area, socioeconomic indicator, utiliza-
tion coecient and log land front in highly priced neighborhoods. (Panel (II) alternately
shows local increasing and decreasing trends.) Some of these eects were also suggested by
the estimated coecients of the CNLRM and GLM models. Here, however, one obtains a
somewhat more exible global picture, as we see.
In Panel (I), one notices that as the latitude increases the contribution of the LAT
covariate between the 702000 and 709000 latitudes (approximately) neighborhoods that
belong to the expansion zone of the city is negative, whereas starting from position 709000
(approximately) South Zone and downtown area the price eect is positive. Additionally,
we note that, in certain ranges, increases in latitude lead to drastic changes in the slope of
the smoothed curve, e.g., between the 708000 and 710000 positions, whereas in other areas,
for instance between the 706000 and 708000 latitudes the Mosqueiro neighborhood, an
increase in latitude leads to an uniform negative eect.
Panel (II) shows that as longitude increases to position 8780000 the contribution of
the LON covariate is positive and nearly uniform, which almost exclusively covers observa-
tions relative to the Mosqueiro neighborhood. Starting at the 8785000 position there is a
remarkable change in the slope of the tted curve, which is triggered by the location of the
most upper class neighborhoods: from 8785000 to 8794000. After the 8794000 position, the
eect remains positive, but is decreasing and, eventually, it becomes negative.
We see in Panel (III) that as the area (in logs) increases the contribution of the log(AR)
covariate, for land lots with log areas between 4 and 5 (respectively), is clearly positive.
86 L. Florencio, F. Cribari-Neto and R. Ospina
702000 704000 706000 708000 710000 712000 714000
1
.
0
0
.
6
0
.
2
0
.
2
Panel (I)
LAT
c
s
(
L
A
T
,
d
f
=
1
0
)
8770000 8775000 8780000 8785000 8790000 8795000
0
.
8
0
.
4
0
.
0
0
.
4
Panel (II)
LON
c
s
(
L
O
N
,
d
f
=
1
0
)
4 6 8 10
1
.
5
0
.
5
0
.
5
Panel (III)
log(AR)
c
s
(
l
o
g
(
A
R
)
,
d
f
=
1
0
)
5 10 15
0
.
0
0
.
5
1
.
0
Panel (IV)
ST
c
s
(
S
T
,
d
f
=
8
)
3.0 3.5 4.0 4.5 5.0 5.5 6.0
0
.
1
0
.
1
0
.
3
Panel (V)
UC
c
s
(
U
C
,
d
f
=
3
)
0.0 0.5 1.0 1.5 2.0
0
.
0
0
.
5
1
.
0
Panel (VI)
log(FRVN)
c
s
(
l
o
g
(
F
R
V
N
)
,
d
f
=
1
0
)
Figure 3. Smoothed additive terms Model J.
The eect is negative for land lots with log areas in excess of 5.
In Panel (IV), it is possible to notice that as we move up in the socioeconomic scale the
contribution of the ST covariate, in the range from 1 to 4 minimum wages, is negative,
even though the there is an increasing trend. For land lots located in neighborhoods that
correspond to more than 4 minimum wages, the eect is always positive; from 10 to 15
minimum wages the eect is uniform.
We note from Panel (V) that, contrary to what one would expect, the contribution of
the UC covariate is not positive. In the range from 3.0 to 5.0, the tted curve displays small
oscillations, alternating in the positive and negative regions. The positive eect only holds
for utilization coecients greater than 5.0.
Notice from Panel (VI) that as the front land lot (in logs) increases in highly priced neigh-
borhoods the contribution of the log(FRVN) covariate is mostly increasing and positive.
However, in the 1.5 to 2.0 interval the positive eect is approximately uniform.
Chilean Journal of Statistics 87
4.4 Comparing models
In order to compare the best estimated models via CNLRM (Model E), GLM (Model A
given in Equation (2)) and GAMLSS (Model J) we use the AIC and BIC. Note the criteria
only is used to compare models that use the response (UP) in the same measurement scale,
i.e., Models A and J. We also compare the dierent models using the pseudo-R
2
given in
Equation (3).
We present in Table 8 a comparative summary of the three models. We note that Model
J is superior to the two competing models. Not only it has the smallest AIC and BIC
values (in comparison to Model A, but it also has a much larger pseudo-R
2
. The GAMLSS
pseudo-R
2
exceeds 0.80, which is notable.
Table 8. Comparative summary of the CNLRM, GLM and GAMLSS estimated models.
Model Class AIC BIC Pseudo-R
2
E (CNLRM) 4290 4392 0.667
A (GLM) 19486 19581 0.672
J (GAMLSS) 18822 19212 0.811
4.5 Dispersion parameter modeling ()
After a suitable model for the prediction of was selected, we carried out a likelihood ratio
test to determine whether the GAMLSS scale parameter is constant for all observations.
The null hypothesis that is constant was rejected at the usual nominal levels. We then
built a regression model for such a parameter. To that end, we used stepwise covariate
selection, considered dierent link functions (such as identity, inverse, reciprocal, etc.) and
included smoothing functions (such as cubic splines, loess and penalized splines) in the
linear predictor, just as we had done for the location parameter. We used the AIC for
selecting the smoothers and for choosing the number of df of the smoothing functions
together with visual inspection of the smoothed curves.
We present in Table 9 the GAMLSS hedonic price function parameter estimates obtained
by jointly modeling the location () and dispersion () eects; Model K. The model uses
the gamma distribution for the response and the log link function for both and . We
note that Model K contains parametric and nonparametric terms, and for that reason it is
said to be a linear additive semiparametric GAMLSS.
We note from Table 9 that the parameter estimates of the location submodel in Model K
are similar to the corresponding estimates from Model J, in which was taken to be con-
stant; see Table 7. It is noteworthy, nonetheless, that there was a sizeable reduction in the
AIC, BIC and GD values (18607, 19065 and 18445, respectively) and also an improvement
in the residuals as evidenced by the worm plot; see Figures 4 and 5.
Only two covariates were selected for the regression submodel in Model K, namely: ST
and log(AR). The former (ST) entered the model in the usual parametric fashion whereas
the latter (log(AR)) entered the model nonparametrically through a cubic spline smoothing
function with ten eective df. We note that the positive sign of the log(AR) coecient indi-
cates that the UP dispersion is larger for land lots with larger areas whereas the negative sign
of the ST coecient indicates that the dispersion is inversely related to the socioeconomic
neighborhood indicator.
It is noteworthy that the pseudo-R
2
of Model K is quite high (0.817) and that all of
explanatory variables are statistically signicant at the 1% nominal level which is not all
that common in large sample cross sectional analyses, especially in real estate appraisals.
Overall, the variable dispersion GAMLSS model is clearly superior to the alternative mod-
els. The good t of Model K can be seen in Figure 6 where we plot the observed response
values against the predicted values from the estimated model. Note that the 45
o
line in
this plot indicates perfect agreement between predicted and observed values.
88 L. Florencio, F. Cribari-Neto and R. Ospina
Table 9. Hedonic price function estimated via GAMLSS Model K.
Estimative Standard error z-statistic p-value
Coecients
(Intercept) 95.1300 14.2700 6.665 0.0000
cs(LAT, df = 10) 5.94e-05 5.37e-06 11.053 0.0000
cs(LON, df = 10) 6.45e-06 1.86e-06 3.460 0.0000
cs(log(AR), df = 10) 0.2087 0.0104 20.138 0.0000
cs(ST, df = 8) 0.0321 0.0030 10.666 0.0000
cs(UC, df = 3) 0.2095 0.0161 13.006 0.0000
STR1 0.2039 0.0298 6.838 0.0000
STR2 0.0729 0.0276 2.635 0.0084
SI 0.7136 0.0192 3.705 0.0000
PA 0.1653 0.0157 10.465 0.0000
TO 0.1778 0.0370 4.799 0.0000
NIO 0.3722 0.0251 14.799 0.0000
NIT 0.2790 0.0468 5.957 0.0000
YR06 0.1255 0.0175 7.144 0.0000
YR07 0.4195 0.0177 23.622 0.00
cs(log(FRVN), df = 10) 0.6809 0.0403 16.88 0.0000
SZ 0.4824 0.0241 20.001 0.0000
coecients
(Intercept) 1.6838 0.0839 20.072 0.0000
cs(log(AR), df = 10) 0.1370 0.0143 9.593 0.0000
ST 0.0391 0.0040 9.632 0.0000
q
qq
q
q q q q q
q
q
q
q q
q q
q
q qq q
q
q
q
q
q
qq q
q
q q
q q
q q
q
q
q
q q q qqq q qq
q qq q q
q q
q
q q q q q qq q q q q q q
q q
q
q
qq
q
q q qq
q
q
q q
q
q
q q
qq q q
q q
q q
q
q
q q
q q q
q q q q
q
q
q
q
q q
q
q q q q q
q
q q
q q
q
q
q
q q
q
q qq
q
q
q
q q
q q q q
q
qqq q qq q q
q q
q
qqq
q q
q
q
q
q
q
q
q
q
q
q
q
q
q q q
q
q
q
q
q
q
q
q
q q
q
q
q q
q
q q q q q q q
q
q q q
q
q
q
q
q
q
q
q q
q qq q
q
q
q
q
q q q q
qq
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q q
q
q
q q q
q
q q
q q
q
qq q
q
q
q
q q
q q
q q
q
q q q q
q q q q q q
q q
q
q q q q q
q q
q
q q qqq q
qq
q
q q
q q q
q
q
q q q q q
q
q q
q
q q q
q
q
q
q
q
q
q
q q
qq
q q
q
q
q
q qq
q
q
q q
q
q q
q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q qq q q q q q
q q
q
q q q q q
q
q
q
qq
q
q
q
q
q
q q
q
q
q
q q
q
q
q
q qq
q
q q
q
q q
q
qq q
q
q
qq qq
q
q
q
qq
q
q q q
q q
q
qqqq
qq
q
q
q
q qq q
q
qq
q
q q q q
q q
q
q qq qqq q q qq q q q q q q
q
q
q q
qq
q
q qq
q
q
q q
q
q
q
q q q q
q q
q
q q qq qqq
q
qq q qqq q
q q q
q
q
q q
q
q
q
q qq q
q
q
qq q q q
q
q q
q
q
q
q q
q
qq
q
q q q
q q
q
q
q q q q q q q q q q
q
q q q q q q
qqqq
q q q q q qq
q q
q q qq q
q
q
q
q
q q
q
q q q
q q
q
q
q
q
q q q
q
q
q qqqq q q
q q q
q
q
q q
q
q
q
q q
q
q
q
q
q q q
q
qq q
q
q
q
q
q
q
q
q
q
q q q
q
q
q q q q q
qq
q q q
qq
q
q
qq
q q
q q
q
q
q qq qq q
q
q
q
q
q
qq q q
qq q
q
q
q
q
q
q
q
q q
q
q
q q
q q
qq
q
q
qq
qq q qq q
q
q
q q
q q q q q q q
q
q
q
q
q
q q qqq
q
q
qq
qq
q
q
q
q
q
qq
qq
q
q
q
q q
q q q
q
q
q
q
qq
q q
q
q
q
q
q
q q q
q
q q q
q
q
q q
q q q q
q
qq
q q q
q
q q
q
q
q q
q q
q
q
q
q
q
q q
q
q q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q q
q
q qq
q
q
q
q
q
q
q
q q q
qq q q
q
q
q
q q
q
q q q q
q
q
q
q
q
q
q
q
q
q
q q
qq
q
q q
q
qq qq q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
q q
q q
q
q
q
q
q
qq
q
q q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
q qq q
q
q q
q
q
q
q
qq
q
q
q q q q q
q
q
q q
q q q
q
q q
q
q
q q
q q
q
q
q
q
q
qqq q
q
q
q
q q
q
q
q q q
q
q
q q
q
q
q
q
q
q
q
q qq
q
q
q q
q
q
q
q
q
q q
q
q
q q
q
q q
q q
q
q
q
q
q
q
q q
q q
q q q
q
q q
q
qq
q
q q q
q q q q q q
q
q
qq q q q
q q qq q q q q q q qq qq q q q q q q q
q
q q
q
q q q qq q q q q q q qq q q q q
q q
q
q
q
q
q
q
q
q qq
q
q
q
q
q
q
q q
q
q q
q
q
q
q q q
q
q
q
q
q
q
q
q
q
q
q q q q
q
q
q q
q q
q
q q
q q q q q q
q q
q
q
q
q qq
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q q
q qq q q q
q q q q q q
q
q
q
q
q
q q
q
q
q q
q
q q q q
q
q q qq
q
q
q
q q
q q
q
q
q
q q q
q
q
q
q
q
q q q qq
q q
q q q q q
q
q
q q
q
q
q q q q
q
q q
q
q
q qq
q q q q q q q q
q
q
q
q
q
q
q
q
q
q
q q q q
q
qq q q
q q q q q
q q q q
q q q
q
q q q
q
qq q q
q
q
q q
q
q q
q q
q
q
q q
q
q
q
q
q q
q
q
q
q
q
q q q q
q
q q q q q q
q
q q
q
q
q q q
q q
q
q q
q
q
q qq
q
q q
q
q q
q
q
q
q q q q
q q
q q
q
q
q q
q
q q q q
q q q qq qq
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q qq q q q
q
q
q
q q
q
q
q q
q q
q
q
q q q q q q
q
q
q
q
qq
q q
q
q
q
q
q q
q
q
qq
q
q
q
q
q q
q q q
q q q
qqq
qq q q q
q q
q q
q
q qq qq q q
q
q
q
q
q q
q
q
q
qq
q
q
q
q
q
q q
q q q q q qq q
q
q q
q q q q q
qq
q
q
q
q
q qq
q q
q
q q
q q q q q
q q q q q q q
q q q
q
q
q
q q q
q
q
q
q
q
q
q qq
q
q
q
q q q q
qq q
q
q q
qq
q qq
q qq
q
q
q
q q q
q
q q q q q
q
q
qq
qq
qq q
qq q
q q
q q
q q qq
q
q
q
q
q q
q q q
q
q
qq q
q q qq
q
qq
q
q
q
q
q
q
q
q
q q q
q
q
q
q
q
q q
q
q
q q
q
q q q q q
q
qqq qq
qqq
q
q
qqq qq
q
q qq
qqqqqqqqqqqqq
q
q
q q
q
q
q
q q
q
q q
q
q
q q q
q
q q
q
q
q
q
q
q
q
q q q q
q q
q
q qq
q q q q
q
q q
q q q qq q q
q
q q
q
qq q
q q
q q
q
q
q q
q
q q
q
q q
q
q qqq q
q
q
qq
q q
q
q
q
q q
q q q q
qq
q
q
q
q q
q
q
q q
qq qq
q q
q
qq
q
q
q q q q
q q
q
q
q
q q q q q
q
q
q
q q qq qq
q
qq
q
q q q q
q
qq q q
q q
q
q qqq
q
q q q q q q q
q
qq
qq
qq qq qq
q
q
q
q q
q
qqqq qqqq
q
q
qq
q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
qq
q
q q q q q q
q
q
q
q
q
q
qq q
q
q q q
q
q
q
q
q q
q
q q q q q q
q
q
q q
q qq
q
q
q
q
q q q q q
q q
q
q
q q
q
q
q
q
q
q
q q q q
q
q
q q
q
q
q q q
q q
q
q
q
q
q
q
q
q
q q qq qq q qq q q q q q q
q
q q q q q
qq
q q q
q
q qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q q q q
q
q
q q
q
q
q
q
q
q
q
q q
q q
q
q
q
q q qq q
q q q
4 2 0 2 4
1
.5
1
.0
0
.5
0
.0
0
.5
1
.0
1
.5
Unit normal quantile
D
e
v
ia
tio
n
Figure 4. Worm plot Model J.
Model K is given by
log() =
0
+ cs(LAT, df = 10) + cs(LON, df = 10) + cs(log(AR), df = 10) +
cs(UC, df = 3) + cs(ST, df = 8) +
1
STR1 +
2
STR2 +
3
SI +
4
PA +
5
TO +
6
NIO +
7
NIT +
8
YR06 +
9
YR07 +
10
SZ +
cs(log(FRVN), df = 10),
log() =
0
+
1
ST + cs(log(AR), df = 10),
in which the response (UP) follows a GA distribution with location and scale parameters
and , respectively. This model proved to be the best model for hedonic prices equation
estimation of urban land lots in Aracaju.
Chilean Journal of Statistics 89
q
q
q
q
qq q q q q
q
q
q
q
q q qq qq q
q
q
q
q q
qq q
q
q q
q
q
qq
q
q
q
q q qqqq qq q q qq q q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q qq q
q qq
q q q
q
q
q q
q q
q
q qq
qq
q qq q
q
q
q
q
q
q q
q
q
q q q q
q
q q
q q q q q
q
q q q
q
q
q
q q q q
q
q q q q
q
q
q q
q q q q q q
q
q
qqq q q q q q
q
q q qq
q qq q q
q
q q
q
q
q
q q
q
q
q q q
q
q
q
q q
q
q
q
q
q q q
q
q
q q q q q q q q
q
q q
q
q
q
q q
q
q
q
q
q q qq
q
q
q
q q
q
q q
q
q q
q
q
q
q
q
q
q
q
q
qq q
q q
q q q q
q
q q
q
q
q
q
q
q
q q q q
q
q q q
q
q q
q
q
q
q q q
q
q
q
q q
q
q
q q
q
q q q q
q q q q q q
q q q
q q qq q
q q
q
q q
qqq q
qq
q
q
q
q q q
q
q q
q
q q q
q
q q q q q
q
q
q
q
q
q
q
q
q
q qq q qq q
q q qq
q
q
q q
q
q q q
q q q q
q q q
q q
q q
q
q q
q
q q
q
qq q qq q
q q
q q
q
qq
q
q
q
q q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
q qq
q
q q
q
q
q q q q
q q
q
qq qq q q q
qq
q
q q q q q q
qqq q
q q
q
q
qq qq q
q
qq
q
q
q
q
q
q q
q
q
qq qqq q
q
qq
q
q
q q q q q
q
q
q
qq
q
q qq q
q
q q
q
q
q
q q
q
q
q q
q
q q
qq
qqq
q
qq q qqq q
q
q q qq q
q
q
q
q
q
qq
q
q q q q q q
q
q q q
q
q
q
qq
q q
q
q q
q
q
q
q
q
q
q
q q q q
q
q
qq q q q
q
q q q q
qqqq q
q q q q qq
q
q
q
q qq
q
q q
q
q
q q
q
q
q
q q q
q
q
q
q
qq q
q
q
q q qqq q q
q
q
q
qq q q
q
q
q
q q
q
q
q q
q
q
q
q
qq
q q
q
q
q
q
q
q
q
q
q q
q q
q
q q q q q
qq
q qq
qq q
q
qq
q
q
q
q
q
q
q q q q q q
q
q
q
q q qq
q
q q q q
q q q
q q
q
q q
q q q q
q
q
q
qq
q
q
q
q qq q
qq
q
q
q q q
q
q
q
q
q q
q
q
q q q q
q q qqq
q q
qq
q q
q q q
q
q
qq
qq q q q
q
q q q q q q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q q q
q
q q
q
q q q q
q
qq q
q q q q q
q
q q q q q q
q
q
q
q
q q
q q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q q
q q q q
q q
q
q
q
q
q
q
q
q
q q q
q q
qq q q
q
q
q
q q
q qq q q
q q
q
q q q q q q q q q
q
q q
q
q
q
q
q
q q
q q
q q
q
q
q
q
q q
qq
q
q q q
q q
q q q
q
q
q
q
q
q
q q q q q
q
q
q q
q
q
q
q
q
q
q q q q q
q
q
q
q
q
qq
q
q qq q
q
q
qq
q
q
q q q qq
q q q
q q q
q
qq q
q
q q q
q
q
q
qq
q
q
q q
q q q
q
q
q
q
q q
q q q
q
q q q q
q q q
q
q
q
q
qq q q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q q
q
q
q
q q q
q q q q q q q q
q
q
q q q
q q q q
q q
q q q q
q
q
q q
q
q
q
q q q q
q
q q
q
q
q
q
q
q
q q q q q qq
q
q
q q q q q
q
q qq q
q
q q
q
q qq qq q q q q q q q
q
q q
q
q q q qq q q q q q q qq q q q
q
q q
q q
q
q q q q
q q q q q q q
q
q q q
q
qq q q q q q q
q
q
q
q
q
q q
q
q
q q q q q
q
q
q q
q
q
q q q q q qq q q q q
q
q q q qq
q
q q
q q q q qqq
q
qqq
q
qq q
qq q q q q q q q q q
q
q
q
q q
q q
q q
q q
q
q q q q
q
q q qq
q
q
q
q q q q
q
q
q
q q
q
q
q
q
q
q
q qq qq
q q
q q q q q
q
q
q q
q q
qqq q
q
q q
q
q
q q q
q q q q q q q q
q
q
q q q
q
q
q
q
q q
q q q q q q q
q q q q q q
qq q q
q q q
q
q q q
q
q
q q
q
q q
q q
q
q q
q q
q
q
q
qq
q
q q
q
q
q
q
q
q
q
q q
q q
q q q q q qq
q qq
q
q q q
q
q q
q
qq
q q
q q
q
q
q q
q
q q
q
q
q
q q
q
q
q q
q q
q
q
q
q
q
q q q q
q q
q qq qq
q
q
q
q
q
q
q
q
q
q
q
q q q
q
q q q
q q q q q
q
q
q
q
q
q
q
qq
q qq
q q
q q q q q
q
q
q
q
q q q q
q
q qq q
q
q
q
q q
q
q
q
q
q
q
q q q
q q q qqq q q q
q
q
q
q
q
q q
q q q qq q
q
q q q
q
q
q q
q
qq q
q
q q q
q
q q
q
q
qq q q q q
q q q
q q q q q
q q
q q
q
q
q
q
q
qq q
q q q qq q q q q
q q q q
q
q qq
q q q
q q q q
q
q q
q
q
q qq
q
q
q q q q q qqq
q
q q
qq
q
q q
q
q
q q
q q
q q
q q
q
q q q
q
q
q
qq qq
qq q
q
q
q
q q
q q q q
qq q q q q
q
q
q q q
q
q
qq
q
q q qq
q
qq
q
q
q q
q
q
q
q
q q q
q q q
q
q
q q
q
q q q
q
q q q q q q
qq q q
q qqq
q
q
qq
q qq
q
q qq qqqqqqqqqqqqq
q
q qq
q
q
q
q
q q
q
q
q
q q
q q
q
q q q
q q
q
q
q
q
q
q
q
q q
q q q
q q
q q q q q q q q q
q qq
q
q
q q
q
q qq q
q
q q
q
q
q
q q
q
q q
q
q q
q q
q
qq q
q
q
qq
q q q
q
q
q q
q q qq
qq q
q
q
q q
q
q
q
q
q q
qq
q
q
q
qq
q q q
q q
q q q
q
q
q
q
q q
q
q q
q
q
q qqq qq q q q
q
q q q
q
q qq
q
q
q
q
q q q q q
q
q q q q q q q
q
qq qq qq qq qq
q
q q
q
q q qqqq qqqq q
q
q q
q
q
q
q q
q
q q
q
q
q q q
q
q
q
qq
q
q q q q
q
q
q
q q
q
q
q
qq q q
q q q
q
q
q
q
q q q qq q q q q q q
q
q
q qq q q
q
q
q q q qq
q q
q q
q
q q q
q
q q
q q
q q
q q
q
q q
q
q
q q
q
q q
q
q
q
q
q
q
q
q
q q qq qq q
q q q q q q q q
q
q q q q q
q
q
q
q
q
q
q qq
qq
q
q
q q
q
q
q
q
q
q
q
q
q q q q
q
q
q q q q
q
q q
q
q
q
q
q
q q
q
q
q
q
q
q q q q
q
4 2 0 2 4
1
.5
1
.0
0
.5
0
.0
0
.5
1
.0
1
.5
Unit normal quantile
D
e
v
ia
tio
n
Figure 5. Worm plot Model K.
0 200 400 600 800 1000
0
2
0
0
4
0
0
6
0
0
8
0
0
1
0
0
0
Predicted values of UP
O
b
s
e
r
v
e
d
v
a
lu
e
s
o
f U
P
Figure 6. Observed values predicted values of UP Model K).
5. Concluding Remarks
Real state appraisal is usually performed using the standard linear regression model or the
class of generalized linear models. In this paper, we introduced real state appraisal based
on the class of generalized additive models for location, scale and shape, GAMLSS. Such
a class of regression models provides a exible framework for the estimation of hedonic
price functions. It even allows for some conditioning variables to enter the model in a
nonparametric fashion. The model also accommodates variable dispersion and can be based
on a wide range of response distributions. Our empirical analysis was carried out using a
large sample of land lots located in the city of Aracaju (Brazil). The selected GAMLSS
model displayed a very high pseudo-R
2
(approximately 0.82) and yielded an excellent
t. Moreover, the inclusion of nonparametric additive terms in the model allowed for the
estimation of the hedonic price function in a very exible way. We showed that the GAMLSS
t was clearly superior to those based on the standard linear regression and on a generalized
linear model. We strongly recommend the use of GAMLSS models for real state appraisal.
Acknowledgements
L. Florencio acknowledges funding from Coordenao de Aperfeioamento de Pessoal de
Nvel Superior (CAPES), F. Cribari-Neto and R. Ospina acknowledge funding from Con-
selho Nacional de Desenvolvimento Cientco (CNPq). We thank three anonymous referees
for their comments and suggestions.
90 L. Florencio, F. Cribari-Neto and R. Ospina
References
Akaike, H., 1983. Information measures and model selection. Bulletin of the International
Statistical Institute, 50, 277290.
Akantziliotou, C., Rigby, R.A., Stasinopoulos, D.M., 2002. The R implementation of gener-
alized additive models for location scale and shape. In Stasinopoulos, M., Touloumi, G.,
(eds.). Statistical Modelling in Society: Proceedings of the 17th International Workshop
on Statistical Modelling. Chania, Greece, pp. 7583.
Anglin, P., Gencay, R., 1996. Semiparametric estimation of hedonic price functions. Journal
of Applied Econometrics, 11, 633648.
Atkinson, A.C. (1985) Plots, Transformations and Regression. Oxford University Press,
New York.
Buuren, S., Fredriks, A.M., 2001. Worm plot: a simple diagnostic device for modeling
growth reference curves. Statistics in Medicine, 20, 12591277.
Clapp, J.M., Kim, H.J., Gelfand, A., 2002. Predicting spatial patterns of house prices using
LPR and Bayesian smoothing. Real Estate Economics, 30, 505532.
Cox, D.R., Snell, E.J., 1989. Analysis of Binary Data. Chapman and Hall, London.
Cribari-Neto, F., Zarkos, S.G., 1999. R: yet another econometric programming environ-
ment. Journal of Applied Econometrics, 14, 319329.
Davidson, R., MacKinnon, J.G., 1993. Estimation and Inference in Econometrics. Oxford
University Press, New York.
Dunn, P.K., Smyth, G.K. (1996). Randomised quantile residuals. Journal of Computational
and Graphical Statistics. 5, 236-244.
Eubank, R., 1999. Nonparametric Regression and Spline Smoothing. Second edition. Marcel
Dekker, New York.
Ferrari, S.L.P., Cribari-Neto, F., 2004. Beta regression for modelling rates and proportions.
Journal of Applied Statistics, 31, 799815.
Gencay, R., Yang, X., 1996. A forecast comparison of residential housing prices by para-
metric and semiparametric conditional mean estimators. Economic Letters, 52, 129135.
Hrdle, W., Mller, M., Sperlich, S., Werwatz, A., 2004. Nonparametric and Semiparamet-
ric Models. Springer-Verlag, Berlin.
Hastie, T.J., Tibshirani, R.J., 1990. Generalizeds Additive Models. Chapman and Hall,
London.
Ihaka, R., Gentleman, R., 1996. R: a language for data analysis and graphics. Journal of
Computational and Graphical Statistics, 5, 299314.
Iwata, S., Murao, H., Wang, Q., 2000. Nonparametric assessment of the eects of neigh-
borhood land uses on the residential house values. In Fomby, T., Carter, H.R., (eds.).
Advances in Econometrics: Applying Kernel and Nonparametric Estimation to Economic
Topics. JAI Press, New York.
Martins-Filho, C., Bin, O., 2005. Estimation of hedonic price functions via additive non-
parametric regression. Empirical Economics, 30, 93114.
McFadden, D., 1974. Conditional logit analysis of qualitative choice behavior. In Zarembka,
P., (ed.). Frontiers in Econometrics. Academic Press, New York, pp. 105142.
Pace, R.K., 1993. Non-parametric methods with applications to hedonic models. Journal
of Real Estate Finance and Economics, 7, 185204.
Rigby, R.A., Stasinopoulos, D.M., 2005. Generalized additive models for location, scale and
shape (with discussion). Applied of Statistics, 54, 507554.
Rigby, R.A., Stasinopoulos, D.M., 2007. Generalized additive models for location scale and
shape (GAMLSS) in R. Journal of Statistical Software, 23, 146.
Rosen, S., 1974. Hedonic prices and implicit markets: product dierentiation perfect com-
petition. Journal of Political Economy, 82, 3455.
Chilean Journal of Statistics 91
Silverman, B.W., 1984. Spline smoothing: the equivalent kernel method. The Annals of
Statistics, 12, 896916.
Thorsnes, P., McMillen, D.P., 1998. Land value and parcel size: a semiparametric analysis.
Journal of Real Estate Finance and Economics, 17, 233244.
92 Chilean Journal of Statistics
Chilean Journal of Statistics
Vol. 3, No. 1, April 2012, 93110
Statistical Distributions
Research Paper
Discriminating between the bivariate generalized
exponential and bivariate Weibull distributions
Arabin Kumar Dey
1
and Debasis Kundu
2,
1
Department of Mathematics, IIT Gauhati, Gauhati, India
2
Department of Mathematics and Statistics, IIT Kanpur, Kanpur, India
(Received: 24 July 2011 Accepted in nal form: 03 January 2012)
Abstract
Recently Kundu and Gupta (2009) introduced a bivariate generalized exponential dis-
tribution, whose marginals are generalized exponential distributions. The bivariate gen-
eralized exponential distribution is a singular distribution, similarly as the well known
bivariate Weibull distribution. The corresponding two singular bivariate distributions
functions have very similar joint probability density functions. In this paper, we consider
the discrimination between these two bivariate distribution functions. The dierence of
the maximized log-likelihood functions is used in discriminating between the two dis-
tribution functions. The asymptotic distribution of the test statistic has been obtained
and it can be used to compute the asymptotic probability of correct selection. Monte
Carlo simulations are performed to study the eectiveness of the proposed method. One
data set has been analyzed for illustrative purposes.
Keywords: Asymptotic distribution EM algorithm Likelihood ratio test
Maximum likelihood Monte Carlo simulations Probability of correct selection.
Mathematics Subject Classication: Primary 62H30 Secondary 62E20.
1. Introduction
Recently, the two-parameter generalized exponential (GE) distribution proposed by Gupta
and Kundu (1999) has received some attention. The two-parameter GE model, which has
one shape parameter and one scale parameter, is a positively skewed distribution. This
model has several desirable properties and many of them are very similar to the corre-
sponding properties of the well known Weibull distribution. For example, the probability
density functions (PDFs) and the hazard functions (HFs) of the GE and Weibull distribu-
tions are very similar. In addition, both distributions have compact cumulative distribution
functions (CDFs). These distributions contain the exponential distribution as a special
case. Therefore, they are extensions of the exponential distribution but in dierent man-
ners. It is further observed that the GE distribution can also be used quite successfully
in analyzing positively skewed data sets in place of the Weibull distribution. Moreover,
often it is very dicult to distinguish between these two distributions. For some recent
developments on the GE distribution, and for its dierent applications, the readers are
referred to the review article by Gupta and Kundu (2007).
Corresponding author. Debasis Kundu. Department of Mathematics, Indian Institute of Technology Kanpur, Kan-
pur 208016, India. Email: kundu@iitk.ac.in
ISSN: 0718-7912 (print)/ISSN: 0718-7920 (online)
c Chilean Statistical Society Sociedad Chilena de Estadstica
http://www.soche.cl/chjs
94 A.K. Dey and D. Kundu
The problem of testing whether some given observations follow one of two (or more)
distributions is quite an old statistical problem. Cox (1961) (see also Cox, 1962) was
the pioneer in considering this problem. He also discussed the eect of choosing a wrong
model. Since then extensive work has been done in discriminating between two or more
distributions; see, e.g., Atkinson (1969, 1970), Bain and Englehardt (1980), Marshall et al.
(2001), Dey and Kundu (2009, 2010) and the references cited therein.
In recent times, it has been observed (see Gupta and Kundu, 2003, 2006) that, due
to the closeness between the Weibull and GE distributions, it is extremely dicult to
discriminate between their two corresponding CDFs. Note that if the shape parameter
is one, the two CDFs are not distinguishable. For small sample sizes, the probability of
correct selection (PCS) can be quite small, even if the shape parameter is not very close to
one. Interestingly, although extensive work has been done in discriminating between two
or more univariate distributions, but no work has been found in discriminating between
two bivariate distributions.
Recently, Kundu and Gupta (2009) introduced a singular bivariate distribution whose
marginals follow GE distributions, which is named as the bivariate generalized exponential
(BGE) distribution. The four-parameter BGE distribution has several desirable properties
and it can be used quite eectively to analyze bivariate data when there are ties. Another
well known four-parameter bivariate singular distribution is the bivariate Marshall-Olkin
Weibull (BMOW) distribution, which has been used quite eectively to analyze bivariate
data when there are ties; see, e.g., Kotz et al. (2000). The BMOW distribution has Weibull
marginals. Therefore, it is clear that for certain range of parameter values, the marginals
of the BGE and BMOW distributions are very similar. In fact, it is observed that the
shapes of the joint PDFs of the BGE and BMOW distributions can also be very similar
in nature.
In this paper, we consider discriminating between BGE and BMOW distributions. We
use the dierence of the values for maximized log-likelihood functions in discriminating
between the two CDFs. The exact distribution of the proposed test statistic is dicult to
obtain, and hence we obtain its asymptotic distribution. It is observed that the asymptotic
distribution of the test statistic is normally distributed and it is used to compute the PCS.
In computing the PCS, one needs to compute the misspecied parameters. Computation
of the misspecied parameters involves solving a four dimensional optimization problem.
We suggest an approximation, which involves solving an one dimensional optimization
problem only, which it computationally becomes very ecient. Monte Carlo simulations
are performed to study the eectiveness of the proposed method, and it is observed that,
even for moderate sample sizes, the asymptotic results match very well with the simulated
results.
Rest of the paper is organized as follows. In Section 2, we briey discuss about the
BGE and BMOW distributions. In Section 3, we present the discrimination procedure.
In Section 4, we provide the asymptotic distribution of the test statistics for both cases.
In Section 5, we discuss the calculation of the misspecied parameters. In Section 6, we
conduct Monte Carlo simulation study. In Section 7, we analyze a data set for illustrative
purposes. Finally, in Section 8 we conclude the paper.
2. BMOW and BGE Distributions
In this section, we briey discuss about the BMOW and BGE distributions. We use the
following notations throughout the paper. It is assumed that the univariate Weibull dis-
tribution with the shape parameter > 0 and the scale parameter > 0 has PDF, CDF
and survival function (SF) given by
f
WE
(x; , ) = x
1
e
x
, F
WE
(x; , ) = 1 e
x
, S
WE
(x; , ) = e
x
, x > 0, (1)
Chilean Journal of Statistics 95
respectively. From now on a Weibull distribution with the PDF as given in Equation (1)
is denoted by WE(, ). The GE distribution, with the shape parameter > 0 and the
scale parameter > 0, has PDF given by
f
GE
(x; , ) = e
x
_
1 e
x
_
1
, x > 0. (2)
The corresponding CDF and SF are
F
GE
(x; , ) =
_
1 e
x
_
, and S
GE
(x; , ) = 1
_
1 e
x
_
,
respectively. A GE distribution with the PDF given in Equation (2) is denoted by GE(, ).
2.1 The BMOW distribution
Suppose U
0
WE(,
0
), U
1
WE(,
1
) and U
2
WE(,
2
) and they are indepen-
dently distributed. Dene X
1
= min{U
0
, U
1
} and X
2
= min{U
0
, U
2
}. Then, the bivariate
vector (X
1
, X
2
) has the BMOW distribution with parameters ,
0
,
1
,
2
, and it is de-
noted by BMOW(), where = (,
0
,
1
,
2
). Then, the (X
1
, X
2
) has joint SF of the
form
S
BMOW
(x
1
, x
2
; ) = P(X
1
> x
1
, X
2
> x
2
) = P(U
1
> x
1
, U
2
> x
2
, U
0
> z)
= S
WE
(x
1
; ,
1
)S
WE
(x
2
; ,
2
)S
WE
(z; ,
0
),
where z = max{x
1
, x
2
}. The joint PDF of (X
1
, X
2
) can be written as
f
BMOW
(x
1
, x
2
; ) =
_
_
f
1W
(x
1
, x
2
; ), if 0 < x
1
< x
2
;
f
2W
(x
1
, x
2
; ), if 0 < x
2
< x
1
;
f
0W
(x; ), if 0 < x
1
= x
2
= x;
where
f
1W
(x
1
, x
2
; ) = f
WE
(x
1
; ,
1
)f
WE
(x
2
; ,
0
+
2
),
f
2W
(x
1
, x
2
; ) = f
WE
(x
1
; ,
0
+
1
)f
WE
(x
2
; ,
2
),
f
0W
(x; ) =
0
0
+
1
+
2
f
WE
(x; ,
0
+
1
+
2
).
Note that the function f
BMOW
() may be considered to be a PDF for the BMOW distribu-
tion if it is understood that the rst two terms are PDFs with respect to two dimensional
Lebesgue measure, and the third term is a PDF with respect to a one dimensional Lebesgue
measure; see, e.g., Bemis et al. (1972). It is clear that the BMOW distribution has an abso-
lute continuous part on {(x
1
, x
2
); 0 < x
1
< , 0 < x
2
< , x
1
= x
2
}, and a singular part
on {(x
1
, x
2
); 0 < x
1
< , 0 < x
2
< , x
1
= x
2
}. The surface plot of the absolute contin-
uous part of the joint PDF has been provided in Figure 1 for dierent parameter values.
It is immediate that the joint BMOW PDF can take variety of shapes, and, therefore, it
can be used quite eectively in analyzing singular bivariate data.
96 A.K. Dey and D. Kundu
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(a)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0
0.5
1
1.5
2
2.5
3
3.5
4
(b)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
(c)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
(d)
Figure 1. Surface plots of the absolute continuous part of the joint PDF of BMOW for (,
1
,
2
,
3
): (a) (2.0, 1.0,
1.0, 1.0) (b) (5.0, 1.0, 1.0, 1.0) (c) (2.0, 2.0, 2.0, 2.0) (d) (1.0, 1.0, 1.0, 1.0).
The following probabilities are used later in deriving the asymptotic PCS. If (X
1
, X
2
)
BMOW(), then
p
1W
= P(X
1
< X
2
) =
_
0
_
y
0
f
WE
(x; ,
1
)f
WE
(y; ,
0
+
2
)dxdy
=
1
0
+
1
+
2
,
p
2W
= P(X
1
> X
2
) =
_
0
_
y
f
WE
(x; ,
0
+
1
)f
WE
(y; ,
2
)dxdy
=
2
0
+
1
+
2
,
p
0W
= P(X
1
= X
2
) =
0
0
+
1
+
2
_
0
f
WE
(z; ,
0
+
1
+
2
)dz
=
0
0
+
1
+
2
.
2.2 The BGE distribution
Suppose V
0
GE(
0
, ), V
1
GE(
1
, ) and V
2
GE(
2
, ). Dene Y
1
= max{V
0
, V
1
}
and Y
2
= max{V
0
, V
2
}. Then the bivariate random vector (Y
1
, Y
2
) is said to have the
BGE distribution with parameters
0
,
1
,
2
, , and it is denoted by BGE(), where =
(
0
,
1
,
2
, ). It is immediate that Y
1
GE (
0
+
1
, ) and Y
2
GE(
0
+
2
, ). The
Chilean Journal of Statistics 97
0
0.5
1
1.5
2
2.5
3
3.5
0
0.5
1
1.5
2
2.5
3
3.5
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
(a)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0
0.02
0.04
0.06
0.08
0.1
0.12
(b)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
(c)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
(d)
Figure 2. Surface plots of the absolute continuous part of the joint BGE PDF for (
1
,
2
,
3
, ): (a) (1.0, 1.0, 2.0,
1.0) (b) (1.0, 1.0, 1.0, 4.0) (c) (5.0, 5.0, 5.0, 1.0) (d) (0.5, 0.5, 0.5, 1.0).
joint CDF of (Y
1
, Y
2
) can be expressed as
F
BGE
(y
1
, y
2
; ) = P(Y
1
y
1
, Y
2
y
2
) = P(V
1
y
1
, V
2
y
2
, V
0
v)
= (1 e
y
1
)
1
(1 e
y
2
)
2
(1 e
v
)
0
,
where v = min{y
1
, y
2
}. In this case, the joint CDF of Y
1
and Y
2
can be written as
f
BGE
(y
1
, y
2
; ) =
_
_
f
1G
(y
1
, y
2
), if 0 < y
1
< y
2
;
f
2G
(y
1
, y
2
), if 0 < y
2
< y
1
;
f
0G
(y), if 0 < y
1
= y
2
= y,
where
f
1G
(y
1
, y
2
; ) = f
GE
(y
1
;
0
+
1
, )f
GE
(y
2
;
2
, ),
f
2G
(y
1
, y
2
; ) = f
GE
(y
1
;
1
, )f
GE
(y
2
;
0
+
2
, ),
f
0G
(y; ) =
0
0
+
1
+
2
f
GE
(y;
0
+
1
+
2
, ).
It is clear that the BGE distribution has also a singular part and an absolute continuous
part similarly as the BMOW distribution. The surface plot of the joint BGE PDF is
provided in Figure 2 for dierent parameter values. It is clear that the shape of the joint
BGE and BMOW PDFs are very similar.
98 A.K. Dey and D. Kundu
The following probabilities are needed later. If (Y
1
, Y
2
) BGE(), then
p
1G
= P(Y
1
< Y
2
) =
_
0
_
y
0
f
GE
(x;
0
+
1
, )f
GE
(y;
2
, )dxdy
=
2
0
+
1
+
2
,
p
2G
= P(Y
1
> Y
2
) =
_
0
_
y
f
GE
(x;
1
, )f
GE
(y;
0
+
2
, )dxdy
=
1
0
+
1
+
2
,
p
0G
= P(Y
1
= Y
2
) =
0
0
+
1
+
2
_
0
f
WE
(z;
0
+
1
+
2
, )dz
=
0
0
+
1
+
2
.
3. Discrimination Procedure
In this section, we present the discrimination procedure between the distributions. Speci-
cally, suppose {(X
11
, X
21
), . . . , (X
1n
, X
2n
)} is a random bivariate sample of size n generated
either from a BGE() distribution or from a BMOW() distribution. Based on the above
sample, we want to decide from which distribution the data set has been obtained. We use
the following notations and sets for the rest of the paper; I
0
= {(x
1i
, x
2i
), x
1i
= x
2i
= x
i
, i =
1, . . . , n}, I
1
= {(x
1i
, x
2i
), x
1i
< x
2i
, i = 1, . . . , n}, I
2
= {(x
1i
, x
2i
), x
1i
> x
2i
, i = 1, . . . , n},
I = I
0
I
1
I
2
, n
0
= |I
0
|, n
1
= |I
1
| and n
2
= |I
2
|, n
0
+ n
1
+ n
2
= n. It is assumed
that n
0
= 0, n
1
= 0, and n
2
= 0. Let
= (
0
,
1
,
2
,
0
,
1
,
2
) be the ML estimator of based
on the assumption that the data have been obtained from the BMOW() distribution.
Note that (
0
,
1
,
2
,
) and ( ,
0
,
1
,
2
) are obtained by maximizing the corresponding
log-likelihood function, say L
1
(
0
,
1
,
2
, ) and L
2
(,
0
,
1
,
2
), respectively. Note that
here the log-likelihood function of the BGE distribution can be written as
L
1
() = (n
0
+ 2n
1
+ 2n
2
)log () +n
1
log (
0
+
1
) +n
1
log(
2
) +n
2
log (
1
)
+n
2
log (
0
+
2
) + (
0
+
1
1)
iI1
log (1 e
x
1i
) + (
2
1)
iI1
log (1 e
x
2i
)
+(
1
1)
iI2
log (1 e
x
1i
) + (
0
+
2
1)
iI2
log (1 e
x
2i
) +n
0
log (
0
)
+(
0
+
1
+
2
1)
iI
0
log (1 e
x
i
)
_
iI
0
x
i
+
iI
1
I
2
x
1i
+
iI
1
I
2
x
2i
_
, (3)
Chilean Journal of Statistics 99
and the BMOW log-likelihood function can be written as
L
2
() = (n
0
+ 2n
1
+ 2n
2
)log () +n
1
log (
1
) +n
2
log (
2
) +n
0
log (
0
) +n
1
log (
0
+
2
)
+n
2
log (
0
+
1
) + ( 1)
_
iI0
log (x
1i
) +
iI1I2
log (x
2i
) +
iI0
log (x
i
)
_
1
_
iI
1
I
2
x
1i
+
iI
0
x
i
_
2
_
iI
1
I
2
x
2i
+
iI
0
x
i
_
0
_
iI2
x
1i
+
iI1
x
2i
+
iI0
x
i
_
.
We use the following discrimination procedure. Consider the statistic
T = L
2
( ,
0
,
1
,
2
) L
1
(
0
,
1
,
2
,
). (4)
If T > 0, we choose the BMOW distribution, otherwise we prefer the BGE distribution.
It may be mentioned that (
0
,
1
,
2
,
) and ( ,
0
,
1
,
2
) are obtained by maximizing
Equations (3) and (4) respectively. Computationally both are quite challenging problems.
To maximize directly these problems one needs to solve a four dimensional optimization
problem in each case. In both the cases the EM algorithm can be used quite eectively
to compute the ML estimators of the unknown parameters; see e.g., Kundu and Gupta
(2009) and Kundu and Dey (2009) for the BGE and BMOW distributions, respectively.
In each case, it involves solving just a one-dimensional optimization problem at each E
step, and both the methods work quite well. In the next section we provide the asymptotic
distribution of T, which helps to compute the asymptotic PCS.
4. Asymptotic Distributions
In this section, we provide the asymptotic distributions of the test statistics for both
cases and use the following notations. For any functions, f
1
(U) and f
2
(U), E
BGE
[f
1
(U)],
V
BGE
[f
1
(U)] and Cov
BGE
(f
2
(U), f
1
(U)) denote the mean of f
1
(U), the variance of f
1
(U),
and the covariance of f
1
(U) and f
2
(U) respectively, under the assumption the U
BGE(). Similarly, we dene E
BWE
[f
1
(U)], V
BWE
[f
1
(U)] and Cov
BWE
(f
2
(U), f
1
(U)) as the
mean of f
1
(U), the variance of f
1
(U) and the covariance of f
1
(U) and f
2
(U) respectively,
under the assumption that U BWE() (bivariate Weibull). We have the following two
main results.
Theorem 4.1 Under the assumption that data come from the BMOW(,
0
,
1
,
2
) dis-
tribution, the distribution of T as dened in Equation (4) is approximately normally
distributed with mean E
BMOW
[T] and variance V
BMOW
[T]. The expressions of E
BMOW
[T]
and V
BMOW
[T] are provided below.
Proof It is provided in Appendix.
Now we provide the expressions for E
BMOW
[T] and V
BMOW
[T]. We denote
lim
n
E
BMOW
[T]
n
= AM
BMOW
and lim
n
E
BMOW
[T]
n
= AV
BMOW
.
100 A.K. Dey and D. Kundu
Therefore,
lim
n
1
n
E
BMOW
[T] = AM
BMOW
= E
BMOW
[log (f
BMOW
(X
1
, X
2
; )) log (f
BGE
(X
1
, X
2
;
))],
lim
n
1
n
V
BMOW
[T] = AV
BMOW
= V
BMOW
[log (f
BMOW
(X
1
, X
2
; )) log (f
BGE
(X
1
, X
2
;
))].
Note that both AM
BMOW
and AV
BMOW
cannot be obtained in explicit form. They have
to be obtained numerically and they are functions of p
1W
, p
2W
, p
3W
, and
. Moreover,
it should be mentioned that the misspecied parameter
as dened in Lemma 8.1 (see
Appendix) also needs to be computed numerically.
Theorem 4.2 Under the assumption that data come from the BGE() distribution, the
distribution of T as dened in Equation (4) is approximately normally distributed with
mean E
BGE
[T] and variance V
BGE
[T]. The expressions of E
BGE
[T] and V
BGE
[T] are provided
below.
Proof It is provided in Appendix.
Now we provide the expressions for E
BGE
[T] and V
BGE
[T]. In this case, we denote
lim
n
E
BGE
[T]
n
= AM
BGE
and lim
n
V
BGE
[T]
n
= AV
BGE
.
Therefore,
lim
n
1
n
E
BGE
[T] = AM
BGE
= E
BGE
[log (f
BMOW
(X
1
, X
2
;
)) log(f
BGE
(X
1
, X
2
; ))],
lim
n
1
n
V
BGE
[T] = AV
BGE
= V
BGE
[log (f
BMOW
(X
1
, X
2
;
)) log (f
BGE
(X
1
, X
2
; ))].
As mentioned, here also both AM
BGE
and AV
BGE
cannot be obtained in explicit form.
They have to be obtained numerically and they are also functions of p
1G
, p
2G
, p
3G
,
and
. The misspecied parameter
as dened in Lemma 8.2 (see Appendix) also needs to be
computed numerically.
Then, based on the corresponding asymptotic distributions, it is possible to compute
the PCS for both the cases.
5. Misspecified Parameter Estimates
In this section, we discuss the estimation of the misspecied parameters.
5.1 Estimation of
In this case, it is assumed that the data have been obtained from the BMOW() dis-
tribution and we would like to compute
, the misspecied BGE parameters, as de-
ned in Lemma 8.1. Suppose (X
1
, X
2
) BMOW(). Consider the following events:
Chilean Journal of Statistics 101
A
1
= {X
1
< X
2
}, A
2
= {X
1
> X
2
} and A
0
= {X
1
= X
2
}. Moreover, 1
A
is the indi-
cator function taking value 1 at the set A and 0 otherwise. Therefore,
can be obtained
as the argument maximum of E
BMOW
[log (f
BGE
(X
1
, X
2
; ))] =
1
() (say), where
1
() = log () +p
1W
log (
0
+
1
) +p
1W
log (
2
)
+(
0
+
1
1)E
BMOW
[log (1 e
X
1
) 1
A1
]
+(
2
1)E
BMOW
[log (1 e
X2
) 1
A
1
] E
BMOW
[(X
1
+X
2
) 1
A
1
]
+p
2W
log(
1
) + (
1
1)E
BMOW
[log (1 e
X
1
) 1
A
2
] +p
2W
log (
0
+
2
)
+(
0
+
2
1)E
BMOW
[log (1 e
X2
) 1
A2
] E
BMOW
[(X
1
+X
2
) I
A2
]
+(
0
+
1
+
2
1)E
BMOW
[log (1 e
X
) 1
A
0
] E
BMOW
[X 1
A
0
] +p
0W
log
0
.
We need to maximize
1
() with respect to for xed , to compute
, numerically.
Clearly,
is a function of , but we do not make it explicit for brevity. Since maximizing
1
() involves a four dimensional optimization process, we suggest to use an approximate
version of it, which can be performed very easily, and works quite well in practice. The
idea basically came from the missing value principle, and it has been used by Kundu and
Gupta (2009) in developing the EM algorithm. We suggest to use the following
1
(), the
pseudo version of
1
()
1
() = (p
0W
+u
2
p
1W
+w
2
p
2W
)log (
0
) + (p
0W
+ 2p
1W
+ 2p
2W
)log ()
+(
0
+
1
+
2
1)E
_
log (1 e
X
1
) 1
A0
_
(E[X
1
1
A0
] + E[(X
1
+X
2
) 1
A1A2
]) + (u
1
p
1W
+p
2W
)log (
1
)
+(w
1
p
2W
+p
1W
)log (
2
) + (
0
+
1
1)E
_
log (1 e
X1
) 1
A1
_
+(
0
+
2
1)E
_
log (1 e
X2
) 1
A
2
_
+ (
2
1)E
_
log(1 e
X2
) 1
A
1
_
+(
1
1)E
_
log (1 e
X
1
) 1
A
2
_
.
Here,
u
1
=
0
0
+
2
, u
2
=
2
0
+
2
, w
1
=
0
0
+
1
, w
2
=
1
0
+
1
, (5)
and p
1W
, p
2W
, p
3W
are same as dened before. The explicit expressions of the expected
values are provided in Appendix. Note that
1
() is actually
1
() = lim
n
1
n
E[l
pseudo
(
0
,
1
,
2
, | (X
1i
, X
2i
; i = 1, . . . , n)] .
Here l
pseudo
() is the pseudo log-likelihood function of the complete data set, as described
in Kundu and Gupta (2009). Moreover, it has the same form as in Kundu and Gupta
(2009), but since here it is assumed that (X
1i
, X
2i
) BMOW(,
1
,
2
,
3
), therefore the
expressions of u
1
, u
2
, w
1
, w
2
are as Equation (5), and they are dierent than Kundu and
Gupta (2009).
102 A.K. Dey and D. Kundu
Now the maximization of
1
() can be performed as follows. Note that for a given ,
the maximization of
1
() with respect to
0
,
1
and
2
can occur at
0
() =
p
0W
+u
2
p
1W
+w
2
p
2W
E[log (1 e
X1
) 1
A0
] + E[log (1 e
X1
) 1
A1
] + E[log (1 e
X2
) 1
A2
]
,
1
() =
u
1
p
1W
+p
2W
E[log (1 e
X
1
) 1
A0
] + E[log (1 e
X
1
) 1
A1
] + E[log (1 e
X
1
) 1
A2
]
,
2
() =
p
1W
+w
1
p
2W
E[log (1 e
X
2
) 1
A
0
] + E[log (1 e
X
2
) 1
A
2
] + E[log (1 e
X
2
) 1
A
1
]
,
respectively, and nally maximization of
1
() can be obtained by maximizing prole
function, namely,
1
(
0
(),
1
(),
2
(), ) with respect to only. Therefore, it involves
solving an one dimensional optimization problem only.
5.2 Estimation of
In this case, it is assumed that the data have been obtained from the BGE() distribution
and we compute
, the misspecied BMOW parameters, as dened in Lemma 8.2. In
this case,
can be obtained as the argument maximum of E
BGE
[log(f
BMOW
(X
1
, X
2
; ))] =
2
() (say), where
2
() = (p
0G
+ 2p
1G
+ 2p
2G
)log ()
+p
1G
log (
1
) +p
2G
log (
2
) +p
0G
log (
0
) +p
1G
log (
0
+
2
)
+p
2G
log(
0
+
1
) + ( 1) (E
BGE
[log X
1
1
A
1
] + E
BGE
[log X
1
1
A
2
])
+( 1) (E
BGE
[log X
2
1
A
1
] + E
BMOW
[log X
2
1
A
2
] + E
BMOW
[log X
1
1
A
0
])
1
(E
BMOW
[X
1
1
A1
] + E
BMOW
[X
1
1
A2
] + E
BMOW
[X
1
1
A0
])
2
(E
BMOW
[X
2
1
A
1
] + E
BMOW
[X
2
1
A
2
] + E
BMOW
[X
1
1
A
0
])
0
(E
BMOW
[X
1
1
A
2
] + E
BMOW
[X
2
1
A
1
] + E
BMOW
[X
1
1
A
0
]) .
In this case, we need to maximize
2
() with respect to numerically to obtain
, for a
xed . Clearly,
depends on , and we do not make it explicit for brevity.
Similarly, as before since maximization of
2
() involves a four dimensional optimization
problem, we suggest to use the following approximation of
2
(). We suggest to use
2
() = (p
0G
+ 2p
1G
+ 2p
2G
)log + ( 1)E[logX
1
1
A0
+ (log X
1
+ log X
2
) 1
A1A2
]
0
E[X
1
1
A
0
+X
1
1
A
2
+X
2
1
A
1
] + (p
0G
+a
1
p
1G
+b
1
p
2G
) log (
0
)
1
E[X
1
] + (p
1G
+a
2
p
2G
) log (
1
)
2
E[X
2
] + (p
2G
+b
2
p
1G
) log (
1
).
Here
a
1
=
1
0
+
1
, a
2
=
0
0
+
2
, b
1
=
2
0
+
2
, b
2
=
0
0
+
2
,
p
0G
, p
1G
, p
2G
are same as dened before. The expressions of the dierent expectations are
provided in Appendix.
Chilean Journal of Statistics 103
It may be similarly observed as before that
2
() = lim
n
1
n
E[l
pseudo
(,
0
,
1
,
2
| (X
1i
, X
2i
); i = 1, . . . , n)] ,
where (X
1i
, X
2i
) BGE(
0
,
1
,
2
, ). The explicit expression of l
pseudo
() is available in
Kundu and Dey (2009).
The maximization of
2
() with respect to can be performed quite easily. For xed
, the maximization
2
() with respect to
1
,
2
and
0
can be obtained for
1
=
p
1G
+b
2
p
2G
E[X
1
]
,
2
=
p
2G
+a
2
p
1G
E[X
2
]
,
0
=
p
0G
+a
1
p
1G
+b
1
p
2G
E[X
1
1
A0
] + E[X
1
1
A2
] + E[X
2
1
A1
]
,
respectively, and nally the maximization
2
() can be performed by maximizing the
prole function
2
(,
0
(),
1
(),
2
()) with respect to only.
6. Numerical Results
In this section, we perform some numerical experiments to observe how these asymp-
totic results work for dierent sample sizes, and for dierent parameter values. All these
computations are performed at the Indian Institute of Technology Kanpur, using Intel(R)
Core(TM)2 Quad CPU Q9550 2.83GHz, 3.23 GB RAM machines. The programs are writ-
ten in R software (2.8.1), which can be obtained from the authors on request. We compute
the PCS based on Monte Carlo (MC) simulation, and also based on the asymptotic results.
We replicate the process 1000 times and compute the proportion of correct selection. For
computing the PCS based on asymptotic results, rst we compute the misspecied param-
eters and based on those misspecied parameters we compute the PCS.
6.1 Case 1: parent distribution is BMOW
In this case, we consider the following parameter sets:
Set 1: = 2.0,
0
= 1.0,
1
= 1.0,
2
= 1.0; Set 2: = 1.5,
0
= 1.0,
1
= 1.0,
2
= 1.0;
Set 3: = 1.5,
0
= 0.5,
1
= 0.5,
2
= 0.5; Set 4: = 1.5,
0
= 2.0,
1
= 1.0,
2
= 1.5,
and dierent sample sizes namely n = 20, 40, 60, 80, 100. For each parameter set and for
each sample size, we have generated the sample from the BMOW distribution. Then, we
compute the ML estimates of the unknown parameters and the values for the corresponding
maximized log-likelihood functions, assuming that the data are coming from the BMOW
or BGE distribution. In computing the ML estimates of the unknown parameters, we have
used the EM algorithm as suggested in Kundu and Dey (2009) and Kundu and Gupta
(2009), respectively. Finally, based on the values for the corresponding maximized log-
likelihood functions, we decide whether we have made the correct decision or not. We
replicate the process 1000 times, and compute the proportion of correct selection. The
results are reported in the rst rows of Tables 1 to 4.
Now, to compare these results with the corresponding asymptotic results, rst we com-
pute the misspecied parameters for each parameter set, and they are presented in the
104 A.K. Dey and D. Kundu
Table 1. PCS based on MC simulations and based on asymptotic distribution (AD) for parameter Set 1.
n 20 40 60 80 100
MC 0.9255 0.9808 0.9953 0.9987 0.9997
AD 0.9346 0.9837 0.9956 0.9987 0.9996
Table 2. PCS based on MC simulations and based on AD for parameter Set 2.
n 20 40 60 80 100
MC 0.9255 0.9808 0.9953 0.9987 0.9997
AD 0.9212 0.9772 0.9928 0.9976 0.9992
Table 3. PCS based on MC simulations and based on AD for parameter Set 3.
n 20 40 60 80 100
MC 0.9073 0.9749 0.9914 0.9979 0.9989
AS 0.9204 0.9767 0.9926 0.9975 0.9992
Table 4. PCS selection based on MC simulations and based on AD for parameter Set 4.
n 20 40 60 80 100
MC 0.8834 0.9587 0.9843 0.9952 0.9973
AS 0.8996 0.9648 0.9866 0.9947 0.9979
following Table 5. In each case, we need to compute AM
BMOW
and AV
BMOW
, as dened
in Theorem 4.1. Since the exact expressions of AM
BMOW
and AV
BMOW
are quite compli-
cated, we have used simulation consistent estimates of AM
BMOW
and AV
BMOW
, which can
be obtained very easily. The simulation consistent estimators of AM
BMOW
and AV
BMOW
are
obtained using 10,000 replications, and they are reported in Table 6.
Table 5. Misspecied parameter values
for dierent parameter sets.
Set
1
2
0
2
5 1.6199 0.1732 0.1137 0.1992
6 1.4199 0.2575 0.2418 0.2418
3 1.8200 0.1123 0.1050 0.1050
8 1.6199 0.1665 0.1553 0.1553
Table 12. AM
BMOW
and AV
BMOW
for dierent parameter sets.
Set AM
BMOW
AV
BMOW
5 0.2224 0.4406
6 0.1967 0.4095
3 0.2316 0.4692
8 0.2128 0.4157
Now, similarly as before, based on the asymptotic distribution of T, as provided in
Theorem 4.2, we compute the PCS in this case, i.e., P(T < 0), for dierent sample sizes.
We report the results in the second rows of Tables 7 to 10 for all the parameter sets. In
this case, it observed that the asymptotic results match extremely well with the simulated
results.
106 A.K. Dey and D. Kundu
7. Data Analysis
In this section, we present the analysis of a real data set for illustrative purposes. These
data are from the National Football League (NFL), American Football, matches played on
three consecutive weekends in 1986. It has been originally published in Washington Post.
In this bivariate data set, the variables are the game time to the rst points scored by
kicking the ball between goal posts (X
1
) and the game time to the rst points scored by
moving the ball into the end zone (X
2
). These times are of interest to a casual spectator
who wants to know how long one has to wait to watch a touchdown or to a spectator who
is interested only at the beginning stages of a game. The data (scoring times in minutes
and seconds) are represented in Table 13. We have analyzed the data by converting the
seconds to the decimal minutes, i.e., 2:03 has been converted to 2.05.
Table 13. American Football League (NFL) data.
X
1
X
2
X
1
X
2
X
1
X
2
2:03 3:59 5:47 25:59 10:24 14:15
9:03 9:03 13:48 49:45 2:59 2:59
0:51 0:51 7:15 7:15 3:53 6:26
3:26 3:26 4:15 4:15 0:45 0:45
7:47 7:47 1:39 1:39 11:38 17:22
10:34 14:17 6:25 15:05 1:23 1:23
7:03 7:03 4:13 9:29 10:21 10:21
2:35 2:35 15:32 15:32 12:08 12:08
7:14 9:41 2:54 2:54 14:35 14:35
6:51 34:35 7:01 7:01 11:49 11:49
32:27 42:21 6:25 6:25 5:31 11:16
8:32 14:34 8:59 8:59 19:39 10:42
31:08 49:53 10:09 10:09 17:50 17:50
14:35 20:34 8:52 8:52 10:51 38:04
The variables X
1
and X
2
have the structure: (i) X
1
< X
2
means that the rst score is
a eld goal, (ii) X
1
= X
2
means the rst score is a converted touchdown, (iii) X
1
> X
2
means the rst score is an unconverted touchdown or safety. In this case, the ties are
exact because no game time elapses between a touchdown and a point-after conversion
attempt. Therefore, it is clear that, in this case, X
1
= X
2
occurs with positive probability,
and some singular distribution should be used to analyze this data set.
If we dene the random variables
U
1
= time to rst eld goal,
U
2
= time to rst safety or unconverted touchdown,
U
0
= time to rst converted touchdown.
Then, X
1
= min{U
0
, U
1
} and X
2
= min{U
0
, U
2
}. Therefore, (X
1
, X
2
) has a similar struc-
ture as the bivariate Marshall-Olkin exponential model. Csorgo and Welsh (1989) analyzed
the data using the bivariate Marshall-Olkin exponential model but concluded that it does
not work well, because X
2
may be exponential but X
1
is not. In fact it is observed that
the empirical HFs of both X
1
and X
2
are increasing functions.
Since both BMOW and BGE distributions can have increasing marginal HFs, we t both
the models to the data set. For the BMOW distribution, using EM algorithm as suggested
in Kundu and Dey (2009), we compute the ML estimates of the unknown parameters as =
Chilean Journal of Statistics 107
0
0.02
0.04
0.06
0.08
0.1
0.12
10 5 0 5 10 15 20
Figure 3. Histogram of the bootstrap sample of the discrimination statistic.
1.2889,
0
= 11.2073,
1
= 8.3572,
2
= 0.4720, and the associated 95% condence intervals
are (1.0372, 1.5406), (5.7213, 16.6932), (2.5312, 14.1831), (-0.4872, 1.4314) respectively.
The value for the corresponding maximized likelihood function is 47.8041. In case of the
BGE distribution using the EM algorithm as suggested in Kundu and Gupta (2009), we
obtained the ML estimates of the unknown parameters as
0
= 1.1628,
1
= 0.0558,
2
= 0.5961,
= 9.5634, and the associated 95% condence intervals are (0.6991, 1.6266),
(-0.0205, 0.1322), (0.2751, 0.9171) and (6.5298, 12.5970) respectively. The value for the
corresponding maximized likelihood function is 38.0042. Therefore, based on the values
for the corresponding maximized likelihood function, we prefer to use the BMOW model
rather than the BGE model to analyze this data set.
Now, to compute the PCS in this case, we perform non-parametric bootstrap. The
histogram of the bootstrap sample of the discrimination statistic is provided in Figure 3.
Based on one thousand bootstrap replications, it is observed that the PCS is 0.98.
8. Conclusion
In this paper, we have considered discrimination between two singular bivariate models,
namely the BMOW and BGE distributions. Both the distributions have singular part and
absolute continuous part. The dierence of the values for the corresponding maximized
likelihood function has been used as the discrimination statistic. We have obtained the
asymptotic distribution of the discrimination statistic, which can be used to compute
the asymptotic PCS. MC simulations are performed to see the behavior of the proposed
method. It is known that the discrimination between Weibull and generalized exponential
distributions is quite dicult (see Gupta and Kundu, 2003), but in this paper it is observed
that the discrimination between the BMOW and BGE distributions is relatively easier.
Even with small sample sizes the PCS quite high. Moreover, the asymptotic PCS matches
very well with the simulated PCS even for moderate sample sizes. We have performed
the analysis of a data set and computed the PCS using non-parametric bootstrap method.
Although we do not have any theoretical results, it seems non-parametric bootstrap method
also can be used quite eectively in computing the PCS in this case. More work is needed
in this direction.
Appendix
To prove Theorem 4.1, we need of Lemma 8.1. Here
a.s.
means converges almost surely.
108 A.K. Dey and D. Kundu
Lemma 8.1 Under the assumption that data are from the BWE(,
0
,
1
,
2
) distribution,
as n , we have
(i)
a.s.
,
0
a.s.
0
,
1
a.s.
1
and
2
a.s.
2
where for = (,
0
,
1
,
2
),
E
BMOW
[log (f
BMOW
(X
1
, X
2
; ))] = max
E
BMOW
[log(f
BMOW
(X
1
, X
2
;
))];
(ii)
0
a.s.
0
,
1
a.s.
1
,
2
a.s.
2
,
a.s.
, where for = (
0
,
1
,
2
, ),
E
BMOW
[log(f
BGE
(X
1
, X
2
;
))] = max
E
BMOW
[log (f
BGE
(X
1
, X
2
; ))].
It may be noted that
may depend on , but we do not make it explicit for brevity;
(iii) If we denote
T
= L
2
(,
0
,
1
,
2
) L
1
(
0
,
1
,
2
,
),
then n
1
2
(T E
BMOW
[T]) is asymptotically equivalent to n
1
2
(T
E
BMOW
[T
]) .
Proof of Lemma 8.1 It is quite standard and it follows along the same line as the proof
of Lemma 2.2 of White (1982), and it is avoided.
Proof of Theorem 4.1 Using Central limit theorem and part (ii) of Lemma 8.1, it
follows that n
1
2
(T
E
BWE
[T
E
BGE
[log (f
BGE
(X
1
, X
2
;
))];
(ii)
a.s.
,
0
a.s.
0
,
1
a.s.
1
,
2
a.s.
2
, where
= ( ,
0
,
1
,
2
),
E
BGE
[log (f
BMOW
(X
1
, X
2
;
))] = max
E
BGE
[log (f
BMOW
(X
1
, X
2
; ))];
here also
depend on , but we do not make it explicit for brevity;
(iii) If we denote
T
= L
2
( ,
0
,
1
,
2
) L
1
(
0
,
1
,
2
, ),
then n
1
2
(T E
BGE
[T]) is asymptotically equivalent to n
1
2
(T
E
BGE
[T
]) .
Proof of Theorem 4.2 Along the same line as the Proof of Lemma 8.1, it also follows
using Lemma 8.2.
The following lemmas are useful in computing the dierent expected values needed in
1
() and in
2
(). Here 1
A0
, 1
A1
and 1
A2
are same as dened before.
Chilean Journal of Statistics 109
Lemma A.1 Let W
0
GE(
0
+
1
+
2
, ), W
1
GE(
0
+
1
, ), W
2
GE(
0
+
2
, )
and (X
1
, X
2
) BGE(
0
,
1
,
2
, ). If g() is any Borel measurable function, then
E[g(X
1
) 1
A
1
] = E[g(W
1
)] +
0
+
1
0
+
1
+
2
E[g(W
0
)].
E[g(X
1
) 1
A
2
] =
1
0
+
1
+
2
E[g(W
0
)].
E[g(X
1
) 1
A0
] = E[g(X
2
) 1
A0
] =
0
0
+
1
+
2
E[g(W
0
)].
E[g(X
2
) 1
A
1
] =
2
0
+
1
+
2
E[g(W
0
)].
E[g(X
2
) 1
A2
] = E[g(W
2
)] +
0
+
2
0
+
1
+
2
E[g(W
0
)].
Proof of Lemma A.1 See Kundu and Gupta (2009).
Lemma A.2 Let Z
0
WE(,
0
+
1
+
2
), Z
1
WE(,
0
+
1
), Z
2
WE(,
0
+
2
)
and (X
1
, X
2
) BMOW(,
0
,
1
,
2
). If g() is any Borel measurable function, then
E[g(X
1
) 1
A1
] =
1
0
+
1
+
2
E[g(Z
1
)].
E[g(X
1
) 1
A2
] = E[g(Z
1
)]
0
+
1
0
+
1
+
2
E[g(Z
0
)].
E[g(X
1
) 1
A0
] = E[g(X
2
) 1
A0
] =
0
0
+
1
+
2
E[g(Z
0
)].
E[g(X
2
) 1
A1
] = E[g(Z
2
)]
0
+
2
0
+
1
+
2
.
E[g(X
2
) 1
A
2
] =
2
0
+
1
+
2
E[g(Z
2
)].
Proof of Lemma A.1 They can be obtained along the same line as in Lemma A.1.
References
Atkinson, A., 1969. A test for discriminating between models. Biometrika, 56, 337341.
Atkinson, A., 1970. A method for discriminating between models (with discussions). Jour-
nal of The Royal Statistical Society Series B - Statistical Methodology, 32, 323353.
Bain, L.J., Englehardt, M., 1980. Probability of correct selection of Weibull versus gamma
based on likelihood ratio test. Communications in Statistics - Theory and Methods, 9,
375381.
Bemis, B., Bain, L.J., Higgins, J.J., 1972. Estimation and hypothesis testing for the pa-
rameters of a bivariate exponential distribution. Journal of the American Statistical
Association, 67, 927929.
Cox, D.R., 1961. Tests of separate families of hypotheses. In Proceedings of the Fourth
Berkeley Symposium on Mathematical Statistics and Probability. University of Califor-
nia Press, Berkeley, pp. 105123.
Cox, D.R., 1962. Further results on tests of separate families of hypotheses. Journal of The
Royal Statistical Society Series B - Statistical Methodology, 24, 406424.
110 A.K. Dey and D. Kundu
Csorgo, S., Welsh, A.H., 1989. Testing for exponential and Marshall-Olkin distribution.
Journal of Statistical Planning and Inference, 23, 287300.
Dey, A.K., Kundu, D., 2009. Discriminating among the log-normal, Weibull and generalized
exponential distributions. IEEE Transactions on Reliability, 58, 416424.
Dey, A.K., Kundu, D., 2010. Discriminating between the log-normal and log-logistic dis-
tributions. Communications in Statistics - Theory and Methods, 39, 280292.
Gupta, R.D., Kundu, D., 1999. Generalized exponential distributions. Australian and New
Zealand Journal of Statistics, 41, 173188.
Gupta, R.D., Kundu, D., 2003. Discriminating between Weibull and generalized exponen-
tial distributions. Computational Statistics and Data Analysis, 43, 179196.
Gupta, R.D., Kundu, D., 2006. On the comparison of Fisher information of the Weibull
and GE distributions. Journal of Statistical Planning and Inference, 136, 31303144.
Gupta, R.D., Kundu, D., 2007. Generalized exponential distribution: existing methods and
recent developments. Journal of Statistical Planning and Inference, 137, 35373547.
Kotz, S., Balakrishnan, N., Johnson, N., 2000. Continuous Multivariate Distributions:
Models and Applications. Wiley and Sons, New York.
Kundu, D., Dey, A.K., 2009. Estimating the parameters of the Marshall-Olkin bivariate
Weibull distribution by EM algorithm. Computational Statistics and Data Analysis, 53,
956965.
Kundu, D., Gupta, R.D., 2009. Bivariate generalized exponential distribution. Journal of
Multivariate Analysis, 100, 581593.
Marshall, A.W., Meza, J.C., Olkin, I., 2001. Can data recognize its parent distribution?
Journal of Computational and Graphical Statistics, 10, 555580.
White, H., 1982. Nonlinear regression on cross-section data. Econometrica, 48, 721746.
Call for Papers
The editorial board of the Chilean Journal of Statistics (ChJS) is seeking papers, which will be refereed. We encourage
the authors to submit a PDF electronic version of the manuscript to Victor Leiva, Executive Editor of the ChJS, to
victor.leiva@uv.cl and chjs.editor@uv.cl.
Manuscript Preparation
Submision
Manuscripts to be submitted to the ChJS must be written in English and contain the name and aliation of each
author and a leading abstract followed by keywords and mathematics subject classication (primary and secondary).
AMS classication is available from the ChJS website. Sections must be numbered 1, 2, etc., where Section 1 is the
introduction part. References should be collected at the end of the paper in alphabetical order as in the following
examples:
Rukhin, A.L., 2009. Identities for negative moments of quadratic forms in normal variables. Statistics and Probability
Letters, 79, 1004-1007.
Stein, M.L., 1999. Statistical Interpolation of Spatial Data: Some Theory for Kriging. Springer, New York.
Tsay, R. S., Pe na, D., Pankratz, A. E., 2000. Outliers in multivariate time series. Biometrics, 87, 789-804.
References in the text should be given by the authors name and year of publication, e.g., Gelfand and Smith (1990).
In the case of more than two authors, the citation must be written as Tsay et al. (2000).
Acceptance
Once the manuscript has been accepted for publication in the ChJS, the authors must prepare the nal version
following the above indications and using the Latex format. Latex template and chjs class les for manuscript
preparation are available from the ChJS website.
Copyright
Authors who publish their articles in the Chilean Journal of Statistics automatically transfer their copyright to the
Chilean Statistical Society. This enables full copyright protection and wide dissemination of the articles and the
journal in any format.
The Chilean Journal of Statistics grants permission to use gures, tables and brief extracts from its collection of
articles in scientic and educational works, in which case the source that provides these issues (Chilean Journal of
Statistics) must be clearly acknowledged.