Statistica Neerlandica - 2021 - Bayer - Inflated Kumaraswamy Regressions With Application To Water Supply and Sanitation in

Received: 2 September 2020 Revised: 12 March 2021 Accepted: 29 March 2021
DOI: 10.1111/stan.12242
ORIGINAL ARTICLE
Inflated Kumaraswamy regressions with

application to water supply and sanitation
in Brazil
Fábio M. Bayer1 Francisco Cribari-Neto2 Jéssica Santos3
1
Departamento de Estatística and
LACESM, Universidade Federal de Santa Models based on the Kumaraswamy law are used with
Maria, Santa Maria, Brazil variables that assume values in (0, 1). In some cases,
2
Departamento de Estatística, however, the data contain zeros and/or ones, that is,
Universidade Federal de Pernambuco,
there is data inflation. We introduce a class of regres-
Recife, Brazil
3
Instituto Federal de Educação, Ciência e
sion models that can be used with such inflated data,
Teconologia de Pernambuco - Campus namely: the class of inflated Kumaraswamy regression
Paulista, Paulista, Brazil models. We consider inflation at zero, at one, and at
Correspondence both zero and one. We introduce the model and pro-
Fábio M. Bayer, Departamento de vide closed-form expressions for its score vector and
Estatística, Universidade Federal de Santa
Fisher’s information matrix. The proposed model is used
Maria, Santa Maria, RS, Brazil.
Email: bayer@ufsm.br to evaluate the impacts of different conditioning vari-
ables on the proportion of people who live in households
Funding information
with inadequate water supply and sewage in Brazilian
CAPES and CNPq, Grant/Award
Numbers: 301651/2017-5, 305350/2017-0 municipalities. Our results reveal that policies directed
to increasing the population share with college educa-
tion in places where it is low are particularly effective
in reducing the prevalence of people who live under
inadequate sanitation conditions.
KEYWORDS
double bounded data, inflated distribution, Kumaraswamy
distribution, likelihood inference, regression model
© 2021 Netherlands Society for Statistics and Operations Research
Statistica Neerlandica. 2021;75:453–481. wileyonlinelibrary.com/journal/stan 453

14679574, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/stan.12242 by UFPE - Universidade Federal de Pernambuco, Wiley Online Library on [24/05/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
454 BAYER et al.
1 I N T RO DU CT ION
Regression modeling is routinely used by practitioners to model the behavior of a given variable
of interest (dependent variable, response) when it is impacted by other variables (independent
variables, covariates, regressors). The most commonly used model is the linear regression model
which is appropriate for responses that assume values in the real line. The gamma, inverse Gaus-
sian, Weibull and Birnbaum–Saunders regression models are commonly used with responses that
are positively valued. None of those models, however, is appropriate for responses whose support
is the standard unit interval, (0, 1). The most commonly used model with such responses is the
beta regression model introduced by Ferrari and Cribari-Neto (2004). Its underlying assumption
is that the response follows the beta law with mean and precision parameters that vary across
observations according to the values of independent variables. An extension of the model that can
be used when some data points equal one of the endpoints of the standard unit interval (single
inflation) was proposed by Ospina and Ferrari (2012).
In this article we develop an alternative class of models for use with double bounded random
variables subject to single (inflation at zero or at one) or double (inflation at both endpoints of
the standard unit interval simultaneously) inflation, namely: the class of inflated Kumaraswamy
regression models. The proposed model combines continuous and discrete components, the for-
mer evolving according to the Kumaraswamy law. Kumaraswamy (1976) noted that the beta
law may fail to fit some double bounded variables, in particular it may fail to deliver good fits
when used with hydrological data. He then introduced a new distribution, which is commonly
referred to as the Kumaraswamy distribution. It has an advantage relative to the beta law, namely:
its distribution and quantile functions can be expressed in closed form. One can then easily
generate sequences of pseudo-random numbers from the Kumaraswamy distribution using the
inversion method. It suffices to generate sequences of pseudo-random standard uniform num-
bers and then evaluate the Kumaraswamy quantile function at each value. It is noteworthy that
the Kumaraswamy law has already proved to be very useful for modeling hydrological and related
phenomena; see, for example, Fletcher and Ponnambalam (1996) and Sundar and Subbiah (1989).
For details on the Kumaraswamy law, we refer readers to Jones (2009). Inflated Kumaraswamy
distributions were recently introduced by Cribari-Neto and Santos (2019).
It is noteworthy that Kumaraswamy moments cannot be expressed in closed form. How-
ever, the Kumaraswamy quantile function can be expressed explicitly and hence it is possible
to model median effects. Mitnik and Baek (2013) considered two different Kumaraswamy repa-
rameterizations that can be used to define regression structures for the location (median) and
dispersion parameters. In what follows, we shall use one of such parameterizations to introduce
a Kumaraswamy regression model that allows for data inflation, that is, that can be used when
the data contain zeros and/or ones. Our model allows for both single and double inflation.
The main contribution of our article to the literature is that we provide applied statisticians
with an additional model that can be used to handle inflated double bounded random variables.
Our model is especially appealing when the interest lies in modeling median effects and also
when it is convenient to have simple closed-form expressions for the underlying law’s distribution
and quantile functions which is particularly useful when resampling-based inferential procedures
are used. Another advantage of our model is that it can be used with doubly inflated-dependent
variables, that is, with responses that assume values in [0, 1], in addition to single inflation in
which the response values are in [0, 1) or in (0, 1], that is, inflation at zero or at one. By con-
trast, Ospina and Ferrari (2012) only considered single inflation in their inflated beta regression
model. A regression model based on the generalized Johnson SB distribution that incorporates
BAYER et al. 455
single inflation was introduced by Queiroz and Lemonte (2021). The authors present the model
observed information matrix, but not Fisher’s information matrix, which cannot be expressed
in closed form. Bayesian quantile regression modeling for proportion data under single inflation
was developed by Santos and Bolfarine (2015). A unit-Weibull parametric quantile regression for
single inflated data was proposed by Menezes, Mazucheli, and Bourguignon (2021). The authors
do not present the expected information matrix. Double data inflation for the beta model was
considered by Galvis, Bandyopadhyay, and Lachos (2014), Liu and Kong (2015), Mohsenkhani,
Mohhamadzadeh, and Baghfalaki (2019), and Nogarotto, Azevedo, and Bazán (2020), but in a
Bayesian framework. For similar results using mixture of beta distributions, see Di Brisco and
Migliorati (2020). Again, Bayesian statistical inference is used. Our analysis, like that of Ospina
and Ferrari (2012), is frequentist. Liu and Eugenio (2020) compare frequentist and Bayesian infer-
ences in beta regressions subject to data inflation. They consider double inflation, but they do
not present closed-form expressions for important model quantities such as Fisher’s information
matrix. An inflated simplex regression that allows for double inflation was introduced by Liu,
Kam Yuen, Wu, Tian, and Li, (2020). The authors present the Hessian matrix (and, hence, the
observed information matrix), but not Fisher’s information matrix. Additionally, the model does
not allow for varying dispersion and statistical inference is only developed for double inflation,
that is, single inflation is not covered by their inferential results.
The derivation of Fisher’s (expected) information for the model we propose is lengthy and
somewhat challenging. The technical details involved in such a derivation are presented in the
Appendix. Fisher’s information matrix is useful for obtaining standard errors for the maximum
likelihood point estimates, for performing interval estimation, and for use in some commonly
used test statistics, for example, Rao’s score and Wald test statistics. A comparison between the
regression model we introduce and every single alternative model is beyond the scope of our
article. In what follows, however, we shall draw some comparisons between our model and the
inflated beta and simplex regression models; the former is arguably the most well-known model
and the most commonly used choice for use with inflated proportion data. We note that the model
we propose is the only model, in the realm of frequentist inference, that, simultaneously: (i) allows
for single and double inflation, (ii) the location parameter is the conditional median, (iii) can be
extended to a more general class of parametric quantile regressions, and (iv) has a closed-form
expression for the expected information matrix. As for the latter, it is noteworthy that the use of
the observed information matrix has several disadvantages, for example, the score test statistic
may assume negative values (Morgan, Palmer, and Ridout 2007).
We use the proposed model inflated Kumaraswamy model to address an important empirical
issue: the relationship between education and sanitation. For instance, Adukia (2017) evalu-
ates the impact of absence of school sanitation infrastructure on educational attainment; see
also Dreibelbis et al. (2013) and Jasper, Le, and Bartram (2012). We analyze a different aspect
of the relationship between education and sanitation, namely: how the socioeconomic land-
scape impacts the proportion of people who live in households with inadequate water supply and
sewage in Brazilian municipalities. We consider different conditioning variables, for instance,
variables related to schooling and poverty. The data contain zeros since no one lives in households
with inadequate water supply and sewage in nearly of 17% of the Brazilian municipalities. Given
that some response values equal zero, the model used in the analysis must combine continuous
and discrete components. Our results reveal that the median estimated impact of higher education
net attendance rate on the prevalence of inadequate water supply and sewage is uniformly less
intense than the corresponding mean impact differently from what happens with the impact of
the proportion of the adult population with college degree, which may display a crossing pattern.
456 BAYER et al.
Additionally, only the former positively impacts the probability that no one lives with inadequate
water supply and sewage. That is, higher college net attendance rates are associated with higher
probabilities of no one living in households with inadequate water supply and sewage. Such prob-
abilities are considerably reduced, however, when the prevalence of extremely poor children and
of people living in households without electricity increase. A noteworthy aspect of our results is
that they clearly show that schooling is an important predictor of inadequate water supply and
sewage: Higher schooling is associated with lower prevalence of people living under such con-
ditions. A particular policymaking implication of our empirical results is that policies directed
to increasing the share of the population with complete tertiary education in places where such
a share is low are particularly effective in reducing the median share of people who live under
inadequate sanitation conditions.
As noted above, we compute the median and mean impacts of the different conditioning vari-
ables, that is, we investigate how such variables impact the response conditional distribution
median and mean. In order to compute mean impacts, we used the inflated beta and simplex
regression models. The inflated beta, simplex, and Kumaraswamy models were then compared
using several criteria (AIC, BIC, mean predicted values, number of extreme residuals). All crite-
ria favor our model. In particular, the mean predicted value achieved by using the Kumaraswamy
model is nearly 37 and 48% smaller than those obtained from the fitted beta and simplex models,
respectively.
In summary, there are three main motivations for the regression model we introduce in this
article. First, it is the only model for proportion data that, simultaneously, allows for single and
double inflation, allows practitioners to model median effects, can be naturally extended to a class
of parametric quantile regression, and has a closed-form expression for the expected information
matrix. Second, as shown in our empirical analysis, median covariate impacts carry informa-
tion that can be used together with mean impacts to gain further insight on the phenomenon
under study. In that way, our model can be viewed as complementary to those in which the
location parameter is the conditional distribution mean. Third, several criteria indicate that our
model achieve a better data fit than two alternative, mean-indexed models when modeling the
proportion of people who live under inadequate sanitary conditions in Brazil.
The article is structured as follows. In Section 2 we briefly review the Kumaraswamy regres-
sion model and in Section 3 we introduce the class of inflated Kumaraswamy regression models
in which inflation takes place at zero and/or one. Likelihood-based inference for the proposed
model is developed in Section 4. Residual analysis is considered in Section 5 and Monte Carlo
simulation evidence on point and interval estimation is presented in Section 6. In Section 7 we
model the proportion of people who live in households with inadequate water supply and sewage
in Brazil. Finally, some concluding remarks are offered in Section 8 together with directions for
future research.
2 THE KUMARASWA MY REGRESSION MODEL
The beta law is the most commonly used model with random variables that assume values in
the standard unit interval. It was noted, however, by Kumaraswamy (1976) that it may fail to fit
well with hydrological data, especially when they consist of hydrological observations of small
frequency. He then proposed a new distribution, which can be considered as an alternative to
the well-known beta model. The Kumaraswamy distribution is indexed by two parameters and
its support is the standard unit interval, that is, (0, 1). Like the beta distribution, it can be used
BAYER et al. 457
to model the behavior of rates, proportions, income concentration indices, and other random
variables that assume values in (0, 1). Let Y be Kumaraswamy distributed. Its density function is
given by k(y) = 𝜙𝛽y𝜙−1 (1 − y𝜙 )𝛽−1 , 0 < y < 1, 𝜙 > 0 and 𝛽 > 0 being shape parameters. We write
Y ∼ Kum(𝜙, 𝛽). The two parameters that index the law can estimated by maximum likelihood or
by alternative methods; see Dey, Mazucheli, and Nadarajah (2018) and Lemonte (2011).
The mean and the variance of Y are given, respectively, by Mitnik and Baek (2013)
( ) ( ) [ ( )]2
1 2 1
E(Y ) = 𝛽B 1 + , 𝛽 and Var(Y ) = 𝛽B 1 + , 𝛽 − 𝛽B 1 + , 𝛽 ,
𝜙 𝜙 𝜙
( )1∕𝜙
where B(⋅ , ⋅) is the beta function. The distribution median is given by 𝜔 = 1 − 0.51∕𝛽 . It
provides the basis for the reparameterizations used by Mitnik and Baek (2013). The authors
expressed 𝜙 and 𝛽 as 𝜙 = ln(1 − 0.51∕𝛽 )∕ ln(𝜔) and 𝛽 = ln(0.5)∕ ln(1 − 𝜔𝜙 ). By plugging the above
expressions for 𝜙 and 𝛽 into the Kumaraswamy density function, the authors arrived at two repa-
rameterizations which can be used to define regression models for double bounded data. Such
models can be viewed as alternatives to the beta regression model.
We note that the aforementioned parameterizations have been considered elsewhere. For
instance, Bayer, Bayer, and Pumi (2017) introduced the Kumaraswamy autorregressive moving
average time series model and Pumi, Rauber, and Bayer (2020) proposed the Kumaraswamy
regression model with Aranda–Ordaz link function. Such authors used one of the parame-
terizations considered by Mitnik and Baek (2013) according to which the density, cumulative
probability, and quantile functions are given, respectively, by
𝜙 ln(0.5) 𝜙−1 ( ) ln(0.5) −1

f (y; 𝜔, 𝜙) = y 1 − y𝜙 ln(1−𝜔𝜙 ) ,
ln(1 − 𝜔 )
𝜙
[ ]1∕𝜙
( ) ln(0.5)
𝜙 ln(1−𝜔𝜙 )
ln(1−𝜔𝜙 )
F(y; 𝜔, 𝜙) = 1 − 1 − y , F (u; 𝜔, 𝜙) = 1 − (1 − u)
−1 ln(0.5) , (1)
where u ∈ (0, 1) and 𝜙 is a precision parameter.
3 THE INFLATED KUMARASWAMY REGRESSION MODEL
The Kumarawasmy and beta distributions cannot be used when the data contain zeros and/or
ones. It is necessary to define new laws that incorporate a discrete component that assigns
positive probability to such point(s). Inflated beta distributions were proposed by Ospina and
Ferrari (2010) and inflated Kumaraswamy distributions were introduced by Cribari-Neto and
Santos (2019). The latter considered the standard Kumaraswamy parametrization which is
indexed by two shape parameters. By contrast, we shall work with an inflated version of
the Kumaraswamy distribution that employs the median-based parametrization from Mitnik
and Baek (2013). The proposed inflated at zero and/or one Kumaraswamy density can be
expressed as
⎧𝜆(1 − p), if y = 0,
⎪
ki(y; 𝜆, p, 𝜔, 𝜙) = ⎨𝜆p, if y = 1,
⎪
⎩(1 − 𝜆)f (y; 𝜔, 𝜙), if y ∈ (0, 1),
458 BAYER et al.
where 0 < 𝜆 < 1 is the mixture parameter, p is the probability that a Bernoulli-distributed random
variable equals one, and f (y; 𝜔, 𝜙) is the Kumaraswamy density function given in (1) which is
indexed by 0 < 𝜔 < 1 (median) 𝜙 > 0 (precision parameter). It can be written as
[ ]I (y) [ ]I (y)
ki(y; 𝜆, p, 𝜔, 𝜙) = 𝜆(1 − p) {0} × (𝜆p)I{1} (y) × (1 − 𝜆)f (y; 𝜔, 𝜙) (0,1) , 0 ≤ y ≤ 1, (2)
where IA (y) is the indicator function that equals one if y ∈ A and zero otherwise. Let Y be a random
variable with density (2). We say that Y follows the Kumaraswamy law inflated at both zero and
one and write Y ∼ KI(𝜆, p, 𝜔, 𝜙). Notice that Pr(Y = 1) = 𝜆p and Pr(Y = 0) = 𝜆(1 − p). It then fol-
lows that when p = 0, we obtain the Kumaraswamy distribution inflated at zero, Y ∼ KI0 (𝜆, 𝜔, 𝜙),
and when p = 1 we obtain the Kumaraswamy distribution inflated at one, Y ∼ KI1 (𝜆, 𝜔, 𝜙), as par-
ticular cases. Additionally, when p = 𝜆 = 0, Y follows the Kumaraswamy law according to the
parameterization introduced by Mitnik and Baek (2013).
The cumulative and quantile functions of Y are given, respectively, by
KI(y; 𝜆, p, 𝜔, 𝜙) = 𝜆(1 − p) + (𝜆p)I{1} (y) + (1 − 𝜆)F(y; 𝜔, 𝜙), 0 ≤ y ≤ 1,

⎧0, if u ≤ 𝜆(1 − p),
⎪ ( )
u−𝜆(1−p)
KI (u; 𝜆, p, 𝜔, 𝜙) = ⎨F −1
−1
1−𝜆
; 𝜔, 𝜙 , otherwise ,
⎪
⎩1, if u ≥ 1 − 𝜆p,
where 0 < u < 1.

Let y = (y1 , … , nn )⊤ be a vector of n inflated Kumaraswamy-distributed random variables
such that yi has density ki(yi ; 𝜆i , pi , 𝜔i , 𝜙i ) given in (2), i = 1, … , n. The inflated Kumaraswamy
regression model is defined as
∑
m
g1 (𝜆i ) = zij 𝛾j = 𝜂1i ,
j=1
∑
u
g2 (pi ) = wik 𝜋k = 𝜂2i ,
k=1
∑
r
g3 (𝜔i ) = xit 𝛽t = 𝜂3i ,
t=1
∑
s
g4 (𝜙i ) = qib 𝜍b = 𝜂4i ,
b=1
where g1 ∶ (0, 1) → R, g2 ∶ (0, 1) → R, g3 ∶ (0, 1) → R and g4 ∶ (0, ∞) → R are strictly increas-

ing and twice-differentiable link functions, zi = (zi1 , … , zim )⊤ , wi = (wi1 , … , wiu )⊤ , xi =
(xi1 , … , xir )⊤ , and qi = (qi1 , … , qis )⊤ , m + u + r + s < n, are known regressors, which coin-
cide totally or partially, and 𝜸 = (𝛾1 , … , 𝛾m )⊤ , 𝝅 = (𝜋1 , … , 𝜋u )⊤ , 𝜷 = (𝛽1 , … , 𝛽r )⊤ , and 𝝇 =
(𝜍1 , … , 𝜍s )⊤ are unknown parameters. Usually, zi1 = wi1 = xi1 = qi1 = 1 ∀i so that 𝛾1 , 𝜋1 , 𝛽1 , and 𝜍1
are intercept coefficients. Different link functions can be used. For instance, g1 , g2 , and g3 can be
taken to be the logit, probit, log-log, complementary log-log, or Cauchy link function. Also, g4 can
be the logarithm or square root function.
Under double inflation, the regression model comprises of four submodels, as indicated above.
Under single inflation, however, it only contains three submodels since p is constant and known:
BAYER et al. 459
p = 0 for inflation at zero and p = 1 for inflation at one. In both cases, the inflated Kumaraswamy
regression model becomes
∑
m
g1 (𝜆i ) = zij 𝛾j = 𝜂1i ,
j=1
∑
r
g3 (𝜔i ) = xit 𝛽t = 𝜂3i ,
t=1
∑s
g4 (𝜙i ) = qib 𝜍b = 𝜂4i .
b=1
Under no inflation, p and 𝜆 are known (they both equal zero) and the above model reduces to
the standard Kumaraswamy regression model (Mitnik & Baek, 2013).
Remark 1. The location parameter 𝜔i above is the conditional median. The model can be easily
extended, however, so that such a parameter becomes, more generally, the 𝜏th conditional quan-
tile, where 𝜏 ∈ (0, 1). That is, the class of regression models presented above can be easily extended
to a class of inflated Kumaraswamy parametric quantile regression models which generalizes the
Kumaraswamy parametric quantile regression model of Bayes, Bazaán, and De Castro (2017).
( )1∕𝜙 ( )1∕𝜙
To that end, it suffices to set 𝜔 ≡ 𝜔(𝜏) = 1 − (1 − 𝜏)1∕𝛽 instead of 𝜔 = 1 − 0.51∕𝛽 in the
Kumaraswamy reparameterization. By doing so, one can evaluate the impacts of the location
regressors at different quantiles of the response conditional distribution.
4 LIKELIHOOD-BASED INFERENCE A ND
GO ODN E S S- OF-F IT
Let y = (y1 , … , yn )⊤ be a sample of independent inflated Kumaraswamy random variables. The

likelihood function for 𝜽 = (𝜸 ⊤ , 𝝅 ⊤ , 𝜷 ⊤ , 𝝇 ⊤ )⊤ can be expressed as
∏ I{0,1} (yi ) ∏ I{1} (yi ) ∏
L(𝜽; y) = 𝜆i (1 − 𝜆i )1−I{0,1} (yi ) × pi (1 − pi )I{0} (yi ) × f (yi ; 𝜔i , 𝜙i ),
i=1 i=1 i=1
yi ∈(0,1)
where 𝜆i = g1−1 (𝜂1i ), pi = g2−1 (𝜂2i ), 𝜔i = g3−1 (𝜂3i ), and 𝜙i = g4−1 (𝜂4i ). It is noteworthy that the above
likelihood function is the product of three terms: one that only depends on 𝜆i , one that only
depends on pi , and one that involves 𝜔i and 𝜙i . Recall that under inflation at zero (at one) pi = 0
(pi = 1) ∀i.
It follows that the log-likelihood function for 𝜽 = (𝜸 ⊤ , 𝝅 ⊤ , 𝜷 ⊤ , 𝝇 ⊤ )⊤ is
𝓁(𝜽; y) = 𝓁1 (𝜸) + 𝓁2 (𝝅) + 𝓁3 (𝜷, 𝝇), (3)
where
∑[ ]
𝓁1 (𝜸) = I{0,1} (yi ) ln(𝜆i ) + (1 − I{0,1} (yi )) ln(1 − 𝜆i ) ,
i=1
∑[ ]
𝓁2 (𝝅) = I{1} (yi ) ln(pi ) + I{0} (yi ) ln(1 − pi ) ,
i=1
460 BAYER et al.
⎧ ⎡ ⎤
∑ ⎪ ⎢ ln(0.5) ⎥
𝓁3 (𝜷, 𝝇) = ⎨ln (𝜙i ) + ln ⎢ ( )⎥
i=1 ⎪ ⎢ ln 1 − 𝜔𝜙i i ⎥
yi ∈(0,1) ⎩ ⎣ ⎦
⎡ ⎤ ⎫
⎢ ln(0.5) ⎥ ( )
𝜙i ⎪
+ (𝜙i − 1) ln(yi ) ⎢ ( ) − 1⎥ ln 1 − yi ⎬ .
⎢ ln 1 − 𝜔𝜙i i ⎥ ⎪
⎣ ⎦ ⎭
When the location parameter is the 𝜏th conditional quantile (instead of the median), the terms
ln(0.5) above and also in what follows must be replaced with ln(1 − 𝜏).
The score function is given by the first derivative of the log-likelihood function (3) with respect
to 𝜽. The components of the score vector are
𝜕𝓁(𝜽; y) ∑ 𝜕𝓁1 (𝜸) 𝜕𝜆i 𝜕𝜂1i

n
U𝛾j (𝜽) = = , j = 1, … , m,
𝜕𝛾j i=1
𝜕𝜆i 𝜕𝜂1i 𝜕𝛾j
𝜕𝓁(𝜽; y) ∑ 𝜕𝓁2 (𝝅) 𝜕pi 𝜕𝜂2i

n
U𝜋k (𝜽) = = , k = 1, … , u,
𝜕𝜋k i=1
𝜕pi 𝜕𝜂2i 𝜕𝜋k
𝜕𝓁(𝜽; y) ∑ 𝜕𝓁3 (𝜷, 𝝇) 𝜕𝜔i 𝜕𝜂3i

n
U𝛽t (𝜽) = = , t = 1, … , r,
𝜕𝛽t i=1
𝜕𝜔i 𝜕𝜂3i 𝜕𝛽t
𝜕𝓁(𝜽; y) ∑ 𝜕𝓁3 (𝜷, 𝝇) 𝜕𝜙i 𝜕𝜂4i

n
U𝜍b (𝜽) = = , b = 1, … , s.
𝜕𝜍b i=1
𝜕𝜙i 𝜕𝜂4i 𝜕𝜍b
Here, 𝜕𝜂1i ∕𝜕𝛾j = zij , 𝜕𝜂2i ∕𝜕𝜋k = wik , 𝜕𝜂3i ∕𝜕𝛽t = xit , 𝜕𝜂4i ∕𝜕𝜍b = qib , 𝜕𝜆i ∕𝜕𝜂1i = 1∕g1′ (𝜆i ),
𝜕𝜋i ∕𝜕𝜂2i = 1∕g2′ (pi ), 𝜕𝜔i ∕𝜕𝜂3i = 1∕g3′ (𝜔i ), and 𝜕𝜙i ∕𝜕𝜂4i = 1∕g4′ (𝜙i ). Additionally,
𝜕𝓁1 (𝛾) I{0,1} (yi ) − 𝜆i 𝜕𝓁2 (𝜋) I{1} (yi ) I{0} (yi )
= = ai , = − = 𝜌i ,
𝜕𝜆i 𝜆i (1 − 𝜆i ) 𝜕𝜋i pi (1 − pi )
𝜕𝓁3 (𝜷, 𝝇) [ ]
= 1 − I{0,1} (yi ) 𝜙i ci ,
𝜕𝜔i
⎡ 𝜙 ⎤
𝜕𝓁3 (𝜷, 𝝇) [ ]⎢ 1 yi i ln(yi ) ⎥
= 1 − I{0,1} (yi ) ⎢ + ln(yi ) + ci 𝜔i ln(𝜔i ) − (𝛿i − 1) ( ) ⎥ = vi ,
𝜕𝜙i ⎢ 𝜙i 𝜙
1 − yi i ⎥
⎣ ⎦
where
𝜙 −1 ( ( ) )
𝜔i i 𝜙 ln(0.5)
ci = 𝜙 𝜙
𝛿i ln 1 − yi i + 1 and 𝛿i = 𝜙
.
(1 − 𝜔i i ) ln(1 − 𝜔i i ) ln(1 − 𝜔i i )
( )⊤
The score vector U(𝜽) = U𝜸 (𝜽)⊤ , U𝝅 (𝜽)⊤ , U𝜷 (𝜽)⊤ , U𝝇 (𝜽)⊤ can be expressed in matrix
notation as
U𝛾 (𝜽) = Z ⊤ T1 a, U𝜋 (𝜽) = W ⊤ T2 𝝆, U𝛽 (𝜽) = X ⊤ T3 c and U𝜍 (𝜽) = Q⊤ T4 v,

BAYER et al. 461
where Z is an n × m matrix with ith row given by zi , W is an n × m matrix with

ith row given by wi , X is an n × r matrix with ith row given by xi , and Q is
an n × s matrix with ith row given by qi . Additionally, the following are diagonal
( ) ( )
matrices: T1 = diag 1∕g1′ (𝜆1 ), … , 1∕g1′ (𝜆n ) , T2 = diag 1∕g2′ (p1 ), … , 1∕g2′ (pn ) , T3 =
( ′ ) ( )
diag 1∕g3 (𝜔1 ), … , 1∕g3′ (𝜔n ) , and T4 = diag 1∕g4′ (𝜙1 ), … , 1∕g4′ (𝜙n ) . Also, a = (a1 , … , an )⊤ ,
([ ] [ ] )⊤
𝝆 = (𝜌1 , … , 𝜌n )⊤ , c = 1 − I{0,1} (y1 ) 𝜙1 c1 , … , 1 − I{0,1} (yn ) 𝜙n cn , and v = (v1 , … , vn )⊤ .
The maximum likelihood estimator 𝜽̂ solves U(𝜽) = 0, where 0 is the null vector in Rm+u+r+s .
Such a system does not have a closed-form solution. In what follows we shall numerically max-
imize the model log-likelihood function using the Broyden–Fletcher–Goldfarb–Shanno (BFGS)
nonlinear optimization algorithm with analytical first derivatives (Nocedal & Wright, 2006; Press,
Teukolsky, Vetterling, & Flannery, 1988).
Next, we shall express Fisher’s information matrix in closed form. Such a matrix can
be used to obtain standard errors and approximate confidence intervals. It is also used in
some test statistics. It is necessary to obtain the second-order log-likelihood derivatives
and then compute their expected values. Technical details are in the Appendix. Let L =
diag(−1∕[𝜆( 1 (1 − 𝜆1 )], … , −1∕[𝜆n (1 − 𝜆n )]),) P = diag (−𝜆1 ∕[p1 (1 − p1 )], … , −𝜆n ∕[pn (1 − pn ]),
V = diag (𝜆1 − 1)𝜙21 𝜈1(2) , … , (𝜆n − 1)𝜙2n 𝜈n(2) , M = diag(m1 , … , mn ), and S = diag(s1 , … , sn ).
The expressions for 𝜈i(2) , si , and mi , i = 1, … , n, are given in the Appendix. Fisher’s information
matrix for 𝜽 can then be expressed as
⎛K 0 0 0 ⎞
⎜ (𝜸,𝜸) ⎟
⎜ 0 K(𝝅,𝝅) 0 0 ⎟
K = K(𝜽) = ⎜ ⎟,
⎜ 0 0 K(𝜷,𝜷) K(𝜷,𝝇) ⎟
⎜ ⎟
⎜ 0 0 K(𝝇,𝜷) K(𝝇,𝝇) ⎟⎠
⎝
⊤
where K(𝜸,𝜸) = −Z ⊤ LT12 Z, K(𝝅,𝝅) = −W ⊤ PT22 W, K(𝜷,𝜷) = −X ⊤ VT32 X, K(𝜷,𝝇) = K(𝝇,𝜷) = −X ⊤ T3 MT4 Q,
⊤ 2
and K(𝝇,𝝇) = −Q ST4 Q, zeros denoting null matrices of conforming dimensions. When n is large,
𝜽̂ is approximately normally distributed mean 𝜽 and covariance matrix K −1 (𝜃). ̂ The consistency
̂
and asymptotic normality of 𝜽 can be established under some regularity assumptions, as in
Fahrmeir and Kaufmann (1985) for generalized linear models and in Pumi et al. (2020) for the
Kumaraswamy regression model with parametric link function.
An asymptotic confidence interval of level (1 − 𝛿) × 100% for 𝜃j , where 𝛿 ∈ (0, 0.5) and j =
1, … , dim(𝜽), is 𝜃̂ j ± z1−𝛿∕2 se(𝜃̂ j ), where z1−𝛿∕2 is the 1 − 𝛿∕2 standard normal quantile, 𝜃̂ j is the
maximum likelihood estimator of 𝜃j and se(𝜃̂ j ) is its asymptotic standard error.
Hypothesis testing inferences can also be easily performed. Suppose the interest lies in test-
ing restrictions on a subset of the parameter vector 𝜽. More specifically, suppose we wish to test
0 ∶ 𝜽1 = 𝜽(0) (0)
1 against 1 ∶ 𝜽1 ≠ 𝜽1 , where 𝜽1 is the q × 1 vector of parameters of interest. The
( )⊤
likelihood ratio test statistic is SLR = 2[𝓁(𝜽) ̃ where 𝜽̂ = (𝜽̂ ⊤1 , 𝜽̂ ⊤2 )⊤ and 𝜽̃ = 𝜽(0)⊤ , 𝜽̃ ⊤2
̂ − 𝓁(𝜽)],
1
are, respectively, the unrestricted and restricted maximum likelihood estimators 𝜽. Under the null
hypothesis and when n is large, SLR is approximately distributed as 𝜒q2 . The hull hypothesis is
rejected at significance level 𝛿 ∈ (0, 1) if SLR > 𝜒q;1−𝛿
2
, that is, if the test statistic exceeds the 1 − 𝛿
𝜒q quantile.
2
Hypothesis test inferences can also be carried out using the Rao score and Wald tests, whose
test statistics are
462 BAYER et al.
̃ ⊤ K 11 (𝜽)U
SR = U1 (𝜽) ̃ 1 (𝜽)
̃ and SW = (𝜽̂ 1 − 𝜃1(0) )⊤ [K 11 (𝜽)]
̂ −1 (𝜽̂ 1 − 𝜽(0) ),
1
respectively, where U1 (𝜽) is a q × 1 vector that contains the log-likelihood derivatives with
respect to the parameters of interest and K 11 (𝜽) is the q × q matrix formed using the lines and
columns of K −1 (𝜽) relative to 𝜽1 . The two test statistics are, under 0 , asymptotically distributed
as 𝜒q2 and the null hypothesis is rejected at significance level 𝛿 if the test statistic exceeds
𝜒q;1−𝛿
2
.
Oftentimes the parameter of interest is scalar. When that happens, testing inferences can be
carried out using the z test. The z test statistic is z = (𝜃̂ 1 − 𝜃1(0) )∕se(𝜃̂ 1 ), where 𝜃̂ 1 is the maximum
likelihood estimator of the parameter of interest and se(𝜃̂ 1 ) is the asymptotic standard error of
𝜃̂ 1 . Under the null hypothesis and when n is large, z is approximately distributed as  (0, 1). The
test is performed using asymptotic standard critical values and the null hypothesis is rejected at
significance level 𝛿 ∈ (0, 1) if |z| > z1−𝛿∕2 .
It is possible to quantify the model’s overall goodness-of-fit by using a pseudo-R2 mea-
sure. Such measures assume values in [0, 1], and the higher their values the better the overall
goodness-of-fit. In what follows, we shall use the following pseudo-R2 (Nagelkerke, 1991): 1 −
(L0 ∕Lfull )2∕n , where Lfull and L0 are the maximized likelihood functions of the model with all
covariates and of the model that only contains the intercepts, respectively.
Computer code for estimation of inflated Kumaraswamy regressions in the R statistical com-
puting environment (R Core Team, 2021) is made available by the authors at https://github.com/
fabiobayer/ikumareg.
5 RESIDUA LS
The agreement between response values and predicted values obtained from the fitted model
can be assessed using the randomized quantile residual (Dunn & Smyth, 1996). We shall now
investigate its behavior under both correct and incorrect model specification. We shall restrict
attention to inflation at zero. Figure 1 contains residual normal probability plots with simulated
envelopes which were constructed under the following scenarios:
• Scenario 1—correct (Figure 1a): The fitted model is correctly specified. Here, g1 (⋅) and g3 (⋅)
are the logit link function and g4 (⋅) is the log link. (The model does not include g2 (⋅) because
there is only inflation at zero.) The parameter values are 𝛾1 = −1, 𝛾1 = −1, 𝛽1 = −1, 𝛽2 = −2,
𝛽3 = −2, 𝜍1 = 1, 𝜍2 = 1. This was the baseline scenario for the five others that follow.
• Scenario 2—covariate (Figure 1b): The estimated model does not include covariate x3 in the
median submodel, that is, we fail to include into the model a relevant covariate.
• Scenario 3—nonlinear (Figure 1c): The model described in the first scenario was estimated, but
the data were generated using the nonlinear median predictor 𝜂3i = 𝛽1 + 𝛽2 xi2 − log(−𝛽3 xi3 ),
that is, there is neglected nonlinearity.
• Scenario 4—link (Figure 1d): The model described in the first scenario was estimated, but the
data were generated using the cloglog link for g1 (⋅) and g3 (⋅) and the square root link for g4 (⋅),
that is, there is incorrect link function specification.
• Scenario 5—outlier (Figure 1e): Five outliers were introduced into the data (1% of the sam-
ple). To that end, five values of 𝜂3i chosen at random were replaced by 2 × max(𝜂3i ) prior to
generating the response values.
BAYER et al. 463
Randomized quantile residuals

2
2
0
0
−2
−2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Normal quantile Normal quantile
(a) Correct (26 points) (b) Without a covariate (267 points)

3

2
2
0 1
0
−2 −1
−2
−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
(c) Nonlinear (311 points) (d) Incorrect link (86 points)

4
2

2
0
0
−2
−2
−4
−4
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
(e) 1% of outliers (225 points) (f) Generated from beta (200 points)
FIGURE 1 Residual normal probability plots with simulated envelopes

464 BAYER et al.
• Scenario 6—beta (Figure 1f): The response values were generated from the zero-inflated beta
distribution with the same regression structure as in the first scenario. Data generation was
carried out using the gamlss package for R; see https://cran.r-project.org/package=gamlss.
The sample size is n = 500, the simulated envelope bands were constructed using 50 repli-
cations and all computations were carried out using the R statistical computing environment
(R Core Team, 2021). The number of residuals that fell outside the envelope bands is indi-
cated in each panel caption (in parentheses). It is clear from panels (b) to (f) that a sizeable
portion of the residuals fall outside the bands when the model is incorrectly specified or the
data contain outliers. We then conclude that the randomized quantile residual can be success-
fully used with inflated Kumaraswamy regressions to identify model misspecification or data
anomalies.
6 SIMULATION EVIDENCE
In what follows we shall report the results of Monte Carlo simulations that were performed to
evaluate the finite sample performances of the maximum likelihood estimators of the parameters
that index the inflated Kumaraswamy model. We also compute the coverage rates of 1 − 𝛿 = 0.95
asymptotic confidence intervals.
We consider three different scenarios, namely: (S1) inflation at both 0 and 1, (S2) inflation
at one, and (S3) inflation at zero. The model uses the logit link for g1 (⋅), g2 (⋅), and g3 (⋅), and the
log link for g4 (⋅). Each submodel has an intercept and a single regressor. The sample sizes are
n = 50, 100, 200, 500, 1000. All results are based on 10,000 replications and were obtained using the
R statistical computing environment (R Core Team, 2021). Log-likelihood maximization was per-
formed using the BFGS nonlinear optimization method with analytical first derivatives (Nocedal
& Wright, 2006; Press et al., 1988). The starting values for the median submodel coefficients
were obtained by running an ordinary least regression of g3 (y) on the covariate used in that sub-
model. The starting values of all other parameters were set equal to zero. The values of z1 , w1 , x1 ,
and q1 are set equal to one so that all submodels contain an intercept. The values of the covari-
ates z2 , w2 , x2 , and q2 were obtained as random standard uniform draws. There were very few
optimization failures (0.24%) when the sample size was quite small (n = 50) and none for larger
samples.
The parameter values and the simulation results are presented in Tables 1 (double infla-
tion), 2 (inflation at one), and 3 (inflation at zero). We report results on point (means, SDs,
biases, and relative biases) and interval (coverage rates of 95% confidence intervals) estima-
tion. Coverage rates are expressed as proportions and relative biases are expressed as percent-
ages. We define relative bias as the estimated bias divided by the true parameter value times
100.
The simulation evidence we report shows that the estimators’ biases, relative biases, and SDs
decay toward zero as the sample size grows, as expected. Also, the empirical coverages of all con-
fidence intervals approach 95% (the nominal coverage) as the number of data points increases.
The largest small sample relative biases correspond to the estimators of the parameters that index
the submodel for pi under double inflation. When n = 200, however, such biases are reduced to
approximately 6%. The confidence intervals’ coverage rates are close to the nominal level (95%)
when n ≥ 200. For example, in scenario 1 (double inflation), the coverage rates range from 93.7 to
95.9% (from 94.8 to 95.2%) when n = 200 (when n = 1000).
BAYER et al. 465
T A B L E 1 Means, SDs, biases, relative biases (%), and coverage rates (%): Scenario 1
Parameter 𝜸1 𝜸2 𝝅1 𝝅2 𝜷1 𝜷2 𝝇1 𝝇2
Parameter value −0.500 −1.000 1.000 −1.000 1.000 −2.000 1.000 1.500
n = 1000
Mean −0.500 −1.003 1.011 −1.012 1.001 −2.001 1.001 1.505
SD 0.135 0.258 0.268 0.455 0.030 0.041 0.058 0.098
Bias 0.000 −0.003 0.011 −0.012 0.001 −0.001 0.001 0.005
Relative bias −0.002 0.290 1.085 1.216 0.053 0.036 0.083 0.312
Coverage rate 0.949 0.948 0.950 0.952 0.945 0.943 0.952 0.949
n = 500
Mean −0.502 −1.007 1.027 −1.036 1.001 −2.001 1.002 1.508
SD 0.191 0.348 0.410 0.668 0.041 0.058 0.082 0.137
Bias −0.002 −0.007 0.027 −0.036 0.001 −0.001 0.002 0.008
Relative bias 0.498 0.679 2.712 3.565 0.076 0.063 0.196 0.523
Coverage rate 0.953 0.954 0.950 0.947 0.946 0.945 0.949 0.948
n = 200
Mean −0.503 −1.017 1.052 −1.061 1.003 −2.004 1.001 1.529
SD 0.304 0.572 0.606 1.088 0.068 0.093 0.141 0.245
Bias −0.003 −0.017 0.052 −0.061 0.003 −0.004 0.001 0.029
Relative bias 0.505 1.662 5.213 6.092 0.299 0.225 0.081 1.912
Coverage rate 0.950 0.950 0.959 0.951 0.937 0.938 0.939 0.938
n = 100
Mean −0.505 −1.032 1.136 −1.163 1.003 −2.004 1.005 1.546
SD 0.455 0.864 1.050 1.697 0.095 0.132 0.206 0.327
Bias −0.005 −0.032 0.136 −0.163 0.003 −0.004 0.005 0.046
Relative bias 1.002 3.206 13.555 16.276 0.290 0.186 0.549 3.088
Coverage rate 0.953 0.951 0.970 0.963 0.930 0.921 0.936 0.932
n = 50
Mean −0.500 −1.121 1.264 −1.269 1.009 −2.001 0.992 1.616
SD 0.625 1.259 2.406 4.425 0.130 0.294 0.335 0.551
Bias 0.000 −0.121 0.264 −0.269 0.009 −0.001 −0.008 0.116
Relative bias −0.059 12.126 26.406 26.947 0.900 0.042 −0.779 7.760
Coverage rate 0.960 0.959 0.977 0.985 0.894 0.874 0.918 0.909
466 BAYER et al.
T A B L E 2 Means, SDs, biases, relative biases (%) and coverage rates (%): Scenario 2 (p = 1)
Parameter 𝜸1 𝜸2 𝜷1 𝜷2 𝝇1 𝝇2
Parameter value −0.500 −1.000 1.000 −2.000 1.000 1.500
n = 1000
Mean −0.501 −1.002 1.001 −2.001 1.000 1.505
SD 0.136 0.252 0.028 0.040 0.060 0.097
Bias −0.001 −0.002 0.001 −0.001 0.000 0.005
Relative bias 0.289 0.208 0.050 0.043 0.014 0.323
Coverage rate 0.948 0.949 0.946 0.946 0.953 0.951
n = 500
Mean −0.501 −1.006 1.001 −2.002 1.000 1.510
SD 0.202 0.364 0.041 0.057 0.081 0.136
Bias −0.001 −0.006 0.001 −0.002 0.000 0.010
Relative bias 0.150 0.578 0.113 0.098 0.045 0.689
Coverage rate 0.952 0.957 0.946 0.943 0.949 0.947
n = 200
Mean −0.499 −1.024 1.003 −2.005 1.005 1.522
SD 0.293 0.576 0.066 0.091 0.141 0.232
Bias 0.001 −0.024 0.003 −0.005 0.005 0.022
Relative bias −0.250 2.448 0.286 0.227 0.490 1.465
Coverage rate 0.951 0.953 0.939 0.938 0.946 0.941
n = 100
Mean −0.504 −1.049 1.003 −2.007 1.008 1.545
SD 0.446 0.831 0.095 0.138 0.190 0.321
Bias −0.004 −0.049 0.003 −0.007 0.008 0.045
Relative bias 0.723 4.908 0.288 0.327 0.807 2.997
Coverage rate 0.954 0.954 0.926 0.918 0.934 0.930
n = 50
Mean −0.526 −1.070 1.011 −2.020 1.003 1.621
SD 0.577 1.137 0.134 0.188 0.295 0.500
Bias −0.026 −0.070 0.011 −0.020 0.003 0.121
Relative bias 5.123 7.008 1.120 1.009 0.317 8.074
Coverage rate 0.957 0.960 0.887 0.871 0.921 0.904

BAYER et al. 467
T A B L E 3 Means, SDs, biases, relative biases (%) and coverage rates (%): Scenario 3 (p = 0)
Parameter 𝜸1 𝜸2 𝜷1 𝜷2 𝝇1 𝝇2
Parameter value −0.500 −1.000 1.000 −2.000 1.000 1.500
n = 1000
Mean −0.499 −1.006 1.000 −2.000 1.001 1.505
SD 0.139 0.256 0.029 0.040 0.058 0.097
Bias 0.001 −0.006 0.000 0.000 0.001 0.005
Relative bias −0.158 0.558 −0.008 0.007 0.062 0.355
Coverage rate 0.947 0.951 0.949 0.946 0.951 0.951
n = 500
Mean −0.503 −1.005 1.000 −2.001 1.001 1.510
SD 0.192 0.357 0.041 0.059 0.085 0.142
Bias −0.003 −0.005 0.000 −0.001 0.001 0.010
Relative bias 0.662 0.521 0.016 0.029 0.102 0.636
Coverage rate 0.950 0.953 0.946 0.943 0.951 0.950
n = 200
Mean −0.503 −1.019 1.001 −2.002 1.005 1.520
SD 0.319 0.594 0.067 0.096 0.133 0.227
Bias −0.003 −0.019 0.001 −0.002 0.005 0.020
Relative bias 0.629 1.948 0.060 0.093 0.543 1.336
Coverage rate 0.946 0.946 0.937 0.934 0.941 0.940
n = 100
Mean −0.509 −1.045 1.004 −2.008 1.001 1.558
SD 0.427 0.810 0.102 0.148 0.213 0.362
Bias −0.009 −0.045 0.004 −0.008 0.001 0.058
Relative bias 1.792 4.515 0.425 0.382 0.092 3.896
Coverage rate 0.954 0.953 0.929 0.926 0.938 0.933
n = 50
Mean −0.511 −1.101 1.008 −2.017 1.035 1.563
SD 0.693 1.396 0.112 0.185 0.285 0.486
Bias −0.011 −0.101 0.008 −0.017 0.035 0.063
Relative bias 2.209 10.148 0.790 0.847 3.509 4.215
Coverage rate 0.961 0.961 0.908 0.900 0.916 0.915

468 BAYER et al.
T A B L E 4 Descriptive statistics, proportion of people who live in households with

inadequate water supply and sewage
Min First quartile Median Mean Third quartile Max
0.0000 0.0053 0.0326 0.0920 0.1302 0.8536

3000
0.8
2500
0.6
1500 2000
Frequency
y
0.4
1000
500
0.2
O
0
0.0
0.0 0.2 0.4 0.6 0.8

y
F I G U R E 2 Histogram (left panel) and adjusted boxplot (right panel), proportion of people who live in
households with inadequate water supply and sewage
7 AN ANALYSIS O F WAT ER SUPPLY AND SANITATION IN

BRAZIL
In what follows we shall use data from the 2010 Brazilian Atlas of Human Development (“Atlas
do Desenvolvimento Humano no Brasil”). The variable of interest (response) is the proportion
of people who live in households with inadequate water supply and sewage in 5565 Brazilian
municipalities (n = 5565). Descriptive statistics are presented in Table 4. The mean and median
values are 3.26 and 9.20%, respectively. The minimal value is zero which indicates that there is
inflation at zero. Indeed, there are 434 observations for which the response value equals zero. The
data histogram and the adjusted boxplot (Hubert & Vandervieren, 2008) are displayed in Figure 2;
adjusted boxplots are recommended for skewed distributions. It is clear from such plots that there
is data inflation and distributional asymmetry.
We preselected as candidate regressors 10 variables from the complete data set using a Spear-
man correlation analysis. They are listed in Table 5. Figure 3 contains a scatterplot involving
the response and the covariates. In each panel, we report the value of the correlation coefficient
between the dependent variable and the relevant regressor. The maximal absolute correlation
between all pairs of regressors is 0.83. We then selected the following zero inflated Kumaraswamy
regression model by sequentially removing covariates that were not statistically significant at the
5% significance level:
g1 (𝜆i ) = 𝛾1 + zi2 𝛾2 + zi5 𝛾5 + zi6 𝛾6 + zi7 𝛾7 + zi,9 𝛾9 + zi10 𝛾10 ,

g3 (𝜔i ) = 𝛽1 + xi2 𝛽2 + xi3 𝛽3 + xi4 𝛽4 + xi5 𝛽5 + xi6 𝛽6 + xi7 𝛽7 + xi8 𝛽8 + xi9 𝛽9 ,
g4 (𝜙i ) = 𝜍1 + qi3 𝜍3 + qi4 𝜍4 + qi7 𝜍7 + qi9 𝜍9 + qi10 𝜍10 ,
BAYER et al. 469
T A B L E 5 Covariates used in the regression analysis

Covariate Description
z2 , x2 , q2 Illiteracy rate of the population between 18 and 24 years of age

z3 , x3 , q3 Net attendance rate in higher education
z4 , x4 , q4 Proportion of the population aged 18–24 years attending primary school
z5 , x5 , q5 Proportion of the population aged 18–24 years with complete secondary
education
z6 , x6 , q6 Proportion of population aged >25 years with completed tertiary
education
z7 , x7 , q7 Proportion of extremely poor children
z8 , x8 , q8 Proportion of children living in households where none of the residents
has completed elementary school
z9 , x9 , q9 Proportion of female heads of household without complete elementary
school and with at least one child <15 years
z10 , x10 , q10 Proportion of people living in households without electricity
i = 1, … , 5565, where g1 (⋅) and g3 (⋅) are the logit link, and g4 (⋅) is the log link. The parameter
estimates, asymptotic standard errors (se), z test statistics for testing the null hypothesis that each
parameter equals zero, and corresponding p-values are presented in Table 6.
We note from the figures in Table 6 that all parameters are statistically different from zero at
the 5% significance level, most statistical significance taking place at the 1% significance level.
The model pseudo-R2 equals 0.6558, that is, the fitted model explains approximately 2/3 of the
variability in the response. We note that the covariate related to electricity supply (x10 ) does not
impact the response median but impacts the two other distribution parameters. The covariates
related to youth and adult schooling negatively impact the median prevalence of people who live
in households with inadequate water supply and sewage. By contrast, the impact is positive for
the proportion of extremely poor children.
Figure 4 contains two plots: (i) index plot of randomized quantile residuals (Dunn &
Smyth, 1996) and (ii) quantile-quantile (QQ) plot of such residuals with simulated envelopes.
They indicate that the model appears to be correctly specified. There is only one data point that is
clearly atypical and whose residual falls below −6. Such an observation corresponds to Aroeiras
do Itaim, a municipality located in the state of Piauí, which only has 2511 inhabitants. Over 10%
of the population of that county live in households with no electricity, only 15% of the population
aged 18–24 years completed secondary education, and only 0.6% of the population has college
degree. Nonetheless, all households in Aroeiras do Itaim have adequate water supply and sewage,
that is, y = 0. No inferential conclusion is reversed when that municipality is removed from the
data.
We also fitted inflated beta and inflated simplex regression models to the data. The latter was,
as noted in Section 1, introduced by Liu et al. (2020). As proposed by the authors, however, the
model does not allow for varying dispersion and the statistical inference in their paper is only
developed for doubly inflated data. We thus adapted their model to our setting. The parameter
470 BAYER et al.
FIGURE 3 Scatterplot of the response and covariates
estimates for the three models are given in Table 7. Some conclusions can be drawn from the
reported estimates. First, some estimated regression coefficients are similar. For instance, the
point estimates of 𝛽3 , the coefficient associated with the net attendance rate in higher education
covariate, are −4.2699, −4.6691, and −5.0508 for the Kumaraswamy, beta, and simplex models,
respectively. Interestingly, even though the point estimates are close, the three covariate impacts
on the location parameter can be quite different since 𝜕𝜔i ∕𝜕xi3 is a function of the other estimated
location regression coefficients as well. Second, some point estimates are considerably different
in some models. This is the case, for example, of the estimates of 𝛽6 , the regression coefficient
associated with the proportion of population aged >25 years with completed tertiary education:
−7.2605 (Kumaraswamy), −3.8929 (beta), and −6.9264 (simplex); notice that the point estimate
is considerably larger for the beta model.
In order to gain insight on the relative merits of each model, we computed some additional
measures for the Kumaraswamy, beta, and simplex models. First, we computed the values of the
BAYER et al. 471
T A B L E 6 Point
Parameter Estimate se z p-value
estimates, standard errors,
z test statistics, and p-values 𝛾1 −1.9361 0.4281 −4.5223 <.0001
𝛾2 15.7385 6.6937 2.3513 .0187
𝛾5 3.4949 0.6261 5.5818 <.0001
𝛾6 −8.3964 1.7491 −4.8005 <.0001
𝛾7 −17.1408 2.1954 −7.8075 <.0001
𝛾9 −2.4371 1.0330 −2.3592 .0183
𝛾10 −154.0586 24.5603 −6.2727 <.0001
𝛽1 −2.0959 0.1749 −11.9821 <.0001
𝛽2 3.5447 0.6590 5.3786 <.0001
𝛽3 −4.2699 0.4809 −8.8792 <.0001
𝛽4 3.5886 0.6488 5.5313 <.0001
𝛽5 −2.5087 0.2554 −9.8239 <.0001
𝛽6 −7.2605 0.7795 −9.3138 <.0001
𝛽7 4.4590 0.1774 25.1305 <.0001
𝛽8 −1.8347 0.2252 −8.1465 <.0001
𝛽9 2.0442 0.1981 10.3166 <.0001
𝜍1 −0.1395 0.0494 −2.8232 .0048
𝜍3 −1.0452 0.2326 −4.4941 <.0001
𝜍4 1.7354 0.5210 3.3306 .0009
𝜍7 0.7938 0.1302 6.0983 <.0001
𝜍9 0.4926 0.1544 3.1896 .0014
𝜍10 −1.1645 0.2406 −4.8405 <.0001
4
4
2
2

0
0
−2
−2
−4
−4
−6
−6
0 1000 2000 3000 4000 5000 −4 −2 0 2 4

Observations Normal quantile
FIGURE 4 Residuals (left panel) and QQ plot with simulated envelopes (right panel)
472 BAYER et al.
Inflated Inflated Inflated T A B L E 7 Parameter

Parameter Kumaraswamy beta simplex estimates of the three models
𝛾1 −1.9361 −1.9360 −1.9088

𝛾2 15.7385 15.7360 14.9747
𝛾5 3.4949 3.4948 3.4747
𝛾6 −8.3964 −8.3964 −8.4664
𝛾7 −17.1408 −17.1401 −17.1108
𝛾9 −2.4371 −2.4371 −2.4386
𝛾10 −154.0586 −154.0856 −154.6976
𝛽1 −2.0959 −2.4370 −1.3449
𝛽2 3.5447 2.5390 2.7758
𝛽3 −4.2699 −4.6691 −5.0508
𝛽4 3.5886 3.6356 0.7588
𝛽5 −2.5087 −1.4186 −2.2591
𝛽6 −7.2605 −3.8929 −6.9264
𝛽7 4.4590 4.1610 2.3664
𝛽8 −1.8347 −1.1105 −1.4807
𝛽9 2.0442 1.7707 1.6333
𝜍1 −0.1395 3.09660 2.5686
𝜍3 −1.0452 5.76313 3.1986
𝜍4 1.7354 −0.84792 −3.8609
𝜍7 0.7938 −2.61235 −2.7829
𝜍9 0.4926 −0.79029 −1.0001
𝜍10 −1.1645 −1.63429 1.8410
two most commonly used model selection criteria, AIC and BIC. The smaller these values, the
better. Second, we computed the mean predicted values corresponding to all responses whose val-
ues equal zero (MPV0 ), that is, we computed the average of all 𝜔̂ i such that yi = 0. The smaller
the MPV0 , the better, since small values are indicative of predicted response values that are close
to zero. Finally, we counted the number of randomized quantile residuals that lie outside (−3, 3).
Such data points are not well fitted by the model. The smaller the count, the better. The four com-
parative measures are presented in Table 8 for the three models. It is noteworthy that they favor
the inflated Kumaraswamy model: (i) its AIC and BIC values are smaller, (ii) its MPV0 value is
nearly 37% (48%) smaller than that of the beta (simplex) model, and (iii) its count of residuals
that exceed 3 in absolute value is much smaller: 11 vs. 20 (beta) and 31 (simplex). We then con-
clude that there is some indication that a better fit is achieved by using the inflated Kumaraswamy
model.
We shall now consider the impacts of the covariates related to college education on the loca-
tion parameter for the three fitted models. In Figure 5 we plot estimates of 𝜕𝜔i ∕𝜕xi3 (left panel)
and 𝜕𝜔i ∕𝜕xi6 (right panel) against values of the corresponding covariates using point estimates
obtained from the Kumaraswamy, beta, and simplex models. All other covariates are fixed at their
BAYER et al. 473
T A B L E 8 Some comparative measures for the three fitted models

Residuals
Model AIC BIC MPV0 outsize (−3, 3)
Inflated Kumaraswamy −17897.72 −17751.98 0.0124 11

Inflated beta −17766.70 −17620.90 0.0197 20
Inflated simplex −16595.14 −16449.40 0.0238 31
0.0
−0.05
−0.1
Estimated impact
Estimated impact
−0.15
−0.3 −0.2
−0.25
Inflated Kumaraswamy regression Inflated Kumaraswamy regression

−0.4
Inflated beta regression Inflated beta regression

−0.35
Inflated simplex regression Inflated simplex regression
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
x3 x6
0.00
0.0
−0.30 −0.25 −0.20 −0.15 −0.10 −0.05
−0.2
Estimated impact
Estimated impact
−0.4 −0.6
Quantile 0.10 Quantile 0.10

−0.8
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
x3 x6
F I G U R E 5 Estimated impacts; top panels are: impacts of (i) net attendance rate in higher education (top left
panel) and (ii) proportion of population aged >25 years with completed tertiary education (top right panel) on the
median (Kumaraswamy) and mean (beta and simplex) response; bottom panels are impacts of (iii) net attendance
rate in higher education (bottom left panel) and (iv) proportion of population aged >25 years with completed
tertiary education (bottom right panel) on five response quantiles (Kumaraswamy): 𝜏 = 0.10, 0.25, 0.50, 0.75, 0.90
median values. At the outset, we focus on the top two panels in which 𝜔i is the ith median in
the Kumaraswamy model and the ith mean in the two alternative models (beta and simplex).
The impacts of the regressors are negative and strictly decreasing according to both models, that
is, the mean and median shares of people who live with inadequate water supply and sewage
decrease as net attendance in higher education and the proportion of adults with college educa-
tion increase, and the impacts of both covariates weaken as their values increase. Interestingly,
both mean impacts of the former are uniformly more intense than the median impact, the simplex
474 BAYER et al.
impact being more intense than the beta impact, whereas there is a crossing pattern in the impacts
of the latter for the beta but not for the simplex model: the median impact of x6 is stronger than the
beta mean impact up to approximately 0.12 and less intense after that point; the simplex impact
is uniformly more intense than the Kumaraswamy impact, but displays a crossing with the beta
impact. Overall, it seems that, at least for the most part, the mean impacts of x3 and x6 are stronger
than the corresponding median impacts, especially at low covariate levels.
It is also interesting to notice that even though the point estimates of 𝛽3 are somewhat similar
in the three fitted models, the Kumaraswamy, beta, and simplex impact curves of x3 in Figure 5
are quite far apart at low levels of the covariate. When the net attendance rate in higher education
is high (say, in excess of 1/3), the median and mean impacts of such a regressor are similar, the
mean impact being slightly stronger (for both beta and simplex models). However, the impact
strengths are quite different when net attendance in higher education is low. In that case, the
mean impact is considerably stronger than the median impact (again, for both beta and simplex
models).
As noted earlier, a novel aspect of the inflated Kumaraswamy regression model introduced in
this article is that it can be extended to a class of parametric regression quantile models. Under
the more general formulation, 𝜔i is the 𝜏th quantile of yi ; when 𝜏 = 0.5, we return to the standard
formulation of the model. In order to exemplify the use of our class of models as a class of paramet-
ric regression quantile models, in the bottom panels of Figure 5 we present the impact curves of
x3 (left panel) and x6 (right panel) on the 𝜏th response quantile for 𝜏 = 0.10, 0.25, 0.50, 0.75, 0.90.
Interestingly, in both cases the impacts are negative even in the lower tail of the response condi-
tional distribution (i.e., 𝜏 = 0.10). Additionally, unlike the impacts of net attendance rate in higher
education (x3 ), those of the proportion of population aged >25 years with completed tertiary edu-
cation (x6 ) on the different response conditional distribution quantiles display convergence as the
covariate value approaches 0.5.
As we have seen above, both higher education net attendance and the share of the adult
population with college degree negatively impact the median prevalence of people who live in
households with inadequate water supply and sewage. Interestingly, however, only the former
impacts the probability that no one lives under such conditions, the impact being positive. That
is, all else being equal, higher college net attendance rates are associated with higher probabili-
ties of no one living in households with inadequate water supply and sewage. By contrast, such
probabilities are considerably reduced as the prevalence of extremely poor children and of people
living in households without electricity increase.
8 CO NCLUDING REMARKS AND D IRECTIONS FO R

FUTURE RESEARC H
Oftentimes it is necessary to model the behavior of certain variables that assume values in the
standard unit interval, (0, 1), which can be accomplished by using models that make use of the
Kumaraswamy law. In some cases, however, the data contain zeros and/or ones. When that hap-
pens, the interest lies in modeling variables that assume values in [0, 1), (0, 1], or [0, 1]. We say that
there is inflation at zero, at one, and at zero and one, respectively. The former two characterize
single data inflation, whereas the latter corresponds to double inflation. The underlying probabil-
ity law must combine continuous and discrete components. In this article, we introduced a class
of regression models that can be used with such data, namely: the class of inflated Kumaraswamy
regression models. It is based on the Kumaraswamy law which is an appealing alternative to the
BAYER et al. 475
commonly used beta law. The proposed regression model comprises of four submodels whose
parameters can be estimated by maximum likelihood. The model structure is reduced to three
submodels under single inflation. Two novel features of our model are (i) it allows for single and
double inflation, and (ii) the model expected information matrix is available in closed form. Addi-
tionally, it can be used to model median effects or, more generally, to evaluate the impacts of the
conditioning variables on different quantiles of the response distribution.
The proposed model was used to analyze the impacts of several conditioning variables on
the proportion of people who live in households with inadequate water supply and sewage in
Brazil. Since in nearly 17% of the Brazilian municipalities no one lives in households with inad-
equate water supply and sewage, the data display inflation at zero. We estimated and plotted the
impacts of (i) higher education net attendance rate and (ii) the adult population share with college
degree on the prevalence of inadequate water supply and sewage. The results showed that median
and mean impacts display different patterns in the two cases. In particular, our empirical results
revealed that policies directed to increasing the share of the population with complete tertiary
education in places where it is low are particularly effective in lowering the median prevalence of
people who live with inadequate water supply and sewage. It is noteworthy that the mean predic-
tion error from our model is nearly 37 and 48% smaller than those obtained with the competing
inflated beta and simplex models, respectively; other criteria also favor our model. By considering
an extended version of our model, we evaluated the impacts of the net attendance rate in higher
education and of the proportion of the adult population with college degree on five quantiles of
the response conditional distribution.
In future search, we plan to develop diagnostic analysis for the inflated Kumaraswamy regres-
sion model. In particular, we plan to develop local influence analysis and obtain expressions for
Cook’s distance and for the generalized leverage.
ACKNOWLEDGMENTS
The authors gratefully acknowledge partial financial support from CAPES and CNPq (Grant num-
bers: 305350/2017-0 and 301651/2017-5), Brazil. The authors also thank two anonymous referees
whose comments and suggestions led to a much improved manuscript.
ORCID
Fábio M. Bayer https://orcid.org/0000-0002-1464-0805
REFERENCES
Adukia, A. (2017). Sanitation and education. American Economic Journal: Applied Economics, 9, 23–59.
Bayer, F. M., Bayer, D. M., & Pumi, G. (2017). Kumaraswamy autoregressive moving average models for double
bounded environmental data. Journal of Hydrology, 555, 385–396.
Bayes, C. L., Bazaán, J. L., & De Castro, M. (2017). A quantile parametric mixed regression model for bounded
response variables. Statistics and Its Interface, 10, 483–493.
Cribari-Neto, F., & Santos, J. (2019). Inflated Kumaraswamy distributions. Anais da Academia Brasileira de
Ciências, 92, e20180955.
Dey, S., Mazucheli, J., & Nadarajah, S. (2018). Kumaraswamy distribution: Different methods of estimation.
Computation and Applied Mathematics, 37, 2094–2211.
Di Brisco, A. M., & Migliorati, S. (2020). A new mixed-effects mixture model for constrained longitudinal data.
Statistics in Medicine, 39, 129–145.
Dreibelbis, R., Greene, L. E., Freeman, M. C., Saboori, S., Chase, R. P., & Rheingans, R. (2013). Water, sanitation,
and primary school attendance: A multi-level assessment of determinants of household-reported absence in
Kenya. International Journal of Educational Development, 33, 457–465.
476 BAYER et al.
Dunn, P. K., & Smyth, G. K. (1996). Randomized quantile residuals. Journal of Computational and Graphical
Statistics, 5, 236–244.
Fahrmeir, L., & Kaufmann, H. (1985). Consistency and asymptotic normality of the maximum likelihood estimator
in generalized linear models. The Annals of Statistics, 1, 342–368.
Ferrari, S., & Cribari-Neto, F. (2004). Beta regression for modelling rates and proportions. Journal of Applied
Statistics, 31, 799–815.
Fletcher, S., & Ponnambalam, K. (1996). Estimation of reservoir yield and storage distribution using moments.
Journal of Hydrology, 182, 259–275.
Galvis, D. M., Bandyopadhyay, D., & Lachos, V. H. (2014). Augmented mixed beta regression models for periodontal
proportion data. Statistics in Medicine, 33, 3759–3771.
Gradshteyn, I. S., & Ryzhik, I. M. (2007). Table of integrals, series, and products (7th ed.). London, UK: Academic
Press.
Hubert, M., & Vandervieren, E. (2008). An adjusted boxplot for skewed distributions. Computational Statistics &
Data Analysis, 52, 5186–5201.
Jasper, C., Le, T.-T., & Bartram, J. (2012). Water and sanitation in schools: A systematic review of the health and
educational outcomes. International Journal of Environental Research and Public Health, 9, 2772–2787.
Jones, M. (2009). Kumaraswamy distribution: A beta-type distribution with some tractability advantages. Statistical
Methodology, 6, 70–81.
Kumaraswamy, P. (1976). Sinepower probability density function. Journal of Hydrology, 31, 181–184.
Lemonte, A. (2011). Improved point estimation for the Kumaraswamy distribution. Journal of Statistical Compu-
tation and Simulation, 81, 1971–1982.
Liu, F., & Eugenio, E. C. (2020). A review and comparison of Bayesian and likelihood-based inferences in beta
regression and zero-or-one-inflated beta regression. Statistical Methods in Medical Research, 27, 1024–1044.
Liu, F., & Kong, Y. (2015). Zoib: An R package for Bayesian inference for beta regression and zero/one inflated beta
regression. R Journal, 7, 34–51.
Liu, P., Kam Yuen, K., Wu, L., Tian, G., & Li, T. (2020). Zero-one-inflated simplex regression models for the analysis
of continuous proportion data. Statistics and Its Interface, 13, 193–208.
Menezes, A. F. B., Mazucheli, J., & Bourguignon, M. (2021). A parametric quantile regression approach for
modelling zero-or-one inflated double bounded data. Biometrical Journal, 63, 841–858.
Mitnik, P. A., & Baek, S. (2013). The Kumaraswamy distribution: Median-dispersion re-parameterizations for
regression modeling and simulation-based estimation. Statistical Papers, 54, 177–192.
Mohsenkhani, Z. F., Mohhamadzadeh, M., & Baghfalaki, T. (2019). Augmented mixed beta regression models with
skew-normal independent distributions: Bayesian analysis of labor force data. Communications in Statistics -
Simulation and Computation, 48, 2147–2164.
Morgan, B. J. T., Palmer, K. J., & Ridout, M. S. (2007). Negative score test statistic. The American Statistician, 61,
285–288.
Nagelkerke, N. J. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78,
691–692.
Nocedal, J., & Wright, S. J. (2006). Numerical optimization (2nd ed.). New York, NY: Springer.
Nogarotto, D. C., Azevedo, C. L. N., & Bazán, J. L. (2020). Bayesian modeling and prior sensitivity analysis for
zero-one augmented beta regression models with an application to psychometric data. Brazilian Journal of
Probability and Statistics, 34, 304–322.
Ospina, R., & Ferrari, S. L. P. (2010). Inflated beta distributions. Statistical Papers, 51, 111–126.
Ospina, R., & Ferrari, S. L. P. (2012). A general class of zero-or-one inflated beta regression models. Computational
Statistics & Data Analysis, 56, 1609–1623.
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1988). Numerical recipes in C (2nd ed.). Cambridge,
MA: Cambridge University Press.
Pumi, G., Rauber, C., & Bayer, F. M. (2020). Kumaraswamy regression model with Aranda-Ordaz link function.
Test, 29, 1051–1071.
Queiroz, F. F., & Lemonte, A. J. (2021). A broad class of zero one inflated regression models for rates and
proportions. Canadian Journal of Statistics. https://doi.org/10.1002/cjs.11576.
R Core Team. (2021). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for
Statistical Computing. Retrieved from. https://www.R-project.org/
BAYER et al. 477
Santos, B., & Bolfarine, H. (2015). Bayesian analysis for zero-or-one inflated proportion data using quantile
regression. Journal of Statistical Computation and Simulation, 85, 3579–3593.
Sundar, V., & Subbiah, K. (1989). Application of double bounded probability density function for analysis of ocean
waves. Ocean Engineering, 16, 193–200.
How to cite this article: Bayer FM, Cribari-Neto F, Santos J. Inflated Kumaraswamy
regressions with application to water supply and sanitation in Brazil. Statistica
Neerlandica. 2021;75:453–481. https://doi.org/10.1111/stan.12242
APPENDIX
In what follows we shall derive the quantities required for obtaining a closed-form expression for
Fisher’s information matrix in inflated Kumaraswamy regressions and develop some results that
are useful to that end.
Fisher’s information matrix

In order to derive the Fisher’s information matrix, we need to compute the expected values of all
second-order derivatives. It can be shown that
( )
𝜕 2 𝓁(𝜽; y) ∑ 𝜕
n
𝜕𝓁1 (𝜸) 𝜕𝜆i 𝜕𝜂1i 𝜕𝜆i 𝜕𝜂1i
=
𝜕𝛾j 𝜕𝛾t i=1
𝜕𝜆i 𝜕𝜆i 𝜕𝜂1i 𝜕𝛾j 𝜕𝜂1i 𝜕𝛾t
[ ( )]
∑n
𝜕 2 𝓁1 (𝜸) 𝜕𝜆i 𝜕𝜂1i 𝜕𝓁1 (𝛾) 𝜕 𝜕𝜆i 𝜕𝜂1i 𝜕𝜆i 𝜕𝜂1i
= + ,
i=1 𝜕𝜆i 𝜕𝜂1i 𝜕𝛾j
2 𝜕𝜆i 𝜕𝜆i 𝜕𝜂1i 𝜕𝛾j 𝜕𝜂1i 𝜕𝛾t
j, t ∈ {1, … , m}. Since E (𝜕𝓁1 (𝜸)∕𝜕𝜆i ) = 0, 𝜕𝜂1i ∕𝜕𝛾j = zij , and 𝜕𝜂1i ∕𝜕𝛾t = zit , it follows that
( ) ( )( )2
𝜕 2 𝓁(𝜽; y) ∑
n
𝜕 2 𝓁1 (𝜸) 𝜕𝜆i
E = E zij zit .
𝜕𝛾j 𝜕𝛾t i=1 𝜕𝜆2i 𝜕𝜂1i
The second derivative of 𝓁1 (𝜸) with respect to 𝜆i is
𝜕 2 𝓁1 (𝜸) I{0,1} (yi ) (1 − I{0,1} (yi ))

=− − .
𝜕𝜆i 2
𝜆2i (1 − 𝜆i )2
( ) ( )
Since E I{0,1} (yi ) = 𝜆i , we have E 𝜕 2 𝓁1 (𝜸)∕𝜕𝜆2i = −1∕[𝜆i (1 − 𝜆i )].
Additionally, for j, t ∈ {1, … , u},
( )
𝜕 2 𝓁(𝜽; y) ∑ 𝜕
n
𝜕𝓁2 (𝜋) 𝜕pi 𝜕𝜂2i 𝜕pi 𝜕𝜂2i
=
𝜕𝜋j 𝜕𝜋t i=1
𝜕pi 𝜕pi 𝜕𝜂2i 𝜕𝜋j 𝜕𝜂2i 𝜕𝜋t
[ ( )]
∑ 𝜕 2 𝓁2 (𝜋) 𝜕pi 𝜕𝜂2i 𝜕𝓁2 (𝜋) 𝜕
n
𝜕pi 𝜕𝜂2i 𝜕pi 𝜕𝜂2i
= + .
i=1 𝜕p 2
i
𝜕𝜂2i 𝜕𝜋j 𝜕p i 𝜕p i 𝜕𝜂2i 𝜕𝜋 j 𝜕𝜂2i 𝜕𝜋t
478 BAYER et al.
Since E (𝜕𝓁2 (𝝅)∕𝜕pi ) = 0, 𝜕𝜂2i ∕𝜕𝜋j = wij , and 𝜕𝜂2i ∕𝜕𝜋t = wit , it follows that
( ) ( )( )2
𝜕 2 𝓁(𝜽; y) ∑
n
𝜕 2 𝓁2 (𝜋) 𝜕pi
E = E wij wit .
𝜕𝜋j 𝜕𝜋t i=1 𝜕p2i 𝜕𝜂2i
The second derivative of 𝓁2 (𝝅) with respect to pi is
𝜕 2 𝓁2 (𝝅) I{1} (yi ) I{0} (yi )

=− − .
𝜕pi 2
pi2 (1 − pi )2
( ) ( )
Since E I{1} (yi ) = 𝜆i pi and E I{0} (yi ) = 𝜆i (1 − pi ), we have
( )
𝜕 2 𝓁2 (𝝅) 𝜆i
E =− .
𝜕p2i pi (1 − pi )
For j, t ∈ {1, … , r}, we obtain

( )
𝜕 2 𝓁(𝜽; y) ∑ 𝜕
n
𝜕𝓁3 (𝜷, 𝝇) 𝜕𝜔i 𝜕𝜂3i 𝜕𝜔i 𝜕𝜂3i
=
𝜕𝛽j 𝜕𝛽t i=1
𝜕𝜔i 𝜕𝜔i 𝜕𝜂3i 𝜕𝛽j 𝜕𝜂3i 𝜕𝛽t
[ ( )]
∑n
𝜕 2 𝓁3 (𝜷, 𝝇) 𝜕𝜔i 𝜕𝜂3i 𝜕𝓁3 (𝜷, 𝝇) 𝜕 𝜕𝜔i 𝜕𝜂3i 𝜕𝜔i 𝜕𝜂3i
= + .
i=1 𝜕𝜔i2 𝜕𝜂3i 𝜕𝛽j 𝜕𝜔i 𝜕𝜔i 𝜕𝜂3i 𝜕𝛽j 𝜕𝜂3i 𝜕𝛽t
It is possible to show that E (𝜕𝓁3 (𝜷, 𝝇)∕𝜕𝜔i ) = 0. Hence,

( ) ( )( )2
𝜕 2 𝓁(𝜽; y) ∑
n
𝜕 2 𝓁3 (𝜷, 𝝇) 𝜕𝜔i
E = E xij xit .
𝜕𝛽j 𝜕𝛽t i=1 𝜕𝜔2i 𝜕𝜂3i i
Let, for k = 1, 2,
k𝜙 −2
𝜔i i
𝜈i(k) =( )k ( )k
𝜙 𝜙
1 − 𝜔i i ln 1 − 𝜔i i
[ ( )]
𝜙
and Ai = 𝜙i 𝜈i(2) 1 + ln 1 − 𝜔i i + (𝜙i − 1)𝜈i(1) . Using some of the results in Bayer et al. (2017),
we obtain
𝜕 2 𝓁3 (𝜷, 𝝇) [ ]{ [ ] (
𝜙i
)}
= 1 − I {0,1} (y i ) 𝜙 i A i + 𝜙 i 𝛿i A i + 𝜙 i 𝜈 (2)
ln 1 − y .
𝜕𝜔2i i i
[ ( )]
𝜙
It follows from lemma 1 in Bayer et al. (2017) that E ln 1 − yi i = −1∕𝛿i , and thus
( )
𝜕 2 𝓁3 (𝜷, 𝝇)
E = (𝜆i − 1)𝜙2i 𝜈i(2) .
𝜕𝜔2i
BAYER et al. 479
For j, t ∈ {1, … , r}, we obtain

( )
𝜕 2 𝓁(𝜽; y) ∑ 𝜕
n
𝜕𝓁3 (𝜷, 𝝇) 𝜕𝜙i 𝜕𝜂4i 𝜕𝜙i 𝜕𝜂4i
=
𝜕𝜍j 𝜕𝜍t i=1
𝜕𝜙i 𝜕𝜙i 𝜕𝜂4i 𝜕𝜍j 𝜕𝜂4i 𝜕𝜍t
[ ( )]
∑ 𝜕 2 𝓁3 (𝜷, 𝝇) 𝜕𝜙i 𝜕𝜂4i 𝜕𝓁3 (𝜷, 𝝇) 𝜕
n
𝜕𝜙i 𝜕𝜂4i 𝜕𝜙i 𝜕𝜂4i
= + .
i=1 𝜕𝜙i
2 𝜕𝜂 4i 𝜕𝜍 j 𝜕𝜙 i 𝜕𝜙 i 𝜕𝜂4i 𝜕𝜍 j 𝜕𝜂4i 𝜕𝜍t
Given that 𝜕𝜂4i ∕𝜕𝜍j = qij , 𝜕𝜂4i ∕𝜕𝜍t = qit , and E (𝜕𝓁3 (𝜷, 𝝇)∕𝜕𝜙i ) = 0 (see Lemma 1, which is
stated and proved below), we have
( ) ( )( )2
𝜕 2 𝓁(𝜽; y) ∑
n
𝜕 2 𝓁3 (𝜷, 𝝇) 𝜕𝜙i
E = E qij qit .
𝜕𝜍j 𝜕𝜍t i=1 𝜕𝜙2i 𝜕𝜂4i
Using some results from Bayer et al. (2017), it is possible to establish that1
( ) (
𝜕 2 𝓁2 (𝜷, 𝝇) 1
E = (𝜆i − 1) + 𝜔2i 𝜈i(2) ln (𝜔i )2 + 2𝛿i 𝜔2i ln(𝜔i )𝜈i(1)
𝜕𝜙2i 𝜙2
( )
1 − 𝜓(𝛿i + 1) − 𝜅
×
(𝛿i − 1)𝜙i
{ })
𝛿i 𝜓(𝛿i ) [𝜓(𝛿i ) + 2(𝜅 − 1)] − 𝜓 ′ (𝛿i ) + k0
+ = si ,
(𝛿i − 2)𝜙2i
where 𝜓 ∶ R+ → R is the digamma function defined as 𝜓(z) = d ln (Γ(z)) ∕dz, 𝜓 ′ (z) = d𝜓(z)∕dz
is the trigamma function, 𝜅 ≈ 0.5772156649 is the Euler–Mascheroni constant and k0 = 𝛾 2 ∕6 +
𝜅 2 − 2𝜅.
It can also be shown that
( )
𝜕 2 𝓁(𝜽; y) ∑ 𝜕
n
𝜕𝓁3 (𝜷, 𝝇) 𝜕𝜔i 𝜕𝜂i 𝜕𝜙i 𝜕𝜂4i
=
𝜕𝛽j 𝜕𝜍t i=1
𝜕𝜙i 𝜕𝜔i 𝜕𝜂i 𝜕𝛽j 𝜕𝜂4i 𝜕𝜍t
n [ 2 ( )]
∑ 𝜕 𝓁3 (𝜷, 𝝇) 𝜕𝜔i 𝜕𝜂i 𝜕𝓁3 (𝜷, 𝝇) 𝜕 𝜕𝜔i 𝜕𝜂i 𝜕𝜙i 𝜕𝜂4i
= + ,
i=1
𝜕𝜔 i 𝜕𝜙 i 𝜕𝜂i 𝜕𝛽 j 𝜕𝜔 i 𝜕𝜙 i 𝜕𝜂i 𝜕𝛽 j 𝜕𝜂4i 𝜕𝜍t
j ∈ {1, … , r} and t ∈ {1, … , s}. We note that E (𝜕𝓁3 (𝜷, 𝝇)∕𝜕𝜔i ) = 0, 𝜕𝜂i ∕𝜕𝛽j = xij and 𝜕𝜂4i ∕𝜕𝜍t =
qit . Thus,
( ) ( )
𝜕 2 𝓁(𝜽; y) ∑
n
𝜕 2 𝓁3 (𝜷, 𝝇) 𝜕𝜔i 𝜕𝜙i
E = E qij qit .
𝜕𝛽j 𝜕𝜍t i=1
𝜕𝜔i 𝜕𝜙i 𝜕𝜂i 𝜕𝜂4i
We also obtain
( 2 ) [ ( )]
𝜕 𝓁3 (𝜷, 𝝇) 1 − 𝜓(𝛿i + 1) − 𝜅
E = (𝜆i − 1) 𝜙i 𝜔i ln(𝜔i )𝜈i(2) + 𝛿i 𝜈i(1) = mi .
𝜕𝜔i 𝜕𝜙i (𝛿i − 1)𝜙i
1
There is a typographical error on the proof of the second expectation in lemma 2 in Bayer et al. (2017). We considered
the correct result.
480 BAYER et al.
Lemma 1. Let y1 , … , yn be independent random variables such that yi ∼ KI(𝜆i , pi , 𝜔i , 𝜙i ),

i = 1, … , n. Then,
E (𝜕𝓁3 (𝜷, 𝝇)∕𝜕𝜙i ) = 0.
Proof. We have
( ) { [ 𝜙i ]}
𝜕𝓁3 (𝜷, 𝝇) 1 yi ln(yi )
E = (1 − 𝜆i ) + E [ln(yi )] + E (ci ) 𝜔i ln(𝜔i ) − (𝛿i − 1)E .
𝜕𝜙i 𝜙i 𝜙
1 − yi i
From lemma 1 in Bayer et al. (2017), we know that E [ln(1 − yi )] = −1∕𝛿i , and it is then easy
to note that E(ci ) = 0. Additionally, lemma 2 in Bayer et al. (2017) states that
( 𝜙 )
yi i ln(yi ) 1 − 𝜓(𝛿i + 1) − 𝜅
E 𝜙
= .
1− yi i (𝛿i − 1)𝜙i
Using Lemma 2, which is stated and proved below, it follows that

( ) [ ]
𝜕𝓁3 (𝜷, 𝝇) 1 𝜓(𝛿 + 1) + 𝜅 1 − 𝜓(𝛿i + 1) − 𝜅
E = (1 − 𝜆i ) − − (𝛿i − 1)
𝜕𝜙i 𝜙i 𝜙i (𝛿i − 1)𝜙i
[ ]
1 𝜓(𝛿 + 1) + 𝜅 1 − 𝜓(𝛿i + 1) − 𝜅
= (1 − 𝜆i ) − −
𝜙i 𝜙i 𝜙i
= 0.
▪
Lemma 2. Let Y be Kumaraswamy-distributed with parameters 𝜔 and 𝜙. Then,
𝜓(𝛿 + 1) + 𝜅
E (ln(Y )) = − .
𝜙
Proof. We have
1
E (ln(Y )) = ln(y)𝜙𝛿y𝜙−1 (1 − y𝜙 )𝛿−1 dy
∫0
1
= 𝜙𝛿 ln(y)y𝜙−1 (1 − y𝜙 )𝛿−1 dy.
∫0
∑∞ ( )
𝛿−1
By expanding (1 − y𝜑 )𝛿t −1 into its binomial series, we have (1 − y𝜙 )𝛿−1 = 𝜙 k
k=0 (−y ) k
.
Thus,
[∞ ( )]
1 ∑ 𝛿t − 1
𝜙−1 𝜙k
E (ln(Y )) = 𝜙𝛿 ln(y)y k
(−1) (y ) dy
∫0 k
k=0
∑
∞ ( ) 1
𝛿−1
= 𝜙𝛿 (−1)k y𝜙(k+1)−1 ln(y)dy
k ∫0
k=0
BAYER et al. 481
∑
∞ ( )[ ]
𝛿−1 −1
= 𝜙𝛿 (−1) k
k=0
k 𝜙2 (k + 1)2
( )
𝛿 ∑ (−1)k
∞
𝛿t − 1
=− .
𝜙 k=0 (k + 1)2 k
By letting k = i − 1, we obtain
∑ (−1)k ( 𝛿 − 1 ) ∑ (−1)i−1 ( 𝛿t − 1 ) ∑ (−1)i−1 [ 1 i ( 𝛿 )] ( )

1 ∑ (−1)i 𝛿
∞ ∞ ∞ ∞
= = = − .
k=0
(k + 1)2 k i=1
i2 i−1 i=1
i i𝛿 i 𝛿 i=1 i i
Using Newton’s expansion for the digamma function (formula 8.363.8 in Gradshteyn &
Ryzhik, 2007, with n = 0), that is,
∑ (−1)k ( s )
∞
𝜓(s + 1) + 𝜅 = − ,
k=1
k k
we arrive at
[ ]
𝛿 𝜓(𝛿 + 1) + 𝜅 1
E (ln(Y )) = − = − [𝜓(𝛿 + 1) + 𝜅] .
𝜙 𝛿 𝜙
▪

Statistica Neerlandica - 2021 - Bayer - Inflated Kumaraswamy Regressions With Application To Water Supply and Sanitation in

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistica Neerlandica - 2021 - Bayer - Inflated Kumaraswamy Regressions With Application To Water Supply and Sanitation in

Uploaded by

Copyright:

Available Formats

Received: 2 September 2020 Revised: 12 March 2021 Accepted: 29 March 2021

Inflated Kumaraswamy regressions with

Fábio M. Bayer1 Francisco Cribari-Neto2 Jéssica Santos3

© 2021 Netherlands Society for Statistics and Operations Research

Statistica Neerlandica. 2021;75:453–481. wileyonlinelibrary.com/journal/stan 453

2 THE KUMARASWA MY REGRESSION MODEL

𝜙 ln(0.5) 𝜙−1 ( ) ln(0.5) −1

where u ∈ (0, 1) and 𝜙 is a precision parameter.

3 THE INFLATED KUMARASWAMY REGRESSION MODEL

KI(y; 𝜆, p, 𝜔, 𝜙) = 𝜆(1 − p) + (𝜆p)I{1} (y) + (1 − 𝜆)F(y; 𝜔, 𝜙), 0 ≤ y ≤ 1,

where 0 < u < 1.

where g1 ∶ (0, 1) → R, g2 ∶ (0, 1) → R, g3 ∶ (0, 1) → R and g4 ∶ (0, ∞) → R are strictly increas-

Let y = (y1 , … , yn )⊤ be a sample of independent inflated Kumaraswamy random variables. The

𝓁(𝜽; y) = 𝓁1 (𝜸) + 𝓁2 (𝝅) + 𝓁3 (𝜷, 𝝇), (3)

𝜕𝓁(𝜽; y) ∑ 𝜕𝓁1 (𝜸) 𝜕𝜆i 𝜕𝜂1i

𝜕𝓁(𝜽; y) ∑ 𝜕𝓁2 (𝝅) 𝜕pi 𝜕𝜂2i

𝜕𝓁(𝜽; y) ∑ 𝜕𝓁3 (𝜷, 𝝇) 𝜕𝜔i 𝜕𝜂3i

𝜕𝓁(𝜽; y) ∑ 𝜕𝓁3 (𝜷, 𝝇) 𝜕𝜙i 𝜕𝜂4i

U𝛾 (𝜽) = Z ⊤ T1 a, U𝜋 (𝜽) = W ⊤ T2 𝝆, U𝛽 (𝜽) = X ⊤ T3 c and U𝜍 (𝜽) = Q⊤ T4 v,

where Z is an n × m matrix with ith row given by zi , W is an n × m matrix with

Randomized quantile residuals

Randomized quantile residuals

(a) Correct (26 points) (b) Without a covariate (267 points)

Randomized quantile residuals

(c) Nonlinear (311 points) (d) Incorrect link (86 points)

Randomized quantile residuals

FIGURE 1 Residual normal probability plots with simulated envelopes

Mean −0.500 −1.003 1.011 −1.012 1.001 −2.001 1.001 1.505

SD 0.135 0.258 0.268 0.455 0.030 0.041 0.058 0.098

Bias 0.000 −0.003 0.011 −0.012 0.001 −0.001 0.001 0.005

Mean −0.502 −1.007 1.027 −1.036 1.001 −2.001 1.002 1.508

SD 0.191 0.348 0.410 0.668 0.041 0.058 0.082 0.137

Bias −0.002 −0.007 0.027 −0.036 0.001 −0.001 0.002 0.008

Mean −0.503 −1.017 1.052 −1.061 1.003 −2.004 1.001 1.529

SD 0.304 0.572 0.606 1.088 0.068 0.093 0.141 0.245

Bias −0.003 −0.017 0.052 −0.061 0.003 −0.004 0.001 0.029

Mean −0.505 −1.032 1.136 −1.163 1.003 −2.004 1.005 1.546

SD 0.455 0.864 1.050 1.697 0.095 0.132 0.206 0.327

Bias −0.005 −0.032 0.136 −0.163 0.003 −0.004 0.005 0.046

Mean −0.500 −1.121 1.264 −1.269 1.009 −2.001 0.992 1.616

SD 0.625 1.259 2.406 4.425 0.130 0.294 0.335 0.551

Bias 0.000 −0.121 0.264 −0.269 0.009 −0.001 −0.008 0.116

Parameter value −0.500 −1.000 1.000 −2.000 1.000 1.500

Mean −0.501 −1.002 1.001 −2.001 1.000 1.505

SD 0.136 0.252 0.028 0.040 0.060 0.097

Bias −0.001 −0.002 0.001 −0.001 0.000 0.005

Relative bias 0.289 0.208 0.050 0.043 0.014 0.323

Coverage rate 0.948 0.949 0.946 0.946 0.953 0.951

Mean −0.501 −1.006 1.001 −2.002 1.000 1.510

SD 0.202 0.364 0.041 0.057 0.081 0.136

Bias −0.001 −0.006 0.001 −0.002 0.000 0.010

Relative bias 0.150 0.578 0.113 0.098 0.045 0.689

Coverage rate 0.952 0.957 0.946 0.943 0.949 0.947

Mean −0.499 −1.024 1.003 −2.005 1.005 1.522

SD 0.293 0.576 0.066 0.091 0.141 0.232

Bias 0.001 −0.024 0.003 −0.005 0.005 0.022

Relative bias −0.250 2.448 0.286 0.227 0.490 1.465

Coverage rate 0.951 0.953 0.939 0.938 0.946 0.941

Mean −0.504 −1.049 1.003 −2.007 1.008 1.545

SD 0.446 0.831 0.095 0.138 0.190 0.321

Bias −0.004 −0.049 0.003 −0.007 0.008 0.045

Relative bias 0.723 4.908 0.288 0.327 0.807 2.997

Coverage rate 0.954 0.954 0.926 0.918 0.934 0.930