Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/350784414

Non-parametric estimation of Gini index with censored observations

Article  in  Statistics & Probability Letters · April 2021


DOI: 10.1016/j.spl.2021.109113

CITATIONS READS
5 254

3 authors:

Sudheesh k k Isha Dewan


Indian Statistical Institute Indian Statistical Institute
65 PUBLICATIONS   218 CITATIONS    114 PUBLICATIONS   949 CITATIONS   

SEE PROFILE SEE PROFILE

Sreelakshmi Narayan
Indian Institute of Technology Madras
11 PUBLICATIONS   33 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Quantile modeling View project

All content following this page was uploaded by Sudheesh k k on 27 May 2021.

The user has requested enhancement of the downloaded file.


NON-PARAMETRIC ESTIMATION OF GINI INDEX WITH
RIGHT CENSORED OBSERVATIONS

Sudheesh K Kattumannil∗ †, Isha Dewan∗∗ and Sreelaksmi N∗∗∗



Indian Statistical Institute, Chennai, India,
∗∗
Indian Statistical Institute, New Delhi, India,
∗∗∗
Indian Institute of Technology, Chennai, India.

Abstract. We obtain a simple non-parametric estimator of the Gini


index when the sample contains right censored observations. Asymp-
totic properties of the proposed estimator are studied. Finite sample
performance of the estimator is evaluated through a Monte Carlo simu-
lation study.
Keywords: Gini mean difference; Inequality measure; Lorenz curve;
Right censoring; U-statistics.

1. Introduction

Income inequality measures focus on economic inequality among the indi-


viduals in an economy and the most celebrated measures are the Lorenz
curve and the Gini index. In 1912, Corrado Gini introduced Gini’s mean
difference (GMD) as a measure of variability. For more than one century,
the GMD and the measures derived from it (such as the Gini index) have
played a prominent role in studying inequality of income of individuals. In-
terested readers may refer to the works by Davidson (2009), Peng (2011),
Ceriani and Verme (2012), Langel and Tille (2013), Yitzhaki and Schecht-
man (2013), Wang and Gilmore (2016) and Lv et al. (2017). Ceriani and
Verme (2012) have given a comprehensive overview of the origin and devel-
opment of the Gini index and provided different expressions for the same.

Corresponding E-mail: skkattu@isichennai.res.in .
1
2

Langel and Tille (2013) have done an exhaustive literature survey of the
topic and showed that the same results, as well as the same errors, have
been republished several times, often with a clear lack of reference to the
previous works. Yitzhaki and Schichtman (2013) discussed the developments
of GMD, concentration ratio, Gini parameters of correlation and regression
coefficients.
The Gini index finds applications not only in Economics, but also in
Survival analysis. Tse (2006) developed non-parametric estimators of the
Lorenz curve and the Gini index when observations are left truncated and
right censored. Bonetti, Gigliarano and Muliere (2009) used a restricted
estimator (under right censoring) of a version of the Gini index to compare
the distributions of survival times across group of patients in clinical studies.
They pointed out that their test was a competitor for classical log-rank test,
Wilcoxon test and the Gray and Tsiatis test [see Harrington and Fleming
(1982) and Gray and Tsiatis (1989)] for comparing the equality of distri-
bution of survival times between group of patients under the assumption of
positive cured rate. Jenkins, Birkhauser, Feng and Larrimore (2011) used
multiple imputation method for the estimation of income inequality mea-
sures in the right censored set up. Recently, Lv, Zhang and Ren (2017)
proposed an estimator of the Gini index when right censored observations
are present in the sample. The estimator obtained by Lv et al. (2017) was
computationally complex and has a complicated variance expression. Mo-
tivated by this, we obtain a simple estimator of the Gini index when the
sample contains right censored observations. The variance of the proposed
estimator has a simple expression which can be estimated consistently.
The rest of the article is organised as follows. In Section 2, we review
estimators of the Gini index for complete data. In Section 3, we obtain
a simple non-parametric estimator of the Gini index when right censored
3

observations are present in the sample. We prove that the estimator is


consistent and has asymptotic normal distribution. In section 4, we report
the results of a Monte Carlo simulation study conducted to evaluate the
finite sample performance of the proposed estimator. Coverage probability
and average width of the confidence intervals for the Gini index are also
given. Concluding remarks are given in Section 5.

2. The Gini index for complete data

In the past two decades, several estimators were proposed for the Gini index.
This was achieved by expressing the Gini index in different forms and then
proposing a natural plug-in estimator or an estimator based on U-statistics.
The main objective of these studies was to obtain a simple reliable estima-
tor with small bias and standard error. Davidson (2009) discussed several
aspects of this problem rigorously. We use the bias corrected version of the
Davidson’s estimator to find the estimator of the Gini index when the sam-
ple contain right censored observations. Hence, we discuss the Davidson’s
estimator of the Gini index for complete data.
Let X be a non-negative continuous random variable with distribution
function F . Assume that X has finite mean µ. Let X1 and X2 be two
independent random variables with same distribution function F . The GMD
is defined as
GM D = E|X1 − X2 |.

The Gini index can be written in terms of GMD as

GM D
G= . (1)

Davidson (2009) expressed G as


R∞
0 2xF (x)dF (x)
G= − 1, (2)
µ
4

and obtained an estimator of the Gini index given by


n  
2 X 1
G̃ = 2 X(i) i − − 1,
n X̄ i=1 2

where X(i) , i = 1, ..., n are the order statistics based on a random sample
X1 , . . . , Xn from F and X̄ = n1 ni=1 Xi . By simple algebraic manipulation,
P

the above estimator can be rewritten as


Pn
(2i − n − 1)X(i)
G̃ = i=1 Pn . (3)
n i=1 Xi

As pointed out earlier, the main objective of Davidson’s work was to obtain
a simple reliable estimator with less bias and standard error with respect
to its competitor. Accordingly, Davidson (2009) proposed a bias corrected
version of G̃ which is given by
Pn
nG̃ − n − 1)X(i)
i=1 (2i
G
b= = . (4)
(n − 1) ni=1 Xi
P
(n − 1)

Next we show that the bias corrected estimator G


b given in Eq. (4) is the

ratio of two U-statistics. Consider


Pn
i=1 (2i − n − 1)X(i)
G =
b
(n − 1) ni=1 Xi
P

2 ni=1 (2(i − 1) − (n − 1))X(i)


P
= .
2n(n − 1) n1 ni=1 Xi
P
Pn
2 i=1 (2(i − 1) − (n − 1))X(i)
= .
n(n − 1) 2X̄
2 Pn Pn
n(n−1) i=1 j=1,j<i (2 max(Xi , Xj ) − Xi − Xj )
=
2X̄

b
= , (5)
2X̄

b is a U-statistic with kernel h(Xi , Xj ) = 2 max(Xi , Xj ) − Xi − Xj .


where ∆
Hence from Eq. (5) it is observed that G
b is a ratio of two U-statistics.
5

In view of expression (5), we can study the asymptotic properties of G


b using

the limit theorems of U-statistics (Lee, 1990) and Slutsky’s theorem. The
results are stated for completeness.

Theorem 1. As n → ∞, G
b converges in probability to G.
√ b
Theorem 2. As n → ∞, the distribution of n(G − G) is Gaussian with
σ12
mean zero and variance µ2
where
 Z X 
σ12 = V X(2F̄ (X) − 1) + 2 yF (y) . (6)
0

3. Gini index for right censored data

In this section, we propose a simple estimator for the Gini index in the
presence of right censored observations. Suppose we have randomly right-
censored observations where the censoring times are independent of the life-
times. The observed data consist of n independent and identical copies of
(Y, δ), with Y = min(X, C), where C is the censoring random variable and
δ = I(X ≤ C), is the censoring indicator. Observe that δi = 1 would
mean that i-th object is not censored, whereas δi = 0 indicates that i-th
object is censored by Ci , on the right. We are interested to find an esti-
mator of the Gini index based on n independent and identical observations
{(Yi , δi ), 1 ≤ i ≤ n}. We obtain an estimator of the Gini index under right
censored case by modifying the estimator given in expression (5). Denote K̄
as the survival function of C. We use inverse probability weighted censoring
approach to find an estimator of G. Following Datta et al. (2010), a version
of ∆
b appropriate for censored data is given by

n n
2 X X h(Yi , Yj )δi δj

bc = , (7)
n(n − 1)
i=1 j<i;j=1 K(Yi −)K(Yj −)
b b

where h(Y1 , Y2 ) = 2 max(Y1 , Y2 ) − Y1 − Y2 and K(t)


b is the Kaplan-Meier
estimator of K̄. Similarly an estimator of µ is given by
6

n
1 X Yi δi
X̄c = . (8)
n b i −)
K(Y
i=1
Hence an estimator of the Gini index with right censored observations is
given by
b c = ∆c .
b
G (9)
2X̄c
Now we study the asymptotic properties of G
b c . In the next theorem, we

prove the consistency of G


bc .

Theorem 3. As n → ∞, G
b c converges in probability to G.

Proof: Consider
n n
2 X X h(Yi , Yj )δi δj

bc =
n(n − 1) i=1 j<i;j=1 K(Yi −)K(Y
b b j −)
n n b i −) − K̄(Yi ))(K(Y
b j −) − K̄(Yj ))
2 X X h(Yi , Yj )δi δj (K(Y
=
n(n − 1) i=1 j<i;j=1 K(Yi −)K(Yj −)K̄(Yi )K̄(Yj )
b b
n n n n
2 X X h(Yi , Yj )δi δj 2 X X h(Yi , Yj )δi δj
+ +
n(n − 1) i=1 j<i;j=1 K(Y
b i −)K̄(Yj ) n(n − 1)
i=1 j<i;j=1 K̄(Yi )K(Yj −)
b
n n
2 X X h(Yi , Yj )δi δj

n(n − 1) i=1 j<i;j=1 K̄(Yi )K̄(Yj )

= ∆
b 1c + ∆ b 3c − ∆
b 2c + ∆ b 4c . (10)

By Corollary 1.2 of Stute and Wang (1993) we have



sup K(t−) − K̄(t) = op (1). (11)
b
t

Note that K(t−)


b is a consistent estimator for the true probability K̄(t).
Hence, as n → ∞

|∆
b 1c | ≤ sup |(K(Y
b i −) − K̄(Yi ))| sup |(K(Y
b j −) − K̄(Yj ))|
Yi Yj
n n
2 X X h(Yi , Yj )δi δj
| |
n(n − 1) K(Y
b i −) b j −)K̄(Yi )K̄(Yj )
K(Y
i=1 j<i;j=1

= op (1)op (1)Op (1) = op (1). (12)


7

Observe that the third term in the Eq. (12) is a U -statistic with kernel
h(Yi ,Yj )δi δj
2 , hence it is Op (1) for large n. Similar lines as above we can
K̄ (Y )K̄ 2 (Y )
i j

show that
n n b i −) − K̄(Yi ))
2 X X h(Yi , Yj )δi δj (K(Y

b 2c = ∆
b 4c +
n(n − 1) b i −)K̄(Yj )
K̄(Yi )K(Y
i=1 j<i;j=1

= ∆
b 4c + op (1). (13)

Also
n n b j −) − K̄(Yj ))
2 X X h(Yi , Yj )δi δj (K(Y

b 3c = ∆
b 4c +
n(n − 1) b j −)K̄(Yj )
K̄(Yi )K(Y
i=1 j<i;j=1

= ∆
b 4c + op (1). (14)

Substituting equations (12), (13) and (14) in (10) we have


bc = ∆
b 4c + op (1).

h(Yi ,Yj )δi δj


Note that ∆
b 4c is a U-statistic with kernel
K̄(Yi )K̄(Yj )
. Hence ∆
b 4c converges in

probability to ∆ (Lehmann, 1951). Accordingly, ∆


b c converges in probability

to ∆. Using (11), it is easy to show that X̄c converges in probability to µ.


The proof of the theorem follows in view of the representation given below

b c = ∆c 2µ ∆ .
b

∆ X̄c 2µ

b c , define Nic (t) = I(Yi ≤ t, δi = 0)


To derive the asymptotic distribution of ∆
as the counting process corresponding to censoring for the i-th individual
and Ri (u) = I(Yi ≥ u). Let λc (t) be the hazard rate of the censoring variable
C. The martingale associated with the counting process Nic (t) is given by
Z t
Mic (t) = Nic (t) − Ri (u)λc (u)du. (15)
0
8

Let Hc (x) = P (Y1 ≤ x, δ = 1), y(t) = P (Y1 > t) and


Z
1 h1 (x)
w(t) = I(x > t)dHc (x), (16)
y(t) K(x−)
b

where h1 (y) = E(h(Y1 , Y2 )|Y1 = y). The proof of the next theorem follows
from Datta et al. (2010) with choice of the kernel

h(Y1 , Y2 ) = 2 max(Y1 , Y2 ) − Y1 − Y2 .

Theorem 4. Assume E(h2 (Y1 , Y2 )) < ∞, bh21 (x) dHc (x) < ∞ and
R
K (x−)
R∞ 2 √ b
0 w (t)λ c (t)dt < ∞. As n → ∞, the distribution of n(∆ c − ∆) is
2 , where σ 2 is given by
Gaussian with mean zero and variance 4σ1c 1c

 h (X)δ Z 
2 1 1
σ1c = V ar + w(t)dM1c (t) . (17)
b −)
K(Y
2 . We can estimate σ 2 by
Next, we find a consistent estimator of σ1c 1c

n
2 4 X
σ
b1c = (Vi − V̄ )2 , (18)
n−1
i=1

where
n
h1 (Xi )δi
b X b i )I(Xi > Xj )(1 − δi )
w(X
Vi = b i )(1 − δi ) −
+ w(X Pn ,
K(X i=1 I(Xi > Xj )
b i)
j=1

n n n
1X 1 X h(X, Yi )δi 1X
V̄ = Vi , h1 (X) =
b , Y (t) = I(Yi > t)
n n b i −)
K(Y n
i=1 i=1 i=1
and
n
1 Xbh1 (Xi )δi
w(t)
b = I(Xi > t).
Y (t) K(X
b i)
i=1
.

Corollary 1. Under the assumptions in the Theorem 4, as n → ∞, the


√ b
asymptotic distribution of n(G c −G) is Gaussian with mean zero and vari-
2 2
Pn
σ1c σ 1 i=1 Yi δi
ance σc2 = . bc2 =
The estimated variance is σ where X̄c =
b1c
µ2 X̄c2 n K
b c (Yi −)

and σ 2 is specified in Eq. (18).


b1c
9

Remark 1. We can construct a confidence interval for the unknown Gini


index G by using the estimator Ĝc and estimate of its variance given by
bc2 . When the sample contains right censored observations, an asymptotic
σ
100(1 − α)% confidence interval for G is given by

b c − Zα/2 σ
(G bc , G
b c + Zα/2 σ
bc ),

where Zα is the upper α percentile point of the standard normal distribution.

We evaluate the performance of the proposed confidence interval in terms


of its coverage probability and average width. The results of the Monte
Carlo simulation study are reported in Section 4.

4. Simulation study

In this section, we carry out a Monte Carlo simulation study to evaluate


the finite sample performance of G
b c given in Eq. (9). Simulation is carried

out using R and is repeated ten thousand times. We consider the situation
with approximately 20% of the observations as censored. We compare the
bias and MSE of G
b c with that of the estimators proposed by Bonetti et al.

(2009) and Lv et al. (2017). Here we use the unrestricted version of the
estimator of the Gini index given by Bonetti et al. (2009). We also evaluate
the coverage probability and the average width of the proposed confidence
intervals for G.
We first generate random sample from exponential distribution with cu-
mulative distribution function F (x) = 1 − exp(−x), x ≥ 0. Censored
observations are generated from exponential distribution with parameter
γ = 0.25 to ensure that the sample contains 20% censored observations.
Note that the value of the Gini index for standard exponential distribu-
tion is 0.5. We find the bias and the MSE of G
b c for different sample sizes

n = 50, 75, 100, 150, 200. The results of the simulation study are reported in
10

Table 1. In all the tables, we reported MSE×10 as MSE is very low. From
Table 1, we observe that the bias and MSE of G
b c is very small in comparison

to that of the estimators proposed by Bonetti et al. (2009) and Lv et al.


(2017).
Pareto distribution is often used to model income data. We generate
observations from Pareto distribution with cumulative distribution function
F (x) = 1 − x−λ , x ≥ 1, λ > 1. The value of the Gini index in this case is
1/(2λ − 1), λ > 1. Davidson (2009) pointed out that the valid inference for
the Gini index is possible for λ > 2. Here we specify λ = 3. We simulated
censored observations from exponential distribution with parameter 0.15 so
that the sample contains 20% censored observations. The MSE and bias of
the estimators obtained in this case are given in Table 2. We observed that
the Bias and MSE of G
b c is lesser than that of the other two estimators.

Table 1. Bias and MSE: Exponential distribution

G
bc Lv et al. (2017) Bonetti et al. (2009)
Sample size Bias MSE Bias MSE Bias MSE
50 0.0173 0.0030 0.0697 0.0486 0.0286 0.0081
75 0.0146 0.0021 0.0500 0.0250 0.0254 0.0064
100 0.0117 0.0013 0.0394 0.0155 0.0213 0.0045
150 0.0078 0.0006 0.0262 0.0068 0.0156 0.0024
200 0.0073 0.0005 0.0216 0.0046 0.0133 0.0017

Finally we generate data from log normal distribution with mean zero and
variance 0.5. The true value of the Gini index in this case is 0.2763. Censored
observations are simulated from exponential distribution with parameter
λ = 0.20 to obtain approximately 20% censored observations. The results
of the simulation study are presented in Table 3. In this case also our
estimator performs well as compared to that of Bonetti et al. (2009) and
Lv et al. (2017). From Tables 1-3, we observe that the bias and MSE of
the estimator proposed by Bonetti et al. (2009) are lesser than that of the
estimator obtained by Lv et al. (2017).
11

Table 2. Bias and MSE: Pareto distribution

G
bc Lv et al. (2017) Bonetti et al. (2009)
Sample size Bias MSE Bias MSE Bias MSE
50 0.0675 0.0456 0.1138 0.1296 0.0793 0.0629
75 0.0641 0.0411 0.0970 0.0941 0.0706 0.0498
100 0.0590 0.0348 0.0844 0.0713 0.0637 0.0405
150 0.0519 0.0270 0.0707 0.0500 0.0534 0.0285
200 0.0477 0.0228 0.0631 0.0399 0.0473 0.0223

Table 3. Bias and MSE: Lognormal distribution

G
bc Lv et al. (2017) Bonetti et al. (2009)
Sample size Bias MSE Bias MSE Bias MSE
50 0.0062 0.0004 0.0430 0.0185 0.0186 0.0034
75 0.0047 0.0002 0.0281 0.0079 0.0167 0.0028
100 0.0043 0.0002 0.0230 0.0053 0.0146 0.0021
150 0.0029 0.0001 0.0153 0.0023 0.0104 0.0010
200 0.0026 0.0001 0.0122 0.0015 0.0095 0.0009

Table 4. The 95% confidence intervals for G for different distribution

Exponential Pareto Log normal


n CI
CP AW CP AW CP AW
Lv et al. (2017) 90.7 0.194 88.8 0.186 88.2 0.143
50 Bonetti et al. (2009) 91.2 0.188 93.1 0.166 92.1 0.124
Proposed method 93.2 0.180 95.3 0.176 94.4 0.130
Lv et al. (2017) 90.8 0.186 89.4 0.163 92.0 0.122
75 Bonetti et al. (2009) 92.6 0.173 92.3 0.154 93.2 0.094
Proposed method 93.8 0.162 95.2 0.168 94.7 0.107
Lv et al. (2017) 90.9 0.178 89.8 0.142 92.1 0.099
100 Bonetti et al. (2009) 92.8 0.147 92.8 0.101 93.9 0.083
Proposed method 94.6 0.136 95.0 0.107 95.0 0.095
Lv et al. (2017) 91.2 0.132 90.1 0.118 93.2 0.083
150 Bonetti et al. (2009) 93.1 0.122 93.2 0.073 94.1 0.075
Proposed method 94.6 0.115 94.8 0.079 95.3 0.079
Lv et al. (2017) 91.8 0.110 90.1 0.104 93.4 0.077
200 Bonetti et al. (2009) 93.4 0.101 93.4 0.092 94.2 0.063
Proposed method 94.8 0.098 94.9 0.067 95.2 0.067

We evaluate the performance of the confidence intervals for G for the


three scenarios described above. The coverage probability (CP) and Average
width (AW) of the confidence intervals are reported in Table 4. From Table
4, we observed that the confidence interval for G based on the proposed
estimator has better coverage probability than those proposed by Bonetti
et al. (2009) and Lv et al. (2017) for the choice of parameters that we
12

have considered. Apart from this empirical evidence we were not able to
identify any reason for the difference in performance between the proposed
estimator and the estimator in Bonetti et al. (2009) and in Lv et al. (2017).
The program for the simulation study is available online as supplementary
material.

5. Concluding remarks

We have obtained a simple non-parametric estimator of the Gini index with


right censored observations by modifying the bias corrected estimator of the
Gini index proposed by Davidson (2009). We have proved that the proposed
estimator is consistent and has asymptotic normal distribution. We also
obtained a consistent estimator of the asymptotic variance which can be used
to find a confidence interval for the Gini index when the sample contains
right censored observations. Using Monte Carlo simulation study, we show
that the proposed estimator produces relatively smaller bias and MSE in
several cases when compared with the estimators proposed by Bonetti et al
(2009) and Lv et al (2017). Simulation study also shows that the confidence
interval of the Gini index based on proposed estimator has good coverage
probability and can be implemented very easily.
The estimator proposed by Lv et al. (2017) can be used under some
forms of dependent censoring. The proposed estimator G
b c can be modified

to deal with dependent censoring. This can probably be done by considering


a U-statistic for dependent censoring suggested by Datta et al. (2010).

Acknowledgement
The authors thank the referees for their constructive suggestions which have
resulted in an improved version of the paper.

References
[1] Bonetti M., Gigliarano, C. and Muliere, P. (2009). The Gini concentration test for
survival data. Lifetime Data Analysis, 15, 493–518.
13

[2] Ceriani, L. and Verme, P. (2012). The origins of the Gini index: extracts from Vari-
abilità e Mutabilità (1912) by Corrado Gini. Journal Economic Inequality, 10, 421–
443.
[3] Datta, S., Bandyopadhyay, D. and Satten, G. A. (2010). Inverse probability of cen-
soring weighted U-statistics for right-censored data with an application to testing
hypotheses. Scandinavian Journal of Statistics, 37, 680–700.
[4] Davidson, R. (2009). Reliable inference for the Gini index. Journal of Econometrics,
150, 30–40.
[5] Gray, R. J. and Tsiatis, A. A. (1989). A linear rank test for use when the main
interest is in differences in cure rates. Biometrics, 45, 899–904.
[6] Harrington, D. P., and Fleming, T. R. (1982). A class of rank test procedures for
censored survival data. Biometrika, 69, 553–566.
[7] Jenkins, S. P., Burkhauser, R. V., Feng, S. and Larrimore, J. (2011). Measuring
inequality using censored data: a multiple-imputation approach to estimation and
inference. Journal of the Royal Statistical Society: Series A, 174, 63–81.
[8] Langel, M. and Tille, Y. (2013). Variance estimation of the Gini index: revisiting a
result several time published. Journal of the Royal Statistical Society -Series A, 176,
521-540.
[9] Lee, A. J. (1990). U-statistics: Theory and Practice, CRC press, Boca Raton.
[10] Lehmann, E. L. (1951). Consistency and unbiasedness of certain non-parametric tests.
Annals of Mathematical Statistics, 22, 165-179.
[11] Lv, X., Zhang, G. and Ren, G. (2017). Gini index estimation for lifetime data. Life-
time data analysis, 23, 275–304.
[12] Peng, L. (2011). Empirical likelihood methods for the gini index. Australian and New
Zealand Journal of Statistics, 53, 131–139.
[13] Stute, W. and Wang, J. L. (1993). The strong law under random censorship. Annals
of Statistics, 21, 1591–1607.
[14] Tse, S. M. (2006). Lorenz curve for truncated and censored data. Annals of the
Institute of Statistical Mathematics, 58, 675–686.
[15] Wang, D., Zhao, Y. and Gilmore, D. W. (2016). Jackknife empirical likelihood con-
fidence interval for the Gini index. Statistics & Probability Letters, 110, 289–295.
[16] Yitzhaki, S. and Schechtman, E. (2013). The Gini Methodology, Springer, New York.

View publication stats

You might also like