Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Communications in Statistics - Theory and Methods

ISSN: 0361-0926 (Print) 1532-415X (Online) Journal homepage: http://www.tandfonline.com/loi/lsta20

A Modified Hosmer–Lemeshow Test for Large Data


Sets

Wei Yu, Wangli Xu & Lixing Zhu

To cite this article: Wei Yu, Wangli Xu & Lixing Zhu (2017): A Modified Hosmer–Lemeshow
Test for Large Data Sets, Communications in Statistics - Theory and Methods, DOI:
10.1080/03610926.2017.1285922

To link to this article: http://dx.doi.org/10.1080/03610926.2017.1285922

Accepted author version posted online: 27


Jan 2017.

Submit your article to this journal

Article views: 51

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=lsta20

Download by: [Cornell University Library] Date: 21 June 2017, At: 06:12
ACCEPTED MANUSCRIPT

A Modified Hosmer–Lemeshow Test for Large Data


Sets
Wei Yu1 Wangli Xu1 and Lixing Zhu2 ∗
1 Center for Applied Statistics, School of Statistics, Renmin University of China
2 Department of Mathematics, Hong Kong Baptist University, Hong Kong

Abstract: The Hosmer–Lemeshow test is a widely used method for evaluating the goodness of fit of logistic

regression models. But its power is much influenced by the sample size, like other chi-square tests. Paul

et al. (2013) considered using a large number of groups for large data sets to standardize the power. But

simulations show that their method performs poorly for some models. In addition, it does not work when the

sample size is larger than 25,000. In the present paper, we propose a modified Hosmer–Lemeshow test that

is based on estimation and standardization of the distribution parameter of the Hosmer–Lemeshow statistic.

We provide a mathematical derivation for obtaining the critical value and power of our test. Through sim-

ulations, we can see that our method satisfactorily standardizes the power of the Hosmer–Lemeshow test.

It is especially recommendable for enough large data sets, as the power is rather stable. A bank marketing

data set is also analysed for comparison with existing methods.

Keywords: Logistic regression; Hosmer–Lemeshow test; Test power; Large data sets.


Corresponding author, lzhu@math.hkbu.edu.hk The research described herewith was supported by a grant
from the University Grants Council of Hong Kong and a grant by National Natural Science Foundation of China
(No:11471335). The authors thank the editor, the associate editor and two referees for their constructive comments
that led an improvement of an early manuscript.

ACCEPTED MANUSCRIPT
1
ACCEPTED MANUSCRIPT

1 Introduction

Logistic regression is a widely used generalized linear model for fitting data sets with binary
outcomes. An important topic in modeling exercise is the goodness of fit test: testing the null
hypothesis that the model fits the data well versus the opposite. The Hosmer–Lemeshow test is a
widely used goodness of fit test for evaluating logistic models. It is implemented by sorting the n
instances in the data set according to the value of the estimated success probability and splitting
the sorted data set into m groups with equal sizes nˉ = n/m (when n is divisible by m). Then the
test statistic is calculated as
Xm
(e j − o j )2
Tm = , (1.1)
j=1
n
ˉ e
ˉ j (1 − e
ˉ j )

where e j is the sum of the estimated success probabilities of the jth group while o j is the sum of

the observed success items of the jth group, and the term eˉ j = e j /ˉn is the mean of the estimated
success probabilities of the jth group. According to Theorem 5.1 in Moore and Spruill (1975),
it is known that under the null hypothesis, T m in (1.1) obeys a chi-square distribution χ2m−2 , while
under the alternative hypothesis, it obeys a non-central chi-square distribution χ2m−2,λ with λ being
a non-centrality parameter.
Due to its easy calculation and clear explanation, the Hosmer–Lemeshow test is widely used
in handling data sets of all sizes. For example, Cole et al. (2010) constructed models for predict-
ing the survival of preterm infants with samples ranging in size from 1434 to 4748 and used the
Hosmer–Lemeshow test to compare three scoring models. Newman et al. (1999) used a sample
of 51,837 infants to test a model for high neonatal total serum bilirubin level by the Hosmer–
Lemeshow method. Krag et al. (1998) constructed a prediction model for metastases with a
sample of 443 breast cancer patients and used the Hosmer–Lemeshow test to evaluate the good-
ness of fit. Another example is the scoring system for the prediction of early mortality in cardiac
surgical patients by Nashef et al. (1999). They used the Hosmer–Lemeshow method to test their

ACCEPTED MANUSCRIPT
2
ACCEPTED MANUSCRIPT

model with a sample size of 13,302. Hukkelhoven et al. (2006) constructed prognostic models to
analyse traumatic brain injury using samples with sizes from 124 to 2269. In addition, they used
data sets with sizes from 409 to 2269 to validate the models.
However, a problem in applying the Hosmer–Lemeshow test is that its power increases with the
sample size n, which is also a drawback of all kinds of chi-square tests. Dahiya and Gurland (1973)
derived that under the alternative hypothesis, the non-centrality parameter λ in the distribution of
T m is proportional to the sample size under a fixed m, i.e.,

λ ∝ n, (1.2)

which causes the increasing power with n. Hence, the null hypothesis may be accepted under a
small sample size, but rejected under a large sample size. In a word, sample size may affect our
determination of the goodness of fit of the model. We would like to get the same results for the
same model under any sample size, and so the Hosmer–Lemeshow test needs some improvement.
To standardize the power of the Hosmer–Lemeshow test, Paul et al. (2013) suggested using
different numbers of groups m for different sample sizes. Since it had been found that under a fixed
n, increasing m leads to a decrease in power, they used large m for large n to offset the increase of
power caused by the sample size. Specifically, they recommended using m = 10 for n ≤ 1000 and
 nk n − k n 2 o
m = max 10, min , ,2 + 8 (1.3)
2 2 1000

for 1000 < n ≤ 25000, where k is the total number of successes. But the simulations in their
paper indicate that only increasing m does not work in some cases. Take model 1 in Paul et al.
(2013) as an example. The success probability of model 1 is 0.256, so for n = 25, 000, m in (1.3)
will be approximately 3200. The test power under m = 3200 is definitely 1 because we can see
that even when m = 2 + 8(n/1000)2 = 5002 is used, the power is still 1 from Table III in Paul et

al. (2013). That is to say, the increasing speed of m in (1.3) cannot completely offset the power
increase caused by the sample size. However, on the other hand, the number of groups cannot be

ACCEPTED MANUSCRIPT
3
ACCEPTED MANUSCRIPT

increased without limit because nˉ = n/m ≥ 5 is required to assure the validity of the chi-square
test (Hosmer and Lemeshow, 2000). These two aspects conflict and make the method ineffective
in some scenarios. In the above example, m = 5002 has already reached the largest possible
number of groups for n = 25, 000. For the case n > 25000, if n > 31250, min{k/2, (n − k)/2}
will be smaller than 2 + 8(n/1000)2 definitely. If 25000 < n < 31250, min{k/2, (n − k)/2} is also
smaller than 2+8(n/1000)2 in many cases, as long as k is not very close to n/2. So m in (1.3) will be

min{k/2, (n−k)/2} at most cases, which will result in an over-powered test. If m = 2+8(n/1000)2 is

used forcibly, then nˉ = n/m will be too small. That is why the method breaks down for n > 25, 000.
Paul et al. (2013) did not propose a solution to this problem and just recommended not using the
Hosmer–Lemeshow test. In addition, if the success probability of the true model is near 0 or 1,
min{k/2, (n − k)/2} will also be much smaller than 2 + 8(n/1000)2 even for n < 25, 000 and then m
is not large enough to offset the influence of the sample size.
In this paper, we propose a modified Hosmer–Lemeshow test which can handle data sets of any
size. And the larger the sample size is, the more stable the power of our test becomes. Our modified
method is based on the following idea. As the non-centrality parameter λ in the distribution of T m
under the alternative hypothesis is seriously affected by the sample size, the motivation for our
modification is to stabilize λ by multiplying it by a constant. Obviously, this constant should
depend on n. We first estimate the non-centrality parameter λ using T m for a fixed m and view the

non-central chi-square distribution with the estimated λ as the predictive distribution of T m . We


then modify the estimated λ and generate a random number from the modified distribution. The
final decision is made based on this number. Through this procedure, the non-centrality parameter
is standardized to a large extent and thus the power becomes stable.
This paper is organized as follows. Because Paul et al. (2013) only evaluated their recommen-
dations for number of groups for model 6 below, we conduct a further simulation to investigate the
performance of their method in Section 2. In Section 3, we will present the main methodology of
the proposed modified test procedure. Next we use simulations to illustrate the performance of our

ACCEPTED MANUSCRIPT
4
ACCEPTED MANUSCRIPT

method and compare it with existing methods in Section 4. In Section 5 a bank marketing data set
is analysed by re-sampling to compare existing methods with our proposed method. Finally, we
give some concluding remarks in Section 6.

2 A further simulation to investigate the method in Paul et al.


(2013)

In order to illustrate the motivation for our modification, we first conduct a further simulation to
investigate the method in Paul et al. (2013) in this section. For simplicity, we also use the models
constructed in their paper, except that model 3 is changed. This is to avoid repetition because the
test powers of models 2 and 3 in their paper are rather similar under the same sample size. In
addition, our new model 3 is constructed to illustrate a case where the method of Paul et al. (2013)
performs poorly. The true models we use are listed as follows:

Model 1: logit(π) = −2 + X1 + 0.2X12 + Z − 2(Z × X1 )

Model 2: logit(π) = 2 + X1 + Z + 0.5(Z × X1 )

Model 3: logit(π) = 6X1 + Z × X12


(2.1)
Model 4: logit(π) = X1 + 0.2X12

Model 5: logit(π) = X1 + 0.2X12 + X2

Model 6: logit(π) = −3 − X1 − 0.2X12 ,

where X1 ∼ N(0, 1), X2 ∼ N(0, 1), Z ∼ Bino(1, 0.5) and π(X1 , X2 , Z) = E(Y|X1 , X2 , Z) with the
conditional distribution Y|X1 , X2 , Z ∼ Bino(1, π). The data sets are formed as (X1i , X2i , Zi , Yi ), i =

1, 2, ∙ ∙ ∙ , n, generated from the above 6 models respectively, and used to fit the following logistic
model

logit(π) = β0 + β1 X1 . (2.2)

ACCEPTED MANUSCRIPT
5
ACCEPTED MANUSCRIPT

We employ the maximum likelihood method to estimate β0 and β1 and then conduct the Hosmer–
Lemeshow test with significance level α = 0.05 using the number of groups proposed by Paul et
al. (2013). Repeating this process K = 5000 times, we can get the power of the test. We present
the results of their test for the 6 models under n = 500, 1000, 2000, 4000, and 25,000 in Table 1.
We can make the following observations from Table 1. For convenience, when saying “in the
case when model 1 is the true model”, we just say “in the case of model 1” or “model 1” for short.
First, if the test power is very small (close to the significance level) under a small sample size
like the case of models 2, 3 and 6, the performance of the method of Paul et al. (2013) is different
for different models. This follows from the structure of the model itself. The method standardizes
the power for model 6 well. For model 2, as the power of the Hosmer-Lemeshow method is already
stable, the effect of their improvement is not evident. For model 3, the power with small sample
size is also very small and similar to those of models 2 and 6. When model 3 is the true model,
the fitted model in (2.2) is also not a bad choice from the viewpoint of prediction: the average
proportion of correctly predicted records is 0.910, much larger than the success probability of
model 3 (0.492). A low power like 0.072 is reasonable. But with n = 25, 000, the power grows
up to 0.765. On the contrary, the power of the Hosmer-Lemeshow method with fixed m = 10 only
increases to 0.615 at n = 25000. We can see that their method cannot stabilize the power for model
3.
Second, for intermediate cases like models 4 and 5, the powers under small sample sizes are
moderate (0.199 and 0.176 respectively under n = 500), but can also become large in very large
data sets by the method of Paul et al. (2013). From the powers of models 4 and 5 under small
sample sizes, we can say that models 4 and 5 don’t have great deviation from the fitted model
(2.2). But the powers of Paul et al. (2013)’s method still increase fast and become near 1 as the
increase of sample size. That is to say, Paul et al. (2013)’s method doesn’t overcome much of the
shortcoming of the Hosmer–Lemeshow test in this case.
Third, if the test power is already very large under a small sample size, as in the case of model 1,

ACCEPTED MANUSCRIPT
6
ACCEPTED MANUSCRIPT

the power under large sample sizes will certainly become very large and quickly reach 1. Although
the model indeed fits very poorly, we still hope to standardize its power and find a universal method
for all models.
In the simulation section, we will show that our method can standardize the power for any
model no matter whether the power under small sample sizes is small or large. Even in a case like
model 1, our proposed method can also do a good job. Our method can manage any sample size
and the largest sample size in our simulation is 1,000,000.

3 The new test procedure

In the following discussion, fm−2 (x) = χ2m−2 (x) denotes the pdf of a central chi-square distribu-
Rx
tion with m−2 degrees of freedom and Fm−2 (x) = 0 χ2m−2 (y)dy denotes its cdf. gm−2,λ (x) = χ2m−2,λ (x)
denotes the pdf of a non-central chi-square distribution with m − 2 degrees of freedom and non-
Rx
centrality parameter λ and Gm−2,λ (x) = 0 χ2m−2,λ (y)dy denotes its cdf.
First we consider the case of a fixed number of groups m = 10. For the Hosmer–Lemeshow test,
under the alternative hypothesis, the statistic T m obeys a χ2m−2,λ distribution as mentioned above.

So the probability of rejecting the null hypothesis is


Z ∞
P = Pr(T m > z) = χ2m−2,λ (y)dy = 1 − Gm−2,λ (z), (3.1)
z

−1
where z = Fm−2 (1 − α) is the critical value which can be solved from
Z ∞
χ2m−2 (y)dy = α.
z

As mentioned in Section 1, under the alternative hypothesis, the non-centrality parameter λ ∝ n.


Define the sample size n0 = 500 as the standard case. Consider the case when the sample size is
twice n0 , i.e., n = 1000. Then λ also becomes approximately twice its value under the standard

case. Therefore the power increases sharply. A natural idea is that we multiply the non-centrality

ACCEPTED MANUSCRIPT
7
ACCEPTED MANUSCRIPT

parameter λ by 1/2 for n = 1000, then it becomes the same as that under the standard case.
To implement this approach, we first need to estimate λ. Paul et al. (2013) used λ̂ = T m −(m−2)

as an estimator, where T m is T m averaged over K identically generated samples. This can be viewed

as the first order moment estimator of λ because the mean of the χ2ν,λ distribution is μ = ν + λ. But
now we can only get a single value of T m with the only sample. So for simplicity, we just use

λ̂ = T m − (m − 2). (3.2)

From simulation, we can see that (3.2) is adequate for our test procedure. The steps of our test are
as follows.

Step 1. Under a data set of size n, we calculate the Hosmer–Lemeshow statistic T m with m = 10,
then calculate λ̂ by (3.2).

Step 2. Define a modified λc as

 
λc = cλ̂ = c T m − (m − 2) , (3.3)

where

c = l(m, m0 , n, n0 ) (3.4)

is a function about m, m0 , n, n0 which will be specified later.

Step 3. If λc < 0, we accept the null hypothesis. If λc ≥ 0, we generate a random number r from
the χ2m−2,λc distribution. Denote the critical value as zc , which is not the same as that of

the Hosmer–Lemeshow test and will be specified in Theorem 1. Then we reject the null
hypothesis if r > zc and accept it otherwise.

−1
Under the proposed test procedure, the critical value zc , Fm−2 (1 − α). Instead, we have the
following theorem.

ACCEPTED MANUSCRIPT
8
ACCEPTED MANUSCRIPT

Theorem 1. Under the proposed test procedure, the critical value zc is the solution of the equation
Z
1 ∞  x 
1 − Gm−2,x (zc ) fm−2 + (m − 2) dx = α. (3.5)
c 0 c
In equation (3.5), 1 − Gm−2,x (zc ) is the probability of rejecting H0 under a fixed non-centrality
parameter λc = x and 1/c ∙ fm−2 (x/c + (m − 2)) is the corresponding probability density of λc under
H0 . The left-hand side of equation (3.5) is the probability of rejecting the null hypothesis when H0
is true.

Next we analyze the form of the transformation coefficient c in (3.4). First consider the influ-
ence of sample size, as mentioned above, c = 1/2 for n = 1000 makes the non-centrality parameter
the same as that under the standard case n0 = 500. Inspired by that, we can use

c = n0 /n (3.6)

under a standard number of groups m, say, m = 10. Under a fixed sample size, if we use a larger
number of groups m, the power of the Hosmer–Lemeshow test decreases, which is illustrated in
Table III of Paul et al. (2013). If we use the c in (3.6) with a larger m, the power will decrease
more because we actually shrink the non-centrality parameter under sample sizes larger than n0 .
So we attempt to develop the coefficient c in (3.6) for different m.

From property III of Paul et al. (2013), we know that λ ∝ m − 2 is needed in order to hold the
power 1 − β constant. However, from Table II of Paul et al. (2013), we can see that the increasing

rate of λ̂ is much smaller than m − 2 under a fixed sample size in most cases. And when m is

large enough, λ̂ increases very slightly with m, or even decreases. For example, for model 1 and

n = 500, when m changes from 6 to 66, m − 2 increases by a factor of four. So to hold the power
stable, λ̂ must also increase by a factor of four. But Table II of Paul et al. (2013) tells us that λ̂
only changes from 9.22 to 12.34, much less than four-fold. And when m continues to rise to 130, λ̂
decreases to 10.88 instead. That’s why the power decreases a lot when m becomes larger. On the

whole, compared with the changing scope of m − 2, we assume that λ̂ is approximately stable

ACCEPTED MANUSCRIPT
9
ACCEPTED MANUSCRIPT

when m changes and define m0 = 10 as a standard case. So to construct a modified non-centrality


√ √
parameter proportional to m − 2, we should multiply λ̂ by (m − 2)/(m0 − 2) under a fixed n.

So jointly considering the variability of n, a comprehensive adjustment coefficient in (3.3) should


be:
r
n0 m−2
c= . (3.7)
n m0 − 2

Then under different n and m, a complete test procedure is described as steps 1, 2 and 3 above with
the adjustment coefficient c in (3.7) in step 2.

Remark 1. In this paper, we use n0 = 500 and m0 = 10 as the standard case for our proposed
method. The number of groups m = 10 is commonly used for the Hosmer–Lemeshow test, and
has been adopted as the default by most statistical packages. And under m = 10, we find by
simulations that n = 500 is a sample size which can make a moderate distinction between poor
models and good models. For example, the power of the Hosmer-Lemeshow test for model 5 is
0.060,0.073,0.152, 0.342 and 0.630 at n = 100, 300, 500, 1000, 2000 respectively. We can see that
for n < 500 the power increases slowly comparing with the increasing speed of the sample size,
but for n > 500, the power increases rapidly. And the power 0.152 under n = 500 is moderate and
reasonable; the power 0.630 under n = 2000 is too large for a model with moderate lack of fit. In
addition, n = 500 is large enough for handling small event rate.

We can use numerical algorithms to solve (3.5) and get zc . Critical values under some n and m
are listed in Table 2. In the calculation, n0 = 500 and m0 = 10 are used in (3.4).

Similarly to the proof of Theorem 1, we can know that when H0 is not true and T m ∼ χ2m−2,λ ,

the probability density function of λc will be

1 x 
gm−2,λ + (m − 2) (3.8)
c c

and the probability of rejecting H0 for the proposed test is


Z
1 ∞  x 
Pc = 1 − Gm−2,x (zc ) gm−2,λ + (m − 2) dx. (3.9)
c 0 c

ACCEPTED MANUSCRIPT
10
ACCEPTED MANUSCRIPT

This is the theoretical power of the proposed test.


Here we present the results of an experiment for comparing the theoretical powers of the pro-
posed method (3.9) and the Hosmer–Lemeshow test (3.1). We fix the number of groups to be
m = 10 and the significance level α = 0.05. Suppose that there is a model like this: for the sample
size n0 = 500, the Hosmer–Lemeshow statistic under m = 10 obeys χ2m−2,λ0 , where λ0 is known
(set as 0, 0.5, 1). Then for any sample size n, the non-centrality parameter will be λ = λ0 n/n0 and

the power of Hosmer-Lemeshow test will be 1 − Gm−2,λ (z) where z is the (1 − α)th quantile of χ2m−2 .

For our proposed method, we can get the critical value zc under n and m = 10 by Theorem 1. Then
with c = n0 /n and λ = λ0 n/n0 , we can use numerical methods to work out the integral in (3.9),
which is the theoretical power of the proposed method. We present in Figure 1 the power curves
versus n for n ∈ [500, 10000] of the proposed method and the Hosmer–Lemeshow method.
We can see that Figure 1 (a) (λ0 = 0) actually shows the size of the two tests, both approaching
to the significance level. From Figure 1 (b) (λ0 = 0.5) and (c) (λ0 = 1), we can see that the power
of the Hosmer–Lemmeshow test increases rapidly with n and the larger λ0 is, the quicker the power

reaches 1. In comparison, the power of the proposed test is rather stable.

4 Simulation Studies

In this section, we present the results of some simulations conducted to investigate the perfor-
mance of the proposed method. The model settings are the same as in Section 2, that is, the true
models are the 6 models in (2.1) and the null hypothesis is the model in (2.2).
Study 1.
First we compare the test power of some methods under the common number of groups m = 10
and sample sizes n = 500, 600, 700, ∙ ∙ ∙ , 10000. We repeat this K = 5000 times to get the test
power of the three methods: the Hosmer–Lemeshow method, the method of Paul et al. (2013), and
our proposed method, under each sample size. The results are plotted in Figure 2.

ACCEPTED MANUSCRIPT
11
ACCEPTED MANUSCRIPT

From Figure 2 we can see that the power of our proposed method is very stable in general,
except that for model 1, it is a little low under small sample sizes. For the Hosmer–Lemeshow
method, the power quickly reaches 1 for models 1, 4 and 5. It also increases at a certain speed for
models 2, 3 and 6. Concerning the method of Paul et al. (2013), it has a certain effect for models
4, 5 and 6 in standardizing the power. But for models 4 and 5, the power stability is not so good
as the proposed method. It is true that the powers of Paul et al.(2013)’s test and the proposed test
are both stable for simulation model 6 when the sample size is not large. When the sample size
is larger, the modest deviation from the true model becomes more critical. But Paul et al.(2013)’s
power for model 6 remains around 0.04 that is less than 0.05 when n = 25000, which is too low. In
this case, our test has the power of 0.15. In other words, the proposed test is much more sensitive.
Paul et al. (2013)’s performance for model 1 hardly shows any improvement compared with the
Hosmer–Lemeshow method. In addition, for model 2, the difference between the three methods
are small and their powers are all near the significance level. This is because model 2 is already a
good model for fitting the data. For model 3, the power stability of the method of Paul et al. (2013)
becomes even worse compared with that of Hosmer–Lemeshow.
Study 2.
In this study we work out the power of the proposed test for the 6 models in (2.1) under
different combinations of n and m by varying n = 500, 1000, 2000, 4000, 10000, 50000 and m =
6, 10, 18, 34, 66, 130, 802. The results are in Table 3.
We summarize the results in Table 3. First, under a fixed m, the power still increases with n,
but the increasing amplitude is much smaller than when using the Hosmer–Lemeshow test directly,
as shown in Table III of Paul et al. (2013). As n becomes large enough, the power grows at an
infinitesimal speed, even barely changes. For example, when m = 6, the power under model 1 only
increases from 0.645 to 0.683 for n from 10000 to 50,000. This is a merit worth mentioning. A
slight pity is that under large m, the power may be unstable for some models, like model 1, when
n changes. But for models 2 through 6, the power is still stable under large m. Under small m, the

ACCEPTED MANUSCRIPT
12
ACCEPTED MANUSCRIPT

power of our test for all models is rather stable. Second, under a fixed n, the power of our test for
different models may have different changing patterns with increasing m: some mainly increase
like model 3, some mainly decrease like model 6, and the others first increase and then decrease,
like model 1. Third, for models 1, 4 and 5, the power of the proposed method is a little low under
n = 500 when compared with that of the Hosmer–Lemeshow test.
Combining the above three points, we suggest that for n < 1000 we can still use the Hosmer–
Lemeshow test and our method can play a good role for n ≥ 1000, especially for large enough
n. In addition, when using the proposed method, we’d better set a small m. The commonly used
m = 10 is already a good choice. In this way we can ignore the changing patterns of power with
varying m and avoid some inferior performances under large m.
Study 3.
In this study we investigate whether the proposed test can stabilize the power for very large data
sets. The models considered here are models 2 and 6 in (2.1), which have small powers (close to the
significance level) under small sample size as seen in Figure 2. The fitted model is also (2.2). The
only difference is that we focus on very large sample size here as n = {0.5, 1, 2, 3, 4, 5, 10} × 105 .
For these sample sizes, the method of Paul et al. (2013) doesn’t work, so there are only results of
Hosmer-Lemeshow method and our proposed method. For both methods, the number of groups
m = 10 is used. The results are shown in Table 4.
From Table 4 we can see that for models which are nearly correct (as much as it can be in a real
setting), the Hosmer-Lemeshow test will always reject the null hypothesis provided enough large
sample size. But the proposed method can stabilize the power well even for sample size up to one
million. Combined the results in Figure 2, for models 2 and 6, the powers of the proposed method
change little when sample size increases from 500 to 10 × 105 .

ACCEPTED MANUSCRIPT
13
ACCEPTED MANUSCRIPT

5 An analysis of real data

In this section, we analyse a bank marketing data set related with the direct marketing cam-
paigns of a Portuguese banking institution. Our goal is to predict the success of telephone market-
ing for selling long-term bank deposits. The target variable y = 1 means that the client will sub-
scribe a term deposit, and y = 0 indicates the opposite. It contains 20 predicting variables, mainly
describing three aspects of the information: the bank client, the product, and social-economic con-
text attributes. There are 41,189 instances, 4640 of which appear to be y = 1, constituting 11.3%
of the whole sample. Several missing values exist in some categorical attributes, all coded with the
“unknown” label. In our analysis, these missing values are treated as a possible class label.
The data was collected by Moro et al. (2014) from a Portuguese retail bank in 2012 and 2013.
In their paper, they tried four data-mining methods: logistic regression, decision trees, neural
network, and support vector machine. They used AUC (area of the receiver operating characteristic
curve) and ALIFT (area of the LIFT cumulative curve) to compare the four models and concluded
that the neural network performs the best (AUC = 0.8 and ALIFT = 0.7). Finally they extracted
some valuable knowledge for telephone marketing campaign managers.
In our paper, we will use a logistic regression model to fit the data. First through adopting
forward selection, the 5 most significant variables were chosen. They are “month” — indicating
the last contact month of year (categorical: ‘jan’, ‘feb’,∙ ∙ ∙ , ‘nov’, ‘dec’), “duration” — the last
contact duration in seconds (numeric), “poutcome” — outcome of the previous marketing cam-
paign (categorical: ‘failure’, ‘nonexistent’, ‘success’), “emp.var.rate” — employment variation
rate quarterly indicator (numeric) and “nr.employed” — number of employees quarterly indicator
(numeric). Categorical variables with r levels are transformed to r − 1 indicator variables. We
view the data set with the 5 chosen predicting variables and the target variable y as a population
and draw a sample of size n from it. We use logistic regression to fit the sub-sample and employ
the Hosmer–Lemeshow test, the method of Paul et al. (2013), and our proposed method, with

ACCEPTED MANUSCRIPT
14
ACCEPTED MANUSCRIPT

α = 0.05. For the Hosmer–Lemeshow method and our proposed method, the number of groups
m = 10 is used. The above re-sampling is repeated K = 5000 times to get the power of the three
methods. We present the powers of the three methods under n = 500, 1000, 2000, 4000, 10000 in
Table 5.
Table 5 shows that under n = 500, the powers of the three methods are similar and close to
the significance level α = 0.05. Then the power of the Hosmer–Lemeshow test grows rapidly as n
increases and reaches 1 when n = 4000. The power of the test of Paul et al. (2013) grows a little
more slowly, but also reaches 0.981 for n = 4000 and 1 for n = 10, 000. Compared with those two
methods, the power of our proposed method is rather stable.

6 Conclusion and Discussion

In this paper, we proposed an improved Hosmer–Lemeshow test based on a modification of the


non-centrality parameter under the alternative hypothesis. The greatest advantage of our method
is that when the sample size is large enough, its power is rather stable and barely changes, unlike
the case of the Hosmer–Lemeshow test, the power of which almost doesn’t stop increasing until
reaching 1. But, unfortunately, the power of our proposed method is a little low under small sample
sizes. So our recommendation is that we still use the Hosmer–Lemeshow test for n < 1000 and
use our modified test for n ≥ 1000, and we had better use small m for our proposed test. The
commonly used m = 10 is already a good choice. Of course, if you want to use our proposed
method for small data sets, increasing the value of n0 in (3.7) is a handy way. This will let you get

a good power under small sample sizes. But the detailed characteristics when changing the value
of n0 and m0 need further research.

ACCEPTED MANUSCRIPT
15
ACCEPTED MANUSCRIPT

Appendix: Proof of Theorem 1


Proof. The critical value should be such that the probability of rejecting the null hypothesis equals
α when the null hypothesis is true, i.e.,

Pr(reject H0 | H0 ) = Pr(r > zc | H0 ) = α, (A.1)

Here r is the random number in Step 3 above. Hence, we only need to derive the expression for
Pr(r > zc | H0 ).
First, as mentioned above, under the null hypothesis, T m ∼ χ2m−2 and its probability density

function is fm−2 (T m ). For λc in (3.3), it can be viewed as a function of T m , i.e, λc (T m ) = c T m − (m −

2) . Its inverse is T m (λc ) = λc /c + (m − 2) and the derivative is T m0 (λc ) = 1/c. So the pdf of λc can
be written directly from the pdf of T m , fm−2 (T m ). Denote the probability density function of λc by
q(x), we have
1 x 
q(x) = fm−2 (T m (x)) ∙ |T m0 (x)| = fm−2 + (m − 2) . (A.2)
c c
On the other hand, from Step 3 of our proposed test procedure, we know that the probability of
rejecting the null hypothesis under a given λc = x is
(
0, x<0
Pr(r > zc | λc = x) = (A.3)
1 − Gm−2,x (zc ), x ≥ 0.
P
The law of total probability under discrete case is Pr(r > zc ) = x Pr(r > zc | λc = x)Pr(λc = x).
Here we use the corresponding formula under continuous case. So based on Equations (A.2) and
(A.3), we have
Z ∞
Pr(r > zc | H0 ) = Pr(r > zc | λc = x)q(x)dx
−∞
Z ∞   x 
1
= 1 − Gm−2,x (zc ) fm−2 + (m − 2) dx. (A.4)
c 0 c
Substituting (A.4) in (A.1), we can get the equation for the critical value, which is just (3.5). 

ACCEPTED MANUSCRIPT
16
ACCEPTED MANUSCRIPT

References

[1] Chen A, Pennell ML, Klebanoff MA, Rogan WJ (2006). Longnecker MP. Maternal smoking
during pregnancy in relation to child overweight: follow-up to age 8 years. International
Journal of Epidemiology, 35(1):121-130.

[2] Cole TJ, Hey E, Richmond S (2010). The PREM score: a graphical tool for predicting sur-
vival in very preterm births. Archives of Disease in Childhood. Fetal and Neonatal Edition,
95(1):F14–F19.

[3] Dahiya RC, Gurland J (1973). How many classes in the Pearson chi-square test? Journal of
the American Statistical Association, 68(343):707–712.

[4] Hosmer DW, Lemeshow S (1980). Goodness of fit tests for the multiple logistic regression
model. Communications in Statistics — Theory and Methods, 9(10):1043–1069.

[5] Hosmer DW, Lemeshow SL (2000). Applied Logistic Regression. John Wiley & Sons, Inc.,
New York.

[6] Hukkelhoven C, Rampen A, Maas A, Farace E, Habbema J, Marmarou A, Marshall L, Murray


G, Steyerberg E (2006). Some prognostic models for traumatic brain injury were not valid.
Journal of Clinical Epidemiology, 59(2):132–143.

[7] Krag D, Weaver D, Ashikaga T, Moffat F, Klimberg VS, Shriver C, Feldman S, Kusminsky
R, Gadd M, Kuhn J, Harlow S, Beitsch P, Whitworth P, Foster R, Dowlatshahi K (1998).
The sentinel node in breast cancer — a multicenter validation study. New England Journal of
Medicine, 339(14):941–946.

[8] Kramer AA, Zimmerman JE (2007). Assessing the calibration of mortality benchmarks in
critical care: the Hosmer–Lemeshow test revisited. Critical Care Medicine, 35(9):2052–
2056.

ACCEPTED MANUSCRIPT
17
ACCEPTED MANUSCRIPT

[9] Moore DS, Spruill MC (1975). Unified large-sample theory of general chi-squared statistics
for tests of fit. The Annals of Statistics, 3(3):599–616.

[10] Moro S, Cortez P, Rita P (2014). A Data-Driven Approach to Predict the Success of Bank
Telemarketing. Decision Support Systems, 62: 22-31.

[11] Newman TB, Escobar GJ, Gonzales VM, Armstrong MA, Gardner MN, Folck BF (1999).
Frequency of neonatal bilirubin testing and hyperbilirubinemia in a large health maintenance
organization. Pediatrics, 104(5):1198–1203.

[12] Nashef SAM, Roques F, Michel P, Gauducheau E, Lemeshow S, Salamon R (1999). European
system for cardiac operative risk evaluation (EuroSCORE). European Journal of Cardio —
Thoracic Surgery, 16(1):9–13.

[13] Paul P, Pennell ML, Lemeshow S (2013) Standardizing the power of the Hosmer–Lemeshow
goodness of fit test in large data sets. Statistics in Medicine, 32:67–80.

[14] Schein A, Ungar L (2004). A-optimality for active learning of logistic regression classifiers.
Tech. Rep., MS-CIS-04-07, University of Pennsylvania Department of Computer and Infor-
mation Science.

[15] Tang EK, Suganthan PN, Yao X, Qin AK (2005). Linear dimensionality reduction using
relevance weighted LDA. Pattern Recognition, 38:485–493.

ACCEPTED MANUSCRIPT
18
ACCEPTED MANUSCRIPT

Table 1: Power of the method in Paul et al. (2013) for the 6 models in (2.1) under different n.

Sample size Model


n 1 2 3 4 5 6
500 0.641 0.060 0.072 0.199 0.176 0.052
1000 0.951 0.055 0.115 0.385 0.349 0.083
2000 0.993 0.061 0.171 0.529 0.431 0.065
4000 0.999 0.073 0.294 0.621 0.477 0.046
25000 1.000 0.119 0.765 0.988 0.916 0.036

ACCEPTED MANUSCRIPT
19
ACCEPTED MANUSCRIPT

Table 2: Critical values of the proposed method computed by Theorem 1 under different n and m.

Sample size Number of groups m


n 6 10 18 34 66 130 802
500 10.779 19.280 35.115 65.571 125.306 243.616 *
1000 8.990 15.970 28.740 52.870 99.540 191.292 *
2000 8.114 14.398 25.808 47.172 88.051 167.525 *
4000 7.681 13.638 24.430 44.576 82.967 157.208 *
10000 7.424 13.192 23.638 43.122 80.202 151.775 871.030
50000 7.287 12.958 23.228 42.383 78.831 149.162 854.895

*m = 802 is not applicable for n = 500, 1000, 2000, 4000 as n/m is too small, so the results under these combinations
are empty.

ACCEPTED MANUSCRIPT
20
ACCEPTED MANUSCRIPT

Table 3: Power of the proposed test for the 6 models in (2.1) under different n and m.
Model Sample size Number of groups m
n 6 10 18 34 66 130 802
1 500 0.424 0.437 0.388 0.277 0.145 0.058 *
1000 0.542 0.602 0.614 0.536 0.478 0.226 *
2000 0.600 0.726 0.747 0.736 0.708 0.589 *
4000 0.656 0.752 0.818 0.853 0.831 0.785 *
10000 0.645 0.786 0.872 0.904 0.929 0.905 0.784
50000 0.683 0.793 0.886 0.923 0.938 0.935 0.972
2 500 0.050 0.057 0.048 0.062 0.066 0.056 *
1000 0.047 0.036 0.045 0.066 0.073 0.064 *
2000 0.053 0.049 0.057 0.041 0.063 0.063 *
4000 0.049 0.064 0.059 0.068 0.046 0.058 *
10000 0.052 0.065 0.049 0.071 0.057 0.050 0.063
50000 0.097 0.066 0.068 0.074 0.081 0.079 0.056
3 500 0.071 0.080 0.095 0.096 0.128 0.156 *
1000 0.081 0.080 0.082 0.130 0.139 0.193 *
2000 0.073 0.106 0.104 0.142 0.180 0.218 *
4000 0.076 0.091 0.131 0.133 0.172 0.190 *
10000 0.095 0.088 0.138 0.129 0.166 0.201 0.367
50000 0.135 0.117 0.105 0.153 0.196 0.210 0.369
4 500 0.135 0.122 0.128 0.126 0.096 0.115 *
1000 0.181 0.191 0.194 0.173 0.149 0.159 *
2000 0.240 0.260 0.265 0.233 0.200 0.197 *
4000 0.292 0.283 0.314 0.334 0.297 0.303 *
10000 0.304 0.309 0.327 0.371 0.359 0.361 0.326
50000 0.270 0.336 0.366 0.378 0.420 0.411 0.425
5 500 0.120 0.120 0.115 0.085 0.089 0.101 *
1000 0.153 0.165 0.150 0.144 0.103 0.102 *
2000 0.222 0.240 0.243 0.212 0.182 0.162 *
4000 0.251 0.271 0.280 0.288 0.233 0.222 *
10000 0.275 0.281 0.295 0.332 0.323 0.325 0.209
50000 0.264 0.321 0.328 0.353 0.360 0.369 0.333
6 500 0.070 0.054 0.040 0.040 0.031 0.024 *
1000 0.075 0.072 0.056 0.040 0.035 0.022 *
2000 0.089 0.078 0.074 0.069 0.036 0.020 *
4000 0.133 0.123 0.109 0.094 0.056 0.047 *
10000 0.158 0.140 0.153 0.120 0.082 0.083 0.013
50000 0.165 0.131 0.155 0.160 0.140 0.148 0.082
*m = 802 is not applicable for n = 500, 1000, 2000, 4000 as n/m is too small, so the results under these combinations
are empty.

ACCEPTED MANUSCRIPT
21
ACCEPTED MANUSCRIPT

Table 4: Powers of the proposed test and Hosmer-Lemeshow test under very large sample sizes.

model method sample size n(×105 )


0.5 1 2 3 4 5 10
model 2 HL 0.198 0.352 0.758 0.924 0.982 0.992 1.000
proposed 0.082 0.104 0.138 0.120 0.138 0.082 0.114
model 6 HL 1.000 1.000 1.000 1.000 1.000 1.000 1.000
proposed 0.168 0.158 0.162 0.148 0.148 0.172 0.142

ACCEPTED MANUSCRIPT
22
ACCEPTED MANUSCRIPT

Table 5: Power of the three methods for the bank marketing data set under the re-sampling size n =
500, 1000, 2000, 4000, 10000.

n Hosmer–Lemeshow Paul et al. (2013) Proposed method


500 0.064 0.064 0.055
1000 0.357 0.357 0.193
2000 0.914 0.751 0.322
4000 1.000 0.981 0.403
10000 1.000 1.000 0.496

ACCEPTED MANUSCRIPT
23
ACCEPTED MANUSCRIPT

1 0.8 1

0.8 0.8
0.6

0.6 0.6
power

power

power
0.4
0.4 0.4

0.2
0.2 0.2

0 n 0 n 0 n
0 5000 10000 0 5000 10000 0 5000 10000
(a) λ0=0 (b) λ0=0.5 (c) λ0=1

Figure 1: Theoretical power of the Hosmer–Lemshow test and the proposed test versus the sample size n
for n ∈ [500, 10000] under different λ0 and fixed m = 10 and α = 0.05. The sub-figures: (a) λ0 = 0; (b)
λ0 = 0.5 and (c) λ0 = 1. The solid line denotes the Hosmer–Lemeshow method and the dashed line denotes
our proposed method.

ACCEPTED MANUSCRIPT
24
ACCEPTED MANUSCRIPT

0.8
1

0.8 0.6

0.6
0.4
0.4
0.2
0.2

0 0
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
(a) model 1 (b) model 2

0.8
1

0.6 0.8

0.6
0.4
0.4
0.2
0.2

0 0
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
(c) model 3 (d) model 4

1
0.8
0.8
0.6
0.6
0.4
0.4

0.2 0.2

0 0
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
(e) model 5 (f) model 6

Figure 2: Power of the three methods for n = 500, 600, ∙ ∙ ∙ , 10000 and m = 10, where the solid line
represents the Hosmer–Lemeshow method, the dotted line represents the method of Paul et al. (2013) and
the dashed line represents our proposed method. x-axis stands for n and y-axis stands for power.

ACCEPTED MANUSCRIPT
25

You might also like