Professional Documents
Culture Documents
On Relationship Between Multicollinearity and Separation in Logistic Regression
On Relationship Between Multicollinearity and Separation in Logistic Regression
On Relationship Between Multicollinearity and Separation in Logistic Regression
Computation
To cite this article: Guoping Zeng & Emily Zeng (2019): On the relationship between
multicollinearity and separation in logistic regression, Communications in Statistics - Simulation and
Computation, DOI: 10.1080/03610918.2019.1589511
Article views: 15
https://doi.org/10.1080/03610918.2019.1589511
1. Introduction
Logistic regression studies the relationship between a binary dependent variable and a
set of independent (also called explanatory) variables. It has been recently widely used
in credit scoring (Siddiqi 2006; Refaat 2011; Zeng 2013; Zeng 2014a; Zeng, 2014b; Zeng
2015; Zeng 2017a; Zeng 2017b; Zeng 2017c; Zeng, G., and E. Zeng 2018; Zeng 2019).
Multicollinearity and separation are two major issues in logistic regression.
Multicollinearity is a phenomenon in which two or more independent variables are lin-
ear related. Collinearity is a special case of multicollinearity in which there is a linear
association between two independent variables. Separation means a straight line separat-
ing the two values, 0 and 1, of the dependent variable.
Multicollinearity is much easier to detect and to handle than separation is.
Multicollinearity reduces to a linear algebra problem of non-full rank. For instance,
Procedure Logistic in SAS first checks multicollinearity. If multicollinearity is detected,
then the independent variables are removed until there is no linear relationship among
the remaining independent variables.
However, the relationship between multicollinearity and separation has never been
studied in literature. Rather, they are studied separately. Albert and Anderson (1984)
initiated the concept of separation in order to study the solution existence and unique-
ness of logistic regression. As a starting point, they assumed that the underlying data
had no multicollinearity. Demidenko (2001) studied computational aspects of logistic
regression by assuming the underlying data had no multicollinearity. Sarlija et al. (2017)
2. Preliminaries
2.1. Logistic regression
To start with, let’s assume that x ¼ ðx1 ; x2 ; :::; xp Þ is the vector of p independent vari-
ables and y is the binary dependent variable (also called response or target) with values
of 0 and 1. Assume we have a sample of N independent observations
ðxi1 ; xi2 ; :::; xip ; yi Þ; i ¼ 1; 2; :::; N; where yi denotes the value of y and
xi1 ; xi2 ; :::; xip are the values of x1 ; x2 ; :::; xp for the i-th observation. We also
assumed that N > p:
Let Xi be the row vector ð1; xi1 ; xi2 ; :::; xip Þ for i ¼ 1; 2; :::; N; and denote X the
N ðp þ 1Þ matrix with Xi as rows. Here, we use ðp þ 1Þ- dimensional row vectors
rather than p-dimensional row vectors by making the first components to 1 in order to
absorb the intercept term in the regression parameters.
To adopt standard notation in logistic regression in Hosmer et al. (2013), we use the
quantity pðxÞ ¼ Pðy ¼ 1jxÞ to represent the conditional probability that y is equal to 1
given x: It follows that 1 pðxÞ is the conditional probability that y is equal to zero
given x: The logistic regression model is given by the equation
eb0 þb1 x1 þ:::þbp xp
pðxÞ ¼ : (2.1)
1 þ eb0 þb1 x1 þ:::þbp xp
The logit transformation of pðxÞ is
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R
3
pðxÞ
g ðxÞ ¼ ln ¼ b0 þ b1 x1 þ ::: þ bp xp : (2.2)
1 pðxÞ
The method of maximum likelihood aims at yielding values for the unknown param-
eters b0 ; b1 ; :::bp which maximize the probability of obtaining the observed data. We
first construct a so-called likelihood function, which expresses the probability of the
observed data as a function of unknown parameters. The contribution of an observation
to the likelihood function is pðxi1 ; xi2 ; :::; xip Þ when yi ¼ 1; and 1 pðxi1 ; xi2 ; :::; xip Þ
when yi ¼ 0: In other words, we attempt to maximize pðxi1 ; xi2 ; :::; xip Þ when yi ¼ 1
and 1 pðxi1 ; xi2 ; :::; xip Þ when yi ¼ 0: A unified way to express the contribution of
an observation to the likelihood function is
pðxi1 ; xi2 ; :::; xip Þyi 1pðxi1 ; xi2 ; :::; xip Þ 1yi : (2.3)
The value of b given by the solution to (2.5) is called the MLE. To maximize function
lðX; bÞ; we differentiate it with respect to b, set the derivative equal to 0, and then solve
the resulting set of equations. We obtain
XN X N
yi ¼ pðxi1 ; xi2 ; :::; xip Þ
i¼1 i¼1
(2.6)
X N XN
xij yi ¼ xij pðxi1 ; xi2 ; :::; xip Þ; j ¼ 1; 2; :::; p:
i¼1 i¼1
as in Hosmer et al. (2013). Since b is a vector of ðp þ 1Þ elements, (2.6) is a set of ðp þ 1Þ
equations. If the equations in (2.6) cannot be solved explicitly, they can be approximately
solved numerically by Newton-Raphson algorithm as follows (See Refaat 2011)
h i1
bðiþ1Þ ¼ bðiÞ €l X; bðiÞ _l X; bðiÞ ; (2.7)
where _l is the column vector of dimension ðp þ 1Þ of the first derivative of the log
likelihood lðX; bÞ; whose j-th element is
@l XN
¼ yi xij pðxi1 ; xi2 ; :::; xip Þxij ; j ¼ 0; 1; :::; p
@bj i¼1
@2l XN
¼ xij xik pðxi1 ; xi2 ; :::; xip Þ 1pðxi1 ; xi2 ; :::; xip Þ :
@bj bk i¼1
Note that €l X; bðiÞ ¼ X T V ðiÞ X; where V ðiÞ is a N N diagonal matrix with diagonals
ðiÞ ðiÞ ðiÞ ðiÞ ði Þ ðÞ
p1 1p1 ; p2 1p2 ; :::; pN 1pNi :
2.2. Multicollinearity
Multicollinearity is a phenomenon in which independent variables are linearly related.
Mathematically, we can define multicollinearity as follows.
Definition 2.1. A logistic regression model with independent variables x1 ; x2 ; :::; xp is
said to have multicollinearity if there exist constant a0 ; a1 ; :::; ap such that
X
p
a0 x0 þ ai xi ¼ 0 (2.8)
i¼1
where at least two of a1 ; :::; ap are nonzero, and x0 is a N-dimensional vector with all
elements 1.
Remark 2.2. (2.8) is equivalent to Xi a ¼ 0 for i ¼ 1; 2; :::; N; where a is a
ðp þ 1Þ-dimensional vector ða0 ; a1 ; :::; ap Þ T :
Remark 2.3. Definition 2.1 covers all the subsets of fx1 ; x2 ; :::; xp g with two or more
variables. If i variables with i < p are linearly related, then aj ¼ 0 if j 6¼ i:
2.3. Separation
Separation includes complete separation and quasi-complete separation (Albert and
Anderson 1984).
Definition 2.5. There is a complete separation of data points if there exists a vector
b ¼ ðb0 ; b1 ; :::; bp ÞT that correctly allocates all observations to their response groups; that is,
8
>
> X p
>
> bj xij ¼ Xi b ¼ bT XiT > 0; yi ¼ 1;
>
<
j¼0
(2.9)
>
> X p
>
> b x ¼ Xi b ¼ b Xi < 0; yi ¼ 0:
T T
>
: j¼0 j ij
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R
5
Definition 2.6. There is quasi-complete separation if the data are not complete separ-
able, but there exists a vector b ¼ ðb0 ; b1 ; :::; bp ÞT such that
8
>
> Xp
>
> bj xij ¼ Xi b ¼ bT XiT 0; yi ¼ 1;
>
<
j¼0
(2.10)
>
> Xp
>
> bj xij ¼ Xi b ¼ b Xi 0; yi ¼ 0
T T
>
: j¼0
and equality holds for at least one subject in each response group.
As pointed out by Zeng (2017b), each of (2.9) and (2.10) has a variant by exchanging
yi ¼ 1 and yi ¼ 0:
Definition 2.7. If neither complete nor quasi-complete separation exists then the data is
said to have overlap.
Proof. It follows from multicollinearity that X is not full column rank. Hence, the
square matrix €l X; bðiÞ ¼ X T V ðiÞ X is not full rank, i.e., €l X; bðiÞ is a singular matrix
and cannot be inverted. Indeed, there is no finite solution to MLE by Theorem 4.1. 䊏
5. Conclusions
In this paper, we have studied the relationship between multicollinearity and separation.
We conclude that
ORCID
Guoping Zeng http://orcid.org/0000-0002-3403-5418
References
Albert, A. and J. A. Anderson. 1984. On the existence of maximum likelihood estimates in logis-
tic regression models. Biometrika 71 (1):1–10.
Chatelain, J. B., and K. Ralf. 2014. Spurious regressions and near multicollinearity, with an appli-
cation to Aid, policies and growth. Journal of Macroeconomics 39 (PA):85–96.
Demidenko, E. 2001. Computational aspects of probit model. Mathematical Communications (6):
233–247.
Hosmer, D. W., S. Lemeshow, and R. X. Sturdivant. 2013. Applied logistic regression. (3rd edi-
tion). New York: John Wiley & Sons, Inc.
Midi, H., S. K. Sarkar, and S. Rana. 2013. Collinearity diagnostics of binary logistic regression model.
Journal of Interdisciplinary Mathematics 13 (3):253–67. doi:10.1080/09720502.2010.10700699.
Murray, L., H. Nguyen, Y. Lee, M. D. Remmenga, and D. Smith. 2012. Variance Inflation Factors
in Regression Models with Dummy Variables, Conference Proceedings of Annual Conference
on Applied Statistics in Agriculture, Kansas State University Libraries, New Prairie Press, str.
160–177.
Refaat, M. 2011. Credit risk scorecards: Development and implementation using SAS. Raleigh,
North Carolina, USA: LULU.COM.
Sarlija, N., A. Bilandzic, and M. Stanic. 2017. Logistic regression modelling: procedures and pit-
falls in developing and interpreting prediction models. Croatian Operational Research Review
8:631–52.
Shen, J., and S. Gao. 2008. A solution to separation and multicollinearity in multiple logistic
regression. Journal of Data Science: Jds 6 (4):515–31.
Siddiqi, N. 2006. Credit risk scorecards: Developing and implementing intelligent credit scoring.
Hoboken, New Jersey: John Wiley & Sons, Inc.
Yoo, W., R. Mayberry, S. Bae, K. Singh, Q. Peter He, and J. W. Lillard Jr. 2014. A Study of
Effects of MultiCollinearity in the Multivariable Analysis. International Journal of Applied
Science and Technology 4 (5):9–19.
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R
9
Zeng, G. 2013. Metric divergence measures and information value in credit scoring. Journal of
Mathematics 2013:1. Article ID 848271. doi:10.1155/2013/848271.
Zeng, G. 2014a. A rule of thumb for reject inference in credit scoring. Mathematical finance let-
ters. Article 2014:2. 1-13.
Zeng, G. 2014b. A necessary condition for a good binning algorithm in credit scoring. Applied
Mathematical Sciences Applied Sciences 8 (65):3229–3243. doi:10.12988/2014.44300.
Zeng, G. 2015. A unified definition of mutual information with applications in machine learning.
Mathematical Problems in Engineering 2015. Article ID 201874. doi:10.1155/2015/201874.
Zeng, G. 2017a. A comparison study of computational methods of kolmogorov–smirnov statistic
in credit scoring. Communications in Statistics: Simulation and Computation 46 (10):7744–60.
doi:10.1080/03610918.2016.1249883.
Zeng, G. 2017b. Invariant properties of logistic regression model in credit scoring under mono-
tonic transformations. Communications in Statistics: Theory and Methods 46 (17):8791–807.
doi:10.1080/03610926.2016.1193200.
Zeng, G. 2017c. On the existence of maximum likelihood estimates for weighted logistic regres-
sion. Communications in Statistics: Theory and Methods 46 (22):11194–203. doi:10.1080/
03610926.2016.1260743.
Zeng, G., and E. Zeng. 2018. On the three-way equivalence of AUC in credit scoring with tied
scores. Communications in Statistics: Theory and Methods. doi:10.1080/03610926.2018.1435814
Zeng, G. 2019. On the confusion matrix in credit scoring and its analytical properties. Submitted to
Communications in Statistics: Theory and Method. doi:10.1080/03610926.2019.1568485