On Relationship Between Multicollinearity and Separation in Logistic Regression

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Communications in Statistics - Simulation and

Computation

ISSN: 0361-0918 (Print) 1532-4141 (Online) Journal homepage: https://www.tandfonline.com/loi/lssp20

On the relationship between multicollinearity and


separation in logistic regression

Guoping Zeng & Emily Zeng

To cite this article: Guoping Zeng & Emily Zeng (2019): On the relationship between
multicollinearity and separation in logistic regression, Communications in Statistics - Simulation and
Computation, DOI: 10.1080/03610918.2019.1589511

To link to this article: https://doi.org/10.1080/03610918.2019.1589511

Published online: 28 Mar 2019.

Submit your article to this journal

Article views: 15

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=lssp20
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R

https://doi.org/10.1080/03610918.2019.1589511

On the relationship between multicollinearity


and separation in logistic regression
Guoping Zenga and Emily Zengb
a
4522 Oak Shores Dr, Plano, TX, USA; bCollege of Letters & Science, University of California at Berkeley,
Berkeley, CA, USA

ABSTRACT ARTICLE HISTORY


Multicollinearity and separation are two major issues in logistic Received 22 October 2018
regression. In this paper, for the first time we study the relationship Accepted 26 February 2019
between multicollinearity and separation. We analytically prove that
KEYWORDS
multicollinearity implies quasi-complete separation. Through counter
Multicollinearity, Separation,
examples, we show that multicollinearity does not always imply Complete separation,
complete separation and that separation does not always imply mul- Quasi-complete separation,
ticollinearity. We also present the consequences of multicollinearity Logistic regression,
and separation. We analytically prove that multicollinearity means no Maximum likeli-
finite solution to maximum likelihood estimate and that separation hood estimate
means no finite solution to maximum likelihood estimate.

1. Introduction
Logistic regression studies the relationship between a binary dependent variable and a
set of independent (also called explanatory) variables. It has been recently widely used
in credit scoring (Siddiqi 2006; Refaat 2011; Zeng 2013; Zeng 2014a; Zeng, 2014b; Zeng
2015; Zeng 2017a; Zeng 2017b; Zeng 2017c; Zeng, G., and E. Zeng 2018; Zeng 2019).
Multicollinearity and separation are two major issues in logistic regression.
Multicollinearity is a phenomenon in which two or more independent variables are lin-
ear related. Collinearity is a special case of multicollinearity in which there is a linear
association between two independent variables. Separation means a straight line separat-
ing the two values, 0 and 1, of the dependent variable.
Multicollinearity is much easier to detect and to handle than separation is.
Multicollinearity reduces to a linear algebra problem of non-full rank. For instance,
Procedure Logistic in SAS first checks multicollinearity. If multicollinearity is detected,
then the independent variables are removed until there is no linear relationship among
the remaining independent variables.
However, the relationship between multicollinearity and separation has never been
studied in literature. Rather, they are studied separately. Albert and Anderson (1984)
initiated the concept of separation in order to study the solution existence and unique-
ness of logistic regression. As a starting point, they assumed that the underlying data
had no multicollinearity. Demidenko (2001) studied computational aspects of logistic
regression by assuming the underlying data had no multicollinearity. Sarlija et al. (2017)

CONTACT Guoping Zeng guopingtx@yahoo.com


ß 2019 Taylor & Francis Group, LLC
2 G. ZENG AND E. ZENG

treated non-multicollinearity as one assumption of logistic regression. Midi, Sarkar, and


Rana (2013) researched how to detect multicollinearity. Yoo et al. (2014) performed a
simulation study with various scenarios of different multicollinearity structures to inves-
tigate the effects of multicollinearity under various correlation structures among
explanatory variables. Shen and Gao (2008) used a double penalization of the log likeli-
hood to simultaneously handle multicollinearity and separation.
Moreover, the literature about the consequences of multicollinearity and separation is
quite limited. The consequences of separation were studied by Albert and Anderson
(1984) under the assumption of no multicollinearity.
In this paper, for the first time we study the relationship between multicollinearity
and separation. We analytically prove that multicollinearity implies quasi complete sep-
aration. We show that multicollinearity does not always imply complete separation. We
also show that separation does not always imply multicollinearity. We then present the
consequences of multicollinearity and separation. We analytically prove that multicolli-
nearity means no finite solution to maximum likelihood estimate (MLE) and that separ-
ation means no finite solution to MLE. We also study consequence of non-
multicollinearity and consequence of non-multicollinearity and non-separation.
The remaining of the paper is organized as follows. In Sec. 2, we briefly introduce
logistic regression, multicollinearity and separation. In Sec. 3, we study the relationship
between multicollinearity and separation. In Sec. 4, we present the consequences of
multicollinearity and separation. Finally, the paper is concluded in Sec. 5.
Throughout the paper, all independent variables are assumed to be continuous. A
solution to MLE means a finite solution.

2. Preliminaries
2.1. Logistic regression
To start with, let’s assume that x ¼ ðx1 ; x2 ; :::; xp Þ is the vector of p independent vari-
ables and y is the binary dependent variable (also called response or target) with values
of 0 and 1. Assume we have a sample of N independent observations
ðxi1 ; xi2 ; :::; xip ; yi Þ; i ¼ 1; 2; :::; N; where yi denotes the value of y and
xi1 ; xi2 ; :::; xip are the values of x1 ; x2 ; :::; xp for the i-th observation. We also
assumed that N > p:
Let Xi be the row vector ð1; xi1 ; xi2 ; :::; xip Þ for i ¼ 1; 2; :::; N; and denote X the
N  ðp þ 1Þ matrix with Xi as rows. Here, we use ðp þ 1Þ- dimensional row vectors
rather than p-dimensional row vectors by making the first components to 1 in order to
absorb the intercept term in the regression parameters.
To adopt standard notation in logistic regression in Hosmer et al. (2013), we use the
quantity pðxÞ ¼ Pðy ¼ 1jxÞ to represent the conditional probability that y is equal to 1
given x: It follows that 1  pðxÞ is the conditional probability that y is equal to zero
given x: The logistic regression model is given by the equation
eb0 þb1 x1 þ:::þbp xp
pðxÞ ¼ : (2.1)
1 þ eb0 þb1 x1 þ:::þbp xp
The logit transformation of pðxÞ is
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R
3

 
pðxÞ
g ðxÞ ¼ ln ¼ b0 þ b1 x1 þ ::: þ bp xp : (2.2)
1  pðxÞ
The method of maximum likelihood aims at yielding values for the unknown param-
eters b0 ; b1 ; :::bp which maximize the probability of obtaining the observed data. We
first construct a so-called likelihood function, which expresses the probability of the
observed data as a function of unknown parameters. The contribution of an observation
to the likelihood function is pðxi1 ; xi2 ; :::; xip Þ when yi ¼ 1; and 1  pðxi1 ; xi2 ; :::; xip Þ
when yi ¼ 0: In other words, we attempt to maximize pðxi1 ; xi2 ; :::; xip Þ when yi ¼ 1
and 1  pðxi1 ; xi2 ; :::; xip Þ when yi ¼ 0: A unified way to express the contribution of
an observation to the likelihood function is
 
pðxi1 ; xi2 ; :::; xip Þyi 1pðxi1 ; xi2 ; :::; xip Þ 1yi : (2.3)

Since the observations are assumed to be independent, the likelihood function is


obtained as the product of all contributions
 yi  1yi
Y
N
  YN
eXi b 1
hðX; bÞ ¼ pðxi1 ; xi2 ; :::; xip Þyi 1pðxi1 ; xi2 ; :::; xip Þ 1yi ¼
i¼1 i¼1
1 þ eXi b 1 þ eXi b
(2.4)
where b is column vector ðb0 ; b1 ; :::bp ÞT ; superscript T is matrix transpose, and Xi b
represents matrix multiplication.
Since it is easier to work with the log of equation, the log likelihood is instead used
   
X N
eXi b XN
1
ð Þ ð ð
l X; b ¼ ln h X; b ¼ ÞÞ yi ln þ ð1yi Þln : (2.5)
i¼1
1 þ eXi b i¼1
1 þ eXi b

The value of b given by the solution to (2.5) is called the MLE. To maximize function
lðX; bÞ; we differentiate it with respect to b, set the derivative equal to 0, and then solve
the resulting set of equations. We obtain
XN X N
yi ¼ pðxi1 ; xi2 ; :::; xip Þ
i¼1 i¼1
(2.6)
X N XN
xij yi ¼ xij pðxi1 ; xi2 ; :::; xip Þ; j ¼ 1; 2; :::; p:
i¼1 i¼1
as in Hosmer et al. (2013). Since b is a vector of ðp þ 1Þ elements, (2.6) is a set of ðp þ 1Þ
equations. If the equations in (2.6) cannot be solved explicitly, they can be approximately
solved numerically by Newton-Raphson algorithm as follows (See Refaat 2011)
h i1 
bðiþ1Þ ¼ bðiÞ  €l X; bðiÞ _l X; bðiÞ ; (2.7)

where _l is the column vector of dimension ðp þ 1Þ of the first derivative of the log
likelihood lðX; bÞ; whose j-th element is
@l XN
¼ yi xij pðxi1 ; xi2 ; :::; xip Þxij ; j ¼ 0; 1; :::; p
@bj i¼1

and €l is a ðp þ 1Þ  ðp þ 1Þ matrix of the second derivatives of the log likelihood


lðX; bÞ: Its ðj; kÞ entry is for j; k ¼ 0; 1; :::; p is as follows
4 G. ZENG AND E. ZENG

@2l XN
¼ xij xik pðxi1 ; xi2 ; :::; xip Þ 1pðxi1 ; xi2 ; :::; xip Þ :
@bj bk i¼1

Note that €l X; bðiÞ ¼ X T V ðiÞ X; where V ðiÞ is a N  N diagonal matrix with diagonals
  
ðiÞ ðiÞ ðiÞ ðiÞ ði Þ ðÞ
p1 1p1 ; p2 1p2 ; :::; pN 1pNi :

2.2. Multicollinearity
Multicollinearity is a phenomenon in which independent variables are linearly related.
Mathematically, we can define multicollinearity as follows.
Definition 2.1. A logistic regression model with independent variables x1 ; x2 ; :::; xp is
said to have multicollinearity if there exist constant a0 ; a1 ; :::; ap such that
X
p
a0 x0 þ ai xi ¼ 0 (2.8)
i¼1

where at least two of a1 ; :::; ap are nonzero, and x0 is a N-dimensional vector with all
elements 1.
Remark 2.2. (2.8) is equivalent to Xi a ¼ 0 for i ¼ 1; 2; :::; N; where a is a
ðp þ 1Þ-dimensional vector ða0 ; a1 ; :::; ap Þ T :

Remark 2.3. Definition 2.1 covers all the subsets of fx1 ; x2 ; :::; xp g with two or more
variables. If i variables with i < p are linearly related, then aj ¼ 0 if j 6¼ i:

Remark 2.4. By multicollinearity we mean perfect multicollinearity or pure multicolli-


nearity as in Midi, Sarkar, and Rana (2013). If there are strong linear dependencies
among the independent variables, it may be called near-multicollinearity just to distin-
guish from pure multicollinearity, as in Chatelain and Ralf (2014). Near-multicollinear-
ity is a subjective concept and can be detected by VIF (Variance Inflation Factor) (see
Murray et al. 2012).

2.3. Separation
Separation includes complete separation and quasi-complete separation (Albert and
Anderson 1984).
Definition 2.5. There is a complete separation of data points if there exists a vector
b ¼ ðb0 ; b1 ; :::; bp ÞT that correctly allocates all observations to their response groups; that is,
8
>
> X p
>
> bj xij ¼ Xi b ¼ bT XiT > 0; yi ¼ 1;
>
<
j¼0
(2.9)
>
> X p
>
> b x ¼ Xi b ¼ b Xi < 0; yi ¼ 0:
T T
>
: j¼0 j ij
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R
5

Definition 2.6. There is quasi-complete separation if the data are not complete separ-
able, but there exists a vector b ¼ ðb0 ; b1 ; :::; bp ÞT such that
8
>
> Xp
>
> bj xij ¼ Xi b ¼ bT XiT  0; yi ¼ 1;
>
<
j¼0
(2.10)
>
> Xp
>
> bj xij ¼ Xi b ¼ b Xi  0; yi ¼ 0
T T
>
: j¼0

and equality holds for at least one subject in each response group.
As pointed out by Zeng (2017b), each of (2.9) and (2.10) has a variant by exchanging
yi ¼ 1 and yi ¼ 0:

Definition 2.7. If neither complete nor quasi-complete separation exists then the data is
said to have overlap.

3. Relationship between multicollinearity and separation


Multicollinearity is a concept for only independent variables, whereas separation is a
concept for both independent variables and the dependent variable.
Multicollinearity is a concept for more than one independent variable. If the model
has only one independent variable x and x has only one value, then X is not full rank.
Procedure Logistic in SAS will report the linear relationship between the independent
variable and the intercept.

3.1. Multicollinearity implies quasi complete separation

Theorem 3.1. Multicollinearity implies quasi complete separation.


Proof. If there is multicollinearity, then there exists a ðp þ 1Þ-dimensional vector
a ¼ ða0 ; a1 ; :::; ap ÞT such that Xi a ¼ 0 for i ¼ 1; 2; :::; N: By Definition 2.6, this is a
quasi-complete separation with equality holding for all i ¼ 1; 2; :::; N no matter
whether yi ¼ 1 or yi ¼ 0: 䊏

3.2. Multicollinearity does not always imply completion separation


Assume there are only 2 independent variables x1 and x2 such that x2 ¼ 2x1 as shown
in Table 1, where the first column n is the count number for the data. Then x1 and x2
are collinear. Yet, there is no complete separation. If otherwise there is complete separ-
ation in the data, then there exist b0 ; b1 ; b2 such that

b0 þ b1 x1 þ b2 x2 >0; y ¼ 1
(3.1)
b0 þ b1 x1 þ b2 x2 >0; y ¼ 0
Therefore, when n ¼ 3; 4; y ¼ 1 and x1 ¼ 1; x2 ¼ 2; so
b0 þ b1 x1 þ b2 x2 ¼ b0 þ b1 þ 2b2 > 0: (3.2)
6 G. ZENG AND E. ZENG

Table 1. Counter Example 1.


n y x1 x2
1 1 0 0
2 1 0 0
3 1 1 2
4 1 1 2
5 0 1 2
6 0 1 2
7 0 1 2
8 0 0 0
9 0 0 0
10 0 0 0

On the other hand, when n ¼ 5; 6; 7; y ¼ 0, x1 ¼ 1; x2 ¼ 2; so


b0 þ b1 x1 þ b2 x2 ¼ b0 þ b1 þ 2b2 < 0: (3.3)
We obtain 2 contradicting inequalities (3.2) and (3.3). Hence, there is no complete
separation in the data.

3.3. Separation does not always imply multicollinearity


We first show that separation does not always imply multicollinearity. Assume a logistic model
has only 2 independent variables x1 and x2 such that x1 ¼ y; then x1  0:5 ¼ 0 completely
separates 0 and 1. Yet, x1 and x2 are not collinear if x2 has more than 2 values.
Next we show that quasi-separation does not always imply multicollinearity. Table 2
shows a quasi-complete separation with line x1  3 ¼ 0: Yet, x1 and x2 are not collinear
since x1 has 3 values but x2 has 6 values.

4. Consequences of multicollinearity and separation


4.1. Consequences of multicollinearity

Theorem 4.1. Multicollinearity means no finite solution to MLE.


Proof. By Definition 2.1, there is a ðp þ 1Þ-dimensional vector a ¼ ða0 ; a1 ; :::; ap ÞT
such that Xi a ¼ 0 for i ¼ 1; 2; :::; N: Let e be the ðp þ 1Þ-dimensional unit vector
e ¼ ð1; 0; :::; 0ÞT : Then for any positive integer k; we have
!  
X N
eXi ðaþkeÞ X N
1
ð Þ
l X; a þ ke ¼ yi ln þ ð1yi Þln
i¼1 1 þ eXi ðaþkeÞ i¼1 1 þ eXi ðaþkeÞ
   
X N
ek XN
1
¼ yi ln þ ð1  yi Þln !  1;
i¼1
1 þ ek i¼1
1 þ ek

as k  > 1: Here, there is no finite solution to MLE. 䊏

Remark 4.2. Although multicollinearity means quasi-complete separation by Theorem 3.1


and quasi-complete separation implies no finite solution to MLE (Albert and Anderson
1984), we cannot conclude from them that multicollinearity means no finite solution to
MLE simply because no multicollinearity is assumed by Albert and Anderson (1984).
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R
7

Table 2. Counter Example 2.


n y x1 x2
1 0 1 1
2 0 1 2
3 0 2 3
4 0 2 4
5 1 3 5
6 1 3 6

Theorem 4.3. Multicollinearity means Newton-Raphson iteration is not applicable.

Proof. It follows from multicollinearity that X is not full column rank. Hence, the
 
square matrix €l X; bðiÞ ¼ X T V ðiÞ X is not full rank, i.e., €l X; bðiÞ is a singular matrix
and cannot be inverted. Indeed, there is no finite solution to MLE by Theorem 4.1. 䊏

Remark 4.4. Multicollinearity does not always imply non-uniqueness of solution to


MLE. Assume there are only 2 independent variables x1 and x2 such that x2 ¼ x1 : If
ðb0 ; b1 ; b2 Þ is a solution to MLE, then ðb0 ; b1 þ 2b2 ; 0Þ is also a solution to MLE
since b0 þ b1 x1 þ b2 x2 ¼ b0 þ ðb1 þ 2b2 Þx1 :
Indeed, the likelihood function is concave. It guarantees any maximum is also the
global maximum. If there is no multicollinearity, that is, if X is full column rank, then
the likelihood function is also strictly concave, which guarantees the solution uniqueness
to MLE (Demidenko (2001) if a solution exists.

4.2. Consequence of separation


For both complete separation and quasi-complete separation, there is no finite solution
to MLE.
Theorem 4.5. Separation means no finite solution to MLE.
Proof. We consider 2 cases:

i. If there is multicollinearity, then there is no finite solution to MLE by Theorem 4.1.


ii. If there is no multicollinearity, then X is of full rank of p þ 1. By Albert and Anderson
(1984), there is no finite solution to MLE. 䊏

4.3. Consequence of non-multicollinearity


If there is no multicollinearity, then X is of full rank. So, the likelihood estimate func-
tion is strictly concave and has at most one solution by Demidenko (2001). Yet, it does
not guarantee MLE has a solution.

4.4. Consequence of non-multicollinearity and non-separation

Theorem 4.6. If there is no multicollinearity or separation, then there is a unique solu-


tion to MLE.
Proof. By Albert and Anderson (1984), there is a unique solution to MLE. 䊏
8 G. ZENG AND E. ZENG

5. Conclusions
In this paper, we have studied the relationship between multicollinearity and separation.
We conclude that

 Multicollinearity implies quasi complete separation,


 Multicollinearity does not always imply completion separation,
 Separation does not always imply multicollinearity.

We have also presented consequences of multicollinearity and separation. We con-


clude that

 Multicollinearity means no finite solution to MLE,


 Multicollinearity means Newton-Raphson iteration is not applicable,
 Separation means no finite solution to MLE,
 Non-multicollinearity means at most one solution to MLE,
 Non-multicollinearity and non-separation means a unique solution to MLE.

ORCID
Guoping Zeng http://orcid.org/0000-0002-3403-5418

References
Albert, A. and J. A. Anderson. 1984. On the existence of maximum likelihood estimates in logis-
tic regression models. Biometrika 71 (1):1–10.
Chatelain, J. B., and K. Ralf. 2014. Spurious regressions and near multicollinearity, with an appli-
cation to Aid, policies and growth. Journal of Macroeconomics 39 (PA):85–96.
Demidenko, E. 2001. Computational aspects of probit model. Mathematical Communications (6):
233–247.
Hosmer, D. W., S. Lemeshow, and R. X. Sturdivant. 2013. Applied logistic regression. (3rd edi-
tion). New York: John Wiley & Sons, Inc.
Midi, H., S. K. Sarkar, and S. Rana. 2013. Collinearity diagnostics of binary logistic regression model.
Journal of Interdisciplinary Mathematics 13 (3):253–67. doi:10.1080/09720502.2010.10700699.
Murray, L., H. Nguyen, Y. Lee, M. D. Remmenga, and D. Smith. 2012. Variance Inflation Factors
in Regression Models with Dummy Variables, Conference Proceedings of Annual Conference
on Applied Statistics in Agriculture, Kansas State University Libraries, New Prairie Press, str.
160–177.
Refaat, M. 2011. Credit risk scorecards: Development and implementation using SAS. Raleigh,
North Carolina, USA: LULU.COM.
Sarlija, N., A. Bilandzic, and M. Stanic. 2017. Logistic regression modelling: procedures and pit-
falls in developing and interpreting prediction models. Croatian Operational Research Review
8:631–52.
Shen, J., and S. Gao. 2008. A solution to separation and multicollinearity in multiple logistic
regression. Journal of Data Science: Jds 6 (4):515–31.
Siddiqi, N. 2006. Credit risk scorecards: Developing and implementing intelligent credit scoring.
Hoboken, New Jersey: John Wiley & Sons, Inc.
Yoo, W., R. Mayberry, S. Bae, K. Singh, Q. Peter He, and J. W. Lillard Jr. 2014. A Study of
Effects of MultiCollinearity in the Multivariable Analysis. International Journal of Applied
Science and Technology 4 (5):9–19.
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R
9

Zeng, G. 2013. Metric divergence measures and information value in credit scoring. Journal of
Mathematics 2013:1. Article ID 848271. doi:10.1155/2013/848271.
Zeng, G. 2014a. A rule of thumb for reject inference in credit scoring. Mathematical finance let-
ters. Article 2014:2. 1-13.
Zeng, G. 2014b. A necessary condition for a good binning algorithm in credit scoring. Applied
Mathematical Sciences Applied Sciences 8 (65):3229–3243. doi:10.12988/2014.44300.
Zeng, G. 2015. A unified definition of mutual information with applications in machine learning.
Mathematical Problems in Engineering 2015. Article ID 201874. doi:10.1155/2015/201874.
Zeng, G. 2017a. A comparison study of computational methods of kolmogorov–smirnov statistic
in credit scoring. Communications in Statistics: Simulation and Computation 46 (10):7744–60.
doi:10.1080/03610918.2016.1249883.
Zeng, G. 2017b. Invariant properties of logistic regression model in credit scoring under mono-
tonic transformations. Communications in Statistics: Theory and Methods 46 (17):8791–807.
doi:10.1080/03610926.2016.1193200.
Zeng, G. 2017c. On the existence of maximum likelihood estimates for weighted logistic regres-
sion. Communications in Statistics: Theory and Methods 46 (22):11194–203. doi:10.1080/
03610926.2016.1260743.
Zeng, G., and E. Zeng. 2018. On the three-way equivalence of AUC in credit scoring with tied
scores. Communications in Statistics: Theory and Methods. doi:10.1080/03610926.2018.1435814
Zeng, G. 2019. On the confusion matrix in credit scoring and its analytical properties. Submitted to
Communications in Statistics: Theory and Method. doi:10.1080/03610926.2019.1568485

You might also like