Computational Statistics and Data Analysis: Kuo-Jung Lee, Martin Feldkircher, Yi-Chi Chen

Computational Statistics and Data Analysis 158 (2021) 107180
Contents lists available at ScienceDirect
Computational Statistics and Data Analysis

journal homepage: www.elsevier.com/locate/csda
Variable selection in finite mixture of regression models with

an unknown number of components
∗
Kuo-Jung Lee a , Martin Feldkircher b , , Yi-Chi Chen c
a
Department of Statistics and Institute of Data Science, National Cheng Kung University, Tainan, Taiwan
b
Vienna School of International Studies (DA), Vienna, Austria
c
Department of Economics, National Cheng Kung University, Tainan, Taiwan
article info a b s t r a c t
Article history: A Bayesian framework for finite mixture models to deal with model selection and the
Received 9 July 2020 selection of the number of mixture components simultaneously is presented. For that
Received in revised form 12 January 2021 purpose, a feasible reversible jump Markov Chain Monte Carlo algorithm is proposed to
Accepted 12 January 2021
model each component as a sparse regression model. This approach is made robust to
Available online 26 January 2021
outliers by using a prior that induces heavy tails and works well under multicollinearity
Keywords: and with high-dimensional data. Finally, the framework is applied to cross-sectional data
Finite mixture of regression models investigating early warning indicators. The results reveal two distinct country groups for
Bayesian variable selection which estimated effects of vulnerability indicators vary considerably.
Unknown number of components © 2021 Elsevier B.V. All rights reserved.
High-dimensional data
Financial crisis
1. Introduction
Many empirical problems in economics involve high-dimensional data where the number of predictors/covariates (p)
is large relative to the number of observations (n). As such, Sala-i-Martin (1997) examined determinants of economic
growth using close to 60 potential explanatory variables. He found that only a few of them were relevant for explaining
economic growth. As another example, the US National Longitudinal Survey of Youth (NLSY) covers hundreds or thousands
of variables potentially affecting productivity and, consequently, wages. However, a wage equation is likely to be sparse
in the sense that only a few covariates, such as education and years of labor force experience, have important effects on
wages.
When the number of covariates is large and the covariates are correlated – potentially due to similarity of theoretical
or empirical properties – it is well known that maximum likelihood estimation can result in unstable estimates and
inference (Dunson et al., 2008). Furthermore, these large-scale data are often characterized by a significant degree of
heterogeneity as they may arise from different sources.
Recently, considerable interest has been put on shrinkage methods and variable selection. These methods allow us to
significantly decrease the number of covariates; see, e.g., Tibshirani (1996), Fan and Li (2001). Nevertheless, it has been
shown that variable selection algorithms may inevitably encounter difficulties or have undesirable features when applied
to high-dimensional data. This becomes even more severe when the population under study is made up of different
sub-populations and the relationship between the response and the covariates varies across the sub-populations. Finite
mixture of regression (FMR) is a very popular statistical modeling technique to address these issues; see, e.g., Frühwirth-
Schnatter (2006), McLachlan and Peel (2000). Relatedly, different subsets of covariates may only be relevant for different
∗ Correspondence to: Favoritenstraße 15a, 1040 Vienna, Austria.

E-mail address: martin.feldkircher@da-vienna.ac.at (M. Feldkircher).
https://doi.org/10.1016/j.csda.2021.107180
0167-9473/© 2021 Elsevier B.V. All rights reserved.
K.-J. Lee, M. Feldkircher and Y.-C. Chen Computational Statistics and Data Analysis 158 (2021) 107180
sub-populations. McLachlan and Peel (2000) pointed out three main challenges to estimate FMR models with high-
dimensional data: heterogeneity, a complex likelihood surface, and an unknown number of mixture components of the
model. In this paper, we additionally address the effect of strong collinearity — a previously overlooked feature common
to economic applications, especially in high dimensional applications. As the FMR models commonly rely on likelihood-
based inference, optimization for complicated and potentially multi-modal likelihood surfaces is notoriously difficult. This
is particularly so for when using high-dimensional or highly collinear covariates. In these cases, convergence of established
algorithms may not be guaranteed.
In this study, we make important contributions to Bayesian FMR models in several ways: First, we propose a
feasible, yet more robust, reversible jump Markov chain Monte Carlo (RJMCMC) algorithm (Green, 1995; Richardson and
Green, 1997). Our algorithm deals with an unknown number of components and applies variable selection within each
component of the mixture model. Overall, our approach marks a significant improvement over existing alternatives such
as the reversible jump algorithm of Liu et al. (2015), which is not well defined when p ≥ n. By contrast, our proposed
algorithm involves simultaneous selection of mixture components and relevant variables when strong collinearity and
potential deviations from normality are pervasive — situations that are typical in high-dimensional settings. Second, we
deliberately address the computational complexity arising from highly correlated covariates in a large p, small n problem,
and consider an alternative prior to the commonly used g-prior – the latter which avoids potential posterior multimodality
due to strong collinearity (Ghosh and Ghattas, 2015). Third, it has been found that misspecification in Gaussian FMR
models may lead to spurious groupings when mixture components slightly deviate from a normal distribution (Zhang,
2017). Thus, our FMR model is robustified by the use of a student t-distributions following Geweke (1993). Finally, we
apply the proposed framework to identify macrofinancial vulnerabilities to explain the severity of the global financial
crisis 2008/09 in terms of output loss. Here, we find evidence of two clusters of countries with both determinants and
size of effects varying between the two groups.
The paper is organized as follows: Section 2 briefly reviews the statistical treatments of variable selection in FMR
models, with a particular focus on identifying the number of mixture components. Sections 3 and 4 describe our proposed
Bayesian sparse variable selection approach for high-dimensional FMR and the required RJMCMC algorithm. Section 5
conducts simulation studies to evaluate the performance of the proposed RJMCMC in a number of different scenarios. This
section also provides a sensitivity analysis with respect to alternative choices of prior distribution. Section 6 describes the
empirical application of the early warning systems and Section 7 concludes.
2. Literature reviews
2.1. Regression regularization methods
Regression regularization methods have been frequently used in the literature on variable selection for linear regression
models. In a pioneering study, Tibshirani (1996) proposes the least absolute shrinkage and selection operator estimator
(LASSO, ), which penalizes the ℓ1 -norm of the regression coefficients. Other, commonly used penalty functions include
bridge regressions (Frank and Friedman, 1993), smoothly clipped absolute deviations (SCAD, Fan and Li, 2001), adaptive
LASSO (Zou, 2006), and the elastic net (Zou and Hastie, 2005). In particular, Khalili and Chen (2007), Khalili and Lin
(2013), and Städler et al. (2010) have investigated the variable selection problem for FMR models with versions of the
above penalty functions. The resulting estimators have been shown to have mostly useful properties including the variable
selection consistency and oracle property as defined in Fan and Li (2001). However, these penalized likelihood-based
approaches are sensitive to data contamination, and their efficiency may be significantly reduced when the model is
slightly misspecified (Tang and Karunamuni, 2017).
2.2. Large p, small n problems
Despite recent advances in variable selection in linear and generalized linear models, research on FMR models is
still scarce. Khalili and Chen (2007) studied the problem of variable selection in a general family of FMR models under
the standard assumption that the dimension p of the covariate space is fixed with respect to the sample size. They
proposed a new class of weighted penalty functions for FMR models to conduct LASSO-type variable selection. However
and at the back of more easily accessible data, there has been a growing interest in applying these models to high
dimensional data recently. Notably, a theoretical study by Städler et al. (2010) developed an ℓ1 -penalized estimator for
high dimensional Gaussian FMR models, where each component of the mixture is a sparse Gaussian linear regression
model. Under certain regularity conditions including the boundedness of both parameter space and the component-wise
Gaussian mean functions, the authors showed consistency (in probability) of LASSO estimators under a Kullback–Leibler
loss function. Also, by assuming a restricted eigenvalue condition on the regression design matrix, they proved oracle
inequalities for LASSO estimators. A further extension was made recently by Khalili and Lin (2013), considering a similarly
high dimensional problem with a general family of FMR models. Khalili and Lin (2013) also provide asymptotic properties
of their estimator allowing for the dimension p to grow with the sample size n polynomially. In contrast to Städler et al.
(2010), Khalili and Lin (2013) demonstrated that their penalized likelihood approach leads to consistent variable selection
in the sense of Fan and Li (2001). The aforementioned studies have, nevertheless, the usual drawbacks of penalized
likelihood approaches: Inference of variable selection and consistency of the estimator rely on the stark assumption of a
2
fixed (known) number (or order) of mixture components in FMR models, and thus limits the applicability of these methods
in variable selection.
2.3. Determining the number of mixture components
‘‘Testing for the number of components [G] in a mixture is an important but very difficult problem which has not
been completely resolved’’ (McLachlan and Peel, 2000, p.175). The order estimation problem in mixture models is a
long-standing problem in statistics (e.g., Xu and Chen, 2015; Li and Chen, 2010; Chen, 2012). The information-theoretic
approaches such as Akaike’s or the Bayesian information criterion (AIC and BIC, respectively) are often used to determine
the number of components in mixture models; see, e.g., Chen (2012) for a detailed discussion on information criteria
and sequential hypothesis testing in this regard. It is well-known that under general conditions, BIC provides a consistent
estimator of G (Leroux, 1992; Keribin, 2000). The applicability of the theoretical results to high-dimensional FMR models
is, however, not readily available. In particular, when the number of covariates p is large, fitting full FMR models is no
easy task. Gupta and Ibrahim (2007) introduced a hierarchical regression mixture model, in which a Monte Carlo method
was used for simultaneous variable selection and clustering. They conducted a MCMC procedure repeatedly for different
values of G – the best value being determined by comparing the different models using Bayes factors. However, as pointed
out by Monni and Tadessey (2009), Gibbs sampling in a high-dimensional setting becomes computationally burdensome
and even more so in case G is large. Estimating a FMR model for different numbers of G can be time consuming and does
not tackle the uncertainty in G, as an optimal choice of G is generally not known a priori. Despite the above progress,
there is no satisfactory non-Bayesian solution available to the problem of simultaneous order and variable selection in
FMR models, especially when p is large.
Apart from the frequentist approach, a number of strategies have been proposed to determine the number of
components from a Bayesian point of view (Nobile and Fearnside, 2007; Richardson and Green, 1997; Malsiner-Walli et al.,
2016). For example, Liu et al. (2015) proposed a fully Bayesian approach relying on the RJMCMC algorithm of Richardson
and Green (1997) and Tadesse et al. (2005), which can be used to infer the number of components while selecting the
variables for each component. This sampling procedure may increase the computational burden considerably. It requires
the Metropolis–Hastings algorithm within Gibbs sampling to sample the posterior distribution when the dimensionality
of the parameter space alters during the sampling procedure. Despite the significant progress made in variable selection, it
remains unclear whether their RJMCMC method is useful for high-dimensional applications and under strong collinearity.
Most importantly, robustness issues with regard to model misspecifications for FMR are not explicitly addressed in existing
work.
3. Variable selection in finite mixture of regression models
To begin with, let (yi , xi ), i = 1, . . . , n, be a dataset of n observations that come from a heterogeneous population, where
yi is the response variable of the ith observation and xi = (xi1 , . . . , xip )′ collects the p covariates of the ith observation.
We assume that the heterogeneous population consists of G sub-populations or mixture components, and within each
sub-population, (yi , xi ) follows a separate linear regression model. Specifically,
G
∑
yi |(ρg , βg , σg2 , g = 1, . . . , G; ωi ) ∼ ρg · N x′i βg , ωi σg2 ,
( )
(1)
g =1
∑G
where ρ = (ρ1 , . . . , ρG ) are the proportions of the G sub-populations, with ρg ≥ 0 and g =1 ρg = 1. In the gth group,
βg = (βg1 , . . . , βgp )′ is the coefficient vector for the linear regression and the variance ωi σg2 , where ωi is a variance-inflation
factor. Let yg be a column vector that consists of ng observations in the gth group and Xg be the corresponding design
matrix, then the regression model in gth group can be simply expressed by
yg = Xg βg + ϵg , ϵg ∼ N(0, σg2 Wg ),
where Wg = diag(wi ; i ∈ g) is a diagonal matrix with the ith diagonal element equal to wi .
3.1. Two sets of latent variables
As is standard for the mixture model, we introduce a latent variable zi for each observation i, so that zi = g indicates
that the ith observation comes from the gth sub-population with the probability P(zi = g) = ρg . That is,
zi ∼ Multinomial(ρ1 , . . . , ρG ) [yi |zi = g ] ∼ N x′i βg , ωi σg2 .

( )
and
In the presence of sparsity, only a subset of the elements of β ’s are non-zero. Put differently, not all of the covariates
are relevant for the response. To facilitate variable selection, we introduce another latent vector γg = (γg1 , . . . , γgp )′ for
each sub-population g such that βgj arises from one of a point mass at zero and a normal distribution, depending on a
latent variable γgj (George and McCulloch, 1993, 1997). That is, when γgj = 0, βgj is zero; on the other hand, when γgj = 1,
βgj is assumed to have a reasonably large deviation from zero.
3
With the two sets of latent variables {zi , i = 1, . . . , n} and {γg = (γgj , j = 1, . . . , p), g = 1, . . . , G} defined, a
Bayesian analysis can be carried out in a straightforward manner. In the following subsections, we shall introduce the
prior assumptions first and then present the details of the Bayesian analysis.
3.2. Prior distributions
First, consider the probability vector ρ of mixture proportions. Similar to Viele and Tong (2002), given G, we assume
a conjugate Dirichlet prior distribution on ρ , ρ ∼ Dirichlet(α1 , . . . , αG ). In each mixture component of the regression
model, the prior distributions of the indicator variables γgj are assumed to be independent Bernoulli(dgj ) for j = 1, . . . , p.
As a result, the joint distribution of γg = (γg1 , . . . , γgp )′ is
p
∏ γgj
π (γg ) = dgj (1 − dgj )1−γgj .
j=1
Conditional on γgj , the prior of the regression coefficient, βgj , is given by
βgj |γgj ∼ γgj N(0, cj2 σg2 ) + (1 − γgj )δ0 ,

where cj2 ’s are pre-specified positive constants to allow nonzero regression coefficients to be selected, and δ0 is a point
mass at 0. In contrast to the well-known g-prior (Zellner, 1986), that uses the scaled variance of the OLS estimator as
the prior variance, the proposed prior can circumvent the computational(difficulty ) arising from the singularity
( due to )
ag0 bg0 ag0 bg0
highly correlated variables. For the prior of each σg2 , we assume σg2 ∼ IG 2
, 2
independently, where IG 2
, 2
bg0
denotes an inverse Gamma distribution with mean ag0 −2
. Finally, we assume G has a discrete uniform distribution on
{1, . . . , Gmax }, or truncated Poisson distribution
e−λ λm /m!
P(G = m) = ∑∞ , m = 1, . . . , Gmax , (2)
1 − e−λ − t =1+Gmax e−λ λt /t !
where Gmax is the maximum number of components in the FMR, and λ is a pre-specified constant. In a simulation study,
we demonstrated that the number of components does not depend qualitatively on the specification of λ. We did the
same for the empirical application with λ = 1, . . . , 5 and obtained the same result. In what follows, we set λ = 2.
For the sake of robustness and to deal with general misspecifications emphasized in Zhang (2017), we place a specific
prior on ωi which follows IG (ν/2, ν/2). Under this setting, the linear model is equivalent to a model whose errors have
independent and identical Student-t distributions with the degree of freedom equal to ν (Geweke, 1993; Cozzini et al.,
2014; Lin et al., 2004). To avoid the selection bias on ν , we furthermore place an exponential prior on ν with mean κ .
Note that lower values of ν correspond to a heavy-tailed distribution which can accommodate outliers and also imply
relatively larger variances in the inverse Gamma distribution.
We assume that given G, z is independent of (γg , βg , σg , ν ), and (γg , βg , σg , ν ) are also assumed to be independent of
each other. Likewise, for each g, γgj ’s are assumed to be independent.
3.3. Gibbs sampler for variable selection
With the aforementioned prior specification, the posterior distribution is obtained by combining a complete-data
likelihood with the prior distributions. We use Gibbs sampling to estimate the model. To speed up convergence of the
algorithm and to ease the complexity of the sampler, the β ’s and σ ’s are integrated out from the full posterior distribution.
For ease of notation, we shall denote Xg = {xij , i = g , γgj ̸ = 0} as the collection of important covariates in gth group. The
marginal posterior distribution of (γ , z , ρ, ω, G) is then obtained as follows
p(γ , z , ρ, ω, G, ν|y) =
[( ng ( ) ( )ng /2
G
∏ 1 (ν+1)/2+1 {
ν
})
(b/2)a/2 ρg
2
|Ωg |−1/2 |Λg |−1/2
∏
exp −
ωi 2ωi Γ (a/2) 2π
g =1 i=1 ⎛ ⎞
p αg +1
] G
( ) b + SSEg −hg ∏ ρg
( )
γgj ∑
Γ hg dgj (1 − dgj )1−γgj πG (G)Γ ⎝ αg ⎠e−ν/κ , (3)
2 Γ (αg )
j=1 g =1
n +a
where hg = g2 , Λg = (Xg′ Wg−1 Xg + Ωg−1 ), Ωg = diag(cj2 ; {j; γgj ̸ = 0}), Wg = diag(wi ; i ∈ G), SSEg = y′g (Wg−1 −
∑p
Wg−1 Xg Λ− j=1 γgj . To implement the Gibbs sampler and to generate the posterior samples from (3),
1 ′ −1
g Xg Wg )yg , and qg =
it is necessary to derive the full conditional distributions of all parameters. The details for the derivation of the conditional
distributions are provided in the supplementary material to this article. The full conditionals are given as follows:
4
1. The full conditional distribution of zi , i = 1, . . . , n, is

( ng ( ) ν+1 ) ( )ng /2
∏ 1 2 +1 − ν ρg2
p(zi |γ , ρ, ω, G, ν, y) ∝ e 2ωi |Ωg |−1/2 |Λg |−1/2
ωi 2π
i=1
p α +1
( ) b + SSEg −hg ∏ ρg g
( )
γgj
× Γ hg dgj (1 − dgj )1−γgj × .
2 Γ (αg )
j=1
2. The full conditional distribution of ρ is Dirichlet(α1 + ng , . . . , αG + nG ).

3. The full condition distribution of γgj is
( )−hg
−2γgj −1/2 b + SSEg γgj
p(γgj |z , ρ, ω, G, ν, y) ∝ cj |Λg | dgj (1 − dgj )1−γgj .
2
4. The full conditional distribution of ωi given zi = g is

)(ν+1)/2+1
ν
( { }
1
p(ωi |z , γ , ρ, G, ν, y) ∝ exp − .
ωi 2ωi
5. The full conditional distribution of ν given ω’s is

[ G ng ( )
∏ ∏ 1 (ν+1)/2+1 {
ν
}]
p(ν|z , γ , ρ, G, ω, y) ∝ exp − e−ν/κ .
ωi 2ωi
g =1 i=1
6. A RJMCMC scheme is provided in the next section to update G.
4. Reversible jump MCMC for updating G
A standard MCMC algorithm to approximate the posterior distribution of a FMR model is often not applicable. The
difficulty arises from accommodating the dimensional change when transitioning between different models. A variety of
sophisticated methods have been proposed. Richardson and Green (1997) proposed a Bayesian approach with a RJMCMC
algorithm by considering birth–death and split–merge moves to estimate the number of components and the component
parameters of mixture models. In this study, we apply a RJMCMC to the variable selection problem when the number of
components in FMR models is unknown. Next, we discuss how to sample the joint posterior distribution of G, z, ρ , and γ
given in (3) with the changes in the number of components, G. The sampler consists of six types of moves
1. Updating the memberships z.

2. Updating the proportions ρ .
3. Updating the indicator variable γ .
4. Updating the variance-inflation factor ω.
5. Updating the degree of freedom ν .
6. Updating the number of components, G.
(a) Birth–death: Generating or eliminating an empty component.

(b) Split–Merge: Splitting one component into two or merging two components into one.
Next, we introduce the steps (6a) in Section 4.1 and (6b) in Section 4.2, respectively.
4.1. Generating and eliminating
In the move of (6a), we randomly generate an empty component with a probability bG and eliminate an empty
component with a probability dG = 1 − bG , where bGmax = 0 and b1 = 1. Suppose that the number of components is G,
with G1 non-empty components and G0 empty components. In this step, the generated or eliminated component does
not affect the allocation of observations. It allows to accept all across-model moves, but maintains a detailed balance
throughout the MCMC algorithm. The details of the sampling procedure are given in the next section.
4.1.1. Generating
Now we consider generating a new empty component. Suppose that a new empty component is generated and denoted
by g ∗ where the corresponding ρ ∗ and γ ∗ are proposed to be updated as follows
• ρg ∗ ∼ Beta(1, G);
• γgj∗ ∼ Bernoulli(φgj ), for j = 1, . . . , p;
• ρ is re-weighted as ρ ∗ with ρk∗ = ρk (1 − ρg ∗ ), k = 1, . . . , G and ρG∗+1 = ρg ∗ .
5
Let G∗ = G + 1 denote the number of components in the proposal step and πρ (·) and πγ (·) denote the Beta and Bernoulli
density functions. Then the proposal candidate (G∗ , ρ ∗ , γ ∗ , z , ω) is accepted with a probability of min(1, ηg ), where
p(G∗ , ρ ∗ , γ ∗ , z , ω, ν|y) dG+1 (G0 + 1)−1 (1 − ρg ∗ )G
ηg = × .
p(G, ρ, γ , z , ω, ν|y)
∏p
bG πρ (ρg ∗ ) j=1 πγ (γgj∗ )
Otherwise, we set the parameters in the next step to equal those in the current step.
4.1.2. Eliminating
In this step, we delete an empty component. If there are no empty components, this step is complete. Otherwise, we
randomly choose an empty component among G0 empty components with probability 1/G0 . Suppose that the proposal
candidate is defined by (G∗ , ρ ∗ , γ ∗ ), where G∗ = G − 1. Assume that the kth component is selected for elimination;
then the corresponding proposal candidates are ρ ∗ and γ ∗ , where ρ ∗ is the normalization of (ρ1 , . . . , ρk−1 , ρk+1 , . . . , ρG )
with the unity sum and γ ∗ = (γ1 , . . . , γk−1 , γk+1 , . . . , γG ). Then the proposal candidate (G∗ , ρ ∗ , γ ∗ , z) is accepted with a
probability of min(1, ηe ), where
∏p
p(G∗ , ρ ∗ , γ ∗ , z , ω, ν|y) bG−1 πρ (ρk ) j=1 πγ (γkj )
ηe = × .
p(G, ρ, γ , z , ω, ν|y) dG G0 (1 − ρk )G−1
−1
Otherwise, we set parameters in the next step equal to those in the current step.
4.2. Splitting and merging
In the move of (6b), we either randomly choose a non-empty component to be split with a probability bG , or two
components to be merged with a probability dG = 1 − bG , where bGmax = 0 and b1 = 1. One of the most challenging
aspects of the proposal generation is to ensure reversibility. One needs to propose one-to-one moves so that the split
move can be recovered from the merge, and vice versa. Suppose that the number of components is G, with G1 non-empty
components and G0 empty components. Let ρ , γ , z, ω, and ν be the sample in the current step.
4.2.1. Splitting
In this step, we split a non-empty component into two. We randomly select a non-empty component g from G1 non-
empty components, and then split it into two components denoted by g1 and g2 , respectively. Next, we sequentially
describe how to propose the candidate (G∗ , ρ ∗ , γ ∗ , z ∗ ) for dimension matching. Then each member in the component g is
assigned to a new component g1 with probability u, and otherwise to the other new component g2 , where u is randomly
generated from a Beta distribution, Beta(2, 2). The corresponding weights of two new components are proposed to be
updated as ρg1 = uρg and ρg2 = (1 − u)ρg . In each new component, the indicator variables are proposed to be updated
as follows, for j = 1, . . . , p,
γgj = 0 ⇒ γg1 j = γg2 j = 0;
γgj = 1, ψj = 0 ⇒ γg1 j = γg2 j = 1;
γgj = 1, ψj = 1 ⇒ γg1 j = 1, γg2 j = 0;
γgj = 1, ψj = 2 ⇒ γg1 j = 0, γg2 j = 1,
where ψj is randomly and independently drawn from p(ψj = a) = 31 for a = 0, 1, 2 as in Liu et al. (2015). This ensures
the dimensional match between moves.
Before calculating the acceptance probability, it is necessary to decide whether it is reasonable for the candidate to
be detained or discarded, in terms of the similarity between two components, g1 and g2 . Several metrics can be used to
measure the component similarity. We consider the distance between the regression coefficients measured by
D(g1 , g2 ) = ∥β̂g1 − β̂g2 ∥,
where ∥ · ∥ denote the Euclidean distance and β̂ ’s are the estimates of β ’s. Conditional on γ , z, and ω, the coefficients are
estimated using weighted least squares which is given by
β̂k = Λk Xk Wk−1 y′k , k = g1 or g2 .
Let
s1 = arg min D(g1 , k);
k∈C (−g1 )
s2 = arg min D(g2 , k),
k∈C (−g2 )
where C (−k) is the collection of components excluding the component k. Then there are four possible scenarios to consider
in this move given below.
1. If both new components are non-empty and s1 ̸ = g2 and s2 ̸ = g1 , then we discard the candidate and let the sample
in next step be the same as the current one.
6
2. Otherwise, let G∗ be the total number of components, and G∗1 and G∗0 are the numbers of non-empty and empty
components after splitting, respectively. The following cases are likely to occur,
(a) If both new components are non-empty, and either s1 = g2 , s2 ̸ = g1 , or s1 ̸ = g2 , s2 = g1 , then Pchos =
G∗
G∗
1
× G1∗ = G1∗ .
1
G∗
1 2 2
(b) If both new components are non-empty, and both s1 = g2 and s2 = g1 , then Pchos = G∗
× G∗
= G∗
.
1
G∗
0 1 1 1
(c) If one of them is empty, then Pchos = G∗
× G∗
× G∗
= G∗ G∗
.
0 1 1
The acceptance probability can be given as min(1, ηs ).

p(G∗ , ρ ∗ , γ ∗ , z ∗ , ω, ν|y) dG+1 ρg P
ηs = × × chos ,
p(G, ρ, γ , z , ω, ν|y) −1 ng1
bG G1 u (1 − u) ng2
πu (u) Palloc
where ng1 and ng2 are the numbers of observations in component g1 and g2 , respectively, and πu (·) denotes the
p
γ
∑ (1) j=1 kj
Beta density function. In addition, Palloc = 3
.
4.2.2. Merging
In this step, we consider merging two components. First, g1 is randomly selected from G components — we then
consider three scenarios:
• If g1 is empty, then we randomly select a component from non-empty components, denoted by g2 .

• If g1 is non-empty, the g2 is selected such that
g2 = arg minD(g1 , g).
k∈C (−g1 )
• Let k denote the component merged from g1 and g2 , and the corresponding proportion is ρk∗ = ρg1 + ρg2 , {zi∗ = k; i ∈
g1 or g2 }, and γkj∗ = max{γg1 j , γg2 j }.
Let G∗ be the total number of components and G∗1 and G∗0 are the numbers of non-empty and empty components after
merging, respectively. Then the acceptance probability is min(1, ηm ).
p(G∗ , ρ ∗ , γ ∗ , z ∗ , ω, ν|y) bG−1 G∗− u 1 (1 − u)ng2 πu (u)

1 ng
P
ηm = × 1
× alloc ,
p(G, ρ, γ , z , ω, ν|y) dG ρk ∗
Pchos
( 1 )p−∑pj=1 I {γg =γg2 j =0}
where ρ ∗ = ρg1 + ρg2 and Palloc = 3
1j
.
4.3. Identifiability and label switching
The lack of identifiability in mixture models and label switching in the MCMC output are two major statistical issues.
These undesirable effects can be serious and, if not properly dealt with, may negate some or all of the inference made
in each group. One possible solution to achieve identifiability is to impose constraints on the model parameters. Other
alternatives are described in Sperrin et al. (2010) and Spezia (2009). In this study, we implement the relabeling procedure
proposed by Stephens (2000). For the G-component cluster analysis in a mixture model, a natural way to consider
cluster assignment uncertainty is to define a matrix Q = (qig ) where qig represents the probability that observation i
is assigned to group g. Therefore, Q can be interpreted as a distribution on G-component clustering of the data. Note
that only MCMC samples with the same number of components are collected for the relabeling process. The relabeling
approach consists of minimizing the posterior distribution of a loss function, or the Kullback–Leibler distance. Specifically,
let θ [k] = (G[k] , ρ [k] , γ [k] , z [k] , ν [k] ) be the kth sample from K posterior samples, the probability of classification of the ith
observation into the gth component is
[k] [k]
p(yi |θg )ρg
dig (θ [k] ) = Dig (z , θ [k] ) = ∑G [k] [k]
,
g =1 p(yi |θg )ρg
and the corresponding loss is defined as

n G
∑ ∑ dig (θ )
L(z ; D) = dig (θ ) log .
qig
i=1 g =1
Stephens’ relabeling algorithm for G-component mixture models works as follows: Set t1 , . . . , tK to be the identity
permutation and iterate the following two steps until the tolerance criterion is reached.
7
Step 1: Choose Q = (qig ) to minimize

K n G
dig tk (θ [k] )
( )
∑ ∑ ∑
L0 (Q ; θ ) = dig tk (θ [k] ) log .
( )
(4)
qig
k=1 i=1 g =1
Actually, the minimization is achieved when

K
1 ∑
dig tk (θ [k] ) .
( )
q̂ig =
K
k=1
Step 2: For k = 1, . . . , K , update tk to minimize

n G
dig tk (θ [k] )
( )
∑ ∑
dig tk (θ [k] ) log .
( )
q̂ig
i=1 g =1
In step 2, it contains n minimizations with respect to all the permutations (G!) and Stephens (2000) showed that the
existence of efficient numerical algorithm for the loss function is given in (4).
4.4. Posterior inference in mixture models
With a sufficiently large number of posterior samples generated by the MCMC algorithm, we estimate the number of
components G in the mixture model as the highest frequency of the value observed in the posterior samples. Then, we
extract the posterior samples corresponding to the final number determined for mixture components and then apply the
relabeling algorithm for the sub-sample of size, say, K . We estimate z’s, β ’s, σ ’s, and ρ ’s based on the samples.
The membership assignment of each observation i is based on the posterior samples of zi ’s. Specifically, an
observation yi is assigned to the gth component if
K
1 ∑ (k)
p̂(zi = g |y) = argmax p̂(zi = j|y) = argmax I {zi = j},
j∈1...,G j∈1...,G K
k=1
(k)
where zi is the membership of the ith observation in the kth posterior sample. The estimate of the mixture
proportion ρg is
K n
1 ∑∑ (k)
ρ̂g = I {zi = g} .
nK
k=1 i=1
The estimate of σg2 is given by

K
1 ∑
σ̂ 2 = SSE[gk] /(ng − qg ),
K
k=1
where
′
SSE[gk] = y[gk] (Wg−1 − Wg−1 Xg Λ−
g Xg Wg−1[k] )y[gk] ,
1[k] ′
−1[k] −1[k] [k]′

and Wg , and Λg are evaluated based on the kth posterior sample and yg are the corresponding observed values
in gth component.
The posterior inclusion probability of γgj = 1 is estimated by
K
1 ∑ [k]
p̂(γgj = 1|y) ≈ I(γgj = 1).
K
k=1
In order to determine the important covariates of the mixture regression model of each sub-population, we adopt the so-
called median probability criterion (Barbieri and Berger, 2004) for model selection. That is, the variable xgj is important
if p̂(γgj = 1|y) > 0.5; otherwise not.
If γ̂gj = 0, then estimate β̂gj of βgj is 0. If γ̂gj = 1, then, in a similar manner to estimate σ 2 ’s, we calculate
′
β̂g[k] (γgj[k] ) = Λ−
g
1[k]
Xg Wg−1[k] yg[k] .
Then, we use the Rao-Blackwellization method (Gelfand and Smith, 1990) to estimate the regression coefficient βgj by
K
∑ [ 1 ∑
E(βgj |y) = E βgj |rgj , y p rgj |y ≈ β̂gj[k] (γgj[k] ),
] ( )
Kgj
rgj k=1
∑K [k]
where Kgj = k=1 rgj .
8
Table 1
Estimated number of mixture components with correlated covariates.
π 0.25 0.50 0.75 0.90
p 25 50 100 25 50 100 25 50 100 25 50 100
G =1 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.02 0.03
G =2 0.98 0.98 0.90 0.98 0.98 0.90 0.98 0.88 0.75 0.85 0.75 0.51
G =3 0.01 0.01 0.06 0.01 0.01 0.05 0.01 0.02 0.12 0.10 0.18 0.29
G =4 0 0 0.03 0 0 0.04 0 0.05 0.10 0.08 0.10 0.17
Notes: The number indicates the posterior probability for the choice of mixture components G with different numbers of covariates p and degrees
of pairwise correlation π .
5. Simulation study
In this section, we conduct a simulation study to investigate the performance of the proposed approach on identifica-
tion accuracy of the number of components and the important variables in FMR models. In addition, the simulation results
demonstrate the effectiveness of the relabeling scheme against the potential pitfalls of possible erroneous interpretations
in the MCMC output. We also investigate to what extent strong collinearity affects the accuracy of the number of mixture
components and variable selection.
5.1. Performance under collinearity
Given ω generated from IG (2, 2), a sample of 70 observations is generated from (1) with two components of unequal
weights, where one of the groups contains 30 observations. This simulation is equivalent to that generated from a
t-error with 4 degrees of freedom. Each component is represented by a distinct set of active variables. The covariates
are generated from a multivariate normal distribution with mean 0 and covariance matrix Σ with different pairwise
correlations Σ (i, j) = π |i−j| , where π = 0.25, 0.5, 0.75, and 0.9. The standard deviations are given as σ12 = 0.5, σ22 = 1,
respectively, in the FMR model. The regression coefficients are set as follows
p−8
  
β1 = (−3, 0, −3, 0, −3, 3, 0, 0, 0, . . . , 0);
p−8
  
β2 = (3, 2, 0, 0, −2, 2, 0, 4, 0, . . . , 0).
In all models, we used 100,000 iterations of the MCMC sampler to ensure convergence of the algorithm. The program
was written in C++ used from within the statistical software R. On a 4.2 GHz Intel Core i7 iMAC desktop computer
(with 64GB memory), it takes about 1 h to estimate the model with up to 100 potential variables and 70 observations.
Specifically, for the setting π = 0.25, n = 70, p = 25, it takes 625 s per 10,000 iterations; for p = 50 it takes 775 s; for
p = 100, it takes 4,015 s. Table 1 reports the average posterior probability of G with different numbers of covariates and
cross-correlations across 100 replications. The simulation results indicate that the proposed model can correctly identify
the number of mixture components, even in presence of high collinearity between covariates with p > n. It is also
interesting to note that the correct number of components can be quite accurately determined for moderate numbers
of covariates and pairwise correlations, as well as for p larger than n. By contrast, order accuracy can be affected when
covariates are strongly collinear and the number of covariates is large.
We also show how well our proposed method recovers the correct number of components compared to other
component selection criteria such as the AIC, BIC and ICL (Biernacki et al., 2000). Using 100 replications, we calculate
the accuracy of correctly estimating the number of FMR components. It is clear from Fig. 1 that conventional selection
criteria predominantly favor a constant parameter model, even in the scenario of low collinearity between regressors.
Since information criteria have the following form
n log(SSE) + penalty,
highly correlated variables can lead to significant changes in efficiency estimates, resulting in larger values of SSE. Thus,
strong collinearity can not only adversely affect variable selection but also underestimate the underlying number of
components (Becker et al., 2015). From a Bayesian perspective, the incorporation of prior information for regression
coefficients reduces the ambiguity produced by collinearity, improving inference about the number of components. To
investigate the robustness of our approach, selection accuracy is also demonstrated based on a normal and a t-error
specification for the simulated data. It can be seen from Fig. 2 that the RJMCMC algorithm performs better in terms of
revealing the correct number of components, regardless of the degree of collinearity among covariates.
To illustrate the usefulness of the relabeling algorithm, we consider a simulation scenario where π = 0.75 and
p = 100. For simplicity, a relabeling result is randomly chosen from the simulations of the two components. Fig. 3 shows
the raw MCMC output for β1 ’s in every 10th iteration for each visualization for the first 1000 samples. After adopting
the relabeling procedure, the relabeled trace plot is shown in Fig. 3(b). The proposed relabeling procedure appears to
9
Fig. 1. The distribution of the number of components over 100 replications under different degrees of pairwise correlation π = 0.25, 0.5, 0.75, 0.9
when the true model is a two-component mixture model. The numbers associated with the pie chart represent the percentage of the number of
components selected.
work well. After relabeling, we calculate the accuracy on both the classification of observations and the identification of
important variables.
The performance is then evaluated by the four commonly-used measures, the True Classification Rate on variables
(TCR), the True Positive Rate (TPR), the False Positive Rate (FPR), and the rate of the True Classification of Observations
(TCO), defined as
number of truly classified variables
TCR = ;
number of variables
number of truly selected variables
TPR = ;
number of active variables
number of falsely selected variables
FPR = ;
number of inactive variables
number of correct classification of observations
TCO = .
number of observations
TCR is an overall evaluation of accuracy in the classification of active and inactive variables implemented by our
proposed approach. TPR is the average rate of active variables correctly identified, and is used to measure the power of the
method. FPR is the average rate of inactive variables included in the selection model, and can be considered as the type
I error rate for variable selection. TCO is the rate at which the responses were assigned into the correct sub-populations.
Considering the four measurements, it is preferred to have a larger value of TCR, TPR, or TCO, or a smaller value of FPR.
Table 3 shows the selection accuracy on variables and classification of observations. It can be seen that once the correct
number of components is determined, the accuracy of variable selection of our proposed method is usually warranted,
10
Fig. 2. Selection accuracy on the number of components when deviations from the normal distribution are present in the simulated data with the
assumed correlations π = 0.25, 0.5, 0.75, 0.9. Normal: Normal error. Robust: t-error.
Table 2
Mean and standard error of the MSEs.
Model Mean s.d.
t-error 0.015 0.0011
Normal error 0.022 0.0015
regardless of the size of covariates. One particular exception is, however, that our proposed method may over-identify
variables when the correlation between covariates reaches 0.90.
To demonstrate the robustness of the proposed model, we fit the normal mixture of linear model to simulated data
with p = 100, n = 70, and π = 0.75 and then compare the mean squared error (MSE) of estimates of β s from two models
over 100 replications. As is evident from Table 2, the t-mixture of linear model accommodates outliers producing robust
estimates of parameters as compared to the normal counterpart (see the means of MSEs for normal and t-errors of 0.022
and 0.015, respectively).
5.2. Sensitivity analysis
In this section, we provide a sensitivity analysis with respect to our prior choices. For that purpose, we design different
experiments with a variety of prior inputs to explore which dimensions of the prior are most critical. More precisely, we
follow a similar design as in our previous simulation study: we draw a random sample from (1) with two components of
unequal weights; we use a t-error and apply the same covariance structure for 100 covariates with π = 0.5.
Instead of a uniform prior on the number of components, we consider a truncated Poisson distribution in (2) with
Gmax
λ− ke−λ λk /k!
∑
k=0
mean E(G) = ∑Gmax −λ k . The truncated Poisson prior allows to meaningfully incorporate prior information about
1− k=0 e λ /k!
11
Fig. 3. An illustration of the relabeling procedure by comparing the trace plots of β1 ’s in every 10th iteration for the first 1000 samples before and
after applying the algorithm in the simulation study.
Table 3
Selection accuracy of the proposed method given the correct number of components.
π 0.25 0.5 0.75 0.90
p 10 25 50 10 25 50 10 25 50 10 25 50
TCR 0.99 0.99 0.99 0.99 0.99 0.98 0.98 0.97 0.97 0.83 0.84 0.86
TPR 1 1 1 1 1 0.98 0.98 0.97 0.97 0.67 0.58 0.55
FPR 0.008 0.010 0.011 0.009 0.010 0.010 0.011 0.012 0.012 0.12 0.13 0.14
TCO 0.92 0.91 0.91 0.92 0.91 0.90 0.92 0.91 0.90 0.92 0.88 0.83
Notes: Performance is measured by the True Classification Rate on variables (TCR), the True Positive Rate (TPR), the False Positive Rate (FPR), and
the rate of the True Classification of Observations (TCO) for different combinations of covariates p and pairwise correlation π .
the number of components available, which can complement the analysis based on a non-informative prior. In each
simulation, we use different values λ = {1, 2, 4, Gmax } to examine if this would affect the estimated number of
components.
Given the expected model size, we can specify the corresponding prior inclusion probability for each variable. We
also try the extreme cases where prior inclusion probabilities are set to be 0.01/0.6 to reflect that all variables are lit-
tle/modestly important. Due to the assumption of sparsity, we do not consider prior inclusion probabilities higher than 0.6.
Moreover, we study the influences of cj on the performance of variable selection. Several values corresponding to prior
information as recommended in Liu et al. (2015) are evaluated. Lastly, we consider a nearly non-informative prior by
12
Table 4
Simulation study with 3 and 4 true components.
3-component 4-component
TNC 98% 96%
TCR (93%, 99%) (91%, 99%)
TPR (95%, 100%) (94%, 99%)
FPR (2%, 10%) (3%, 12%)
TCO (90%, 95%) (86%, 92%)
Notes: The number in parenthesis represents the 5th and

95th quantiles, respectively.
setting ag0 = bg0 = 0.01 and an informative prior such that the given values of ag0 and bg0 correspond to the sample
variance. An investigation of the sensitivity to hyperparameters is carried by comparing TCR, TPR, FPR, and TCO.
In this sensitivity analysis, we investigate the impact of different hyperparameters on the performance of the proposed
approach. First, we conduct a simulation study to examine if the specified values of λ will have an influence on determining
the correct number of components. Under the simulation of the 2-component model, we perform 100 simulations for each
λ = 1, 2, 4, and Gmax . In each simulation, we found that the number of the 2-component model selected are 96, 98, 97, 96,
and 97, respectively. As a result, it appears that λ has little effect on the number of components estimated. We thus set
λ = 2 in the following sensitivity analysis. Second, we study how the assignment of prior probabilities affects the accuracy
of variable selection. Due to the sparsity where only a few variables are important, we let dgj = 0.01, 0.6 for sensitivity
analysis. Overall, 98 out of 100 replications correctly identify the number of components, which correspond to both TCR
and TPR greater than 0.98, TCO greater than 0.9 and FPR less than 0.01. Therefore, we conclude that the prior inclusion
probability has little effect on the accuracy of our proposed method. Next, we turn our attention to the choice of cj2 such
that βgj ∼ N(0, cj2 σg2 ). Under the non-informative prior inclusion probability, dgj = 0.5, the given values of cj2 correspond
to the selection of the best model using information criteria. It is worth noting that specifying cj2 ≈ 3.92, cj2 = n or
cj2 = p2 in Bayesian variable selection corresponds approximately to selecting the best model based on AIC/Cp , BIC, or RIC,
respectively (Chipman et al., 2001). In our application, we set cj2 = n which corresponds to the unit information prior.
To illustrate that the proposed approach is applicable for the mixture model with more than two components, we
further evaluate the performance of models with 3 and 4 components. We investigate the rate of true number of
components (TNC) and then, under the condition of the true number of components, we calculate TCR, TPR, FPR, and TCO.
Let n = 100 and G, the mixture proportions (ρ1 , . . . , ρG ) are generated from Dirichlet(G, . . . , G) and the gth component
contains nρg observations. Given p = 100 and under the sparsity assumption, let dgj the probability of an important
variable be 0.05 and generate γgj ∼ Ber(dgj ). When γgj = 1, βgj ∼ N(0, 4); otherwise β = 0. Given the required parameters
and Xi ∼ N(0, Σ ), we simulate yi ∼ N(Xi βi , ωi σ 2 ), where ωi ∼ Exp(1). We then evaluate our proposed approach based
on the TNC, TCR, TPR, FPR, and TCO, and the results are shown in Table 4 with 100 replications. Overall, the proposed
method can correctly determine the number of components.
6. Empirical application: Determinants of the global financial crisis
We will use the cross-sectional dataset from Feldkircher (2014) for comparative studies on the early warning indicators
for the 2008 financial crisis. The raw data is available at http://feldkircher.gzpace.net/pages/work.html. Following Feld-
kircher (2014), our default measure of crisis intensity is the cumulative loss in real GDP during the global financial crisis
2008/09. This dataset is of specific interest to us, due to the presence of high correlation among the variables and the
high-dimensional data (p > n).
The histogram of cumulative loss in real output during the crisis is shown in Fig. 4. It appears that output losses are
skewed to the left with a clear pattern of bimodality, and possibly a few of outliers located in the left tail.
To accommodate the findings in the observed data structure and to assess pre-crisis vulnerabilities that led to strong
output losses in a systematic way, we estimate the following model:
G
∑
yi |(ρg , βg , σg2 , g = 1, . . . , G; ωi ) ∼ ρg · N x′i βg , ωi σg2 ,
( )
g =1
where yi is output loss over the crisis period for the ith country and xi is the corresponding covariate vector to measure
the pre-crisis state of the economy (2006 or earlier). In total, we use 96 crisis indicators for a sample of 63 emerging
and advanced countries. In our analysis, all regressors other than the dummy variables are normalized to avoid possible
scaling effects. The data and variable description are provided in the supplementary material.
We assume that yi in the gth group is independently normally distributed with a random shock σg2 multiplied by
a scale parameter ωi to account for heteroskedasticity, βg is the regression coefficient, and ρg denotes the proportion
of data belonging to the gth group. We then fit the data using our proposed algorithm with the following choice of
hyperparameters: λ = 2, κ = 4, dgj = 0.5, ag0 = bg0 = 1, cj2 = 63, αg = 1. In addition, we set bG = dG = 0.5.
13
Fig. 4. Histogram of the cumulative loss across countries.
We show the degree of collinearity in Fig. 5, with the absolute values of correlations between variables greater than
0.85 in bold. We note that several pairs of variables are highly correlated, e.g., openness.0206 and merchTrade.0006,
freedom.from.corr.06 and cpi.corruption.06, the selection results, however, remained qualitatively similar, regardless of
whether these variables were included or not, underpinning the robustness of our proposed approach.
It is of great interest to investigate whether the countries form several distinct clusters or groups that exhibit different
behaviors on the cumulative loss. With 100,000 iterations, the trace plot of the number of components for every 100th
iteration is displayed in Fig. 6, which indicates the mixture models with two components are preferable. This result
suggests that the assumption of parameter homogeneity for the cross-country impact of the crisis may be inappropriate.
It should be noted that Feldkircher (2014) among others address heterogeneity by including regional dummy variables
which allows for different means but not parameters across country groups. In particular, there is strong evidence against
the homogeneous parameter model (G = 1).
The classification of countries is shown in Fig. 7. The figure shows that about three quarters of the countries belong
to group 2 and one-quarter to group 1. It is worth noting that the cluster memberships do not match with pre-defined
regional segmentation or country-specific characteristics. For example, group 1 consists of countries spread across Europe,
including Western Europe (Denmark, Finland, France, Greece, Iceland, Ireland, Italy, Portugal, and United Kingdom) and
Emerging Europe (Estonia, Croatia, Hungary, Latvia, Lithuania, Slovenia, and Ukraine). Most of these countries, such as the
Baltics, Ukraine and CESEE were severely affected in terms of output loss during the crisis (Berglöf et al., 2009; Berkmen
et al., 2012; Cuaresma and Feldkircher, 2012; Feldkircher et al., 2014).
In the following analysis, we consider a FMR model with two components. For the sake of illustration, we only present
the 20 leading risk factors, and the full results can be found in the supplementary material. For each variable we report
its associated posterior inclusion probability (PIP) and the posterior mean (PM) of regression coefficients in Table 5. The
variables are ordered based on the PIPs. In addition to pre-crisis credit growth, which was identified by Feldkircher (2014)
as the main robust determinant of crisis severity, Table 5 reveals additional important crisis determinants with PIPs above
the threshold value of 0.5. The change in domestic credit provided by the banking sector over the period from 2000 to
2006 (chg.dom.cred.bank.0006) receives the highest posterior probability which best explains the crisis severity, followed
by pre-crisis population growth rate (pop.gr.0006), the change in the value of total stocks traded (chg.stocks.gdp.0006),
pre-crisis growth of real GDP (rgdpgr.06), the floating monetary regime (floater), changes in real per capita income over
the pre-crisis period (chg.rgdpcap.0006), and so on. The result of variable selection not only confirms the robustness
of pre-crisis credit growth, which corroborates previous findings in the crisis literature, but also supports the evidence
for the important role of frequently cited vulnerabilities such as financial openness, external debt, solid fiscal position,
and sufficient foreign reserve buffers. For example, the robustness of chg.rgdpcap.0006 echoes the finding of Feldkircher
(2014) that given the inevitable boom-and-bust cycles of the economy, the countries with the overheated growth prior to
14
Fig. 5. Pairwise correlations. The figure reports a snapshot of pairwise correlations between covariates in the unit of percentage.
the crisis tend to experience severe downturns that followed the financial crisis. Looking at the two groups, the majority
of crisis determinants is equally important for both sets of countries. If we use the threshold value of 0.5 for posterior
probability, there are a few variables that appear marginally robust in group 1, while almost no support in group 2. For
the remaining variables differences of PIPs are modest.
In addition, as shown in Table 5, there are stark differences in the magnitude of the marginal effects across the two
groups. For example, coefficients on population growth (pop.gr.0006), the growth in stocks traded (chg.stocks.gdp.0006),
real GDP growth (rgdpgr.06) or domestic credit growth (chg.dom.cred.bank.0006) are almost or more than two times
higher in group 1 as compared to group 2. This finding corroborates our first impression from the country classification,
namely that group 1 countries are those that grew strongly prior to the crisis, with economic growth often fueled by strong
credit expansions and financial deepening. In group 2, effects of variables related to the exchange rate regime (floater and
infl.targeter), labor market freedom (labor.freedom), property rights (prop.rights.06), and the net portfolio debt inflows
(net.pf.debt.infl.0006) are more sizable. For brevity, the results of infl.targeter, labor.freedom, and prop.rights.06 are not
shown in Table 5, but are provided in the supplementary material. Again, these effects can be attributed more to advanced
economies as these display in general higher levels of institutional quality and government debt.
To assess the relative fit of each mixture model, we compare the AIC, BIC, and ICL, with better fitting models having
lower values of these measurements. In addition, since predictive performance is an important model selection criterion,
we demonstrate that the fit of the median model (simple OLS regression with those variables included that have PIP
greater than 0.5) with different mixture components. The fit should be better with the two-component model which
would prove the usefulness of the approach and the necessity to run such a model accordingly.
Results for alternative mixture models are presented in Tables 6 and 7, respectively. The goodness of fit of the model
shown in Table 6 suggests a better prediction of a two-component mixture model, although AIC appears to favor a more
complex model. Furthermore, we examine the results from the one- and two-component models to assess predictive
accuracy using leave-one-out cross-validation. Table 7 gives the mean squared prediction error (MSPE) and the mean
absolute prediction error (MAPE) and a two-component mixture model provides substantially better prediction than a
single component model. Fig. 8 illustrates the predictive values for each country. Overall, it seems that a two-component
mixture model is more suitable for analyzing the cross-sectional data with heterogeneity than a single component model.
15
Fig. 6. Trace plot of the number of components for every 100th iteration.
Fig. 7. The posterior group memberships in the 2-component mixture model.
7. Concluding remarks
In this paper, we provide a flexible framework for modeling population heterogeneity allowing simultaneously for
variable selection and determining the number of mixture components. A fully Bayesian approach is provided by proposing
a feasible reversible-jump Markov chain Monte Carlo that obviates the statistical pitfalls in requiring the number of
components constituting the data to be known a priori. More importantly, our proposed mixture model is robust to
potential model specifications by considering t-errors in the component distributions. The proposed framework is applied
to simulated data and shows excellent statistical properties in uncovering the right number of components as well as
variables of the underlying model. The results are robust to a prior sensitivity analysis.
16
Table 5
Robust posterior estimates for the two-component mixture model.
PIP PM
Variable Group 1 Group 2 Group 1 Group 2
chg.dom.cred.bank.0006 0.8271 0.8149 −22.9649 −13.8002
pop.gr.0006 0.6071 0.5886 7.9815 3.4963
chg.stocks.gdp.0006 0.5907 0.5798 9.3973 3.5484
rgdpgr.06 0.5840 0.5856 −7.2186 −3.6113
floater 0.5707 0.5778 −2.7955 −3.8727
chg.rgdpcap.0006 0.5693 0.5745 −8.1162 −5.7108
trade.freedom.06 0.5626 0.5574 −7.0100 −4.5338
net.fdi.infl.gdp.0006 0.5550 0.5465 5.4519 2.1156
gr.savings.gdp.06 0.5403 0.5403 −3.6760 −3.5081
net.pf.debt.infl.0006 0.5393 0.5387 −1.9107 −2.8141
money.gdp.06 0.5390 0.5364 3.8549 3.5480
monInd.06 0.5357 0.5369 −4.6254 −3.2245
finOpenn.06 0.5335 0.5252 −4.8611 −4.8279
infl.0006 0.5310 0.5382 3.4452 2.0376
ext.debt.gdp.06 0.5286 0.5207 −3.3116 −2.5642
infl.06 0.5285 0.5227 −3.1405 −3.0351
chg.money.gdp.0006 0.5279 0.5428 −4.5038 −2.5827
unempl.06 0.5261 0.5196 4.3043 2.9123
gen.govDebt.06 0.5232 0.5297 −2.9865 −2.8384
rgdpgr.0006 0.5226 0.5242 −4.5998 −3.0716
··· ··· ··· ··· ···
Notes: The table represents a snapshot of the full results and presents the posterior
inclusion probability (PIP) of the 20 highest ranked variables across two groups. PM
stands for the posterior mean of the regression coefficient. The intercept is not shown.
The variables are ordered by their posterior probabilities. Degrees of freedom for t-errors,
ν = 4, are assumed in our robust approach. The full results refer to the supplementary
material.
Table 6
Information criteria for alternative mixture models.
1-component 2-component 3-component
AIC 436.23 141.97 103.29
BIC 743.61 731.74 1182.94
ICL 743.61 723.64 1042.94
Notes: The bold number indicates the preferred model.
Table 7
Forecast errors for one- and two-component mixture
models.
1-component 2-component
MSPE 3328.00 2253.79
MAE 48.43 36.99
Fig. 8. Leave-one-out cross-validation for mixture models.
Finally, we apply the model to a large dataset with highly collinear variables to uncover early warning indicators for a
relatively small set of 63 countries. Our method successfully identifies two groups of countries that are not connected to
each other in a spatial sense. The two-component mixture model is chosen over a more simplistic one-component model
17
by means of information criteria as well as based on (in-sample) cross-validation. Our results indicate that both the set
of variables and the size of estimated coefficients can vary across the two groups. For one country group, coefficients
attached to indicators of excessive and unsustainable growth (e.g., real GDP growth, credit growth, stock market growth)
appeared sizable. Moreover, external debt appears to be an important regressor. These variables are typically observed
when assessing vulnerabilities for emerging economies. For the other country group, effects of the exchange rate regime,
institutional quality, the fiscal stance and hence characteristics of more advanced economies, were particularly sizable.
Taken at face value, our results point to the limitations of a ‘foolproof’ or ‘one-size-fits-all’ list of early warning indicators
in explaining variations in the severity of the output impact across countries.
Acknowledgments
The authors wish to thank Yen-Chi Chen, Adrian Dobra, Bettina Grün, Darryl Holman, Chang-Jin Kim, Gian-Luca
Melchiorre, Roberto León-González, Eric Zivot, and participants at the University of Washington Center for Statistics and
the Social Sciences (CSSS) seminar, the 2019 North American Meeting of the Econometric Society, the annual meeting of
Taiwan Econometric Society for helpful comments. The authors are also very grateful to the Editor, anonymous Associate
Editor and referees for their helpful comments that have substantially improved this article. This research was supported
by the Ministry of Science and Technology of the Republic of China (MOST 108-2118-M-006-002 for Lee and MOST
107-2410-H-006-014 for Chen).
Appendix A. Supplementary data
Supplementary material related to this article can be found online at https://doi.org/10.1016/j.csda.2021.107180.
References
Barbieri, M., Berger, J.O., 2004. Optimal predictive model selection. Ann. Statist. 32, 870–897.
Becker, J.-M., Ringle, C.M., Sarstedt, M., Völckner, F., 2015. How collinearity affects mixture regression results. Mark. Lett. 26, 643–659.
Berglöf, E., Korniyenko, Y., Plekhanov, A., Zettelmeyer, J., 2009. Understanding the crisis in emerging Europe. In: EBRD Working Paper No. 109.
Berkmen, S.P., Gelos, G., Rennhack, R., Walsh, J.P., 2012. The global financial crisis: Explaining cross-country differences in the output impact. J. Int.
Money Finance 31, 42–59.
Biernacki, C., Celeux, G., Govaert, G., 2000. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern
Anal. Mach. Intell. 22, 719–725.
Chen, B., 2012. Bayesian Model Selection in Finite Mixture Regression (Ph.D. thesis). University of Texas at San Antonio.
Chipman, H., George, E.I., McCulloch, R.E., 2001. The practical implementation of Bayesian model selection. In: Model Selection, Vol. 38. Institute of
Mathematical Statistics, Beachwood, OH, pp. 65–116.
Cozzini, A., Jasra, A., Montana, G., Persing, A., 2014. A Bayesian mixture of lasso regressions with t-errors. Comput. Statist. Data Anal. 77, 84–97.
Cuaresma, J.C., Feldkircher, M., 2012. Drivers of output loss during the 2008-09 crisis: a focus on emerging Europe. In: Focus on European Economic
Integration, Vol. Q2/12. pp. 46–64.
Dunson, D., Herring, A.H., Engel, S.M., 2008. Bayesian selection and clustering of polymorphisms in functionally related genes. J. Amer. Statist. Assoc.
103, 534–546.
Fan, J., Li, R., 2001. Variable selection via non-concave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96, 1348–1360.
Feldkircher, M., 2014. The determinants of vulnerability to the global financial crisis 2008 to 2009: Credit growth and other sources of risk. J. Int.
Money Finance 43, 19–49.
Feldkircher, M., Gruber, T., Moder, I., 2014. Using a threshold approach to flag vulnerabilities in CESEE economies. In: Focus on European Economic
Integration, Vol. Q3/14. pp. 8–30.
Frank, I.E., Friedman, J.H., 1993. A statistical view of some chemometrics regression tools. Technometrics 35, 109–135.
Frühwirth-Schnatter, S., 2006. Finite Mixture and Markov Switching Models. Springer, New York.
Gelfand, A.E., Smith, A.F.M., 1990. Sampling-based approaches to calculating marginal densities. J. Amer. Statist. Assoc. 85, 398–409.
George, E., McCulloch, R.E., 1993. Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88, 881–889.
George, E., McCulloch, R.E., 1997. Approaches for Bayesian variable selection. Statist. Sinica 7, 339–374.
Geweke, J., 1993. Bayesian treatment of the independent student-t linear model. J. Appl. Econometrics 8, S19–S40.
Ghosh, J., Ghattas, A., 2015. Bayesian variable selection under collinearity. Amer. Statist. 69, 165–173.
Green, P., 1995. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82, 711–732.
Gupta, M., Ibrahim, J., 2007. Variable selection in regression mixture modeling for the discovery of gene regularory nectworks. J. Amer. Statist. Assoc.
102, 867–880.
Keribin, C., 2000. Consistent estimation of the order of mixture models. Sankhyā 62, 49–66.
Khalili, A., Chen, J., 2007. Variable selection in finite mixture of regression models. J. Amer. Statist. Assoc. 102, 1025–1038.
Khalili, A., Lin, S., 2013. Regularization in finite mixture of regression models with diverging number of parameters. Biometrics 69, 436–446.
Leroux, B.G., 1992. Consistent estimation of a mixing distribution. Ann. Statist. 20, 1350–1360.
Li, P., Chen, J., 2010. Testing the order of a finite mixture. J. Amer. Statist. Assoc. 105, 1084–1092.
Lin, T.I., Lee, J.C., Ni, H.F., 2004. Bayesian analysis of mixture modelling using the multivariate t distribution. Stat. Comput. 14, 119–130.
Liu, W., Zhang, B., Zhang, Z., Tao, J., Branscum, A.J., 2015. Model selection in finite mixture of regression models: a Bayesian approach with innovative
weighted g priors and reversible jump Markov chain Monte Carlo implementation. J. Stat. Comput. Simul. 85, 2456–2478.
Malsiner-Walli, G., Frühwirth-Schnatter, S., Grün, B., 2016. Model-based clustering based on sparse finite Gaussian mixtures. Stat. Comput. 26, 303–324.
McLachlan, G.J., Peel, D., 2000. Finite Mixture Models. Wiely, New York.
Monni, S., Tadessey, M.G., 2009. A stochastic partitioning method to associate high-dimensional responses and covariates. Bayesian Anal. 4, 413–436.
Nobile, A., Fearnside, A.T., 2007. Bayesian finite mixtures with an unknown number of components: the allocation sampler. Stat. Comput. 17, 147–162.
Richardson, S., Green, P.J., 1997. On Bayesian analysis of mixtures with an unknown number of components. J. R. Stat. Soc. Ser. B Stat. Methodol.
59, 731–792.
Sala-i-Martin, X., 1997. I just ran two million regressions. Amer. Econ. Rev. 87, 178–183.
18
Sperrin, M., Jaki, T., Wit, E., 2010. Probabilistic relabelling strategies for the label switching problem in Bayesian mixture models. Stat. Comput. 20,
357–366.
Spezia, L., 2009. Reversible jump and the label switching problem in hidden Markov models. J. Statist. Plann. Inference 139, 2305–2315.
Städler, N., Bühlmann, P., van de Geer, S., 2010. ℓ1 -Penalization for mixture regression models. Test 19, 209–256.
Stephens, M., 2000. Dealing with label switching in mixture models. J. R. Stat. Soc. Ser. B Stat. Methodol. 62, 795–809.
Tadesse, M.G., Sha, N., Vannucci, M., 2005. Bayesian variable selection in clustering high-dimensional data. J. Amer. Statist. Assoc. 100, 602–617.
Tang, Q., Karunamuni, R.J., 2017. Robust variable selection for finite mixture regression models. Ann. Inst. Stat. Math. (forthcoming).
Tibshirani, R., 1996. Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. Ser. B Stat. Methodol. 58, 267–288.
Viele, K., Tong, B., 2002. Modeling with mixtures of linear regressions. Stat. Comput. 12, 315–330.
Xu, C., Chen, J., 2015. A thresholding algorithm for order selection in finite mixture models. Commun. Stat. Simul. Comput. 44, 433–453.
Zellner, A., 1986. On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In: Goel, P., Zellner, A. (Eds.), Bayesian
Inference and Decision Techniques: Essays in Honor of Bruno de Finetti. North-Holland, Amsterdam, pp. 233–243.
Zhang, J., 2017. Screening and clustering of sparse regressions with finite non-Gaussian mixtures. Biometrics 73, 540–550.
Zou, H., 2006. The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101, 1418–1429.
Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67, 301–320.
19

Computational Statistics and Data Analysis: Kuo-Jung Lee, Martin Feldkircher, Yi-Chi Chen

Uploaded by

Copyright:

Available Formats

You might also like

Computational Statistics and Data Analysis: Kuo-Jung Lee, Martin Feldkircher, Yi-Chi Chen

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computational Statistics and Data Analysis: Kuo-Jung Lee, Martin Feldkircher, Yi-Chi Chen

Uploaded by

Copyright:

Available Formats

Computational Statistics and Data Analysis 158 (2021) 107180

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis

Variable selection in finite mixture of regression models with

∗ Correspondence to: Favoritenstraße 15a, 1040 Vienna, Austria.

2.1. Regression regularization methods

2.2. Large p, small n problems

2.3. Determining the number of mixture components

3. Variable selection in finite mixture of regression models

3.1. Two sets of latent variables

zi ∼ Multinomial(ρ1 , . . . , ρG ) [yi |zi = g ] ∼ N x′i βg , ωi σg2 .

3.2. Prior distributions

Conditional on γgj , the prior of the regression coefficient, βgj , is given by

βgj |γgj ∼ γgj N(0, cj2 σg2 ) + (1 − γgj )δ0 ,

3.3. Gibbs sampler for variable selection

1. The full conditional distribution of zi , i = 1, . . . , n, is

2. The full conditional distribution of ρ is Dirichlet(α1 + ng , . . . , αG + nG ).

4. The full conditional distribution of ωi given zi = g is

5. The full conditional distribution of ν given ω’s is

6. A RJMCMC scheme is provided in the next section to update G.

4. Reversible jump MCMC for updating G

1. Updating the memberships z.

(a) Birth–death: Generating or eliminating an empty component.

4.1. Generating and eliminating

4.2. Splitting and merging

The acceptance probability can be given as min(1, ηs ).

• If g1 is empty, then we randomly select a component from non-empty components, denoted by g2 .

p(G∗ , ρ ∗ , γ ∗ , z ∗ , ω, ν|y) bG−1 G∗− u 1 (1 − u)ng2 πu (u)

4.3. Identifiability and label switching

and the corresponding loss is defined as

Step 1: Choose Q = (qig ) to minimize

Actually, the minimization is achieved when

Step 2: For k = 1, . . . , K , update tk to minimize

4.4. Posterior inference in mixture models

The estimate of σg2 is given by

−1[k] −1[k] [k]′

5.1. Performance under collinearity

5.2. Sensitivity analysis

Notes: The number in parenthesis represents the 5th and

6. Empirical application: Determinants of the global financial crisis

Fig. 4. Histogram of the cumulative loss across countries.

Fig. 7. The posterior group memberships in the 2-component mixture model.

Notes: The bold number indicates the preferred model.

Fig. 8. Leave-one-out cross-validation for mixture models.

Appendix A. Supplementary data

Supplementary material related to this article can be found online at https://doi.org/10.1016/j.csda.2021.107180.

You might also like