An Integrated Machine Learning Framework For Hospital Readmission prediction2018KnowledgeBased Systems

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Knowledge-Based Systems 146 (2018) 73–90

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

An integrated machine learning framework for hospital readmission


prediction
Shancheng Jiang a, Kwai-Sang Chin a,∗, Gang Qu b, Kwok L. Tsui a
a
Department of Systems Engineering and Engineering Management, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon Tong, Hong Kong SAR,
China
b
Dalian Dermatosis Hospital, 788 Changjiang Road, Shahekou District, Dalian City, Liaoning Province, China

a r t i c l e i n f o a b s t r a c t

Article history: Unplanned readmission (re-hospitalization) is the main source of cost for healthcare systems and is nor-
Received 9 June 2017 mally considered as an indicator of healthcare quality and hospital performance. Poor understanding of
Revised 5 January 2018
the relative importance of predictors and limited capacity of traditional statistical models challenge the
Accepted 26 January 2018
development of accurate predictive models for readmission. This study aims to develop a robust and
Available online 1 February 2018
accurate risk prediction framework for hospital readmission, by combining feature selection algorithms
Keywords: and machine learning models. With regard to feature selection, an enhanced version of multi-objective
Hospital readmission bare-bones particle swarm optimization (EMOBPSO) is developed as the principal search strategy, and
Mutual information a new mutual information-based criterion is proposed to efficiently estimate feature relevancy and re-
Multi-objective optimization dundancy. A greedy local search strategy (GLS) is developed and merged into EMOBPSO to control the
Bare-bones particle swarm optimization final feature subset size as desired. For the modeling process, manifold machine learning models, such as
Feature selection
support vector machine, random forest, and deep neural network, are trained with preprocessed datasets
Greedy local search
and corresponding feature subsets. In the case study, the proposed methodology is applied to an actual
hospital located in Northeast China, with various levels of data collected from the hospital information
system. Results obtained from comparative experiments demonstrate the effectiveness of EMOBPSO and
EMOBPSO-GLS feature selection algorithms. The combination of EMOBPSO (EMOBPSO-GLS) and deep neu-
ral network possesses robust predictive power among different datasets. Furthermore, insightful implica-
tions are abstracted from the obtained elite features and can be used by practitioners to determine the
vulnerable patients for readmission and target the delivery of early resource-intensive interventions.
© 2018 Elsevier B.V. All rights reserved.

1. Introduction stay [3]. Therefore, unplanned readmission rates are considered as


a significant indicator of hospital performance comparison, pub-
Exploring unplanned hospital readmission risk is an issue of lic reporting, and reimbursement determinations [4]. In 2012, the
concern for the healthcare system, particularly in developed re- Centers for Medicare & Medicaid Services, a federal agency within
gions. Unplanned re-hospitalization leads to a disruption to the the United States Department of Health and Human Services, final-
normality of medical services and lives of patients, which results in ized a series of policies under the Hospital Readmissions Reduc-
a critical financial burden on the healthcare system [1]. Unplanned tion Program, which was aimed at providing financial incentives to
readmission in the United States has been estimated to be approx- reduce unnecessary hospital readmissions [5]. Recently, a few ter-
imately 20% of hospital-discharged patients, which accounted for tiary referral hospitals in China have introduced the readmission
$17.4 billion of payments by Medicare. In the United Kingdom, the rate as a quality-of-service indicator.
estimated figures suggest around 35% of unplanned readmissions, From the viewpoint of hospital managers, one strategy to con-
costing £11 billion per year. Although a few studies have conducted trol the readmission rate is to implement knowledge discovery sys-
relative investigations, the situation in developing countries is sim- tems to identify patients who have a high risk of readmission. Ide-
ilar or even worse with limited medical resources [2]. Apart from ally, predicted outcomes obtained from these systems would be
financial burden, unplanned re-hospitalization usually causes pa- used to deliver early resource-intensive interventions to the pa-
tients to suffer from an increased risk of death or long length of tients at greatest risk, such as intensive post-discharge care, man-
aging the conditions in their house, and other kinds of discharge

planning. Potential readmissions could then be reduced by bet-
Corresponding author.
ter coordinating transitions of care. However, previous readmission
E-mail addresses: scjiang2-c@my.cityu.edu.hk (S. Jiang), mekschin@cityu.edu.hk
(K.-S. Chin), 1638654878@qq.com (G. Qu), kltsui@cityu.edu.hk (K.L. Tsui). risk prediction models present poor predictive ability based on the

https://doi.org/10.1016/j.knosys.2018.01.027
0950-7051/© 2018 Elsevier B.V. All rights reserved.
74 S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90

results of a systematic literature review [4]. Among 21 studies, considered and dealt with one classifier. A more flexible read-
only 6 reported a c-statistic, that is, the area under the ROC curve mission interval ranging from 30 to 180 days is also considered
(AUC), above 0.70. Most approaches are limited to patients with a in our study.
specific disease and the original features are limited to medical- (b) A new hybrid feature selection method is designed and im-
level. Moreover, few previous studies attempted to design and im- plemented as a pre-processing to find key factors that influ-
plement the feature selection process. Generally, involving a broad ence readmission. For the criterion, we design a novel MI-
variety of factors benefits the model training by reducing bias, and based criterion with a more accurate measurement of feature
relative importance of data collected from the system, medical, and redundancy and less computational cost. For the search strat-
patient level may significantly differ according to the population egy, EMOBPSO algorithm is designed, in which the initialization
studied [4,6]. Motivated by the foregoing analysis and the current strategy, the Gbest selection strategy, and the update rule for
demand from our cooperative hospital, we aim to develop in this the entire swarm are improved. A GLS strategy is proposed and
study a new framework to identify patients with high readmission merged into EMOBPSO to control the number of selected fea-
risk, by improving three key operations: data collection and inte- tures. Apart from benefiting the model learning, the results can
gration, feature selection, and model training. be used as a guide for hospital managers to determine potential
In field of machine learning, feature selection is essential as a reasons for unplanned readmissions.
pre-processing operation which selects a compact feature subset (c) Instead of limiting the application to a single classification
that holds maximal discriminative capability. Feature selection has model [23,24], our research introduces three modern learn-
obtained significant attention in the past decades because it is ben- ing models to solve this machine learning task: support vector
eficial to the training algorithm by avoiding overfitting, controlling machine (SVM), random forest (RF), and deep neural network
noise, and improving prediction performance [7,8]. Proposed fea- (DNN). After examination with cross-validation tests, the model
ture selection algorithms are focused on two points: criterion and presenting the best performance is put into practice.
search strategy. A criterion typically includes the discriminability
and compactness, that is, the selected feature or feature subset The rest of this paper is organized as follows. Section 2 reviews
should have high relevance to class labels and low redundancy the related work on hospital readmission prediction, MI-based fea-
within themselves. This condition indicates that feature selection ture selection, and proposed evolutionary search strategies on this
belongs to the multi-objective optimization problem. As feature se- field. Section 3 describes the details of our new methodology.
lection has been proven to be an NP-hard problem [9], manifold Section 4 presents experimental results on real datasets to eval-
heuristic search algorithms are employed as search strategies to uate the efficiency and effectiveness of our methodology. In addi-
reduce the search space, such as hill-climbing, simulated anneal- tion, insightful implications for practitioners are explored. Finally,
ing (SA) [10], ant colony optimization (ACO) [11], particle swarm Section 5 concludes the paper.
optimization (PSO) [12–14], and genetic algorithm (GA) [15,16].
However, proposed heuristic search strategy-based feature se- 2. Related works
lection algorithms contain some common limitations. One is that
the criterion normally relies on a certain machine learning model, 2.1. Risk prediction model for hospital readmission
in which the model is trained as a black box function to calcu-
late the criterion. This kind of criterion results in a more reliable By comparing retrieved studies in two systemic reviews on this
feature subset at a high cost of computational time, which can problem [4,25], we have found an increasing number of studies de-
hardly be generalized to high-dimensional data. Another one is voted to utilizing statistical models for predicting hospital readmis-
that most approaches directly applied popular evolutionary com- sion risks. The outcome of the 30-day readmission was most com-
putation techniques as search strategies and did not fine-tune monly reported [26–29], even though a few approaches selected
them to the feature selection task [7,17]. Few proposed research other intervals ranging from 14 days [30] to 4 years [31]. Almost
works attempt to customize key operators of evolutionary compu- all approaches employed retrospective data to learn the model and
tation techniques for feature selection. conduct an analysis, and these studies vary in the target popula-
To compensate for the above drawbacks, we propose a novel tion and outcome variables. The majority pre-selected the target
feature selection algorithm and apply it to abstract elite features population with the same disease or under similar conditions, such
that influence patient readmission risk. The mutual information as approaches focusing on cardiovascular disease-related readmis-
(MI), which was originated from Shannon’s information theory sions [32,33], surgical condition readmissions [34,35], and men-
[18,19], is used to calculate the criterion [20,21]. As a nonpara- tal health condition readmissions [36]. Although a recent study
metric metric for quantifying the uncertainty of feature or fea- supported that elderly people are not the only group that should
ture subset, MI makes no assumption regarding data distribution be scrutinized in predicting readmissions, readmissions of elderly
and has the advantage of measuring the joint effect of features people attracted more attention than that of younger people based
on the target class. For the search strategy, the bare-bones multi- on the average age of the target population reported in previous
objective particle swarm optimization (BMOPSO) algorithm [22] is studies [6]. Among input features that were identified as impor-
employed and tuned for the feature selection task. A greedy local tant contributors to readmission, “medical comorbidity,” “length of
search (GLS) strategy is developed and then merged into the search stay,” and “number of previous admissions” were used in nearly
strategy to control the number of selected features. Compared with all studies [25]. The variable “laboratory test” or “medication” was
previous approaches on patient readmission risk prediction, our re- more frequently included in the classification models for cardiovas-
search work has added the following improvements: cular disease-related readmissions. Basic sociodemographic vari-
ables, such as age and gender, were also considered by many stud-
(a) A wide range of input factors, such as administration data, ies [29,37,38]. For model comparison and performance evaluation,
diagnosis-related information, pharmacy and laboratory data, the most frequently reported metric is the AUC, which is a more
conditional monitoring data of inpatients, and identity informa- comprehensive metric than accuracy for datasets with class imbal-
tion of patients, are collected from different databases and inte- ance.
grated as the original feature set to generalize the classifier and The results reported by published studies indicate that un-
final results. Instead of merely concerning the readmission risk planned readmission risk prediction is characterized by poor un-
of patients with a specific disease, all registered inpatients are derstanding and complex endeavor. One significant reason is that
S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90 75

the frequently used patient-level factors, such as “medical comor- most are wrapper approaches that take high computational cost
bidity” and other clinical variables, are more appropriate to predict and are risky to maintain high performance in other learning algo-
mortality than readmission risk [4]. The condition of patients af- rithms. Moreover, there are very few studies applying or designing
ter discharge, such as the timeliness of post-discharge nursing care multi-objective PSO as search strategy for feature selection, though
and the quality of medication reconciliation, may affect readmis- PSO, multi-objective optimization, and feature selection have been
sion risk; however, related data are difficult to collect or monitor. individually investigated and improved frequently [14]. Therefore,
Broad social and environmental factors, such as “access to care” the potential of PSO and multi-objective PSO for feature selection
and “social support,” are suggested to indirectly contribute to read- has not been fully explored.
mission risk; however, the utility of such factors is limited [39].
Another reason is that the traditional logistic regression method 3. Methodology
is applied to most approaches, with a relatively weak predictive
power. Recently, Golmohammadi and Radnia [6] introduced three 3.1. Preliminary
advanced machine learning algorithms in their study: neural net-
work, decision tree-based classification method, and Chi-squared 3.1.1. Entropy and mutual information
automatic interaction detection, to reach a model with a high level In Shannon’s information theory, information is defined as
of accuracy. Duggla et al. [40] applied naive Bayes classifier and de- something that removes or reduces uncertainty, while entropy is
cision tree to enhance the overall model performance obtained by the expected value of the information contained in random vari-
logistic regression. In this study, the beneficial effects of conduct- ables. For a random variable X with discrete values, the entropy of
ing feature selection process are explored as well. The result indi- X is defined as

cates that such type of pre-processing techniques is indeed useful H (X ) = − p(x ) log p(x ), (1)
in their scenario, whereas this operation is ignored by most previ- x∈X
ous studies. where p(x ) = P r (X = x ) is the probability density function of X.
Eq. (1) indicates that entropy depends on the probability distri-
2.2. Feature selection bution of the random variable instead of its actual value. For two
discrete random variables X and Y with joint probability densityp(x,
As a measure of the amount of shared information between y), the joint entropy of X and Y is defined as
two random variables, MI has been frequently used as the cri- 
H (X, Y ) = − p(x, y ) log p(x, y ). (2)
terion for existing filter feature selection algorithms. At the be-
x∈X y∈Y
ginning, an algorithm named as MIFS (“mutual information based
feature selection”) was proposed by Battiti [41], in which MI be- If a certain variable is known and others are not, then the re-
tween feature vectors is approximated by calculating the MI be- maining uncertainty is measured by conditional entropy. Let vari-
tween the individual components of the vectors. In this algorithm, able Y be the given, and then the conditional entropy H(X|Y) of X
a greedy sequential forward selection strategy is applied to select with respect to Y is defined as

the best feature subset. Kwok and Chong-Ho [42] proposed an al- H (X |Y ) = − p(x, y ) log p(x|y ). (3)
gorithm called MIFS-U by developing a more accurate estimation x∈X y∈Y
of the conditional MI term involved in MIFS under the assumption
In Eq. (3), p(x|y) is the posterior probabilities of X given Y.
that the information is uniformly distributed. Peng et al. [21] pro-
According to this definition, if X completely depends on Y, then
posed the minimal redundancy maximal relevance (mRMR) algo-
H(X|Y) is zero, which indicates that no other information is re-
rithm, which achieved good performance by wrapping a learning
quired to describe X when Y is known. Otherwise, H (X |Y ) =
algorithm. Recently, more attention has been provided to the es-
H (X )indicates that Y offers no meaningful information regarding
timation of MI (or entropy) for multiple features. Estevez et al.
X [53]. The relationship between joint entropy and conditional en-
[43] proposed an enhancement of MIFS called normalized MIFS
tropy are denoted as follows:
and demonstrated that MI should be normalized with its corre-
sponding entropy. Wang et al. [44] defined a new feature redun- H (X, Y ) = H (X ) + H (Y |X ) = H (Y ) + H (X |Y ). (4)
dancy measurement that is capable of accurately estimating MI The common information shared between two random vari-
between features and the target variable. Sun et al. [45] intro- ables is defined as mutual information (Eq. (5)), which is a promis-
duced a new scheme to estimate the MI and conditional MI by dy- ing indicator of relevance between two random variables.
namically re-weighting the candidate features. Most of these algo-
I (X ; Y ) = H (X ) − H (X |Y ) = H (Y ) − H (Y |X )
rithms adopt greedy search strategies to incrementally select fea-
 p(x, y )
tures, which are more likely to generate local optimal solutions. = p(x, y ) log (5)
The proposed MI-based feature selection algorithms are extensively p( x ) · p( y )
x∈X y∈Y
used to conduct knowledge discovery in different domains, such
From Eq. (5), a large I(X; Y) indicates that two variables are
as load forecasting [46,47], medical diagnosis [48,49], and financial
closely related; if I (X; Y )=0, then X and Y are completely unrelated.
analysis [50].
As defined in Eq. (5), the relationship between MI and entropy is
Evolutionary computation techniques are extensively applied as
illustrated in Fig. 1. With regard to continuous random variables,
search strategies in the wrapper feature selection algorithm. Com-
differential entropy, conditional differential entropy, and MI are re-
pared with GA and ACO, PSO is normally easier to implement with
spectively defined as:
less computational cost and faster convergence rate. With these 
advantages, PSO stands out as a promising search strategy for fea- H (X ) = − p(x ) log p(x )dx, (6)
ture selection. Wang and Yan [51] developed a binary PSO-based
 
feature selection to reduce network complexity and improve gen-
eralization ability. In the study by Zhang et al. [52], the binary
H (Y |X ) = − p(x, y ) log p(y|x )dxdy, (7)
PSO with mutation operator was used as the feature subset search  
p(x, y )
strategy for spam detection. An improved PSO with two filter tech- I (X ; Y ) = p(x, y ) log dxdy. (8)
niques was developed by Chhikara et al. [13] for image steganaly- p( x ) p( y )
sis. Among the proposed PSO-based feature selection algorithms,
76 S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90

tions can be expressed as follows [14]:

minimize F (x ) = [ f1 (x ), f2 (x ), ..., fk (x )]
subject to
, (11)
gi (x ) ≤ 0, i = 1, 2, ..., m
hi (x ) = 0, i = 1, 2, ..., l
where x is the vector of decision variables, fi (x)is ith objective func-
tion of x, k is the number of objective functions, and gi (x) and hi (x)
are the constraints of the problem. In multi-objective optimization,
the quality of a solution is explained in terms of trade-offs be-
tween conflicting objectives. Let y and z be two solutions of the
above-mentioned k-objective minimization problem. If the condi-
tions shown in Eq. (12) are met, one can say that y dominates
z. When a solution is not dominated by any other solutions, it is
named as a Pareto-optimal solution. The set of all Pareto-optimal
solutions forms the Pareto front. A multi-objective optimization al-
gorithm is designed to search for the Pareto front.

∀i : fi (y ) ≤ fi (z ) and ∃ j : f j (y ) < f j (z ), where i, j ∈ {1, 2, 3, ..., k}


Fig. 1. Relationship between MI and entropy.
(12)
The promising results provided by PSO for solving SOPs validate
3.1.2. PSO and bare-bones PSO its effectiveness and efficiency. This motivates the researchers to
PSO solves a problem by having a population of candidate so- extend PSO for MOPs and many multi-objective PSO (MOPSO) al-
lutions (i.e., particles), and each particle is iteratively improved by gorithms are proposed accordingly [56,57]. Most proposed MOPSO
moving around in the search space according to functions over the algorithms can be classified into two categories: the first category
position and velocity of a particle. Considering the ith particle in embeds the Pareto dominance relationship into PSO, which is used
the swarm, its position and velocity at tth generation in a tra- to determine the personal best and global best particles in the
ditional PSO are denoted as Xi (t ) = (xi,1 (t ), xi,2 (t ), ..., xi,N (t )) and iterative process; the second category adopts decomposition ap-
Vi (t ) = (vi,1 (t ), vi,2 (t ), ..., vi,N (t )), respectively. The update rule of proach to transform MOPs into a set of single-objective optimiza-
this particle is denoted as follows: tion problems [58]. The BMOPSO [22], that is the raw algorithm of
⎧ the EMOBPSO, belongs to the first category.
⎨vi, j (t + 1 ) = w · vi, j (t ) + r1 · c1 · ( pbi, j (t ) − xi, j (t ))
+ r2 · c2 (gb j (t ) − xi, j (t )) , 3.2. A new MI-based criterion for feature selection

xi, j (t + 1 ) = xi, j (t ) + vi, j (t + 1 )
Ideally, MI-based feature selection problem is summarized as
j = 1, 2, ..., N, (9) follows. Given an initial feature set F with n features, find the sub-
set S⊂F with k features that maximizes I(c; S), where c is the target
where P bi (t ) = ( pbi,1 (t ), pbi,2 (t ), ... pbi,N (t )) represents the best class variable. However, directly calculating the criterion I(c; S) en-
position found by particle i so far, which is called the personal counters extremely high computational complexity, which involves
best solution (Pbest); Gb(t ) = (gb1 (t ), gb2 (t ), ..., gbN (t )) is the best- the inverse of high-dimensional covariance matrix. In the present
known position of the entire swarm (Gbest). c1 and c2 are pre- study, the calculation of the high-dimensional MI is simplified by
defined nonnegative constants that represent the acceleration co- evaluating all pairwise MI terms.
efficients; w is also a pre-defined hyper-parameter to control the Given any two features fi and fj belonging to S, that is, fi , fi ∈ S,
exploration of a particle in a search space. r1 and r2 are two ran- i = j, the mutual information between {fi , fj } and c is denoted as
dom values within [0,1] to control the diversity of the swarm.
Bare-bones PSO (BPSO) can be regarded as a simplified version I ( f i , f j ; c ) = I ( f i ; c| f j ) + I ( f j ; c| f i ) + I ( f i ; f j ; c )
of PSO, which eliminates the need for pre-defining and turning = I ( f i ; c ) + I ( f j ; c | f i ) = I ( f j ; c ) + I ( f i ; c | f j ). (13)
hyper-parameters, such as the inertia weight w and acceleration
coefficients c1 and c2 . Unlike traditional PSO, the position of parti- The mechanism behind Eq. (13) is illustrated in Fig. 2, in which
cle i in the swarm is updated as follows: the shaded part (i.e., areas 1, 2, and 4) is I(fi , fj ; c). In Eq. (13), ei-
ther I(fj ; c|fi ) or I(fj ; c|fi ) represents a conditional MI term, which
xi, j = N (( pbi, j + gb j )/2, ( pbi, j − gb j )2 ), (10) involves joint and conditional entropy of all the three variables
and still maintains high computational complexity. To simplify this
where N(a,b) denotes the Gaussian distribution with the mean a conditional MI term, Eq. (13) is transformed as
and the variance b [54]. Owing to its hyper-parameter-free advan-
tage, BPSO has been recently applied in many practical problems. I ( fi , f j ; c ) = I ( fi ; c ) + I ( f j ; c ) − I ( fi ; f j ; c ) = D − R. (14)
As shown in Eq. (14), D = I ( fi ; c ) + I ( f j ; c ) represents the rel-
evance of feature i and j to the target class labels. Meanwhile,
3.1.3. Multi-objective optimization problem and multi-objective PSO R = I ( fi ; f j ; c ) measures the redundancy relative in feature i and j
In many real-world situations, the problem that is required to as well as target class c, which is the shared information between
optimize multiple objectives simultaneously is often encountered, fi and fj with respect to c. Considering any pair of features in S, the
which is called multi-objective optimization problems (MOPs) [55]. criterion I(c; S) is approximated by generalizing Eq. (14) to
Multi-objective optimization involves minimizing or maximizing    
multiple conflicting objective functions. In mathematical terms, the I (c; S ) ≈ D− R= I ( fi ; c ) − I ( fi ; f j ; c ). (15)
formula of a minimization problem with multiple objective func- f i ∈S fi , f j ∈S,i= j
S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90 77

 I ( fi ; c ) + I ( f j ; c )
− I ( fi ; f j ). (17)
2 · max{I ( fi ; c )}
fi , f j ∈S,i= j f i ∈S

The mechanism of the shrink coefficient in Eq. (17) is inter-


preted as follows: as I(fi ; c) and I(fj ; c) approach the maximal
value of pairwise MI between selected features and target variable,
I ( fi ;c )+I ( f j ;c )
the shrink coefficient 2·max{I ( fi ;c )}
approaches one and is in accor-
f i ∈S

dance with the situation shown in Fig. 3(b). By contrast, if I(fi ; c)


and I(fj ; c) approach zero, the shrink coefficient then approaches
zero, thereby producing a notable shrinkage in the correspond-
ing I(fi ; fj ). This situation is in accordance with the one shown
in Fig. 3(a). Compared with previous studies on estimating redun-
dancy between features, that is,  R, one significant advantage of
the present approximation method is the low computational cost
by avoiding introduced conditional MI terms. All involved terms
in Eq. (17) are formulated by pairwise MI, which can be retrieved
from a pre-computed pairwise MI matrix. In addition, the entire
approximation process is derived in a more general way, without
any restrictions on the information distribution.
Fig. 2. Venn diagram of information theoretic measures for three variables: fi , fj ,
and c.
3.3. Enhanced multi-objective bare-bones PSO algorithm

According to the criterion (Eq. (17)), the present feature se-


lection is a two-objective optimization problem: Objective 1 is to
maximize the relevance of selected features to the target class and
meanwhile Objective 2 is to minimize the redundancy between
them. This section describes in detail the EMOBPSO search strat-
egy for feature selection, in which the initialization strategy, the
Gbest selection strategy, as well as the update rule of BMOPSO, are
improved and tuned for feature selection.

3.3.1. Initialization strategy


Unlike binary representation adopted in most previous ap-
Fig. 3. Two extreme situations of the relationship among fi , fj , and c.
proaches, our initialization strategy is based on an alternative en-
coding method, in which the probability that a feature is included
in the feature subset is considered as an element for a certain par-
ticle. For a dataset with N features, the ith particle in the entire
Cervante et al. [53] indicates that the redundancy contained in
swarm is encoded as
the selected feature subset can be approximated as the mutual in-
formation shared by each pair of selected features to avoid intro- Xi = (xi,1 , xi,2 , ..., xi,N ), xi, j ∈ [0, 1], j = 1, 2, ..., N, i = 1, 2, ...M,
ducing conditional MI terms involved in  R. In this approach, I(c;
(18)
S)is evaluated as
    where M is the swarm size, and xi , j represents the probability that
I (c; S ) ≈ D− R≈ I ( fi ; c ) − I ( fi ; f j ). (16) the corresponding jth feature is included in the feature subset. For
f i ∈S fi , f j ∈S,i= j a given Xi , its decoding solution Zi is set as

 1, i f xi, j > 0.5
According to Fig. 2, the I ( fi ; f j ) term can be interpreted zi, j = , (19)
fi , f j ∈S,i= j
 0, otherwise
as areas 3 and 4, whereas the objective function I ( fi ; f j ; c )
fi , f j ∈S,i= j where zi , j represents the selected jth feature. The quality of the ini-
is a single area 4. Therefore, the true redundancy is overrated in tial population influences the convergence rate and the possibility
Eq. (16), without considering the effect of c. A shrink coefficient is to fall in local minima. Based on the result of a filter method, the
 present initialization strategy attempts to generate a few “promis-
necessary for the I ( fi ; f j ) term, which can offset the bias.
fi , f j ∈S,i= j ing particles” by combining individually good features and then
Fig. 3 illustrates two extreme situations of the relationship among putting them into the initial swarm of randomly distributed par-
fi , fj , and c, in which (a) indicates that the target class label c barely ticles. The pseudocode to generate a “promising particle” is listed
relies on fi and fj while (b) indicates a high relevance among them. as follows:
By tracing the variance of areas 1, 2, and 4 in Fig. 3(a) and (b), In statistics, the odds ratio is a significant way to quantify how
the size of area 4 is notably positively correlated with the size of strongly the presence or absence of a given property is associated
areas 1, 4 and areas 2, 4; that is, I(fi ; fj ; c) is positively correlated with the presence or absence of another property in a dataset.
with I(fi ; c) and I(fj ; c) simultaneously. Based on this regularity, the Therefore, the ORj can be regarded as a criterion to quantify the
shrink coefficient should be set as the relative value of I(fi ; c) and relative importance of feature j. According to the definition of the
I(fj ; c), and the criterion I(c; S) is approximated as odds ratio, if ORj is greater than one, then the target label c is as-
sociated with feature j in the sense that a positive value of feature
  
I (c; S ) ≈ D− R≈ I ( fi ; c ) j raises the odds of a positive c. Based on this mechanism, the orig-
f i ∈S inal feature set is divided into high-OR and low-OR groups, and the
78 S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90

Algorithm 1
Generate a “promising particle” X based on the odds ratio of original features.

Require: Original feature set: F


Require: Feature size: N
Begin
Fit a logistic regression model with the training set and obtain the parameter set βˆ = {βˆ0 , βˆ1 , βˆ2 ,...βˆN }
Compute odds ratio for the entire F: OR j = eβ j , j ∈ {1, 2, ..., N}
ˆ

Categorize F into two groups: high-OR group and low-OR group, by threshold ORthreshold = 1
Normalize ORj in two groups:
for hj ∈ {j|feature j in high-OR group}, do normalize the corresponding OR to scale [0.5, 1]:
NORh j = 0.5 + 0.5(ORh j − ORmin
hj
)/(ORmax
hj
− ORmin
hj
).
end
for lj ∈ {j|feature j in low-OR group}, do normalize the corresponding OR to scale [0, 0.5]:
NORl j = 0.5(ORl j − ORmin
lj
)/(ORmax
lj
− ORmin
lj
).
end
Convert normalized OR to probability by adding noise ε :
for j = 1 to N do
randomly generate a noise ε , ε ∼U(−0.25, 0.25)
x j = NOR j + ε .
if xj > 1 then
x j = 1.
else if xj < 0then
xj = 0
end if
end
return X = (x1 , x2 , ..., xN )
End

corresponding odds ratios are normalized to different intervals. The Algorithm 2


External archive (EA) update mechanism with constant size CEA in each genera-
probability value is formulated by adding a uniformly distributed
tion.
noise to the normalized odds ratio to maintain the diversity of the
initial swarm. The ratio of the “promising particle” and the normal Require: the current EA obtained in the last generation
Require: the current swarm
particle is set as 1:3.
Require: CEA
Begin
3.3.2. Update rule Merge EA into the current swarm and then empty EA. |EA| is set as zero.
while |EA| < CEA do
In EMOBPSO, a dynamic update rule is proposed based on it- Identify the non-dominated solutions (NDS) in current swarm;
eration times. During the iterative process of EMOBPSO, the i-th if |EA| + |NDS| ≤ CEA then
particle in the swarm is updated as send NDS to current EA;
√ √ |EA| = |EA| + |NDS|;
(2 − 1/ t ) · pbi, j (t − 1 ) + (1/ t ) · gbi, j (t − 1 ) remove NDS from swarm;
xi, j (t ) = N , else
2

calculate the crowding distance for each particle in NDS;
[ pbi, j (t − 1 ) − gbi, j (t − 1 )]2 , (20) sort particles in NDS in descending order
send the top (CEA-|EA|) particles to EA;
where t is the current iteration time. Compared with the normal |EA| = CEA;
remove the top (CEA-|EA|) particles from swarm
update rule of BPSO (Eq. (10)), the swarm of EMOBPSO is dynam- end if
ically updated with the increase of iteration times. At the begin- end while
ning, t equals one, Eq. (20) is simplified to Eq. (10), and the ex- Return EA
pected value of the new xi, j (2) is the average of the old pbi, j (1) End

and gbi, j (1). With the increase of t, the expected position of a


new particle moves toward its personal best solution (relative to
Algorithm 3
the global best-known positions). As such, the effect of Gbest is Gbest selection strategy for the swarm.
weakened, and a new particle is more likely to fall in the neigh-
Require: EA
borhood of its personal best position as t increases. Therefore, this
Begin
new rule avoids trapping the swarm into a local minimum around Identify different levels of non-dominated fronts in EA and divide EA into
Gbest while maintaining the convergence rate by taking advantage nl subsets:
of the neighborhood of all available personal best positions. E A= {E A1 ,E A2 ,...,E Anl }, in which nl is the number of total levels;
Similar to most previous multi-objective evolutionary algo- for each particle i in swarm do
(i) Select a leveli for particle i based on roulette wheel, in which the
rithms, the EMOBPSO adopts an external archive (EA) to store non-
probability of selecting level k in
dominated solutions found. In the iterative process, the capacity {1, 2, …, nl} is pk = nlnl+1 −k
(nl +1 )/2
of EA (CEA) is kept constant over generations by employing a pro- (ii) Randomly select a gbi for particle i from E Aleveli
posed Gbest update mechanism [14], as shown in Algorithm 2. end
End

3.3.3. Gbest selection strategy


EMOBPSO selects a Gbest position for each particle from
the external archive to maintain the diversity of a swarm. In- In Algorithm 3, the level of non-dominated fronts is interpreted
stead of randomly assigning a Gbest to each particle, a roulette as follows. First, the non-dominated solutions in EA are named as
wheel selection-based strategy is designed in this stage, as shown the first level of non-dominated front and then excluded from EA.
inAlgorithm 3. Then, the non-dominated solutions in the remaining EA are named
S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90 79

Algorithm 4 Algorithm 5
EMOBPSO algorithm for feature selection. GLS for the current swarm.

Begin Require DFS: pre-determined desired feature size


Obtain the initial swarm by generating Ma particles based on Algorithm 1 Require S: current selected features with size L
and Mb entirely random particles. Set Ma : Mb = 1 : 3, and the size of the Require F: original feature set with size N
swarm is M = Ma + Mb ; Begin
Initialize the two objective values ( D and  R) of each particle according to for particle i=1 to M do
Eq. (17). Set the EA size CEA, and initialize EA based on Algorithm 2; if Li <DFS, then add a feature to Si which satisfies:

Initialize the global best positions (gbi ) for each particle based on arg max[I ( f j ; c ) − L1i I ( f j ; f k )].
f j ∈F −Si f k ∈Si
Algorithm 3;
Initialize the personal best position (pbi ) for each particle as the particle i else, delete a feature from S which satisfies:

itself;
1
arg max[ L−1 I ( f k ; f j ) − I ( f j ; c )] .
f j ∈Si f k ∈Si ∧k= j
while maximum iteration times are not reached do end if
for each particle i in the swarm do end
Update the particle i as Eq. (20), in which t is the current iteration End
times;
Update two objectives for each particle;
end Algorithm 6
for each particle i←1 to M do EMOBPSO-GLS.
Update pbi as the following equation:

⎪0.5[Xi (t ) + Z (Xi (t ))], i f pbi (t − 1 ) ≺ Xi (t )

Require DFS: pre-determined desired feature size

⎨0.5[Xi (t ) + Z (Xi (t ))], i f Xi (t ) per f orms better than pbi (t − 1 ) Begin
pbi (t ) = in ob jective 1 & the ob jective 2 o f Xi (t ) Generate initial swarm, Pbest, Gbest… (identical as shown in Algorithm 4),

⎪ is less than the max value o f pbi (t − 1 ) the two objectives are defined as D̄ and R̄ (Eq. 17) in order to offset the side


pbi (t − 1 ), otherwise effect to GLS.
in which Z(Xi (t))represents the decoding solution of Xi (t), and while maximum iteration times is not reached do
pbi (t − 1 ) ≺ Xi (t ) denotes that Xi (t) dominates pbi (t − 1 ); for each particle i in the swarm do
end Update the particle i as (Eq. 20) in which t is the current iteration times;
Update EA based on Algorithm 2; Update two objectives (D̄ and R̄ in Eq. 17) for each particle;
for each particle i in the swarm do end
Update gbi for particle i based on Algorithm 3; Add GLS here based on Algorithm 5;
end for each particle i←1 to M do
end while Update pbi as shown in Algorithm 4;
End end
Update EA based on Algorithm 2;
for each particle i in the swarm do
Update gbi for particle i based on Algorithm 3;
as the second level of the non-dominated front. By repeating this end
procedure, the following levels are individually identified. For a end while
end
certain particle in the swarm, the non-dominated front with a high
level in EA is more likely to be selected based on the mechanism
of roulette wheel selection. Therefore, this Gbest selection strategy
maintains diversity and quality of the swarm during the entire it- lected features is under control during the iterative process. De-
erative process. tails of implementing EMOBPSO with GLS (EMOBPSO-GLS) are de-
scribed as Algorithm 6.
3.3.4. Implementation of EMOBPSO
Details of the new EMOBPSO are presented in Algorithm 4. De- 3.5. Classification models
coded solutions in the final swarm are used to train SVMs because
EMOBPSO provides a set of solutions in the final Pareto front sim- Besides identifying the patients with high readmission risk,
ilar like any other multi-objective evolutionary algorithm, and the classification models are used to quantify the strength of the
solution with the lowest training error is selected as the resulting current feature selection methods, and to show how they influ-
feature subset. ence classification performance. Thus, we introduce three machine
learning techniques, namely, support vector machine (SVM), ran-
dom forest (RF), and deep neural network (DNN).
3.4. Greedy local search

In certain cases, practitioners want to acquire concise and in- 3.5.1. Support vector machine
formative features that can reflect the core reason for unplanned SVM is one of the most influential approaches to supervised
readmission. To meet this demand, an operator should be added learning tasks [59]. Considering a two-class, linearly separable clas-
in EMOBPSO to control the number of selected features as pre- sification task with the dataset {x1 , x2 , ..., xm } and xi ’s target class
determined. Based on the principle of maximum relevance and labelyi ∈ {1, −1}, we can summarize the original SVM as the fol-
minimum redundancy, a greedy local search operator (GLS) is de- lowing constrained optimization problem:
signed and merged into EMOBPSO. The details of GLS are shown in
Minimize 12 w2
Algorithm 5. . (21)
subject to yi (wT xi + b) ≥ 1∀i
As proven by Peng et al. [21], a combination of max-relevance
and min-redundancy criteria is equivalent to the max-dependency Driven by a linear function(wT x + b), SVM is aimed at finding a
criterion if one feature is selected at one time. In GLS, if the num- decision boundary that classifies all points correctly. This process is
ber of selected features is smaller than DFS, then one feature is similar to the logistic regression. Among studies focusing on SVM,
added from (F-S) and is able to achieve a maximal increment of one key innovation is the kernel trick, which is derived by observ-
dependency, that is, I(c; S). Similarly, if the number of selected ing that the linear function of SVM can be rewritten as
features is larger than DFS, then one feature is removed from cur-

m
rent S with the minimal decrement of I(c; S). Thus, the quality of wT x + b = b + αi xT xi , (22)
the feature subset is maximally maintained and the number of se-
i=1
80 S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90

Fig. 4. General structure and information transmission of DNN.

where α is a vector of coefficients. Rewriting the key operators in tion transmission among neurons in different layers is illustrated
this form allows us to replace x by the output of a given func- in Fig. 4b, where the function f(x) is called “activation function”.
tion φ (x) and the dot product with a kernel function k(x, xi ) = As reported by previous studies, one significant optimization
φ (x ) · φ (xi ). After conducting this replacement, predictions can be challenge for training DNNs is overfitting. In this study, a novel
implemented using the function regularization strategy called “dropout” is introduced to avoid
overfitting. To train DNN with dropout, some neurons are ran-

m
f (x ) = b + αi k(x, xi ). (23) domly omitted with a predefined probability in each generation,
i=1
and finally different sets of multiple DNNs are shared and com-
bined. Thus, the dropout method can be considered as an ex-
From Eq. (23), the f(x) is nonlinear with respect to x, but the treme form of bootstrap aggregation, which enables the assembly
relationship between f(x) and φ (x)is linear. By using this kernel of many large DNNs.
trick to map the inputs into high-dimensional feature spaces, SVM
is able to efficiently perform a nonlinear classification. The kernel
4. Case study
trick enables learning of nonlinear models through convex opti-
mization techniques. In this study, the radial basis function (RBF)
4.1. Data collection and pre-processing
kernel is selected for SVM.
To conduct the present methodology in a hospital while vali-
3.5.2. Random forest dating the effectiveness of EMOBPSO and EMOBPSO-GLS on fea-
RF is an ensemble method for classification, regression, and ture selection, retrospective data are collected from a tertiary re-
other tasks, which operates by constructing a multitude of deci- ferral hospital located in Northeast China, from November 1, 2013
sion trees at training time and outputting the prediction by taking to December 31, 2015. The hospitalization records of 45,407 indi-
average of all individual trees [60]. RF is easy to handle with no viduals are retrieved from the hospital information system (HIS)
distribution assumptions. In particular, deep decision trees are ca- and assembled to determine readmissions. In this study, readmis-
pable of learning highly irregular patterns, but they normally have sion (i.e., the target class label) is defined as re-hospitalization for
very high variance and are more likely to overfit their training sets. any cause within a given time window (i.e., readmission interval)
Based on bootstrap aggregation (i.e., bagging), RF provides a way of of discharge for the last hospital admission. After removing the pa-
averaging multiple deep decision trees with the goal of reducing tients who died during their hospital stay, we converted the raw
the variance. The basic idea of RF can be summarized as the fol- data into four datasets, namely, R30, R60, R120, and R180, by using
lowing four points: (1) training many iid trees on bootstrap sam- four different readmission intervals (i.e., 30-, 60-, 120-, and 180-
ples of training data, (2) reducing bias by growing trees sufficiently day) to define the target class label for each record, respectively.
deep, (3) reducing variance of noise by averaging the output of all The subject of this study is individual patients, and if a patient has
trees, and (4) maximizing variance reduction by minimizing cor- been admitted several times, each admission is evaluated individ-
relation between trees through bootstrapping data for each tree ually and included in the dataset.
and sampling available variable sets at each node. Generally, RF Based on results acquired from relevant studies and the help
achieves robust performance even if several outliers are in the pre- of domain experts from the hospital, input features of the hospi-
dictive variables. tal readmission problem mainly consist of “demographic,” “social
and economic status,” “treatment and clinical,” and “healthcare uti-
3.5.3. Deep neural network lization” factors [62]. To collect informative features as much as
In the era of big data, DNN has achieved outstanding results in possible, relevant data is collected from different databases in HIS
manifold knowledge discovery and data mining applications [61]. and assigned to individual hospitalization records by utilizing the
This study employs a feedforward DNN model, in which the in- uniqueness of patient ID number and SQL codes. The details of
formation flows from an input layer with input features, through these databases and related fields are listed in Table 1.
multiple layers of nonlinearity and finally to the output layer to From Table 1, representative fields collected from these three
produce the target class label (Fig. 4a). In DNN, each neuron is databases almost cover the relevant factors that directly or indi-
connected to other neurons by different communication links, with rectly influence readmission risk. For demographic factors, the pa-
each link containing an associated weight parameter. The informa- tient’s age, home address, and marital status are involved; for fac-
S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90 81

Table 1
Relevant fields of three databases in HIS.

Database Fields Notes

Outpatient information Patient ID Only used for data combination


Age at current claim Continuous variable
Mode of payment 7 categories
Whether first visit
Home address 13 categories
Marital status 5 categories
Occupation and type of insurance 9 categories
Electronic medical record Registration time at outpatient department
Drug count Continuous variable
Laboratory count Continuous variable
Severity level at outpatient department 3 categories: normal, severe, high-risk
Disease category 17 categoriesa
Inpatient information Registration time at in-patient department
Severity level at in-patient department 3 categories: normal, severe, high-risk
Department of registration 38 departments coded with unique ID
Whether accompanied
Whether diagnosis is confirmed
Follow-up flag
Department of discharge 39 departments coded with unique ID
Length of stay Continuous variable
Discharge time Used to define whether readmission
a
The classification principle of diseases is based on diagnosis information provided by physicians, International Clas-
sification of Diseases, and set-up of the department.

Table 2
Configuration of two hyper-parameters for
SMOTE.

Dataset

R30 R60 R120 R180

Prec.over 300% 200% 200% 200%


Prec.under 300% 300% 200% 200%

and Prec.under denotes how many extra examples from the major-
ity classes are randomly selected for each example generated from
the minority class. Notably, the adjustment of these two hyper-
parameters determines the tradeoff between type Ⅰ misclassifica-
Fig. 5. Number of examples in four datasets.
tion (i.e., a non-readmitted patient is misclassified as readmitted)
and type Ⅱ misclassification (i.e., a readmitted patient is misclas-
sified as non-readmitted). For practitioners in a certain hospital,
tors related to social and economic status, the mode of payment,
if the goal is to spot patients with a high readmission risk and
occupation and corresponding insurance type are included as orig-
take measures to prevent it, higher Prec.over and lower Prec.under
inal features; for factors related to treatment and clinical result,
should be configured to assign a higher cost to type Ⅱ misclassifi-
whether the diagnosis is confirmed, disease category, number of
cation. As another unusual case where the goal is to estimate the
medical treatments, and severity level at different stages is in-
number of patients with high readmission risk and use this esti-
volved; for healthcare utilization-related factors, whether first visit,
mation to provide necessary resources, lower Prec.over and higher
department of registration and discharge, whether accompanied,
Prec.under should be configured to assign a higher cost to type
follow-up flag and length of stay are considered. Registration time
Ⅰ misclassification. To provide a general research approach, this
at outpatient and inpatient departments is used to judge if a long
study balances two types of misclassification (maximum AUC) with
waiting time occurs between these two operations. The season of
relevant configurations shown in Table 2.
registration is also included in the original features. All these fea-
tures are combined with corresponding records in four datasets,
and categorical variables are converted to binary dummy features. 4.2. Experimental set-up
Then, the features with less than 40 positive values are excluded
from the datasets. The total number of samples and the number of The main steps in implementing the present methodology to
readmitted patients for each dataset is shown in Fig. 5. solve the unplanned readmission prediction problem are illustrated
From Fig. 5, all four datasets are unbalanced with respect to the in Fig. 6. A series of comparative experiments are conducted and
ratio of readmitted patients to non-readmitted patients. To over- relevant results are listed in the following part of Section 4: in
come this problem, a synthetic minority oversampling technique Section 4.3, we validate the efficiency of EMOBPSO by monitoring
called SMOTE is used in this study, implemented before feature its iterative process and comparing its solution distribution with
selection and model training. In SMOTE, the minority class is over- various kinds of raw BMOPSOs. In Section 4.4, the effectiveness of
sampled by generating “synthetic” examples rather than by over- EMOBPSO-based feature selection algorithm is tested, in which the
sampling with replacement [63]. Two hyper-parameters of SMOTE executive time and model performance are compared with those
are set in Table 2 based on several initial trails. Prec.over denotes of GA and SA-based feature selection algorithms. In Section 4.5,
how many extra examples from the minority class are generated, the effectiveness of EMOBPSO-GLS is tested and compared with
82 S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90

Fig. 6. Main steps in implementing proposed methodology.

Table 3 4.3. Convergence efficiency of EMOBPSO


Evaluation metrics for model performance.

Metric Description For EMOBPSO, the number of generations is fixed to 100 for the
AUC Reflects trade-off between the rate of patients that are correctly involved datasets, and the swarm size is set as 80. Two objectives
predicted as readmission and the rate of patients incorrectly of each particle in the initial and final swarms are monitored and
predicted as readmission. plotted in Fig. 7. To facilitate the identification of different non-
Accuracy Rate of correctly classified patients. dominated front levels, the sign of “Objective 1 is reversed to
Precision Rate of patients correctly predicted as readmission to total number negative and then both objectives can be regarded as a minimiza-
of patients predicted as readmission. Relevant to type Ⅰ tion problem. Fig. 7 indicates that EMOBPSO reaches convergence
misclassification:P = T P/(T P + F P ) within 100 generations for all involved datasets based on intensive
Recall Rate of patients correctly predicted as readmission to total number non-dominated front lines shown in the plots of final swarms.
of patients that are actually readmitted. Relevant to type Ⅱ To test the global search ability of EMOBPSO, we calculate the
misclassification:R = T P/(T P + F N )
distribution of particles in the two-objective function space, be-
fore and after applying EMOBPSO. Meanwhile, the distribution of
final solution is also compared with 3 kinds of raw algorithms, re-
feature selection algorithms that are capable of controlling fea- spectively, to validate the impact of each improved operator. The
ture subset size: mRMR, correlation-based feature ranking (CFR), 3 kinds of raw algorithms are EMOBPSO without the present ini-
and variable importance-based feature ranking algorithm (VIFR). In tialization strategy (EMOBPSO/IS), EMOBPSO without the present
these two sections, the effectiveness of different feature selection Gbest selection strategy (EMOBPSO/GS), and EMOBPSO without the
algorithms is quantified in two aspects: the feature reduction rate dynamic update role (EMOBPSO/UR). The Tan’s spacing metric (TS)
and the impact of selected features on model performance. The im- is used to evaluate the solution distribution:
pact on model performance is tested by using selected features to
|E |
train classification models (SVM, RF, and DNN), and four evaluation 1  2
TS = (Di − D̄ ) /D̄, (24)
metrics are considered to assess the quality of predicted outcomes |E |i=1
(Table 3). In Table 3, TP, FP, and FN denote “true positive,” “false
|E |

positive” (i.e., type I error), and “false negative” (i.e., type II error),
where D̄ = Di /|E |, Di is the Euclidean distance between solu-
respectively. The training set is set as the first 75% of each dataset, i=1
and is used to conduct all involved feature selection algorithms. As tion xi in E and the nearest solution in the candidate space [65].
the samples of readmitted patients are sparse in each dataset, the A smaller of TS indicates superior of the solution distribution and
testing set is defined as the entire dataset, and it is used to quan- better global search ability. Table 4 shows experimental results
tify the impact of selected features on model performance. All cri- of tackling four datasets, in which TS values are the average of
teria listed in Table 3 are calculated via a five-fold cross validation five repeated experiments. The significant difference between ini-
process on the testing set. tial set and after applying EMOBPSO demonstrates powerful global
Considering the results obtained from these comparative exper- search ability of EMOBPSO. Compared with EMOBPSO/GS, the per-
iments, we explore the best combination between all involved fea- formance of EMOBPSO has achieved a significant improvement on
ture selection algorithms and classification models (Section 4.6), solution distribution, with the help of roulette wheel selection-
and the most important features obtained by EMOBPSO-GLS is in- based Gbest selection strategy. Compared with EMOBPSO/IS and
terpreted as management implications, which is helpful for prac- EMOBPSO/UR, EMOBPSO has still achieved slight improvements.
titioners to prevent readmission in advance. In this study, all in- Therefore, all three improved operators prompt the swarm to
volved experiments are run on Rstudio version 0.99.893 (contain- reach convergence more efficiently and ensure the powerful global
ing R version 3.1.2). Calculation of pairwise MI between all types of search ability of EMOBPSO.
variables (including continuous vs continuous, continuous vs dis-
crete, and discrete vs discrete) is implemented through an R pack- 4.4. Effectiveness of EMOBPSO feature selection algorithm
age “mpmi,” which is based on a kernel smoothing approach [64].
GA and SA-based wrapper methods are implemented through the To evaluate the effectiveness of EMOBPSO-based feature selec-
“caret” package. The computer used is Intel(R) Core(TM) i7-4600 tion algorithm, an SA-based and a GA-based wrapper feature se-
CPU, 8 GB RAM, operated by Windows 64-bit Operating System. lection algorithms are implemented to conduct the same task. For
S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90 83

Fig. 7. Plots of MI-based feature relevancy and feature redundancy for each particle in the initial and final swarm.

Table 4
TS values obtained by different search strategies.

Dataset TS

EMOBPSO EMOBPSO/IS EMOBPSO/GS EMOBPSO/UR Initial Set

R30 0.140 0.141 0.217 0.142 0.223


R60 0.122 0.178 0.202 0.145 0.224
R120 0.111 0.158 0.212 0.119 0.248
R180 0.117 0.129 0.192 0.141 0.209
84 S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90

Table 5 Table 8
Execution time of EMOBPSO, GA, and SA. Impact of feature selection algorithms on RF.

Dataset Feature selection algorithm Dataset Criterion Feature selection algorithm

EMOBPSO GA SA EMOBPSO GA SA

R30 1.275 h 7.785 h 5.350 h R30 AUC 90.32% 49.38% 84.33%


R60 1.191 h 27.109 h 5.198 h Accuracy 91.41% 91.47% 90.22%
R120 1.060 h 28.263 h 4.830 h Precision 43.66% 8.07% 38.98%
R180 1.237 h 7.534 h 4.213 h Recall 71.53% 0.74% 66.89%
R60 AUC 90.17% 90.18% 85.76%
Accuracy 89.96% 90.01% 88.74%
Table 6
Precision 50.28% 50.51% 45.96%
Feature size reduction of EMOBPSO, GA, and SA.
Recall 68.06% 67.99% 65.04%
Dataset Number of selected features Feature reduction rate R120 AUC 89.70% 88.75% 78.56%
Accuracy 83.06% 82.30% 78.79%
EMOBPSO GA SA EMOBPSO GA SA
Precision 45.88% 44.44% 37.20%
R30 82 23 41 37.88% 82.58% 68.94% Recall 77.34% 75.90% 62.01%
R60 77 94 71 41.22% 28.24% 45.80%
R180 AUC 89.30% 68.12% 77.03%
R120 87 97 55 33.08% 25.38% 57.69%
Accuracy 81.42% 68.06% 75.93%
R180 63 28 52 50.78% 78.13% 59.38%
Precision 51.46% 31.42% 41.77%
Recall 79.22% 54.77% 59.00%
Table 7
Impact of feature selection algorithms on SVM.

Dataset Criterion Feature selection algorithm


Table 9
EMOBPSO GA SA Impact of feature selection algorithms on DNN.

R30 AUC 90.38% 52.49% 83.74% Dataset Criterion Feature selection algorithm
Accuracy 91.37% 92.01% 90.32%
EMOBPSO GA SA
Precision 43.43% 4.49% 39.24%
Recall 70.08% 0.44% 66.43% R30 AUC 90.20% 66.81% 85.43%
R60 AUC 89.61% 89.57% 87.04% Accuracy 88.22% 60.08% 89.13%
Accuracy 89.75% 89.58% 87.94% Precision 35.15% 11.66% 36.10%
Precision 49.51% 48.95% 43.38% Recall 74.83% 70.33% 68.08%
Recall 67.42% 67.81% 63.49% R60 AUC 89.69% 89.54% 86.18%
R120 AUC 88.36% 87.85% 77.50% Accuracy 86.18% 84.71% 81.20%
Accuracy 82.55% 81.95% 79.42% Precision 40.43% 37.36% 32.05%
Precision 44.85% 43.77% 37.98% Recall 74.37% 75.67% 73.15%
Recall 75.50% 75.35% 60.71% R120 AUC 89.21% 87.30% 79.30%
R180 AUC 87.73% 65.32% 76.67% Accuracy 76.59% 73.63% 62.76%
Accuracy 80.92% 68.69% 75.93% Precision 37.56% 34.62% 25.73%
Precision 50.63% 31.71% 41.64% Recall 86.54% 85.28% 79.58%
Recall 77.25% 53.42% 58.79% R180 AUC 88.83% 68.14% 76.70%
Accuracy 77.81% 49.04% 60.84%
Precision 46.25% 26.05% 30.78%
Recall 84.86% 85.45% 80.96%
SA, the population size and number of generations are set equal to
the EMOBPSO; for GA, the population size is reduced to 40 and the
number of generations is set as 50 because the identical configu-
ration as used in EMOPBSO and SA will result in extremely long
execution time. The crossover rate is fixed as 0.8 and the muta- and 76.67% in R120 and R180. With regard to precision, EMOBPSO
tion rate is 0.1 for both SA and GA. The linear discriminant analysis outperforms GA and SA in most cases by pairwise comparison;
model (LDA) is used to calculate the fitness value for SA and GA, with regard to recall, EMOBPSO still outperforms GA and SA in
which is a simple linear classifier with low computational cost. Ex- most cases. With regard to accuracy, the difference among feature
ecution time for the entire iterative process of three algorithms is selection algorithms is not so obvious in R30, R60, and R120, and
listed in Table 5. In Table 5, EMOBPSO demonstrates notably lower even GA is able to achieve 92.01% accuracy in R30. This result indi-
computational cost than the other two wrapper feature selection cates that the accuracy is not an appropriate criterion for the class
algorithms. The results indicate that the proposed MI-based crite- imbalance problem.
rion (Eq. (17)) takes much lower and more stable computational For RF and DNN, the difference of model performance among
cost, even compared to a simple linear model based criterion. three feature selection algorithms presents similar patterns as dis-
Table 6 shows the number of selected features and feature re- cussed under SVM. EMOBPSO still achieves high AUC in RF (90.32%,
duction rate of different feature selection algorithms. The impact of 90.17%, 89.70%, and 89.30%) and DNN (90.20%, 89.69%, 89.21%, and
these feature selection algorithms on training SVM, RF, and DNN is 88.83%); GA or SA performs worst in most cases. These results
quantified, as shown in Tables 7, 8, and 9, respectively. further validate the positive effectiveness of EMOBPSO-based fea-
For SVM, Table 7 shows that EMOBPSO feature selection algo- ture selection algorithm and the universal property of obtained
rithm achieves highest AUC among four datasets (90.38%, 89.61%, feature subsets as well. In summary, the results shown in Tables
88.36%, and 87.73%), compared with GA and SA. Although the ex- 6–9 demonstrate that feature subsets obtained by EMOBPSO max-
ecution time of GA and SA is much longer than that of EMOBPSO, imally maintain the information carried by original feature sets,
the quality of feature subsets obtained from these two wrapper- with as much feature reduction as possible. The original feature
based feature selection algorithms is significantly worse than spaces of four datasets are high-dimensional, applying wrapper-
EMOBPSO feature selection algorithm in some cases: the AUC of based feature selection algorithms to the present problem poses
GA is only 52.49% in R30, and the AUC of SA decreases to 77.50% risks.
S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90 85

Fig. 8. Boxplots of AUC, accuracy, precision, and recall achieved by EMOBPSO and EMOBPSO-GLS under different settings of DFS.

Table 10 besides regularizing the feature subset size, GLS helps EMOBPSO
Actual number of selected features acquired by EMOBPSO-GLS under different set-
maximally maintain the key information by retaining elite feature
tings of DFS.
combinations during the iterative process, compared with other
Dataset DFS = 10 DFS = 20 DFS = 30 DFS = 40 DFS = 50 DFS = 60 widely used filter methods.
R30 17 21 35 34 49 59
R60 10 26 33 40 44 49 4.6. Model selection and management implications for practitioners
R120 10 23 28 42 53 60
R180 13 19 33 40 50 45 Based on results discussed in Sections 4.4 and 4.5, EMOBPSO
and EMOBPSO-GLS are capable of attaining elite feature subsets
with minimum information loss. To analyze which model performs
4.5. Effectiveness of EMOBPSO-GLS feature selection algorithm best under EMOBPSO and EMOBPSO-GLS, the distribution of cri-
terion values achieved by EMOBPSO and EMOBPSO-GLS is shown
Table 10 shows the number of selected features acquired with a group of boxplots, under different settings of DFS in SVM,
by EMOBPSO-GLS under different values of DFS. The impact of RF, and DNN. In Fig. 8(a), DNN achieves relatively higher AUC than
EMOBPSO-GLS, CFR, mRMR, and VIFR on training SVM, RF, and SVM and RF, with lower variance. Fig. 8(c) and (d) indicate that
DNN is quantified, as shown in Tables 11, 12, and 13, respectively. DNN generates significantly higher recall than SVM and RF at the
Table 10 shows that by introducing GLS to EMOBPSO, the fea- expense of precision to obtain high AUC. This adaptive outcome
ture subset sizes have been brought under control and the ac- is appropriate for general usage in a healthcare system. Therefore,
tual number of selected features is closed to the predefined values the combination of EMOBPSO (EMOBPSO-GLS) and DNN possesses
of DFS. Table 11 shows that different feature selection algorithms strongest predictive power and presents high robustness under dif-
obtain similar AUC, accuracy, precision, and recall with regard to ferent settings of DFS or different datasets, and the high recall
SVM, when DFS (or FS for CFR, mRMR, and VIFR) is equal to or demonstrates that DNN is a good choice in preventing unplanned
greater than 30. When DFS is less than 30, the feature subset ob- readmissions in advance.
tained by EMOBPSO-GLS can still bring the desired impact on SVM, Table 14 lists feature subsets selected by using EMOBPSO-GLS
whereas other feature selection algorithms cannot achieve com- with DFS equal to 10. Based on the relevant model performance
parable results under the same circumstances. For example, when shown in Tables 11–13, these features maintain the most predic-
DFS is equal to 10, EMOBPSO-GLS reaches 86.1%, 81.6%, 86.3%, and tive power for four datasets and respective values of discrimina-
85.5% AUC for R30, R60, R120, and R180, respectively. However, tion. Some similarities and differences among individual features
AUC decreases to 78.8% for CFR (dataset R120), 79.7% for mRMR and feature combinations in four datasets are analyzed to provide
(dataset R60), and 84.6% for VIFR (dataset R30) when FS is equal practical insights for practitioners in the hospital. For demographic
to 10. Similarly, for RF, EMOBPSO-GLS maintains high AUC with factors, the patient’s age is included in R30, R120, and R180, which
decrease of DFS. Several abnormal results generated through CFR, shows a notable impact on readmission. Marital status indicator
mRMR, and VIFR still exist, such as 77.7% AUC generated by CFR (in (i.e., whether the patient is married or not) is merely selected
R60, FS = 0; AUC = 86.7% for EMOBPSO-GLS) and 84.8% AUC gen- in R180 and no address-related factor is included. For social and
erated by VIFR (in R60, FS = 20; AUC = 88.2% for EMOBPSO-GLS). economic status-related factors, the payment covered by basic so-
From Table 13, DNN reliefs the difference of AUC among feature se- cial insurance is a significant indicator of readmission, which is
lection algorithms or different settings of DFS (FS), and EMOBPSO- included in all four datasets. More early resource-intensive inter-
GLS still outperforms other algorithms to a certain degree. Overall, ventions should be arranged and delivered to the patients in this
86
Table 11
Impact of EMOBPSO-GLS, CFR, mRMR, and VIFR on SVM performance.

Feature selection algorithm Dataset

R30 R60 R120 R180

AU (%) AC (%) P (%) R (%) AU (%) AC (%) P (%) R (%) AU (%) AC (%) P (%) R (%) AU (%) AC (%) P (%) R (%)

DFS
EMOBPSO-GLS 10 86.1 91.4 43.0 64.4 81.6 90.7 53.7 57.1 86.3 79.7 40.6 79.1 85.5 78.6 46.9 81.7

S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90


20 88.3 91.2 42.4 66.5 87.0 87.4 42.4 67.0 86.7 78.7 39.5 81.1 86.2 79.6 48.5 80.0
30 89.5 91.3 42.7 68.1 88.9 89.8 49.8 64.7 87.4 80.4 41.7 79.4 87.2 79.2 47.8 82.1
40 89.1 91.3 42.7 67.2 88.7 89.8 49.6 64.7 87.9 81.9 43.8 77.6 87.5 80.2 49.4 78.2
50 89.8 91.5 43.6 68.4 88.9 89.9 50.2 65.0 87.9 82.0 43.9 75.5 87.5 80.1 49.2 78.3
60 89.9 91.4 43.5 68.3 88.9 89.9 50.0 65.4 87.9 81.5 43.2 77.6 87.7 80.9 50.5 78.2
CFR FS
10 81.0 90.2 39.7 66.5 80.3 89.7 50.0 58.9 78.8 88.2 63.0 50.1 85.3 78.5 46.6 79.0
20 87.4 91.2 42.4 67.9 86.1 90.5 52.7 60.7 86.9 78.0 38.8 82.9 86.4 79.2 47.8 81.6
30 89.0 91.4 43.2 67.2 87.7 89.8 50.6 62.5 87.2 79.9 41.0 79.8 86.8 79.6 48.3 80.8
40 89.6 91.4 43.4 67.4 88.4 88.5 45.5 67.7 87.6 80.3 41.6 79.7 86.8 79.9 48.8 80.6
50 89.8 91.4 43.3 67.3 88.5 88.7 46.1 67.4 87.9 81.6 43.5 78.2 87.2 80.2 49.3 79.8
60 90.0 91.5 43.5 67.7 88.9 89.1 47.2 67.7 88.0 81.9 43.8 78.3 87.5 80.4 49.7 79.5
mRMR FS
10 82.7 91.0 41.6 64.7 79.7 90.6 53.4 57.8 82.2 79.8 40.7 77.9 85.2 78.1 46.3 83.0
20 85.9 90.8 40.7 65.3 82.8 90.3 52.0 59.2 86.6 79.5 40.5 79.7 85.8 77.8 45.9 83.2
30 89.1 91.0 41.8 68.8 88.5 90.2 51.5 62.7 86.8 79.7 40.6 78.8 86.6 79.8 48.7 79.4
40 89.3 91.1 42.1 68.4 88.7 90.1 51.1 62.6 86.8 80.3 41.4 78.2 86.8 79.8 48.7 80.1
50 89.6 91.5 43.7 69.1 88.7 90.2 51.4 62.7 87.6 80.6 41.8 78.4 87.2 79.8 48.6 80.3
60 89.8 91.6 44.3 69.2 89.2 89.9 49.9 65.9 87.8 80.8 42.0 77.7 87.1 79.9 48.8 80.1
VIFR FS
10 84.6 90.9 41.1 66.7 83.0 90.5 52.8 59.2 85.4 79.3 40.2 80.5 85.7 79.3 47.9 80.3
20 88.6 91.1 42.2 68.9 85.0 90.6 52.9 60.4 86.7 79.5 40.4 80.3 86.0 79.8 48.7 79.8
30 89.7 91.2 42.6 68.1 86.7 90.2 51.3 63.1 87.1 80.5 41.7 78.1 86.6 79.9 48.9 78.7
40 89.5 91.2 42.6 68.4 87.9 90.2 51.2 64.1 87.6 81.1 42.5 76.7 86.9 80.2 49.3 77.9
50 89.7 91.3 43.1 68.6 88.1 90.0 50.4 65.1 88.0 82.0 44.0 76.0 86.9 80.1 49.2 77.9
60 89.8 91.3 43.0 69.0 88.5 89.8 49.8 65.0 88.3 82.6 45.0 75.6 87.2 80.1 49.1 77.0

AU = AUC, AC = Accuracy, P = Precision, and R = Recall.


Table 12
Impact of EMOBPSO-GLS, CFR, mRMR, and VIFR on RF performance.

Feature selection algorithm Dataset

R30 R60 R120 R180

AU (%) AC (%) P (%) R (%) AU (%) AC (%) P (%) R (%) AU (%) AC (%) P (%) R (%) AU (%) AC (%) P (%) R (%)

DFS

S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90


EMOBPSO-GLS 10 86.3 91.4 43.2 64.9 86.7 90.8 54.3 56.9 88.4 78.1 38.9 82.8 87.8 79.6 48.3 80.5
20 87.5 91.2 42.6 68.5 88.2 88.8 46.2 64.6 88.5 79.1 39.9 80.8 88.1 80.0 48.9 80.4
30 88.5 91.2 42.8 70.6 87.3 89.9 49.9 65.6 88.8 80.5 41.9 80.3 87.9 80.3 49.5 80.6
40 88.3 91.3 42.8 68.9 88.3 90.0 50.3 65.9 89.2 82.4 44.8 78.0 88.6 81.0 50.6 79.7
50 88.7 91.3 43.3 71.0 87.6 89.9 50.1 66.2 89.1 82.7 45.3 77.3 88.5 80.7 50.3 79.3
60 89.6 91.5 43.7 70.5 88.1 90.1 50.7 66.4 88.9 82.4 44.7 78.4 88.9 81.2 51.0 79.3
CFR FS
10 81.5 90.2 39.8 66.5 77.7 89.7 50.1 59.0 83.6 88.1 62.2 50.5 87.1 77.0 44.9 83.1
20 85.2 91.4 43.2 67.7 83.1 90.7 53.7 60.0 88.5 78.1 38.9 83.1 87.9 80.0 49.0 80.0
30 86.3 91.5 43.4 67.7 87.8 90.5 52.9 60.7 88.4 79.4 40.4 81.6 87.9 80.4 49.5 79.7
40 87.7 91.4 43.3 67.9 88.8 88.8 46.3 67.4 88.7 80.3 41.7 80.9 88.0 80.5 49.8 79.8
50 88.1 91.1 42.4 69.1 88.7 89.0 47.0 67.4 89.3 82.6 45.1 77.8 88.1 80.7 50.1 79.8
60 88.7 91.4 43.3 70.1 88.5 89.2 47.5 68.0 89.4 82.9 45.7 77.5 88.6 81.0 50.7 79.1
mRMR FS
10 85.3 91.2 42.2 64.3 86.6 90.3 52.1 58.6 88.0 79.5 40.3 78.9 87.9 79.1 47.7 81.6
20 85.3 91.2 42.2 65.6 84.2 90.1 51.2 59.2 88.4 79.2 40.2 81.7 87.7 79.5 48.3 81.2
30 88.3 91.3 42.9 70.5 87.8 90.4 51.9 62.9 88.4 79.2 40.2 81.4 87.8 80.0 48.9 80.5
40 88.7 91.2 42.6 70.6 88.6 90.2 51.4 64.0 88.4 79.6 40.6 81.1 88.0 80.3 49.4 79.8
50 89.3 91.4 43.6 70.6 88.8 90.2 51.3 64.6 88.8 80.6 42.0 79.8 88.1 80.8 50.2 80.2
60 89.3 91.4 43.5 70.7 89.3 89.8 49.6 67.3 88.8 80.8 42.3 79.5 88.0 80.8 50.2 80.1
VIFR FS
10 83.6 91.0 41.8 66.7 86.4 90.6 53.3 59.0 88.2 78.9 39.8 82.2 87.7 79.6 48.3 80.6
20 86.7 91.4 43.4 68.6 84.8 90.8 53.9 59.8 88.7 79.7 40.8 81.2 88.0 80.0 48.9 80.5
30 88.7 91.5 43.9 69.2 85.8 90.4 52.2 63.0 88.7 81.2 42.8 79.1 87.9 80.3 49.4 80.0
40 88.9 91.5 43.8 70.1 87.1 90.2 51.1 64.9 88.7 81.9 43.9 78.1 88.0 80.7 50.2 79.4
50 89.0 91.5 43.8 69.7 87.3 90.1 50.7 65.4 88.7 82.6 45.0 77.7 88.0 80.8 50.4 78.9
60 88.8 91.4 43.6 70.5 87.7 89.9 50.1 65.6 88.9 82.6 45.1 77.9 88.1 80.9 50.6 78.9

87
88
Table 13
Impact of EMOBPSO-GLS, CFR, mRMR, and VIFR on DNN performance.

Feature selection algorithm Dataset

R30 R60 R120 R180

AU (%) AC (%) P (%) R (%) AU (%) AC (%) P (%) R (%) AU (%) AC (%) P (%) R (%) AU (%) AC (%) P (%) R (%)

DFS

S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90


EMOBPSO-GLS 10 88.9 89.2 37.0 69.9 86.9 74.0 29.2 82.7 88.3 75.8 36.6 85.8 88.1 77.0 45.1 85.3
20 89.5 89.3 36.9 71.2 87.9 84.2 36.1 72.7 88.5 74.6 35.7 87.2 88.5 77.2 45.3 85.0
30 89.9 87.5 33.8 74.3 88.7 84.8 38.5 72.6 89.0 76.2 37.1 85.9 88.6 77.2 45.2 86.0
40 89.4 89.5 37.7 69.8 89.0 85.0 38.3 74.2 89.0 77.5 38.3 84.2 88.7 77.2 45.5 85.1
50 89.5 89.5 37.4 70.9 88.5 86.4 40.9 70.5 89.0 76.9 37.7 85.4 88.6 76.8 45.0 86.3
60 89.8 88.7 35.8 73.0 89.5 84.5 37.2 75.8 89.3 77.6 38.6 84.9 88.7 77.6 45.9 85.0
CFR FS
10 84.0 81.6 29.7 71.3 84.9 88.3 45.8 60.5 83.4 62.1 32.1 88.7 87.2 74.9 42.6 85.3
20 87.7 89.1 36.4 70.4 86.1 81.7 34.5 71.9 88.4 74.8 35.9 87.4 87.5 75.6 43.6 87.5
30 88.1 89.4 37.8 69.2 87.5 85.3 39.5 70.6 88.5 75.2 36.1 86.0 88.4 76.4 44.4 87.2
40 87.6 87.8 33.5 70.9 88.7 84.8 37.7 73.0 89.0 76.9 37.8 84.7 87.4 75.6 43.9 86.7
50 89.0 88.3 34.7 72.8 88.3 82.9 35.0 76.1 89.1 76.8 37.8 85.8 88.6 77.3 45.7 86.5
60 89.4 88.7 35.9 73.3 88.9 85.0 38.6 73.4 89.0 77.3 38.3 85.4 86.5 74.5 42.9 86.1
mRMR FS
10 87.5 88.9 35.8 68.8 87.0 82.1 35.1 73.0 87.6 75.4 36.1 85.1 87.9 75.8 43.7 87.6
20 87.1 88.5 34.8 70.4 86.8 78.9 30.0 77.4 88.3 75.7 36.5 85.7 88.1 75.7 43.6 87.8
30 88.7 88.6 35.0 70.6 87.9 81.8 33.8 75.1 88.1 75.6 36.4 85.6 87.6 76.2 44.2 85.7
40 88.8 89.4 37.2 70.5 88.2 85.8 39.9 71.6 88.6 76.5 37.4 85.4 87.0 75.6 43.4 85.7
50 88.3 89.1 36.3 71.8 88.9 83.9 36.5 73.8 88.7 75.2 36.1 86.7 88.1 76.9 45.1 85.5
60 88.2 88.1 34.0 71.2 89.6 83.6 36.2 78.6 88.9 75.6 36.6 86.7 86.1 73.9 42.1 87.1
VIFR FS
10 87.4 87.8 33.4 71.2 86.9 78.1 30.1 79.1 87.4 74.2 35.2 87.0 87.8 76.2 44.0 85.9
20 88.5 89.3 37.0 71.2 85.9 81.0 34.1 72.5 88.1 74.3 35.4 86.6 88.4 76.8 44.8 85.6
30 89.0 89.7 38.2 69.5 87.0 84.1 36.6 71.2 87.3 73.2 34.3 86.9 87.9 75.8 43.8 86.6
40 89.2 89.6 38.1 70.7 87.3 84.4 36.4 70.1 87.5 73.1 34.2 86.9 88.1 76.2 44.2 85.6
50 89.1 88.6 35.8 72.4 86.9 83.9 36.3 72.7 87.5 73.4 34.4 86.2 87.1 75.9 44.1 84.8
60 89.5 89.9 38.9 71.0 88.0 84.3 36.8 71.1 87.9 75.1 35.9 84.7 86.6 75.2 42.8 84.0
S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90 89

Table 14 filter methods, EMOBPSO-GLS is capable of retaining elite feature


Most predictive features selected in four datasets using EMOBPSO-GLS (DFS = 10).
combinations while regularizing the feature subset size. In practice,
Dataset Most predictive features selected by EMOBPSO-GLS (DFS = 10) the combination of EMOBPSO (EMOBPSO-GLS) and DNN presents
R30 age, mode of payment (government-funded, basic social insurance), robust performance under different situations, which can be de-
occupation and type of insurance (normal medical insurance), ployed as a risk analysis tool for hospital readmission prevention
whether the diagnosis is confirmed, disease category (cancer, in healthcare systems. Insightful implications abstracted from fea-
cardiology, hemorrhoids), department of registration (area ID: ture subsets can be used by practitioners to identify the vulnerable
a1189, a254, a260, a893, a905), department of discharge (area
patients for readmission and target the delivery of early resource-
ID: a240, a246, a255, a256)
intensive interventions.
R60 mode of payment (basic social insurance), occupation and type of
insurance (migrant worker), disease category (cancer), whether
One promising direction for our future research is exploring
first visit, department of registration (area ID: a255, a890), how to generalize the proposed feature selection methods or mod-
department of discharge (area ID: a256, a910, a911), long waiting ern machine learning techniques to other knowledge discovery ap-
time after registration plications in the field of health science. In this way, patient care
R120 age, mode of payment (basic social insurance), disease category is improved by converting complex medical data (e.g., medical im-
(cancer, hemorrhoids), whether first visit, department of
agery, data from wearable health monitors, reviews of physicians,
registration (area ID: a247, a255), department of discharge (area
ID: a254, a893, a911)
and so on) into accessible knowledge.
R180 age, marital status (married), mode of payment (basic social
insurance), occupation and type of insurance (delegated medical
Acknowledgments
insurance), disease category (cancer, hemorrhoids), whether first
visit, department of registration (area ID: a1270, a254, a888,
a894, a912), department of discharge (area ID: a247) This work is partly supported by the Hong Kong RGC grant no.
T32-102/14N and CityU SRG grant no. 7004698.

category. For treatment and clinical-related factors, patients with References


cancer or hemorrhoids should be given more attention than those
with other kinds of diseases. “Whether the diagnosis is confirmed” [1] J.A. Bosco Iii, A.J. Karkenny, L.H. Hutzler, J.D. Slover, R. Iorio, Cost burden of
is a significant indicator in the 30-day readmission. In particular, 30-day readmissions following medicare total hip and knee arthroplasty, J.
Arthroplast. 29 (5) (2014) 903–905.
the number of medical treatments and the patient’s severity level [2] Y.L. Zhou, Y.P. Ning, N. Fan, S. Mohamed, R.A. Rosenheck, H.B. He, Correlates
are not included in any dataset. For healthcare utilization-related of readmission risk and readmission days in a large psychiatric hospital in
factors, various combinations of selected registration and discharge Guangzhou, China, Asia-Pac. Psychiatry 6 (3) (2014) 342–349.
[3] A.S. Fialho, F. Cismondi, S.M. Vieira, S.R. Reti, J.M.C. Sousa, S.N. Finkelstein, Data
areas can be used as a reference in medical resource allocation, mining using clinical physiology at discharge to predict ICU readmissions, Ex-
and few similarities can be summarized among four datasets. Fur- pert Syst. Appl. 39 (18) (2012) 13158–13165.
thermore, “whether first visit” and “long waiting time after regis- [4] D. Kansagara, H. Englander, A. Salanitro, et al., Risk prediction models for hos-
pital readmission: a systematic review, JAMA 306 (15) (2011) 1688–1698.
tration” are two significant indicators of the 60-day readmission. [5] Centers for Medicare & Medicaid Services, The Hospital Readmissions Re-
None of the seasonal factors are selected, thereby indicating that duction (HRR) Program. https://www.cms.gov/Medicare/Quality-Initiatives-
the unplanned readmission does not present a notable seasonal Patient-Assessment-Instruments/Value-Based-Programs/HRRP/Hospital-
Readmission-Reduction-Program.html, 2012 (Accessed 21 February 2017).
pattern.
[6] D. Golmohammadi, N. Radnia, Prediction modeling and pattern recognition for
patient readmission, Int. J. Prod. Econ. 171 (Part 1) (2016) 151–161.
[7] G. Chandrashekar, F. Sahin, A survey on feature selection methods, Comput.
5. Conclusion Electr. Eng. 40 (1) (2014) 16–28.
[8] B. Xue, M. Zhang, W.N. Browne, X. Yao, A Survey on Evolutionary Computa-
In this study, a new machine learning framework is proposed to tion Approaches to Feature Selection, IEEE Trans. Evol. Comput. 20 (4) (2016)
606–626.
predict unplanned hospital readmission by combining feature se- [9] E. Amaldi, V. Kann, On the approximability of minimizing nonzero variables
lection algorithms and classification models. To increase the data or unsatisfied relations in linear systems, Theor. Comput. Sci. 209 (1) (1998)
quality and generality of the present study, a broad variety of fac- 237–260.
[10] S. Saha, R. Spandana, A. Ekbal, S. Bandyopadhyay, Simultaneous feature se-
tors is collected from different databases in HIS and then inte-
lection and symmetry based clustering using multiobjective framework, Appl.
grated as the original feature set before the proposed methodology Soft Comput. 29 (2015) 479–486.
is implemented. In the feature selection process, EMOBPSO is de- [11] P. Moradi, M. Rostami, Integration of graph clustering with ant colony opti-
veloped as the principal search strategy, and a new MI-based crite- mization for feature selection, Knowledge-Based Syst. 84 (2015) 144–161.
[12] Y.D. Zhang, S.H. Wang, P. Phillips, G.L. Ji, Binary PSO with mutation operator
rion is proposed as a more accurate estimation of feature relevancy for feature selection using decision tree applied to spam detection, Knowl-
and feature redundancy, with less computational cost. A greedy edge-Based Syst. 64 (2014) 22–31.
search strategy is developed and merged into EMOBPSO to control [13] R.R. Chhikara, P. Sharma, L. Singh, A hybrid feature selection approach based
on improved PSO and filter approaches for image steganalysis, Int. J. Mach.
the final feature subset size as desired. In the modeling process, Learn. Cybern. 7 (6) (2016) 1195–1206.
manifold machine learning models, such as SVM, RF, and DNN, are [14] B. Xue, M. Zhang, W.N. Browne, Particle swarm optimization for feature se-
trained with preprocessed datasets and corresponding feature sub- lection in classification: a multi-objective approach, IEEE Trans. Cybern. 43 (6)
(2013) 1656–1671.
set obtained in the feature selection process. The model perfor- [15] A.K. Das, S. Das, A. Ghosh, Ensemble feature selection using bi-objective ge-
mance is compared to explore the best combination between fea- netic algorithm, Knowledge-Based Syst. 123 (2017) 116–127.
ture selection algorithms and prediction models. Meanwhile, the [16] L.G. Zhou, D. Lu, H. Fujita, The performance of corporate financial distress pre-
diction models with features selection guided by domain knowledge and data
effectiveness of the proposed feature selection algorithm is also
mining approaches, Knowledge-Based Syst. 85 (2015) 52–61.
validated. [17] B. Xue, M. Zhang, W.N. Browne, Particle swarm optimisation for feature selec-
Based on the results obtained from a series of comparative ex- tion in classification: Novel initialisation and updating mechanisms, Appl. Soft
Comput. 18 (Suppl. C) (2014) 261–276.
periments, the contribution of this study is twofold. Theoretically,
[18] C.E. Shannon, W. Weaver The Mathematical Theory of Communication, 29,
compared with other wrapper-based feature selection methods, University of Illinois Press, Urbana, 1949.
EMOBPSO feature selection algorithm maximally maintains the in- [19] T.M. Cover, Elements of Information Theory, in: Thomas M. Cover, Joy
formation carried by original feature sets of readmission, with A. Thomas (Eds.), John Wiley & Sons, Inc, 1991 Copyright© 1991, Print ISBN
0-471-06259-6 Online ISBN 0-471-20061-1.
much less computational cost. Three new operators of EMOBPSO [20] J.R. Vergara, P.A. Estévez, A review of feature selection methods based on mu-
enhances the convergence efficiency. Compared with widely used tual information, Neural Comput. Appl. 24 (1) (2014) 175–186.
90 S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90

[21] H. Peng, F. Long, C. Ding, Feature selection based on mutual information crite- [41] R. Battiti, Using mutual information for selecting features in supervised neural
ria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pat- net learning, IEEE Trans. Neural Networks 5 (4) (1994) 537–550.
tern Anal. Mach. Intell. 27 (8) (2005) 1226–1238. [42] N. Kwak, C. Chong-Ho, Input feature selection for classification problems, IEEE
[22] Z. Yong, G. Dun-wei, Z. Wan-qiu, Feature selection of unreliable data us- Trans. Neural Networks 13 (1) (2002) 143–159.
ing an improved multi-objective PSO algorithm, Neurocomputing 171 (2016) [43] P.A. Estevez, M. Tesmer, C.A. Perez, J.M. Zurada, Normalized mutual informa-
1281–1290. tion feature selection, IEEE Trans. Neural Networks 20 (2) (2009) 189–201.
[23] M. Taha, A. Pal, J.D. Mahnken, S.K. Rigler, Derivation and validation of a for- [44] Z. Wang, M. Li, J. Li, A multi-objective evolutionary algorithm for feature se-
mula to estimate risk for 30-day readmission in medical patients, Int. J. Qual. lection based on mutual information with a new redundancy measure, Inf. Sci.
Health Care 26 (3) (2014) 271–277. 307 (2015) 73–88.
[24] P.E. Cotter, V.K. Bhalla, S.J. Wallis, R.W. Biram, Predicting readmissions: poor [45] X. Sun, Y. Liu, M. Xu, H. Chen, J. Han, K. Wang, Feature selection using dynamic
performance of the LACE index in an older UK population, Age Ageing 41 (6) weights for classification, Knowledge-Based Syst. 37 (2013) 541–549.
(2012) 784–789. [46] I. Koprinska, M. Rana, V.G. Agelidis, Correlation and instance based feature se-
[25] H. Zhou, P.R. Della, P. Roberts, L. Goh, S.S. Dhaliwal, Utility of models to predict lection for electricity load forecasting, Knowledge-Based Syst. 82 (2015) 29–40.
28-day or 30-day unplanned hospital readmissions: an updated systematic re- [47] N. Huang, Z. Hu, G. Cai, D. Yang, Short term electrical load forecasting using
view, BMJ Open 6 (6) (2016). mutual information based feature selection with generalized minimum-redun-
[26] B.G. Hammill, et al., Incremental value of clinical data beyond claims data in dancy and maximum-relevance criteria, Entropy 18 (9) (2016) 330.
predicting 30-day outcomes after heart failure hospitalization, Circ.-Cardiovasc. [48] A.E.-A. Shereen, A.R. Rabie, I.G. Neveen, Classification of EEG signals for mo-
Qual. Outcomes 4 (1) (2011) 60–67. tor imagery based on mutual information and adaptive neuro fuzzy inference
[27] I. Shams, S. Ajorlou, K. Yang, A predictive analytics approach to reducing system, Int. J. Syst. Dyn. Appl. (IJSDA) 5 (4) (2016) 64–82.
30-day avoidable readmissions among patients with heart failure, acute my- [49] D. He, I. Rish, D. Haws, L. Parida, MINT: mutual information based transductive
ocardial infarction, pneumonia, or COPD, Health Care Manag. Sci. 18 (1) (2015) feature selection for genetic trait prediction, IEEE/ACM Trans. Comput. Biol.
19–34. Bioinf. 13 (3) (2016) 578–583.
[28] S. Keyhani, L.J. Myers, E. Cheng, P. Hebert, L.S. Williams, D.M. Bravata, Effect of [50] H. Gunduz, Z. Cataltepe, Borsa Istanbul (BIST) daily prediction using finan-
clinical and social risk factors on hospital profiling for stroke readmission: a cial news and balanced feature selection, Expert Syst. Appl. 42 (22) (2015)
cohort study, Ann. Intern. Med. 161 (11) (2014) 775–784. 9001–9011.
[29] C. Walraven, J. Wong, A.J. Forster, S. Hawken, Predicting post-discharge death [51] H. Wang, X. Yan, Optimizing the echo state network with a binary particle
or readmission: deterioration of model performance in population having mul- swarm optimization algorithm, Knowledge-Based Syst. 86 (2015) 182–193.
tiple admissions per patient, J. Eval. Clin. Pract. 19 (6) (2013) 1012–1018. [52] Y. Zhang, S. Wang, P. Phillips, G. Ji, Binary PSO with mutation operator for fea-
[30] J.W. Thomas, Does risk-adjusted readmission rate provide valid information on ture selection using decision tree applied to spam detection, Knowledge-Based
hospital quality? Inquiry 33 (3) (1996) 258–270. Syst. 64 (2014) 22–31.
[31] C. Boult, B. Dowd, D. McCaffrey, L. Boult, R. Hernandez, H. Krulewitch, Screen- [53] L. Cervante, B. Xue, M. Zhang, L. Shang, Binary particle swarm optimisation for
ing elders for risk of hospital admission, J. Am. Geriatr. Soc. 41 (8) (1993) feature selection: a filter based approach, 2012 IEEE Congress on Evolutionary
811–817. Computation, 2012.
[32] Q.L. Huynh, et al., Roles of nonclinical and clinical data in prediction of 30-day [54] J. Kennedy, Bare bones particle swarms, in: Proceedings of the 2003 IEEE
rehospitalization or death among heart failure patients, J. Card. Fail. 21 (5) Swarm Intelligence Symposium, 2003, SIS’03, 2003.
(2015) 374–381. [55] K. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A fast and elitist multiobjective
[33] S. Sudhakar, W. Zhang, Y.-F. Kuo, M. Alghrouz, A. Barbajelata, G. Sharma, Vali- genetic algorithm: NSGA-II, IEEE Trans. Evol. Comput. 6 (2) (2002) 182–197.
dation of the readmission risk score in heart failure patients at a tertiary hos- [56] N.A. Moubayed, A. Petrovski, J. McCall, D2 >MOPSO: MOPSO Based on decom-
pital, J. Card. Fail. 21 (11) (2015) 885–891. position and dominance with archiving using crowding distance in objective
[34] D.J. Taber, et al., Inclusion of dynamic clinical data improves the predictive per- and solution spaces, Evol. Comput. 22 (1) (2014) 47–77.
formance of a 30-day readmission risk model in kidney transplantation, Trans- [57] Z.H. Zhan, J. Li, J. Cao, J. Zhang, H.S.H. Chung, Y.H. Shi, Multiple populations
plantation 99 (2) (2015) 324–330. for multiple objectives: a coevolutionary technique for solving multiobjective
[35] J.C. Iannuzzi, F.J. Fleming, K.N. Kelly, D.T. Ruan, J.R. Monson, J. Moalem, Risk optimization problems, IEEE Trans. Cybern. 43 (2) (2013) 445–463.
scoring can predict readmission after endocrine surgery, Surgery 156 (6) (2014) [58] Q. Lin, J. Li, Z. Du, J. Chen, Z. Ming, A novel multi-objective particle swarm
1432–1440. optimization with multiple search strategies, Eur. J. Oper. Res. 247 (3) (2015)
[36] S.N. Vigod, et al., READMIT: A clinical risk index to predict 30-day readmis- 732–744.
sion after discharge from acute psychiatric units, J. Psychiatr. Res. 61 (2015) [59] B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin
205–213. classifiers, in: Proceedings of the Fifth Annual Workshop on Computational
[37] S. Yu, F. Farooq, A. van Esbroeck, G. Fung, V. Anand, B. Krishnapuram, Predict- Learning Theory, Pittsburgh, Pennsylvania, USA, ACM, 1992, pp. 144–152.
ing readmission risk with institution-specific prediction models, Artif. Intell. [60] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32.
Med. 65 (2) (2015) 89–96. [61] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, MIT Press, 2016.
[38] R. Gildersleeve, P. Cooper, Development of an automated, real time surveil- [62] E.W. Lee, Selecting the best prediction model for readmission, J. Prev. Med.
lance tool for predicting readmissions at a community hospital, Appl. Clin. Inf Public Health 45 (4) (2012) 259–266.
4 (2) (2013) 153–169. [63] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: synthetic mi-
[39] R. Amarasingham, et al., An automated model to identify heart failure patients nority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357.
at risk for 30-day readmission or death using electronic medical record data, [64] C. Pardy, Mutual Information as an exploratory measure for genomic data with
Med. Care 48 (11) (2010) 981–988. discrete and continuous variables, Faculty of Medicine, The University of New
[40] R. Duggal, S. Shukla, S. Chandra, B. Shukla, S.K. Khatri, Impact of selected South Wales, 2013.
pre-processing techniques on prediction of risk of early readmission for dia- [65] F. Zhao, J. Tang, J. Wang, Jonrinaldi, An improved particle swarm optimization
betic patients in India, Int. Diabetes Dev. Countries 36 (4) (2016) 469–476. with decline disturbance index (DDPSO) for multi-objective job-shop schedul-
ing problem, Comput. Oper. Res. 45 (Suppl. C) (2014) 38–50.

You might also like