Professional Documents
Culture Documents
An Integrated Machine Learning Framework For Hospital Readmission prediction2018KnowledgeBased Systems
An Integrated Machine Learning Framework For Hospital Readmission prediction2018KnowledgeBased Systems
An Integrated Machine Learning Framework For Hospital Readmission prediction2018KnowledgeBased Systems
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
a r t i c l e i n f o a b s t r a c t
Article history: Unplanned readmission (re-hospitalization) is the main source of cost for healthcare systems and is nor-
Received 9 June 2017 mally considered as an indicator of healthcare quality and hospital performance. Poor understanding of
Revised 5 January 2018
the relative importance of predictors and limited capacity of traditional statistical models challenge the
Accepted 26 January 2018
development of accurate predictive models for readmission. This study aims to develop a robust and
Available online 1 February 2018
accurate risk prediction framework for hospital readmission, by combining feature selection algorithms
Keywords: and machine learning models. With regard to feature selection, an enhanced version of multi-objective
Hospital readmission bare-bones particle swarm optimization (EMOBPSO) is developed as the principal search strategy, and
Mutual information a new mutual information-based criterion is proposed to efficiently estimate feature relevancy and re-
Multi-objective optimization dundancy. A greedy local search strategy (GLS) is developed and merged into EMOBPSO to control the
Bare-bones particle swarm optimization final feature subset size as desired. For the modeling process, manifold machine learning models, such as
Feature selection
support vector machine, random forest, and deep neural network, are trained with preprocessed datasets
Greedy local search
and corresponding feature subsets. In the case study, the proposed methodology is applied to an actual
hospital located in Northeast China, with various levels of data collected from the hospital information
system. Results obtained from comparative experiments demonstrate the effectiveness of EMOBPSO and
EMOBPSO-GLS feature selection algorithms. The combination of EMOBPSO (EMOBPSO-GLS) and deep neu-
ral network possesses robust predictive power among different datasets. Furthermore, insightful implica-
tions are abstracted from the obtained elite features and can be used by practitioners to determine the
vulnerable patients for readmission and target the delivery of early resource-intensive interventions.
© 2018 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.knosys.2018.01.027
0950-7051/© 2018 Elsevier B.V. All rights reserved.
74 S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90
results of a systematic literature review [4]. Among 21 studies, considered and dealt with one classifier. A more flexible read-
only 6 reported a c-statistic, that is, the area under the ROC curve mission interval ranging from 30 to 180 days is also considered
(AUC), above 0.70. Most approaches are limited to patients with a in our study.
specific disease and the original features are limited to medical- (b) A new hybrid feature selection method is designed and im-
level. Moreover, few previous studies attempted to design and im- plemented as a pre-processing to find key factors that influ-
plement the feature selection process. Generally, involving a broad ence readmission. For the criterion, we design a novel MI-
variety of factors benefits the model training by reducing bias, and based criterion with a more accurate measurement of feature
relative importance of data collected from the system, medical, and redundancy and less computational cost. For the search strat-
patient level may significantly differ according to the population egy, EMOBPSO algorithm is designed, in which the initialization
studied [4,6]. Motivated by the foregoing analysis and the current strategy, the Gbest selection strategy, and the update rule for
demand from our cooperative hospital, we aim to develop in this the entire swarm are improved. A GLS strategy is proposed and
study a new framework to identify patients with high readmission merged into EMOBPSO to control the number of selected fea-
risk, by improving three key operations: data collection and inte- tures. Apart from benefiting the model learning, the results can
gration, feature selection, and model training. be used as a guide for hospital managers to determine potential
In field of machine learning, feature selection is essential as a reasons for unplanned readmissions.
pre-processing operation which selects a compact feature subset (c) Instead of limiting the application to a single classification
that holds maximal discriminative capability. Feature selection has model [23,24], our research introduces three modern learn-
obtained significant attention in the past decades because it is ben- ing models to solve this machine learning task: support vector
eficial to the training algorithm by avoiding overfitting, controlling machine (SVM), random forest (RF), and deep neural network
noise, and improving prediction performance [7,8]. Proposed fea- (DNN). After examination with cross-validation tests, the model
ture selection algorithms are focused on two points: criterion and presenting the best performance is put into practice.
search strategy. A criterion typically includes the discriminability
and compactness, that is, the selected feature or feature subset The rest of this paper is organized as follows. Section 2 reviews
should have high relevance to class labels and low redundancy the related work on hospital readmission prediction, MI-based fea-
within themselves. This condition indicates that feature selection ture selection, and proposed evolutionary search strategies on this
belongs to the multi-objective optimization problem. As feature se- field. Section 3 describes the details of our new methodology.
lection has been proven to be an NP-hard problem [9], manifold Section 4 presents experimental results on real datasets to eval-
heuristic search algorithms are employed as search strategies to uate the efficiency and effectiveness of our methodology. In addi-
reduce the search space, such as hill-climbing, simulated anneal- tion, insightful implications for practitioners are explored. Finally,
ing (SA) [10], ant colony optimization (ACO) [11], particle swarm Section 5 concludes the paper.
optimization (PSO) [12–14], and genetic algorithm (GA) [15,16].
However, proposed heuristic search strategy-based feature se- 2. Related works
lection algorithms contain some common limitations. One is that
the criterion normally relies on a certain machine learning model, 2.1. Risk prediction model for hospital readmission
in which the model is trained as a black box function to calcu-
late the criterion. This kind of criterion results in a more reliable By comparing retrieved studies in two systemic reviews on this
feature subset at a high cost of computational time, which can problem [4,25], we have found an increasing number of studies de-
hardly be generalized to high-dimensional data. Another one is voted to utilizing statistical models for predicting hospital readmis-
that most approaches directly applied popular evolutionary com- sion risks. The outcome of the 30-day readmission was most com-
putation techniques as search strategies and did not fine-tune monly reported [26–29], even though a few approaches selected
them to the feature selection task [7,17]. Few proposed research other intervals ranging from 14 days [30] to 4 years [31]. Almost
works attempt to customize key operators of evolutionary compu- all approaches employed retrospective data to learn the model and
tation techniques for feature selection. conduct an analysis, and these studies vary in the target popula-
To compensate for the above drawbacks, we propose a novel tion and outcome variables. The majority pre-selected the target
feature selection algorithm and apply it to abstract elite features population with the same disease or under similar conditions, such
that influence patient readmission risk. The mutual information as approaches focusing on cardiovascular disease-related readmis-
(MI), which was originated from Shannon’s information theory sions [32,33], surgical condition readmissions [34,35], and men-
[18,19], is used to calculate the criterion [20,21]. As a nonpara- tal health condition readmissions [36]. Although a recent study
metric metric for quantifying the uncertainty of feature or fea- supported that elderly people are not the only group that should
ture subset, MI makes no assumption regarding data distribution be scrutinized in predicting readmissions, readmissions of elderly
and has the advantage of measuring the joint effect of features people attracted more attention than that of younger people based
on the target class. For the search strategy, the bare-bones multi- on the average age of the target population reported in previous
objective particle swarm optimization (BMOPSO) algorithm [22] is studies [6]. Among input features that were identified as impor-
employed and tuned for the feature selection task. A greedy local tant contributors to readmission, “medical comorbidity,” “length of
search (GLS) strategy is developed and then merged into the search stay,” and “number of previous admissions” were used in nearly
strategy to control the number of selected features. Compared with all studies [25]. The variable “laboratory test” or “medication” was
previous approaches on patient readmission risk prediction, our re- more frequently included in the classification models for cardiovas-
search work has added the following improvements: cular disease-related readmissions. Basic sociodemographic vari-
ables, such as age and gender, were also considered by many stud-
(a) A wide range of input factors, such as administration data, ies [29,37,38]. For model comparison and performance evaluation,
diagnosis-related information, pharmacy and laboratory data, the most frequently reported metric is the AUC, which is a more
conditional monitoring data of inpatients, and identity informa- comprehensive metric than accuracy for datasets with class imbal-
tion of patients, are collected from different databases and inte- ance.
grated as the original feature set to generalize the classifier and The results reported by published studies indicate that un-
final results. Instead of merely concerning the readmission risk planned readmission risk prediction is characterized by poor un-
of patients with a specific disease, all registered inpatients are derstanding and complex endeavor. One significant reason is that
S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90 75
the frequently used patient-level factors, such as “medical comor- most are wrapper approaches that take high computational cost
bidity” and other clinical variables, are more appropriate to predict and are risky to maintain high performance in other learning algo-
mortality than readmission risk [4]. The condition of patients af- rithms. Moreover, there are very few studies applying or designing
ter discharge, such as the timeliness of post-discharge nursing care multi-objective PSO as search strategy for feature selection, though
and the quality of medication reconciliation, may affect readmis- PSO, multi-objective optimization, and feature selection have been
sion risk; however, related data are difficult to collect or monitor. individually investigated and improved frequently [14]. Therefore,
Broad social and environmental factors, such as “access to care” the potential of PSO and multi-objective PSO for feature selection
and “social support,” are suggested to indirectly contribute to read- has not been fully explored.
mission risk; however, the utility of such factors is limited [39].
Another reason is that the traditional logistic regression method 3. Methodology
is applied to most approaches, with a relatively weak predictive
power. Recently, Golmohammadi and Radnia [6] introduced three 3.1. Preliminary
advanced machine learning algorithms in their study: neural net-
work, decision tree-based classification method, and Chi-squared 3.1.1. Entropy and mutual information
automatic interaction detection, to reach a model with a high level In Shannon’s information theory, information is defined as
of accuracy. Duggla et al. [40] applied naive Bayes classifier and de- something that removes or reduces uncertainty, while entropy is
cision tree to enhance the overall model performance obtained by the expected value of the information contained in random vari-
logistic regression. In this study, the beneficial effects of conduct- ables. For a random variable X with discrete values, the entropy of
ing feature selection process are explored as well. The result indi- X is defined as
cates that such type of pre-processing techniques is indeed useful H (X ) = − p(x ) log p(x ), (1)
in their scenario, whereas this operation is ignored by most previ- x∈X
ous studies. where p(x ) = P r (X = x ) is the probability density function of X.
Eq. (1) indicates that entropy depends on the probability distri-
2.2. Feature selection bution of the random variable instead of its actual value. For two
discrete random variables X and Y with joint probability densityp(x,
As a measure of the amount of shared information between y), the joint entropy of X and Y is defined as
two random variables, MI has been frequently used as the cri-
H (X, Y ) = − p(x, y ) log p(x, y ). (2)
terion for existing filter feature selection algorithms. At the be-
x∈X y∈Y
ginning, an algorithm named as MIFS (“mutual information based
feature selection”) was proposed by Battiti [41], in which MI be- If a certain variable is known and others are not, then the re-
tween feature vectors is approximated by calculating the MI be- maining uncertainty is measured by conditional entropy. Let vari-
tween the individual components of the vectors. In this algorithm, able Y be the given, and then the conditional entropy H(X|Y) of X
a greedy sequential forward selection strategy is applied to select with respect to Y is defined as
the best feature subset. Kwok and Chong-Ho [42] proposed an al- H (X |Y ) = − p(x, y ) log p(x|y ). (3)
gorithm called MIFS-U by developing a more accurate estimation x∈X y∈Y
of the conditional MI term involved in MIFS under the assumption
In Eq. (3), p(x|y) is the posterior probabilities of X given Y.
that the information is uniformly distributed. Peng et al. [21] pro-
According to this definition, if X completely depends on Y, then
posed the minimal redundancy maximal relevance (mRMR) algo-
H(X|Y) is zero, which indicates that no other information is re-
rithm, which achieved good performance by wrapping a learning
quired to describe X when Y is known. Otherwise, H (X |Y ) =
algorithm. Recently, more attention has been provided to the es-
H (X )indicates that Y offers no meaningful information regarding
timation of MI (or entropy) for multiple features. Estevez et al.
X [53]. The relationship between joint entropy and conditional en-
[43] proposed an enhancement of MIFS called normalized MIFS
tropy are denoted as follows:
and demonstrated that MI should be normalized with its corre-
sponding entropy. Wang et al. [44] defined a new feature redun- H (X, Y ) = H (X ) + H (Y |X ) = H (Y ) + H (X |Y ). (4)
dancy measurement that is capable of accurately estimating MI The common information shared between two random vari-
between features and the target variable. Sun et al. [45] intro- ables is defined as mutual information (Eq. (5)), which is a promis-
duced a new scheme to estimate the MI and conditional MI by dy- ing indicator of relevance between two random variables.
namically re-weighting the candidate features. Most of these algo-
I (X ; Y ) = H (X ) − H (X |Y ) = H (Y ) − H (Y |X )
rithms adopt greedy search strategies to incrementally select fea-
p(x, y )
tures, which are more likely to generate local optimal solutions. = p(x, y ) log (5)
The proposed MI-based feature selection algorithms are extensively p( x ) · p( y )
x∈X y∈Y
used to conduct knowledge discovery in different domains, such
From Eq. (5), a large I(X; Y) indicates that two variables are
as load forecasting [46,47], medical diagnosis [48,49], and financial
closely related; if I (X; Y )=0, then X and Y are completely unrelated.
analysis [50].
As defined in Eq. (5), the relationship between MI and entropy is
Evolutionary computation techniques are extensively applied as
illustrated in Fig. 1. With regard to continuous random variables,
search strategies in the wrapper feature selection algorithm. Com-
differential entropy, conditional differential entropy, and MI are re-
pared with GA and ACO, PSO is normally easier to implement with
spectively defined as:
less computational cost and faster convergence rate. With these
advantages, PSO stands out as a promising search strategy for fea- H (X ) = − p(x ) log p(x )dx, (6)
ture selection. Wang and Yan [51] developed a binary PSO-based
feature selection to reduce network complexity and improve gen-
eralization ability. In the study by Zhang et al. [52], the binary
H (Y |X ) = − p(x, y ) log p(y|x )dxdy, (7)
PSO with mutation operator was used as the feature subset search
p(x, y )
strategy for spam detection. An improved PSO with two filter tech- I (X ; Y ) = p(x, y ) log dxdy. (8)
niques was developed by Chhikara et al. [13] for image steganaly- p( x ) p( y )
sis. Among the proposed PSO-based feature selection algorithms,
76 S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90
minimize F (x ) = [ f1 (x ), f2 (x ), ..., fk (x )]
subject to
, (11)
gi (x ) ≤ 0, i = 1, 2, ..., m
hi (x ) = 0, i = 1, 2, ..., l
where x is the vector of decision variables, fi (x)is ith objective func-
tion of x, k is the number of objective functions, and gi (x) and hi (x)
are the constraints of the problem. In multi-objective optimization,
the quality of a solution is explained in terms of trade-offs be-
tween conflicting objectives. Let y and z be two solutions of the
above-mentioned k-objective minimization problem. If the condi-
tions shown in Eq. (12) are met, one can say that y dominates
z. When a solution is not dominated by any other solutions, it is
named as a Pareto-optimal solution. The set of all Pareto-optimal
solutions forms the Pareto front. A multi-objective optimization al-
gorithm is designed to search for the Pareto front.
I ( fi ; c ) + I ( f j ; c )
− I ( fi ; f j ). (17)
2 · max{I ( fi ; c )}
fi , f j ∈S,i= j f i ∈S
Algorithm 1
Generate a “promising particle” X based on the odds ratio of original features.
Categorize F into two groups: high-OR group and low-OR group, by threshold ORthreshold = 1
Normalize ORj in two groups:
for hj ∈ {j|feature j in high-OR group}, do normalize the corresponding OR to scale [0.5, 1]:
NORh j = 0.5 + 0.5(ORh j − ORmin
hj
)/(ORmax
hj
− ORmin
hj
).
end
for lj ∈ {j|feature j in low-OR group}, do normalize the corresponding OR to scale [0, 0.5]:
NORl j = 0.5(ORl j − ORmin
lj
)/(ORmax
lj
− ORmin
lj
).
end
Convert normalized OR to probability by adding noise ε :
for j = 1 to N do
randomly generate a noise ε , ε ∼U(−0.25, 0.25)
x j = NOR j + ε .
if xj > 1 then
x j = 1.
else if xj < 0then
xj = 0
end if
end
return X = (x1 , x2 , ..., xN )
End
Algorithm 4 Algorithm 5
EMOBPSO algorithm for feature selection. GLS for the current swarm.
In certain cases, practitioners want to acquire concise and in- 3.5.1. Support vector machine
formative features that can reflect the core reason for unplanned SVM is one of the most influential approaches to supervised
readmission. To meet this demand, an operator should be added learning tasks [59]. Considering a two-class, linearly separable clas-
in EMOBPSO to control the number of selected features as pre- sification task with the dataset {x1 , x2 , ..., xm } and xi ’s target class
determined. Based on the principle of maximum relevance and labelyi ∈ {1, −1}, we can summarize the original SVM as the fol-
minimum redundancy, a greedy local search operator (GLS) is de- lowing constrained optimization problem:
signed and merged into EMOBPSO. The details of GLS are shown in
Minimize 12 w2
Algorithm 5. . (21)
subject to yi (wT xi + b) ≥ 1∀i
As proven by Peng et al. [21], a combination of max-relevance
and min-redundancy criteria is equivalent to the max-dependency Driven by a linear function(wT x + b), SVM is aimed at finding a
criterion if one feature is selected at one time. In GLS, if the num- decision boundary that classifies all points correctly. This process is
ber of selected features is smaller than DFS, then one feature is similar to the logistic regression. Among studies focusing on SVM,
added from (F-S) and is able to achieve a maximal increment of one key innovation is the kernel trick, which is derived by observ-
dependency, that is, I(c; S). Similarly, if the number of selected ing that the linear function of SVM can be rewritten as
features is larger than DFS, then one feature is removed from cur-
m
rent S with the minimal decrement of I(c; S). Thus, the quality of wT x + b = b + αi xT xi , (22)
the feature subset is maximally maintained and the number of se-
i=1
80 S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90
where α is a vector of coefficients. Rewriting the key operators in tion transmission among neurons in different layers is illustrated
this form allows us to replace x by the output of a given func- in Fig. 4b, where the function f(x) is called “activation function”.
tion φ (x) and the dot product with a kernel function k(x, xi ) = As reported by previous studies, one significant optimization
φ (x ) · φ (xi ). After conducting this replacement, predictions can be challenge for training DNNs is overfitting. In this study, a novel
implemented using the function regularization strategy called “dropout” is introduced to avoid
overfitting. To train DNN with dropout, some neurons are ran-
m
f (x ) = b + αi k(x, xi ). (23) domly omitted with a predefined probability in each generation,
i=1
and finally different sets of multiple DNNs are shared and com-
bined. Thus, the dropout method can be considered as an ex-
From Eq. (23), the f(x) is nonlinear with respect to x, but the treme form of bootstrap aggregation, which enables the assembly
relationship between f(x) and φ (x)is linear. By using this kernel of many large DNNs.
trick to map the inputs into high-dimensional feature spaces, SVM
is able to efficiently perform a nonlinear classification. The kernel
4. Case study
trick enables learning of nonlinear models through convex opti-
mization techniques. In this study, the radial basis function (RBF)
4.1. Data collection and pre-processing
kernel is selected for SVM.
To conduct the present methodology in a hospital while vali-
3.5.2. Random forest dating the effectiveness of EMOBPSO and EMOBPSO-GLS on fea-
RF is an ensemble method for classification, regression, and ture selection, retrospective data are collected from a tertiary re-
other tasks, which operates by constructing a multitude of deci- ferral hospital located in Northeast China, from November 1, 2013
sion trees at training time and outputting the prediction by taking to December 31, 2015. The hospitalization records of 45,407 indi-
average of all individual trees [60]. RF is easy to handle with no viduals are retrieved from the hospital information system (HIS)
distribution assumptions. In particular, deep decision trees are ca- and assembled to determine readmissions. In this study, readmis-
pable of learning highly irregular patterns, but they normally have sion (i.e., the target class label) is defined as re-hospitalization for
very high variance and are more likely to overfit their training sets. any cause within a given time window (i.e., readmission interval)
Based on bootstrap aggregation (i.e., bagging), RF provides a way of of discharge for the last hospital admission. After removing the pa-
averaging multiple deep decision trees with the goal of reducing tients who died during their hospital stay, we converted the raw
the variance. The basic idea of RF can be summarized as the fol- data into four datasets, namely, R30, R60, R120, and R180, by using
lowing four points: (1) training many iid trees on bootstrap sam- four different readmission intervals (i.e., 30-, 60-, 120-, and 180-
ples of training data, (2) reducing bias by growing trees sufficiently day) to define the target class label for each record, respectively.
deep, (3) reducing variance of noise by averaging the output of all The subject of this study is individual patients, and if a patient has
trees, and (4) maximizing variance reduction by minimizing cor- been admitted several times, each admission is evaluated individ-
relation between trees through bootstrapping data for each tree ually and included in the dataset.
and sampling available variable sets at each node. Generally, RF Based on results acquired from relevant studies and the help
achieves robust performance even if several outliers are in the pre- of domain experts from the hospital, input features of the hospi-
dictive variables. tal readmission problem mainly consist of “demographic,” “social
and economic status,” “treatment and clinical,” and “healthcare uti-
3.5.3. Deep neural network lization” factors [62]. To collect informative features as much as
In the era of big data, DNN has achieved outstanding results in possible, relevant data is collected from different databases in HIS
manifold knowledge discovery and data mining applications [61]. and assigned to individual hospitalization records by utilizing the
This study employs a feedforward DNN model, in which the in- uniqueness of patient ID number and SQL codes. The details of
formation flows from an input layer with input features, through these databases and related fields are listed in Table 1.
multiple layers of nonlinearity and finally to the output layer to From Table 1, representative fields collected from these three
produce the target class label (Fig. 4a). In DNN, each neuron is databases almost cover the relevant factors that directly or indi-
connected to other neurons by different communication links, with rectly influence readmission risk. For demographic factors, the pa-
each link containing an associated weight parameter. The informa- tient’s age, home address, and marital status are involved; for fac-
S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90 81
Table 1
Relevant fields of three databases in HIS.
Table 2
Configuration of two hyper-parameters for
SMOTE.
Dataset
and Prec.under denotes how many extra examples from the major-
ity classes are randomly selected for each example generated from
the minority class. Notably, the adjustment of these two hyper-
parameters determines the tradeoff between type Ⅰ misclassifica-
Fig. 5. Number of examples in four datasets.
tion (i.e., a non-readmitted patient is misclassified as readmitted)
and type Ⅱ misclassification (i.e., a readmitted patient is misclas-
sified as non-readmitted). For practitioners in a certain hospital,
tors related to social and economic status, the mode of payment,
if the goal is to spot patients with a high readmission risk and
occupation and corresponding insurance type are included as orig-
take measures to prevent it, higher Prec.over and lower Prec.under
inal features; for factors related to treatment and clinical result,
should be configured to assign a higher cost to type Ⅱ misclassifi-
whether the diagnosis is confirmed, disease category, number of
cation. As another unusual case where the goal is to estimate the
medical treatments, and severity level at different stages is in-
number of patients with high readmission risk and use this esti-
volved; for healthcare utilization-related factors, whether first visit,
mation to provide necessary resources, lower Prec.over and higher
department of registration and discharge, whether accompanied,
Prec.under should be configured to assign a higher cost to type
follow-up flag and length of stay are considered. Registration time
Ⅰ misclassification. To provide a general research approach, this
at outpatient and inpatient departments is used to judge if a long
study balances two types of misclassification (maximum AUC) with
waiting time occurs between these two operations. The season of
relevant configurations shown in Table 2.
registration is also included in the original features. All these fea-
tures are combined with corresponding records in four datasets,
and categorical variables are converted to binary dummy features. 4.2. Experimental set-up
Then, the features with less than 40 positive values are excluded
from the datasets. The total number of samples and the number of The main steps in implementing the present methodology to
readmitted patients for each dataset is shown in Fig. 5. solve the unplanned readmission prediction problem are illustrated
From Fig. 5, all four datasets are unbalanced with respect to the in Fig. 6. A series of comparative experiments are conducted and
ratio of readmitted patients to non-readmitted patients. To over- relevant results are listed in the following part of Section 4: in
come this problem, a synthetic minority oversampling technique Section 4.3, we validate the efficiency of EMOBPSO by monitoring
called SMOTE is used in this study, implemented before feature its iterative process and comparing its solution distribution with
selection and model training. In SMOTE, the minority class is over- various kinds of raw BMOPSOs. In Section 4.4, the effectiveness of
sampled by generating “synthetic” examples rather than by over- EMOBPSO-based feature selection algorithm is tested, in which the
sampling with replacement [63]. Two hyper-parameters of SMOTE executive time and model performance are compared with those
are set in Table 2 based on several initial trails. Prec.over denotes of GA and SA-based feature selection algorithms. In Section 4.5,
how many extra examples from the minority class are generated, the effectiveness of EMOBPSO-GLS is tested and compared with
82 S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90
Metric Description For EMOBPSO, the number of generations is fixed to 100 for the
AUC Reflects trade-off between the rate of patients that are correctly involved datasets, and the swarm size is set as 80. Two objectives
predicted as readmission and the rate of patients incorrectly of each particle in the initial and final swarms are monitored and
predicted as readmission. plotted in Fig. 7. To facilitate the identification of different non-
Accuracy Rate of correctly classified patients. dominated front levels, the sign of “Objective 1 is reversed to
Precision Rate of patients correctly predicted as readmission to total number negative and then both objectives can be regarded as a minimiza-
of patients predicted as readmission. Relevant to type Ⅰ tion problem. Fig. 7 indicates that EMOBPSO reaches convergence
misclassification:P = T P/(T P + F P ) within 100 generations for all involved datasets based on intensive
Recall Rate of patients correctly predicted as readmission to total number non-dominated front lines shown in the plots of final swarms.
of patients that are actually readmitted. Relevant to type Ⅱ To test the global search ability of EMOBPSO, we calculate the
misclassification:R = T P/(T P + F N )
distribution of particles in the two-objective function space, be-
fore and after applying EMOBPSO. Meanwhile, the distribution of
final solution is also compared with 3 kinds of raw algorithms, re-
feature selection algorithms that are capable of controlling fea- spectively, to validate the impact of each improved operator. The
ture subset size: mRMR, correlation-based feature ranking (CFR), 3 kinds of raw algorithms are EMOBPSO without the present ini-
and variable importance-based feature ranking algorithm (VIFR). In tialization strategy (EMOBPSO/IS), EMOBPSO without the present
these two sections, the effectiveness of different feature selection Gbest selection strategy (EMOBPSO/GS), and EMOBPSO without the
algorithms is quantified in two aspects: the feature reduction rate dynamic update role (EMOBPSO/UR). The Tan’s spacing metric (TS)
and the impact of selected features on model performance. The im- is used to evaluate the solution distribution:
pact on model performance is tested by using selected features to
|E |
train classification models (SVM, RF, and DNN), and four evaluation 1 2
TS = (Di − D̄ ) /D̄, (24)
metrics are considered to assess the quality of predicted outcomes |E |i=1
(Table 3). In Table 3, TP, FP, and FN denote “true positive,” “false
|E |
positive” (i.e., type I error), and “false negative” (i.e., type II error),
where D̄ = Di /|E |, Di is the Euclidean distance between solu-
respectively. The training set is set as the first 75% of each dataset, i=1
and is used to conduct all involved feature selection algorithms. As tion xi in E and the nearest solution in the candidate space [65].
the samples of readmitted patients are sparse in each dataset, the A smaller of TS indicates superior of the solution distribution and
testing set is defined as the entire dataset, and it is used to quan- better global search ability. Table 4 shows experimental results
tify the impact of selected features on model performance. All cri- of tackling four datasets, in which TS values are the average of
teria listed in Table 3 are calculated via a five-fold cross validation five repeated experiments. The significant difference between ini-
process on the testing set. tial set and after applying EMOBPSO demonstrates powerful global
Considering the results obtained from these comparative exper- search ability of EMOBPSO. Compared with EMOBPSO/GS, the per-
iments, we explore the best combination between all involved fea- formance of EMOBPSO has achieved a significant improvement on
ture selection algorithms and classification models (Section 4.6), solution distribution, with the help of roulette wheel selection-
and the most important features obtained by EMOBPSO-GLS is in- based Gbest selection strategy. Compared with EMOBPSO/IS and
terpreted as management implications, which is helpful for prac- EMOBPSO/UR, EMOBPSO has still achieved slight improvements.
titioners to prevent readmission in advance. In this study, all in- Therefore, all three improved operators prompt the swarm to
volved experiments are run on Rstudio version 0.99.893 (contain- reach convergence more efficiently and ensure the powerful global
ing R version 3.1.2). Calculation of pairwise MI between all types of search ability of EMOBPSO.
variables (including continuous vs continuous, continuous vs dis-
crete, and discrete vs discrete) is implemented through an R pack- 4.4. Effectiveness of EMOBPSO feature selection algorithm
age “mpmi,” which is based on a kernel smoothing approach [64].
GA and SA-based wrapper methods are implemented through the To evaluate the effectiveness of EMOBPSO-based feature selec-
“caret” package. The computer used is Intel(R) Core(TM) i7-4600 tion algorithm, an SA-based and a GA-based wrapper feature se-
CPU, 8 GB RAM, operated by Windows 64-bit Operating System. lection algorithms are implemented to conduct the same task. For
S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90 83
Fig. 7. Plots of MI-based feature relevancy and feature redundancy for each particle in the initial and final swarm.
Table 4
TS values obtained by different search strategies.
Dataset TS
Table 5 Table 8
Execution time of EMOBPSO, GA, and SA. Impact of feature selection algorithms on RF.
EMOBPSO GA SA EMOBPSO GA SA
R30 AUC 90.38% 52.49% 83.74% Dataset Criterion Feature selection algorithm
Accuracy 91.37% 92.01% 90.32%
EMOBPSO GA SA
Precision 43.43% 4.49% 39.24%
Recall 70.08% 0.44% 66.43% R30 AUC 90.20% 66.81% 85.43%
R60 AUC 89.61% 89.57% 87.04% Accuracy 88.22% 60.08% 89.13%
Accuracy 89.75% 89.58% 87.94% Precision 35.15% 11.66% 36.10%
Precision 49.51% 48.95% 43.38% Recall 74.83% 70.33% 68.08%
Recall 67.42% 67.81% 63.49% R60 AUC 89.69% 89.54% 86.18%
R120 AUC 88.36% 87.85% 77.50% Accuracy 86.18% 84.71% 81.20%
Accuracy 82.55% 81.95% 79.42% Precision 40.43% 37.36% 32.05%
Precision 44.85% 43.77% 37.98% Recall 74.37% 75.67% 73.15%
Recall 75.50% 75.35% 60.71% R120 AUC 89.21% 87.30% 79.30%
R180 AUC 87.73% 65.32% 76.67% Accuracy 76.59% 73.63% 62.76%
Accuracy 80.92% 68.69% 75.93% Precision 37.56% 34.62% 25.73%
Precision 50.63% 31.71% 41.64% Recall 86.54% 85.28% 79.58%
Recall 77.25% 53.42% 58.79% R180 AUC 88.83% 68.14% 76.70%
Accuracy 77.81% 49.04% 60.84%
Precision 46.25% 26.05% 30.78%
Recall 84.86% 85.45% 80.96%
SA, the population size and number of generations are set equal to
the EMOBPSO; for GA, the population size is reduced to 40 and the
number of generations is set as 50 because the identical configu-
ration as used in EMOPBSO and SA will result in extremely long
execution time. The crossover rate is fixed as 0.8 and the muta- and 76.67% in R120 and R180. With regard to precision, EMOBPSO
tion rate is 0.1 for both SA and GA. The linear discriminant analysis outperforms GA and SA in most cases by pairwise comparison;
model (LDA) is used to calculate the fitness value for SA and GA, with regard to recall, EMOBPSO still outperforms GA and SA in
which is a simple linear classifier with low computational cost. Ex- most cases. With regard to accuracy, the difference among feature
ecution time for the entire iterative process of three algorithms is selection algorithms is not so obvious in R30, R60, and R120, and
listed in Table 5. In Table 5, EMOBPSO demonstrates notably lower even GA is able to achieve 92.01% accuracy in R30. This result indi-
computational cost than the other two wrapper feature selection cates that the accuracy is not an appropriate criterion for the class
algorithms. The results indicate that the proposed MI-based crite- imbalance problem.
rion (Eq. (17)) takes much lower and more stable computational For RF and DNN, the difference of model performance among
cost, even compared to a simple linear model based criterion. three feature selection algorithms presents similar patterns as dis-
Table 6 shows the number of selected features and feature re- cussed under SVM. EMOBPSO still achieves high AUC in RF (90.32%,
duction rate of different feature selection algorithms. The impact of 90.17%, 89.70%, and 89.30%) and DNN (90.20%, 89.69%, 89.21%, and
these feature selection algorithms on training SVM, RF, and DNN is 88.83%); GA or SA performs worst in most cases. These results
quantified, as shown in Tables 7, 8, and 9, respectively. further validate the positive effectiveness of EMOBPSO-based fea-
For SVM, Table 7 shows that EMOBPSO feature selection algo- ture selection algorithm and the universal property of obtained
rithm achieves highest AUC among four datasets (90.38%, 89.61%, feature subsets as well. In summary, the results shown in Tables
88.36%, and 87.73%), compared with GA and SA. Although the ex- 6–9 demonstrate that feature subsets obtained by EMOBPSO max-
ecution time of GA and SA is much longer than that of EMOBPSO, imally maintain the information carried by original feature sets,
the quality of feature subsets obtained from these two wrapper- with as much feature reduction as possible. The original feature
based feature selection algorithms is significantly worse than spaces of four datasets are high-dimensional, applying wrapper-
EMOBPSO feature selection algorithm in some cases: the AUC of based feature selection algorithms to the present problem poses
GA is only 52.49% in R30, and the AUC of SA decreases to 77.50% risks.
S. Jiang et al. / Knowledge-Based Systems 146 (2018) 73–90 85
Fig. 8. Boxplots of AUC, accuracy, precision, and recall achieved by EMOBPSO and EMOBPSO-GLS under different settings of DFS.
Table 10 besides regularizing the feature subset size, GLS helps EMOBPSO
Actual number of selected features acquired by EMOBPSO-GLS under different set-
maximally maintain the key information by retaining elite feature
tings of DFS.
combinations during the iterative process, compared with other
Dataset DFS = 10 DFS = 20 DFS = 30 DFS = 40 DFS = 50 DFS = 60 widely used filter methods.
R30 17 21 35 34 49 59
R60 10 26 33 40 44 49 4.6. Model selection and management implications for practitioners
R120 10 23 28 42 53 60
R180 13 19 33 40 50 45 Based on results discussed in Sections 4.4 and 4.5, EMOBPSO
and EMOBPSO-GLS are capable of attaining elite feature subsets
with minimum information loss. To analyze which model performs
4.5. Effectiveness of EMOBPSO-GLS feature selection algorithm best under EMOBPSO and EMOBPSO-GLS, the distribution of cri-
terion values achieved by EMOBPSO and EMOBPSO-GLS is shown
Table 10 shows the number of selected features acquired with a group of boxplots, under different settings of DFS in SVM,
by EMOBPSO-GLS under different values of DFS. The impact of RF, and DNN. In Fig. 8(a), DNN achieves relatively higher AUC than
EMOBPSO-GLS, CFR, mRMR, and VIFR on training SVM, RF, and SVM and RF, with lower variance. Fig. 8(c) and (d) indicate that
DNN is quantified, as shown in Tables 11, 12, and 13, respectively. DNN generates significantly higher recall than SVM and RF at the
Table 10 shows that by introducing GLS to EMOBPSO, the fea- expense of precision to obtain high AUC. This adaptive outcome
ture subset sizes have been brought under control and the ac- is appropriate for general usage in a healthcare system. Therefore,
tual number of selected features is closed to the predefined values the combination of EMOBPSO (EMOBPSO-GLS) and DNN possesses
of DFS. Table 11 shows that different feature selection algorithms strongest predictive power and presents high robustness under dif-
obtain similar AUC, accuracy, precision, and recall with regard to ferent settings of DFS or different datasets, and the high recall
SVM, when DFS (or FS for CFR, mRMR, and VIFR) is equal to or demonstrates that DNN is a good choice in preventing unplanned
greater than 30. When DFS is less than 30, the feature subset ob- readmissions in advance.
tained by EMOBPSO-GLS can still bring the desired impact on SVM, Table 14 lists feature subsets selected by using EMOBPSO-GLS
whereas other feature selection algorithms cannot achieve com- with DFS equal to 10. Based on the relevant model performance
parable results under the same circumstances. For example, when shown in Tables 11–13, these features maintain the most predic-
DFS is equal to 10, EMOBPSO-GLS reaches 86.1%, 81.6%, 86.3%, and tive power for four datasets and respective values of discrimina-
85.5% AUC for R30, R60, R120, and R180, respectively. However, tion. Some similarities and differences among individual features
AUC decreases to 78.8% for CFR (dataset R120), 79.7% for mRMR and feature combinations in four datasets are analyzed to provide
(dataset R60), and 84.6% for VIFR (dataset R30) when FS is equal practical insights for practitioners in the hospital. For demographic
to 10. Similarly, for RF, EMOBPSO-GLS maintains high AUC with factors, the patient’s age is included in R30, R120, and R180, which
decrease of DFS. Several abnormal results generated through CFR, shows a notable impact on readmission. Marital status indicator
mRMR, and VIFR still exist, such as 77.7% AUC generated by CFR (in (i.e., whether the patient is married or not) is merely selected
R60, FS = 0; AUC = 86.7% for EMOBPSO-GLS) and 84.8% AUC gen- in R180 and no address-related factor is included. For social and
erated by VIFR (in R60, FS = 20; AUC = 88.2% for EMOBPSO-GLS). economic status-related factors, the payment covered by basic so-
From Table 13, DNN reliefs the difference of AUC among feature se- cial insurance is a significant indicator of readmission, which is
lection algorithms or different settings of DFS (FS), and EMOBPSO- included in all four datasets. More early resource-intensive inter-
GLS still outperforms other algorithms to a certain degree. Overall, ventions should be arranged and delivered to the patients in this
86
Table 11
Impact of EMOBPSO-GLS, CFR, mRMR, and VIFR on SVM performance.
AU (%) AC (%) P (%) R (%) AU (%) AC (%) P (%) R (%) AU (%) AC (%) P (%) R (%) AU (%) AC (%) P (%) R (%)
DFS
EMOBPSO-GLS 10 86.1 91.4 43.0 64.4 81.6 90.7 53.7 57.1 86.3 79.7 40.6 79.1 85.5 78.6 46.9 81.7
AU (%) AC (%) P (%) R (%) AU (%) AC (%) P (%) R (%) AU (%) AC (%) P (%) R (%) AU (%) AC (%) P (%) R (%)
DFS
87
88
Table 13
Impact of EMOBPSO-GLS, CFR, mRMR, and VIFR on DNN performance.
AU (%) AC (%) P (%) R (%) AU (%) AC (%) P (%) R (%) AU (%) AC (%) P (%) R (%) AU (%) AC (%) P (%) R (%)
DFS
[21] H. Peng, F. Long, C. Ding, Feature selection based on mutual information crite- [41] R. Battiti, Using mutual information for selecting features in supervised neural
ria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pat- net learning, IEEE Trans. Neural Networks 5 (4) (1994) 537–550.
tern Anal. Mach. Intell. 27 (8) (2005) 1226–1238. [42] N. Kwak, C. Chong-Ho, Input feature selection for classification problems, IEEE
[22] Z. Yong, G. Dun-wei, Z. Wan-qiu, Feature selection of unreliable data us- Trans. Neural Networks 13 (1) (2002) 143–159.
ing an improved multi-objective PSO algorithm, Neurocomputing 171 (2016) [43] P.A. Estevez, M. Tesmer, C.A. Perez, J.M. Zurada, Normalized mutual informa-
1281–1290. tion feature selection, IEEE Trans. Neural Networks 20 (2) (2009) 189–201.
[23] M. Taha, A. Pal, J.D. Mahnken, S.K. Rigler, Derivation and validation of a for- [44] Z. Wang, M. Li, J. Li, A multi-objective evolutionary algorithm for feature se-
mula to estimate risk for 30-day readmission in medical patients, Int. J. Qual. lection based on mutual information with a new redundancy measure, Inf. Sci.
Health Care 26 (3) (2014) 271–277. 307 (2015) 73–88.
[24] P.E. Cotter, V.K. Bhalla, S.J. Wallis, R.W. Biram, Predicting readmissions: poor [45] X. Sun, Y. Liu, M. Xu, H. Chen, J. Han, K. Wang, Feature selection using dynamic
performance of the LACE index in an older UK population, Age Ageing 41 (6) weights for classification, Knowledge-Based Syst. 37 (2013) 541–549.
(2012) 784–789. [46] I. Koprinska, M. Rana, V.G. Agelidis, Correlation and instance based feature se-
[25] H. Zhou, P.R. Della, P. Roberts, L. Goh, S.S. Dhaliwal, Utility of models to predict lection for electricity load forecasting, Knowledge-Based Syst. 82 (2015) 29–40.
28-day or 30-day unplanned hospital readmissions: an updated systematic re- [47] N. Huang, Z. Hu, G. Cai, D. Yang, Short term electrical load forecasting using
view, BMJ Open 6 (6) (2016). mutual information based feature selection with generalized minimum-redun-
[26] B.G. Hammill, et al., Incremental value of clinical data beyond claims data in dancy and maximum-relevance criteria, Entropy 18 (9) (2016) 330.
predicting 30-day outcomes after heart failure hospitalization, Circ.-Cardiovasc. [48] A.E.-A. Shereen, A.R. Rabie, I.G. Neveen, Classification of EEG signals for mo-
Qual. Outcomes 4 (1) (2011) 60–67. tor imagery based on mutual information and adaptive neuro fuzzy inference
[27] I. Shams, S. Ajorlou, K. Yang, A predictive analytics approach to reducing system, Int. J. Syst. Dyn. Appl. (IJSDA) 5 (4) (2016) 64–82.
30-day avoidable readmissions among patients with heart failure, acute my- [49] D. He, I. Rish, D. Haws, L. Parida, MINT: mutual information based transductive
ocardial infarction, pneumonia, or COPD, Health Care Manag. Sci. 18 (1) (2015) feature selection for genetic trait prediction, IEEE/ACM Trans. Comput. Biol.
19–34. Bioinf. 13 (3) (2016) 578–583.
[28] S. Keyhani, L.J. Myers, E. Cheng, P. Hebert, L.S. Williams, D.M. Bravata, Effect of [50] H. Gunduz, Z. Cataltepe, Borsa Istanbul (BIST) daily prediction using finan-
clinical and social risk factors on hospital profiling for stroke readmission: a cial news and balanced feature selection, Expert Syst. Appl. 42 (22) (2015)
cohort study, Ann. Intern. Med. 161 (11) (2014) 775–784. 9001–9011.
[29] C. Walraven, J. Wong, A.J. Forster, S. Hawken, Predicting post-discharge death [51] H. Wang, X. Yan, Optimizing the echo state network with a binary particle
or readmission: deterioration of model performance in population having mul- swarm optimization algorithm, Knowledge-Based Syst. 86 (2015) 182–193.
tiple admissions per patient, J. Eval. Clin. Pract. 19 (6) (2013) 1012–1018. [52] Y. Zhang, S. Wang, P. Phillips, G. Ji, Binary PSO with mutation operator for fea-
[30] J.W. Thomas, Does risk-adjusted readmission rate provide valid information on ture selection using decision tree applied to spam detection, Knowledge-Based
hospital quality? Inquiry 33 (3) (1996) 258–270. Syst. 64 (2014) 22–31.
[31] C. Boult, B. Dowd, D. McCaffrey, L. Boult, R. Hernandez, H. Krulewitch, Screen- [53] L. Cervante, B. Xue, M. Zhang, L. Shang, Binary particle swarm optimisation for
ing elders for risk of hospital admission, J. Am. Geriatr. Soc. 41 (8) (1993) feature selection: a filter based approach, 2012 IEEE Congress on Evolutionary
811–817. Computation, 2012.
[32] Q.L. Huynh, et al., Roles of nonclinical and clinical data in prediction of 30-day [54] J. Kennedy, Bare bones particle swarms, in: Proceedings of the 2003 IEEE
rehospitalization or death among heart failure patients, J. Card. Fail. 21 (5) Swarm Intelligence Symposium, 2003, SIS’03, 2003.
(2015) 374–381. [55] K. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A fast and elitist multiobjective
[33] S. Sudhakar, W. Zhang, Y.-F. Kuo, M. Alghrouz, A. Barbajelata, G. Sharma, Vali- genetic algorithm: NSGA-II, IEEE Trans. Evol. Comput. 6 (2) (2002) 182–197.
dation of the readmission risk score in heart failure patients at a tertiary hos- [56] N.A. Moubayed, A. Petrovski, J. McCall, D2 >MOPSO: MOPSO Based on decom-
pital, J. Card. Fail. 21 (11) (2015) 885–891. position and dominance with archiving using crowding distance in objective
[34] D.J. Taber, et al., Inclusion of dynamic clinical data improves the predictive per- and solution spaces, Evol. Comput. 22 (1) (2014) 47–77.
formance of a 30-day readmission risk model in kidney transplantation, Trans- [57] Z.H. Zhan, J. Li, J. Cao, J. Zhang, H.S.H. Chung, Y.H. Shi, Multiple populations
plantation 99 (2) (2015) 324–330. for multiple objectives: a coevolutionary technique for solving multiobjective
[35] J.C. Iannuzzi, F.J. Fleming, K.N. Kelly, D.T. Ruan, J.R. Monson, J. Moalem, Risk optimization problems, IEEE Trans. Cybern. 43 (2) (2013) 445–463.
scoring can predict readmission after endocrine surgery, Surgery 156 (6) (2014) [58] Q. Lin, J. Li, Z. Du, J. Chen, Z. Ming, A novel multi-objective particle swarm
1432–1440. optimization with multiple search strategies, Eur. J. Oper. Res. 247 (3) (2015)
[36] S.N. Vigod, et al., READMIT: A clinical risk index to predict 30-day readmis- 732–744.
sion after discharge from acute psychiatric units, J. Psychiatr. Res. 61 (2015) [59] B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin
205–213. classifiers, in: Proceedings of the Fifth Annual Workshop on Computational
[37] S. Yu, F. Farooq, A. van Esbroeck, G. Fung, V. Anand, B. Krishnapuram, Predict- Learning Theory, Pittsburgh, Pennsylvania, USA, ACM, 1992, pp. 144–152.
ing readmission risk with institution-specific prediction models, Artif. Intell. [60] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32.
Med. 65 (2) (2015) 89–96. [61] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, MIT Press, 2016.
[38] R. Gildersleeve, P. Cooper, Development of an automated, real time surveil- [62] E.W. Lee, Selecting the best prediction model for readmission, J. Prev. Med.
lance tool for predicting readmissions at a community hospital, Appl. Clin. Inf Public Health 45 (4) (2012) 259–266.
4 (2) (2013) 153–169. [63] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: synthetic mi-
[39] R. Amarasingham, et al., An automated model to identify heart failure patients nority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357.
at risk for 30-day readmission or death using electronic medical record data, [64] C. Pardy, Mutual Information as an exploratory measure for genomic data with
Med. Care 48 (11) (2010) 981–988. discrete and continuous variables, Faculty of Medicine, The University of New
[40] R. Duggal, S. Shukla, S. Chandra, B. Shukla, S.K. Khatri, Impact of selected South Wales, 2013.
pre-processing techniques on prediction of risk of early readmission for dia- [65] F. Zhao, J. Tang, J. Wang, Jonrinaldi, An improved particle swarm optimization
betic patients in India, Int. Diabetes Dev. Countries 36 (4) (2016) 469–476. with decline disturbance index (DDPSO) for multi-objective job-shop schedul-
ing problem, Comput. Oper. Res. 45 (Suppl. C) (2014) 38–50.