Professional Documents
Culture Documents
Computational Intelligence - 2019 - Tripathi - A Novel Hybrid Credit Scoring Model Based On Ensemble Feature Selection and
Computational Intelligence - 2019 - Tripathi - A Novel Hybrid Credit Scoring Model Based On Ensemble Feature Selection and
Computational Intelligence - 2019 - Tripathi - A Novel Hybrid Credit Scoring Model Based On Ensemble Feature Selection and
DOI: 10.1111/coin.12200
ORIGINAL ARTICLE
K E Y WO R D S
classification, credit scoring, ensemble learning, feature ranking
1 I N T RO DU CT ION
Credit scoring is a procedure to calculate the risk associated with credit products using applicant's
credentials (such as annual income, job status, and residential status); statistical or machine
learning techniques are applied to the applicants' historical data.1,2 Credit scoring models are con-
sidered as a binary class classification problem to distinguish whether an applicant is a defaulter
(with suspicious/bad credit) or a nondefaulter (with legitimate/good credit).2 These models try to
isolate the impact of different characteristics of the applicant on criminal behavior and defaults.
The main focus of a credit scoring model is to determine whether or not a borrower will behave in
an undesirable way in the future.3 The most important and challenging issue for financial insti-
tutions and credit industries is to assess the performance of applicants prior to an actual business
failure.4,5 Even a slight improvement in recognizing applicants with suspicious credit will result
in a huge gain for the credit industries.6
Credit scoring is not a single-step process; it is carried out periodically by financial institutions
and credit industries in various steps such as application scoring, behavioral scoring, and collec-
tion scoring.7 Various benefits of credit scoring for financial institutions include calculating and
reducing credit risk, making managerial decisions, and improving cash flow. The performance
of credit scoring model is responsible for the profitability of the financial institutions. Usually,
credit scoring datasets are of high dimension and are heterogeneous in nature. As a result of this,
credit scoring models suffer from high computational complexity and poor performance.8 Fea-
ture selection is a way to reduce the computational complexity and to improve the performance
of the credit scoring models.8,9
Various articles10-13 are based on feature selection approaches that use bio-inspired algorithms
to improve the performance of the classifiers. Chi and Hsu10 proposed a dual credit scoring
model. In this model, the authors used genetic algorithm (GA) to select the important features
to combine the bank's internal behavioral rating with the external credit bureau rating for credit
scoring. Huang and Dun11 proposed a hybrid model for credit scoring that tried to consolidate
feature selection and classification. In this work, the authors applied feature selection based on
binary particle swarm optimization (BPSO). Furthermore, the dataset with selected features was
applied to support vector machine (SVM) for classification. Oreski and Oreski12 proposed a hybrid
model for credit scoring by combining GA-based feature selection with neural networks (NNs) to
increase the classification accuracy. Wang et al14 proposed a feature selection approach based on
rough set (RS) and tabu search (TS). In this work, the authors used conditional entropy as criteria
function to search for an optimal set of features. Liang et al15 used three filter-based feature selec-
tion methods, namely, linear discriminant analysis (LDA), T-test, and logistic regression, and two
wrapper-based feature selection methods, namely, GA and particle swarm optimization (PSO),
on six different prediction models for credit scoring datasets.
In the literature, many studies have revealed that individual classifiers show only moder-
ately good performance when compared to ensemble classifiers. Other works16-19 presented a
comparative study on various ensemble methods, such as bagging, boosting, random subspace,
decorate, and rotation forest, for credit scoring. The conventional credit scoring models are based
on individual classifiers or on a combination of these classifiers and tend to show moderate
performance. Kim and Upneja20 proposed AdaBoosted decision tree (DT) for predicting the com-
plex dynamics of restaurant financial distress. Many classifiers, such as DT, SVM, Naïve Bayes
(NB), and NN-based classifiers, have thus far been proposed to learn problems. However, no clas-
sifier can perform well for all datasets. Usually, classifiers perform well for a specific dataset.
As a consequence, using an ensemble classifier is a strong approach to get near to the optimal
TRIPATHI ET AL. 373
classifier for any dataset.21 This approach strengthens the classifiers in error-prone subspaces and,
consequently, leads to higher performance for the classification. Generally, the result of combi-
nation of diverse classifiers is better than the individual classifiers.22-24 Basically, there are two
types of ensemble frameworks, ie, homogeneous and heterogeneous ensemble frameworks.24 The
most popular ways to combine the base classifiers are majority voting and weighted voting.22,23,25
A multilayer ensemble framework that is formed by combining the classifiers using a layered
approach with heterogeneous classifiers overcomes the limitations of conventional performance
bottlenecks.26 However, when a multilayer ensemble classifier is used, the placement of each
classifier may affect the overall performance.
Although bio-inspired approaches toward problem solving seems almost ideal because they
have properties like self-optimization, flexibility, and a simple set of ground rules, they have a few
disadvantages, such as the initial thrust or starting condition for the algorithm, overhead, and
the checking of the environment variables.27,28 In order to overcome the limitations of a feature
selection approach based on bio-inspired algorithm, filter-based feature ranking techniques can
be utilized to rank features independently without involving any learning algorithm.
In order to address the limitations of feature selection based on bio-inspired algorithms and a
multilayer ensemble classifier, in this paper, a hybrid model is proposed, which combines feature
selection based on feature ranking with a multilayer ensemble method to improve the classifi-
cation performance for credit scoring. For the feature selection approach, an ensemble feature
ranking approach is proposed and, in order to place the classifiers in the multilayer ensemble
framework, an approach for classifiers ranking is proposed. The rank of the classifier is estimated
based on Choquet integral value (CIV) of the respective classifiers. Some preprocessing steps, such
as data-cleaning, data-transformation, and data-discretization, are carried out before the feature
selection process.
The rest of this paper is organized as follows. Section 2 covers the proposed framework for
credit scoring and covers preprocessing, feature selection, and multilayer ensemble classification.
Section 3 presents the experimental results and analysis on four credit scoring datasets. Finally,
Section 4 concludes the contributions of this paper based on experimental observations.
2 PROPOSED FRAMEWORK
This section discusses the proposed hybrid credit scoring model, which combines ensemble fea-
ture selection and a multilayer ensemble classifier. Feature selection eliminates the irrelevant
and noisy features and this helps to reduce the complexity. In order to improve the performance
of the model, a multilayer ensemble framework is used. These are discussed in the following
sections and the proposed framework is shown in Figure 1. The model has three phases, namely,
preprocessing, feature selection, and ensemble framework.
2.1 Phase-1
This phase consists of three steps, preprocessing, classifier's ranking, and weight assignment.
2.1.1 Preprocessing
Data preprocessing is an important step in the modeling process. The aim of this step is to increase
the effectiveness of the classification process by using representative and consistent dataset.
374 TRIPATHI ET AL.
FIGURE 1 Proposed hybrid framework for credit scoring [Color figure can be viewed at
wileyonlinelibrary.com]
Preprocessing includes data cleaning, data transformation, and data discretization, which are
explained as follows.
• Data cleaning: In this step, the whole dataset is considered and checked for missing values, and
samples with missing values are eliminated.
• Data transformation: Credit scoring datasets have both numerical and categorical attributes.
Some classification algorithms, such as SVM and NNs, do not work well with heterogeneous
attributes, so data transformation is required. To convert categorical values into numerical
values, a unique integer number is assigned to each unique categorical value in each feature set.
• Data discretization: In the previous step, categorical data values are replaced by numerical val-
ues. The categorical attributes have a small range of values, but the numerical attributes have
a wide range of values. To make these ranges balanced, discretization is used. Data discretiza-
tion is the process of converting a dataset with continuous valued attributes into a dataset with
discrete valued attributes.
the rank of each classifier. The traditional ranking approach is based on the classifier's accuracy
and it is a weighted sum, but accuracy is not a sufficient measure if the dataset has imbalance
toward a class label. Despite its simplicity, this approach has a shortcoming because it implic-
itly assumes the independence of the criteria. It is difficult to assign a rank to a classifier by
considering multiple criteria. However, often, the criteria are interacting. Choquet integral is a
measure that considers interaction of criteria and it is used for decision making. To find the ranks
of the classifiers, the Choquet integral values based on multiple criteria are utilized.39-41 Basic
preliminaries of Choquet integral are as follows.
Let us denote by C = {c1 , … , cn } the set of criteria, and the power set of C by P(C ).
Definition 1. A fuzzy measure on the set C of criteria is a set function 𝜇 ∶ P'(C ) → [0, 1]
satisfying the following axioms42 ;
(1) 𝜇(∅) = 0, 𝜇(C ) = 1;
(2) A ⊂ B ⊂ C implies that 𝜇(A) ⊂ 𝜇(B) ⊂ 𝜇(C );
where 𝜇(A), 𝜇(B), and 𝜇(C ) symbolize the weights for the importance of sets' criteria on A, B, and
C, respectively. Subsequently, in addition to the standard weights on criteria taken independently,
the weights on any combination of criteria are also defined.
Let C be a finite set, ie, C = c1 , c2 , … , cn . Rearrange f (c1 ), f (c2 ), … , f (cn ) into an increasing
order, ie, 𝑓 (c∗1 ) ⩽ 𝑓 (c∗2 ) ⩽ … 𝑓 (c∗n ), where (c∗1 , c∗2 , … c∗n ) is a permutation of (c1 , c2 , … , cn ). The
Choquet integral based on fuzzy measures 𝜇 can be computed as follows.
Definition 2. Let 𝜇 be a fuzzy measure on C. The Choquet integral of a function f ∶ C →
[0, 1] with respect to 𝜇 is defined as in Equation (1), ie,
n [( ( )) ( )]
( ( )) ∑
I𝜇 𝑓 c∗1 , … , c∗n = 𝑓 c∗(i) − 𝑓 c∗(i−1) 𝜇(A(i) ), (1)
i=1
(4) Calculate the rank of classifier on the basis of respective CIV value (max rank with max CIV
value).
(1 − Eri𝑗 )
Wi𝑗 = ∑P , (2)
𝑗=1 (1 − Eri𝑗 )
where Wi j represents weight of jth classifier in ith iteration, Eri j represents the error rate of jth
classifier in ith iteration, and P denotes the number of classifiers.
The resulting p-values are used as a measure of feature importance. Thus, a lesser p-value
designates a greater importance.
Correlation: Pearson product-moment correlation,45,46 as in Equation (4), is used to select fea-
tures that are highly correlated with the class label and have low correlation with other features.
A threshold for the correlation between the predictor variables of pv = 0.7 is used, ie,
∑n
̄ ̄
i=1 (Xi − X )(Yi − Y )
pv = √ , (4)
∑n ∑n
̄
i=1 (Xi − X) ∗
2 ̄
i=1 (Yi − Y )
2
where X and Y represent the features and n number of samples in feature. X̄ and Ȳ are the means
of features X and Y.
Logistic regression (LogReg): This is another FS method. In this approach, the coefficients
(𝛽-values) of the logistic regression are utilized as a weighting system, but the range of features
may differ. The 𝛽-coefficients of features are not comparable, so, to make them comparable, a
Z-transformation is performed, as shown in Equation (5),47 ie,
X − X̄
Zx = , (5)
Sx
where X̄ and SX represent the mean and standard deviation of the feature X, respectively.
Through standardization by Z-transformation, the mean of the 𝛽-coefficients becomes zero
with a standard deviation of 1, thus ensuring that the features all have the same domain.
Subsequently, the values are ordered according to their absolute values in decreasing order.
Finally, to calculate the final ranking of all the features, the values obtained by median, cor-
relation, and LogReg are aggregated (summation) for each feature. These summed values are
then normalized in the range [0, 1] and are considered as the final measure to calculate the final
ranking of the respective features. The complete process is as depicted in Figure 2.
the outputs predicted by the classifiers, these are majority and weighted voting. In majority vot-
ing approach, same weights are assigned to each classifier and the output is the class that has the
highest votes.43 In weighted voting approach, the highest weight is assigned to the classifier with
the highest accuracy, and vice versa. The final output is the weighted sum of the outputs predicted
by the base classifiers.43
A multilayer ensemble classifier framework permits adaptation from multiple points, unlike
single-layer classifiers.24 The primary reason for preferring this perspective is that diverse mod-
els can be used to separate the granularity of issues. If different classifiers at different layers are
utilized, different features for each layer and the classification tasks can be more refined. Further-
more, a multilayer ensemble classification scheme can be used to enhance the prediction.24 The
computational complexity of the multilayer ensemble framework is reduced by dividing it into a
two-layer approach. The main motive behind using a multilayer ensemble classifier framework is
that, when the classifier makes a decision, it is not dependent on just a single classifier's decision,
but, rather, requires all classifiers to participate in the decision-making process by aggregating
their individual predictions. Hence, this method outperforms the base classifiers.
In order to improve the predictive performance of the proposed framework, five heteroge-
neous classifiers are aggregated into a multilayer ensemble classifier framework, as shown in
phase-III of Figure 1. The classifiers C-1, C-2, C-3, C-4, and C-5 are chosen as the best classi-
fiers out of the eight heterogeneous classifiers in phase-1. Data with the selected features are fed
with the weights assigned to the respective classifiers for evaluation of the final results against
the input samples. Furthermore, the five classifiers with the best ranking are arranged, as shown
in phase-III of Figure 1. In this framework, C-1 and C-2 have the highest ranks and are placed in
the second layer. The other three classifiers are placed in the first layer. The combiner in the sec-
ond layer aggregates the results obtained by the three layer-1 classifiers, and the combiner in the
next layer aggregates the results obtained by the two layer-2 classifiers and the layer-2 combiner.
In this framework, same training dataset is used to train different base classifiers, and then these
classifiers' outputs are aggregated to make the final predicted output of the framework against
each sample.
Each combiner aggregates the output predicted by the associated classifiers using
Equation (6),49 ie,
∑P
O= Wi ∗ Xi , (6)
i=1
where Wi and Xi are the weight and predicted output of the ith classifier, respectively, and
P denotes the number of base classifiers.
3 EXPERIMENTAL RESULTS
This section mentions about datasets and performance measures applied in this work, ensemble
feature selection results by various feature selection approaches along with results obtained by
various ensemble frameworks.
types. Australian and Japanese datasets are related to credit approval. German-categorical and
German-numerical datasets are related to loan application. To protect the confidentiality of the
data, the values of some attributes are replaced by random meaningless symbols. Detailed descrip-
tion of the datasets is given in Table 1. All the aforementioned datasets as described in Table 1 are
binary class datasets, and the class label represents whether the sample was accepted (class-1) or
rejected (class-2). In the Australian dataset, the categorical attributes are 1, 4, 5, 6, 8, 9, 11, and
12, with 2, 3, 14, 9, 2, 2, 2, and 3 values, respectively, whereas the others are numerical attributes.
In the Japanese dataset, the categorical attributes are 1, 4, 5, 6, 7, 9, 10, 12, and 13 and have 2, 4,
3, 14, 9, 2, 2, 2, and 3 values, respectively, whereas the others are numerical attributes. In case of
German-categorical dataset, attributes 1, 3, 4, 6, 7, 9, 10, 12, 14, 15, 17, 19, and 20 are categorical
with 4, 5, 11, 5, 5, 5, 3, 4, 3, 3, 4, 2, and 2 values, respectively, whereas the rest of the attributes
are numerical attributes. Similarly in the case of German-numerical dataset, all attributes are
numerical with varying ranges of integer values.
As discussed earlier, all the datasets have binary class and the proposed model focuses on the
classification problem. To analyze the performance of the proposed model, confusion matrix, as
shown in Table 2, is used. This is the most popular way to evaluate classification problems. Using
that table, various measures to evaluate the classification measures that are commonly available
in the literature, such as accuracy, sensitivity, specificity, and G-measure, are displayed. Accuracy
(Equation (7)) shows the predictive performance of the classifier, which is not sufficient as a per-
formance measure, if there is a significant class imbalance toward one class in the dataset. The
datasets used in this experimental work are binary class datasets with accepted (credit approved)
and rejected (credit not approved) classes. Sensitivity (Equation (8)) represents the accuracy of
only the prediction of accepted samples and specificity (Equation (9)) measures the prediction
accuracy for rejected samples. G-measure (Equation (10)) is a measure that considers both the
accepted and the rejected accuracies to compute the score. It can be interpreted as geometric mean
of sensitivity and specificity. All performance measures mentioned here would give 1 or 0 for the
best or the worst cases, respectively, ie,
TP + TN
Accuracy = (7)
TP + TN + FP + FN
TP
Sensitivity = (8)
TP + FN
TN
Specificity = (9)
TN + FP
√
G − measure = Sensitivity ∗ Specificity. (10)
TABLE 1 Descriptions about datasets used
Dataset Samples Features Class-1/Class-2 Categorical/Numerical
Australian 690 14 383/317 6/8
Japanese 690 15 383/317 7/8
German-categorical 1000 20 700/300 13/7
German-numerical 1000 24 700/300 0/24
CL: Classifier, Sen: Sensitivity, Spe: Specificity, R: Rank. Abbreviations: CIV, Choquet
integral value; DT, decision tree; KNN, K-nearest neighborhood; MLFN, multilayer
feed forward neural network; NB, Naïve Bayes; QDA, quadratic discriminant analysis;
SVM, support vector machine; TDNN, time delay neural network.
The preprocessed dataset is used to find the best features. For feature selection, an ensemble
feature selection approach is proposed in which the median, Pearson-correlation (P-cor), and
382 TRIPATHI ET AL.
LogReg value for each feature are utilized to rank the features. The ranking of each feature is
depicted in Figures 3 to 6 by each parameter median, P-cor, LogReg, and ensemble approach with
each dataset.
The graph for the Australian dataset is shown in Figure 3. It is clear from the results of Figure 3
that, when the median measure is used, many features (eg, V5, V8, V10, … ,V9) have the same
p-value, which shows that these features have the same rank. The P-cor measure assigns the best
rank for V8 and the least rank for V1, which shows that the features V8 and V1 are highly and
poorly correlated with target or class label, respectively. V5 and V1 are ranked as the best and
FIGURE 3 Feature ranking on Australian dataset [Color figure can be viewed at wileyonlinelibrary.com]
FIGURE 4 Feature ranking on Japanese dataset [Color figure can be viewed at wileyonlinelibrary.com]
TRIPATHI ET AL. 383
the least features by the LogReg measure, respectively. Finally, the proposed ensemble feature
ranking algorithm aggregates the values calculated by all the aforementioned measures to give a
final predictive measure to rank the features. Using the proposed feature ranking approach, V5
and V1 are assigned the best and least feature ranks, respectively.
The graph for the Japanese dataset is shown in Figure 4. It is clear from the results shown in
Figure 4 that, for the median measure, many features, such as V8, V9, and V5, have same p-value,
which shows that these features have the same rank. The P-cor assigns the best rank to V9 and
384 TRIPATHI ET AL.
the least rank to V1, which shows that the feature V9 and V1 are the highly and poorly correlated
with the target or class label, respectively. V8 and V12 are ranked as the best and the least feature
by LogReg measure, respectively. Finally, using the proposed feature ranking approach, V8 and
V1 are assigned as the best and the least feature ranks, respectively.
Next, the graph for the German-categorical dataset is shown in Figure 5. This shows that, using
median measure, the same ranking is given to more than one feature (eg, V1, V2, V12, … ,V9) and
that V10 has the lowest rank. The P-cor and LogReg assign the highest rank to V1 and V2, respec-
tively, and both give the lowest rank to V18. The proposed ensemble feature ranking algorithm
assigns V1 as the best and V18 as the least feature ranks, respectively.
Finally, the graph for the German-numerical dataset is shown in Figure 6. It is clear from the
results shown in Figure 6 that, by the median measure, many features, such as V1, V2, V3, V5,
and V9, have the same p-value, which shows that these features have the same rank, and V18 has
the lowest rank. The P-cor and LogReg measures assign V1 and V8 as the best and the least ranks.
Finally, by the proposed feature ranking approach, V1 and V8 are assigned as the best and the
least feature ranks, respectively.
For comparative analysis, the dataset is partitioned as per the 10-fold cross validation, and
eight heterogeneous classifiers, namely, QDA, NB, MLFN, DTNN, TDNN, DT, SVM, and KNN,
are chosen. The experiment is conducted with every classifier with the K-best (best 50% of actual
features for the particular dataset) features ranked by median, P-cor, LogReg, and the ensemble
of all three approaches (EFS) applied on the training dataset and tested on the testing dataset.
The mean of the 10-fold cross validation results in terms of accuracy for the Australian, Japanese,
German-categorical, and German-numerical datasets are shown in Figures 7 to 10, respectively.
From Figures 7 to 10, it can be observed that the results obtained using the ensemble feature
ranking approach have the best predictive accuracy for most of the classifiers. This approach also
gives significant improvements in predictive accuracy for all the features on all aforementioned
credit scoring datasets.
FIGURE 7 Comparative graph on Australian dataset. DT, decision tree; DTNN, distributed time delay neural
network; KNN, K-nearest neighborhood; MLFN, multilayer feed forward neural network; NB, Naïve Bayes; QDA,
quadratic discriminant analysis; SVM, support vector machine; TDNN, time delay neural network [Color figure
can be viewed at wileyonlinelibrary.com]
TRIPATHI ET AL. 385
FIGURE 8 Comparative graph on Japanese dataset. DT, decision tree; DTNN, distributed time delay neural
network; KNN, K-nearest neighborhood; MLFN, multilayer feed forward neural network; NB, Naïve Bayes; QDA,
quadratic discriminant analysis; SVM, support vector machine; TDNN, time delay neural network [Color figure
can be viewed at wileyonlinelibrary.com]
FIGURE 9 Comparative graph on German-categorical dataset. DT, decision tree; DTNN, distributed time
delay neural network; KNN, K-nearest neighborhood; MLFN, multilayer feed forward neural network; NB, Naïve
Bayes; QDA, quadratic discriminant analysis; SVM, support vector machine; TDNN, time delay neural network
[Color figure can be viewed at wileyonlinelibrary.com]
FIGURE 10 Comparative graph on German-numerical dataset. DT, decision tree; DTNN, distributed time
delay neural network; KNN, K-nearest neighborhood; MLFN, multilayer feed forward neural network; NB, Naïve
Bayes; QDA, quadratic discriminant analysis; SVM, support vector machine; TDNN, time delay neural network
[Color figure can be viewed at wileyonlinelibrary.com]
layered ensemble approach, as in the proposed model, has two layers. In the first layer, the out-
puts of three classifiers (C-3, C-4, and C-5), with ranks of 3, 4, and 5, are aggregated and forwarded
to the next layer. In the second layer, the two classifiers with the best rankings (C-1 and C-2) and
the predicted output of the previous layer are aggregated by the combiner, which gives the final
predicted output against a specified sample. The ranks of classifiers are shown in Table 8 over the
four datasets.
The experimental results for the Australian, Japanese, German-categorical, and German-
numerical datasets are depicted in terms of accuracy, sensitivity, specificity, and G-measure with
the respective aggregation method in Tables 9 to 12 respectively. Moreover, RF represents the
results obtained by the state-of-the-art method random forest approach53 with 100 trees, with
features selected by the proposed approach. Furthermore, MV and LMV represent the results
TRIPATHI ET AL. 387
obtained by the majority voting approach using nonlayered and layered approaches, respectively.
In addition, WV and LWV represent the results obtained using weighted voting aggregation
approach with nonlayered and layered approaches, respectively. LWV(R) represents the results
obtained by the layered weighted voting approach when the classifiers are arranged in reverse
order (C-4 and C-5 in the last layer and C-1, C-2, and C-3 in layer-1).
From Tables 9 to 12, it can be observed that, with all the features, the multilayer ensemble
approach with weighted voting when the appropriate classifiers are placed in layers has a great
improvement on the same approach with random placement of the classifiers, and it achieves
the best accuracy, sensitivity, and G-measure. Furthermore, it can also be seen that, for all four
credit scoring datasets, the layered weighted voting approach achieves the best accuracy, sensitiv-
ity, and G-measure and competitive specificity when compared to other ensemble methods RF,
MV, WV, LMV, and LWV(R). Similarly, with ensemble feature selection, it can be observed that
388 TRIPATHI ET AL.
LWV gives better classification performances than other ensemble frameworks. As the results are
tabulated in Tables 9 to 12, it can be seen that the proposed ensemble feature selection approach
improves the classification performances with RF, MV, WV, LMV, and LWV(R). Overall, the pro-
posed ensemble feature selection approach with a multilayer ensemble framework outperforms
the state-of-the-art method RF and ensemble frameworks such as MV, LMV, WV, and LWV(R).
LWV(R), in terms of classification accuracy and G-measure. A right tailed T-test is applied to
compare the average accuracy and G-measure of 10-FCV with 100-iterations for each dataset.
The p-value against classification accuracy and G-measure for LWV against the other ensemble
frameworks (RF, MV, WV, LMV, and LWV(R)) is depicted in each of the columns in Table 14 in
a particular confidence interval (𝛼 = 0.05). The column represents the p-values against accuracy
and G-measure and indicates whether or not the null hypothesis is rejected in favor of the alter-
native hypothesis based on the p-value and chosen 𝛼. With all the test cases, the null hypothesis is
rejected, so it is concluded that LWV performs significant better. The p-values shown in Table 14
TABLE 14 Statistical significance of proposed approach vs. other ensemble approaches in terms of p-values
Dataset Measure RF Vs LWV MV Vs LWV WV Vs LWV LMV Vs LWV LWV(R) Vs LWV
Accuracy 0.00982797 0.00071830 0.00085368 0.00956326 0.00017383
AUS G-measure 0.00046783 0.04287237 0.03963242 0.02603634 0.00925835
Accuracy 0.00988396 0.00081386 0.00093653 0.00893698 0.00023698
JPD G-measure 0.00069832 0.00982563 0.00369856 0.01369875 0.00036987
Accuracy 0.00056893 0.00025896 0.00288361 0.00291593 0.00011258
GCD G-measure 0.00036825 0.00369236 0.00258451 0.00853632 0.00222139
Accuracy 0.00000789 0.00314214 0.00369819 0.00219959 0.00005880
GND G-measure 0.00030889 0.01180889 0.00245218 0.00320036 0.00218766
With Feature Selection
Accuracy 0.00006361 0.00183982 0.00111509 0.02858956 0.00437582
AUS G-measure 0.00000045 0.00002641 0.00014612 0.00867170 0.00013236
Accuracy 0.00249519 0.03828963 0.02301973 0.03293815 0.00504604
JPD G-measure 0.00027914 0.01100812 0.01848148 0.01192802 0.00138615
Accuracy 0.00009954 0.00011153 0.01047690 0.00663793 0.00050401
GCD G-measure 0.00210698 0.00810601 0.01526207 0.01949494 0.00019816
Accuracy 0.00004118 0.00736635 0.00217363 0.01837393 0.00309696
GND G-measure 0.00000514 0.04511904 0.00123686 0.03727870 0.00462196
Abbreviations: AUS, Australian; GCD, German-categorical; GND, German-numerical; JPD, Japenese.
show the statistical significance of LWV according to the increase in classification accuracy and
G-measure.
datasets with features selected by the proposed approach and with all features. Overall, comparing
it with all four credit scoring datasets, it achieves the highest rank.
4 CO N C LU S I O N
In this paper, a hybrid credit scoring model is proposed, which combines an ensemble feature
selection approach with a multilayer ensemble framework. The feature selection technique is
based on the ensemble ranking of the existing features of the respective dataset, which is estimated
using the median Pearson-correlation and logistic regression. The multilayer ensemble classifier
is modeled by aggregating heterogeneous classifiers in a layered manner. Moreover, a novel clas-
sifier ranking algorithm is proposed using CIV for placement of the classifiers in the multilayer
ensemble framework.
The proposed framework is tested on Australian, Japanese, German-categorical, and
German-numerical datasets. The experimental results indicated that the features selected using
the proposed approach are more representative and improved the performance of QDA, NB,
MLFN, DTNN, TDNN, DT, and SVM classifiers in terms of classification accuracy. Overall, for all
the aforementioned credit scoring datasets, the proposed ensemble model outperformed the tradi-
tional ensemble models such as RF, MV, LMV, WV, and LWV(R) in terms of accuracy, sensitivity,
and G-measure. Hence, it can be concluded that the proposed ensemble framework based on
an ensemble feature selection with appropriate placement of classifiers in a multilayer ensemble
classifier is an efficient approach for credit scoring.
392 TRIPATHI ET AL.
ORCID
REFERENCES
1. Mester LJ. What's the point of credit scoring? Bus Rev. 1997;3:3-16.
2. García V, Marqués AI, Sánchez JS. An insight into the experimental design for credit risk and corporate
bankruptcy prediction systems. J Intell Inf Syst. 2015;44(1):159-189.
3. Lessmann S, Baesens B, Seow H-V, Thomas LC. Benchmarking state-of-the-art classification algorithms for
credit scoring: an update of research. Eur J Oper Res. 2015;247(1):124-136.
4. Chen N, Ribeiro B, Chen A. Comparative study of classifier ensembles for cost-sensitive credit risk assessment.
Intell Data Anal. 2015;19(1):127-144.
5. Chen N, Ribeiro B, Chen A. Financial credit risk assessment: a recent review. Artif Intell Rev. 2016;45(1):1-23.
6. Wang G, Ma J, Huang L, Xu K. Two credit scoring models based on dual strategy ensemble trees. Knowl Based
Syst. 2012;26:61-68.
7. Paleologo G, Elisseeff A, Antonini G. Subagging for credit scoring models. Eur J Oper Res. 2010;201(2):490-499.
8. Wang J, Hedar A-R, Wang S, Ma J. Rough set and scatter search metaheuristic based feature selection for
credit scoring. Expert Syst Appl. 2012;39(6):6123-6128.
9. Maldonado S, Weber R, Basak J. Simultaneous feature selection and classification using kernel-penalized
support vector machines. Inf Sci. 2011;181(1):115-128.
10. Chi B-W, Hsu C-C. A hybrid approach to integrate genetic algorithm into dual scoring model in enhancing
the performance of credit scoring model. Expert Syst Appl. 2012;39(3):2650-2661.
11. Huang C-L, Dun J-F. A distributed PSO–SVM hybrid system with feature selection and parameter optimiza-
tion. Appl Soft Comput. 2008;8(4):1381-1391.
12. Oreski S, Oreski G. Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert
Syst Appl. 2014;41(4):2052-2064.
13. Edla DR, Tripathi D, Cheruku R, Kuppili V. An efficient multi-layer ensemble framework with
BPSOGSA-based feature selection for credit scoring data analysis. Arab J Sci Eng. 2018;43(12):6909-6928.
14. Wang J, Guo K, Wang S. Rough set and tabu search based feature selection for credit scoring. Procedia Comput
Sci. 2010;1(1):2425-2432.
15. Liang D, Tsai C-F, Wu H-T. The effect of feature selection on financial distress prediction. Knowl Based Syst.
2015;73:289-297.
16. Sun J, Li H. Financial distress prediction using support vector machines: ensemble vs. individual. Appl Soft
Comput. 2012;12(8):2254-2265.
17. Marqués AI, García V, Sánchez JS. Two-level classifier ensembles for credit risk assessment. Expert Syst Appl.
2012;39(12):10916-10922.
18. Tripathi D, Edla DR, Cheruku R. Hybrid credit scoring model using neighborhood rough set and multi-layer
ensemble classification. J Intell Fuzzy Syst. 2018;34(3):1543-1549.
19. Abellán J, Castellano JG. A comparative study on base classifiers in ensemble methods for credit scoring.
Expert Syst Appl. 2017;73:1-10.
20. Kim SY, Upneja A. Predicting restaurant financial distress using decision tree and AdaBoosted decision tree
models. Econ Model. 2014;36:354-362.
21. Parvin H, MirnabiBaboli M, Alinejad-Rokny H. Proposing a classifier ensemble framework based on classifier
selection and decision tree. Eng Appl Artif Intell. 2015;37:34-42.
22. Ala'raj M, Abbod MF. A new hybrid ensemble credit scoring model based on classifiers consensus system
approach. Expert Syst Appl. 2016;64:36-55.
23. Ala'raj M, Abbod MF. Classifiers consensus system approach for credit scoring. Knowl Based Syst.
2016;104:89-105.
TRIPATHI ET AL. 393
24. Bashir S, Qamar U, Khan FH. IntelliHealth: a medical decision support application using a novel weighted
multi-layer classifier ensemble framework. J Biomed Inform. 2016;59:185-200.
25. Verikas A, Kalsyte Z, Bacauskiene M, Gelzinis A. Hybrid and ensemble-based soft computing techniques in
bankruptcy prediction: a survey. Soft Comput. 2010;14(9):995-1010.
26. Bashir S, Qamar U, Khan FH, Naseem L. HMV: a medical decision support framework using multi-layer
classifiers for disease prediction. J Comput Sci. 2016;13:10-25.
27. Neumann F, Witt C. Bioinspired Computation in Combinatorial Optimization: Algorithms and Their Compu-
tational Complexity. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2010.
28. Chakravarthy H, Bachan P, Roshini P, Rajan KCh. Bio inspired approach as a problem solving technique. Netw
Complex Syst. 2012;2(2):14-22.
29. Duda RO, Hart PE, Stork DG. Pattern Classification. New York, NY: John Wiley & Sons; 2012.
30. Fisher RA. The use of multiple measurements in taxonomic problems. Ann Hum Genet. 1936;7(2):179-188.
31. Mitchell TM. Machine Learning. Singapore: McGraw-Hill Boston, MA; 1997.
32. Hornik K, Stinchcombe M, White H. Multilayer feedforward networks are universal approximators. Neural
Netw. 1989;2(5):359-366.
33. Svozil D, Kvasnicka V, Pospichal J. Introduction to multi-layer feed-forward neural networks. Chemom Intell
Lab Syst. 1997;39(1):43-62.
34. Waibel A, Hanazawa T, Hinton G, Shikano K, Lang KJ. Phoneme recognition using time-delay neural
networks. IEEE Trans Acoust Speech Signal Process. 1989;37(3):328-339.
35. MathWorks. MATLAB neural network toolbox. 2017. https://in.mathworks.com/help/nnet/ref/distdelaynet.
html
36. Quinlan JR. Induction of decision trees. Mach Learn. 1986;1(1):81-106.
37. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273-297.
38. Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inf Theory. 1967;13(1):21-27.
39. Grabisch M, Labreuche C. A decade of application of the Choquet and Sugeno integrals in multi-criteria
decision aid. Ann Oper Res. 2010;175(1):247-286.
40. Wang G, Zhao B, Li Y. Quantitative Logic and Soft Computing. Singapore: World Scientific Publishing Co Pte
Ltd; 2012.
41. Wang Z, Yan J-A. Choquet integral and its applications: a survey. Beijing, China: Academy of Mathematics
and Systems Science: CAS; 2006.
42. Grabisch M. The application of fuzzy integrals in multicriteria decision making. Eur J Oper Res.
1996;89(3):445-456.
43. Rokach L. Ensemble-based classifiers. Artif Intell Rev. 2010;33(1-2):1-39.
44. Bauer DF. Constructing confidence sets using rank statistics. J Amer Stat Assoc. 1972;67(339):687-690.
45. Cohen J, Cohen P, West SG, Aiken LS. Applied Multiple Regression/Correlation Analysis for the Behavioral
Sciences. Abingdon, UK: Routledge; 2013.
46. Bluman AG. Elementary Statistics: Brown Melbourne; 1995.
47. Neumann U, Riemenschneider M, Sowa J-P, Baars T, Kälsch J, Canbay A, Heider D. Compensation of feature
selection biases accompanied with improved predictive performance for binary classification by using a novel
ensemble feature selection approach. BioData Min. 2016;9(1):36.
48. Tsai C-F, Lin Y-C, Yen DC, Chen Y-M. Predicting stock returns by classifier ensembles. Appl Soft Comput.
2011;11(2):2452-2459.
49. Triantaphyllou E. Multi-criteria decision making methods. In: Multi-Criteria Decision Making Methods: A
Comparative Study. Boston, MA: Springer-Verlag; 2000:5-21.
50. Asuncion A, Newman DJ. UCI machine learning repository. 2007. https://archive.ics.uci.edu/ml/index.php
51. Nguyen HS. Discretization of Real Value Attributes, Boolean Reasoning Approach [PhD thesis]. Warsaw, Poland:
Warsaw University; 1997.
52. Ong C-S, Huang J-J, Tzeng G-H. Building credit scoring models using genetic programming. Expert Syst Appl.
2005;29(1):41-47.
53. Breiman L. Random forests. Mach Learn. 2001;45(1):5-32.
54. Hogg RV, Ledolter J. Engineering Statistics. New York, NY: Macmillan Pub Co; 1987.
394 TRIPATHI ET AL.
55. Hollander M, Wolfe DA. Nonparametric Statistical Methods. New York, NY: John Wiley and Sons; 1999.
56. Zhao Z, Xu S, Kang BH, Kabir MMJ, Liu Y, Wasinger R. Investigation and improvement of multi-layer
perceptron neural networks for credit scoring. Expert Syst Appl. 2015;42(7):3508-3516.
57. West D. Neural network credit scoring models. Comput Oper Res. 2000;27(11-12):1131-1152.
58. Bequé A, Lessmann S. Extreme learning machines for credit scoring: an empirical evaluation. Expert Syst
Appl. 2017;86:42-53.
59. Wang G, Hao J, Ma J, Jiang H. A comparative assessment of ensemble learning for credit scoring. Expert Syst
Appl. 2011;38(1):223-230.
60. Zhang D, Zhou X, Leung SCH, Zheng J. Vertical bagging decision trees model for credit scoring. Expert Syst
Appl. 2010;37(12):7838-7843.
61. Nanni L, Lumini A. An experimental comparison of ensemble of classifiers for bankruptcy prediction and
credit scoring. Expert Syst Appl. 2009;36(2):3028-3033.
62. Tsai C-F, Wu J-W. Using neural network ensembles for bankruptcy prediction and credit scoring. Expert Syst
Appl. 2008;34(4):2639-2649.
63. Wongchinsri P, Kuratach W. SR-based binary classification in credit scoring. Paper presented at: 2017
14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and
Information Technology (ECTI-CON); 2017; Phuket, Thailand.
64. Hens AB, Tiwari MK. Computational time reduction for credit scoring: an integrated approach based on
support vector machine and stratified sampling method. Expert Syst Appl. 2012;39(8):6774-6781.
65. Huang C-L, Wang C-J. A GA-based feature selection and parameters optimization for support vector
machines. Expert Syst Appl. 2006;31(2):231-240.
66. Hu Q, Yu D, Liu J, Wu C. Neighborhood rough set based heterogeneous feature subset selection. Inf Sci.
2008;178(18):3577-3594.
67. Liu Y, Wang G, Chen H, Dong H, Zhu X, Wang S. An improved particle swarm optimization for feature
selection. J Bionic Eng. 2011;8(2):191-200.
68. Huang C-L, Chen M-C, Wang C-J. Credit scoring with a data mining approach based on support vector
machines. Expert Syst Appl. 2007;33(4):847-856.
69. Ping Y, Yongheng L. Neighborhood rough set and SVM based hybrid credit scoring classifier. Expert Syst Appl.
2011;38(9):11300-11304.
70. Xia Y, Liu C, Da B, Xie F. A novel heterogeneous ensemble credit scoring model based on bstacking approach.
Expert Syst Appl. 2018;93:182-199.
71. Sun J, Lee Y-C, Li H, Huang Q-H. Combining B&B-based hybrid feature selection and the
imbalance-oriented multiple-classifier ensemble for imbalanced credit risk assessment. Technol Econ Dev
Econ. 2015;21(3):351-378.
How to cite this article: Tripathi D, Edla DR, Cheruku R, Kuppili V. A novel hybrid
credit scoring model based on ensemble feature selection and multilayer ensemble classi-
fication. Computational Intelligence. 2019;35:371–394. https://doi.org/10.1111/coin.12200