Computational Intelligence - 2019 - Tripathi - A Novel Hybrid Credit Scoring Model Based On Ensemble Feature Selection and

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Received: 11 October 2017 Revised: 23 January 2019 Accepted: 13 February 2019

DOI: 10.1111/coin.12200

ORIGINAL ARTICLE

A novel hybrid credit scoring model based on


ensemble feature selection and multilayer
ensemble classification
Diwakar Tripathi1 Damodar Reddy Edla1
Ramalingaswamy Cheruku2 Venkatanareshbabu Kuppili1
1
National Institute of Technology Goa,
Abstract
Ponda, India
2
Mahindra École Centrale, Hyderabad, Credit scoring focuses on the development of empirical
India models to support the financial decision-making processes
of financial institutions and credit industries. It makes use
Correspondence
Diwakar Tripathi, National Institute of of applicants' historical data and statistical or machine
Technology Goa, Ponda-403 401, India. learning techniques to assess the risk associated with an
Email: diwakarnitgoa@gmail.com
applicant. However, the historical data may consist of
redundant and noisy features that affect the performance
of credit scoring models. The main focus of this paper is
to develop a hybrid model, combining feature selection
and a multilayer ensemble classifier framework, to improve
the predictive performance of credit scoring. The proposed
hybrid credit scoring model is modeled in three phases. The
initial phase constitutes preprocessing and assigns ranks
and weights to classifiers. In the next phase, the ensemble
feature selection approach is applied to the preprocessed
dataset. Finally, in the last phase, the dataset with the
selected features is used in a multilayer ensemble classifier
framework. In addition, a classifier placement algorithm
based on the Choquet integral value is designed, as the clas-
sifier placement affects the predictive performance of the
ensemble framework. The proposed hybrid credit scoring
model is validated on real-world credit scoring datasets,
namely, Australian, Japanese, German-categorical, and
German-numerical datasets.

K E Y WO R D S
classification, credit scoring, ensemble learning, feature ranking

Computational Intelligence. 2019;35:371–394. wileyonlinelibrary.com/journal/coin © 2019 Wiley Periodicals, Inc. 371


372 TRIPATHI ET AL.

1 I N T RO DU CT ION

Credit scoring is a procedure to calculate the risk associated with credit products using applicant's
credentials (such as annual income, job status, and residential status); statistical or machine
learning techniques are applied to the applicants' historical data.1,2 Credit scoring models are con-
sidered as a binary class classification problem to distinguish whether an applicant is a defaulter
(with suspicious/bad credit) or a nondefaulter (with legitimate/good credit).2 These models try to
isolate the impact of different characteristics of the applicant on criminal behavior and defaults.
The main focus of a credit scoring model is to determine whether or not a borrower will behave in
an undesirable way in the future.3 The most important and challenging issue for financial insti-
tutions and credit industries is to assess the performance of applicants prior to an actual business
failure.4,5 Even a slight improvement in recognizing applicants with suspicious credit will result
in a huge gain for the credit industries.6
Credit scoring is not a single-step process; it is carried out periodically by financial institutions
and credit industries in various steps such as application scoring, behavioral scoring, and collec-
tion scoring.7 Various benefits of credit scoring for financial institutions include calculating and
reducing credit risk, making managerial decisions, and improving cash flow. The performance
of credit scoring model is responsible for the profitability of the financial institutions. Usually,
credit scoring datasets are of high dimension and are heterogeneous in nature. As a result of this,
credit scoring models suffer from high computational complexity and poor performance.8 Fea-
ture selection is a way to reduce the computational complexity and to improve the performance
of the credit scoring models.8,9
Various articles10-13 are based on feature selection approaches that use bio-inspired algorithms
to improve the performance of the classifiers. Chi and Hsu10 proposed a dual credit scoring
model. In this model, the authors used genetic algorithm (GA) to select the important features
to combine the bank's internal behavioral rating with the external credit bureau rating for credit
scoring. Huang and Dun11 proposed a hybrid model for credit scoring that tried to consolidate
feature selection and classification. In this work, the authors applied feature selection based on
binary particle swarm optimization (BPSO). Furthermore, the dataset with selected features was
applied to support vector machine (SVM) for classification. Oreski and Oreski12 proposed a hybrid
model for credit scoring by combining GA-based feature selection with neural networks (NNs) to
increase the classification accuracy. Wang et al14 proposed a feature selection approach based on
rough set (RS) and tabu search (TS). In this work, the authors used conditional entropy as criteria
function to search for an optimal set of features. Liang et al15 used three filter-based feature selec-
tion methods, namely, linear discriminant analysis (LDA), T-test, and logistic regression, and two
wrapper-based feature selection methods, namely, GA and particle swarm optimization (PSO),
on six different prediction models for credit scoring datasets.
In the literature, many studies have revealed that individual classifiers show only moder-
ately good performance when compared to ensemble classifiers. Other works16-19 presented a
comparative study on various ensemble methods, such as bagging, boosting, random subspace,
decorate, and rotation forest, for credit scoring. The conventional credit scoring models are based
on individual classifiers or on a combination of these classifiers and tend to show moderate
performance. Kim and Upneja20 proposed AdaBoosted decision tree (DT) for predicting the com-
plex dynamics of restaurant financial distress. Many classifiers, such as DT, SVM, Naïve Bayes
(NB), and NN-based classifiers, have thus far been proposed to learn problems. However, no clas-
sifier can perform well for all datasets. Usually, classifiers perform well for a specific dataset.
As a consequence, using an ensemble classifier is a strong approach to get near to the optimal
TRIPATHI ET AL. 373

classifier for any dataset.21 This approach strengthens the classifiers in error-prone subspaces and,
consequently, leads to higher performance for the classification. Generally, the result of combi-
nation of diverse classifiers is better than the individual classifiers.22-24 Basically, there are two
types of ensemble frameworks, ie, homogeneous and heterogeneous ensemble frameworks.24 The
most popular ways to combine the base classifiers are majority voting and weighted voting.22,23,25
A multilayer ensemble framework that is formed by combining the classifiers using a layered
approach with heterogeneous classifiers overcomes the limitations of conventional performance
bottlenecks.26 However, when a multilayer ensemble classifier is used, the placement of each
classifier may affect the overall performance.
Although bio-inspired approaches toward problem solving seems almost ideal because they
have properties like self-optimization, flexibility, and a simple set of ground rules, they have a few
disadvantages, such as the initial thrust or starting condition for the algorithm, overhead, and
the checking of the environment variables.27,28 In order to overcome the limitations of a feature
selection approach based on bio-inspired algorithm, filter-based feature ranking techniques can
be utilized to rank features independently without involving any learning algorithm.
In order to address the limitations of feature selection based on bio-inspired algorithms and a
multilayer ensemble classifier, in this paper, a hybrid model is proposed, which combines feature
selection based on feature ranking with a multilayer ensemble method to improve the classifi-
cation performance for credit scoring. For the feature selection approach, an ensemble feature
ranking approach is proposed and, in order to place the classifiers in the multilayer ensemble
framework, an approach for classifiers ranking is proposed. The rank of the classifier is estimated
based on Choquet integral value (CIV) of the respective classifiers. Some preprocessing steps, such
as data-cleaning, data-transformation, and data-discretization, are carried out before the feature
selection process.
The rest of this paper is organized as follows. Section 2 covers the proposed framework for
credit scoring and covers preprocessing, feature selection, and multilayer ensemble classification.
Section 3 presents the experimental results and analysis on four credit scoring datasets. Finally,
Section 4 concludes the contributions of this paper based on experimental observations.

2 PROPOSED FRAMEWORK

This section discusses the proposed hybrid credit scoring model, which combines ensemble fea-
ture selection and a multilayer ensemble classifier. Feature selection eliminates the irrelevant
and noisy features and this helps to reduce the complexity. In order to improve the performance
of the model, a multilayer ensemble framework is used. These are discussed in the following
sections and the proposed framework is shown in Figure 1. The model has three phases, namely,
preprocessing, feature selection, and ensemble framework.

2.1 Phase-1
This phase consists of three steps, preprocessing, classifier's ranking, and weight assignment.

2.1.1 Preprocessing
Data preprocessing is an important step in the modeling process. The aim of this step is to increase
the effectiveness of the classification process by using representative and consistent dataset.
374 TRIPATHI ET AL.

FIGURE 1 Proposed hybrid framework for credit scoring [Color figure can be viewed at
wileyonlinelibrary.com]

Preprocessing includes data cleaning, data transformation, and data discretization, which are
explained as follows.
• Data cleaning: In this step, the whole dataset is considered and checked for missing values, and
samples with missing values are eliminated.
• Data transformation: Credit scoring datasets have both numerical and categorical attributes.
Some classification algorithms, such as SVM and NNs, do not work well with heterogeneous
attributes, so data transformation is required. To convert categorical values into numerical
values, a unique integer number is assigned to each unique categorical value in each feature set.
• Data discretization: In the previous step, categorical data values are replaced by numerical val-
ues. The categorical attributes have a small range of values, but the numerical attributes have
a wide range of values. To make these ranges balanced, discretization is used. Data discretiza-
tion is the process of converting a dataset with continuous valued attributes into a dataset with
discrete valued attributes.

2.1.2 Classifiers ranking


There is no specific way to recognize which classifier is the best classifier for a specific dataset
and ensemble learning is an effective approach for producing a near to optimal classifier for any
dataset. As in the third phase, a multilayer ensemble classifier framework is used and the pre-
dicted outputs of classifiers are combined using weighted voting approach. In this framework,
the placement of classifiers has a great impact on the framework's predictive accuracy, classifiers
with better predictive accuracy in the last layer will have better predictive accuracy as compared
to classifiers with lesser predictive accuracy in the last layer. The preprocessed dataset is utilized
to find the rank and weight of the classifiers.
In this phase, eight classifiers, namely, quadratic discriminant analysis (QDA),29,30 NB,31 mul-
tilayer feed forward neural network (MLFN),32,33 time delay NN (TDNN),34 distributed time delay
NN (DTNN),35 DT,36 SVM,37 and K-nearest neighborhood (KNN),38 are initially utilized to find
TRIPATHI ET AL. 375

the rank of each classifier. The traditional ranking approach is based on the classifier's accuracy
and it is a weighted sum, but accuracy is not a sufficient measure if the dataset has imbalance
toward a class label. Despite its simplicity, this approach has a shortcoming because it implic-
itly assumes the independence of the criteria. It is difficult to assign a rank to a classifier by
considering multiple criteria. However, often, the criteria are interacting. Choquet integral is a
measure that considers interaction of criteria and it is used for decision making. To find the ranks
of the classifiers, the Choquet integral values based on multiple criteria are utilized.39-41 Basic
preliminaries of Choquet integral are as follows.
Let us denote by C = {c1 , … , cn } the set of criteria, and the power set of C by P(C ).
Definition 1. A fuzzy measure on the set C of criteria is a set function 𝜇 ∶ P'(C ) → [0, 1]
satisfying the following axioms42 ;
(1) 𝜇(∅) = 0, 𝜇(C ) = 1;
(2) A ⊂ B ⊂ C implies that 𝜇(A) ⊂ 𝜇(B) ⊂ 𝜇(C );
where 𝜇(A), 𝜇(B), and 𝜇(C ) symbolize the weights for the importance of sets' criteria on A, B, and
C, respectively. Subsequently, in addition to the standard weights on criteria taken independently,
the weights on any combination of criteria are also defined.
Let C be a finite set, ie, C = c1 , c2 , … , cn . Rearrange f (c1 ), f (c2 ), … , f (cn ) into an increasing
order, ie, 𝑓 (c∗1 ) ⩽ 𝑓 (c∗2 ) ⩽ … 𝑓 (c∗n ), where (c∗1 , c∗2 , … c∗n ) is a permutation of (c1 , c2 , … , cn ). The
Choquet integral based on fuzzy measures 𝜇 can be computed as follows.
Definition 2. Let 𝜇 be a fuzzy measure on C. The Choquet integral of a function f ∶ C →
[0, 1] with respect to 𝜇 is defined as in Equation (1), ie,
n [( ( )) ( )]
( ( )) ∑
I𝜇 𝑓 c∗1 , … , c∗n = 𝑓 c∗(i) − 𝑓 c∗(i−1) 𝜇(A(i) ), (1)
i=1

where 𝑓 (c∗(0) ) = 0 and A(i) = {c1 , c2 , … , cn }.42


Furthermore, the rank assigned to the classifier is based on its Choquet integral value of each
classifier, according to its value rank assigned using max operator (the highest value indicates the
highest priority). Complete procedure is as in Algorithm 1.
Algorithm 1. Classifiers ranking algorithm
Input:
• Choose number of evaluation items (number of classifiers).
• Choose number of criteria (namely, sensitivity and specificity).
• Fuzzy measure Ai .
Output:
• Choquet integral value for each classifier..
Procedure:
(1) Choose the weights for each measure.
(2) Compute fuzzy measure value for each subset as follows. Let us denote C = {c1 , … , cn } as
the set of criteria, and P(C) as the power set of C. A fuzzy measure on the set C of criteria is
a set function 𝜇 ∶ P'(C ) → [0, 1] satisfying the following axioms:
376 TRIPATHI ET AL.

(a) 𝜇(∅) = 0, 𝜇(C ) = 1;


(b) A ⊂ B ⊂ C implies that 𝜇(A) ⊂ 𝜇(B) ⊂ 𝜇(C ).
(3) Choquet integral value of each classifier is obtained using the following: I𝜇 ( 𝑓 (c∗1 , … , c∗n )) =
∑n
i=1 [( 𝑓 (c(i) ) − 𝑓 (c(i−1) )]𝜇(A(i) ).
∗ ∗

(4) Calculate the rank of classifier on the basis of respective CIV value (max rank with max CIV
value).

2.1.3 Weight assignment


The five classifiers with the best ranking are arranged, as shown in phase-III of Figure 1. In case
of a multilayer ensemble framework, the classifiers C-1 and C-2 with the highest rankings are in
the second layer and the other three classifiers are in the first layer.
In this work, weighted voting approach with a heterogeneous multilayer ensemble frame-
work is used. For weighted voting, the weights are required to aggregate the outputs predicted
by the base classifiers to calculate the final output of the ensemble framework. For the assign-
ment of weights to the base classifiers, classification error rate is used as a parameter and the
weights are inversely proportional to the classifier's error rate. These weights are calculated from
Equation (2).43 Initially, equal weights are assigned to each base classifier, then the dataset is
applied for classification and the weights are calculated. This procedure is repeated for n itera-
tions and the mean of the weights (after n iterations) is assigned to the respective classifiers, ie,

(1 − Eri𝑗 )
Wi𝑗 = ∑P , (2)
𝑗=1 (1 − Eri𝑗 )
where Wi j represents weight of jth classifier in ith iteration, Eri j represents the error rate of jth
classifier in ith iteration, and P denotes the number of classifiers.

2.2 Phase-2: ensemble feature selection


Feature selection (FS) stands out among the most basic issues in the field of machine learning.
It provides several benefits, such as reduced computational costs (eg, training and testing time
and storage requirements), and it also improves prediction performance.24 The main aim of fea-
ture selection is to determine, from the original set of features in a given dataset, a subset of
features that is ideally necessary and sufficient to describe the target concept. In former stud-
ies, it has been demonstrated that one single FS method cannot perform well with all classifiers
and datasets.26 The ensemble FS (EFS) method is therefore an effective approach to provide the
optimal FS method. The EFS method is an approach that leads to higher stability and a better pre-
dictive model by aggregating many weaker prediction models. In this article, a stable FS method
based on an ensemble learning approach is proposed. Three different feature ranking methods,
namely, median, Pearson-correlation coefficient, and logistic regression coefficient methods, are
utilized as base learners for the EFS to calculate the final ranking of the features.
Median: This approach compares samples of a positive class with samples of a negative
class using a Mann-Whitney-U test.44 The test evaluates the following hypothesis a set out in
Equation (3), ie,
H0 ∶ m0 = m1 , (3)
where m0 and m1 are the median of the positive and negative class of a predictor variable, and H0
is the null hypothesis for each predictor variable.
TRIPATHI ET AL. 377

FIGURE 2 Framework for ensemble feature ranking

The resulting p-values are used as a measure of feature importance. Thus, a lesser p-value
designates a greater importance.
Correlation: Pearson product-moment correlation,45,46 as in Equation (4), is used to select fea-
tures that are highly correlated with the class label and have low correlation with other features.
A threshold for the correlation between the predictor variables of pv = 0.7 is used, ie,
∑n
̄ ̄
i=1 (Xi − X )(Yi − Y )
pv = √ , (4)
∑n ∑n
̄
i=1 (Xi − X) ∗
2 ̄
i=1 (Yi − Y )
2

where X and Y represent the features and n number of samples in feature. X̄ and Ȳ are the means
of features X and Y.
Logistic regression (LogReg): This is another FS method. In this approach, the coefficients
(𝛽-values) of the logistic regression are utilized as a weighting system, but the range of features
may differ. The 𝛽-coefficients of features are not comparable, so, to make them comparable, a
Z-transformation is performed, as shown in Equation (5),47 ie,
X − X̄
Zx = , (5)
Sx
where X̄ and SX represent the mean and standard deviation of the feature X, respectively.
Through standardization by Z-transformation, the mean of the 𝛽-coefficients becomes zero
with a standard deviation of 1, thus ensuring that the features all have the same domain.
Subsequently, the values are ordered according to their absolute values in decreasing order.
Finally, to calculate the final ranking of all the features, the values obtained by median, cor-
relation, and LogReg are aggregated (summation) for each feature. These summed values are
then normalized in the range [0, 1] and are considered as the final measure to calculate the final
ranking of the respective features. The complete process is as depicted in Figure 2.

2.3 Multilayer ensemble classifier


Various classification algorithms have been proposed by various researchers, but there is no spe-
cific way to predict which classifier will produce the best results on a specific dataset. An ensemble
of classifiers has the capability to produce near optimal results on every dataset. An ensemble
frameworks can be assembled in two ways homogeneously or heterogeneously. In a homogeneous
framework, base classifiers of the same type are combined, whereas, in a heterogeneous frame-
work, base classifiers of different types are combined. As shown in the work of Tsai et al,48 there is
no significant difference between homogeneous and heterogeneous ensemble frameworks with
majority voting, but, with a homogeneous classifier ensemble with majority voting, the prediction
capability is good for the appropriate class label. There are two common approaches to aggregate
378 TRIPATHI ET AL.

the outputs predicted by the classifiers, these are majority and weighted voting. In majority vot-
ing approach, same weights are assigned to each classifier and the output is the class that has the
highest votes.43 In weighted voting approach, the highest weight is assigned to the classifier with
the highest accuracy, and vice versa. The final output is the weighted sum of the outputs predicted
by the base classifiers.43
A multilayer ensemble classifier framework permits adaptation from multiple points, unlike
single-layer classifiers.24 The primary reason for preferring this perspective is that diverse mod-
els can be used to separate the granularity of issues. If different classifiers at different layers are
utilized, different features for each layer and the classification tasks can be more refined. Further-
more, a multilayer ensemble classification scheme can be used to enhance the prediction.24 The
computational complexity of the multilayer ensemble framework is reduced by dividing it into a
two-layer approach. The main motive behind using a multilayer ensemble classifier framework is
that, when the classifier makes a decision, it is not dependent on just a single classifier's decision,
but, rather, requires all classifiers to participate in the decision-making process by aggregating
their individual predictions. Hence, this method outperforms the base classifiers.
In order to improve the predictive performance of the proposed framework, five heteroge-
neous classifiers are aggregated into a multilayer ensemble classifier framework, as shown in
phase-III of Figure 1. The classifiers C-1, C-2, C-3, C-4, and C-5 are chosen as the best classi-
fiers out of the eight heterogeneous classifiers in phase-1. Data with the selected features are fed
with the weights assigned to the respective classifiers for evaluation of the final results against
the input samples. Furthermore, the five classifiers with the best ranking are arranged, as shown
in phase-III of Figure 1. In this framework, C-1 and C-2 have the highest ranks and are placed in
the second layer. The other three classifiers are placed in the first layer. The combiner in the sec-
ond layer aggregates the results obtained by the three layer-1 classifiers, and the combiner in the
next layer aggregates the results obtained by the two layer-2 classifiers and the layer-2 combiner.
In this framework, same training dataset is used to train different base classifiers, and then these
classifiers' outputs are aggregated to make the final predicted output of the framework against
each sample.
Each combiner aggregates the output predicted by the associated classifiers using
Equation (6),49 ie,
∑P
O= Wi ∗ Xi , (6)
i=1
where Wi and Xi are the weight and predicted output of the ith classifier, respectively, and
P denotes the number of base classifiers.

3 EXPERIMENTAL RESULTS

This section mentions about datasets and performance measures applied in this work, ensemble
feature selection results by various feature selection approaches along with results obtained by
various ensemble frameworks.

3.1 Datasets and performance measures


Australian (AUS), Japanese (JPD), German-categorical (GCD), and German-numerical (GND)
datasets are used in this work. These datasets are acquired from the UCI Machine Learn-
ing Repository.50 All the datasets have a combination of continuous and nominal attribute
TRIPATHI ET AL. 379

types. Australian and Japanese datasets are related to credit approval. German-categorical and
German-numerical datasets are related to loan application. To protect the confidentiality of the
data, the values of some attributes are replaced by random meaningless symbols. Detailed descrip-
tion of the datasets is given in Table 1. All the aforementioned datasets as described in Table 1 are
binary class datasets, and the class label represents whether the sample was accepted (class-1) or
rejected (class-2). In the Australian dataset, the categorical attributes are 1, 4, 5, 6, 8, 9, 11, and
12, with 2, 3, 14, 9, 2, 2, 2, and 3 values, respectively, whereas the others are numerical attributes.
In the Japanese dataset, the categorical attributes are 1, 4, 5, 6, 7, 9, 10, 12, and 13 and have 2, 4,
3, 14, 9, 2, 2, 2, and 3 values, respectively, whereas the others are numerical attributes. In case of
German-categorical dataset, attributes 1, 3, 4, 6, 7, 9, 10, 12, 14, 15, 17, 19, and 20 are categorical
with 4, 5, 11, 5, 5, 5, 3, 4, 3, 3, 4, 2, and 2 values, respectively, whereas the rest of the attributes
are numerical attributes. Similarly in the case of German-numerical dataset, all attributes are
numerical with varying ranges of integer values.
As discussed earlier, all the datasets have binary class and the proposed model focuses on the
classification problem. To analyze the performance of the proposed model, confusion matrix, as
shown in Table 2, is used. This is the most popular way to evaluate classification problems. Using
that table, various measures to evaluate the classification measures that are commonly available
in the literature, such as accuracy, sensitivity, specificity, and G-measure, are displayed. Accuracy
(Equation (7)) shows the predictive performance of the classifier, which is not sufficient as a per-
formance measure, if there is a significant class imbalance toward one class in the dataset. The
datasets used in this experimental work are binary class datasets with accepted (credit approved)
and rejected (credit not approved) classes. Sensitivity (Equation (8)) represents the accuracy of
only the prediction of accepted samples and specificity (Equation (9)) measures the prediction
accuracy for rejected samples. G-measure (Equation (10)) is a measure that considers both the
accepted and the rejected accuracies to compute the score. It can be interpreted as geometric mean
of sensitivity and specificity. All performance measures mentioned here would give 1 or 0 for the
best or the worst cases, respectively, ie,
TP + TN
Accuracy = (7)
TP + TN + FP + FN
TP
Sensitivity = (8)
TP + FN
TN
Specificity = (9)
TN + FP

G − measure = Sensitivity ∗ Specificity. (10)
TABLE 1 Descriptions about datasets used
Dataset Samples Features Class-1/Class-2 Categorical/Numerical
Australian 690 14 383/317 6/8
Japanese 690 15 383/317 7/8
German-categorical 1000 20 700/300 13/7
German-numerical 1000 24 700/300 0/24

TABLE 2 Confusion matrix


Actual Accepted Actual Rejected
Observed as accepted True positive (TP) False positive (FP)
Observed as rejected False negative (FN) True negative (TN)
380 TRIPATHI ET AL.

3.2 Feature ranking and comparative analysis


Preprocessing is the first phase in the proposed model and includes data cleaning, data trans-
formation, and data discretization. Credit scoring datasets are used for validating the proposed
model. These datasets have continuous and categorical features, along with some missing values.
Samples with missing values are eliminated, and individual categorical values are replaced by
unique integer values. The categorical attributes have a small range of values, but the numerical
attributes have a wide range of values. The method of discretization is used to bring the numerical
attributes within a balanced range. In order to discretize the numerical features, Boolean rea-
soning algorithm,51,52 which separates the feature values in an optimal set of the hyper-plane, is
used. The discretized values of the corresponding features are depicted in Tables 3 to 6 with the
respective datasets.
The preprocessed dataset is used to rank the classifiers. The data are randomly partitioned into
a training and a testing dataset in a ratio of 9:1 to calculate the ranks of the classifiers. All eight
classifiers are applied to the preprocessed dataset, and their specificity and sensitivity are evalu-
ated. These measures are used to calculate the CIV and this approach is repeated for n-iterations,
mean of all the iterations is used as the final value to rank each classifier. Table 7 represents the
fuzzy measure for each criterion, this is used to calculate the CIV. Furthermore, the CIVs are
arranged in descending order and the corresponding rank is assigned in ascending order. Table 8
depicts the rank of each classifier on each dataset with the corresponding CIV, sensitivity, and
specificity.

TABLE 3 Discretization of features with continuous values in Australian dataset


Feature 1 2 3 4 5 6 7 8 9 10 11
2 0-17.75 21.21 22.38 23.04 23.34 24.38 27.92 32.38 37.38 48.96 -
3 0-0.48 1.793 2.793 4.020 6.103 -
7 0-0.14 1.020 2.45 -
13 0-23 93 171 262 -
14 0-13 13-

TABLE 4 Discretization of features with continuous values in Japanese dataset


Feature 1 2 3 4 5 6 7 8 9 10 11
2 0-17.75 21.21 22.38 23.04 23.34 24.38 27.92 32.38 37.38 48.96 -
3 0-0.48 1.793 2.793 4.020 6.103 -
8 0-0.14 1.020 2.45 -
14 0-23 93 171 262 -
15 0-13 13-

TABLE 5 Discretization of features with


continuous values in German-categorical dataset
Feature 1 2 3 4 5
2 0-12 23 32 38 -
4 0-7.14 13.87 20.45 39.14 -
10 0-27 33 -
TRIPATHI ET AL. 381

TABLE 6 Discretization of features with


continuous values in German-numerical
dataset
Feature 1 2 3 4 5
2 0-12 23 32 38 -
5 0-714 1387 2045 3914 -
8 0-4 -
13 0-27 33 -
17 0-17 -

TABLE 7 Criterion's fuzzy measure


Set Fuzzy measure
{} 0
{Sensitivity} 0.8
{Specificity} 0.8
{Sensitivity, Specificity} 1

TABLE 8 Classifiers ranking using CIV on four datasets


Australian dataset Japanese dataset
CL Sen Spe CIV R Sen Spe CIV R
QDA 82.01 90.85 89.08 5 90.26 80.97 88.40 2
NB 86.24 85.74 86.14 6 88.04 82.38 86.90 4
MLFN 82.47 86.13 85.39 7 85.22 83.42 84.86 5
DTDN 92.22 89.86 91.74 2 81.58 79.98 81.26 7
TDNN 93.43 85.13 91.77 1 89.47 78.96 87.37 3
DT 84.64 78.09 83.33 8 83.35 81.47 82.97 6
SVM 79.94 92.49 89.98 3 93.68 80.97 91.14 1
KNN 89.47 51.19 89.81 4 83.13 61.73 78.85 8
German-categorical dataset German-numerical dataset
QDA 88.00 47.33 79.87 3 88.42 49.66 80.67 4
NB 81.84 55.30 76.53 5 82.05 51.90 76.02 5
MLFN 75.74 60.60 72.71 7 75.69 66.56 73.86 7
DTDN 87.50 60.77 82.15 2 88.53 63.84 83.59 2
TDNN 89.28 61.42 83.70 1 90.82 67.77 86.21 1
DT 83.42 49.66 76.67 6 82.57 47.33 75.52 6
SVM 73.14 69.66 72.44 8 72.14 72.66 72.56 8
KNN 95.28 16.00 79.42 4 95.57 23.66 81.18 3

CL: Classifier, Sen: Sensitivity, Spe: Specificity, R: Rank. Abbreviations: CIV, Choquet
integral value; DT, decision tree; KNN, K-nearest neighborhood; MLFN, multilayer
feed forward neural network; NB, Naïve Bayes; QDA, quadratic discriminant analysis;
SVM, support vector machine; TDNN, time delay neural network.

The preprocessed dataset is used to find the best features. For feature selection, an ensemble
feature selection approach is proposed in which the median, Pearson-correlation (P-cor), and
382 TRIPATHI ET AL.

LogReg value for each feature are utilized to rank the features. The ranking of each feature is
depicted in Figures 3 to 6 by each parameter median, P-cor, LogReg, and ensemble approach with
each dataset.
The graph for the Australian dataset is shown in Figure 3. It is clear from the results of Figure 3
that, when the median measure is used, many features (eg, V5, V8, V10, … ,V9) have the same
p-value, which shows that these features have the same rank. The P-cor measure assigns the best
rank for V8 and the least rank for V1, which shows that the features V8 and V1 are highly and
poorly correlated with target or class label, respectively. V5 and V1 are ranked as the best and

FIGURE 3 Feature ranking on Australian dataset [Color figure can be viewed at wileyonlinelibrary.com]

FIGURE 4 Feature ranking on Japanese dataset [Color figure can be viewed at wileyonlinelibrary.com]
TRIPATHI ET AL. 383

FIGURE 5 Feature ranking on German-categorical dataset [Color figure can be viewed at


wileyonlinelibrary.com]

FIGURE 6 Feature ranking on German-numerical dataset [Color figure can be viewed at


wileyonlinelibrary.com]

the least features by the LogReg measure, respectively. Finally, the proposed ensemble feature
ranking algorithm aggregates the values calculated by all the aforementioned measures to give a
final predictive measure to rank the features. Using the proposed feature ranking approach, V5
and V1 are assigned the best and least feature ranks, respectively.
The graph for the Japanese dataset is shown in Figure 4. It is clear from the results shown in
Figure 4 that, for the median measure, many features, such as V8, V9, and V5, have same p-value,
which shows that these features have the same rank. The P-cor assigns the best rank to V9 and
384 TRIPATHI ET AL.

the least rank to V1, which shows that the feature V9 and V1 are the highly and poorly correlated
with the target or class label, respectively. V8 and V12 are ranked as the best and the least feature
by LogReg measure, respectively. Finally, using the proposed feature ranking approach, V8 and
V1 are assigned as the best and the least feature ranks, respectively.
Next, the graph for the German-categorical dataset is shown in Figure 5. This shows that, using
median measure, the same ranking is given to more than one feature (eg, V1, V2, V12, … ,V9) and
that V10 has the lowest rank. The P-cor and LogReg assign the highest rank to V1 and V2, respec-
tively, and both give the lowest rank to V18. The proposed ensemble feature ranking algorithm
assigns V1 as the best and V18 as the least feature ranks, respectively.
Finally, the graph for the German-numerical dataset is shown in Figure 6. It is clear from the
results shown in Figure 6 that, by the median measure, many features, such as V1, V2, V3, V5,
and V9, have the same p-value, which shows that these features have the same rank, and V18 has
the lowest rank. The P-cor and LogReg measures assign V1 and V8 as the best and the least ranks.
Finally, by the proposed feature ranking approach, V1 and V8 are assigned as the best and the
least feature ranks, respectively.
For comparative analysis, the dataset is partitioned as per the 10-fold cross validation, and
eight heterogeneous classifiers, namely, QDA, NB, MLFN, DTNN, TDNN, DT, SVM, and KNN,
are chosen. The experiment is conducted with every classifier with the K-best (best 50% of actual
features for the particular dataset) features ranked by median, P-cor, LogReg, and the ensemble
of all three approaches (EFS) applied on the training dataset and tested on the testing dataset.
The mean of the 10-fold cross validation results in terms of accuracy for the Australian, Japanese,
German-categorical, and German-numerical datasets are shown in Figures 7 to 10, respectively.
From Figures 7 to 10, it can be observed that the results obtained using the ensemble feature
ranking approach have the best predictive accuracy for most of the classifiers. This approach also
gives significant improvements in predictive accuracy for all the features on all aforementioned
credit scoring datasets.

FIGURE 7 Comparative graph on Australian dataset. DT, decision tree; DTNN, distributed time delay neural
network; KNN, K-nearest neighborhood; MLFN, multilayer feed forward neural network; NB, Naïve Bayes; QDA,
quadratic discriminant analysis; SVM, support vector machine; TDNN, time delay neural network [Color figure
can be viewed at wileyonlinelibrary.com]
TRIPATHI ET AL. 385

FIGURE 8 Comparative graph on Japanese dataset. DT, decision tree; DTNN, distributed time delay neural
network; KNN, K-nearest neighborhood; MLFN, multilayer feed forward neural network; NB, Naïve Bayes; QDA,
quadratic discriminant analysis; SVM, support vector machine; TDNN, time delay neural network [Color figure
can be viewed at wileyonlinelibrary.com]

FIGURE 9 Comparative graph on German-categorical dataset. DT, decision tree; DTNN, distributed time
delay neural network; KNN, K-nearest neighborhood; MLFN, multilayer feed forward neural network; NB, Naïve
Bayes; QDA, quadratic discriminant analysis; SVM, support vector machine; TDNN, time delay neural network
[Color figure can be viewed at wileyonlinelibrary.com]

3.3 Ensemble classification results


In the proposed approach, after data preprocessing, the next task is to assign ranks and weights to
the classifiers. The preprocessed dataset is used to find the rank of the classifiers, and then, accord-
ing to their ranks, the classifiers are placed in a multilayer ensemble framework and the weight
of the classifier is calculated. The dataset with the features selected by the proposed ensemble
feature ranking-based feature selection is used in the ensemble framework. In case of nonlay-
ered ensemble approach, the best five classifiers are used in a single layer. The final output of the
ensemble approach with majority voting is the class that has the highest votes. In case of weighted
voting, the final output is the weighted sum of the predicted output by the base classifiers. The
386 TRIPATHI ET AL.

FIGURE 10 Comparative graph on German-numerical dataset. DT, decision tree; DTNN, distributed time
delay neural network; KNN, K-nearest neighborhood; MLFN, multilayer feed forward neural network; NB, Naïve
Bayes; QDA, quadratic discriminant analysis; SVM, support vector machine; TDNN, time delay neural network
[Color figure can be viewed at wileyonlinelibrary.com]

TABLE 9 Performance comparison of proposed model with other ensemble


approach and random forest (RF) on Australian dataset
Ensemble Technique Accuracy Sensitivity Specificity G-measure
RF 81.37 79.86 80.89 80.37
MV 88.35 88.32 88.26 88.29
WV 88.26 87.83 89.91 88.86
LMV 91.12 94.79 87.18 90.88
LWV(R) 89.15 93.37 86.53 89.91
LWV 91.55 97.68 86.58 91.96
With Feature Selection
RF 83.85 84.84 81.28 83.04
MV 89.18 89.20 89.13 89.01
WV 89.61 89.91 89.49 89.69
LMV 92.62 94.19 89.62 91.88
LWV(R) 90.31 91.81 89.38 90.59
LWV 93.85 98.46 88.46 93.33

layered ensemble approach, as in the proposed model, has two layers. In the first layer, the out-
puts of three classifiers (C-3, C-4, and C-5), with ranks of 3, 4, and 5, are aggregated and forwarded
to the next layer. In the second layer, the two classifiers with the best rankings (C-1 and C-2) and
the predicted output of the previous layer are aggregated by the combiner, which gives the final
predicted output against a specified sample. The ranks of classifiers are shown in Table 8 over the
four datasets.
The experimental results for the Australian, Japanese, German-categorical, and German-
numerical datasets are depicted in terms of accuracy, sensitivity, specificity, and G-measure with
the respective aggregation method in Tables 9 to 12 respectively. Moreover, RF represents the
results obtained by the state-of-the-art method random forest approach53 with 100 trees, with
features selected by the proposed approach. Furthermore, MV and LMV represent the results
TRIPATHI ET AL. 387

TABLE 10 Performance comparison of proposed model with other ensemble


approach and random forest (RF) on Japanese dataset
Ensemble Technique Accuracy Sensitivity Specificity G-measure
RF 82.37 83.86 87.89 85.85
MV 84.35 89.32 80.53 84.81
WV 84.26 88.36 83.91 86.10
LMV 85.92 89.79 82.18 85.90
LWV(R) 84.39 87.37 83.58 85.45
LWV 86.15 89.83 83.08 86.39
With Feature Selection
RF 83.48 83.42 89.68 86.49
MV 86.61 90.40 81.59 85.88
WV 86.98 90.51 82.23 86.27
LMV 87.56 94.40 81.59 87.76
LWV(R) 86.43 93.14 81.28 87.00
LWV 88.42 95.88 82.97 89.19

TABLE 11 Performance comparison of proposed model with other ensemble


approach and random forest (RF) on German-categorical dataset
Ensemble Technique Accuracy Sensitivity Specificity G-measure
RF 77.93 89.83 48.83 66.22
MV 78.51 91.61 48.63 66.74
WV 78.46 91.36 48.58 66.62
LMV 81.12 90.79 54.67 70.45
LWV(R) 80.35 89.37 56.58 71.11
LWV 82.68 92.03 59.36 73.91
With Feature Selection
RF 78.82 91.28 49.77 67.40
MV 79.97 93.34 49.33 67.85
WV 80.85 92.30 58.94 73.75
LMV 83.51 92.34 47.33 65.54
LWV(R) 81.12 91.40 62.66 76.68
LWV 84.51 94.06 61.00 75.74

obtained by the majority voting approach using nonlayered and layered approaches, respectively.
In addition, WV and LWV represent the results obtained using weighted voting aggregation
approach with nonlayered and layered approaches, respectively. LWV(R) represents the results
obtained by the layered weighted voting approach when the classifiers are arranged in reverse
order (C-4 and C-5 in the last layer and C-1, C-2, and C-3 in layer-1).
From Tables 9 to 12, it can be observed that, with all the features, the multilayer ensemble
approach with weighted voting when the appropriate classifiers are placed in layers has a great
improvement on the same approach with random placement of the classifiers, and it achieves
the best accuracy, sensitivity, and G-measure. Furthermore, it can also be seen that, for all four
credit scoring datasets, the layered weighted voting approach achieves the best accuracy, sensitiv-
ity, and G-measure and competitive specificity when compared to other ensemble methods RF,
MV, WV, LMV, and LWV(R). Similarly, with ensemble feature selection, it can be observed that
388 TRIPATHI ET AL.

TABLE 12 Performance comparison of proposed model with other ensemble


approach and random forest (RF) on German-numerical dataset
Ensemble Technique Accuracy Sensitivity Specificity G-measure
RF 74.69 86.86 46.67 63.67
MV 78.03 92.32 47.52 66.23
WV 79.26 92.36 49.91 67.89
LMV 83.31 93.79 59.18 74.50
LWV(R) 80.15 91.37 62.58 75.61
LWV 83.95 94.13 61.65 76.18
With Feature Selection
RF 75.25 88.38 46.86 64.35
MV 79.36 93.90 48.33 67.37
WV 80.81 94.13 49.62 68.35
LMV 84.74 95.83 58.49 74.87
LWV(R) 81.14 93.12 56.33 72.42
LWV 85.83 95.93 64.00 78.36

TABLE 13 Overall performance of the proposed approach in


credit scoring datasets
Datasets Accuracy Sensitivity Specificity G-measure
AUS 92.69 97.16 88.46 92.68
JPD 89.06 96.16 83.81 89.74
GCD 85.18 93.29 65.06 77.89
GND 84.69 94.52 65.83 78.86
Abbreviations: AUS, Australian; GCD, German-categorical; GND, German-
numerical; JPD, Japenese.

LWV gives better classification performances than other ensemble frameworks. As the results are
tabulated in Tables 9 to 12, it can be seen that the proposed ensemble feature selection approach
improves the classification performances with RF, MV, WV, LMV, and LWV(R). Overall, the pro-
posed ensemble feature selection approach with a multilayer ensemble framework outperforms
the state-of-the-art method RF and ensemble frameworks such as MV, LMV, WV, and LWV(R).

3.4 Statistical significance


This section presents the statistical analysis to show the significance of the proposed approach
against other ensemble approaches such as RF, MV, WV, LMV, and LWV(R). Various authors,
such as Abellán and Castellano19 and Ala'raj and Abbod23 have applied 10-FCV, 100 iterations and
5-FCV, 50 iterations to show the stability of their approach. Therefore, in this study, 10-FCV and
100 iterations are used, and the mean of 10-FCV, 100 iterations is tabulated in Table 13 with the
respective dataset.

3.4.1 T-test analysis


The objective is to test the performance of the proposed classifier ranking algorithm in a mul-
tilayer ensemble framework against other ensemble approaches, like RF, MV, WV, LMV, and
TRIPATHI ET AL. 389

LWV(R), in terms of classification accuracy and G-measure. A right tailed T-test is applied to
compare the average accuracy and G-measure of 10-FCV with 100-iterations for each dataset.
The p-value against classification accuracy and G-measure for LWV against the other ensemble
frameworks (RF, MV, WV, LMV, and LWV(R)) is depicted in each of the columns in Table 14 in
a particular confidence interval (𝛼 = 0.05). The column represents the p-values against accuracy
and G-measure and indicates whether or not the null hypothesis is rejected in favor of the alter-
native hypothesis based on the p-value and chosen 𝛼. With all the test cases, the null hypothesis is
rejected, so it is concluded that LWV performs significant better. The p-values shown in Table 14

TABLE 14 Statistical significance of proposed approach vs. other ensemble approaches in terms of p-values
Dataset Measure RF Vs LWV MV Vs LWV WV Vs LWV LMV Vs LWV LWV(R) Vs LWV
Accuracy 0.00982797 0.00071830 0.00085368 0.00956326 0.00017383
AUS G-measure 0.00046783 0.04287237 0.03963242 0.02603634 0.00925835
Accuracy 0.00988396 0.00081386 0.00093653 0.00893698 0.00023698
JPD G-measure 0.00069832 0.00982563 0.00369856 0.01369875 0.00036987
Accuracy 0.00056893 0.00025896 0.00288361 0.00291593 0.00011258
GCD G-measure 0.00036825 0.00369236 0.00258451 0.00853632 0.00222139
Accuracy 0.00000789 0.00314214 0.00369819 0.00219959 0.00005880
GND G-measure 0.00030889 0.01180889 0.00245218 0.00320036 0.00218766
With Feature Selection
Accuracy 0.00006361 0.00183982 0.00111509 0.02858956 0.00437582
AUS G-measure 0.00000045 0.00002641 0.00014612 0.00867170 0.00013236
Accuracy 0.00249519 0.03828963 0.02301973 0.03293815 0.00504604
JPD G-measure 0.00027914 0.01100812 0.01848148 0.01192802 0.00138615
Accuracy 0.00009954 0.00011153 0.01047690 0.00663793 0.00050401
GCD G-measure 0.00210698 0.00810601 0.01526207 0.01949494 0.00019816
Accuracy 0.00004118 0.00736635 0.00217363 0.01837393 0.00309696
GND G-measure 0.00000514 0.04511904 0.00123686 0.03727870 0.00462196
Abbreviations: AUS, Australian; GCD, German-categorical; GND, German-numerical; JPD, Japenese.

TABLE 15 Average ranking of ensemble frameworks by


Friedman's test with respective credit scoring datasets
Dataset RF MV WV LMV LWV(R) LWV
AUS 1.28 2.81 3.31 4.75 3.73 5.54
JPD 1.52 3.25 3.25 4.45 3.05 5.53
GDC 2.43 2.82 2.48 4.66 3.24 5.69
GDN 2.15 2.75 3.20 4.71 3.55 4.75
After Feature Selection
AUS 1.49 2.87 3.29 4.65 3.78 5.65
JPD 1.62 3.20 3.65 4.4 2.93 5.25
GDC 1.95 2.63 3.05 4.55 3.21 5.61
GDN 1.55 3.05 3.92 4.45 3.15 4.93
Overall Ranking
All-Features 1 2.5 2.5 5 4 6
ENS-FS 1 2.25 3.25 5 3.5 6

Abbreviations: AUS, Australian; FS, feature selection; GCD, German-


categorical; GND, German-numerical; JPD, Japenese.
390 TRIPATHI ET AL.

show the statistical significance of LWV according to the increase in classification accuracy and
G-measure.

3.4.2 Friedman's test


This section describes a nonparametric statistical test analysis using test like Friedman's test.54,55
A nonparametric statistical test is a test in which the model does not specify conditions for the
parameters of the population from which the sample was drawn. Friedman's test is performed in
respect of the average ranks of various ensemble frameworks with the four credit scoring datasets.
For the rank analysis, the means of 10-FCV results with 100 iterations are considered with all
features and with the features selected by the proposed feature selection approach, and the aver-
age rank of the respective approach as depicted in Table 15. The results in Table 15 show that
the proposed LWV approach achieves the highest rank with all the aforementioned credit scoring

TABLE 16 Comprehensive comparisons of the results obtained by proposed approach and


from prior work
Dataset

Method AUS JPD GCD GND References


RBFN 87.14 - 74.60 - West57
SVM-L 87.40 - 74.80 - Bequé and Lessmann58
SVM-R 86.10 - 75.90 - Bequé and Lessmann58
Boosting (LRA) 86.56 - 76.14 - Wang et al59
RS-bagging DT 88.17 - 78.52 - Wang et al6
Consensus hybrid ensemble 88.10 88.70 79.00 - Ala'raj and Abbod22
Consensus system approach 88.98 87.88 77.72 - Ala'raj and Abbod23
RF+CTD 86.10 86.40 75.20 - Abellán and Castellano19
VBDTM 91.97 - 81.64 - Zhang et al60
Random subspace+ LMNC 87.05 87.34 73.93 - Nanni and Lumini61
NN Classifiers 87.25 85.91 76.60 - Tsai and Wu62
SR+ANN 84.09 - - - Wongchinsri and Kuratach63
Sampling+F-score+SVM 86.76 - 76.84 - Hens and Tiwari64
GA+SVM 90.19 - 84.24 - Huang and Wang65
NRS-based FS - 85.48 74.50 - Hu et al66
IFS 90.90 - - 80.20 Liu et al67
HGA-NN - - 78.90 - Oreski and Oreski12
SVM + GA 86.90 - 77.92 - Huang et al68
NRS+SVM+ Grid search 87.52 - 76.60 - Ping and Yongheng69
GA+NB 85.56 - - 74.03 Liang et al15
LDA+MLP 86.00 - - 73.44 Liang et al15
Fused with MV 86.29 - 78.32 - Xia et al70
B&B+SVM - - 78.43 - Sun et al71
Random choosing+MLP - - 87.10 - Zhao et al56
Proposed approach 92.69 89.06 85.18 84.69 This study
Abbreviations: AUS, Australian; DT, decision tree; FS, feature selection; GA, genetic algorithm; GCD,
German-categorical; GND, German-numerical; JPD, Japanese; LDA, linear discriminant analysis; NN, neural
network; RF, random forest; RS, rough set; SVM, support vector machine.
TRIPATHI ET AL. 391

datasets with features selected by the proposed approach and with all features. Overall, comparing
it with all four credit scoring datasets, it achieves the highest rank.

3.5 Comprehensive comparative analysis


This section presents a comprehensive comparison of the results obtained from prior works for
credit scoring datasets in terms of classification accuracy. Table 16 contains the results obtained
from prior works, along with the year of publication of the article, the datasets used, and the meth-
ods employed. Most of the researchers have considered the credit risk evaluation as a binary class
classification problem and have found the evaluation to be reliable for exploring hidden patterns
in the credit scoring data. These systems aid professionals by enhancing their knowledge for credit
risk evaluation. In this context, a variety of soft computing and machine learning techniques has
been used to model the risk evaluation systems. These techniques are broadly categorized into
classifiers, ensemble classifiers, and hybrid classifiers.
It is clear from the results in Table 16 that the proposed model obtained the highest accuracy
in case of Australian, Japanese, and German-numerical datasets. In case of German-categorical
dataset, the proposed approach achieved the second highest accuracy. The proposed approach
is based on feature selection and a hybrid model based on an ensemble framework. The results
obtained using the proposed approach show that its performance is the best when compared with
approaches based on feature selection and classification (individual classifier or ensemble frame-
work). Zhao et al56 used random selection of training samples (ensemble learning approach) and
this showed the best performance with the German-categorical dataset. Applying the ensem-
ble learning approach together with the proposed approach may improve the classification
performance of the proposed framework.

4 CO N C LU S I O N

In this paper, a hybrid credit scoring model is proposed, which combines an ensemble feature
selection approach with a multilayer ensemble framework. The feature selection technique is
based on the ensemble ranking of the existing features of the respective dataset, which is estimated
using the median Pearson-correlation and logistic regression. The multilayer ensemble classifier
is modeled by aggregating heterogeneous classifiers in a layered manner. Moreover, a novel clas-
sifier ranking algorithm is proposed using CIV for placement of the classifiers in the multilayer
ensemble framework.
The proposed framework is tested on Australian, Japanese, German-categorical, and
German-numerical datasets. The experimental results indicated that the features selected using
the proposed approach are more representative and improved the performance of QDA, NB,
MLFN, DTNN, TDNN, DT, and SVM classifiers in terms of classification accuracy. Overall, for all
the aforementioned credit scoring datasets, the proposed ensemble model outperformed the tradi-
tional ensemble models such as RF, MV, LMV, WV, and LWV(R) in terms of accuracy, sensitivity,
and G-measure. Hence, it can be concluded that the proposed ensemble framework based on
an ensemble feature selection with appropriate placement of classifiers in a multilayer ensemble
classifier is an efficient approach for credit scoring.
392 TRIPATHI ET AL.

ORCID

Diwakar Tripathi https://orcid.org/0000-0001-8593-108X


Damodar Reddy Edla https://orcid.org/0000-0002-5040-0745
Ramalingaswamy Cheruku https://orcid.org/0000-0003-1677-5321

REFERENCES
1. Mester LJ. What's the point of credit scoring? Bus Rev. 1997;3:3-16.
2. García V, Marqués AI, Sánchez JS. An insight into the experimental design for credit risk and corporate
bankruptcy prediction systems. J Intell Inf Syst. 2015;44(1):159-189.
3. Lessmann S, Baesens B, Seow H-V, Thomas LC. Benchmarking state-of-the-art classification algorithms for
credit scoring: an update of research. Eur J Oper Res. 2015;247(1):124-136.
4. Chen N, Ribeiro B, Chen A. Comparative study of classifier ensembles for cost-sensitive credit risk assessment.
Intell Data Anal. 2015;19(1):127-144.
5. Chen N, Ribeiro B, Chen A. Financial credit risk assessment: a recent review. Artif Intell Rev. 2016;45(1):1-23.
6. Wang G, Ma J, Huang L, Xu K. Two credit scoring models based on dual strategy ensemble trees. Knowl Based
Syst. 2012;26:61-68.
7. Paleologo G, Elisseeff A, Antonini G. Subagging for credit scoring models. Eur J Oper Res. 2010;201(2):490-499.
8. Wang J, Hedar A-R, Wang S, Ma J. Rough set and scatter search metaheuristic based feature selection for
credit scoring. Expert Syst Appl. 2012;39(6):6123-6128.
9. Maldonado S, Weber R, Basak J. Simultaneous feature selection and classification using kernel-penalized
support vector machines. Inf Sci. 2011;181(1):115-128.
10. Chi B-W, Hsu C-C. A hybrid approach to integrate genetic algorithm into dual scoring model in enhancing
the performance of credit scoring model. Expert Syst Appl. 2012;39(3):2650-2661.
11. Huang C-L, Dun J-F. A distributed PSO–SVM hybrid system with feature selection and parameter optimiza-
tion. Appl Soft Comput. 2008;8(4):1381-1391.
12. Oreski S, Oreski G. Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert
Syst Appl. 2014;41(4):2052-2064.
13. Edla DR, Tripathi D, Cheruku R, Kuppili V. An efficient multi-layer ensemble framework with
BPSOGSA-based feature selection for credit scoring data analysis. Arab J Sci Eng. 2018;43(12):6909-6928.
14. Wang J, Guo K, Wang S. Rough set and tabu search based feature selection for credit scoring. Procedia Comput
Sci. 2010;1(1):2425-2432.
15. Liang D, Tsai C-F, Wu H-T. The effect of feature selection on financial distress prediction. Knowl Based Syst.
2015;73:289-297.
16. Sun J, Li H. Financial distress prediction using support vector machines: ensemble vs. individual. Appl Soft
Comput. 2012;12(8):2254-2265.
17. Marqués AI, García V, Sánchez JS. Two-level classifier ensembles for credit risk assessment. Expert Syst Appl.
2012;39(12):10916-10922.
18. Tripathi D, Edla DR, Cheruku R. Hybrid credit scoring model using neighborhood rough set and multi-layer
ensemble classification. J Intell Fuzzy Syst. 2018;34(3):1543-1549.
19. Abellán J, Castellano JG. A comparative study on base classifiers in ensemble methods for credit scoring.
Expert Syst Appl. 2017;73:1-10.
20. Kim SY, Upneja A. Predicting restaurant financial distress using decision tree and AdaBoosted decision tree
models. Econ Model. 2014;36:354-362.
21. Parvin H, MirnabiBaboli M, Alinejad-Rokny H. Proposing a classifier ensemble framework based on classifier
selection and decision tree. Eng Appl Artif Intell. 2015;37:34-42.
22. Ala'raj M, Abbod MF. A new hybrid ensemble credit scoring model based on classifiers consensus system
approach. Expert Syst Appl. 2016;64:36-55.
23. Ala'raj M, Abbod MF. Classifiers consensus system approach for credit scoring. Knowl Based Syst.
2016;104:89-105.
TRIPATHI ET AL. 393

24. Bashir S, Qamar U, Khan FH. IntelliHealth: a medical decision support application using a novel weighted
multi-layer classifier ensemble framework. J Biomed Inform. 2016;59:185-200.
25. Verikas A, Kalsyte Z, Bacauskiene M, Gelzinis A. Hybrid and ensemble-based soft computing techniques in
bankruptcy prediction: a survey. Soft Comput. 2010;14(9):995-1010.
26. Bashir S, Qamar U, Khan FH, Naseem L. HMV: a medical decision support framework using multi-layer
classifiers for disease prediction. J Comput Sci. 2016;13:10-25.
27. Neumann F, Witt C. Bioinspired Computation in Combinatorial Optimization: Algorithms and Their Compu-
tational Complexity. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2010.
28. Chakravarthy H, Bachan P, Roshini P, Rajan KCh. Bio inspired approach as a problem solving technique. Netw
Complex Syst. 2012;2(2):14-22.
29. Duda RO, Hart PE, Stork DG. Pattern Classification. New York, NY: John Wiley & Sons; 2012.
30. Fisher RA. The use of multiple measurements in taxonomic problems. Ann Hum Genet. 1936;7(2):179-188.
31. Mitchell TM. Machine Learning. Singapore: McGraw-Hill Boston, MA; 1997.
32. Hornik K, Stinchcombe M, White H. Multilayer feedforward networks are universal approximators. Neural
Netw. 1989;2(5):359-366.
33. Svozil D, Kvasnicka V, Pospichal J. Introduction to multi-layer feed-forward neural networks. Chemom Intell
Lab Syst. 1997;39(1):43-62.
34. Waibel A, Hanazawa T, Hinton G, Shikano K, Lang KJ. Phoneme recognition using time-delay neural
networks. IEEE Trans Acoust Speech Signal Process. 1989;37(3):328-339.
35. MathWorks. MATLAB neural network toolbox. 2017. https://in.mathworks.com/help/nnet/ref/distdelaynet.
html
36. Quinlan JR. Induction of decision trees. Mach Learn. 1986;1(1):81-106.
37. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273-297.
38. Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inf Theory. 1967;13(1):21-27.
39. Grabisch M, Labreuche C. A decade of application of the Choquet and Sugeno integrals in multi-criteria
decision aid. Ann Oper Res. 2010;175(1):247-286.
40. Wang G, Zhao B, Li Y. Quantitative Logic and Soft Computing. Singapore: World Scientific Publishing Co Pte
Ltd; 2012.
41. Wang Z, Yan J-A. Choquet integral and its applications: a survey. Beijing, China: Academy of Mathematics
and Systems Science: CAS; 2006.
42. Grabisch M. The application of fuzzy integrals in multicriteria decision making. Eur J Oper Res.
1996;89(3):445-456.
43. Rokach L. Ensemble-based classifiers. Artif Intell Rev. 2010;33(1-2):1-39.
44. Bauer DF. Constructing confidence sets using rank statistics. J Amer Stat Assoc. 1972;67(339):687-690.
45. Cohen J, Cohen P, West SG, Aiken LS. Applied Multiple Regression/Correlation Analysis for the Behavioral
Sciences. Abingdon, UK: Routledge; 2013.
46. Bluman AG. Elementary Statistics: Brown Melbourne; 1995.
47. Neumann U, Riemenschneider M, Sowa J-P, Baars T, Kälsch J, Canbay A, Heider D. Compensation of feature
selection biases accompanied with improved predictive performance for binary classification by using a novel
ensemble feature selection approach. BioData Min. 2016;9(1):36.
48. Tsai C-F, Lin Y-C, Yen DC, Chen Y-M. Predicting stock returns by classifier ensembles. Appl Soft Comput.
2011;11(2):2452-2459.
49. Triantaphyllou E. Multi-criteria decision making methods. In: Multi-Criteria Decision Making Methods: A
Comparative Study. Boston, MA: Springer-Verlag; 2000:5-21.
50. Asuncion A, Newman DJ. UCI machine learning repository. 2007. https://archive.ics.uci.edu/ml/index.php
51. Nguyen HS. Discretization of Real Value Attributes, Boolean Reasoning Approach [PhD thesis]. Warsaw, Poland:
Warsaw University; 1997.
52. Ong C-S, Huang J-J, Tzeng G-H. Building credit scoring models using genetic programming. Expert Syst Appl.
2005;29(1):41-47.
53. Breiman L. Random forests. Mach Learn. 2001;45(1):5-32.
54. Hogg RV, Ledolter J. Engineering Statistics. New York, NY: Macmillan Pub Co; 1987.
394 TRIPATHI ET AL.

55. Hollander M, Wolfe DA. Nonparametric Statistical Methods. New York, NY: John Wiley and Sons; 1999.
56. Zhao Z, Xu S, Kang BH, Kabir MMJ, Liu Y, Wasinger R. Investigation and improvement of multi-layer
perceptron neural networks for credit scoring. Expert Syst Appl. 2015;42(7):3508-3516.
57. West D. Neural network credit scoring models. Comput Oper Res. 2000;27(11-12):1131-1152.
58. Bequé A, Lessmann S. Extreme learning machines for credit scoring: an empirical evaluation. Expert Syst
Appl. 2017;86:42-53.
59. Wang G, Hao J, Ma J, Jiang H. A comparative assessment of ensemble learning for credit scoring. Expert Syst
Appl. 2011;38(1):223-230.
60. Zhang D, Zhou X, Leung SCH, Zheng J. Vertical bagging decision trees model for credit scoring. Expert Syst
Appl. 2010;37(12):7838-7843.
61. Nanni L, Lumini A. An experimental comparison of ensemble of classifiers for bankruptcy prediction and
credit scoring. Expert Syst Appl. 2009;36(2):3028-3033.
62. Tsai C-F, Wu J-W. Using neural network ensembles for bankruptcy prediction and credit scoring. Expert Syst
Appl. 2008;34(4):2639-2649.
63. Wongchinsri P, Kuratach W. SR-based binary classification in credit scoring. Paper presented at: 2017
14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and
Information Technology (ECTI-CON); 2017; Phuket, Thailand.
64. Hens AB, Tiwari MK. Computational time reduction for credit scoring: an integrated approach based on
support vector machine and stratified sampling method. Expert Syst Appl. 2012;39(8):6774-6781.
65. Huang C-L, Wang C-J. A GA-based feature selection and parameters optimization for support vector
machines. Expert Syst Appl. 2006;31(2):231-240.
66. Hu Q, Yu D, Liu J, Wu C. Neighborhood rough set based heterogeneous feature subset selection. Inf Sci.
2008;178(18):3577-3594.
67. Liu Y, Wang G, Chen H, Dong H, Zhu X, Wang S. An improved particle swarm optimization for feature
selection. J Bionic Eng. 2011;8(2):191-200.
68. Huang C-L, Chen M-C, Wang C-J. Credit scoring with a data mining approach based on support vector
machines. Expert Syst Appl. 2007;33(4):847-856.
69. Ping Y, Yongheng L. Neighborhood rough set and SVM based hybrid credit scoring classifier. Expert Syst Appl.
2011;38(9):11300-11304.
70. Xia Y, Liu C, Da B, Xie F. A novel heterogeneous ensemble credit scoring model based on bstacking approach.
Expert Syst Appl. 2018;93:182-199.
71. Sun J, Lee Y-C, Li H, Huang Q-H. Combining B&B-based hybrid feature selection and the
imbalance-oriented multiple-classifier ensemble for imbalanced credit risk assessment. Technol Econ Dev
Econ. 2015;21(3):351-378.

How to cite this article: Tripathi D, Edla DR, Cheruku R, Kuppili V. A novel hybrid
credit scoring model based on ensemble feature selection and multilayer ensemble classi-
fication. Computational Intelligence. 2019;35:371–394. https://doi.org/10.1111/coin.12200

You might also like