Multi-View Ensemble Learning Based On Distance-To-Model

Information Sciences 525 (2020) 182–204
Contents lists available at ScienceDirect
Information Sciences
journal homepage: www.elsevier.com/locate/ins
Multi-view ensemble learning based on distance-to-model

and adaptive clustering for imbalanced credit risk assessment
in P2P lending
Yu Song a,1, Yuyan Wang a,1, Xin Ye a,∗, Dujuan Wang b, Yunqiang Yin c,
Yanzhang Wang a
a
School of Economics and Management, Dalian University of Technology, Dalian 116024, China
b
Business School, Sichuan University, Chengdu 610064, China
c
School of Management and Economics, University of Electronic Science and Technology of China, Chengdu 611731, China
a r t i c l e i n f o a b s t r a c t
Article history: Credit risk assessment is a crucial task in the peer-to-peer (P2P) lending industry. In recent
Received 8 March 2019 years, ensemble learning methods have been verified to perform better in default pre-
Revised 25 February 2020
diction than individual classifiers and statistical techniques. Real-world loan datasets are
Accepted 11 March 2020
imbalanced; however, most studies focus on enhancing overall prediction accuracy rather
Available online 15 March 2020
than improving the identification ability of real default loans. Moreover, some of the fea-
Keywords: tures that are significantly correlated with default rates are not attached importance in the
Credit risk assessment model construction of previous studies. To fill these gaps, we propose a distance-to-model
Peer-to-peer lending and adaptive clustering-based multi-view ensemble (DM–ACME) learning method for pre-
Multi-view ensemble learning dicting default risk in P2P lending. In this method, multi-view learning and an adaptive
Adaptive clustering clustering method are explored to produce an ensemble of diverse ensembles constituted
Distance-to-model by gradient boosting decision trees. A novel combination strategy called distance-to-model
and a soft probability fashion are embedded for model integration. To verify the effective-
ness of the proposed ensemble approach, comprehensive analysis on DM–ACME, compara-
tive experiments with several state-of-the-art methods, and feature importance evaluation
are conducted with the data provided by Lending Club. Experimental results demonstrate
the superiority of the proposed method as well as indicate the importance of some fea-
tures in loan default prediction.
© 2020 Elsevier Inc. All rights reserved.
1. Introduction
In recent years, the peer-to-peer (P2P) lending industry has been showing a vigorous global development trend, which
provides a convenient way for individuals to borrow and invest money online directly without complicated procedures. How-
ever, though P2P lending simplifies the credit process, it causes higher credit risk relative to lending transactions through
traditional financial institutions, which is commonly presented as a higher default rate. Thus, constructing effective credit
risk assessment models to predict loan default probability has become a crucial challenge and a significant task for theoret-
∗
Corresponding author.
E-mail address: yexin@dlut.edu.cn (X. Ye).
1
The first two authors contributed equally to this study and share first authorship.
https://doi.org/10.1016/j.ins.2020.03.027
0020-0255/© 2020 Elsevier Inc. All rights reserved.
Y. Song, Y. Wang and X. Ye et al. / Information Sciences 525 (2020) 182–204 183
ical research and practical application in P2P lending field. Particularly, for real-world application, even 1% improvement on
the capability of recognizing applicants with bad credit could greatly enhance capital security for both P2P lending platforms
and investors [18].
Generally, credit risk in the financial industry reflects the risk that lenders may not get their principal and interest back,
usually resulting from the borrower’s failure to pay back the loans. Credit risk assessment, also known as "credit scoring"
or "credit ranking", is an essential process of credit risk calculation. In most data-driven studies, credit risk assessment
is regarded as a binary classification problem, wherein the loan payment status is dichotomous, and fully paid loans are
denoted as "0" while default loans are denoted as "1". The main purpose of credit risk assessment is to identify the re-
lationship between the final payment status and all sorts of loan information. In the early stage, credit risk was assessed
based on practitioners’ professional knowledge and experience. Subsequently, statistical technique-based methods became a
preferable choice for loan risk evaluation due to their better default prediction performance and interpretability compared
to that of expertise-based subjective judgment methods [18]. Nevertheless, they fail to uncover the possible sophisticated
non-linear effects among different variables in credit risk assessment, especially amid today’s booming loan market.
Over the past two decades, machine learning methods have been explored and exploited to predict loan default proba-
bility. From single model to ensemble learning model, machine learning methods for credit risk assessment have undergone
a period of getting mature. In recent years, ensemble classification methods have become a better choice for construct-
ing loan default prediction models. In an ensemble, multiple base learners are combined to generate an improved one, by
which decisions can be made accurately [44]. Regarding credit risk classification, real default identification is of paramount
importance for financial institutes who provide loan services, especially in the P2P industry, because default loans greatly
influence the operation and reputation of P2P platforms and can cause a capital loss for individual investors. However, a
great percentage of the existing studies put their efforts on improving the overall prediction accuracy rather than enhancing
the capability of real default identification (i.e., models’ prediction sensitivity). This gap concerning real default identification
should gain more attention especially when considering that most real-world credit loan datasets, which mostly comprise
good loan records, are characterized by significant class imbalance.
Many studies have explored ensemble learning methods to solve imbalanced classification problems [16,42,43,46], among
which increasing diversity of ensembles has been demonstrated as an effective approach to significantly enhance the pre-
diction capability of ensemble methods for imbalanced datasets. Multi-view learning dedicates to create multiple views (i.e.,
feature subspaces) as independence as possible [17]. For multi-view supervised learning methods, distinct individual mod-
els can be generated from different views [49], which means multi-view learning can be regarded as an efficient way to
produce diversity of ensembles. For this consideration, multi-view learning is introduced to design an ensemble learning
method for solving credit risk imbalanced classification problems.
To be concrete, in order to improve the ability of default loan discrimination under imbalanced data distribution, this
paper proposes a distance-to-model and adaptive clustering-based multi-view ensemble (DM–ACME) classification method.
During the process of base learner construction, multi-view learning and adaptive clustering are synthesized to enhance the
diversity of base learners. Based on the main idea of multi-view learning, loan attributes are first partitioned into several
complementary groups as multiple views, each of which can describe loan records from a different perspective. For each
view, a novel accuracy and diversity-based adaptive clustering method is investigated to generate data subsets competent
for training accurate and diverse ensemble members. Thus, base learners of each view are trained with the samples in dif-
ferent clusters, where one cluster corresponds to one base learner. As for base learner construction in credit risk assessment
field, the previous studies often treated attributes as a single whole part to train base learners, without analyzing and uti-
lizing loan attributes from different views. With multi-view learning [49], we aim to learn a function for each view and
then combine all the functions to enhance the generalization performance. The experimental results demonstrate that the
incorporation of multi-view learning and the proposed adaptive clustering can evidently contribute to the enhancement of
the prediction sensitivity, i.e., the precision of real default identification. In the model combination stage, different from
the existing studies which use majority voting or entire performance-based weighted voting, we employ a soft probability
and distance-to-model-based fusion strategy to integrate the prediction results. This combination strategy can guarantee the
precision of each classifier’s output and provide a dynamic weight assignment mechanism considering the prediction ability
of each predictor for a certain sample.
Besides the exploration on the improved loan default prediction method, we take another issue into account when con-
structing the prediction model. To the best of our knowledge, most of the existing data-driven credit risk assessment stud-
ies ignore the importance of utilizing business-oriented experience during the model construction process. What draws our
most attention is that the potential impacts of insufficient occupational information on the default rate, which is a com-
mon phenomenon in loan datasets. Some researchers have claimed that the more comprehensive information borrowers
can provide, the more creditable assessment results will be achieved [12]. Specifically, according to the professional analysis
of several proficient credit risk assessment practitioners and other research findings [12,24], the occupational information
of the loan applicants is of great account when judging whether loans are credible or not. According to [24], the average
default rate of borrowers with unstable jobs is five percent higher than that of borrowers with stable jobs.
However, the instances involving insufficient information such as job title (i.e., the job a person pursues) and employment
length (i.e., the number of years a person has worked for a certain employer or company) have not been paid enough
attention in many previous studies. In some credit risk assessment studies [4,33,46] using the data from Solidarity Tunisian
Bank [4], Lending Club [33], and a Chinese bank [46], the records without job title or employment length information were
184 Y. Song, Y. Wang and X. Ye et al. / Information Sciences 525 (2020) 182–204
removed. As a consequence, the constructed models cannot handle any new loan application without complete occupational
information. Another common operation [1] is to discard the features describing occupational information, leaving the high-
default-rate instances and the others not to be distinguished. This study constructs a functional credit risk assessment model,
in which the loans with complete and partial occupational information are differentiated by structuring new features. Note
that the proposed model can address the credit assessment for loans of any kind no matter how much of their occupational
information can be obtained.
For clarity, the contributions of this study are summarized as follows.
First, a novel ensemble learning method is proposed to address the imbalanced classification of P2P credit risk, in which
multi-view learning is introduced and further synthesized with a newly developed adaptive clustering technique to create
diversity for the ensemble construction.
Second, for each view, an adaptive clustering method is designed to generate diverse and accurate base learners by
finding the best-fit number of clusters through an accuracy and diversity-oriented optimization procedure.
Third, in the combination stage, besides the utilization of a novel dynamic weighted combination strategy named
distance-to-model, we exploit soft probability as the output form of each intermediate classifier, which guarantees the im-
plicit information in the decision-making process not to be lost. As far as we know, it is the first time to combine them as
an ensemble fusion strategy.
Finally, extensive experiments are carried out to demonstrate its generalization ability to identify default loans and the
superiority over other popular methods. The proposed ensemble model for credit risk assessment is verified to be able to
handle the default prediction for a significant number of existing loan applicants without complete occupational information.
The rest of the paper is structured as follows. Section 2 reviews credit risk assessment methods and related work on
ensemble learning methods. Section 3 presents the details of the proposed DM–ACME method. Section 4 describes data
preparation, performance metrics, experiment design and parameter settings, and displays experimental results together
with analysis in detail. Finally, Section 5 concludes the paper and suggests future work.
2. Related work
In this section, credit risk assessment methods and related work on improved ensemble learning algorithms are reviewed.
2.1. Credit risk assessment methods
The realm of credit risk assessment involves a great many credit risk prediction methods, which can be grouped
into three stages of development, namely expertise-based assessment, statistical technique-based evaluation, and machine
learning-based estimation. At the very beginning, the credit risk was mainly evaluated by expertise-based techniques. The
evaluation process is guided by subjective judgment, making the assessment a product of the estimators’ professional knowl-
edge and experience. Baklouti and Baccar [4] ascertained that it is challenging even for experienced loan officers to make a
sensible decision in most circumstances. Along with the advancement of data science, more efficient and effective statistical
model-based techniques for loan credit risk appraisal are developed. Some studies proved that statistical techniques, such as
logistic regression (LR) [13,18], outperform expertise-based approaches in credit risk evaluation. Emekter et al. [13] utilized
LR to analyze the relationship between attributes and default probability, and employed the actual default risk to measure
the reliability of analysis results. Although these statistical methods are conductive to analyze the linear relations between
potential factors and default probability, they are inadequate for uncovering the possible sophisticated non-linear effects
among different variables in credit risk assessment, especially given the current booming loan market. Over the past two
decades, machine learning-based models have been explored and exploited for loan default prediction. Previous studies have
verified that machine learning methods manifest better prediction performance than conventional statistical methods.
A significant number of machine learning algorithms, including Artificial Neural Network (ANN) [3], Decision Tree (DT)
[35], and Support Vector Machine (SVM) [47] have been used as alternative approaches for credit risk estimation. For ex-
ample, Serrano-Cinca et al. [35] employed DT to figure out the non-linear relationship between the explanatory variables
and the target variable in the credit risk assessment of P2P lending. However, single classification models are not able to
gratify the requirement of accurate prediction models with good generalization ability when handling tremendous amounts
of instances and patterns in the big data era. There is a mounting tendency for ensemble learning methods, which combine
multiple base learners to form an improved model, to be applied in credit risk evaluation model construction because of
their well-recognized prediction capability. Twala [38] designed and carried out a series of comparison experiments based
on static parallel, multi-stage or dynamic classifier selection ensemble learning architectures, and several commonly used
individual learning techniques, such as DT, ANN. The experimental evaluation showed that ensemble techniques had the
capability to improve prediction precision. Wang et al. [44] utilized random forest (RF) to predict default probability and
the specific default time of loan applicants in P2P lending, which showed a better AUC than compared models, including
standard mixture cure model, Cox proportional hazards model, and logistic regression.
As imbalanced data distribution commonly exists in real-world credit loan datasets [10], many efforts have also been put
into applying imbalanced classification techniques to handle P2P credit risk assessment problem. Considering that predic-
tion errors for minority class samples would be increased for classifiers generally favor the accurate classification of majority
class instances, which are the trustworthy loan records, neglecting the main purpose of identifying default loans. The inves-
tigation of imbalanced classification models is mainly characterized by three types: data-level manipulations, algorithm-level
techniques and ensemble learning strategies [16].
Data-level approaches focus on adjusting or modifying the original distribution of the training datasets by adding an extra
data preprocessing step. Oversampling and undersampling are the most common strategies which are widely used in many
studies [48]. For example, the synthetic minority oversampling technique (SMOTE) [9] is a classical oversampling method
which aims to create new minority samples. Though the wide utilization of SMOTE, there are still some drawbacks due to its
sampling principles, especially the blind oversampling process and its dependence on original data distribution. Therefore,
a series of improved methods have been proposed, such as SMOTE–ENN [32], in which an edited nearest neighbor (ENN)
process is added to remove unwanted instances after the SMOTE sampling process. On the basic of SMOTE–ENN, Pang et al.
[30] further proposed a novel imbalanced learning method named AWGSENN, in which each minority sample is assigned an
oversampling weight and the quality of the rebalanced dataset is effectively ameliorated.
Studies on algorithm-level mainly focus on magnifying the roles of the minority instances in the classification process.
Among these methods, cost-sensitive learning is one of the most typical methods. Different from many conventional al-
gorithms, which assume misclassification errors of the majority and minority classes have equal costs, the cost-sensitive
learning algorithms are designed to reassign more reasonable misclassification costs for each class. Studies have shown that
even the simplest combination of cost-sensitive learning and certain classification method, such as the support vector ma-
chine cost-sensitive (SVMCS) [40] and C4.5 decision trees cost-sensitive (C4.5CS) [37] algorithms, can improve the models’
predictive ability when handling imbalanced classification. Also, improved cost-sensitive based algorithms have been pro-
posed. For example, Peng et al. [32] proposed an interesting approach named Imbalanced DGC (IDGC) for imbalanced data
classification based on data gravitation-based classification (DGC) [31] method. They introduced a parameter called ampli-
fied gravitation coefficient (AGC) to strengthen/weaken the gravitational between the minority/majority classes and a certain
unlabeled instance. However, the manipulations only on the distribution of input data or algorithms based on adjusting the
weights of hard-to-classified data produce limited improvement in the generalization ability of credit risk prediction models.
In recent years, ensemble learning methods have shown an obvious advantage in imbalanced data classification tasks, by
which misclassification error can be reduced by the decision-combination of several base learners. Also, data-level’s resam-
pling strategy or cost-sensitive methods can be embedded into ensemble learning methods. Taking a typical hybrid method
Easyensemble [26] as example, an undersampling process and AdaBoost classifiers are combined to train the ensemble. En-
semble learning has also been utilized to handle imbalanced problem in credit risk management realm. For example, Yu
et al. [48] proposed a resampling SVM ensemble learning method based on a deep belief network for credit risk evalua-
tion, in which 20 diverse SVM individual learners were generated by randomly resampled training sets, and accuracy was
employed to evaluate the advantage of the proposed method.
Despite the plethora of application research of machine learning methods in credit risk assessment, most of the work
overlooked the investigation on some crucial factors from the business-oriented perspective, where the representative ones
may be the effects of borrowers’ job title and employment length as mentioned in Section 1. In contrast, this study discerns
and analyzes how the presence and absence of loan applicants’ occupational information affects the default rates. Unlike
the previous studies [4,33,46], we analyze and reserve abounded loan records without complete occupational information,
which have high default rates, and discriminate them by creating new features when constructing prediction models. On
the other hand, many relevant studies [38,48] emphasized the improvement of prediction accuracy, while ignoring the im-
portance of models’ sensitivity in credit risk assessment tasks, which reflects the ability of real default identification. This
study attaches importance to the real default identification ability of the prediction model as well as the overall general-
ization ability. Moreover, most of the real-world credit loan datasets are highly class imbalanced [10], which demand more
generalized methods to construct prediction models, arousing further research on the improvements in ensemble learning
methods.
2.2. Improved ensemble learning methods
Enhancing model performance is a crucial issue to be considered in ensemble learning. It is universally acknowledged
that the accuracy of individual learners and the diversity among ensemble members are the cornerstones of ensemble learn-
ing. The main incentive for devising ensemble learning methods is to improve generalizability and robustness by creating
diversity between different ensemble members. There are two primary stages for producing difference in the course of
ensemble model construction, namely base learner generation and integration stage, where researchers can exploit their
creativity to make improvements.
The primary objective of the base learner generation stage is to produce individual learners as diverse as possible. Data
manipulation is one of the common and helpful techniques. Bagging and Boosting are two of the most representative meth-
ods. To be specific, in bagging, numerous different subsets are randomly drawn with replacement from the entire input
space; while in boosting, samples in the original training data are selected according to their weights learned from the
previous iterations. As an alternative technique, choosing different feature subsets from the complete data is an extensively
used approach for training base classifiers [28]. For example, random forest (RF) [11] takes the random feature subsets as the
optional splitting nodes of decision trees; the random subspace (RS) method constructs distinct feature subsets for training
different base learners. Wang et al. [41] further proposed two ensemble strategies, RS–Bagging DT and Bagging–RS DT, for
credit risk evaluation to increase the diversity of base learners. These two strategies achieved better prediction performance
especially in accuracy than that of five single classifiers and four popular ensemble classifiers. However, these methods are
inadequate for handling imbalanced classification problems, which usually display better performance for large class identi-
fication but poor ability in small class identification.
Another sort of techniques to achieve desirous diversity between base learners relies on the efforts made at algorithm
level. As there are various algorithms available to be adopted to train base classifiers, a common way to create diversity
is to construct heterogeneous ensembles with more than two different algorithms. Xia et al. [45] constructed a heteroge-
neous ensemble model for credit risk assessment, in which SVM, RF, Gaussian process classifier, and XGBoost are employed
to train diverse base classifiers. However, it is undeniable that optioning the appropriate algorithms for establishing het-
erogeneous ensembles is a mammoth task. Tuning the parameters of base algorithm is another common approach to create
different ensemble members, for example, tuning the hidden neurons or the number of training iterations of neural network
algorithms. However, this strategy is applicable only to parameter-sensitive algorithms. In recent years, a series of ensemble
methods which combine clustering approaches and classification techniques have emerged as a way to deal with imbalanced
classification problems and achieved desirable performance [27,46]. Clustering can be employed to provide diverse training
sets for creating different base learners. Luo [27] proposed a clustering-launched ensemble classification method, in which
a clustering process realized by K-means was employed to provide efficient classifier training. Xiao et al. [46] proposed an
ensemble classification method for credit risk assessment based on supervised clustering, in which K-means was utilized
to partition the samples of the same category (fully paid or default), and various training subsets were then obtained by
randomly selecting one cluster from each category and combining them. In this study, to create diversity, we introduce an
unsupervised clustering-based base learner generation approach. Furthermore, multi-view learning is proposed to cooperate
with clustering-based model creation, by which each feature view can produce a diverse ensemble. As unsupervised clus-
tering is always accompanied by strong randomness especially on the number of clusters, an accuracy and diversity-based
adaptive clustering method is developed.
There also exist abundant studies attempting to heighten the capability of ensemble models by taking sheer advantages
of generated base learners with efficacious combination strategies. According to [34], the predominant combination methods
for base learner integration involve meta-learning-based methods and weighting-based methods. Meta-learning [34] refers
to learning from new data generated by base-learners.
Meta-learning-based combination strategies, such as stacking, are best suited for situations in which certain instances
are consistently correctly classified or misclassified by certain classifiers. In weighting methods, each component is assigned
with a fixed or dynamic weight, and majority voting and weighted voting are the most common weighting methods applied
in ensemble classification research [1,34]. For example, Zhou et al. [50] employed majority voting to make final decisions,
where each classifier had the same weight. While in weighted voting [14,46,48], the prediction reliability of base learners
is considered when assigning weights. Feng et al. [14] proposed a dynamic ensemble method for credit risk assessment, in
which the weights of base classifiers were allocated according to their performance on the validation set.
However, conventional weighted voting cannot display the relative prediction performance of each classifier to a cer-
tain sample. In recent years, an innovative weight assignment fashion, distance-to-model-based method [22], has emerged
in some clustering-based classification research. In contrast to the reliability-based assignment principle, the distance-to-
model-based assignment strategy employed in this study is a dynamic distribution manner, and the weight of each classifier
for a certain observation is calculated by relative distance. In addition, as far as we know, in many ensemble learning studies
[12,17,35,46,47,49], concrete labels are taken as the output form of intermediate classifiers. This may neglect some valuable
information and the loss will be magnified by each time of ensemble. To evade this issue, we employ soft probability [14] as
the intermediate output form of classifiers, which can intuitively present the probabilities belonging to each category.
3. Methodology
Considering the significance of accuracy and diversity in ensemble learning, this study devises a novel classification
method DM–ACME to enhance the overall precision of credit risk assessment in P2P lending. We employ multi-view learn-
ing to produce several distinct ensembles based on different aspects of loan applicants’ features. For each ensemble, un-
supervised clustering is investigated to generate multiple data clusters for obtaining diverse base classifiers. The ensemble
decisions are then aggregated based on the distance of the samples to the base models, namely, the base classifiers in each
view and the assembled classifier of each view for the outermost ensemble.
The framework of the proposed method is illustrated in Fig. 1, where V stands for the number of views obtained by
multi-view segmentation, KV stands for the number of clusters in the V-th view, CKV denotes the KV -th cluster of the V-th
V
view, and MVK denotes the base learner constructed based on the samples in cluster CKV . In Fig. 1, the model training process
V V
is shown by red arrows, and the model testing is marked by blue arrows.
As the figure shows, the proposed method involves the following main tasks. The first is multi-view segmentation, the
objective of which is to train diverse models to make more reasonable decisions by making full use of data features from
different perspectives. The second is the iterative process of adaptive clustering-based model construction, which aims to
obtain accurate and diverse ensemble members for each view. A new accuracy and diversity-based adaptive learning ap-
proach is thus explored, which consists of three main parts: the K-means and KNN-based clustering for distinct training
Fig. 1. Framework of methodology.
subset generation; the base learner training procedure; and an accuracy and diversity-based adaptive learning process for
optimal number of clusters determination. The third step is the dynamic weight distribution strategy based on distance-to-
model, which is designed for the integration of the base learners in each view and the ensemble models of the multiple
views. Here, soft probability is taken as the output form of the intermediate classifiers, which can provide more information
for decision-making. The mechanisms of the proposed method are elaborated in the subsections below.
3.1. Multi-view learning
Many real-world problems are represented by multi-dimensional datasets reflecting various or complementary aspects of
objects, referred to as "multi-view data". A naive operation concatenates the data of different views, which is usually col-
lected from different sources and reflects different kinds of information, into a single view, and trains models with machine
learning algorithms. One evident drawback of this process is its neglect of the statistical properties of the data from each
view. Even when a dataset shows no obvious multi-view attributes, artificially partitioning multiple views could achieve
better performance than what is possible using the original single view [49].
In recent years, the multi-view learning concept has appeared in an assortment of ensemble learning studies and has
been verified as an effective technique for boosting the generalizability of predictive models [41,49]. Feature sets are nor-
mally divided into multiple views according to the apparent categorical characteristics of features. For example, image data,
which includes two different feature sets, i.e., those describing color and those depicting texture, could be treated as two-
view data [39]. The other techniques involve feature manipulation methods such as principal component analysis (PCA)
[17] and canonical correlation analysis (CCA) [19]. No matter how we create the multiple views, the core objective of multi-
view learning is to separate features into several subspaces that are as complementary as possible wherein each feature
subspace corresponds to a view. Multi-view learning seeks to find multiple independent feature subspaces that can offer
compatible and complementary information to portray instances from different perspectives, making it an effective way to
promote diversity when constructing ensemble models. Meanwhile, enhancing diversity has been demonstrated as a way to
improve the capability of ensemble learning models to handle imbalanced classification problem. Thus, multi-view learning
is clearly a good choice to construct imbalanced ensemble classification methods. Based on this consideration, this paper
introduces multi-view learning to generate diverse ensemble members, which are also ensemble models. As the proposed
ensemble method tends to construct an ensemble of ensembles, the role of multi-view learning is to create diversity in the
outermost ensemble with the support of the distinction between different views, while the following introduced adaptive
clustering serves for producing diversity in the ensemble of each view.
Credit loan datasets are typical multi-view datasets, in which features fall into several groups [20]. Each feature category
represents a detailed description of loan records or loan applicants from a specific perspective. This characteristic of credit
loan data provides a firm foundation for the application of multi-view learning in credit risk assessment tasks. For credit
loan data D, suppose features can be divided into V views, denote matrix X = {X 1 , X 2 , . . . , X V } ∈ Rn×d as data D, in which
n is the sample size of data and d stands for the data dimension. Similarly, let X v ∈ Rn×d (v ) be the data matrix of the v-

V
th view with d(v) dimensions, and d (v ) = d. Note that credit loan datasets from different financial institutions are not
v=1
exactly alike, different views also vary in name. The view separation results of the dataset utilized in this study is shown in
Section 4.1.
3.2. Adaptive clustering-based model construction
A basic principle behind ensemble learning is that the classifiers that participate jointly in the decision-making process
should have the capacity to provide different individual opinions [21]. Consequently, in addition to multi-view learning, we
also construct a group of diverse base classifiers for each view through a novel adaptive clustering-based model generation
process. Specifically, an unsupervised clustering technique is employed to partition training sets into multiple data cluster
for each view. On this basis, each data cluster can be utilized to generate a classifier, and thus each view can attain an
ensemble with the structured classifiers.
It is acknowledged that two crucial problems need to be solved during a clustering process—how to obtain an opti-
mal number of clusters and how to determine the cluster center [6], which can decide the clustering results directly. Once
clusters are formed, base classifiers can be constructed separately based on the instances of each cluster. Researchers have
claimed that classification performance is highly reliant on the related training dataset [36]. For an imbalanced dataset,
good clustering results are able to make sure that the categories of samples in each cluster are normally distributed instead
of exceptionally imbalanced distributed. In order to improve the prediction performance of base classifiers, it’s necessary
to ensure that the instances belonging to the same cluster are compact whereas instances from different groups are well
separated [46]. This can be easily realized by increasing the number of clusters, however, there is no evidence which can
prove that a greater number of clusters can produce better prediction performance. Moreover, attempting all possible num-
ber of clusters manually would require much effort. Therefore, an adaptive clustering-based model construction method is
proposed to handle the ensemble learning process for each view. Specifically, we embed K-means and KNN in the adaptive
clustering to produce data clusters and take gradient boosting tree (GBDT) as the base learning algorithm to train base clas-
sifiers. Furthermore, an adaptive learning mechanism is developed to complete the whole learning procedure. The following
subsections explain the critical components of the adaptive learning process.
3.2.1. K-means and KNN-based clustering

To start with, a suitable clustering algorithm needs to be chosen to obtain data clusters for each view. As one of the most
popular clustering techniques, K-means has been used to resolve many real-world clustering problems due to its simplicity
and efficiency [27]. Also, it is particularly suitable for partitioning large dataset into distinct clusters, which is the imperative
to train diverse classifiers. Therefore, we adopt K-means for the unsupervised clustering process.
Via K-means, samples are divided into several disjoint clusters, and each cluster is described by the mean vector uk
(where k is the k-th cluster), which is commonly called a cluster centroid. K-means algorithm tends to choose the optimal
centroids that minimize the objective function, namely the within-cluster sum of the squared criterion, which is defined
as

K
min
xo ∈ck
xo − uk 2 (1)
k=1
where K is the prescribed number of clusters that needs to be specified in advance, xo is the sample belonging to cluster
ck , and uk is the centroid of cluster k. Taking Eq. (1) as the measure of distance, K-means is designed in the following steps.
First, K samples are randomly selected as the initial centroids. Subsequently, assign each sample to its closest cluster. Then,
replace the previous centroids by taking the mean of the samples assigned to each cluster. Repeating the last two steps till
the cluster centroids no longer change. After the unsupervised clustering process, we obtain plenty of training subsets for
each view. However, the base classifier’s predictive power will be weakened when the training set size is relatively small.
According to [5,23], the misclassification rate of classifiers begins to increase when the training size is less than 100. So, in
order to ensure the performance of each classifier, instead of directly using the training samples of each cluster to build a
base classifier, a KNN-based redistribution process is introduced to reassign the samples of small-sample-size clusters into
their corresponding nearest larger clusters when the sample size of a cluster is less than 100. K-Nearest Neighborhood (KNN)
is an effective pattern recognition method, especially for classification or regression problems, wherein an observation’s
class membership is determined according to information from its neighbors. Here, the main task of the redistribution
process is to find the Kknn nearest neighbors for the sample to be redistributed, and the cluster to which the sample should
be reassigned is obtained by a majority vote of its neighbors, where the integer parameter Kknn needs to be specified in
advance. The process of the K-mean and KNN-based clustering in this study is shown in Fig. 2.
Suppose we have training data Dtr = {xi , yi }i=1,...,Ntr = [Xtr Ytr ], where Xtr is a Ntr ∗ d matrix containing Ntr training samples
and d attributes, and Ytr is a dimensional column vector containing the class labels. According to the descriptions of multi-
view learning in Section 3.1, the multi-view partition results of training data Dtr can be denoted as a data matrix Xtr v ∈
V
RNtr ×d (v ) , in which d(v) is the feature dimension for v-th view, and d (v ) = d. Let the data belonging to matrix Xtr v be
v=1
partitioned into clusters {Cv1 , . . . , Cvγ , . . . , Cv } by K-means. The Kv clusters {C1v , . . . , Cζv , . . . , CKvv } can be obtained after
Nζv ×d (v )
the Nrt times KNN-based sample reassignment process. Briefly, we use a set of matrixes {Xζv ∈ R |ζ = 1, 2, . . . , Kv }

Kv
to represent the clustering results of the v-th view, where Nζv is the sample size of cluster Cζv , and Nζv = Ntr .
ζ =1
3.2.2. Base classifier construction

In this subsection, we aim at obtaining a group of base classifiers for each view based on their corresponding sub
datasets. For purpose of ensuring the accuracy of classifiers, we exploit the complete feature space to construct base models
instead of the data with a feature subspace. To this end, the training subsets for a specific view are built by data mapping,
which realizes mapping the low-dimensional data into the high-dimensional space. As a matter of fact, the original indexes
of each instance in the entire dataset do not change even though the view partition may change the data dimensions.
Therefore, the required training data with complete feature space can be procured by fetching records according to the
Nζv ×d (v )
instance indexes belonging to each view. Corresponding to the formal description of clustering results {Xζv ∈ R |ζ =
Nζv ×(d+1 )
1, 2, . . . , Kv }, the training subsets can be represented as a set of matrixes {(Xtr , Ytr )ζ ∈ R |ζ = 1, 2, . . . , Kv }, where
(Xtr , Ytr )ζ represents the training instances of cluster Cζv assigned by data mapping.
In this study, to deal with the imbalanced loan data, gradient boosting decision tree (GBDT) [15] and the simplest over-
sampling technique (Randomly resampling the minority class till the numbers of both classes are equal) are combined as
the base algorithm to train base learners for each view. As GBDT takes regression trees as individual learners, we treat re-
gression trees as sub-base learners for the proposed ensemble model. As an effective off-the-shelf ensemble method, GBDT
is competent for various classification and regression tasks. Unlike common ensemble techniques such as AdaBoost and RF,
the final estimation is calculated in a typical forward stage-wise fashion in GBDT, where the newly generated base learner
should be maximally correlated with the negative gradient of the loss function. In other words, in each iteration, GBDT
intends to build a new regression tree that can decrease the error produced by its previous round. In this study, the loss
function is calculated as
1
Loss(y, F (x ) ) = ( F ( x ) − y )2 (2)
2
where F signifies the constructed model. On this basis, we can compute the negative gradient of the i-th iteration by
∂ Loss(y, Fi−1 (x ))
− = Fi−1 (x ) − y (3)
∂ Fi−1 (x )
Fig. 2. Process of K-means and KNN-based clustering for viewv .
where Fi−1 denotes the model obtained in the i-th iteration. Then a new weak learner is trained to fit the difference between
the real values and the predictions derived from the last iteration. When the iterations end, the weak learners are aggregated
to be the final ensemble model.
Real-world loan datasets are usually imbalanced in the distribution of target variable values, in which good loans consti-
tute the majority and default loans the minority class. It has been claimed that, in such cases, ordinary classifiers mainly pay
attention to the majority class while overlooking the minority class. However, default identification is the primary concern
of this study, so the training of base classifiers is under the help of oversampling technique. We make sure the number of
minority class is equal to that of the majority through randomly duplicating the minority class samples.
3.2.3. Accuracy and diversity-based adaptive learning

In this part, a novel accuracy and diversity-based adaptive learning process is introduced to find the optimal number of
clusters for K-means. In terms of the proposed ensemble member generation strategy, before the model training procedure,
a necessary K-means and KNN-based clustering process is conducted to split training subsets. However, it is a difficult task
to determine a suitable number of clusters, which influences the performance of base learners, and therefore affects the
prediction capability of the ensemble model. Besides, it would take a long time if we enumerate every possible number
of clusters for each view to make a reasonable choice by calculating the prediction performance after finishing the entire
ensemble learning process. Further, considering the importance of accuracy and diversity of base classifiers, a feasible self-
adaptive clustering approach is proposed here to compute an optimal number of clusters for each view. In addition, if the
number of clusters is big enough, there will be many small-sample-size clusters that need to be redistributed by KNN. To
avoid the repetitive KNN-based reassignment process and to improve algorithm efficiency, we introduce a parameter KMax ,
which is recognized as the maximum threshold of number of clusters. To be specific, KMax equals the number of clusters if
there are small-sample-size clusters making their first appearance in the clustering results of all views under this parameter
setting.
Ensemble performance is utilized as the bridge to confirm the optimal number of clusters in the proposed self-adaptive
approach. In order to guarantee the performance of the ensemble, classifiers should display complementarity from each
other in some degree [8]. But it has already been clarified that the increment of base classifiers’ diversity can cause the de-
crease of their accuracy [2]. Moreover, an ensemble cannot show its advantage over its base classifiers unless accuracy and
diversity are both considered during the generation procedure of ensemble members [7]. As good ensembles are character-
ized by high accuracy and wide diversity, an evaluation function is defined to assist us in determining the optimal number
of clusters:
1
CN
E valuationη = (Accηh + Divηh ) (4)
CN
h=1
where Evaluationη is the overall indicator computed to assess the performance of classifiers generated in the η-th iteration,
CN represents the number of clusters corresponding to the number of base classifiers in a specific view, and Accηh and Divηh
separately denote the normalized values of accuracy and diversity for the h-th classifier of η-th iteration.
The computation process of our proposed evaluation function is similar to that of fitness functions [20] in Genetic Al-
gorithm (GA)-based methods, which are used to determine the best generation. However, they also have several essential
differences. For instance, in GA, the parent generation and its subsequent offspring generation are highly relevant, which
is determined by the evolutionary mechanism. Hence, the fitness functions as well as their normalization processes merely
need to consider the relative performance of individuals in the current generation. On the contrary, iterations in our pro-
posed adaptive approach are independent. For each η, a process of clustering and base classifier construction is carried out;
thus, it would be reasonable to compare the value of Evaluationη and the current optimal evaluation value E valuationCbest .
To this end, we obtain the minimum and maximum values of accuracy and diversity over the foregoing and the current
iterations. Then, the normalized value of accuracy and diversity of classifier ψ ηh can be calculated by
accηh − accmin
Accηh = (5)
accmax − accmin
divηh − divmin
Divηh = (6)
divmax − divmin
where accηh and divηh are the original accuracy and diversity of classifier ψ ηh , accmin and accmax represent the minimum
and maximum accuracy value of all iterations respectively, and divmin and divmax represent the minimum and maximum
diversity value of all iterations respectively.
The accuracy of classifier ψ ηh can be calculated by
T Pηh + T Nηh
accηh = (7)
Nηh
where TPηh and TNηh denote the number of true positive and true negative predicted by classifier ψ ηh , separately.
Though many diversity measurements have been proposed, no consensus has been reached on which is the best. In this
study, based on the definition given by [29], we define the diversity as the mean squared error (MSE) between the outputs
of the base classifier ψ ηh and the ensemble, as shown by Eq. (8):

1 2
Nηh
1
1
divηh = preηh_ξ , j − enPreη_ξ , j (8)
2 Nηh
ξ =0 j=1
where ξ stands for the class label, for a binary classification problem, ξ {0, 1} (i.e., 0 for negative samples and 1 for positive
ones). Nηh is the training sample size of classifier ψ ηh . In this study, we utilize the probability belonging to each class as
the output form of base learners. To be specific, pr eηh_ξ , j denotes the probability that training sample j is regarded as ξ
class predicted by classifier ψ ηh . Besides, enP r eη_ξ , j represents the ensemble probability that sample j is regarded as ξ class,
which is integrated by the CN classifiers generated by the η-th iteration and defined as
1
CN
enPreη_ξ , j = preηh_ξ , j (9)
CN
h=1
Algorithm 1 Adaptive algorithm for determining the optimal number of clusters for view v.
Initialization:
Cbest ← 0; flag ← 0
opti_acc ← [ ]; opti_div ← []; Al l _ACCv ← [ ]; Al l _DIVv ← []
FLAG: The threshold for ending the iteration
Input: KMax ; Training data: Dtr ; low-dimension data of view v: Xtrv
Process:
1: repeat until flag ≥ FLAG
2: for η in range (2, KMax ) do
3: CN ← 0; ACCη ← []; DIVη ← []; flag +==1;
4: Clustering the training data Dtr based on K-means
5: Use KNN to redistribute the small-sample-size clusters
Obtain CN training subsets { (Xtr , Ytr )ζ ∈ RNζ ×(d+1) |ζ = 1, 2, . . . , Kv } by data mapping
v
6:
7: Train base classifiers with the derived training subsets
8: for h in range (1, CN ) do
9: calculate accηh and divηh for each classifier by Eq. (7) and Eq. (8)
10: ACCη ← ACCη ∪ accηh ; DIVη ← DIVη ∪divηh
11: Al l _ACCv ← Al l _ACCv ∪ accηh ; Al l _DIVv ← Al l _DIVv ∪ divηh
12: if (η == 2 )
13: opti_acc ← ACC2 ; opti_div ← DIV2 ; Cbest ← 2; flag ← 0
14: if (η > 2)
15: find the minimum and maximum value from All_ACCv and All_DIVv
16: normalize Accηh , AccCbest h and Divηh , DivCbest h by Eq. (3) and Eq. (4)
17: calculate Evaluationη , E valuationCbest by Eq. (2)
18: if (E valuationη > E valuationCbest )
19: opti_acc ← ACCη ; opti_div ← DIVη ; Cbest ← η ; flag ← 0
20: end of repeat
Output: CN , Cbest
Eq. (9) introduces the integration process of these CN classifiers, in order to effectively assess whether these ensemble
members produce complementary results, equal weights are assigned to each member during the ensemble procedure.
Based on the predefined evaluation function, the adaptive learning algorithm for determining the optimal number of clusters
is proposed. Take view v as an example, the adaptive process is shown as Algorithm 1.
We first initialize 6 variables. To be precise, Cbest is the best number of clusters with an initial value of 0. FLAG represents
the threshold for ending the iteration, and flag is an accumulated value (when flag ≥ FLAG, we end the adaptive search
process). The empty lists opti_acc and opti_div are used to store the accuracy and diversity of each classifier produced by
the current best clustering number. The empty lists Al l _ACCv and Al l _DIVv are set to store the accuracy and diversity of all
classifiers generated by all iterations. Then, with the KMax , training data Dtr and the low-dimension data Xtr v of view v, the
adaptive algorithm can automatically calculate Evaluationη and E valuationCbest , and further decide whether to perform flag
++ 1 or to reinitialize flag to zero according to the comparison result of Evaluationη and E valuationCbest . This loop will repeat
till flag ≥ FLAG. Cbest denotes the optimal clustering number for view v, and CN represents the number of clusters after the
KNN-based distribution process.
3.3. Soft probability and distance-to-model-based model integration
An effective integration mechanism is indispensable for successful ensemble models. In general, two questions are most
concerned for model combination, i.e., "Which type of outputs of base learners is preferable?" and "How are the outputs
integrated?". In this study, we adopt a strategy of combining a useful output form named soft probability and a dynamic
weight distribution fashion named distance-to-model, which are demonstrated to be appropriate for the proposed adaptive
clustering-based multi-view ensemble model.
3.3.1. Soft probability-based outputs

Soft probability is one of the two fundamental output forms of classification (the other is hard probability). Un-
like in hard probability [25], where each observation is assigned to a single class, in soft probability [14], we can
obtain the probabilities belonging to each category for an unlabeled sample. Suppose there are φ +1 possible labels
{O0 , . . . , Oϕ , . . . , Oφ } for an object, we use a classifier set {H1 , . . . , Hμ , . . . , HT } to predict the label of a certain sam-
ple τ . The prediction results for sample τ by classifiers Hμ (μ ∈ {1, 2, . . . , T } ) can be presented as a φ +1 dimension vector
φ

[Hμ0 (τ ), . . . , H ϕ (τ ), . . . , H φ (τ )]T with ϕ ϕ
Hμ (τ ) = 1, where Hμ (τ ) represents the probability of sample τ belonging to the
μ μ
ϕ =0
ϕ -th category predicted by classifier Hμ . Note that Hμϕ (τ ) ∈ {0, 1} for hard probability, while Hμϕ (τ ) ∈ [0, 1] for soft proba-
bility.
Moreover, in an ensemble classification method, the ensemble decisions are made based on the outputs given by its base
classifiers. According some existing studies [14,20], if we take hard probability as each classifier’s output form, some hidden
information may be lost during the model integration process. Therefore, we adopt soft probability as the output form of
the intermediate classifiers and the hard probability as the output of the final ensemble. Besides, in this study, the credit
risk assessment problem is treated as a binary classification task, where the label "1" represents "default" and label "0"
denotes "fully paid"; thus, φ = 1 and the label set is {0, 1}.
In a binary classification problem, after obtaining the probability belonging to each category, the final classes of instances
need to be decided. As far as we know, various loan default judgment thresholds have been used in related literatures,
where 0.5 is the commonly used threshold value. But when dealing with an imbalanced dataset, such as ours, it’s necessary
to choose an optimal threshold based on the best precision of the training set [1]. It would be a great potential risk for both
P2P lending institutions and individual lenders if bad loans were not distinguished from good ones. Taking this into account,
default identification is also a crucial factor that needs to be considered in threshold determination. However, threshold
adjustment is a tough process to make a good tradeoff between the accuracy and sensitivity (the index is used to measure
the proportion of actual positives that can be correctly identified), which requires a tremendous amount of experiments.
To tackle this issue, this study achieves the final classification results by comparing the probabilities belonging to different
categories. The final label has the highest predicted probability.
3.3.2. Distance-to-model-based integration

To make good use of the created diversity of base learners, we employ a novel weight generation method, distance-to-
model (DM), to combine the base classifiers’ outputs for each view and calculate the final results of all views. DM involves
a dynamic weight generation process. For each instance to be classified, the weight associated with a base classifier is
determined by the distance between the instance and the training set of the used classifier. There are two stages of ensemble
member combination in the proposed method. The first one is to integrate the predicted results of the constructed base
classifiers for each view. The second one is to integrate the ensemble outputs of each view for a final prediction.
It has been shown in [22] that the DM criterion can reflect the prediction accuracy of a classifier for a specific sample.
One of the most commonly adopted DMs is the distance to the average of dataset.
(1) Dynamic weights-based classifier integration for each view
Suppose we have testing data Dte = {xi , yi }i=1, ...,Nte = [Xte Yte ], where Xte is a Nte ∗ d matrix containing Nte testing samples
and d attributes, and Yte is a dimensional column vector containing the real class labels. Similarly, Dte can be divided into V
views. We use another data matrix Xte v ∈ RNte ×d (v ) to denote the low dimensional data for the v-th view obtained from D by
te

V
“data mapping”, and d (v ) = d. Let xj be a sample to be classified, and xjv represent the low dimension vector of xj in view
v=1
v. As introduced in Section 3.2.2, the training data Dtr is partitioned into Kv clusters based on the unsupervised clustering
results of view v. For view v, the average vector of data in different cluster is denoted as xvζ = {xvζ 1 , . . . , xvζ θ , . . . , xvζ d (v ) },
Nζv
1 i
where xvζ θ = N v xvζ θ , Nζv denotes the number of instances in cluster ζ , and xivζ θ represents the θ -th dimensional value
ζ i=1
of sample xivζ , which belongs to cluster ζ . The distance between sample xj and the base classifier Hvζ can then be defined
as follows:
D M j _ v ζ = x j v − x vζ (10)
Thus, we can obtain the distances between sample xj and different classifiers of view v: DM j_v =
[DM j_v1 , . . . , DM j_vζ , . . . , DM j_vKv ]. Distances of the same group can reflect the estimation capabilities of the base clas-
sifiers toward sample xj . Previous study [22] has shown that the larger the value of DM is, the lower the prediction
accuracy of this model is. On this basis, we define the weight of classifier Hvζ when predicting sample xj as

1/lg DM j_vζ
w j_vζ = K (11)
ζ =1 1/lg DM j_vζ
v

Kv
with w j_vζ = 1. Thereby, we can obtain the weights of all base classifiers in view v for sample xj , which is denoted by
ζ =1
W jv = [w j_v1 , . . . , w j_vζ , . . . , w j_vKv ].
Suppose vector Pjvl = [ p jvl _1 , . . . , p jvl _ζ , . . . , p jvl _Kv ] is the predicted output of individual members in view v, where each
element indicates the probability of xj belonging to class l (in this study, l ∈ {0, 1}). Then, the outputs of base learners for
label l and the corresponding weights are combined via dot product:
R jv0 = Pjv0 · W jv (12)
R jv1 = Pjv1 · W jv (13)

where Rjv0 and Rjv1 denote the combination results for label "0" and "1" in view v respectively. Thus, the outputs of all
views for label "0" and for label "1" are R j0 = [R j10 , . . . , R jv0 , . . . , R jV 0 ] and R j1 = [R j11 , . . . , R jv1 , . . . , R jV 1 ], respectively.
(2) Dynamic weights-based multi-view integration
Multi-view integration aims at leveraging information provided by different views to improve predictive performance.
After single-view fusion, the last task is how we integrate all these results to generate a final output. In this study, we
employ the average distance between sample xj and different classifiers of view v to measure the distance between xj and

Kv
the constituted ensemble of view v. Let Ave_DM jv = DM j_vζ denote the distance between sample xj and the ensemble
ζ =1
of view v. The weights of the ensemble in view v can be calculated by the following equation:

1/lg Ave_DM jv
w j_v = V (14)
v=1 1/lg Ave_DM jv

V
with w j_v = 1. Thereby, we can obtain the weights of all ensembles for sample xj , which can be denoted by W j =
v=1
{w j_1 , . . . , w j_v , . . . , w j_V }. According to the integration of single-view, predictions of various views for label "0" and for la-
bel "1" are R j0 = [R j10 , . . . , R jv0 , . . . , R jV 0 ] and R j1 = [R j11 , . . . , R jv1 , . . . , R jV 1 ] respectively. Therefore, the final result can
be obtained by comparing the following two dot product results: P j0 = R j0 · W j and P j1 = R j1 · W j . If P j1 ≥ P j0 , xj is classi-
fied as a positive instance, which is a default loan; contrariwise, if P j1 < P j0 , xj is classified as a negative sample, which has
a higher probability to be creditable.
4. Experimental study
4.1. Data preparation
This study carries out numerical experiments with the data provided by Lending Club1 , the world’s largest P2P lending
platform. Loans with a duration of 36 months, which were issued from January 2014 to August 2014, are mainly analyzed,
because these data already had their maturity at the time when we downloaded them in September 2017. A total of 162,570
loan records and 143 attributes are contained in the raw dataset. Data preparation occurs in the following steps.
(1) Redundant feature removal
Not all attributes are necessary for model construction. In this study, routine manners and the domain knowledge col-
lected from various resources are referenced to identify redundant features. The following attributes are excluded in this
study. (a) Attributes like "inq_fi" and "inq_last_12m", which have more than 50% missing values, were removed directly;
(b) Post-loan attributes, such as "total_rec_int" and "last_pymnt_amnt", which are updated monthly since the loan was is-
sued, are beyond the range of our research; (c) According to the Data Dictionary provided by Lending Club, some features
contain the same or similar information. For example, "funded_amnt_inv", "Loan_amnt", and "funded_amnt" all record the
loan amount. In such situations, only one attribute was retained; (d) We also removed the variables lacking useful infor-
mation (e.g., "ID" and "url") and variables having long text values (e.g., "address"). Ultimately, 78 of the 143 attributes were
removed, and 65 attributes including the label are remained. Seven of these are categorical, and the rest are numerical.
(2) Data cleaning
There are 162,570 loans records with a duration of 36 months in raw data. These instances have seven final statuses:
"Fully paid", "Current", "Default", "Charged off", "Late (16–30 days)", "Late (31–120 days)" and "In grace period". We removed
430 samples with "Current" loan status because we cannot infer their final results. Instead of replacing the missing values
with mean or random values, we discarded all the samples containing missing values, it is worth noting that occupational
information-related features were not involved in this step. We also eliminated the records with obvious mistakes. After
these manipulations, the data comprises 70,860 records, of which about 14.56% are default loan records.
(3) Data transformation
Some attributes need further transformation to make the data suitable for algorithms coping with numerical variables.
"Loan Status" represents the final status of issued loans, and default prediction is treated as a binary classification task,
we transformed the status "Fully paid" into 0, which denotes good loans, and set the other five types as 1, signifying bad
ones. Besides, the values of attributes including "grade", "sub_grade", "purpose", "home_ownship", and "verification_status"
are non-numerical, which were transformed into integers respectively.
For better usage of two important variables, "emp_length" and "emp_title", the values of "emp_length" were coded from
0 to 11, where 0 means that the borrower had been working for an employer for less than 1 year when he or she obtained
this loan, and 10 means the borrower hadn’t changed job for 10 years or more. 11 indicates missing employment length
information. Similarly, the values of "emp_title" were replaced by 1 and 2, with 1 indicating the records with value none
1
https://www.lendingclub.com/info/download-data.action.
Table 1
Description of feature transformation.
Attribute Attribute description Original value Code value
loan_status Current status of the loan Fully paid; Default, In grace period, Charged off, Late {0, 1}
(16-30 days), Late (31-120 days)
grade LC assigned loan grade A to G {1, 2, …,7}
sub_grade LC assigned loan subgrade A1 to G5 {1, 2, …, 35}
purpose Categories of the loan request renewable_energy; small_business; vacation; moving; {1, 2, …, 13}
debt_consolidation; medical; home_improvement;
car; house; wedding; credit_card; major_purchase;
other
home_ownship Home ownership status mortgage; own; rent {1, 2, 3}
verification_status Status of whether the source verified; not verified; verified {1, 2, 3}
borrower’s income was
verified
emp_length Employment length in years <1; 1 to 9; 10+; none {0, 1, …, 11}
emp_title Job title of borrowers none; job title {1, 2}
applicant type Describe the status of —— {1, 2, 3, 4}
occupational information
and 2 indicating the records with job title filled. In addition, we introduced a new attribute "applicant type", which has four
integer values: 1 represents employment length and job title information are available; 2 indicates only job title information
is provided; 3 represents only employment length is available; and 4 denotes that neither kind of occupational information
is available. Details on feature transformation are provided in Table 1.
As shown in Tables 1, 8 attributes need to be further processed except the target variable "loan_status". Among them,
the values of "grade" and "sub_grade" reflect how creditable the borrowers are, while the values of the other attributes have
no real meaning. Further, in order to ensure the reliability of the experiments, attributes like "purpose", "home_ownship",
"verification_status", "emp_length", "emp_title", and "applicant type" were recoded with the help of one-hot encoding. After
the recoding process, these discrete attributes were replaced by a group of dummy variables, which take 0 or 1 to indicate
the absence or presence (respectively) of the categorical attributes.
(4) Data partition
After the above-mentioned manipulations, the data we use to build the credit risk assessment model contains 70,860
records, 93 independent variables and a target variable. The class distribution is approximately 6:1, with 60,538 (85.44%)
good loans and 10,322 (14.56%) default loans. Among the 70,860 samples, 66,490 records have complete employment length
and job title information, and 4,370 records provide no employment length or job title information. Regarding data partition,
the preprocessed data was divided into a training set and a testing set randomly based on the 80–20% principle, i.e., 56,694
samples for training and 14,166 for testing. The training set includes 48,434 (85.44%) good loans and 8,260 (14.56%) default
samples.
(5) Multi-view partitioning
According to the official data dictionary and the professional analysis of practitioners, features of the Lending Club data
can be artificially partitioned into four categories: bad credit history, loan description, extra credit information and repaying
capacity. Each category is regarded as an individual view and each feature can be only assigned into one view. Finally,
the number of features in each view is 15, 32, 17, and 29 respectively. Specifically, features in "bad credit history" imply the
information on the borrower’s bad credit records in the past years. "Loan description" contains the attributes which describe
the information about the loan, such as loan purpose and loan amount. "Extra credit information" refers to the information
about the loan applicants, which is collected from external channels, such as FICO. Features such as "annual income" and
"debt to income", which reflect the repayment ability of borrowers are distributed into the "repaying capacity" view.
4.2. Performance metrics
To gain comprehensive and reliable comparative results between the proposed ensemble method and the alternative
methods, six popular performance metrics are used: accuracy, sensitivity, specificity, G-mean, receiver operating character-
istic (ROC) curve and area under the ROC curve (AUC).
As a binary classification problem, these evaluation metrics are calculated based on the confusion matrix, as shown in
Table 2. Accuracy, sensitivity, specificity and G-mean of an algorithm are computed according to Eqs. (15)–(18) respectively.
To be specific, accuracy is a basic evaluation indicator for classification tasks, but it fails to reflect the model’s overall pre-
diction ability when handling a class imbalance problem [21,40]. As previously analyzed, with skewed data distribution, the
discriminative ability of real default instances is important, but it is not reasonable to sacrifice significant majority class
instance identification ability. Thus, sensitivity and specificity, which represent the ratio of minority and majority accuracies
Table 2
Confusion matrix for a binary classification problem.
Correctly classified instances Wrongly classified instances
Positive instances True positive (TP) False negative (FN)

Negative instances True negative (TN) False positive (FP)
respectively, are both employed.

TP + TN
Accuracy = (15)
TP + TN + FP + FN
TP
Sensitivity = (16)
TP + FN
TN
Specificity = (17)
TN + FP

G − mean = Sensit ivit y × Speci f icity (18)

In order to balance the sensitivity and specificity as well as show a model’s comprehensive performance, G-mean, AUC
and the ROC curve are further introduced. A ROC curve is illustrated by false positive rate (FPR, which equals 1-Specificity)
and true positive rate (TPR, namely Sensitivity) as x and y axes respectively, which reveals relative tradeoffs between FPR
and TPR. The ROC curve of a better prediction method is closer to the upper left corner. The value of AUC represents the
area of a ROC curve; for a certain prediction approach, the bigger, the better.
4.3. Experiment design
At first, we present a comprehensive analysis of the proposed method, the DM–ACME. Utilizing GBDT and as the base
algorithm, this method synthesizes multi-view learning and adaptive clustering-based individual classifier construction, and
takes both soft probability and distance-to-model into account in the combination strategy. In order to verify the effect
of the multi-view ensemble, the DM–ACME was compared with its version of single-view ensemble. The full name of this
single-view ensemble method is distance-to-model and adaptive clustering-based single-view ensemble (DM–ACSE), where
view partitioning is not conducted when generating base classifiers. To investigate the benefit of the clustering-based in-
dividual classifiers construction process, we compared the performance of the DM–ACSE with its base algorithm. Then, the
prediction results of DM–ACME and base algorithm were compared to reveal the advantages of the cooperation of multi-
view ensemble and adaptive clustering. To verify the effect of soft probability and distance-to-model-based weight assign-
ment on the final results, experiments were conducted on the suggested method and its versions with ensemble strategies
composed of hard probability and majority voting, which is called MV–ACME.
Then, to validate the effectiveness of the proposed DM–ACME method in default forecasting, we included six data-driven
techniques as performance benchmarks. Three of them are ensemble classification methods: gradient boosting decision tree
(GBDT) [15], random forest (RF) [11], and AdaBoost. The other three are single classification methods: decision tree (DT)
[35], logistic regression (LR) [13], and multi-layer perceptron (MLP, an artificial neural network algorithm). To further assess
the capacity of the proposed method dealing with imbalanced data, we compared the proposed method with these bench-
mark methods that are cooperated with two classic data manipulation techniques for handling imbalanced classification
problems, including oversampling and undersampling (i.e., randomly oversampling the minority class or randomly under-
sampling the majority class to rebalance the training set). We also conducted a group of experiments comparing between
the proposed method and several popular imbalance classification methods, including SMOTE–ENN [32], Easyensemble [26],
C4.5CS [37] and SVMCS [40]. The decision tree algorithm was used as the classification algorithm to train the data generated
by SMOTE–ENN.
Finally, to examine the roles of the occupational information-related features and other crucial features in the prediction
model construction process, we computed feature importance via Pearson correlation analysis and GBDT respectively.
The experiments were conducted on a PC with a 2.7 GHz Intel Core 7 CPU and 8 GB RAM, using the Windows 10
operating system. Python (version 2.7.13) was used for modeling. Each experiment was repeated 25 times independently and
the average results are used for performance comparison among all the classification methods involved in our experiments.
4.4. Parameter settings
In the proposed ensemble learning method, an adaptive clustering-based model construction approach is utilized to gen-
erate individual classifiers. Three parameters need to be tuned in advance. They are KMax (the upper bound of the number
of cluster shared by four views), Kknn (the number of neighbors to be referred when redistributing the samples of the
small-sample-size cluster using KNN) and FLAG (a parameter used to terminate the adaptive learning process). After plenty
Table 3
Results of the number of clusters setting.
Number of sub-base learners m = 10 m = 30 m = 50 m = 70 m = 100
Adaptive learning results (Cbest ) [6, 4, 7, 5] [14, 4, 19, 5] [14, 7, 19, 4] [14, 7, 19, 4] [7, 4, 9, 6]
Number of clusters after redistribution (CN ) [5, 4, 7, 5] [12, 4, 17, 5] [12, 7, 17, 4] [12, 7, 17, 4] [6, 4, 9, 6]
Table 4
Results of DM–ACME, DM–ACSE and base algorithm.
Performance Comparison Model Number of sub-base learners

metric item
m = 10 m = 30 m = 50 m = 70 m = 100
Accuracy Mean DM–ACME 0.7231 0.8030 0.8342 0.8293 0.8352

(Standard (0.0081) (0.0042) (0.0020) (0.0029) (0.0016)
Deviation) DM–ACSE 0.6114 0.6854 0.7112 0.7345 0.7531
(0.0207) (0.0092) (0.0120) (0.0135) (0.0066)
Base Algorithm 0.6077 0.6180 0.6235 0.6270 0.6292
(0.0021) (0.0028) (0.0033) (0.0029) (0.0027)
Improvemen DM–ACME vs. 18.27% 17.16% 17.29% 12.91% 10.90%
Ratet DM–ACSE
DM–ACSE vs. 0.61% 10.03% 14.07% 16.64% 19.69%
Base Algorithm
DM–ACME vs. 18.99% 28.91% 33.79% 31.70% 32.74%
Base Algorithm
AUC Mean DM–ACME 0.6697 0.6675 0.6636 0.6631 0.6607
(Standard (0.0016) (0.0030) (0.0032) (0.0037) (0.0038)
Deviation) DM–ACSE 0.6576 0.6491 0.6362 0.6331 0.6237
(0.0054) (0.0055) (0.0053) (0.0038) (0.0042)
Base algorithm 0.6188 0.6236 0.6207 0.6196 0.6148
(0.0033) (0.0035) (0.0042) (0.0046) (0.0033)
Improvement DM–ACME vs. 1.84% 2.83% 4.31% 4.74% 5.93%
Rate DM–ACSE
DM–ACSE vs. 6.27% 4.19% 2.50% 1.75% 1.45%
Base Algorithm
DM–ACME vs. 8.23% 7.14% 6.91% 6.57% 7.47%
Base Algorithm
of experiments, KMax was set to 30. When KMax = 30, the number of small-sample-size clusters contained in the clustering
results of each view is 8, 1, 3 and 3, respectively. Hence, the optimal number of clusters for the mentioned unsupervised
learning process is selected within a range of 2–30. Kknn was set to 10 to reassign the samples in small-sample-size clusters
into nearest large clusters. The optimal value of parameter FLAG was validated and selected among several discrete values
including 5, 10, 15, 20 and 25. We finally set FLAG to 5 by experiments for the following reasons. It was observed from the
experiments that the final ensemble prediction model showed higher accuracy but lower sensitivity with the increment of
FLAG value; this is not a desirable result considering the importance of default loan identification. On the other hand, it is
very time consuming to finish the ensemble process when FLAG is set to a high value.
According to the results of the adaptive learning process and KNN-based adjustment for our proposed ensemble method,
DM–ACME, the numbers of clusters for 4 views under different number of sub-base learners are shown in Table 3. The
values in the sets denote the number of clusters for the "bad credit history", "loan description", "extra credit information"
and "repaying capacity" views in sequence. The numbers in bold indicate that small-sample-size clusters exist in their cor-
responding clustering results. Therefore, we employed the redistribution process to obtain the final number of clusters for
model construction. Similarly, the number of clusters of the compared method DM–ACSE was determined as 4 according to
the adaptive learning and KNN-based adjustment process.
4.5. Results and discussion
4.5.1. Comprehensive analysis of the proposed method

(1) Analysis of the advantages of multi-view ensemble and adaptive clustering
Many comparative experiments under different numbers of sub-base learners were carried out to verify the effectiveness
of multi-view ensemble and adaptive clustering via the proposed method. The numbers of sub-base learners were set to
10, 30, 50, 70, and 100 to observe how different classifier numbers affect prediction performance. Table 4 shows the mean
accuracy and AUC values for 25 experiments conducted using the proposed multi-view model (DM–ACME), its version of
single-view model (DM–ACSE), and the base algorithm, where the base algorithm corresponds to the GBDT cooperating with
oversampling as mentioned in Section 3.2.2. The improvement rate is calculated as: improvement rate = (max − min )/min,
where max denotes the comparatively higher value and min denotes the lower one. From Table 4, the relationship between
Table 5
Results of DM–ACME and MV–ACME.
Performance Comparison item Model Number of sub-base learners

metric
m = 10 m = 30 m = 50 m = 70 m = 100
Accuracy Mean (Standard DM–ACME 0.7231 0.8030 0.8342 0.8293 0.8352

Deviation) (0.0081) (0.0042) (0.0020) (0.0029) (0.0016)
MV–ACME 0.6274 0.6617 0.6450 0.7236 0.6910
(0.0202) (0.0230) (0.0197) (0.0206) (0.0077)
Rate MV–ACME
AUC Mean (Standard DM–ACME 0.6697 0.6675 0.6636 0.6631 0.6607
Deviation) (0.0016) (0.0030) (0.0032) (0.0037) (0.0038)
MV–ACME 0.5987 0.5861 0.5578 0.5477 0.5776
(0.0011) (0.0017) (0.0100) (0.0106) (0.0037)
Rate MV–ACME
prediction performance and the number of sub-base learners, and the performance gaps between models can be discerned
clearly, which are analyzed according to the performance metrics as follows.
In terms of accuracy, the mean indicator values of the three models all manifest a growing tendency along with the in-
crement of the number of base learners. When comparing the proposed DM–ACME with DM–ACSE, it can be found that the
improvement rates exceed 10.90%. Besides, the standard deviations of multiple experiments of the proposed method under
different number of base learners are significantly lower than those for DM–ACSE. These results demonstrate that the multi-
view learning-based ensemble can make better decisions by training base classifiers from different feature perspectives in
terms of not only prediction precision but also of generalization ability. When comparing DM–ACSE with the base algorithm,
Table 4 shows that the improvement rates of DM–ACSE vs. the base algorithm range from 0.61% to 19.69% as the number
of sub-base learners increases, indicating an obvious accuracy improvement due to clustering-based ensemble strategy and
showing the effectiveness of the proposed adaptive clustering technique during ensemble construction. When comparing
the proposed DM–ACME with the base algorithm, it can be seen from Table 4 that the improvement rates of DM–ACSE vs.
base algorithm vary from 18.99% to 33.79%, which reveals the positive role of multi-view learning together with adaptive
clustering in enhancing the accuracy of ensemble classification models.
Regarding AUC, which measures models’ holistic classification ability, the proposed DM–ACME maintains a steady per-
formance under different number of base learners, while the performance of DM–ACSE and the base algorithm show fluc-
tuation with the increment of the number of sub-base learners. When comparing the proposed DM–ACME with DM–ACSE,
it can be obtained that the improvement rates on AUC of DM–ACME vs. DM–ACSE range from 1.84% to 5.93% as the num-
ber of sub-base learners increases. Additionally, the standard deviations for AUC in multiple experiments on the proposed
method with various number of sub-base learners are lower than those for DM–ACSE. These results reveal that the multi-
view learning-based ensemble can make better decisions by training base classifiers from different perspectives in terms of
AUC and generalization ability. When comparing DM–ACSE with the base algorithm about AUC, the improvement rates in
tests of DM–ACSE vs. the base algorithm vary from 1.45% to 6.27%, which demonstrates that the proposed adaptive cluster-
ing technique-based ensemble strategy can improve models’ classification ability. When comparing the proposed DM–ACME
with the base algorithm in terms of AUC, the improvement rates in tests of DM–ACME vs. the base algorithm exceed 6.57%,
indicating that multi-view learning together with adaptive clustering can increase the AUC of ensemble classification models
as well as their accuracy.
(1) Effect of soft probability and DM-based integration
To verify the effect of soft probability and DM-based integration, a group of comparative experiments with hard prob-
ability and MV-based integration under different number of sub-base learners are carried out. The numbers of sub-base
learners were set to 10, 30, 50, 70 and 100 for observing the influence of the number of classifiers on prediction perfor-
mance. Table 5 summarizes the mean values of 25 experiments on the proposed multi-view model (DM–ACME) and its
comparison method (MV–ACME) in terms of accuracy and AUC.
From Table 5, the relationship between prediction performance and the number of sub-base learners as well as the
performance gaps between models can be discerned clearly, which are analyzed according to the performance metrics as
follows. In terms of accuracy, the mean indicator of both models shows overall rising values along with the increment of
the number of sub-base learners, except for a few fluctuations. When comparing these two methods, it is evident that DM–
ACME has higher improvement rates of more than 14.61%. Advantages are also reflected in the improvement of AUC, the
improvement rates of DM–ACME method vs. MV–ACME vary from 11.86% to 21.07%. From both performance indicators, the
proposed DM–ACME is superior to the MV–ACME, revealing the advantages of soft probability and the DM-based integration
strategy over the strategy of hard probability and majority voting. Besides, most of the standard deviations in multiple
experiments on the proposed method under different number of sub-base learners are significantly lower than those for
Table 6
Results of DM–ACME and benchmark techniques.
Confusion
Method matrix Performance metric
TP FN
TN FP Accuracy Sensitivity Specificity G-mean AUC
Proposed 950 1112 0.7231 ± 0.0081 0.4607 ± 0.0191 0.7678 ± 0.0127 0.6009 ± 0.0241 0.6697 ± 0.0016
method 9293 2811
GBDT 35 2027 0.8517 ± 0.0000 0.0170 ± 0.0000 0.9939 ± 0.0000 0.1299 ± 0.0000 0.5054 ± 0.0000
12030 74 (-) (+) (-) (+) (+)
RF 0 2062 0.8544 ± 0.0000 0.0000 ± 0.0000 1.0000 ± 0.0000 0.0000 ± 0.0000 0.5000 ± 0.0000
12104 0 (-) (+) (-) (+) (+)
AdaBoost 460 1602 0.7521 ± 0.0014 0.2229 ± 0.0027 0.8423 ± 0.0014 0.4333 ± 0.0028 0.5326 ± 0.0017
10195 1909 (-) (+) (-) (+) (+)
DT 473 1589 0.7518 ± 0.0000 0.2294 ± 0.0000 0.8408 ± 0.0000 0.4392 ± 0.0000 0.5351 ± 0.0000
10177 1927 (-) (+) (-) (+) (+)
LR 0 2062 0.8544 ± 0.0000 0.0000 ± 0.0000 1.0000 ± 0.0000 0.0000 ± 0.0000 0.5000 ± 0.0000
12104 0 (-) (+) (-) (+) (+)
MLP 324 1738 0.7245 ± 0.0000 0.1572 ± 0.0000 0.8211 ± 0.0000 0.3593 ± 0.0000 0.4892 ± 0.0000
9939 2165 (=) (+) (-) (+) (+)
MV–ACME, which implies that the novel soft probability and DM-based integration strategy improve the stability of the
ensemble performance.
4.5.2. Comparison with different classification models

In this subsection, we summarize and compare the results of the proposed method, the benchmark methods using origi-
nal data, oversampled data and undersampled data, respectively, and several popular imbalanced classification algorithms. In
this part, we utilize independent samples t-test to analyze the significance of the differences among these methods with the
commercial software SPSS (Version 19.0). In the classification results of this subsection, symbol "+" signifies the proposed
method significantly outperforms the particular model, while "-" signifies the particular model is significantly better than
the proposed method, and "=" signifies that there is no significant difference between the results attained by the proposed
method and the model being compared statistically.
In order to prove the validity and superiority of the proposed method, we first compare it with several popular bench-
mark classification models, including three prevailing ensemble learning methods (GBDT, RF, and AdaBoost) and three classic
single algorithms (DT, LR, and MLP, a kind of ANN algorithm). The average experimental results of the comparative meth-
ods are summarized in Table 6. The prediction results are obtained under the two parameters that the number of sub-base
learners for the proposed method is 10 and that the numbers of base learners are 50 for GBDT, RF, and AdaBoost.
From Table 6, we can see that the proposed method reaches the highest statistically significant sensitivity, which rep-
resents the real default identification rate in credit risk assessment. Also, it has the best statistically significant G-mean,
which measures the comprehensive performance of sensitivity and specificity in the imbalanced loan classification. Besides,
the suggested method displays the best classification ability as it achieves the highest statistically significant AUC value.
In terms of accuracy, RF and LR achieve the best value 0.8544. However, the sensitivity values of these two methods are
0.0 0 0 0, which indicates that the best accuracy is obtained at the cost of wrongly classifying all the positive samples. This
also explains why the specificity values of RF and LR are 1.0 0 0 0. These results indicate that RF and LR cannot handle the
hard-to-classify positive samples, lacking the ability to identify real default loans.
It is noteworthy that it is inevitable to cause the specificity value reduce when to improve sensitivity. However, for
extremely imbalanced data where positive samples are the minority, the increase of sensitivity may also decrease accuracy
significantly, for reason that the correct classification in some positive samples may be at a sacrifice of the misclassification
of many negative samples. When dealing with highly imbalanced classification problems, the absolute high accuracy is not
desirable. What we pursue is good comprehensive prediction performance with satisfactory generalization ability. As in the
credit risk prediction problem, the default loan identification rate indicated by sensitivity is of great account for investors
to avoid losing principal. The proposed method is developed with this motivation and as is displayed in the experimental
results, it achieves its purpose by displaying a relatively good sensitivity and accuracy, and the best AUC, i.e., the optimal
classification ability.
Considering the significant imbalance of the experimental data, to obtain a reliable comparative result, the methods
examined are further tested using both oversampled and undersampled data, because oversampling and undersampling are
often employed to handle imbalanced classification problems (We use "OS" to denote oversampling and "US" to denote
undersampling). The results shown in Table 7 lead to the following observations.
First, compared with the results of the benchmarks with original data distribution from Table 6, these methods coop-
erated with oversampling or undersampling gain a distinctly better sensitivity. The performance of AdaBoost and DT with
original and oversampled data are relatively close in terms of accuracy, sensitivity and specificity, while the significant
Table 7
Results of DM-ACME and benchmark techniques combined with oversampling/undersampling.
Method Mean ± Standard deviation
Accuracy Sensitivity Specificity G-mean AUC
Proposed method 0.7231 ± 0.0081 0.4607 ± 0.0191 0.7678 ± 0.0127 0.6009 ± 0.0241 0.6697 ± 0.0016
GBDT+OS 0.6235 ± 0.0033 (+) 0.6168 ± 0.0090 (-) 0.6246 ± 0.0041 (+) 0.6207 ± 0.0042 (-) 0.6207 ± 0.0042 (+)
RF+OS 0.7701 ± 0.0021 (-) 0.3107 ± 0.0072 (+) 0.8483 ± 0.0022 (-) 0.5134 ± 0.0055 (+) 0.5795 ± 0.0035 (+)
AdaBoost+OS 0.7562 ± 0.0028 (-) 0.1925 ± 0.0092 (+) 0.8523 ± 0.0027 (-) 0.4050 ± 0.0098 (+) 0.5224 ± 0.0050 (+)
DT+OS 0.7568 ± 0.0037 (-) 0.1934 ± 0.0070 (+) 0.8527 ± 0.0048 (-) 0.4060 ± 0.0069 (+) 0.5231 ± 0.0031 (+)
LR+OS 0.5630 ± 0.0130 (+) 0.5558 ± 0.0166 (+) 0.5642 ± 0.0179 (+) 0.5597 ± 0.0024 (+) 0.5600 ± 0.0024 (+)
MLP+OS 0.7245 ± 0.0000 (=) 0.1572 ± 0.0000 (+) 0.8211 ± 0.0000 (-) 0.3593 ± 0.0000 (+) 0.4892 ± 0.0000 (+)
GBDT+US 0.6033 ± 0.0034 (+) 0.6292 ± 0.0077 (-) 0.5989 ± 0.0048 (+) 0.6138 ± 0.0027 (=) 0.6140 ± 0.0027 (+)
RF+US 0.5912 ± 0.0047 (+) 0.6623 ± 0.0096 (-) 0.5791 ± 0.0066 (+) 0.6193 ± 0.0029 (-) 0.6207 ± 0.0032 (+)
AdaBoost+US 0.5288 ± 0.0045 (+) 0.5577 ± 0.0130 (-) 0.5238 ± 0.0066 (+) 0.5404 ± 0.0044 (+) 0.5408 ± 0.0046 (+)
DT+US 0.5323 ± 0.0058 (+) 0.5558 ± 0.0120 (-) 0.52832 ± 0.0074 (+) 0.5418 ± 0.0054 (+) 0.5421 ± 0.0055 (+)
LR+US 0.5742 ± 0.0148 (+) 0.5437 ± 0.0213 (-) 0.5794 ± 0.0209 (+) 0.5609 ± 0.0019 (+) 0.5615 ± 0.0016 (+)
MLP+US 0.7245 ± 0.0000 (=) 0.1572 ± 0.0000(+) 0.8211 ± 0.0000 (-) 0.3593 ± 0.0000 (+) 0.4892 ± 0.0000 (+)
Table 8
Results of DM-ACME and imbalanced classification algorithms.
Method Mean ± Standard deviation
Accuracy Sensitivity Specificity G-mean AUC
Proposed method 0.7231 ± 0.0081 0.4607 ± 0.0191 0.7678 ± 0.0127 0.6009 ± 0.0241 0.6697 ± 0.0016
SMOTE–ENN+DT 0.6839 ± 0.0041 (+) 0.3359 ± 0.0114 (+) 0.79320 ± 0.0043 (-) 0.4996 ± 0.0086 (+) 0.5396 ± 0.0061 (+)
Easyensemble 0.7701 ± 0.0021 (-) 0.3107 ± 0.0072 (+) 0.8483 ± 0.0022 (-) 0.5134 ± 0.0055 (+) 0.6163 ± 0.0035 (+)
C4.5CS 0.7565 ± 0.0000 (-) 0.1935 ± 0.0000 (+) 0.8524 ± 0.0000 (-) 0.4061 ± 0.0000 (+) 0.5230 ± 0.0000 (+)
SVMCS 0.8544 ± 0.0000 (-) 0.0000 ± 0.0000 (+) 1.0000 ± 0.0000 (-) 0.0000 ± 0.0000 (+) 0.5000 ± 0.0000 (+)
changes occur when using undersampling technique. Different from the foregoing five benchmark classification methods,
the results of the MLP are not affected by these different data sampling techniques.
Second, from Table 7, RF+OS shows the highest accuracy and significant improvement in sensitivity over the performance
with original distribution; however, the G-mean and AUC, reflecting the overall prediction capability, present no advantage
over the proposed ensemble learning method. Similarly, the results of AdaBoost+OS and DT+OS show better accuracy than
the proposed method but lower sensitivity, G-mean, and AUC. Besides, MLP+OS and MLP+US show an accuracy approximat-
ing that of the proposed ensemble method but lower sensitivity, G-mean, and AUC. Among these ten comparative methods,
six of them have better sensitivity than our proposed method, but they obviously have lower values in terms of the other
four indicators compared with DM–ACME, especially in accuracy. In terms of specificity, DT+OS performs the best but has
low sensitivity, G-mean, and AUC. In terms of G-mean, GBDT+OS performs the best; however, its accuracy is relatively low,
and the overall classification ability, indicated by AUC, is much lower than that of the proposed ensemble learning method.
Except for the above-mentioned methods, to further verify the effectiveness of our method, we also compare it with
several imbalanced classification algorithms. In SMOTE–ENN+DT, the number of the nearest neighbors was set as 10. The
number of base classifiers generated by random undersampling contained in Easyensemble was set as 199, and the number
of base learners in Adaboost was set as 100. In C4.5CS, we set the cost as 2 when misclassifying a minority class instance
into majority class, and 1 when misclassifying a majority class sample into minority class. The results are shown in Table 8.
From Table 8, we can see that the proposed method has the highest sensitivity, G-mean and AUC, which verifies the
advantage of the proposed method when handling imbalance classification. SVMCS shows the highest accuracy and speci-
ficity but the lowest AUC. The sensitivity and G-mean are both 0.0 0 0 0, which are significantly lower than that of the other
four compared methods. Similarly, Easyensemble and SMOTE–ENN have higher accuracy and specificity compared to the
proposed method. But their minority class identification ability (sensitivity) and overall performance (G-mean and AUC) are
not as good as those of the proposed method when dealing with imbalanced credit risk loan data. The performance of
SMOTE–ENN, which uses a decision tree as the classification algorithm, has the lowest accuracy, meanwhile, its sensitivity,
G-mean and AUC are lower than those of the proposed method, which means that SMOTE–ENN does not perform as well
as the proposed method when addressing the imbalanced credit risk classification problem.
In order to compare these methods more vividly, we illustrate the ROC curves of the average results for all comparative
methods in Fig. 3. False Positive Rate indicates the possibility of wrongly predicting positive samples. Obviously, the index
is expected to be as low as possible. True Positive Rate represents the possibility of correctly predicting positive samples.
Of course, a higher true positive rate is pursued. Therefore, we want the ROC curve to be above the diagonal line as far as
possible. According to Fig. 3 (a), the proposed model significantly performs better than other benchmarks. From Fig. 3 (b)
and (c), the ROC curve of the proposed method is above that of the benchmark methods combined with oversampling and
undersampling evidently. The Fig. 3 (d) illustrates that the overall performance of the proposed method is better than that of
the compared imbalanced classification algorithms. These results indicate the proposed ensemble learning method has the
Fig. 3. ROC curves of the proposed method and compared techniques.
optimal classification ability, especially the best capacity in correctly classifying real default loans in credit risk assessment
of P2P lending.
According to the above analyses, dealing with the imbalanced classification of loan default prediction in P2P lending,
our proposed ensemble learning method, DM–ACME, outperforms the compared six benchmark methods and four imbal-
anced classification algorithms, especially in terms of real default identification rate (sensitivity), the aggregative indicator
(G-mean) and the overall classification ability measure (AUC). Meanwhile, the proposed method shows the statistically sig-
nificant superiority in comprehensive classification capacity (AUC) than that of the compared benchmark methods cooper-
ated with oversampling or undersampling approaches.
4.5.3. Feature importance analysis

As mentioned, loan applicants with partial occupational information (job title and employment length) are more likely to
default than those with complete occupational information. Many existing studies [4,33,46] neglected to analyze the default
possibility of applicants with partial occupational information, who are ubiquitous in real-world loan data. Additionally, the
impacts of occupational information-related features in loan data on the prediction results are not revealed in the relevant
studies such as [1]. In this study, to further investigate the importance of these features as well as to uncover other crucial
factors, we carried out a feature importance analysis using two different approaches, i.e., Pearson correlation analysis and
GBDT. Pearson correlation analysis belongs to a statistical method by which the linear relationship between the input vari-
ables and the target variable can be figured out. The Pearson correlation coefficients estimated by least square method in
company with the significance values can indicate the importance of each feature for explaining the target variable. GBDT
is a state-of-the-art tree ensemble method, in which each node of each tree is split based on one of the features selected
Table 9
Feature importance evaluation results obtained by Pearson correlation analysis.
Importance interval Significance Number of features Accumulated importance Number of occupational information-related features
[0.1000,0.2000) 0.0000 3 0.5071 0

[0.0800,0.1000) 0.0000 3 0.2538 0
[0.0600,0.0800) 0.0000 8 0.5402 0
[0.0400,0.0600) 0.0000 19 0.8714 5
[0.0200,0.0400) 0.0000 12 0.3276 1
[0.0000,0.0200) [0.0000,0.0500) 23 0.2738 2
Table 10
Details of occupational information-related features obtained by Pearson correlation analysis.
Feature Importance IR Feature description
emp_title = none 0.0431 27 Job title information is not available

emp_title = job title 0.0431 28 Job title information is available
applicant type = 0 0.0426 29 Job title and employment length information are available
applicant type = 3 0.0425 30 Job title and employment length information are not available
emp_length=none 0.0420 31 Employment length information is not available
emp_length = 10+ 0.0280 39 Employment length is equal to or more than 10 years
emp_length = 1 0.0172 50 Employment length of a borrower for the current employer is 1 year
applicant type = 2 0.0085 62 Job title information is not available while employment length is available
Table 11
Feature importance evaluation results obtained by GBDT.
Importance interval Number of features Accumulated importance Number of occupational information-related features
[0.0800,0.1100) 1 0.1087 0
[0.0600,0.0800) 2 0.1242 0
[0.0400,0.0600) 2 0.0950 0
[0.0200,0.0400) 11 0.2882 0
[0.0100,0.0200) 17 0.2571 1
[0.0000,0.0100) 34 0.1628 5
Table 12
Details of occupational information-related features obtained by GBDT.
Feature Importance IR Feature description
emp_title = none 0.0179 21 Job title information is not available

emp_length < 1 0.0077 40 Employment length of a borrower for the current employer is less than 1 year
emp_title = job title 0.0045 44 the job title information is available
emp_length = 2 0.0037 45 Employment length of a borrower for the current employer is 2 years
emp_length = 10+ 0.0018 56 Employment length is equal to or more than 10 years
applicant type = 3 0.0010 66 Job title and employment length information are not available
using measures like Gini impurity or mean squared error. In this study, feature importance is estimated with the evaluation
criteria of the normalized total reduction of mean squared error brought by each feature.
The results of Pearson correlation analysis are illustrated in Table 9, where the 68 features with a significance of no
more than 0.050 are included; these features are ranked according to the absolute values of their correlation coefficients,
which are regarded to reflect the feature importance. From Table 9, there are 8 features in total that are relevant to oc-
cupational information. Their corresponding original features, importance, importance rankings (IR) and feature descrip-
tions are shown in Table 10. From Table 10, the importance rankings of the newly generated features in this study, in-
cluding "emp_title = none", "emp_title = job title", "applicant type = 0", "applicant type = 3", "emp_length = none",
"emp_length = 10+", "emp_length = 1", and "applicant type = 2", which are obtained by one-hot encoding in data pre-
processing, are 27, 28, 29, 30, 31, 39, 50 and 62, respectively. The results in Table 10 illuminate that the job title and
employment length information play relatively important roles in loan default prediction.
The results of feature importance analyzed by GBDT are illustrated in Table 11, where the 67 features with an impor-
tance of more than 0.0 0 0 0 are included. From Table 11, there are 6 features in total which are relevant to occupational
information. Their corresponding original features, importance values, and rankings (IR), and their descriptions are shown in
Table 12. From Table 12, the importance rankings of the newly generated features in this study, including "emp_title = none",
"emp_length < 1", "emp_title = job title", "emp_length = 2", "emp_length = 10+", and "applicant type = 3", are 21, 40, 44,
45, 56 and 66 respectively. These results confirm the significance of the job title information and employment length-related
features in loan default prediction.
Thus, we can conclude that it is necessary and helpful to consider occupational information during the modeling process.
What’s more, according to these feature importance analysis results, the people with partial occupational information are
more likely to default. As the default loans may cause tremendous principal losses, it is wise to avoid investing in those
who don’t have complete occupational information.
Considering the crucial roles of the important features playing in practical application and prediction model construction,
except the foregoing analysis on occupational information-related features, we further discuss the shared features ranked
among the top 10 by both feature importance analysis methods, namely, "sub_grade", "dti" and "acc_open_past_24mths". Of
these features, "sub_grade" is ranked first by both methods. This feature stands for the sub credit grade of loan applicants
assigned by Lending Club, which strongly affects the likelihood of repayment. According to the evaluation mechanism for
"sub_grade" used by Lending Club, the sub credit grade is an aggregative indicator obtained by trading off the borrower’s
credit grade and loan amount of the application. That is to say, the more a borrower borrows, the farther below the ini-
tial credit grade the sub credit grade will be, and the higher the default risk the loan will have. Feature "dti", signifies
the ratio between the applicant’s monthly debt payment and income. These results reveal that the lower the "dti" value
is, the lesser risk the investors will take. It is obvious that when a borrower earns more than he or she needs to repay,
the less pressure the borrower will take and the loan would be more likely to be paid back on time. The third feature
"acc_open_past_24mths", indicates the number of accounts opened in the past 24 months. Obviously, when making invest-
ment decisions, it would be better to avoid lending to those who have high "acc_open_past_24mths" records. From the
above analyses, we may conclude that the more we understand the essential of loan data features, the more intelligent our
prediction model, and the better our investment decisions will become.
5. Conclusion
This study develops a novel distance-to-model and adaptive clustering-based multi-view ensemble (DM–ACME) classi-
fication method to improve the precision of loan default prediction in P2P lending. The proposed method comprises the
diversity creation techniques grounded in multi-view learning and a newly proposed adaptive clustering approach, and the
ensemble members are integrated with a soft probability technique and distance-to-model-based dynamic weight genera-
tion strategy.
Experimental studies, which take several state-of-the-art classification techniques as comparative methods, based on the
real-world P2P loan data from Lending Club and five performance metrics, demonstrate the capability and effectiveness of
the proposed ensemble method, DM–ACME, in real default recognition. Our experimental results show that the proposed
ensemble model not only achieves good accuracy but also effectively improves the generalization ability of real default loan
identification, namely sensitivity. Multi-view learning, adaptive clustering-based model generation, and soft probability and
DM-based outputs integration are verified to be able to encourage good ensemble model construction. Moreover, our trained
ensemble model can address the loan default prediction of a larger group of loan applicants that involves borrowers lacking
occupational information, which are neglected by previous studies. The proposed DM–ACME provides an effective credit risk
assessment approach by which P2P platforms and individual investors can make wiser decisions.
In future work, we will evaluate the proposed ensemble method using more loan data and investigate other tactics for
ensemble diversity creation and techniques to optimize the adaptive clustering mechanism for constructing base learners
that achieve optimal tradeoffs between accuracy and diversity. We will continue to improve the proposed ensemble learning
method and make it more robust and efficient, meanwhile seeking to achieve a more detailed default prediction by exploring
multi-classification approaches.
Credit author statement
Yu Song, Yuyan Wang, Xin Ye, and Yanzhang Wang contributed to the conception of the study;
Yu Song, Yuyan Wang, and Xin Ye designed the proposed method;
Yu Song and Yuyan Wang analyzed the data and performed experiments;
Yu Song and Yuyan Wang wrote the manuscript;
Xin Ye, Dujuan Wang, and Yunqiang Yin helped perform the result analysis with constructive discussions;
Xin Ye, Dujuan Wang, Yunqiang Yin, and Yanzhang Wang helped improve the writing of the manuscript.
Declaration of Competing Interest
None
Acknowledgement
This research is supported in part by the National Natural Science Foundation of China (No. 71533001).
References
[1] M. Ala’raj, M.F. Abbod, A new hybrid ensemble credit scoring model based on classifiers consensus system approach, Expert Syst. Appl. 64 (2016)
36–55.
[2] F.M. Amasyali, Improved space forest: a meta ensemble method, IEEE Trans. Cybern. (2018) 1–11.
[3] E. Angelini, G.D. Tollo, A. Roli, A neural network approach for credit risk evaluation, Q. Rev. Econ. Financ. 48 (4) (2008) 733–755.
[4] I. Baklouti, A. Baccar, Evaluating the predictive accuracy of microloan officers’ subjective judgment, Int. J. Res. Stu. Manag. 2 (2) (2013) 21–34.
[5] C. Beleites, U. Neugebauer, T. Bocklitz, C. Krafft, J. Popp, Sample size planning for classification models, Anal. Chim. Acta 760 (2013) 25–33.
[6] J.C. Bezdek, Cluster validity with fuzzy sets, J. Cybern. 3 (3) (1973) 58–73.
[7] U. Bhowan, M. Johnston, M. Zhang, X. Yao, Evolving diverse ensembles using genetic programming for classification with unbalanced data, IEEE Trans.
Evol. Comput. 17 (3) (2013) 368–386.
[8] R. Campos, C. Sérgio, T. Salles, M.A. Gonçalves, Stacking bagged and boosted forests for effective automated classification, International ACM SIGIR
Conference on Research and Development in Information Retrieval, 2017.
[9] Z. Chen, H. Han, Q. Yan, B. Yang, L. Peng, L. Zhang, J. Li, A first look at android malware traffic in first few minutes, 2015 IEEE Trustcom/BigDataSE/ISPA,
2015.
[10] S.F. Crone, S. Finlay, Instance sampling in credit scoring: an empirical study of sample size and balancing, Int. J. Forecast. 28 (1) (2012) 224–238.
[11] A. Cutler, D.R. Cutler, J.R. Stevens, Random forests, Mach. Learn. 45 (1) (2004) 157–176.
[12] J. Duarte, S. Siegel, L. Young, Trust and credit: the role of appearance in peer-to-peer lending, Rev. Financ. Stud. 25 (8) (2012) 2455–2484.
[13] R. Emekter, Y. Tu, B. Jirasakuldech, M. Lu, Evaluating credit risk and loan performance in online peer-to-peer (P2P) lending, Appl. Econ. 47 (1) (2015)
54–70.
[14] X. Feng, Z. Xiao, B. Zhong, J. Qiu, Y. Dong, Dynamic ensemble classification for credit scoring using soft probability, Appl. Soft.Comput. 65 (2018)
139–151.
[15] J.H. Friedman, Greedy function approximation: a gradient boosting machine, Ann. Stat. 29 (5) (2001) 1189–1232.
[16] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles for the class imbalance problem: bagging-, boosting- and
hybrid-based approaches, IEEE Trans. Syst. Man. Cybern. C Appl. Rev. 42 (4) (2012) 463–484.
[17] S. Gu, Y. Jin, Multi-train: a semi-supervised heterogeneous ensemble classifiers, Neurocomputing 249 (2017) 202–211.
[18] D.J. Hand, W.E. Henley, Statistical classification methods in consumer credit scoring: a review, J. R. Stat. Soc. Ser. A-Stat. Soc. 160 (3) (1997) 523–541.
[19] D.R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation analysis: an overview with application to learning methods, Neural Comput. 16 (2004)
2639–2664.
[20] S. Jadhav, H. He, K. Jenkins, Information gain directed genetic algorithm wrapper feature selection for credit rating, Appl. Soft. Comput. 69 (2018)
541–553.
[21] José F D., Juan J R., César I G., Ludmila K, Diversity techniques improve the performance of the best imbalance learning ensembles, Inf. Sci. 325 (2015)
98–117.
[22] H. Kaneko, M. Arakawa, K. Funatsu, Applicability domains and accuracy of prediction of soft sensor models, Aiche J. 57 (6) (2011) 1506–1513.
[23] M.Y Kiang, A comparative assessment of classification methods, Decis. Support. Syst. 35 (4) (2003) 441–454.
[24] A. Kizilaslan and A. Lookman. Can economically intuitive factors improve ability of proprietary algorithms to predict defaults of peer-to-peer loans?
Available at SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2987613, (2017).
[25] L.I. Kuncheva, J.C. Bezdek, R.P.W Duin, Decision templates for multiple classifier fusion: an experimental comparison, Pattern Recognit. 34 (2) (2001)
299–314.
[26] X. Liu, J. Wu, Z. Zhou, Exploratory undersampling for class-imbalance learning, IEEE Trans. Syst. Man Cybern. Part B Cybern. 39 (2009) 539–550.
[27] S. Luo, B. Cheng, C.H. Hsieh, Prediction model building with clustering-launched classification and support vector machines in credit scoring, Expert
Syst. Appl. 36 (4) (2009) 7562–7566.
[28] O. Maimon, L. Rokach, Improving supervised learning by feature decomposition, Foundations of Information and Knowledge Systems, Second Interna-
tional Symposium, 2002.
[29] D.W. Opitz, Feature selection for ensembles, in: 16th National Conference on Artificial Intelligence, Orlando, America, 1999, pp. 379–384. page.
[30] Y. Pang, L. Peng, Z. Chen, B. Yang, H. Zhang, Imbalanced learning based on adaptive weighting and gaussian function synthesizing with an application
on android malware detection, Inf. Sci. 484 (2019) 95–112.
[31] L. Peng, B. Yang, Y. Chen, A. Abraham, Data gravitation based classification, Inf. Sci. 179 (2009) 809–819.
[32] L. Peng, H. Zhang, B. Yang, Y. Chen, A new approach for imbalanced data classification based on data gravitation, Inf. Sci. 288 (2014) 347–373.
[33] M. Polena, T. Regner, Determinants of borrowers’ default in P2P lending under consideration of the loan risk class, Games, MDPI, Open Access J. 9 (4)
(2018) 1–17.
[34] L. Rokach, Ensemble-based classifiers, Artif. Intell. Rev. 33 (2010) 1–39.
[35] C. Serrano-Cinca, B. Gutiérrez-Nieto, The use of profit scoring as an alternative to credit scoring systems in peer-to-peer (P2P) lending, Decis. Support.
Syst. 89 (2016) 113–122.
[36] B.K. Singh, K. Verma, A.S. Thoke, Fuzzy cluster based neural network classifier for classifying breast tumors in ultrasound images, Expert Syst. Appl.
66 (2016) 114–123.
[37] K.M. Ting, An instance-weighting method to induce cost-sensitive trees, IEEE Trans. Knowl. Data Eng. 14 (3) (2002) 659–665.
[38] B. Twala, Multiple classifier application to credit risk assessment, Expert Syst. Appl. 37 (4) (2010) 3326–3336.
[39] G. Tzortzis, A. Likas, Kernel-based weighted multiview clustering, the 12th IEEE International Conference on Data Mining, 2012.
[40] K. Veropoulos, C. Campbell, N. Cristianini, Controlling the sensitivity of support vector machines, the International Joint Conference on AI, 1999.
[41] G. Wang, J. Ma, L. Huang, K. Xu, Two credit scoring models based on dual strategy ensemble trees, Knowl.-Based Syst. 26 (2012) 61–68.
[42] Y. Wang, D. Wang, N. Geng, Y. Wang, Y. Yin, Y. Jin, Stacking-based ensemble learning of decision trees for interpretable prostate cancer detection, Appl.
Soft Comput. 77 (2019) 188–204.
[43] Y. Wang, D. Wang, X. Ye, Y. Wang, Y. Yin, Y. Jin, A tree ensemble-based two-stage model for advanced-stage colorectal cancer survival prediction, Inf.
Sci. 474 (2019) 106–124.
[44] Z. Wang, C. Jiang, Y. Ding, X. Lv, Y. Liu, A novel behavioral scoring model for estimating probability of default over time in peer-to-peer lending,
Electron. Commer. Res. Appl. 27 (2017) 74–82.
[45] Y. Xia, C. Liu, B. Da, F. Xie, A novel heterogeneous ensemble credit scoring model based on bstacking approach, Expert Syst. Appl. 93 (2018) 182–199.
[46] H. Xiao, Z. Xiao, Y. Wang, Ensemble classification based on supervised clustering for credit scoring, Appl. Soft. Comput. 43 (2016) 73–86.
[47] X. Yao, J. Crook, G. Andreeva, Support vector regression for loss given default modelling, Eur. J. Oper. Res. 240 (2) (2015) 528–538.
[48] L. Yu, R. Zhou, L. Tang, R. Chen, Dbn-based resampling svm ensemble learning paradigm for credit classification with imbalanced data, Appl. Soft.
Comput. 69 (2018) 192–202.
[49] J. Zhao, X. Xie, X. Xu, S. Sun, Multi-view learning overview: recent progress and new challenges, Inf. Fusion. 38 (2017) 43–54.
[50] L. Zhou, K.P. Tam, H. Fujita, Predicting the listing status of Chinese listed companies with multi-class classification models, Inf. Sci. 328 (2016) 222–236.

Multi-View Ensemble Learning Based On Distance-To-Model

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multi-View Ensemble Learning Based On Distance-To-Model

Uploaded by

Copyright:

Available Formats

Information Sciences 525 (2020) 182–204

Contents lists available at ScienceDirect

Multi-view ensemble learning based on distance-to-model

2.1. Credit risk assessment methods

2.2. Improved ensemble learning methods

Fig. 1. Framework of methodology.

3.1. Multi-view learning

3.2. Adaptive clustering-based model construction

3.2.1. K-means and KNN-based clustering

3.2.2. Base classiﬁer construction

Fig. 2. Process of K-means and KNN-based clustering for viewv .

3.2.3. Accuracy and diversity-based adaptive learning

3.3. Soft probability and distance-to-model-based model integration

3.3.1. Soft probability-based outputs

3.3.2. Distance-to-model-based integration

(1) Dynamic weights-based classiﬁer integration for each view

R jv1 = Pjv1 · W jv (13)

(2) Dynamic weights-based multi-view integration

4.1. Data preparation

(1) Redundant feature removal

(2) Data cleaning

(3) Data transformation

Attribute Attribute description Original value Code value

(4) Data partition

(5) Multi-view partitioning

4.2. Performance metrics

Correctly classiﬁed instances Wrongly classiﬁed instances

Positive instances True positive (TP) False negative (FN)

respectively, are both employed.

G − mean = Sensit ivit y × Speci f icity (18)

4.3. Experiment design

4.4. Parameter settings

Number of sub-base learners m = 10 m = 30 m = 50 m = 70 m = 100

Performance Comparison Model Number of sub-base learners

Accuracy Mean DM–ACME 0.7231 0.8030 0.8342 0.8293 0.8352

4.5. Results and discussion

4.5.1. Comprehensive analysis of the proposed method

Performance Comparison item Model Number of sub-base learners

Accuracy Mean (Standard DM–ACME 0.7231 0.8030 0.8342 0.8293 0.8352

(1) Effect of soft probability and DM-based integration

4.5.2. Comparison with different classiﬁcation models

Method Mean ± Standard deviation

Accuracy Sensitivity Speciﬁcity G-mean AUC

Method Mean ± Standard deviation

Accuracy Sensitivity Speciﬁcity G-mean AUC

Fig. 3. ROC curves of the proposed method and compared techniques.

4.5.3. Feature importance analysis

[0.1000,0.2000) 0.0000 3 0.5071 0

Feature Importance IR Feature description

emp_title = none 0.0431 27 Job title information is not available

Feature Importance IR Feature description

emp_title = none 0.0179 21 Job title information is not available

Credit author statement

Declaration of Competing Interest

You might also like