Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Pattern Recognition 48 (2015) 19251935

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/pr

META-DES: A dynamic ensemble selection framework


using meta-learning
Rafael M.O. Cruz a,n, Robert Sabourin a, George D.C. Cavalcanti b, Tsang Ing Ren b
a
LIVIA, cole de Technologie Suprieure, University of Quebec, Montreal, Quebec, Canada1
b
Centro de Informtica, Universidade Federal de Pernambuco, Recife, PE, Brazil2

art ic l e i nf o a b s t r a c t

Article history: Dynamic ensemble selection systems work by estimating the level of competence of each classier from
Received 9 April 2014 a pool of classiers. Only the most competent ones are selected to classify a given test sample. This is
Received in revised form achieved by dening a criterion to measure the level of competence of a base classier, such as, its
24 October 2014
accuracy in local regions of the feature space around the query instance. However, using only one
Accepted 2 December 2014
criterion about the behavior of a base classier is not sufcient to accurately estimate its level of
Available online 12 December 2014
competence. In this paper, we present a novel dynamic ensemble selection framework using meta-
Keywords: learning. We propose ve distinct sets of meta-features, each one corresponding to a different criterion
Ensemble of classiers to measure the level of competence of a classier for the classication of input samples. The meta-
Dynamic ensemble selection
features are extracted from the training data and used to train a meta-classier to predict whether or not
Meta-learning
a base classier is competent enough to classify an input instance. During the generalization phase, the
Classier competence
meta-features are extracted from the query instance and passed down as input to the meta-classier.
The meta-classier estimates, whether a base classier is competent enough to be added to the
ensemble. Experiments are conducted over several small sample size classication problems, i.e.,
problems with a high degree of uncertainty due to the lack of training data. Experimental results show
that the proposed meta-learning framework greatly improves classication accuracy when compared
against current state-of-the-art dynamic ensemble selection techniques.
& 2014 Elsevier Ltd. All rights reserved.

1. Introduction base classier is an expert in a different local region of the feature


space [14]. So, given a new test sample, DES techniques aim to select
Multiple Classier Systems (MCS) aim to combine classiers to the most competent classiers for the local region in the feature
increase the recognition accuracy in pattern recognition systems space where the test sample is located. Only the classiers that
[1,2]. MCS are composed of three phases [3]: (1) Generation, attain a certain competence level, according to a selection criterion,
(2) Selection and (3) Integration. In the rst phase, a pool of are selected. Recent work in the dynamic selection literature de-
classiers is generated. In the second phase, a single classier or a monstrates that dynamic selection techniques are an effective tool
subset having the best classiers of the pool is(are) selected. We for classication problems that are ill-dened, i.e., for problems
refer to the subset of classiers as ensemble of classiers (EoC). where the size of the training data is small and there are not enough
The last phase is the integration, and the predictions of the data available to model the classiers [6,7].
selected classiers are combined to obtain the nal decision [1]. The key issue in DES is to dene a criterion to measure the level of
For the second phase, there are two types of selection app- competence of a base classier. Most DES techniques [4,12,11,10,15
roaches: static and dynamic. In static approaches, the selection is 18] use estimates of the classiers' local accuracy in small regions of
performed during the training stage of the system. Then, the the feature space surrounding the query instance as a search criterion
selected classier or EoC is used for the classication of all unseen to perform the ensemble selection. However, in our previous work
test samples. In contrast, dynamic ensemble selection approaches [10], we demonstrated that the use of local accuracy estimates alone is
(DES) [413] select a different classier or a different EoC for each insufcient to achieve results close to the Oracle performance. The
new test sample. DES techniques rely on the assumption that each Oracle is an abstract model dened in [19] which always selects the
classier that predicted the correct label, for the given query sample, if
n
such classier exists. In other words, it represents the ideal classier
Corresponding author.
selection scheme. In addition, as reported by Ko et al. [4], addressing
E-mail address: rafaelmenelau@gmail.com (R.M.O. Cruz).
1
URL: http://www.livia.etsmtl.ca the behavior of the Oracle is much more complex than applying a
2
URL: http://www.cin.ufpe.br/  viisar simple neighborhood approach.

http://dx.doi.org/10.1016/j.patcog.2014.12.003
0031-3203/& 2014 Elsevier Ltd. All rights reserved.
1926 R.M.O. Cruz et al. / Pattern Recognition 48 (2015) 19251935

On the other hand, DES techniques based on other criteria, such few larger datasets were also considered in order to evaluate the
as the degree of consensus of the ensemble classiers [5,6], performance of the proposed framework under different condi-
encounter some problems when the search cannot nd a con- tions. The goal of the experiments is to answer the following
sensus among the ensembles. In addition, they neglect the local research questions: (1) Can the use of multiple DES criteria, as
performance of the base classiers. As stated by the No Free meta-features, lead to a more robust dynamic selection techni-
Lunch theorem [20], no algorithm is better than any other over all que? (2) Does the proposed framework outperform current DES
possible classes of problems. Using a single criterion to measure techniques for ill-dened problems?
the level of competence of a base classier is very error-prone. This paper is organized as follows: Section 2 introduces the
Thus, we believe that multiple criteria to measure the competence notion of classier competence, and the state-of-the-art techni-
of a base classier should be taken into account in order to achieve ques for dynamically measuring the classiers' competence are
a more robust dynamic ensemble selection technique. presented. The proposed framework is presented in Section 3. The
In this paper, we propose a novel dynamic ensemble selec- experimental study is conducted in Section 4. Finally, our conclu-
tion framework using meta-learning. From the meta-learning sion is presented in the last section.
perspective, the dynamic ensemble selection problem is con-
sidered as another classication problem, called meta-problem.
The meta-features of the meta-problem are the different cri- 2. Classier competence for dynamic selection
teria used to measure the level of competence of the base
classier. We propose ve sets of meta-features in this paper. Classier competence denes how much we trust an expert,
Each set captures a different property about the behavior of the given a classication task. The notion of competence is used
base classier, and can be seen as a different dynamic selection extensively in the eld of machine learning as a way of selecting,
criterion such as, the classication performance in a local from the plethora of different classication models, the one that
region of the feature space and the classier condence for best ts the given problem. Let C fc1 ; ; cM g (M is the size of the
the classication of the input sample. Using ve distinct sets of pool of classiers) be the pool of classiers and ci a base classier
meta-features, even though one criterion might fail due to belonging to the pool C. The goal of dynamic selection is to nd an
problems in the local regions of the feature space [10] or due ensemble of classiers C 0  C that has the best classiers to classify
to low condence results [21], the system can still achieve a a given test sample xj . This is different from static selection, where
good performance as other meta-features are also considered the ensemble of classiers C 0 is selected during the training phase,
by the selection scheme. Furthermore, in a recent analysis [22] and considering the global performance of the base classiers over
we compared the criteria used to measure the competence of a validation dataset [2427].
base classiers embedded in different DES techniques. The Nevertheless, the key issue in dynamic selection is how to
result demonstrates that, given the same query sample, distinct measure the competence of a base classier c i for the classi-
DES criteria select a different base classier as the most cation of a given query sample xj . In the literature, we can
competent one. Thus, they are not fully correlated. Hence, we observe three categories: the classier accuracy over a local
believe that a more robust dynamic ensemble selection tech- region, i.e., in a region of the feature space surrounding the
nique is achieved using ve sets of meta-features rather than query instance xj , decision templates [28], which are techni-
only one. ques that work in the decision space (i.e, a space dened by the
The meta-features are used as input to a meta-classier that outputs of the base classiers) and the extent of consensus or
decides whether or not a base classier is competent enough for condence. The three categories are described in the following
the classication of an input sample based on the meta-features. subsections.
The use of meta-learning has recently been proposed in [23] as an
alternative for performing classier selection in static scenarios. 2.1. Classier accuracy over a local region
We believe that we can carry this further, and extend the use of
meta-learning to dynamically estimate the level of competence of Classier accuracy is the most commonly used criterion for
a base classier. dynamic classier and ensemble selection techniques [12,4,10,
The proposed framework is divided into three phases: over- 17,13,15,29,16,9]. Techniques that are based on local accuracy
production, meta-training and generalization. In the overproduc- rst dene a small region in the feature space surrounding a given
tion stage, a pool of classiers is generated using the training test instance xj , called the region of competence. This region
data. In the meta-training stage, the ve sets of meta-features are is computed using either the K-NN algorithm [4,12,10] or by
extracted from the training data, and are used to train the meta- Clustering techniques [17,13], and can be dened either in the
classier that works as the classier selector. During the general- training set [12] or in the validation set, such as in the KNORA
ization phase, the meta-features are extracted from the query techniques [4].
instance and passed down as inputs to the meta-classier. The Based on the samples belonging to the region of competence, a
meta-classier estimates whether a base classier is competent criterion is applied in order to measure the level of competence of
enough to classify the given test instance. Thus, the proposed a base classier. For example, the Overall Local Accuracy (OLA) [12]
system differs from the current state-of-the-art dynamic selec- technique uses the accuracy of the base classier in the whole
tion techniques not only because it uses multiple criteria to region of competence as a criterion to measure its level of
perform the classier selection, but also because the classier competence. The classier that obtains the highest accuracy rate
selection rule is learned by the meta-classier using the is considered the most competent one. The Local Classier
training data. Accuracy (LCA) [12] computes the performance of the base
The generalization performance of the system is evaluated classier in relation to a specic class label using a posteriori
over 30 classication problems. We compare the proposed frame- information [30]. The Modied Local Accuracy [16] works similar
work against eight state-of-the-art dynamic selection techniques to the LCA technique, with the only difference being that each
as well as static combination methods. The evaluation is focused sample belonging to the region of competence is weighted by its
on small size dataset, since DES techniques have shown to be an Euclidean distance to the query instance. That way, instances from
effective tool for problems where the level of uncertainty for the region of competence that are closer to the test sample have a
recognition is high due to few training samples [6]. However, a higher inuence when computing the performance of the base
R.M.O. Cruz et al. / Pattern Recognition 48 (2015) 19251935 1927

classier. The classier rank method [29] uses the number of input sample using the decision space. Here, the selection criterion
consecutive correctly classied samples as a criterion to measure is based on a threshold. The base classiers that achieve a
the level of competence. The classier that correctly classies the performance higher than the predened threshold are considered
most consecutive samples coming from the region of competence competent and are selected to form the ensemble.
is considered to have the highest competence level or rank. The advantage of this class of methods is that they are not
Ko et al. [4] proposed the K-Nearest Oracles (KNORA) family of limited by the quality of the region of competence dened in the
techniques, inspired by the Oracle concept. Four techniques are feature space, with the similarity computed based on the decision
proposed: the KNORA-Eliminate (KNORA-E) which, considers space rather than the feature space. However, the disadvantage
that a base classier ci is competent for the classication of the with this comes from the fact that only global information is
query instance xj if ci achieves a perfect accuracy for the whole considered, while the local expertise of each base classier is
region of competence. Only the base classiers with a perfect neglected.
accuracy are used during the voting scheme. In the KNORA-Union
(KNORA-U) technique, the level of competence of a base classier 2.3. Extent of consensus or condence
ci is measured by the number of correctly classied samples in
the dened region of competence. In this case, every classier Different from other methods, techniques that are based on the
that correctly classied at least one sample can submit a vote. In extent of consensus work by considering a pool of ensemble of
addition, two weighted versions, KNORA-E-W and KNORA-U-W classiers (EoC) rather than a pool of classiers. Hence, the rst
were also proposed, in which the inuence of each sample step is to generate a population of EoC, C n fC 01 ; C 02 ; ; C 0M0 g (M 0 is
belonging to the region of competence was weighted based on the number of EoC generated) using an optimization algorithm
its Euclidean distance to the query sample xj . Lastly, Xiao et al. [9] such as genetic algorithms or greedy search [26,5,27]. Then, for
proposed the Dynamic Classier Ensemble for Imbalanced Data each new query instance xj , the level of competence of an
(DCEID), which is based on the same principles as the LCA ensemble of classiers C 0i is equal to the extent of consensus
technique. However, this technique also takes into account each among its base classiers.
class prior probability when computing the performance of the Several criterion based on this paradigm was proposed: the
base classier for the dened region of competence in order to Margin-based Dynamic Selection (MDS) [5], where the criterion is
deal with imbalanced distributions. the margin between the most voted class and the second most
The difference between these techniques lies in how they voted class. The margin is computed simply by considering the
utilize the local accuracy information in order to measure the difference between the number of votes received by the most
level of competence of a base classier. The main issue with the voted class and those received by the second most voted class.
techniques arises from the fact that they depend on the perfor- Two variations of the MDS were proposed in [5], the Class-
mance of the techniques that dene the region of competence Strength Dynamic Selection (CSDS), which includes the ensemble
such as K-NN or clustering techniques. In our previous work [10], decision in the computation of the MDS, and the GSDS, where the
we demonstrated that the effectiveness of dynamic selection global performance of each EoC is also taken into account [6].
techniques is limited by the performance of the algorithm that Another technique from this paradigm is the Ambiguity-guided
denes the region of competence. The dynamic selection tech- Dynamic Selection (ADS) [5], which uses the ambiguity among the
nique is likely to commit errors when outlier instances (i.e., base classiers of an EoC as the criterion for measuring the
mislabeled samples) exist around the query sample in the feature competence level of an EoC. The ambiguity is calculated by the
space [10]. Using the local accuracy information alone is number of base classiers of an ensemble that disagrees with the
not sufcient to achieve results close to the Oracle. Moreover, ensemble decision. The lower the number of classiers that
any difference between the distribution of validation and test disagree with the ensemble decision, the higher the level of
datasets may negatively affect the system performance. Conse- competence of the EoC.
quently, we believe that additional information should also be The greatest advantage of this class of methods stems from
considered. the fact that it does not require information from the region of
competence. Thus, it does not suffer from the limitations of the
2.2. Decision templates algorithm that denes the region of competence. However,
these techniques present the following disadvantages: In many
In this class of methods, the goal is also to select samples that cases, the search cannot nd an EoC with an acceptable
are close to the query instance xj . However, the similarity is condence level. There is a tie between different members of
computed over the decision space through the concept of decision the pool, and the systems end up performing a random decision
templates [28]. This is performed by transforming both the test [6]. In addition, some classiers are more overtrained than
instance xj and the validation data into output proles. The output others. In this case, they end up dominating the outcome even

prole of an instance xj is denoted by x~ j x~ j;1 ; x~ j;2 ; ; x~ j;M , though they do not present better recognition performance
where each x~ j;i is the decision yielded by the base classier ci for [31]. The pre-computation of ensembles also greatly increases
the sample xj . the overall system complexity as we are dealing with a pool of
Based on the information extracted from the decision space, EoC rather than a pool of classiers.
the K-Nearest Output Prole (KNOP) [7] is similar to the KNORA
technique, with the difference being that the KNORA works in the
feature space, while the KNOP works in the decision space. The 3. The proposed framework: META-DES
KNOP technique rst denes a set with the samples that are most
similar to the output prole of the input sample, x~ j in the decision 3.1. Problem denition
space, called the output proles set. The validation set is used for
this purpose. Then, similar to the KNORA-E technique, only the From the meta-learning perspective, the dynamic selection
base classiers that achieve a perfect recognition accuracy for the problem can be seen as another classication problem, called the
samples belonging to the output proles set are used during the meta-problem. This meta-problem uses different criteria regarding
voting scheme. The Multiple Classier Behavior (MCB) technique the behavior of a base classier in order to decide whether it is
[11] also denes a set with the most similar output proles to the competent enough to classify a given sample xj . Thus, a dynamic
1928 R.M.O. Cruz et al. / Pattern Recognition 48 (2015) 19251935

selection system can be dened based on two environments. A 1. The overproduction phase, where the pool of classiers
classication environment in which the input features are mapped C fc1 ; ; cM g, composed of M classiers, is generated using
into a set of class labels w fw1 ; w2 ; ; wL g and a meta- the training instances xj;train from the dataset T .
classication environment in which information about the beha- 2. The meta-training stage, in which samples xj;train from the
vior of the base classier is extracted from the classication meta-training dataset T , is used to extract the meta-features.
environment and used to decide whether a base classier ci is A different dataset T is used in this phase in order to prevent
competent enough to classify xj . overtting. The meta-feature vectors vi;j are stored in the set T n
To keep with the conventions of the meta-learning literature, that is later used to train the meta-classier .
we dene the proposed dynamic ensemble selection in a meta- 3. The generalization phase, given a test sample xj;test resulting
learning framework as follows: from the generalization data G; its region of competence is
extracted using the samples from the dynamic selection dataset
 The meta-problem consists in dening whether a base classier DSEL in order to compute the meta-features. The meta-feature
ci is competent enough to classify xj . vector vi;j is then passed to the selector , which decides
 The meta-classes of this meta-problem are either competent whether ci is competent enough to classify xj;test and should
or incompetent to classify xj . be added to the ensemble, C 0 . The majority vote rule is applied
 Each meta-feature fi corresponds to a different criterion to over the ensemble C 0 , giving the classication wl of xj;test .
measure the level of competence of a base classier.
 The meta-features are encoded into a meta-features vector vi;j
which contains the information about the behavior of a base
3.2.1. Overproduction
classier ci in relation to the input instance xj .
 In this work, the overproduction phase is performed using the
A meta-classier is trained based on the meta-features vi;j to
Bagging technique [32,33]. Bagging is an acronym for Bootstrap
predict whether or not ci will achieve the correct prediction for
AGGregatING. The idea behind this technique is to build a diverse
xj .
ensemble of classiers by randomly selecting different subsets of
the training data. Each subset is used to train one individual
classier ci. As the focus of the paper is on classier selection, and
In other words, a meta-classier is trained, based on vi;j , to
not on classier generation methods, only the bagging technique is
predict whether a base classier ci is competent enough to classify
considered.
given a test sample xj . Thus, the proposed system differs from the
current state-of-the-art dynamic selection techniques not only
because it uses multiple criteria, but also because the selection 3.2.2. Meta-training
rule is learned by the meta-classier using the training data. As shown in Fig. 1, the meta-training stage consists of three
steps: the sample selection process, the meta-features extraction
process, and the training of the meta-classier . For every sample
xj;train A T , the rst step is to apply the sample selection mechan-
3.2. The proposed META-DES ism in order to know whether or not xj;train should be used for the
training of the meta-classier . The whole Meta-training phase is
The META-DES framework is divided into three phases (Fig. 1): formalized in Algorithm 1.

Fig. 1. Overview of the proposed META-DES framework. It is divided into three steps (1) overproduction, where the pool of classiers C fc1 ; ; cM g is generated, (2) the
training of the meta-classier , and (3) the generalization phase where an ensemble C 0 is dynamically dened based on the meta-information extracted from xj;test and the
pool C fc1 ; ; cM g. The generalization phase returns the label wl of x j;test . hC, K and Kp are the hyper-parameters required by the proposed system.
R.M.O. Cruz et al. / Pattern Recognition 48 (2015) 19251935 1929

Algorithm 1. The Meta-Training Phase. meta-features, even though one criterion might fail due to
imprecisions in the local regions of the feature space or due to
Input: Training data T
low condence results, the system can still achieve a good
Output: Pool of classiers C fc1 ; ; cM g
performance as other meta-features are considered by the
1: T n
selection scheme. Table 1 shows the criterion used by each fi and
2: for all xj;train A T do its relationship with one dynamic ensemble selection paradigm

3: Compute the consensus of the pool H xj;train ; C presented in Section 2.

4: if H xj;train ; C ohC then Three meta-features, f1, f2 and f3, are computed using informa-
5: Find the region of competence j of xj;train using T . tion extracted from the region of competence j. f4 uses informa-
6: Compute the output prole x~ j;train of xj;train . tion extracted from the set of output proles j. f5 is calculated
directly from the input sample xj;train , and corresponds to the level
7: Find the Kp similar output proles j of x~ j;train using
of condence of ci for the classication of xj;train .
~
T .
8: for all ci A C do f1 Neighbors' hard classication: First, a vector with K elements is
9: vi;j MetaFeatureExtractionj ; j ; ci ; xj;train created. For each instance xk , belonging to the region of
10: if ci correctly classies xj;train then competence j, if ci correctly classies xk , the kth position
11: i;j 1 ci is competent for xj;train of the vector is set to 1, otherwise it is 0. Thus, K meta-
12: else features are computed.
13: i;j 0 ci is incompetent for xj;train f2 Posterior probability: First, a vector with K elements is created.
14: end if Then, for each instance xk , belonging to the region of

15: T n T n [ vi;j competence j, the posterior probability of ci, Pwl xk is
16: end for computed and inserted into the kth position of the
17: end if vector. Consequently, K meta-features are computed.
18: end for f3 Overall Local accuracy: The accuracy of ci over the whole
19: Divide T n into 25% for validation and 75% for training. region of competence j is computed and encoded as f3.
20: Train using the Levenberg-Marquadt algorithm. f4 Output proles classication: First, a vector with Kp elements is
generated. Then, for each member x~ k belonging to the
21: return The meta-classier .
set of output proles j, if the label produced by ci for xk
is equal to the label wl;k of x~ k , the kth position of the
vector is set to 1, otherwise it is 0. A total of Kp meta-
3.2.2.1. Sample selection. As demonstrated by Dos Santos et al. [5]
features are extracted using output proles.
and Cavalin et al. [6], one of the main issues in dynamic ensemble
f5 Classier's condence: The perpendicular distance between
selection arises when classifying testing instances where the
the input sample xj;train and the decision boundary of the
degree of consensus among the pool of classier is low, i.e.,
base classier ci is calculated and encoded as f5. f5 is
when the number of votes from the winning class is close or
normalized to a 0 1 range using the Minmax
even equal to the number of votes from the second class. To tackle
normalization.
this issue, we decided to focus the training of the meta-classier
to specically deal with cases where the extent of consensus 
A vector vi;j f 1 [ f 2 [ f 3 [ f 4 [ f 5 is obtained at the end of
among the pool is low. This step is conducted using a threshold hC,
the process (Fig. 2). If ci correctly classies xj;train , the class
called the consensus threshold. Each instance xj;train is rst
attribute of vi;j , i;j 1 (i.e., vi;j corresponds to the behavior of a
evaluated by the whole pool of classiers in order to compute
 competent classier), otherwise i;j 0. vi;j is stored in the meta-
the degree of consensus among  the pool, denoted by H xj;train ; C . features dataset T n (lines 1016).
If the consensus H xj;train ; C falls below the consensus threshold
For each sample xj;train used in the meta-training stage, a total of
hC, the instance xj;train is used to compute the meta-features.
M (M is the size of the pool of classiers C) meta-feature vectors vi;j
Before extracting the meta-features, the region of competence of
are extracted, each one corresponding to one classier from the pool
the instance xj;train , denoted by j fx1 ; ; xK g, must rst be
C. In this way, the size of the meta-training dataset T n is the pool size
computed. The region of competence j is dened in the T set,
M number of training samples N. For instance, consider that 200
using the K-Nearest Neighbor algorithm (line 5). Then, xj;train is
training samples are available for the meta-training stage (N200),
transformed into an output prole. The output prole of the instance
 if the pool C is composed of 100 weak classiers (M100), the meta-
xj;train is denoted by x~ j;train x~ j;train ;1 ; x~ j;train ;2 ; ; x~ j;train ;M , where
training dataset is the number of training samples N  the number
each x~ j;train ;i is the decision yielded by the base classier ci for the
classiers in the pool M, N  M 20; 000. Hence, even though the
sample xj;train [6,7].
classication problem may be ill-dened due to the size of the
Next, with the region of competence j and the set with the
training set, we can overcome this limitation in the meta-problem by
most similar output proles j computed, for each base classier ci
increasing the size of the pool of classiers.
belonging to the pool of classiers C, one meta-feature vector vi;j is
extracted (lines 814). Each vi;j contains ve sets of meta-features:

3.2.2.2. Meta-feature extraction process. Five different sets of meta- 3.2.2.3. Training. The last step of the meta-training phase is the
features are proposed in this work. Each feature set fi, corresponds training of the meta-classier . The dataset T n is divided on the
to a different criterion for measuring the level of competence of a basis of 75% for training and 25% for validation. A Multi-Layer
base classier. Each set captures a different property about the Perceptron (MLP) neural network is considered as the selector .
behavior of the base classier, and can be seen as a different The validation data was used to select the number of nodes in the
criterion to dynamically estimate the level of competence of base hidden layer. We use a conguration of 10 neurons in the hidden
classier such as, the classication performance estimated in a layer since there was no improvement in results with more than
local region of the feature space and the classier condence for 10 neurons. The training process for is performed using the
the classication of the input sample. Using ve distinct sets of LevenbergMarquadt algorithm. In addition, the training process
1930 R.M.O. Cruz et al. / Pattern Recognition 48 (2015) 19251935

Table 1
Relationship between each meta-features and different paradigms to compute the level of competence of a base classier.

Meta-feature Criterion Paradigm

f1 Local accuracy in the region of competence Classier accuracy over a local region
f2 Extent of consensus in the region of competence Classier consensus
f3 Overall accuracy in the region of competence Accuracy over a local region
f4 Accuracy in the decision space Decision templates
f5 Degree of condence for the input sample Classier condence

Fig. 2. Feature vector containing the meta-information about the behavior of a base classier. A total of 5 different meta-features are considered. The size of the feature
vector is 2  K K p 2. The class attribute indicates whether or not ci correctly classied the input sample.

is stopped if its performance on the validation set decreases or techniques [3]. Tie-breaking is handled by choosing the class with
fails to improve for ve consecutive epochs. the highest a posteriori probability.

3.2.3. Generalization phase 4. Experiments


The generalization procedure is formalized by Algorithm 2.
Given the query sample xj;test , in this phase, the region of 4.1. Datasets
competence j is computed using the samples from the dynamic
selection dataset DSEL (line 2). Following that, the output proles A total of 30 datasets are used in the comparative experi-
x~ j;test of the test sample, xj;test , are calculated. The set with Kp ments. Sixteen coming from the UCI machine learning repository
similar output proles j, of the query sample xj;test , is obtained [34], four from the STATLOG project [35], four from the Knowl-
through the Euclidean distance applied over the output proles of edge Extraction based on Evolutionary Learning (KEEL) reposi-
the dynamic selection dataset, D ~ SEL .
tory [36], four from the Ludmila Kuncheva Collection of real
medical data [37], and two articial datasets generated with the
Algorithm 2. Classication steps using the selector .
Matlab PRTOOLS toolbox [38]. We consider both ill-dened
Input: Query sample xj;test problems, such as, Heart and Liver Disorders as well as larger
Input: Pool of classiers C fc1 ; ; cM g databases, such as, Adult, Magic Gamma Telescope, Phoneme
Input: dynamic selection dataset DSEL and WDG V1. The key features of each dataset are shown in
1: C0 Table 2.
2: Find the region of competence j of xj;test using DSEL.
3: Compute the output prole x~ j;test of xj;test .
4.2. Experimental protocol
4: ~ SEL .
Find the Kp similar output proles j of x~ j;test using D
5: for all ci A C do The experiments were conducted using 20 replications. For
6: vi;j FeatureExtractionj ; j ; ci ; xj;test each replication, the datasets were randomly divided on the basis
7: input vi;j to 50% for training, 25% for the dynamic selection dataset (DSEL ), and
8: if i;j 1 ci is competent for xj;test then 25% for the test set (G). The divisions were performed maintaining
9: C 0 C 0 [ fc i g the priors probabilities of each class. For the proposed META-DES,
10: end if 50% of the training data was used in the meta-training process T
11: end for and 50% for the generation of the pool of classiers (T ).
For the two-class classication problems, the pool of classiers
12: wl MajorityVotexj;test ; C 0
was composed of 100 Perceptrons generated using the bagging
13: return wl
technique [32]. For the multi-class problems, the pool of classiers
was composed of 100 multi-class perceptron classier. The use of
Perceptron as base classier comes from the following observa-
Next, for each classier ci belonging to the pool of classiers C, the tions based on the past works in the literature:
meta-feature extraction process is called (Section 3.2.2.2), returning
the meta-features vector vi;j (lines 5 and 6). Then, vi;j is used as input  The use of weak classiers can show more differences between
to the meta-classier . If the output of is 1 (i.e., competent), ci is the DES schemes [4]. Thus, making it a better option for
included in the ensemble C 0 (lines 810). After every base classier, ci, comparing different DES techniques.
is evaluated, the ensemble C 0 is obtained. The base classiers in C 0 are  Past works in the DES literature demonstrate that the use of
combined through the Majority Vote rule [1], giving the label wl of weak models as base classier achieves better results
xj;test (line 12 and 13). The majority vote rule is used to combine the [5,6,39,40,10], where the use of decision trees or Perceptrons
selected classiers since it has been successfully used by other DES outperform strong classication models such as KNN classiers.
R.M.O. Cruz et al. / Pattern Recognition 48 (2015) 19251935 1931

Table 2 4.3.1. The effect of the parameter hC


Key Features of the datasets used in the experiments. We varied the parameter hc from 50% to 100% at 10 percentile
point interval. Fig. 3 shows the mean performance and standard
Database No. of Dimensionality No. of Source
instances classes deviation for each hC value. We compared each pair of results
using the KruskalWallis non-parametric statistical test with a 95%
Pima 768 8 2 UCI condence interval. For 6 out of 11 datasets (Vehicle, Lithuanian,
Liver Disorders 345 6 2 UCI Banana, Blood transfusion, Ionosphere and Sonar) hC 70% pre-
Breast (WDBC) 568 30 2 UCI
Blood transfusion 748 4 2 UCI
sented a value that was statistically superior to the others. Hence,
Banana 1000 2 2 PRTOOLS hC 70% was selected.
Vehicle 846 18 4 STATLOG
Lithuanian 1000 2 2 PRTOOLS
Sonar 208 60 2 UCI
4.3.2. The effect of the parameter Kp
Ionosphere 315 34 2 UCI
Wine 178 13 3 UCI Fig. 4 shows the impact of the value of the parameter Kp in an
Haberman's Survival 306 3 2 UCI 110 range. Once again, we compared each pair of results using
Cardiotocography 2126 21 3 UCI the KruskalWallis non-parametric statistical test, with a 95%
(CTG)
condence. The results were statistically different only for the
Vertebral Column 310 6 2 UCI
Steel Plate Faults 1941 27 7 UCI
Sonar, Ionosphere and liver disorders datasets, where the value of
WDG V1 50000 21 3 UCI K p 5 showed the best results. Hence, Kp was set at 5.
Ecoli 336 7 8 UCI
Glass 214 9 6 UCI
ILPD 214 9 6 UCI
4.4. Comparison with the state-of-the-art dynamic selection
Adult 48842 14 2 UCI
Weaning 302 17 2 LKC techniques
Laryngeal1 213 16 2 LKC
Laryngeal3 353 16 3 LKC In this section we compare the recognition rates obtained by
Thyroid 215 5 3 LKC the proposed META-DES, against eight dynamic selection techni-
German credit 1000 20 2 STATLOG
Heart 270 13 2 STATLOG
ques found in the literature [3]. The objective of this comparative
Satimage 6435 19 7 STATLOG study is to answer the following research question: (1) Can the use
Phoneme 5404 6 2 ELENA of multiple DES criteria as meta-features lead to a more robust
Monk2 4322 6 2 KEEL dynamic selection technique? (2) Does the proposed framework
Mammographic 961 5 2 KEEL
outperform current DES techniques for ill-dened problems?
MAGIC Gamma 19020 10 2 KEEL
Telescope The eight state-of-the-art DES techniques used in this study
are: the KNORA-ELIMINATE [4], KNORA-UNION [4], DES-FA [10],
Local Classier Accuracy (LCA) [12], Overall Local Accuracy (OLA)
[12], Modied Local Accuracy (MLA) [16], Multiple Classier
Behavior (MCB) [11] and K-Nearests Output Proles (KNOP) [7,6].
 As reported by Leo Breiman [32,33], the bagging technique These techniques were selected because they presented the very
achieves better results when weak and unstable base classiers best results in the dynamic selection literature according to a
are used. recent survey on this topic [3]. In addition, we also compare the
performance of the proposed META-DES with static combination
methods (Adaboost and Bagging), the classier with the highest
accuracy in the validation data (Single Best), static ensemble
4.3. Parameters setting
selection based on the majority voting error [41] and the abstract
model (Oracle) [19]. The Oracle represents the ideal classier
The performance of the proposed selection scheme depends on
selection scheme. It always selects the classier that predicted
three parameters: the neighborhood size, K, the number of similar
the correct label, for any given query sample, if such classier
patterns using output proles Kp and the consensus threshold hC.
exists. For the static ensemble selection method, 50% of the
The dynamic selection dataset DSEL was used for the analysis. The
classiers of the pool are selected. The comparison against static
following methodology is used:
methods is used since it is suggested the DES literature that the
 For the sake of simplicity, we selected the parameters that minimum requirement for a DES method is to surpass the
performance of static selection and combination methods in the
performed best.
 The value of the parameter K was selected based on the results same pool [3].
For all techniques, the pool of classiers C is composed of 100
of our previous paper [10]. In this case, K 7 showed the best
Perceptrons as base classier (M 100). For the state-of-the-art
overall results, considering several dynamic selection
DES techniques (KNORA-E, KNORA-U, DES-FA, LCA, OLA, MLA, MCB
techniques.
 and KNOP), the size of the region of competence (neighborhood
The KruskallWallis statistical test with a 95% condence
size), K is set to 7, since it achieved the best result on previous
interval was used to determine whether the difference in
publications [3,10]. The size of the region of competence K is the
results was statistically signicant. If two congurations
only hyper-parameter required for the eight DES techniques. For
yielded similar results, we selected the one with the smaller
the Adaboost and Bagging technique 100 iterations are used (i.e.,
parameter value as it leads to a smaller meta-features vector.
 100 base classier are generated).
The parameter hC was evaluated with Kp initially set at 1.
 We split the results in two tables: Table 3 shows a comparison
The best value of hc was used in the evaluation of the best value
with the proposed META-DES against the eight state-of-the-art
for Kp.
 dynamic selection techniques considered. A comparison of the
Only a subset with eleven of the thirty datasets are used for
META-DES against static combination rules is shown in Table 4.
parameters setting procedure: Pima, Liver, Breast, Blood Trans-
Each pair of results is compared using the KruskalWallis non-
fusion, Banana, Vehicle, Lithuanian, Sonar, Ionosphere, Wine,
parametric statistical test, with a 95% condence interval. The best
and Haberman's Survival.
1932 R.M.O. Cruz et al. / Pattern Recognition 48 (2015) 19251935

Fig. 3. Performance of the proposed system based on the parameter hC on the dynamic selection dataset, DSEL. K7 and K p 1.

Fig. 4. The performance of the system varying the parameter Kp from 1 to 10 on the dynamic selection dataset, DSEL. hc 70% and K7.

results are in bold. Results that are signicantly better (p o 0:05) estimate the competence of base classiers that dominates all other
are marked with a . when compared with several classication problems. Since the
We can see in Table 3 the proposed META-DES achieves results proposed META-DES uses a combination of ve different criteria as
that are either superior or equivalent to the state-of-the-art DES meta-features, even though one criterion might fail, the system can
techniques in 25 datasets (84% of the datasets). In addition, the still achieve a good performance as other meta-features are also
META-DES achieved the highest recognition performance for 18 considered by the selection scheme. In this way, a more robust DES
datasets, which corresponds to 60% of the datasets considered. technique is achieved.
Only for the Ecoli, Heart, Vehicle, Banana and Lithuanian datasets Moreover, another advantage of the proposed META-DES fra-
(16% of the datasets) the recognition rates of the proposed META- mework comes from the fact that several meta-feature vectors are
DES framework presented is statistically inferior to the best result generated for each training sample in the meta-training phase
achieved by state-of-the-art DES techniques. (Section 3.2.2). For instance, consider that 200 training samples
For the 12 datasets where the proposed META-DES did not are available for the meta-training stage (N 200), if the pool C is
achieved the highest recognition rate (WDBC, Banana, Vehicle, composed of 100 weak classiers (M100), the meta-training
Lithuanian, Cardiotocography, Vertebral column, Steel plate faults, dataset is the number of training samples N  the number
Ecoli, Glass, ILPD, Laryngeal3 and Heart) we can see that each DES classiers in the pool M, N  M 20; 000. Hence, there is more
technique presented the best accuracy for different datasets (as data to train the meta-classier than for the generation of the
shown in Fig. 5). The KNOP achieves the best results for three pool of classiers C itself. Even though the classication problem
datasets (Ecoli, Steel plate faults and Laryngeal3), the MCB for two may be ill-dened, due to the size of the training set, using the
datasets (Vehicle and Glass), the DES-FA for three datasets (Banana, proposed framework we can overcome this limitation since the
Breast cancer and Cardiotocography) and so forth. This can be size of the meta-problem is up to 100 times bigger than the
explained by the no free lunch theorem. There is no criterion to classication problem. So, our proposed framework has more data
R.M.O. Cruz et al. / Pattern Recognition 48 (2015) 19251935 1933

Table 3
Mean and standard deviation results of the accuracy obtained for the proposed META-DES and the DES systems in the literature. A pool of 100 Perceptrons as base classiers
is used for all techniques. The best results are in bold. Results that are signicantly better (p o 0:05) are marked with a .

Database META-DES KNORA-E [4] KNORA-U [4] DES-FA [10] LCA [12] OLA [12] MLA [16] MCB [11] KNOP [6]

Pima 79.03 (2.24)  73.79 (1.86) 76.60 (2.18) 73.95 (1.61) 73.95 (2.98) 73.95 (2.56) 77.08 (4.56) 76.56 (3.71) 73.42 (2.11)
Liver Disorders 70.08 (3.49)  56.65 (3.28) 56.97 (3.76) 61.62 (3.81) 58.13 (4.01) 58.13 (3.27) 58.00 (4.25) 58.00 (4.25) 65.23 (2.29)
Breast (WDBC) 97.41 (1.07) 97.59 (1.10) 97.18 (1.02) 97.88 (0.78) 97.88 (1.58) 97.88 (1.58) 95.77 (2.38) 97.18 (1.38) 95.42 (0.89)
Blood transfusion 79.14 (1.03)  77.65 (3.62) 77.12 (3.36) 73.40 (1.16) 75.00 (2.87) 75.00 (2.36) 76.06 (2.68) 73.40 (4.19) 77.54 (2.03)
Banana 91.78 (2.68) 93.08 (1.67) 92.28 (2.87) 95.21 (3.18) 95.21 (2.15) 95.21 (2.15) 80.31 (7.20) 88.29 (3.38) 90.73 (3.45)
Vehicle 82.75 (1.70) 83.01 (1.54) 82.54 (1.70) 82.54 (4.05) 80.33 (1.84) 81.50 (3.24) 74.05 (6.65) 84.90 (2.01) 80.09 (1.47)
Lithuanian classes 93.18 (1.32) 93.33 (2.50) 95.33 (2.64) 98.00 (2.46) 85.71 (2.20) 98.66 (3.85) 88.33 (3.89) 86.00 (3.33) 89.33 (2.29)
Sonar 80.55 (5.39) 74.95 (2.79) 76.69 (1.94) 78.52 (3.86) 76.51 (2.06) 74.52 (1.54) 76.91 (3.20) 76.56 (2.58) 75.72 (2.82)
Ionosphere 89.94 (1.96) 89.77 (3.07) 87.50 (1.67) 88.63 (2.12) 88.00 (1.98) 88.63 (1.98) 81.81 (2.52) 87.50 (2.15) 85.71 (5.52)
Wine 99.25 (1.11)  97.77 (1.53) 97.77 (1.62) 95.55 (1.77) 85.71 (2.25) 88.88 (3.02) 88.88 (3.02) 97.77 (1.62) 95.50 (4.14)
Haberman 76.71 (1.86) 71.23 (4.16) 73.68 (2.27) 72.36 (2.41) 70.16 (3.56) 69.73 (4.17) 73.68 (3.61) 67.10 (7.65) 75.00 (3.40)
Cardiotocography (CTG) 84.62 (1.08) 86.27 (1.57) 85.71 (2.20) 86.27 (1.57) 86.65 (2.35) 86.65 (2.35) 86.27 (1.78) 85.71 (2.21) 86.02 (3.04)
Vertebral column 86.89 (2.46) 85.89 (2.27) 87.17 (2.24) 82.05 (3.20) 85.00 (3.25) 85.89 (3.74) 77.94 (5.80) 84.61 (3.95) 86.98 (3.21)
Steel plate faults 67.21 (1.20) 67.35 (2.01) 67.96 (1.98) 68.17 (1.59) 66.00 (1.69) 66.52 (1.65) 67.76 (1.54) 68.17 (1.59) 68.57 (1.85)
WDG V1 84.56 (0.36) 84.01 (1.10) 84.01 (1.10) 84.01 (1.10) 80.50 (0.56) 80.50 (0.56) 79.95 (0.85) 78.75 (1.35) 84.21 (0.45)
Ecoli 77.25 (3.52) 76.47 (2.76) 75.29 (3.41) 75.29 (3.41) 75.29 (3.41) 75.29 (3.41) 76.47 (3.06) 76.47 (3.06) 80.00 (4.25) 
Glass 66.87 (2.99) 57.65 (5.85) 61.00 (2.88) 55.32 (4.98) 59.45 (2.65) 57.60 (3.65) 57.60 (3.65) 67.92 (3.24) 62.45 (3.65)
ILPD 69.40 (1.64) 67.12 (2.35) 69.17 (1.58) 67.12 (2.35) 69.86 (2.20) 69.86 (2.20) 69.86 (2.20) 68.49 (3.27) 68.49 (3.27)
Adult 87.15 (2.43)  80.34 (1.57) 79.76 (2.26) 80.34 (1.57) 83.58 (2.32) 82.08 (2.42) 80.34 (1.32) 78.61 (3.32) 79.76 (2.26)
Weaning 87.15 (2.43)  78.94 (1.25) 81.57 (3.65) 82.89 (3.52) 77.63 (2.35) 77.63 (2.35) 80.26 (1.52) 81.57 (2.86) 82.57 (3.33)
Laryngeal1 79.67 (3.78)  77.35 (4.45) 77.35 (4.45) 77.35 (4.45) 77.35 (4.45) 77.35 (4.45) 75.47 (5.55) 77.35 (4.45) 77.35 (4.45)
Laryngeal3 72.65 (2.17) 70.78 (3.68) 72.03 (1.89) 72.03 (1.89) 72.90 (2.30) 71.91 (1.01) 61.79 (7.80) 71.91 (1.01) 73.03 (1.89)
Thyroid 96.78 (0.87) 95.95 (1.25) 95.95 (1.25) 95.37 (2.02) 95.95 (1.25) 95.95 (1.25) 94.79 (2.30) 95.95 (1.25) 95.95 (1.25)
German credit 75.55 (1.31)  72.80 (1.95) 72.40 (1.80) 74.00 (3.30) 73.33 (2.85) 71.20 (2.52) 71.20 (2.52) 73.60 (3.30) 73.60 (3.30)
Heart 84.80 (3.36) 83.82 (4.05) 83.82 (4.05) 83.82 (4.05) 85.29 (3.69) 85.29 (3.69) 86.76 (5.50) 83.82 (4.05) 83.82 (4.05)
Satimage 96.21 (0.87) 95.35 (1.23) 95.86 (1.07) 93.00 (2.90) 95.00 (1.40) 94.14 (1.07) 93.28 (2.10) 95.86 (1.07) 95.86 (1.07)
Phoneme 80.35 (2.58) 79.06 (2.50) 78.92 (3.33) 79.06 (2.50) 78.84 (2.53) 78.84 (2.53) 64.94 (7.75) 73.37 (5.55) 78.92 (3.33)
Monk2 83.24 (2.19)  80.55 (3.32) 77.77 (4.25) 75.92 (4.25) 74.07 (6.60) 74.07 (6.60) 75.92 (5.65) 74.07 (6.60) 80.55 (3.32)
Mammographic 84.82 (1.55)  82.21 (2.27) 82.21 (2.27) 80.28 (3.02) 82.21 (2.27) 82.21 (2.27) 75.55 (5.50) 81.25 (2.07) 82.21 (2.27)
MAGIC Gamma Telescope 84.35 (3.27)  80,03 (3.25) 79,99 (3.55) 81.73 (3.27) 81,53 (3.35) 81,16 (3.00) 73,13 (6.35) 75,91 (5.35) 80,03 (3.25)

Table 4
Mean and standard deviation results of the accuracy obtained for the proposed META-DES and static ensemble combination. A pool of 100 Perceptrons as base classier is
used for all techniques. The best results are in bold. Results that are signicantly better (p o 0:05) are marked with a .

Database META-DES Single Best [3] Bagging [32] AdaBoost [42] Static Selection [41] Oracle [19]

Pima 79.03 (2.24)  73.57 (1.49) 73.28 (2.08) 72.52 (2.48) 72.86 (4.78) 95.10 (1.19)
Liver Disorders 70.08 (3.49)  65.38 (3.47) 62.76 (4.81) 64.65 (3.26) 59.18 (7.02) 93.07 (2.41)
Breast (WDBC) 97.41 (1.07) 97.04 (0.74) 96.35 (1.14) 98.24 (0.89) 96.83 (1.00) 99.13 (0.52)
Blood transfusion 79.14 (1.03)  75.07 (1.83) 75.24 (1.67) 75.18 (2.08) 75.74 (2.23) 94.20 (2.08)
Banana 91.78 (2.68) 84.07 (2.22) 81.43 (3.92) 81.61 (2.42) 81.35 (4.28) 94.75 (2.09)
Vehicle 82.75 (1.70) 81.87 (1.47) 82.18 (1.31) 80.56 (4.51) 81.65 (1.48) 96.80 (0.94)
Lithuanian classes 93.18 (1.32)  84.35 (2.04) 82.33 (4.81) 82.70 (4.55) 82.66 (2.45) 98.35 (0.57)
Sonar 80.55 (5.39) 78.21 (2.36) 76.66 (2.36) 74.95 (5.21) 79.03 (6.50) 94.46 (1.63)
Ionosphere 89.94 (1.96) 87.29 (2.28) 86.75 (2.75) 86.75 (2.34) 87.50 (2.23) 96.20 (1.72)
Wine 99.25 (1.11) 96.70 (1.46) 95.56 (1.96) 99.20 (0.76) 96.88 (1.80) 100.00 (0.01)
Haberman 76.71 (1.86) 75.65 (2.68) 72.63 (3.45) 75.26 (3.38) 73.15 (3.68) 97.36 (3.34)
Cardiotocography (CTG) 84.62 (1.08) 84.21 (1.10) 84.54 (1.46) 83.06 (1.23) 84.04 (2.02) 93.08 (1.46)
Vertebral column 86.89 (2.46) 82.04 (2.17) 85.89 (3.47) 83.22 (3.59) 84.27 (3.24) 97.40 (0.54)
Steel plate faults 67.21 (1.20) 66.05 (1.98) 67.02 (1.98) 66.57 (1.06) 67.22 (1.64) 88.72 (1.89)
WDG V1 84.56 (0.36) 83.17 (0.76) 84.36 (0.56) 84.04 (0.37) 84.23 (0.53) 97.82 (0.54)
Ecoli 77.25 (3.52)  69.35 (2.68) 72.22 (3.65) 70.32 (3.65) 67.80 (4.60) 91.54 (1.55)
Glass 66.87 (2.99)  52.92 (4.53) 62.64 (5.61) 55.89 (3.25) 57.16 (4.17) 90.65 (0.00)
ILPD 69.40 (1.64) 67.53 (2.83) 67.20 (2.35) 69.38 (4.28) 67.26 (1.04) 99.10 (0.72)
Adult 87.15 (2.43)  83.64 (3.34) 85.60 (2.27) 83.58 (2.91) 84.37 (2.79) 95.59 (0.39)
Weaning 79.67 (3.78)  74.86 (4.78) 76.31 (4.06) 74.47 (3.68) 76.89 (3.15) 92.10 (0.92)
Laryngeal1 83.43 (4.50) 80.18 (5.51) 81.32 (3.82) 79.81 (3.88) 80.75 (4.93) 98.86 (0.98)
Laryngeal3 72.65 (2.17) 68.42 (3.24) 67.13 (2.47) 62.32 (2.57) 71.23 (3.18) 100.00 (0.00)
Thyroid 96.78 (0.87) 95.15 (1.74) 95.25 (1.11) 96.01 (0.74) 96.24 (1.25) 99.88 (0.36)
German credit 75.55 (2.31) 71.16 (2.39) 74.76 (2.73) 72.96 (1.25) 73.60 (2.69) 99.12 (0.70)
Heart 84.80 (3.36) 80.26 (3.58) 82.50 (4.60) 81.61 (5.01) 82.05 (3.72) 95.90 (1.02)
Satimage 96.21 (0.87) 94.52 (0.96) 95.23 (0.87) 95.43 (0.92) 95.31 (0.92) 98.69 (0.87)
Phoneme 80.35 (2.58)  75.87 (1.33) 72.60 (2.33) 75.90 (1.06) 72.70 (2.32) 99.34 (0.24)
Monk2 83.24 (2.19) 79.25 (3.78) 79.18 (2.57) 80.27 (2.76) 80.55 (3.59) 98.98 (1.19)
Mammographic 84.82 (1.55) 83.60 (1.85) 85.27 (1.85) 83.07 (3.03) 84.23 (2.14) 99.59 (0.15)
MAGIC Gamma Telescope 84.35 (3.27) 80.27 (3.50) 81.24 (2.22) 87.35 (1.45)  85.25 (3.25) 95.35 (0.68)
1934 R.M.O. Cruz et al. / Pattern Recognition 48 (2015) 19251935

Fig. 5. Bar plot showing the number of datasets that each DES technique presented the highest recognition accuracy.

to estimate the level of competence of base classiers than the results show that the proposed META-DES achieved the highest
other DES methods, where only the training or validation data is classication accuracy in the majority of datasets, which can be
available. This fact can be observed by the results obtained for explained by the fact that the proposed META-DES framework is
datasets with less than 500 samples for training, such as, Liver based on ve different DES criteria. Even though one criterion might
Disorders, Sonar, Weaning and Ionosphere where recognition fail, the system can still achieve a good performance as other criteria
accuracy of the META-DES is statistically superior for those small are also considered in order to perform the ensemble selection. In
size problems. this way, a more robust DES technique is achieved.
When compared against static ensemble techniques Table 4, In addition, we observed a signicant improvement in perfor-
the proposed META-DES achieves the highest recognition accuracy mance for datasets with critical training size samples. This gain in
for 24 out of 30 datasets. This can be explained by the fact that the accuracy can be explained by the fact that during the Meta-
majority of datasets considered are ill-dened. Hence, the results Training phase of the framework, each training sample generates
found in this paper also support the claim made by Cavalin et al. several meta-feature vectors for the training of the meta-classier.
[6] that DES techniques outperform static methods for ill-dened Hence, the proposed framework has more data to train the meta-
problems. classier and consequently to estimate the level of competence of
We can thus answer the research question posed in this paper: base classiers than the current state-of-the-art DES methods,
Can the use of meta-features lead to a more robust dynamic selection where only the training or validation data is available.
technique? As the proposed system achieved better recognition rates Future works on this topic will involve:
in the majority of datasets the use of multiple properties from the
classication environment as meta-features indeed leads to a more 1. The denition of new sets of meta-features to better estimate
robust dynamic ensemble selection technique. the level of competence of the base classiers.
2. The selection of meta-features based on optimization algo-
rithms in order to improve the performance of the meta-
5. Conclusion classier, and consequently, the accuracy of the DES system.
3. The evaluation of different training scenarios for the meta-
In this paper, we presented a novel DES technique in a meta- classier.
learning framework. The framework is based on two environ-
ments: the classication environment, in which the input features
are mapped into a set of class labels, and the meta-classication
environment, in which different properties from the classication Conict of interest
environment, such as the classier accuracy in the feature space or
the consensus in the decision space, are extracted from the None.
training data and encoded as meta-features. Five sets of meta-
features are proposed. Each set corresponding to a different
Acknowledgments
dynamic selection criterion. These meta-features are used to train
a meta-classier which can estimate whether a base classier is
This work was supported by the Natural Sciences and Engi-
competent enough to classify a given input sample. With the
neering Research Council of Canada (NSERC) (grant OGP0106456
arrival of new test data, the meta-features are extracted using the
to Robert Sabourin), the cole de technologie suprieure (TS
test data as reference, and used as input to the meta-classier. The
Montral) and CNPq (Conselho Nacional de Desenvolvimento
meta-classier decides whether the base classier is competent
Cientco e Tecnolgico).
enough to classify the test sample.
Experiments were conducted using 30 classication datasets
coming from ve different data repositories (UCI, KEEL, STATLOG, References
LKC and ELENA) and compared against eight state-of-the-art
dynamic selection techniques (each technique based on a single [1] J. Kittler, M. Hatef, R.P.W. Duin, J. Matas, On combining classiers, IEEE Trans.
Pattern Anal. Mach. Intell. 20 (1998) 226239.
criterion to measure the level of competence of a base classier), as [2] L.I. Kuncheva, Combining Pattern Classiers: Methods and Algorithms, Wiley-
well as ve classical static combination methods. Experimental Interscience, New Jersey, 2004.
R.M.O. Cruz et al. / Pattern Recognition 48 (2015) 19251935 1935

[3] A.S. Britto, Jr., R. Sabourin, L.E.S. de Oliveira, Dynamic selection of classiersa [23] J.H. Krijthe, T.K. Ho, M. Loog, Improving cross-validation based classier
comprehensive review, Pattern Recognit. 47 (11) (2014) 36653680. selection using meta-learning, in: International Conference on Pattern Recog-
[4] A.H.R. Ko, R. Sabourin, A.S. Britto Jr., From dynamic classier selection to nition, 2012, pp. 28732876.
dynamic ensemble selection, Pattern Recognit. 41 (2008) 17351748. [24] G. Giacinto, F. Roli, Design of effective neural network ensembles for image
[5] E.M. Dos Santos, R. Sabourin, P. Maupin, A dynamic overproduce-and-choose classication purposes, Image Vis. Comput. 19 (910) (2001) 699707.
strategy for the selection of classier ensembles, Pattern Recognit. 41 (2008) [25] R.M.O. Cruz, G.D. Cavalcanti, I.R. Tsang, R. Sabourin, Feature representation
29933009. selection based on classier projection space and oracle analysis, Exp. Syst.
[6] P.R. Cavalin, R. Sabourin, C.Y. Suen, Dynamic selection approaches for multiple Appl. 40 (9) (2013) 38133827.
classier systems, Neural Comput. Appl. 22 (34) (2013) 673688. [26] E.M. dos Santos, R. Sabourin, P. Maupin, Single and multi-objective genetic
[7] P.R. Cavalin, R. Sabourin, C.Y. Suen, Logid: an adaptive framework combining algorithms for the selection of ensemble of classiers, in: Proceedings of the
local and global incremental learning for dynamic selection of ensembles of
International Joint Conference on Neural Networks, 2006, pp. 30703077.
HMMs, Pattern Recognit. 45 (9) (2012) 35443556.
[27] I. Partalas, G. Tsoumakas, I. Vlahavas, Focused ensemble selection: a diversity-
[8] T. Woloszynski, M. Kurzynski, A probabilistic model of classier competence
based method for greedy ensemble selection, in: Proceeding of the 18th
for dynamic ensemble selection, Pattern Recognit. 44 (2011) 26562668.
European Conference on Articial Intelligence, 2008, pp. 117121.
[9] J. Xiao, L. Xie, C. He, X. Jiang, Dynamic classier ensemble model for customer
[28] L.I. Kuncheva, J.C. Bezdek, R.P.W. Duin, Decision templates for multiple
classication with imbalanced class distribution, Expert Syst. Appl. 39 (2012)
36683675. classier fusion: an experimental comparison, Pattern Recognit. 34 (2001)
[10] R.M.O. Cruz, G.D.C. Cavalcanti, T.I. Ren, A method for dynamic ensemble 299314.
selection based on a lter and an adaptive distance to improve the quality of [29] M. Sabourin, A. Mitiche, D. Thomas, G. Nagy, Classier combination for
the regions of competence, in: Proceedings of the International Joint Con- handprinted digit recognition, in: Proceedings of the Second International
ference on Neural Networks, 2011, pp. 11261133. Conference on Document Analysis and Recognition, 1993, pp. 163166.
[11] G. Giacinto, F. Roli, Dynamic classier selection based on multiple classier [30] L. Didaci, G. Giacinto, F. Roli, G.L. Marcialis, A study on the performances of
behaviour, Pattern Recognit. 34 (2001) 18791881. dynamic classier selection based on local accuracy estimation, Pattern
[12] K. Woods, W.P. Kegelmeyer Jr., K. Bowyer, Combination of multiple classiers Recognit. 38 (11) (2005) 21882191.
using local accuracy estimates, IEEE Trans. Pattern Anal. Mach. Intell. 19 (1997) [31] R.P.W. Duin, The combining classier: to train or not to train?, in: Proceedings
405410. of the 16th International Conference on Pattern Recognition, vol. 2, 2002,
[13] L. Kuncheva, Switching between selection and fusion in combining classiers: pp. 765770.
an experiment, IEEE Trans. Syst. Man Cybern. 32 (2) (2002) 146156. [32] L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123140.
[14] X. Zhu, X. Wu, Y. Yang, Dynamic classier selection for effective mining from [33] M. Skurichina, R.P.W. Duin, Bagging for linear classiers, Pattern Recognit. 31
noisy data streams, in: Proceedings of the 4th IEEE International Conference (1998) 909930.
on Data Mining, 2004, pp. 305312. [34] K. Bache, M. Lichman, UCI machine learning repository (2013). URL http://
[15] S. Singh, M. Singh, A dynamic classier selection and combination approach to archive.ics.uci.edu/ml.
image region labelling, Signal Process. Image Commun. 20 (3) (2005) 219231. [35] R.D. King, C. Feng, A. Sutherland, Statlog: comparison of classication algo-
[16] P.C. Smits, Multiple classier systems for supervised remote sensing image rithms on large real-world problems, 1995.
classication based on dynamic classier selection, IEEE Trans. Geosci. Remote [36] J. Alcal-Fdez, A. Fernndez, J. Luengo, J. Derrac, S. Garca, KEEL data-mining
Sens. 40 (4) (2002) 801813. software tool: data set repository, integration of algorithms and experimental
[17] L.I. Kuncheva, Clustering-and-selection model for classier combination, in: analysis framework, Multiple-Valued Logic Soft Comput. 17 (23) (2011)
Fourth International Conference on Knowledge-Based Intelligent Information
255287.
Engineering Systems & Allied Technologies, 2000, pp. 185188.
[37] L. Kuncheva, Ludmila kuncheva collection (2004). URL http://pages.bangor.ac.
[18] R.G.F. Soares, A. Santana, A.M.P. Canuto, M.C.P. de Souto, Using accuracy and
uk/  mas00a/activities/real_data.htm.
diversity to select classiers to build ensembles, in: Proceedings of the
[38] R.P.W. Duin, P. Juszczak, D. de Ridder, P. Paclik, E. Pekalska, D.M. Tax, Prtools, a
International Joint Conference on Neural Networks, 2006, pp. 13101316.
matlab toolbox for pattern recognition, 2004. URL http://www.prtools.org.
[19] L.I. Kuncheva, A theoretical study on six classier fusion strategies, IEEE Trans.
[39] P.R. Cavalin, R. Sabourin, C.Y. Suen, Dynamic selection of ensembles of
Pattern Anal. Mach. Intell. 24 (2) (2002) 281286.
[20] D.W. Corne, J.D. Knowles, No free lunch and free leftovers theorems for classiers using contextual information, Multiple Classif. Syst. ( (2010)
multiobjective optimisation problems, in: Evolutionary Multi-Criterion Opti- 145154.
mization (EMO 2003) Second International Conference, 2003, pp. 327341. [40] E.M. dos Santos, R. Sabourin, Classier ensembles optimization guided by
[21] E.M. dos Santos, R. Sabourin, P. Maupin, A dynamic overproduce-and-choose population oracle, 2011, pp. 693698.
strategy for the selection of classier ensembles, Pattern Recognit. 41 (10) [41] D. Ruta, B. Gabrys, Classier selection for majority voting, Inf. Fusion 6 (1)
(2008) 29933009. (2005) 6381.
[22] R.M.O. Cruz, R. Sabourin, G.D.C. Cavalcanti, Analyzing dynamic ensemble [42] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning
selection techniques using dissimilarity analysis, in: Articial Neural Networks and an application to boosting, in: Proceedings of the Second European
in Pattern Recognition ANNPR, 2014, pp. 5970. Conference on Computational Learning Theory, 1995, pp. 2337.

Rafael M.O. Cruz is a Ph.D. student in the Laboratoire dimagerie, de vision et dintelligence articielle (LIVIA) at the cole de technologiesuprieure (TS). His main research
interests are on ensemble of classiers, adaptive classication systems, concept drift, meta-learning and handwritten recognition.

R. Sabourin joined in 1977 the physics department of the Montreal University where he was responsible for the design, experimentation and development of scientic
instrumentation for the Mont Mgantic Astronomical Observatory. His main contribution was the design and the implementation of a microprocessor-based ne tracking
system combined with a low-light level CCD detector. In 1983, he joined the staff of the cole de Technologie Suprieure, Universit du Qubec, in Montral where he co-
founded the Department of Automated Manufacturing Engineering where he is currently Full Professor and teaches Pattern Recognition, Evolutionary Algorithms, Neural
Networks and Fuzzy Systems. In 1992, he joined also the Computer Science Department of the Pontifcia Universidade Catlica do Paran (Curitiba, Brazil) where he was, co-
responsible for the implementation in 1995 of a master program and in 1998 a Ph.D. program in applied computer science. Since 1996, he is a senior member of the Centre
for Pattern Recognition and Machine Intelligence (CENPARMI, Concordia University). Since 2012, he is the Research Chair holder specializing in Adaptive Surveillance
Systems in Dynamic Environments. He is the author (and co-author) of more than 350 scientic publications including journals and conference proceeding. He was co-chair
of the program committee of CIFED98 (Confrence Internationale Francophone sur lcritet le Document, Qubec, Canada) and IWFHR04 (9th International Workshop on
Frontiers in Handwriting Recognition, Tokyo, Japan). He was nominated as Conference co-chair of ICDAR07 (9th International Conference on Document Analysis and
Recognition) that has been held in Curitiba, Brazil in 2007. His research interests are in the areas of adaptive biometric systems, adaptive surveillance systems in dynamic
environments, intelligent watermarking systems, evolutionary computation and bio-cryptography.

George D.C. Cavalcanti received the D.Sc. degree in Computer Science from Center for Informatics, Federal University of Pernambuco, Brazil. He is currently an Associate
Professor with the Center for Informatics, Federal University of Pernambuco, Brazil. His research interests include machine learning, pattern recognition, computer vision,
and biometrics.

Tsang Ing Ren received the B.Sc. degree in electronic engineering from the Universidade Federal de Pernambuco, Recife, Brazil, and the Ph.D. degree in physics from the
University of Antwerp, Antwerp, Belgium. He is currently an Associate Professor with the Centro de Informtica, Universidade Federal de Pernambuco. His research interests
include machine learning, computer vision, image processing, and biometrics.

You might also like