Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

A wrapper approach for feature selection based on Bat Algorithm

and Optimum-Path Forest


Douglas Rodrigues
a
, Lus A.M. Pereira
a
, Rodrigo Y.M. Nakamura
a
, Kelton A.P. Costa
a
, Xin-She Yang
b
,
Andr N. Souza
c
, Joo Paulo Papa
a,
a
Department of Computing, Universidade Estadual Paulista, Bauru, Brazil
b
School of Science and Technology, Middlesex University, London, United Kingdom
c
Department of Electrical Engineering, Universidade Estadual Paulista, Bauru, Brazil
a r t i c l e i n f o
Keywords:
Dimensionality reduction
Swarm intelligence
Bat Algorithm
Optimum-Path Forest
a b s t r a c t
Besides optimizing classier predictive performance and addressing the curse of the dimensionality
problem, feature selection techniques support a classication model as simple as possible. In this paper,
we present a wrapper feature selection approach based on Bat Algorithm (BA) and Optimum-Path Forest
(OPF), in which we model the problem of feature selection as an binary-based optimization technique,
guided by BA using the OPF accuracy over a validating set as the tness function to be maximized. More-
over, we present a methodology to better estimate the quality of the reduced feature set. Experiments
conducted over six public datasets demonstrated that the proposed approach provides statistically signif-
icant more compact sets and, in some cases, it can indeed improve the classication effectiveness.
2013 Elsevier Ltd. All rights reserved.
1. Introduction
Given the current availability of large amount of data and com-
putational resources, machine learning techniques have received
signicant strides of the scientic community looking for a classi-
cation model which better describes a given dataset. Nonetheless,
besides to optimize the predictive performance, we shall keep in
mind the objective of understanding the underlying process of data
generation. Foremost, feature selection tries to discover a minimal
model capable of explaining the data distribution. These methods
offer a fruitful feature analysis, and can also improve the classier
performance and reduce the classication model complexity and
induction time.
Broadly speaking, feature selection methods can be categorized
into wrappers or lter-based approaches. While the wrapper
model consists of a heuristic search in a subspace of all possible
feature combinations, being the classiers performance the tness
function, the lter-based approaches apply statistical analysis for
ranking individual features based on a utility criterion. Although
a natural idea is to relate an abundance of features to a better data
representation, weakly informative features may degrade the
classier effectiveness acting as articial noise. In light of the
importance of these methods, the literature covers excellent
collections (Alonso-Atienza et al., 2012; Guyon, Gunn, Nikravesh,
& Zadeh, 2006) and several comparative studies (Guyon et al.,
2007; Jain & Zongker, 1997; Kudo & Sklansky, 2000).
Recently, meta-heuristic search algorithms derived from the
behaviour of biological and/or physical systems in nature have
been proposed as powerful methods for global optimizations
(Geem, 2009; Kennedy & Eberhart, 2001; Kirkpatrick, Gelatt, &
Vecchi, 1983; Koza, 1992; Rashedi, Nezamabadi-pour, & Saryazdi,
2009). The main reason to perform stochastic optimization for
feature selection concerns with the resulting exponential set of
possible solutions in the search space. For instance, several works
have advocated the use of Genetic Algorithms (GA) (Huang &
Wang, 2006; Kuncheva & Jain, 1999; Oh, Lee, & Moon, 2004) and
Particle Swarm Optimization (PSO) (Kennedy & Eberhart, 1997;
Wang, Yang, Teng, Xia, & Jensen, 2007) in this context. Addition-
ally, Rashedi, Nezamabadi-pour, and Saryazdi (2010) introduced
a Gravitational Search Algorithm (GSA) for feature selection
purposes, and Ramos, Souza, Chiachia, Falco, and Papa (2011)
presented their Harmony Search (HS) approach to select the best
subset of variables related to thefts in power distributions systems.
More recently, in our previous work (Nakamura et al., 2012) we
have proposed a feature selection method using the Bat Algorithm
(BA) (Yang, 2011), being its effectiveness compared with Firey
Algorithm(FA) (Yang, 2010), GSA, HS, and PSO. The idea was to nd
a near optimal solution through a combination of a fast classier,
namely Optimum-Path Forest (OPF) (Papa, Falco, & Suzuki,
2009, 2012), together with the optimization strategy of BA,
0957-4174/$ - see front matter 2013 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.eswa.2013.09.023

Corresponding author. Tel./fax: +55 14 3103 6079.


E-mail addresses: rodrigo.mizobe@fc.unesp.br, douglasrodrigues.dr@gmail.com
(D. Rodrigues), luis.pereira@fc.unesp.br (L.A.M. Pereira), kelton@fc.unesp.br (K.A.P.
Costa), x.yang@mdx.ac.uk (X.-S. Yang), andrejau@feb.unesp.br (A.N. Souza), papa.
joaopaulo@gmail.com, papa@fc.unesp.br (J.P. Papa).
Expert Systems with Applications 41 (2014) 22502258
Contents lists available at ScienceDirect
Expert Systems with Applications
j our nal homepage: www. el sevi er . com/ l ocat e/ eswa
which combines local search through random walks and global
exploration.
In this paper, we performed a deep study using different data-
sets, as well as we evaluated several other swarm-based tech-
niques proposed in the literature in the context of feature
selection, and also the inuence of distinct transfer functions (con-
tinuous to binary) for agents position in the search space. Another
question we would like to shed light over here concerns with the
methodology to evaluate feature selection methods. One of the
most popular strategy is to adopt a hold-out methodology, which
randomly generates training, evaluating and test sets, for further
accuracy computation. Since distinct sampling strategies can pro-
duce different estimates of performance, we present here a meth-
odology which performs a k-fold cross validation. More precisely,
for each fold, we perform a feature selection process and then we
project a classier instance using only the features encoded by
the solution found out on that set, and we evaluate its performance
in the remaining folds.
The remainder of this paper is divided as follows: the Bat Algo-
rithm (BA) and its version for feature selection proposed by
Nakamura et al. (2012) are then introduced in Section 2. We revisit
the OPF theory in Section 3, and a brief background in swarm intel-
ligence is presented in Section 4, as well as the proposed method-
ology is introduced in Section 5. The experimental results and
conclusions are stated in Sections 6 and 7, respectively.
2. Feature selection using Bat Algorithm
In this section we describe the BA and the proposed Binary BA
(BBA) proposed by Nakamura et al. (2012).
2.1. Bat Algorithm
Bats are fascinating animals and their advanced capability of
echolocation have attracted attention of researchers from differ-
ent elds. Echolocation works as a type of sonar: bats, mainly mi-
cro-bats, emit a loud and short pulse of sound, wait it hits into an
object and, after a fraction of time, the echo returns back to their
ears (Grifn, Webster, & Michael, 1960). Thus, bats can compute
how far they are from an object (Metzner, 1991). In addition, this
amazing orientation mechanism makes bats being able to distin-
guish the difference between an obstacle and a prey, allowing
them to hunt even in complete darkness (Schnitzler & Kalko,
2001).
Based on the behaviour of the bats, Yang (2011) has developed a
new and interesting meta-heuristic optimization technique called
Bat Algorithm. Such technique has been developed to behave as
a band of bats tracking prey/foods using their capability of echolo-
cation. In order to model this algorithm, Yang (2011) has idealized
some rules, as follows:
1. All bats use echolocation to sense distance, and they also
know the difference between food/prey and background bar-
riers in some magical way;
2. A bat b
i
ies randomly with velocity v
i
at position x
i
with a xed
frequency f
min
, varying wavelength k and loudness A
0
to search
for prey. They can automatically adjust the wavelength (or fre-
quency) of their emitted pulses and adjust the rate of pulse
emission r [0,1], depending on the proximity of their target;
3. Although the loudness can vary in many ways, Yang (2011)
assume that the loudness varies from a large (positive) A
0
to a
minimum constant value A
min
.
Algorithm 1 presents the Bat Algorithm (adapted from Yang
(2011)), in which r ~ U(0,1).
Algorithm 1. Bat Algorithm
Firstly, the initial position x
i
, velocity v
i
and frequency f
i
are ini-
tialized for each bat b
i
. For each time step t, being T the maximum
number of iterations, the movement of the virtual bats is given by
updating their velocity and position using Eqs. (1)(3), as follows:
f
i
= f
min
(f
min
f
max
)b; (1)
v
j
i
(t) = v
j
i
(t 1) ^x
j
x
j
i
(t 1)
_ _
f
i
; (2)
x
j
i
(t) = x
j
i
(t 1) v
j
i
(t); (3)
where b denotes a randomly generated number within the interval
[0,1]. Recall that x
j
i
(t) denotes the value of decision variable j for bat
i at time step t. The result of f
i
(Eq. (1)) is used to control the pace
and range of the movement of the bats. The variable ^x
j
represents
the current global best location (solution) for decision variable j,
which is achieved comparing all the solutions provided by the m
bats.
In order to improve the variability of the possible solutions,
Yang (2011) has proposed to employ random walks. Primarily,
one solution is selected among the current best solutions, and then
the random walk is applied in order to generate a new solution for
each bat that accepts the condition in Line 5 of Algorithm 1:
x
new
= x
old
A(t); (4)
in which A(t) stands for the average loudness of all the bats at time
t, and [ 1,1] attempts to the direction and strength of the ran-
dom walk. For each iteration of the algorithm, the loudness A
i
and
the emission pulse rate r
i
are updated, as follows:
A
i
(t 1) = aA
i
(t) (5)
and
r
i
(t 1) = r
i
(0)[1 exp(ct)[; (6)
where a and c are the algorithm parameters. At the rst step of the
algorithm, the emission rate r
i
(0) and the loudness A
i
(0) are often
randomly chosen. Generally, A
i
(0) [1,2] and r
i
(0) [0,1] (Yang,
2011). Although, the loudness and the emission rates will be up-
dated only if the new solutions are improved, which means that
these bats are moving towards the optimal solution.
D. Rodrigues et al. / Expert Systems with Applications 41 (2014) 22502258 2251
2.2. BBA: Binary Bat Algorithm
As the reader can observe, in standard BA each bat moves in the
search space towards continuous-valued positions. However, in
case of feature selection, the search space is modelled as a n-
dimensional boolean lattice, in which bats move across the corners
of a hypercube. Since the problemis to select or not a given feature,
the bats position is then represented by binary vectors.
Nakamura et al. (2012) have proposed a binary version of the
Bat Algorithm restricting the new bats position to only binary val-
ues using a sigmoid function:
S v
j
i
_ _
=
1
1 e
v
j
i
: (7)
Therefore, Eq. (3) can be replaced by:
x
j
i
=
1 if S v
j
i
_ _
> r;
0 otherwise;
_
(8)
in which r ~ U(0,1). Therefore, Eq. (8) can provide only binary val-
ues for each bats coordinates in the boolean lattice, which stand for
the presence of absence of the features.
3. Optimum-Path Forest
The Optimum-Path Classier (OPF) (Papa et al., 2009, Papa, Fal-
co, Albuquerque, & Tavares, 2012) works by modeling the sam-
ples as graph nodes, whose arcs are dened by an adjacency
relation and weighted by some distance function. Further, a role
competition process between some key nodes (prototypes) is car-
ried out in order to partition the graph into optimum-path trees
(OPTs) according to some path-cost function. Therefore, to design
an Optimum-Path Forest-based classier, one needs to dene: (i)
an adjacency relation, (ii) a path-cost function and (iii) a method-
ology to estimate prototypes. Next section describes the Optimum-
Path Forest classier employed in this work.
3.1. Background theory
Suppose we have a fully labeleddataset Z = Z
1
Z
2
, in whichZ
1
and Z
2
stand for training and test sets, respectively. Let o Z
1
be a
set of prototypes of all classes (i.e., key samples that best represent
the classes). Let (Z
1
; A) be a complete graph whose nodes are the
samples in Z
1
and any pair of samples denes an arc in
A = Z
1
Z
1
. Let p
s
be a path in the graph that ends in sample
s Z
1
, and p
s
(s,t)) the concatenation between p
s
and the arc
(s; t); t Z
1
. In this paper, we employ a path-cost function that re-
turns the maximumarc-weight along a path in order to avoid chains
and to give the idea of connectivity between samples. This path-cost
function is denoted here as W, and it can be computed as follows:
W(s)) =
0; if s S;
; otherwise;
_
W(p
s
s; t)) = maxW(p
s
); d(s; t); (9)
Thus, the objective of the Optimum-Path Forest algorithm (super-
vised version) is to minimize W(p
t
); \t Z
1
.
An optimal set of prototypes S
+
can be found by exploiting the
theoretical relation between minimum-spanning tree (Cormen,
Leiserson, Rivest, & Stein, 2001) and optimum-path tree for W
(Allne, Audibert, Couprie, Cousty, & Keriven, 2007). By computing
a minimum-spanning tree in the complete graph (Z
1
; A), we obtain
a connected acyclic graph whose nodes are all samples of Z
1
and
the arcs are undirected and weighted by the distances d between
adjacent samples. The spanning tree is optimum in the sense that
the sum of its arc weights is minimum as compared to any other
spanning tree in the complete graph. In the minimum-spanning
tree, every pair of samples is connected by a single path, which
is optimum according to W. Thus, the minimum-spanning tree
contains one optimum-path tree for any selected root node. The
optimum prototypes are the closest elements of the minimum-
spanning tree with different labels in Z
1
(i.e., elements that fall
in the frontier of the classes).
The Optimum-Path Forest training phase consists, essentially, in
starting the competition process between prototypes in order to
minimize the cost of each training sample. At the nal of such pro-
cedure, we obtain an Optimum-Path Forest, which is a collection of
optimum-path threes rooted at each prototype. A sample con-
nected to an OPT means that it is more strongly connected to the
root of that tree than to any other root in the forest.
Further, in the classication phase, for any sample t Z
2

Z
3
Z
2
, we consider all arcs connecting t with samples s Z
1
, as
though t were part of the training graph. Considering all possible
paths from S
+
to t, we nd the optimum path P
+
(t) from S
+
and label
t with the class k(R(t)) of its most strongly connected prototype
R(t) S
+
. This path can be identied incrementally, by evaluating
the optimum cost C(t) as:
C(t) = minmaxC(s); d(s; t); \s Z
1
: (10)
Let the node s
+
Z
1
be the one that satises (Eq. (10)) (i.e., the
predecessor P(t) in the optimum path P

(t)). Given that L(s

) =
k(R(t)), the classication simply assigns L(s

) as the class of t. An
error occurs when L(s

) k(t).
4. Swarm-based Optimization
In this section, we briey describe the swarm-based optimiza-
tion techniques employed in this work.
4.1. Firey Algorithm
The Firey Algorithm was proposed by Yang (2010) and it is de-
rived from the ash attractiveness of reies for mating partners
(communication) and attracting potential preys. The brightness
of a rey is determined by some objective function and the per-
ceived light intensity I depends on the distance d from its source,
as follows:
I = I
0
e
id
; (11)
where I
0
is the original light intensity and i stands for the light
absorption coefcient.
As a reys attractiveness is proportional to the light intensity
seen by adjacent reies, we can now dene the attractiveness b of
a rey by
b = b
0
e
id
2
; (12)
where b
0
is the attractiveness at d = 0.
A rey i is attracted to another rey k with a better tness va-
lue, and moves according to:
x
j
i
(t 1) = x
j
i
(t) b
0
e
id
2
i;k
x
j
k
x
j
i
_ _
/ r
i

1
2
_ _
; (13)
where the second term states the attraction between both reies,
d
2
i;k
stands for the distance between reies i and k,/ is a randomi-
zation factor and r
i
~ U(0,1).
4.2. Gravitational Search Algorithm
Rashedi et al. (2009) proposed an optimization algorithm based
on the gravity, which is one of the fundamental interactions of
2252 D. Rodrigues et al. / Expert Systems with Applications 41 (2014) 22502258
nature. Their approach, called Gravitational Search Algorithm,
models each possible solution as a particle in the universe, which
interacts with other ones according to the Newtons law of univer-
sal gravitation (Halliday, Resnick, & Walker, 2000).
Let p
i
be a particle in a universe, and x
i
R
n
and v
i
R
n
its po-
sition and velocity, respectively. One can dene, at a specic time t,
the force acting on particle i from particle k in the jth dimension as
following:
F
j
ik
(t) = G(t)
M
i
(t)M
k
(t)
R
ik
(t) s
(x
j
k
(t) x
j
i
(t)); (14)
where R
ik
(t) is the Euclidean distance between particles i and k,M
i
stands for the mass of particle i and s is a small constant to avoid
division by zero. G is a gravitational potential, which is given by
G(t) = G(t
0
)
t
0
t
_ _
f
; f < 1; (15)
in which f is a control parameter (Mansouri, Nasseri, & Khorrami,
1999), G(t) is the value of gravitational potential at time t, and
G(t
0
) is the value of the gravitational potential at the time of the
creation of the universe that is being considered (Mansouri
et al., 1999).
To give a stochastic behaviour to Gravitational Search Algo-
rithm, Rashedi et al. (2009) assume the total force that acts on par-
ticle i in a dimension j as a randomly weighted sum of the forces
exerted from other agents:
F
j
i
(t) =

m
k=1;ji
r
j
F
j
ik
(t); (16)
in which r
i
~ U(0,1) and m denotes the number of particles (size of
the universe).
The acceleration of a particle i at time t and dimension j is given
by
a
j
i
(t) =
F
j
i
(t)
M
i
(t)
; (17)
in which the mass M
i
is calculated as follows:
M
i
(t) =
q
i
(t)

m
k=1
q
k
(t)
; (18)
with
q
i
(t) =
f
i
(t) w(t)
b(t) w(t)
: (19)
The terms w(t) and b(t) denote, respectively, the particles with the
worst and best tness value. The term f
i
(t) stands for the tness va-
lue of particle i.
Finally, to avoid local optimal solutions, only the best b masses,
i.e., the ones with highest tness values, will attract others. Let B
be the set of these masses. The value of b is set to b
0
at the begin-
ning of the algorithm and decreases with time. Hence, Eq. (16) is
rewritten as:
F
j
i
(t) =

bB;bi
r
b
F
j
ib
(t): (20)
The velocity and position updating equations are given by:
v
j
i
(t 1) = r
i
v
j
i
(t) a
j
i
(t) (21)
and
x
j
i
(t 1) = x
j
i
(t) v
j
i
(t 1); (22)
where in which r
i
~ U(0,1).
4.3. Harmony Search
Harmony Search (HS) is a meta-heuristic algorithm inspired in
the improvisation process of music players. Musicians often impro-
vise the pitches of their instruments searching for a perfect state of
harmony (Geem, 2009). The main idea is to use the same process
adopted by musicians to create new songs to obtain a near-optimal
solution according to some tness function. Each possible solution
is modelled as a harmony and each musical note corresponds to
one decision variable.
The algorithm, which has a theoretical stochastic derivative
background, generates after each iteration a new harmony vector
~x
new
= (x
1
new
x
2
new
; . . . ; x
n
new
) based on memory considerations, pitch
adjustments, and randomization (music improvisation). Variable
n stands for the number of decision variables, as stated for the
aforementioned nature-inspired optimization techniques.
With regard to the memory consideration step, the idea is to
model the process of creating songs, in which the musician can
use her memories of good musical notes to create a new song. This
process is modeled by the Harmony Memory Considering Rate
(HMCR) parameter. Suppose that HMCR = 0.75. In this case, 75% of
the new harmony will be composed of musical notes that come
from the harmony memory, and the remaining 25% are given ran-
domly, which simulates the process of music improvisation. Math-
ematically speaking:
x
j
new

x
j
new
x
j
1
; . . . ; x
j
m
_ _
with probability HMCR;
x
j
new
A
j
with probability (1 HMCR);
_
_
_
(23)
where m and A
j
stand for the number of harmonies and the set of
ranges for each decision variable j, respectively. Therefore,
HMCR [0,1] is the probability of choosing one value from the his-
toric values stored in the harmony memory, and (1 HMCR) is the
probability of randomly choosing one feasible value.
Further, every component j of the new harmony vector ~x
new
is
examined to determine whether it should be pitch-adjusted, which
is controlled by the Pitch Adjusting Rate (PAR) variable:
Pitching adjusting decision for x
j
new

Yes with probability PAR;
No with probability(1 PAR):
_
(24)
The pitch adjustment for each instrument is often used to improve
solutions and to escape from local optima. This mechanism con-
cerns shifting the neighbouring values of some decision variable
in the harmony.
In such a way, if the pitch adjustment decision for the decision
variable x
j
new
is Yes, x
j
new
is replaced as follows:
x
j
new
x
j
new
r
j
h; (25)
where h is an arbitrary distance bandwidth for the continuous de-
sign variable, and r
j
~ U(0,1).
4.4. Particle Swarm Optimization
Particle Swarm Optimization (PSO) is an algorithm modeled on
swarm intelligence that nds a solution in a search space based on
the social behaviour dynamics (Kennedy & Eberhart, 2001). Each
possible solution of the problem is modeled as a particle in the
swarm that imitates its neighbourhood based on a objective
function.
Some denitions consider Particle Swarm Optimization as a sto-
chastic and population-based search algorithm, in which the social
D. Rodrigues et al. / Expert Systems with Applications 41 (2014) 22502258 2253
behavior learning allows each possible solution to moves onto this
search space by combining some aspect of the history of its own
current and best locations with those of one or more members of
the swarm, with some random perturbations. This process simu-
lates the social interaction between humans looking for the same
objective or a ock of birds looking for food, for instance.
The entire swarm is modeled as a multidimensional space R
n
, in
which each particle p
i
= (x
i
; v
i
) R
n
has two main features: (i) po-
sition (~x
i
) and (ii) velocity (~v
i
). The local (best current position ^x
i
)
and global solution ~g are also known for each particle. After den-
ing the swarm size m, i.e., the number of particles, each one is ini-
tialized with random values for both velocity and position. Each
individual is then evaluated in respect to some tness function
and its local maximum is updated. At the end, the global maximum
is updated with the particle that achieved the best position in the
swarm. This process is repeated until some convergence criterion
is reached. The updated velocity and position equations of the par-
ticle p
i
in the simplest form that governs the Particle Swarm Opti-
mization at time step t are, respectively, given by
v
j
i
(t 1) = wv
j
i
(t) c
1
r
1
^x
i
(t) x
j
i
(t)
_ _
c
2
r
2
~g x
j
i
(t)
_ _
(26)
and
x
j
i
(t 1) = x
j
i
(t) v
j
i
(t 1); (27)
where w is the inertia weight that controls the interaction between
particles, and r
1
, r
2
[0,1] are random variables that give the sto-
chastic idea to Particle Swarm Optimization. The variables c
1
and
c
2
are used to guide the particles onto good directions.
5. Methodology
A data instance is typically described as pair (~x; y), in which
~x R
n
and y stand for the feature vector and its label, respectively.
Let Z(A; }) be a dataset of our classication problem in which A
represents a set of feature vectors, and } the set of outputs related
to each instance. A classier is then dened as a function f : A },
which predicts y for a given~x based on a model learned from a set
of labeled data (supervised learning). In order to provide a better
understanding of the problem, feature selection techniques aim
to discover a minimal subspace which better describes the distri-
bution of A. More precisely, our goal is to select a value of mn
and project each data instance x R
n
to a new one x
/
R
m
. Fur-
thermore, classication algorithms may suffer from the Hughes
phenomenon (Hughes, 1968) in high dimensional spaces, and thus
require much more computational load for numerical solutions of
dynamic programming problems (Bellman, 2010).
We now describe the methodology employed to evaluate the
performance of feature selection techniques discussed in previous
sections (Fig. 1 depicts a pipeline to clarify this procedure). Firstly,
we randomly partitioned the dataset into N folds, i.e., Z = T
1

T
i
T
N
. Note that each fold should be large enough to contain
representative samples of the problem. Further, for each fold, we
train a given instance of the OPF classier over a subset of this fold,
Z
1
i
T
i
, and an evaluation set Z
2
i
T
i
Z
1
i
is then classied in
order to compute a tness function which will guide a stochastic
optimization algorithm to select the most representative set of
features. Each member of the population in the meta-heuristic
algorithm is associated with a string of bits denoting the presence
or absence of a feature. Thus, for each member, we construct a clas-
sier from the training set with only the selected features and
compute a tness function by means of classifying Z
2
i
. As long as
the procedure converges, i.e, all generations of a population were
computed, the agent (bat, rey, mass, harmony, particle) with
the highest tness value encodes a solution with the best com-
pacted set of features. Further, we build a classication model
using the training set and the selected features, and we also eval-
uate the quality of the solution computing an effectiveness over
the remaining folds, T
j
Z T
i
. Algorithm 2 details the methodol-
ogy for comparing feature selection techniques.
Algorithm 2. Feature Selection Evaluation
Fig. 1 displays the above procedure. As aforementioned, the fea-
ture selection is carried on over the fold i, which is partitioned in a
training Z
1
i
and an evaluating set Z
2
i
. The idea is to represent a pos-
sible subset of features as a string of bits, which encodes each
agents position in the search space. Thus, for each agent, we model
the dataset using its string of bits, and an OPF classier is trained
over the new Z
1
i
and its effectiveness using this subset of features
is assessed over Z
2
i
. This recognition rate is then used as the tness
function to guide each agent to new positions until we reach the
convergence criterion. The agent with the best tness function is
then employed to build Z
1
i
, which is used for OPF training. The nal
accuracy using the selected subset of features is computed over the
remaining folds (red rectangle in Fig. 1). This procedure is repeated
over all folds for mean accuracy computation.
In regard to datasets, we have employed the following:
v Wisconsin Breast Cancer: 683 samples, 2 classes, and 10 features
(Mangasarian, Wolberg, & Setiono, 1989).
v DNA: 2,000 samples, 3 classes, and 180 feature (King, Feng, &
Sutherland, 1995).
v USPS: 7,291 samples, 10 classes, and 256 features (Hull, 1994).
v Splice: 1,000 samples, 2 classes, and 60 features (Frank & Asun-
cion, 2010).
v Ionosphere: 351 samples, 2 classes, and 34 features (Frank &
Asuncion, 2010).
v SVM Guide 1: 3,089 samples, 2 classes, and 4 features (Hsu,
Chang, & Lin, 2003).
2254 D. Rodrigues et al. / Expert Systems with Applications 41 (2014) 22502258
6. Experimental results
In this section, we evaluated the effectiveness of BBA to nd out
compact sets of features with as high predictive performance as
possible, and compared it against other swarm-based feature
selection methods. We applied the methodology presented in Sec-
tion 5 to obtain a better quality estimation of each solution. More
precisely, we dened k = 5 for a cross-validation scheme which im-
plied in ve rounds of feature selection for each method, being the
quality of solution evaluated from the remaining four folds. All fea-
Fig. 1. Pipeline of the proposed methodology.
(a) Wisconsin Breast Cancer dataset.
(b) Ionosphere dataset.
(c) DNA dataset.
Fig. 2. Experimental results using different transfer functions for each swarm-based optimization technique.
D. Rodrigues et al. / Expert Systems with Applications 41 (2014) 22502258 2255
tures in each dataset discussed in this section were normalized
within the range [0,1] to avoid attributes with greater numeric
ranges, which can dominate those in smaller numeric ranges. It
is worth noting that a Euclidean metric was employed for OPF dis-
tance computation. In addition, regarding the tness function and
the nal classication performance, we used an accuracy measure
proposed by Papa et al. (2009), which considers the fact that clas-
ses may have different concentrations in the dataset. This informa-
tion avoid a strong estimation bias towards the majority class in
high class imbalance datasets. Additionally, we have employed
Principal Component Analysis (PCA) for comparison purposes with
two distinct congurations: using 70% of the dimensions with the
biggest eigenvalues (called PCA), and using the same number of
features as the best technique on that dataset (called PCA2).
Figs. 2 and 3 display the results obtained over the datasets. Re-
call that, in the subtitles of each gure, Baseline means OPF with
the entire set of features to give us a reference point; Binary stands
for the methods as were proposed in the literature (Falcn, Almei-
da, & Nayak, 2011; Firpi & Goodman, 2004; Nakamura et al., 2012;
Ramos et al., 2011; Rashedi et al., 2010) for feature selection pur-
poses (i.e., the transfer function used as the one displayed in Eq.
(8)); in such methods, after each particles position be changed to
binary values using Eq. (8), their coordinates keep going into the
search space with binary coordinates, and in the following two
methods, the particles assume binary coordinates only for OPF
computation purposes, i.e., their original values over the search
space are continuous-valued. Sigmoid (Eq. (28)) denotes continu-
ous versions of the techniques with a sigmoid function as the con-
tinuous-binary mapping:
f (x) =
1
1 exp(x)
; (28)
and Hyperbolic Tangent (Eq. (29)) means the continuous approaches
with hyperbolic tangent as transfer function:
f (x) = [ tanh(x)[: (29)
(a) Splice dataset.
(b) USPS dataset.
(c) SVM Guide 1 dataset.
Fig. 3. Experimental results using different transfer functions for each swarm-based optimization technique.
Table 1
Parameters used for each optimization technique. The parameter values were
empirically chosen based on results reported in previous studies in the literature.
Technique Parameters
Bat Algorithm a = c = 0.9
Firey Algorithm b
0
= 0.1, i = 0.8, / = 0.1
Gravitational Search Algorithm G
0
= 100, f = 0.8
Harmony Search HMCR = 0.7
Particle Swarm Optimization c
1
= c
2
= 2, w = 0.9
2256 D. Rodrigues et al. / Expert Systems with Applications 41 (2014) 22502258
Table 1 presents the parameters used for each evolutionary-
based techniques. It is important to clarify that, for all techniques,
we assumed a model with a population size of 30 agents and 100
generations to reach a solution.
Loosely speaking, meta-heuristic algorithms differ mainly in re-
spect to balance between generating diverse solutions so as explor-
ing the search space on the global scale, and searching in a local
region by exploiting the neighbourhood of a good solution. Figs.
2 and 3 compare each evaluated algorithm against BBA in the
aforementioned six public datasets. As we can see, in respect to
Wisconsin Breast Cancer dataset, all feature selection approaches
reduced considerably the set of features and, indeed, improved
the predictive performance of OPF classier. In regard to remaining
datasets, the overall classication performance remained quite
similar when considering the original and reduced datasets. It is
also possible to highlight the number of selected features by BBA
with Sigmoid Transfer Function in Ionosphere dataset, which was
considerably smaller than the others.
The reader may observe the Bat Algorithm performed at least
similar to traditional algorithms, overcoming the others on Splice
dataset. Indeed, BA works similarly to the traditional Particle
Swarm Optimization, as the frequency essentially controls the pace
and range of the movement of the bats. However, this difference is
crucial to intensication which means a local diversication, via
randomization, which avoids the solutions being trapped at local
optima. In addition, BA and Harmony Search do not make a explicit
distinction between global and local search, which may become an
advantage for the user to dene parameters.
Additionally, in regard to the experiment with different transfer
functions, for almost all datasets and swarm-based optimization
techniques, the Hyperbolic Tangent has selected more features
than Binary and Sigmoid functions. This last study is interesting
to point out reasonable transfer functions to be employed for bin-
ary-based optimization problems. Finally, we dened the classi-
cation accuracy over a test set and the nal size of the reduced
set as measures for comparing the presented feature selection
approaches. Although the importance of these measures for feature
selection purposes, since we are investigating the sensibility to
over-tting and the curse-of-dimensionality, it may not be a
proper evaluation of the meta-heuristic algorithms performance.
Possibly, a optimal solution over the evaluation set may not be a
good solution in respect to the test set. Despite these consider-
ations, we provided an analysis over ve meta-heuristic algorithms
with a farrier performance estimation through a cross-validation
scheme. In general, the results showed that swarm-inspired
algorithms are suitable choices for optimization problems in
which there is some discontinuity or complexity in the objective
functions.
7. Conclusion
In this paper, we have presented a wrapper feature selection ap-
proach based on the Bat Algorithm and Optimum-Path Forest clas-
sier, which combines a exploration of the search space and an
intense local analysis by exploiting the neighbourhood of a good
solution to reduce the feature space dimensionality. We have also
proposed a methodology to evaluate feature selection methods
performing a k-fold cross validation.
The proposed approach was compared with traditional meta-
heuristic algorithms in six public datasets. As we have a binary-
based feature selection process, we also evaluated two different
transfer functions, a Hyperbolic Tangent and a Sigmoid function,
which map continuous-valued positions into binary ones. The idea
is to analyze continuous optimizations with different transfer func-
tions besides the literature binary optimization approaches.
The results showed that BA is so effective as some state-of-the-
art swarm-based optimization techniques, and it can also drasti-
cally compact the feature set in all evaluated datasets, while can
indeed improve the predictive performance in some cases. Addi-
tionally, the Hyperbolic Tangent transfer function appears to select
more features than Sigmoid functions for almost all datasets and
swarm-based optimization techniques.
References
Allne, C., Audibert, J. Y., Couprie, M., Cousty, J., & Keriven, R. (2007). Some links
between min-cuts, optimal spanning forests and watersheds. In Proceedings of
the international symposium on mathematical morphology, MCT/INPE (pp. 253
264).
Alonso-Atienza, F., Rojo-lvarez, J. L., Rosado-Muoz, A., Vinagre, J. J., Garca-
Alberola, A., & Camps-Valls, G. (2012). Feature selection using support vector
machines and bootstrap methods for ventricular brillation detection. Expert
Systems with Applications, 39, 19561967.
Bellman, R. (2010). Dynamic programming. Princeton, NJ, USA: Princeton University
Press.
Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2001). Introduction to
algorithms (2nd ed.). The MIT Press.
Falcn, R., Almeida, M., & Nayak, A. (2011). Fault identication with binary adaptive
reies in parallel and distributed systems. In Proceedings of the IEEE congress on
evolutionary computation (pp. 13591366). IEEE.
Firpi, H. A., & Goodman, E. (2004). Swarmed feature selection. In Proceedings of the
33rd applied imagery pattern recognition workshop (pp. 112118). Washington,
DC, USA: IEEE Computer Society.
Frank, A., & Asuncion, A. (2010). UCI machine learning repository.
Geem, Z. W. (2009). Music-inspired harmony search algorithm: Theory and
applications (1st ed.). Springer Publishing Company, Incorporated.
Grifn, D. R., Webster, F. A., & Michael, C. R. (1960). The echolocation of ying
insects by bats. Animal Behaviour, 8, 141154.
Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. A. (2006). Feature extraction:
Foundations and applications (Studies in fuzziness and soft computing). Secaucus,
NJ, USA: Springer Verlag, New York, Inc..
Guyon, I., Li, J., Mader, T., Pletscher, P. A., Schneider, G., & Uhr, M. (2007).
Competitive baseline methods set new standards for the NIPS 2003 feature
selection benchmark. Pattern Recognition Letters, 28, 14381444.
Halliday, D., Resnick, R., & Walker, J. (2000). Extended fundamentals of physics. Wiley.
Hsu, C., Chang, C., & Lin, C. (2003). A pratical guide to support vector classication.
Technical Report, National Taiwan University.
Huang, C.-L., & Wang, C.-J. (2006). A GA-based feature selection and parameters
optimization for support vector machines. Expert Systems with Applications, 31,
231240.
Hughes, G. (1968). On the mean accuracy of statistical pattern recognizers. IEEE
Transactions on Information Theory, 14, 5563.
Hull, J. J. (1994). A database for handwritten text recognition research. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16, 550554.
Jain, A., & Zongker, D. (1997). Feature selection: Evaluation, application, and small
sample performance. IEEE transactions on pattern analysis and machine
intelligence, 19, 153158.
Kennedy, J., & Eberhart, R. C. (1997). A discrete binary version of the particle swarm
algorithm. In IEEE international conference on systems, man and cybernetics (Vol.
5, pp. 41044108).
Kennedy, J., & Eberhart, R. (2001). Swarm intelligence. M. Kaufman.
King, R. D., Feng, C., & Sutherland, A. (1995). Statlog: Comparison of classication
algorithms on large real-world problems. Applied Articial Intelligence, 9,
289333.
Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated
annealing. Science, 220, 671680.
Koza, J. (1992). Genetic programming: On the programming of computers by means of
natural selection. Cambridge, MA: The MIT Press.
Kudo, M., & Sklansky, J. (2000). Comparison of algorithms that select features for
pattern classiers. Pattern Recognition, 33, 2541.
Kuncheva, L. I., & Jain, L. C. (1999). Nearest neighbor classier: Simultaneous editing
and feature selection. Pattern Recognition Letters, 20, 11491156.
Mangasarian, O. L., Wolberg, W., & Setiono, R. (1989). Pattern recognition via linear
programming : Theory and application to medical diagnosis. Technical Report
TR 0878, University of Wisconsin (Madison, WI US).
Mansouri, R., Nasseri, F., & Khorrami, M. (1999). Effective time variation of g in a
model universe with variable space dimension. Physics Letters, 259, 194200.
Metzner, W. (1991). Echolocation behaviour in bats. Science Progress Edinburgh, 75,
453465.
Nakamura, R. Y. M., Pereira, L. A. M., Costa, K. A., Rodrigues, D., Papa, J. P., & Yang, X. -
S. (2012). BBA: A binary bat algorithm for feature selection. In Proceedings of the
XXV SIBGRAPI conference on graphics, patterns and images (pp. 291297).
Oh, I.-S., Lee, J.-S., & Moon, B.-R. (2004). Hybrid genetic algorithms for feature
selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26,
14241437.
Papa, J. P., Falco, A. X., Albuquerque, V. H. C., & Tavares, J. M. R. S. (2012). Efcient
supervised optimum-path forest classication for large datasets. Pattern
Recognition, 45, 512520.
D. Rodrigues et al. / Expert Systems with Applications 41 (2014) 22502258 2257
Papa, J. P., Falco, A. X., & Suzuki, C. T. N. (2009). Supervised pattern classication
based on optimum-path forest. International Journal of Imaging Systems and
Technology, 19, 120131.
Ramos, C., Souza, A., Chiachia, G., Falco, A., & Papa, J. (2011). A novel algorithm for
feature selection using harmony search and its application for non-technical
losses detection. Computers & Electrical Engineering, 37, 886894.
Rashedi, E., Nezamabadi-pour, H., & Saryazdi, S. (2009). GSA: A gravitational search
algorithm. Information Sciences, 179, 22322248.
Rashedi, E., Nezamabadi-pour, H., & Saryazdi, S. (2010). BGSA: Binary gravitational
search algorithm. Natural Computing, 9, 727745.
Schnitzler, H.-U., & Kalko, E. K. V. (2001). Echolocation by insect-eating bats.
BioScience, 51, 557569.
Wang, X., Yang, J., Teng, X., Xia, W., & Jensen, R. (2007). Feature selection based on
rough sets and particle swarm optimization. Pattern Recognition Letters, 28,
459471.
Yang, X.-S. (2010). Firey, algorithm stochastic test functions and design
optimisation. International Journal Bio-Inspired Computing, 2, 7884.
Yang, X.-S. (2011). Bat algorithm for multi-objective optimisation. International
Journal of Bio-Inspired Computation, 3, 267274.
Yang, X.-S. (2011). Bat algorithm for multi-objective optimisation. International
Journal of Bio-Inspired Computation, 3, 267274.
2258 D. Rodrigues et al. / Expert Systems with Applications 41 (2014) 22502258

You might also like