Professional Documents
Culture Documents
Multilabel Feature Selection: A Comprehensive Review and Guiding Experiments
Multilabel Feature Selection: A Comprehensive Review and Guiding Experiments
DOI: 10.1002/widm.1240
ADVANCED REVIEW
1
Intelligent data Processing Laboratory (IDPL),
Department of Electrical Engineering, Shahid Feature selection has been an important issue in machine learning and data mining,
Bahonar University of Kerman, Kerman, Iran and is unavoidable when confronting with high-dimensional data. With the advent
2
Mahani Mathematical Research Center, Shahid of multilabel (ML) datasets and their vast applications, feature selection methods
Bahonar University of Kerman, Kerman, Iran have been developed for dimensionality reduction and improvement of the classifi-
Correspondence cation performance. In this work, we provide a comprehensive review of the exist-
Hossein Nezamabadi-pour, Intelligent data
Processing Laboratory (IDPL), Department of
ing multilabel feature selection (ML-FS) methods, and categorize these methods
Electrical Engineering, Shahid Bahonar University based on different perspectives. As feature selection and data classification are
of Kerman, P.O. Box 76619-133, Kerman, Iran. closely related to each other, we provide a review on ML learning algorithms as
Email: nezam@uk.ac.ir
well. Also, to facilitate research in this field, a section is provided for setup and
benchmarking that presents evaluation measures, standard datasets, and existing
software for ML data. At the end of this survey, we discuss some challenges and
open problems in this field that can be pursued by researchers in future.
This article is categorized under:
Technologies > Data Preprocessing
KEYWORDS
1 | INTRODUCTION
Nowadays, the world has encountered the problem of high-dimensional data generated from different sources in various fields
such as health care, social media, transportation, bioinformatics, microarray data, e-commerce, multimedia data, and so on. Fast
growth of data imposes big challenges on effective and efficient data management in the field of pattern recognition and machine
learning, that is, on applying machine learning algorithms to discover knowledge from high-dimensional data. Recently, it has
been realized that preprocessing methods play an inevitable role in working more effectively with datasets. Such methods, pro-
cess the data before being presented to a learning algorithm with the goal of improving the performance. In the fields of pattern
recognition, machine learning and statistics, feature selection, also recognized as feature subset selection, attribute selection, or
variable selection, is a preprocessing technique that selects a subset of relevant features (variables) from the existing features to
be used in model construction. There is wide-ranging interest in feature selection as data preprocessing techniques among experts
from pattern recognition and machine learning fields. Fundamental motivations for feature selection are as follows:
• Avoiding curse of dimensionality as the main reason (Friedman, Hastie, & Tibshirani, 2001)
• Reducing the training time of the algorithms
• Simplifying the models to make them easier to interpret (James, Witten, Hastie, & Tibshirani, 2013)
• Enhance the generalization ability of the systems by reducing overfitting (Bermingham et al., 2015).
Among the above-mentioned reasons, the most important one is a “curse of dimensionality,” which is an issue followed
by confusion of model (classifier) and reduction of classification accuracy. Using a suitable feature selection scheme can play
WIREs Data Mining Knowl Discov. 2018;8:e1240. wires.wiley.com/dmkd © 2018 Wiley Periodicals, Inc. 1 of 29
https://doi.org/10.1002/widm.1240
2 of 29 KASHEF ET AL.
a very important role in addressing this problem by eliminating the irrelevant and redundant data and reducing the dimen-
sionality. To clarify the feature selection procedure, suppose X as the original set of features with cardinality |X| = M and E :
0 0
X X ! ℝ as the evaluation function to be optimized; the feature selection process is defined as finding X X, in such a
0 0
way that |X | = m < M and E(X ) is optimized (Barani, Mirhosseini, & Nezamabadi-pour, 2017).
Feature selection methods are generally categorized as three main groups including filters, wrappers, and embedded
methods. The methods of filter category are independent of learning algorithms and make use of general characteristics of the
training data for selecting the best features. Such methods rank the features using some criteria and eliminate the features with
insufficient scores. The main advantage of methods of this group is their low computational complexity, which makes them
suitable to be used in high-dimensional data. Wrapper methods, on the other hand, take advantage of a specific learning algo-
rithm as a part of the feature selection process; hence, these approaches usually gain better results, but are computationally
expensive and sometimes cannot be utilized. Filter and wrapper methods can be considered as complementary of each other;
therefore, embedded methods exploit the strengths of these two categories simultaneously. In other words, feature selection is
performed as part of the model constructing process in embedded approaches (Kashef & Nezamabadi-pour, 2015).
In addition to the previously mentioned categorization, existing feature selection methods fall under three main categories
from the label perspective: supervised, unsupervised, and semisupervised methods. In supervised feature selection tech-
niques, sufficient labeled training data samples are available and feature relevance is determined by evaluating feature’s cor-
relation with the class. On the other hand, unsupervised methods do not need any labeled training data samples. The simplest
unsupervised method may be maximum variance in which features are evaluated by data variance and the features with max-
imum variance are selected (Ren, Zhang, Yu, & Li, 2012). Although the supervised methods achieve higher accuracy
because of relying on more information, usually obtaining data labels is expensive especially in high-dimensional datasets,
so supervised methods are no longer proper in these cases. Moreover, as label information is not available in unsupervised
scenarios, it is hard to choose discriminative features. Semisupervised feature selection techniques are suitable for cases in
which among the whole training data samples, there exist few labeled ones. This kind of datasets has been recently grown in
real applications. In this situation, the training procedure of supervised feature selection results in overfitting.
In classical supervised learning problems, each instance in the dataset belongs to only one label yj from a set of labels L .
This type of datasets is called single-label (SL) data. However, in some real-world problems, each instance is usually associ-
ated with a set of labels, Yi L, simultaneously. Such prediction tasks are usually denoted as multilabel (ML) classification
problems, and are widely used in applications such as semantic image and video annotation (Boutell, Luo, Shen, & Brown,
2004; Yang, Jiang, Hauptmann, & Ngo, 2007), classification of protein functions and genes (Diplaris, Tsoumakas, Mitkas, &
Vlahavas, 2005; Zhang & Zhou, 2006), categorization of text (Luo, Chen, & Xiong, 2011) and emotions evoked by music
(Trohidis, Tsoumakas, Kalliris, & Vlahavas, 2008), and so on. As an example of image annotation task, an image can simul-
taneously depict “sky,” “trees,” and “sunrise,” so it should be included in all the three categorizes. Due to fast emergence
and spread of ML datasets, many studies have been done in this field during the past decade. The same as SL case, feature
selection is an essential task in ML data classification due to the large number of features and many works have been done
in this field.
Beside the feature selection methods, there exist other preprocessing approaches that focus on features including feature
extraction, feature construction, and feature ranking with the goal of improving the performance of the machine learning
algorithms. Feature extraction methods aim to reduce the dimensionality of data by creating new features using a linear com-
bination of all features. Such methods are supposed to build informative and nonredundant features from an initial set of
measured data to facilitate learning. There are some ML feature extraction methods some of which can be found in the
papers of Carmona-Cejudo, Baena-García, del Campo-Avila, and Morales-Bueno (2011); Naula, Airola, Salakoski, and
Pahikkala (2014); Xu, Liu, Yin, and Sun (2016); Yu, Yu, and Tresp (2005); and Zhang and Zhou (2010). Methods of feature
construction category attempt to incorporate original features for achieving new high-level features with the purpose of pro-
viding better discriminative ability. There are few work relating to feature construction in ML domain such as (Duivesteijn,
Mencía, Fürnkranz, & Knobbe, 2012; Prati & de França, 2013). In ML tasks, in addition to feature construction, label con-
struction can be used to generate new labels employing the information obtained by considering the relations between the
original labels (Spolaôr, Monard, Tsoumakas, & Lee, 2014). Feature ranking techniques are mainly employed to evaluate the
individual relevance of features and order features based on their relevance in prediction of labels. Such methods can be
helpful for feature selection by eliminating the features, which are the least significant. In Kocev, Slavkov, and Dzeroski
(2013); Lee and Kim (2015a); Reyes, Morell, and Ventura (2014); and Teisseyre (2016), some feature ranking methods are
presented. In our study, it is mainly focused on reviewing multilabel feature selection (ML-FS) methods in classification task,
so the other ML feature engineering approaches are not explained in detail.
Due to large amount of research in the field of single-label feature selection (SL-FS), there are many review
papers, which categorize and analyze these methods from different perspectives. Recent reviews on SL-FS methods can be
found in the papers of Bolón-Canedo, Sánchez-Maroño, and Alonso-Betanzos (2013); Bolón-Canedo, Sánchez-Marono,
KASHEF ET AL. 3 of 29
Alonso-Betanzos, Benítez, and Herrera (2014); Chandrashekar and Sahin (2014); Choudhary and Saraswat (2014); Li
et al. (2016); Vergara and Estévez (2014); Xue, Zhang, Browne, and Yao (2016); and Yu and Liu (2004). In Xue
et al. (2016) and De La Iglesia (2013), evolutionary-based SL-FS methods are surveyed precisely. Moreover, a comprehen-
sive overview about recent advances in feature selection is presented in Li et al. (2016). In the last years, with the increas-
ingly spread of ML classification, there have been many research studied for ML learning and feature selection. There exists
some papers that review ML learning algorithm such as Tsoumakas, Katakis, and Vlahavas (2009) and Zhang and Zhou
(2014). For ML-FS, Spolaôr, Monard, Tsoumakas, and Lee (2016) provide a systematic review paper, which investigates the
existing ML-FS publications. They illustrate the necessity of providing a complete taxonomy for ML-FS in this work.
Another paper by Pereira, Plastino, Zadrozny, and Merschmann (2016) reviews ML-FS methods, which provides a good cat-
egorization of ML-FS methods. To the best of our knowledge, there is no more review paper in ML-FS task. Despite the
existence of the mentioned ML review papers, there is still a need to provide a review for analyzing the ML-FS methods
from different views, investigating more recent works and discussing the challenges for future work. Therefore, we intended
to provide a comprehensive study to review and categorize the existing methods to help future researchers in their work.
The paper is organized as follows: Section 2 gives fundamental concepts including formal definition of ML data, and
suggests a novel taxonomy of ML classification approaches. ML-FS task is described in Section 3, and the existing methods
are reviewed and categorized based on different perspectives. Section 4 discusses setup and benchmarking by investigating
standard ML databases along with performance evaluation criteria, nonparametric tests, and developed software for ML task.
Finally, in Section 5, we discuss some challenges and open problems that are ignored and need more attention in future
works, and draw the conclusions.
2 | FU ND AME NT AL CONC EP TS
In this section, the formal definition of ML data is firstly presented. Then, a comprehensive taxonomy of ML learning algo-
rithms is introduced, and the most important methods are briefly described.
2.1 | ML data
Suppose X = ℝ M or ℤ M is an M-dimensional instance space, and L = {y1, y2, …, yq} denotes the label space with q possi-
ble class labels. The task of ML learning is to learn a function h : X ! 2qfrom the ML training set D = {(xi, Yi), i = 1, …,
N} with N samples, where each sample is associated with a feature vector xi = (xi1, xi2, …, xiM) described by M features,
and a binary label vector Yi = (yi1, yi2, …, yiq) described by q labels. Figure 1 shows this representation.
For each unseen sample x 2 X , the ML classifier h(.) predicts h(x) !L as the set of proper labels for x.
To describe the properties of ML datasets, several practical ML measures can be employed. Label cardinality
(LC) measures the degree of multilabeledness, which is defined by Equation (1). In other words, it measures the average
number of labels associated with each instance. The other ML measure is label density (LD), which is the cardinality normal-
ized by jLj defined by Equation (2) (Zhang & Zhou, 2014).
1 X
jDj
LCðDÞ = j Yi j , ð1Þ
j D j i=1
1 X j Yi j
jDj
LDðDÞ = : ð2Þ
j D j i=1 j L j
Unlike LD that considers the number of labels q, LC is not dependent on the number of labels and is utilized to quantify
the number of available labels that describe the examples of an ML training dataset. Two datasets with equal LC but with
different LD might have different behaviors in ML learning method (Tsoumakas et al., 2009).
Y
X y1 y2 ••• yq
X11 X12 ••• X1M 0 1 1 0
X21 X22 ••• X2M 1 1 0 1
•••
•••
•••
•••
•••
•••
•••
•
•
•
Another popular criterion is label diversity that represents the number of distinct label sets, and is helpful for many prob-
lem transformation methods that work on subsets of labels.
ADABOOST.MH
Tree based boosting
ADABOOST.MR
Algorithm adaptation
methods ML-DT
Decision trees
ML-C4.5
ML-KMPSO
MuLAM
Evolutionary based hmANT-Miner
hmAntMiner-C
G3P-ML
appeared in the training set, and is unable to be generalized to those outside (Lee, Kim, Kim, & Lee, 2016). Various varia-
tions of the LP method can be found in the papers of Lo, Lin, and Wang (2014); Read (2008); and Read, Puurula, and Bifet
(2014) that try to improve the performance of this method and reduce its disadvantages.
Pruned problem transformation (PPT) is one of the extensions of the LP method proposed to overcome its disadvantage
regarding the created classes associated with rare instances. The idea is to prune label sets (new classes) that appear less
times than a user-defined threshold (usually 2 or 3). Instances belonging to these classes can either be eliminated or be
assigned to another class.
In order to reduce the computational cost of evaluating all possible pairwise binary classifiers, Park and Fürnkranz (2007)
extends a recently proposed method called QWeighted which is designed for multiclass problems and choose the base classi-
fiers that are essential for predicting the top class.
The idea of pairwise classification can be extended to ML classification problem. Here, a problem with q labels is trans-
formed into q(q − 1)/2 subproblems, that is, a binary classifier for every pairs of labels. The instances of the first label are
used as positive instances and the instances of the second label are used as the negative instances to train each classifier. To
classify a test sample, each classifier votes for one of the two labels. All labels are then sorted according to their sum of
votes, and the relevant labels for each sample are predicted by a label ranking algorithm (El Kafrawy, Mausad, & Esmail,
2015). In recent years, the idea of pairwise classification is adapted to ML learning algorithms such as calibrated label rank-
ing (CLR) (Fürnkranz et al., 2008), ranking by pairwise comparison (RPC) (Hüllermeier, Fürnkranz, Cheng, & Brinker,
2008), and multilabel pairwise perceptron (MLPP) (Mencía, Park, & Fürnkranz, 2010). For example, Mencia et al. (2010)
generalized the QWeighted approach for ML data.
Now, the problem is finding the values of the prior probability P(Hj)(P(−Hj)), and the likelihood P(Cj | Hj)(P(Cj | −Hj))
which are approximated from the training data via the frequency counting strategy (Zhang & Zhou, 2014). This process is
repeated for each label yj 2 L. Therefore, ML-kNN is a binary relevance learner, and as it should consider the information of
all k nearest neighbors, the computational complexity of this method is rather high. Another disadvantage of this classifier is
that it does not take into consideration the dependency between labels. However, it is still one of the most popular ML classi-
fiers that is frequently used in many ML-FS papers (Doquire & Verleysen, 2013a; Jungjit & Freitas, 2015a; Lee & Kim,
2015a; Li, Li, Zhai, Wang, & Zhang, 2016; Lin, Hu, Liu, Chen, & Duan, 2016).
3 | M U L TI L A B E L F E A T U R E S E L E C T I O N
As discussed earlier, feature selection is the process of reducing the number of features by eliminating irrelevant and redun-
dant features. Given input data with M features, X = {x1, x2, …, xM} and the label set L = {y1, y2, …, yq}, an ML-FS algo-
rithm should discover the smallest subset of nonredundant features S X with m M features and the greatest relevance to
the labels. In other words, feature-label dependency should be maximized whereas feature–feature dependency should be
minimized in the selected feature subset. Some feature selection methods only consider feature-label correlations and try to
select the most relevant features (Spolaôr, Cherman, Monard, & Lee, 2013a). As Makrehchi and Kamel (2005) showed, the
effect of redundant features is almost similar to that of noise and causes degradation of the classifier performance. Another
factor is also assumed to have an important role in improving the performance of ML-FS method, which is considering
label-label dependency. Many papers suggest to consider feature-feature, label-feature and label-label dependencies in ML-
FS methods (Lin, Hu, Liu, & Duan, 2015).
Supervised
Label
Semi-supervised
perspective
Unsupervised
Exhaustive
Search strategy
Sequential
perspective
Randomized
Multi-label feature
selection methods from
different perspectives
Filter
Interaction with
learning algorithm Wrapper
Embedded
Problem
transformation
Data format
perspective
Algorithm
FIGURE 3 Categorization of multilabel
adaptation
feature selection methods
8 of 29 KASHEF ET AL.
the ML learning to multiple independent SL problems, which fails to take into consideration correlations between different
labels (Ma, Nie, Yang, Uijlings, & Sebe, 2012).
There are few algorithms designed specifically for ML-FS with semisupervised approach. A convex semisupervised ML-
FS (CSFS) algorithm for large-scale multimedia analysis is presented in Chang, Nie, Yang, and Huang (2014), in which both
labeled and unlabeled data are utilized to select features while correlations among different features are simultaneously taken
into consideration. In this method, firstly, the train and test data are shown using different types of features and the labels of
unlabeled data are set to zero. Afterward, the least square loss function is minimized and sparse feature selection and label
prediction is applied. Feature selection is done using the achieved sparse coefficients. This method is computationally effec-
tive, so it is appropriate for large-scale datasets.
In Qian and Davidson (2010), a framework for ML feature reduction is presented called Semi-Supervised Dimension
Reduction for Multi-Label Classification (SSDR-MC) that exploits semisupervised learning. The authors utilized reconstruc-
tion error to determine how well an example is represented by its nearest neighbors. In this way, the intrinsic geometric rela-
tions between examples are specified which can be helpful in both label inference and feature selection. The connection
between dimension reduction and ML learning by an alternating optimization process is as follows: (1) A weight matrix is
learnt from both the available labels and feature description and (2) The missing labels are derived according to the weight
matrix; Repeating (1) and (2) until the predictions are confirmed.
Another method of feature selection algorithm for semisupervised learning is proposed in Li, You, Ge, Yang, and Yang
(2010), which is applied for analysis of gene function. In this method, a Cotraining algorithm, FESCOT, is incorporated with
ML-kNN leading to a new algorithm called Cotraining ML-kNN (COMN); then, prediction risk-based embedded feature
selection for COMN (PRECOMN) is proposed that uses the sequential backward search algorithm to search for feature sub-
sets. To evaluate feature subsets, the prediction risk criterion is employed.
Here, feature Xi, i = 1…N, can take v distinct values, and each subset Dv D consists of the set of examples where Xi
has the value v.
FCBF is a multivariate filter approach presented in Yu and Liu (2003) paper, which is especially designed for high-
dimensional data. It considers feature-class correlations as well as feature-feature correlations to find a subset of features that
are highly correlated to the class but not highly correlated to the other features. It introduces a measure called symmetrical
uncertainty (SU) as the ratio between the IG and the entropy of two variables. First, it calculates the SU value for each fea-
ture and selects those features associated with SU values higher than a user-defined threshold. Then, redundant features are
removed from this subset and a subset of relevant informative features remains.
F-score is a criterion to evaluate the discriminative ability of features. Equation (6) shows how to calculate the F-score of
the ith feature. The numerator specifies the discrimination among the categories of the target variable, and the denominator
indicates the discrimination within each category. A larger F-score implies a greater likelihood that this feature is discrimina-
tive (Kashef & Nezamabadi-pour, 2015).
Pc 2
xik − xi
k=1
ði = 1,2,…, nÞ, ð6Þ
PN ki k k 2
F_scorei =
Pc 1
k = 1 Nk −1 j=1 x ij −xi
i
where c is the number of classes and n is the number of features; Nik is the number of samples of the feature i in class k,
(k = 1, 2, …, c, i = 1, 2, …, n), xijk is the jth training sample for the feature i in class k, j = 1,2,…, Nik , xi is the mean value
of feature i from all classes and xik is the mean of the ith feature of the samples in class k.
KASHEF ET AL. 11 of 29
Chi-square is another common univariate feature selection method. The χ 2 is used in statistics to define the independence
of two events. More specifically, it is used in feature selection to test whether the occurrence of a specific feature Xi and the
occurrence of a specific label yj are independent. It assigns a weight to each feature, where higher weights indicate more
dependency between Xi and yj (Spolaôr & Tsoumakas, 2013).
CFS (Hall, 1999) is a multivariate filter algorithm that evaluates different subsets of features based on a correlation-based
heuristic function. CFS tries to find a small subset of relevant and nonredundant features by the following equation:
krcf
CFS − scoreF = pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi , ð7Þ
k + k ðk −1Þrff
where, the CFS-score is the heuristic score of subset F with k features. rcf and rff are the average feature-class and feature-
feature correlations, respectively. The nominator demonstrates the predictive ability of features while the denominator indi-
cates the redundancy between features of F. To obtain the values of rcf and rff , CFS uses SU measure. To find a proper sub-
set of feature, CFS starts with an empty set and add features with the greatest CFS-score, iteratively until a stopping criterion
is met.
Transformation
strategy
Elimination of non-
selected features in
multi-label data
Single-label data
two ways: (a) defining the maximum number of features to be selected (e.g., r features), and selecting the top r features,
(b) specifying a threshold value and selecting those features that have higher average score or maximum score than the
threshold value. The specified features are then selected in the ML dataset and an ML classifier is employed to evaluate the
performance of the selected feature subset. This process is called as external approach in employing BR transformation strat-
egy for ML-FS problem in (Pereira et al., 2016) and is shown in Figure 5. Internal approach comes versus external approach
that use a SL classifier after the feature selection process for each transformed dataset, that is, both feature selection and clas-
sification methods are SL ones. The results of each SL classifier are then incorporated similar to a BR method for ML classi-
fication. Most of papers that use the BR approach as the transformation strategy follow the external strategy, and there are
few papers that use internal strategy (Dendamrongvit, Vateekul, & Kubat, 2011). Therefore, we mean external strategy for
BR transformation, from now on. This process is shown in Figure 6.
Chen et al. (2007) proposed ELA transformation strategy that assigns weights to features for different labels according to
label entropy, as was described before. The authors transform the ML data into SL data using ALA, LLA, SLA, NLA, and
Multi-label data
Classified multi-
label dataset
BR Transformation
1 2 q
Aggregation
methodology
Elimination of non-
selected features in
multi-label data
FIGURE 5 Multilabel feature selection using
Determine selected
features binary relevance transformation method in
external form
KASHEF ET AL. 13 of 29
Multi-label data
BR Transformation
1 2 q
Inverse of BR
transformation
strategy
Classified
multi-label
FIGURE 6 Multilabel feature selection using binary relevance transformation dataset
method in internal form
ELA transformation strategies, and apply three SL filter methods including IG, CHI, and Optimal Orthogonal Centroid Fea-
ture Selection (OCFS) (Yan et al., 2005). Trohidis et al. (2008) applied LP to transform an ML music dataset to a SL one for
the task of music retrieval by emotion. The most salient features are then detected by χ 2 statistics. They found that LP + χ 2
is the most effective strategy among several ML classification methods. Doquire and Verleysen (2011) employ a PPT
method that was introduced in Read’s (2008) study to transform ML datasets into SL one. Then, a greedy feature selection
procedure based on multidimensional MI is executed. The work by Doquire and Verleysen (2013a) extends preliminary
results that were presented in Doquire and Verleysen (2011) study and proposes a way to automatically select the pruning
parameter for PPT. A similar method is proposed in Reyes et al.’s (2015) study, which converts the ML problem to a SL
problem using PPT, and then utilizes the ReliefF algorithm for assigning weight to each feature. Spolaôr et al. (2013a) use
LP and BR to transform the problem, and then employ IG and ReliefF for feature selection. Finally, the performance of these
four ML-FS methods is compared to each other. Tsoumakas and Vlahavas (2007), that proposed RAKEL ML classifier,
reduced the number of features before employing their proposed classifier in order to lessen the computational cost of train-
ing process. They used the BR transformation in conjunction with the χ 2 statistic to get a ranking of all features for each
label. Finally, the top 500 features with the greatest score over all labels were selected.
Text categorization is included in the domain of ML problem; for example, an article about Persepolis can be categorized
as different topics such as Iran, Culture and History. The keyword “multilabel” may not found in these researches but they
are actually involved with an ML data. Researches in text categorization problem have long age (Olsson & Oard, 2006;
Rogati & Yang, 2002). For example, Yang and Pedersen (1997) applied and compare several filter methods including docu-
ment frequency, MI, IG, term strength, and χ 2 statistic in a text categorization problem. They evaluated each label
14 of 29 KASHEF ET AL.
individually, similar to the BR transformation strategy. Finally, the performance of each feature selection method is evaluated
by kNN and LLSF.
Gharroudi et al. (2014) discuss three wrapper ML-FS algorithms based on Random Forest (RF) model called BRRF,
RFLP (Random Forest Label Powerset), and RFPCT (Random Forest Predictive Clustering Tree). The first two methods
transform the ML data into SL data using BR and LP, respectively. Then, a RF is applied on the created SL dataset(s). The
experimental results show that BRRF acts better than RFLP, and the authors conclude that considering label dependency is
not significantly effective in ML-FS.
is utilized as the second objective. However, overall, there was no statistically significant difference between the results of
the proposed LexGA-ML-CFS method and other methods.
Lin et al. (2015) proposed an incremental ML-FS method based on max-dependency and min-redundancy (MDMR) cri-
terion inspired by the well-known SL filter method, minimum Redundancy Maximum Relevancy (mRMR). More precisely,
features are selected one by one based on their calculated scores. Assuming a selected feature subset, the score of remaining
features is calculated according to θ = D − R, where D is related to the relevance of a feature with labels, and R defines the
redundancy of that feature with the selected feature subset. At each step, θ is calculated for all remaining features, and the
feature with the highest score is selected. This process is repeated until the desired number of features is selected.
An ML-FS approach that selects salient features based on ML neighborhood MI is proposed in Lin et al.’s (2016) study.
At first, all instances are granulated under different labels using the margin of instances, and three different neighborhood
MIs for ML learning are defined. Then, they introduced an optimization objective function to measure the quality of candi-
date features.
A granular ML-FS method with a maximal relevance and minimal redundancy measure is proposed in Li et al.’s (2017)
study that considers the local label correlations. At first, the labels are abstracted into some information granules based on
their correlations. Then, it selects the feature that has the greatest correlation with each label in the information granule while
having minimal correlation with already chosen features.
4 | S E T UP A N D B E NC H M AR K IN G
In this section, issues related to comparing the performance of algorithms are discussed. To this end, the standard ML data-
sets and evaluation measures are reviewed. Next, the most frequently used nonparametric tests are investigated. Finally, vari-
ous software tools that are developed in ML task are introduced.
• Example-based measures
KASHEF ET AL. 17 of 29
FIGURE 7 Diagram of Nemenyi’s post-hoc test in terms of (a) accuracy and (b) hamming loss criteria. Methods which are not significantly different based
on the critical difference (CD) (at α = .05) are connected
1. Hamming loss
Hamming loss calculates the percentage of misclassified labels, that is, the sample associated to a wrong label or a label
belonging to the true sample which is not predicted (Cherman, Spolaôr, Valverde-Rebaza, & Monard, 2015).
1 X j Yi ΔZi j
p
Hamming lossðh, T Þ = , ð8Þ
p i=1 j L j
where Δ is the symmetric difference between two sets. Hamming loss calculates the percentage of labels whose relevance is
not predicted correctly.
2. Subset accuracy
Subset accuracy calculates the ratio of correctly classified test samples to the cardinality of test set. The predicted label set
should be identical to the original label set for each sample to be considered as correctly classified, leading this measure to
be strict. Subset accuracy is an ML counterpart of the SL accuracy measure.
1X
p
Subset accuracyðhÞ = ½Zi = Yi : ð9Þ
p i=1
3. Accuracy
This measure, which calculates the correctly predicted labels among all true and predicted labels, is determined as follows:
1 X j Yi \ Zi j
p
Accuracyðh,T Þ = : ð10Þ
p i = 1 j Yi [ Zi j
Accuracy seems to be a more balanced criteria and better representative of an actual algorithm’s predictive performance
for most classification problems compared to hamming loss (El Kafrawy et al., 2015).
4. One error
This measure counts the times the top-ranked label is not relevant:
1 X
p
one-errorð f Þ = arg maxy2Y f ðxi ,yÞ 2
= Yi : ð11Þ
p i=1
5. Coverage
Coverage evaluates the average number of steps to move down in the list of ranked labels to cover all the relevant labels of a sample.
1X
p
Coverageð f Þ = maxy2Y rankf ðxi , yÞ −1, ð12Þ
p i=1
where rankf (xi, y) denote the rank of y in Y based on the descending order induced by f.
18 of 29 KASHEF ET AL.
6. Ranking loss
Ranking loss counts the average fraction of reversely ordered pairs; that is, an irrelevant label is ranked higher than a relevant
label.
1X 1
p n 0 0 0 o
Ranking lossð f Þ = j y ,y} jf xi ,y ≤ f xi , y} , y , y} 2 Yi × Y i j : ð13Þ
p i = 1 j Yi kY i j
7. Average precision
The average precision determines the average percentage of relevant labels that are ranked above a particular relevant
label y 2 Yi.
0 0
1 X 1 X y jrankf xi , y ≤ rankf ðxi ,yÞ,y 2 Yi
p 0
Avg-pre = : ð14Þ
p i = 1 j Yi j y2Yi rankf ðxi ,yÞ
• Label-based metrics
For the label yj, the number of true positive (TPj), true negative (TNj), false positive (FPj), and false negative (FNj) test sam-
ples can be calculated as:
TN j = jfxi jyi2
= Yi Λ yi2
= hðxi Þ,1 ≤ i ≤ pgj, ð16Þ
Based on the above definitions, TPj + TNj + FPj + FNj = p. Accuracy, Precision, Recall, and Fβ are four classification
metrics that are defined using, TPj, TNj, FPj, FNj as:
TPj + TN j
Accuracy TPj , TN j ,FPj , FN j = , ð19Þ
TPj + TN j + FPj + FN j
TPj
Precision TPj ,TN j ,FPj , FN j = , ð20Þ
TPj + FPj
TPj
Recall TPj , TN j ,FPj , FN j = , ð21Þ
TPj + FN j
β
1 + β2 :TPj
F TPj , TN j , FPj ,FN j = : ð22Þ
1 + β2 :TPj + β2 :FN j + FPj
If B(TPj, TNj, FPj, FNj) shows one of these four functions, the label-based classification measures , including macro-
averaging and micro-averaging, are defined as follow: (q is the number of all labels).
1. Macro-averaging
1X
q
Bmacro ðhÞ = B TPj , TN j , FPj ,FN j : ð23Þ
q j=1
2. Micro-averaging
!
X
q X
q X
q X
q
Bmicro ðhÞ = B TPj , TN j , FPj , FN j : ð24Þ
j=1 j=1 j=1 j=1
Also, two label-based ranking metrics, AUCmacro and AUCmicro, are defined as:
KASHEF ET AL. 19 of 29
3. AUCmacro
0 0 0
1 X
x ,x} Þj f x , yj ≥ f ðx} ,yj , x , x} 2 Zj × Zj g
q
AUCmacro =
ð25Þ
q j=1
j Zj kZj j,
where Zj = {xi| yi 2 Yj, 1 ≤ i ≤ p} is the set of test samples with label yi and Zj is its complementary, that is, the set of test
samples without label yi.
4. AUCmicro
0 0 0 0 0
1 X
x , x} , y ,y} Þj f x , y ≥ f ðx} , y} , x , y 2 S + , x} ,y} 2 S −
q 0
AUCmicro = , ð26Þ
q j=1 j S + kS − j
where S+ = {(xi, y)| y 2 Yi, 1 ≤ i ≤ p} is a set of sample-label pairs which are relevant and S− = {(xi, y)| y 2 Yi, 1 ≤ i ≤
p} is irrelevant sample-label pairs set.
Amongst the mentioned example-based measures, smaller values show better performance for all criteria except average
precision and accuracy. Also, all measures are normalized to a number between 0 and 1 except for coverage. Moreover, for all
the mentioned label-based metrics, the larger metric value indicates better system’s performance, where the optimal value is 1.
Beside the given metrics, another parameter can be defined for evaluation and comparison of feature selection methods
called average feature reduction, Fr, to investigate the rate of feature reduction and is defined as:
M −r
Fr = , ð27Þ
M
where M is the total number of features and r is the number of selected features by the FS algorithm. The more it is close to
1, the more features are reduced and the classifier’s complexity is less (Kashef & Nezamabadi-pour, 2015).
are also developed for different purposes: utiml1 package for ML learning, MLPUGS2 package for ML prediction using
Gibbs sampling and classifier chains, mldr3 (Charte & Charte, 2015) package that contains exploratory data analysis and
manipulation of ML data, and mldr.datasets4 (Charte, Charte, Rivera, del Jesus, & Herrera, 2016) package that contains R
ultimate ML dataset repository.
Also, two exclusive ML libraries including MEKA5 (Read, Reutemann, Pfahringer, & Holmes, 2016) and MULAN6
(Tsoumakas, Spyromitros-Xioufis, Vilcek, & Vlahavas, 2011) that are based on WEKA7 (Hall et al., 2009) are introduced.
MEKA is an open-source Java framework based on the famous WEKA library. It contains all the basic problem transfor-
mation methods, such as different varieties of classifier chains (Read et al., 2011), and many of advanced methods investi-
gated by (Madjarov, Kocev, Gjorgjevikj, & Džeroski, 2012), as well as some algorithm adaptation methods such as ML
neural networks. Also, there exist different evaluation criteria and tools for ML experiments and development. Note that
MEKA offers both the command line interface (CLI) and graphical user interface (GUI). Its associated repository8 contains
more than 20 ML datasets with ARFF format.
MULAN is an open source Java package for learning from ML learning based on WEKA. It only offers programmatic
API to the library users, and there is no GUI. MULAN contains many state-of-the-art ML learning algorithms, label ranking,
and dimensionality reduction algorithm, as well as algorithms for learning from hierarchically structured labels. It also offers
KASHEF ET AL. 21 of 29
an evaluation framework that calculates ML evaluation measures through cross validation and hold-out evaluation. Its reposi-
tory contains over 25 ML datasets with ARFF format.
As a simple way for ML-FS problem is to use SL methods via data transformation methods such as LP and BR, having
the implementation of practical SL filter methods is helpful. To this end, fspackage9 (Liu, 2010) which is a package based
on WEKA containing most of filter methods such as IG, FCBF, CFS, and chi-square is a proper choice. This package calls
WEKA in Matlab, that is, WEKA should be installed before using the functions. Also, complete explanations about how to
call the functions are given by comments at the beginning of them.
There also exists some general-purpose software that manage ML data as part of their functionality. Clus10 is a decision
tree system that implement the predictive clustering framework, and can be used for hierarchical ML classification.
LibSVM11 (Chang & Lin, 2011) is an integrated software for SVMs that can be applied for ML data using the binary rele-
vance transformation.
KEEL (Alcalá-Fdez et al., 2011) is an open source data mining software tool that contains many algorithms with different
purposes to be used in knowledge discovery tasks. Moreover, there is a dataset repository associated with this software that
contains some ML datasets. The file format of these datasets is ARFF-based, with specific header fields to show whether
each attributes is label or not. Besides, in this software, a complete set of statistical procedures including both parametric and
nonparametric procedures is provided for pairwise comparisons of the algorithms.
5 | GU I D I N G E X P E R I M E N T S
In this section, some experimental results are presented. First, the results of different evaluation measures for the most popu-
lar datasets with all features are shown in Table 2 which can be used as a baseline to evaluate the goodness of ML-FS
methods. Afterward, two series of ML-FS methods are tested and compared to draw useful conclusions. The first group of
methods use problem transformation strategy and the second group are algorithm adaptation-based methods. Methods of the
first group are based on SL filter methods, using the two standard problem transformation approaches LP and BR (in the
external form according to Figure 5).
The filter methods include ReliefF, IG, F-score, chi-square, FCBF, and CFS. Four of these methods including BR-IG,
BR-RF, LP-IG, and LP-RF, which are based on ReliefF, and IG methods are initially presented in SpolaôR et al.’s (2013a)
study. In this experiment, those algorithms that use the LP method for transformation consider the label dependency, and the
other algorithms which use BR method for transformation do not. We aim to explore the role of considering label correla-
tions in ML-FS.
As was mentioned in the introduction, some methods provide an ordered ranking of features and a threshold is needed to
select those features with greater score than the threshold, while the other methods produce a candidate subset of features.
All the above filter methods are ranking methods except FCBF and CFS. Because defining a proper threshold for each
method is not an easy task, and for the sake of fairness, we decide the number of features for ranking-based methods accord-
ing to the following rules that are inspired from the paper by Bolón-Canedo, Sánchez-Maroño, and Alonso-Betanzos (2016)
and our previous experiences.
We selected PMU and MDMR as the algorithms of the second group. Both of these algorithm adaptation filter methods
use a greedy search to find the optimal feature subset, that is, the features are selected one by one according to their calcu-
lated scores in each step. PMU takes into account the correlation among labels opposite to the MDMR algorithm. Here again,
the above rules-of-thumb are used to define the number of features to be selected.
For all of the comparing algorithms, the results are averaged over 20 independent runs in each dataset, and ML-kNN is
used as the classifier (k = 10). During each experiment, 60% of samples are chosen randomly for training process and the
remaining 40% are used for evaluating the performance of the methods. In PMU, numeric features are discretized using an
equal-width strategy into two bins, as suggested in by Lee and Kim (2013), while nominal features are left untouched.
Table 3 reports the results of the 12 problem transformation algorithms and the two algorithm adaptation methods in
terms of accuracy, hamming loss, and feature reduction criteria. The best results are highlighted in bold. For easier compari-
son, we repeat the baseline results in the last column of this table. By comparing the results, it can be seen that there is an
22 of 29 KASHEF ET AL.
TABLE 2 Baseline results for the most frequently used multilabel datasets using the ML-kNN classifier
improvement in most of the datasets after applying feature selection. For instance, the accuracy of medical dataset reaches
from 0.528 to 0.635 by LP-F-score showing an improvement of more than 10%.
Also, about 90% of features are eliminated by this method. Although some FS methods degrade the performance of the
classifier compared to baseline results, the ratio of feature reduction should not be neglected. For example, in Image dataset
the accuracy measure is decreased by 0.048 using MDMR method, while almost 70% of features are eliminated which
reduces the computational complexity, significantly.
In order to assess the statistical significance of the results, we employed the Friedman’s test and the Nemenyi’s post-hoc
procedure at a confidence level of α = .05. Based on the results of Table 3, the average ranks obtained by comparing the
methods through Friedman 1*N statistical test are presented in Table 4. These results are obtained using KEEL software that
is previously described. Numbers written in brackets are the ranks obtained by each algorithm among the others. Lower rank
for an algorithm indicates its superiority against the others. As the number of features to be selected in ranking-based method
is determined by the user, their ranks are the same.
According to this table, BR-CFS obtains the first rank, in both accuracy and hamming loss criteria. Based on accuracy
measure, BR-FCBF and LP-RF stand on the second and third places, whereas their ordering is reversed for hamming loss
measure. The obtained p values after applying the Friedman test for accuracy and hamming loss criteria are, respectively,
0.014525 and 0.001209, which shows significant difference of the comparing methods.
As the null hypothesis is rejected for both measures, the Nemenyi’s post-hoc test is employed to discover significant differ-
ences among the methods. Figure 7 indicates the corresponding diagrams on the accuracy and hamming loss measures. The ranks
of methods are displayed on the axis, in such a way that the top-ranked methods are at the rightmost side of the diagram. When
the difference of methods average ranks is more than a critical distance (CD), these methods are significantly different. Methods
that are not significantly different are connected to each other with a line. According to these diagrams, the sequence of methods’
ranks is similar to what Table 4 presents. Also, BR-CFS is significantly better than PMU in both criteria. Moreover, although it is
not possible to find significant difference among the other methods, the lower the ranking, the better the method is.
A challenging issue that can be concluded from these diagrams is that methods which consider label dependency such as
PMU- and LP-based methods do not have better functionality than the other methods. Also, the diagrams show that methods which
use problem transformation strategy perform better than algorithm adaptation methods for ML-FS, on average. However, we reach
these conclusions on the basis of the comparing algorithms, and more experiments are needed to reject or prove this inferences.
6 | DI S C US S I O N A ND FU T U RE DI R E C T I ON
In this section, we provide an analysis of the reviewed papers and a discussion of the ignored issues and challenges in the
field of ML-FS.
TABLE 3 Experimental results of comparing methods averaged on 20 independent runs on 10 multilabel datasets
Flag Acc 0.512 0.499 0.494 0.503 0.508 0.512 0.520 0.496 0.486 0.505 0.476 0.483 0.508 0.511 0.503
HL 0.319 0.32 0.329 0.326 0.318 0.325 0.306 0.336 0.327 0.325 0.344 0.339 0.326 0.327 0.328
Fr 0.600 0.600 0.365 0.600 0.600 0.235 0.600 0.600 0.900 0.600 0.600 0.900 0.600 0.600 0
cal500 Acc 0.196 0.195 0.195 0.192 0.196 0.192 0.196 0.192 0.196 0.195 0.191 0.195 0.196 0.191 0.197
HL 0.14 0.14 0.139 0.139 0.139 0.14 0.139 0.139 0.138 0.139 0.139 0.139 0.139 0.140 0.139
Fr 0.602 0.602 0.432 0.602 0.602 0.324 0.602 0.602 0 0.602 0.602 0 0.602 0.602 0
Emotions Acc 0.391 0.392 0.381 0.4 0.387 0.402 0.389 0.409 0.358 0.377 0.408 0.381 0.318 0.297 0.311
HL 0.254 0.259 0.258 0.259 0.259 0.254 0.257 0.25 0.272 0.26 0.247 0.261 0.270 0.281 0.27
Fr 0.597 0.597 0.631 0.597 0.597 0.329 0.597 0.597 0.956 0.597 0.597 0.808 0.597 0.597 0
Yeast Acc 0.479 0.488 0.495 0.486 0.489 0.505 0.486 0.419 0.348 0.484 0.417 0.347 0.491 0.477 0.501
HL 0.202 0.199 0.2 0.201 0.2 0.195 0.201 0.217 0.233 0.201 0.218 0.233 0.199 0.204 0.198
Fr 0.699 0.699 0.656 0.699 0.699 0.278 0.699 0.699 0.99 0.699 0.699 0.99 0.699 0.699 0
Scene Acc 0.548 0.536 0.613 0.524 0.539 0.647 0.556 0.535 0.53 0.528 0.528 0.623 0.4485 0.493 0.658
HL 0.113 0.117 0.098 0.118 0.116 0.093 0.114 0.116 0.116 0.118 0.117 0.098 0.129 0.121 0.09
Fr 0.7 0.7 0.774 0.7 0.7 0.185 0.7 0.7 0.936 0.7 0.7 0.579 0.7 0.7 0
Image Acc 0.405 0.414 0.425 0.398 0.419 0.489 0.43 0.429 0.316 0.402 0.428 0.441 0.412 0.390 0.46
HL 0.903 0.9 0.897 0.905 0.9 0.883 0.897 0.897 0.924 0.904 0.897 0.894 0.901 0.905 0.889
Fr 0.700 0.700 0.868 0.700 0.700 0.368 0.700 0.700 0.968 0.700 0.700 0.742 0.700 0.700 0
Slashdot Acc 0.799 0.794 0.795 0.796 0.794 0.794 0.795 0.797 0.807 0.8 0.795 0.809 0.792 0.655 0.798
HL 0.017 0.017 0.017 0.018 0.018 0.017 0.017 0.017 0.018 0.018 0.018 0.018 0.018 0.018 0.017
Fr 0.899 0.899 0.873 0.899 0.899 0.841 0.899 0.899 0.999 0.899 0.899 0.999 0.899 0.899 0
Genbase Acc 0.931 0.925 0.94 0.937 0.935 0.933 0.931 0.943 0.909 0.931 0.935 0.927 0.925 0.889 0.928
HL 0.005 0.006 0.005 0.005 0.005 0.005 0.005 0.004 0.008 0.005 0.005 0.006 0.006 0.009 0.005
Fr 0.899 0.899 0.97 0.899 0.899 0.943 0.899 0.899 0.991 0.899 0.899 0.99 0.899 0.899 0
Medical Acc 0.551 0.587 0.59 0.604 0.611 0.578 0.567 0.568 0.552 0.635 0.568 0.595 0.573 0.463 0.528
HL 0.015 0.014 0.014 0.014 0.014 0.015 0.015 0.015 0.016 0.013 0.015 0.014 0.016 0.017 0.016
Fr 0.899 0.899 0.853 0.899 0.899 0.811 0.899 0.899 0.993 0.899 0.899 0.99 0.899 0.899 0
Enron Acc 0.361 0.202 0.369 0.224 0.217 0.331 0.351 0.265 0.161 0.215 0.267 0.172 0.365 0.346 0.293
HL 0.05 0.057 0.051 0.058 0.058 0.052 0.051 0.054 0.061 0.058 0.054 0.061 0.051 0.052 0.053
Fr 0.900 0.900 0.886 0.900 0.900 0.674 0.900 0.900 0.999 0.900 0.900 0.999 0.900 0.900 0
TABLE 4 Average rankings of the algorithms obtained by each evaluation measure by performing Friedman test
Measure
method Accuracy Hamming loss Feature reduction
BR-RF 6.35 [5] 6.10 [4] 7.70 [3]
BR-IG 8.20 [9] 7.15 [7] 7.70 [3]
BR-FCBF 5.50 [2] 5.05 [3] 8.60 [4]
BR-FSCORE 7.45 [7] 8.35 [10] 7.70 [3]
BR-CHI 6.10 [4] 6.60 [6] 7.70 [3]
BR-CFS 4.95 [1] 4.80 [1] 12.80 [5]
LP-RF 5.55 [3] 4.95 [2] 7.70 [3]
LP-IG 6.85 [6] 6.35 [5] 7.70 [3]
LP-FCBF 10.40 [11] 10.75 [13] 2.45 [1]
LP-FSCORE 7.80 [8] 7.90 [8] 7.70 [3]
LP-CHI 8.60 [10] 7.95 [9] 7.70 [3]
LP-CFS 7.60 [7] 8.70 [11] 4.15 [2]
MDMR 8.20 [9] 8.80 [12] 7.70 [3]
PMU 11.45 [12] 11.55 [14] 7.70 [3]
A challenging concern in ML classification and feature selection is taking into consideration the correlation among labels.
Undoubtedly, label dependency is a critical issue in ML classification as it is a possibility to derive the unknown labels of an
instance from the known labels according to the label correlation. However, there is a controversy over this issue in ML-FS
methods. Although, many papers confirm its positive impact in improving feature selection process (Lee & Kim, 2013;
Lee & Kim, 2015a; Qiao et al., 2017), several papers claim that considering label dependency has no effect on the results
(Gharroudi et al., 2014). We also saw this issue in our experiments that methods that consider the label correlation do not
perform better than the others; however, since our experiments are limited to several methods, we cannot claim this with cer-
tainty, and wider experiments are needed. Furthermore, it is noteworthy that in some of ML-FS methods that declared con-
sidering label dependency, the employed learning algorithm does consider the correlation among labels, not the feature
selection method (Doquire & Verleysen, 2013a).
Another controversial issue is whether to choose problem transformation or algorithm adaptation strategy. There is still
no certain answer to this question, but it seems that methods based on algorithm adaptation strategy perform better for ML
learning algorithms. Maybe, this example clarifies the issue. In a hydroelectric power plant, we see something a bit similar,
as not all potential energy is transformed into useful energy (electricity). However, based on the obtained results, we cannot
generalize this assumption to ML-FS. Obviously, more experiments are also necessary in this regard.
There are also other important issues in both SL- and ML-FS methods including stability and scalability that are usually
neglected. Most of the times, supervised feature selection methods are only evaluated by their classification accuracy. How-
ever, a good FS method is the one that is stable in addition to have a high accuracy (He & Yu, 2010). Stability of a feature
selection method is considered as the consistency of an approach to find a consistent subset of features when some training
instances are removed or new training instances are added (Chandrashekar & Sahin, 2014). A method is unreliable for fea-
ture selection if it finds a different feature subset for any perturbation in the training data. The next issue is the scalability of
FS methods that may be jeopardized in the case of large-scale data. Scalability is specified as the efficacy that an increase in
the size of the training set has on the computational performance of a method: training time, accuracy, and allocated memory
(Bolón-Canedo et al., 2016). A good FS algorithm is the one that can find a balance among them. Normally, large-scale data-
sets cannot directly be loaded into the memory and therefore confines the usability of most feature selection methods
(Li et al., 2016). Recently, distributed computing protocols and frameworks such as MPI and MapReduce have been pro-
posed to perform parallel feature selection for very large-scale datasets.
Another open and challenging problem in both SL- and ML-FS algorithms is how to determine optimal number of
selected features. For most of FS methods, the number of features to be selected should be determined by the user. However,
the optimal number of features is usually unknown and is different for various datasets. On one hand, too large number of
selected features may increase the risk of including irrelevant, noisy and redundant features. On the other hand, with too
small number of selected features, some relevant features that are necessary to be included in the final subset may be elimi-
nated. Therefore, it is more desirable for a FS algorithm to decide the size of the final feature subset, automatically.
It was also observed that employing filter methods is more than wrapper and embedded methods in the publications.
Although, filter methods are the most appropriate approaches when dealing with very large number of features, the higher
accuracy of wrapper and embedded methods should not be ignored. Maybe, employing a filter method to eliminate irrelevant
KASHEF ET AL. 25 of 29
features and then a wrapper or embedded method to select the most salient features is a good way (Kashef & Nezamabadi-
pour, 2017). Also, there are still some SL filter approaches that are not utilized for ML task neither in problem transforma-
tion nor in algorithm adaptation form.
As discussed in Section 3.2, feature selection methods are divided into three categories according to label perspective:
supervised, semisupervised, and unsupervised methods. All of these strategies have been widely discussed in SL feature
selection, but the two last methods are poorly investigated in ML task. Specially, to the best of our knowledge, there is no
publication in unsupervised ML-FS methods.
Apart from the mentioned issues, the results of several evaluation measures for 12 popular datasets with all features using
ML-kNN classifier are reported in Table 2 as a baseline. These results can be the basis for comparison of feature selection
methods. It is desired that an ML-FS method can obtain better results using fewer features (Kashef & Nezamabadi-pour,
2013). However, sometimes, finding the values of a specific feature is so costly that it would be preferred to eliminate it in
return for accepting higher classification error. Therefore, the main objective of feature selection is to simplify a dataset by
reducing its dimensionality and identifying relevant underlying features without degrading predictive accuracy (Kashef &
Nezamabadi-pour, 2013). Of course, it is usually observed that the performance of the classifier also is improved after feature
selection.
Feature selection has been a hot topic and an active field of research in machine learning. In this paper, we have investigated
the previous works on ML-FS problem. We categorize ML-FS methods from different perspectives: (a) label perspective
containing supervised, semisupervise, and unsupervised methods, (b) search strategy perspective containing exhaustive,
sequential, and randomized methods, (c) Interaction with learning algorithm, containing filter, wrapper, and embedded
methods, and (d) data model perspective containing problem transformation and algorithm adaptation methods. Each strategy
is described and methods of them in the ML (and to some extent, SL) task are introduced.
Also, by analyzing the existing papers, several challenges and open issues are discussed that can be pursued in the
future.
ACKNOWLEDGMENT
The authors would like to thank Professor Min-Lin Zhang and Dr. Newton Spolaor, and also the anonymous reviewers and
the editor for their helpful and constructive comments.
CONFLICT OF INTEREST
The authors have declared no conflicts of interest for this article.
NOTES
1
https://CRAN.R-project.org/package=utiml.
2
https://CRAN.R-project.org/package=MLPUGS.
3
https://CRAN.R-project.org/package=mldr.
4
https://CRAN.R-project.org/package=mldr.datasets.
5
http://meka.sourceforge.net/.
6
http://mulan.sourceforge.net/.
7
http://www.cs.waikato.ac.nz/ml/weka/.
8
http://mulan.sourceforge.net/datasets-mlc.html.
9
http://featureselection.asu.edu/old/software.php.
10
http://clus.sourceforge.net.
11
https://www.csie.ntu.edu.tw/~cjlin/libsvm/.
REFERENC ES
Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sanchez, L., & Herrera, F. (2011). KEEL data-mining software tool: data set repository, integration
of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic & Soft Computing, 17, 255–287.
26 of 29 KASHEF ET AL.
Ang, J. C., Haron, H., & Hamed, H. N. A., (Eds). (2015). Semi-supervised SVM-based feature selection for cancer classification using microarray gene expression
data. Paper presented at the meeting of the International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. Cham:
Springer.
Banerjee, M., & Pal, N. R. (2014). Feature selection with SVD entropy: Some modification and extension. Information Sciences, 264, 118–134.
Barani, F., Mirhosseini, M., & Nezamabadi-pour, H. (2017). Application of binary quantum-inspired gravitational search algorithm in feature subset selection. Applied
Intelligence, 40, 1–15.
Barkia, H., Elghazel, H., & Aussem, A., (Eds). (2011). Semi-supervised feature importance evaluation with ensemble learning. Paper presented at the meeting of the
Data Mining (ICDM), 2011 I.E. 11th International Conference; IEEE.
Bellal, F., Elghazel, H., & Aussem, A. (2012). A semi-supervised feature ranking method with ensemble learning. Pattern Recognition Letters, 33(10), 1426–1433.
Bermingham, M. L., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., … Navarro, P. (2015). Application of high-dimensional feature selec-
tion: Evaluation for genomic prediction in man. Scientific Reports, 5, 1–12.
Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2013). A review of feature selection methods on synthetic data. Knowledge and Information Sys-
tems, 34(3), 483–519.
Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2016). Feature selection for high-dimensional data. Progress in Artificial Intelligence, 5(2), 65–75.
Bolón-Canedo, V., Sánchez-Marono, N., Alonso-Betanzos, A., Benítez, J. M., & Herrera, F. (2014). A review of microarray datasets and applied feature selection
methods. Information Sciences, 282, 111–135.
Boutell, M. R., Luo, J., Shen, X., & Brown, C. M. (2004). Learning multi-label scene classification. Pattern Recognition, 37(9), 1757–1771.
Brassard, G., & Bratley, P. (1996). Fundamentals of algorithms. New Jersey: Prentice Hall.
Cai, D., Zhang, C., & He, X., (Eds.) (2010). Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining; ACM.
Carmona-Cejudo, J. M., Baena-García, M., del Campo-Avila, J., & Morales-Bueno, R., (Eds). (2011). Feature extraction for multi-label learning in the domain of
email classification. Paper presented at the meeting of the Computational Intelligence and Data Mining (CIDM), 2011 I.E. Symposium; IEEE.
Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16–28.
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
Chang, X., Nie, F., Yang, Y., & Huang, H. (2014). A convex formulation for semi-supervised multi-label feature selection. Paper presented at the AAAI, Québec,
Canada.
Charte, F., & Charte, D. (2015). Working with multilabel datasets in R: The mldr package. R J., 7(2), 149–162.
Charte, F., Charte, D., Rivera, A., del Jesus, M. J., & Herrera, F., (Eds). (2016). R ultimate multilabel dataset repository. Paper presented at the meeting of the Interna-
tional Conference on Hybrid Artificial Intelligence Systems; Springer.
Chen, W., Yan, J., Zhang, B., Chen, Z., & Yang, Q., (Eds). (2007). Document transformation for multi-label feature selection in text categorization. Paper presented
at the meeting of the Data Mining, 2007 ICDM 2007 Seventh IEEE International Conference; IEEE.
Cheng, H., Deng, W., Fu, C., Wang, Y., & Qin, Z. (2011). Graph-based semi-supervised feature selection with application to automatic spam image identification.
Computer Science for Environmental Engineering and EcoInformatics, 159, 259–264.
Cherman, E. A., Metz, J., & Monard, M. C., (Eds). (2010). A simple approach to incorporate label dependency in multi-label classification. Paper presented at the
meeting of the Mexican International Conference on Artificial Intelligence; Springer.
Cherman, E. A., Monard, M. C., & Metz, J. (2011). Multi-label problem transformation methods: A case study. CLEI Electronic Journal, 14(1), 4.
Cherman, E. A., Spolaôr, N., Valverde-Rebaza, J., & Monard, M. C. (2015). Lazy multi-label learning algorithms based on mutuality strategies. Journal of Intelli-
gent & Robotic Systems, 80(1), 261–276.
Chiang, T.-H., Lo, H.-Y., & Lin, S.-D. (2012). A ranking-based KNN approach for multi-label classification. Paper presented at the meeting of the ACML; Vol. 25:
81–96.
Chou, S., & Hsu, C.-L. (2005). MMDT: A multi-valued and multi-labeled decision tree classifier for data mining. Expert Systems with Applications, 28(4), 799–812.
Choudhary, A., & Saraswat, J. K. (2014). Survey on hybrid approach for feature selection. International Journal of Science and Research, 3(4), 438–439.
Clare, A., & King R.D., (2001). Knowledge discovery in multi-label phenotype data. In: L. De Raedt & A. Siebes (Eds.), Principles of Data Mining and Knowledge
Discovery. PKDD 2001. Lecture Notes in Computer Science (vol 2168). Berlin: Springer.
Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent Data Analysis, 1(1–4), 131–156.
De Comité, F., Gilleron, R., & Tommasi, M., (Eds). (2003). Learning multi-label alternating decision trees from texts and data. Paper presented at the meeting of the
International Workshop on Machine Learning and Data Mining in Pattern Recognition; Springer.
De La Iglesia, B. (2013). Evolutionary computation for feature selection in classification problems. WIREs: Data Mining and Knowledge Discovery, 3(6), 381–407.
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7(Jan), 1–30.
Dendamrongvit, S., Vateekul, P., & Kubat, M. (2011). Irrelevant attributes and imbalanced classes in multi-label text-categorization domains. Intelligent Data Analy-
sis, 15(6), 843–859.
Ding, S., Ed. (2009). Feature selection based F-score and ACO algorithm in support vector machine. Paper presented at the meeting of the Knowledge Acquisition
and Modeling, 2009 KAM'09 Second International Symposium; IEEE.
Diplaris, S., Tsoumakas, G., Mitkas, P. A., & Vlahavas, I., (2005). Protein classification with multiple algorithms. Paper presented at the meeting of the Panhellenic
Conference on Informatics, Berlin, Heidelberg.
Doak, J. (1992). CSE-92-18-an evaluation of feature selection methodsand their application to computer security. UC Davis Dept of Computer Science tech reports.
Doquire, G., & Verleysen, M., (Eds). (2011). Feature selection for multi-label classification problems. Paper presented at the meeting of the International
Work-Conference on Artificial Neural Networks; Springer.
Doquire, G., & Verleysen, M. (2013a). Mutual information-based feature selection for multilabel classification. Neurocomputing, 122, 148–155.
Doquire, G., & Verleysen, M. (2013b). A graph Laplacian based approach to semi-supervised feature selection for regression problems. Neurocomputing, 121, 5–13.
Duivesteijn, W., Mencía, E. L., Fürnkranz, J., & Knobbe, A., (Eds). (2012). Multi-label LeGo—Enhancing multi-label classifiers with local patterns. Paper presented
at the meeting of the International Symposium on Intelligent Data Analysis; Springer.
Ebrahimpour, M. K., & Eftekhari, M. (2017). Ensemble of feature selection methods: A hesitant fuzzy sets approach. Applied Soft Computing, 50, 300–312.
El Kafrawy, P., Mausad, A., & Esmail, H. (2015). Experimental comparison of methods for multi-label classification in different application domains. International
Journal of Computer Applications, 114(19), 406–417.
Elisseeff, A., & Weston, J. (2001). A kernel method for multi-labelled classification. Paper presented at the NIPS, Vancouver, British Columbia, Canada.
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning: Springer series in statistics. Berlin: Springer.
Fürnkranz, J., Hüllermeier, E., Mencía, E. L., & Brinker, K. (2008). Multilabel classification via calibrated label ranking. Machine Learning, 73(2), 133–153.
García, S., Fernández, A., Luengo, J., & Herrera, F. (2010). Advanced nonparametric tests for multiple comparisons in the design of experiments in computational
intelligence and data mining: Experimental analysis of power. Information Sciences, 180(10), 2044–2064.
KASHEF ET AL. 27 of 29
Gharroudi, O., Elghazel, H., & Aussem, A., (Eds). (2014). A comparison of multi-label feature selection methods using the random forest paradigm. Paper presented
at the meeting of the Canadian Conference on Artificial Intelligence; Springer.
Gu, Q., Li, Z., & Han, J. (2011). Correlated multi-label feature selection. Paper presented at the Proceedings of the 20th ACM international conference on Information
and knowledge management.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1), 389–422.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations
Newsletter, 11(1), 10–18.
Hall, M. A. (1999). Correlation-based feature selection for machine learning. New Zealand: The University of Waikato.
He, X., Cai, D., & Niyogi, P. (2005). Laplacian score for feature selection. Paper presented at the NIPS, Vancouver, British Columbia, Canada.
He, Z., & Yu, W. (2010). Stable feature selection for biomarker discovery. Computational Biology and Chemistry, 34(4), 215–225.
Huang, J., Li, G., Huang, Q., & Wu, X., (Eds). (2015). Learning label specific features for multi-label classification. Paper presented at the meeting of the Data Min-
ing (ICDM), 2015 I.E. International Conference; IEEE.
Huang, J., Li, G., Huang, Q., & Wu, X. (2016). Learning label-specific features and class-dependent labels for multi-label classification. IEEE Transactions on Knowl-
edge and Data Engineering, 28(12), 3309–3323.
Hüllermeier, E., Fürnkranz, J., Cheng, W., & Brinker, K. (2008). Label ranking by learning pairwise preferences. Artificial Intelligence, 172(16), 1897–1916.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 6). Vancouver, British Columbia, Canada: Springer.
Jing, S.-Y. (2014). A hybrid genetic algorithm for feature subset selection in rough set theory. Soft Computing, 18(7), 1373–1382.
Jungjit, S., & Freitas, A. A., (2015a). A new genetic algorithm for multi-label correlation-based feature selection. Paper presented at the The 23rd European Sympo-
sium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium.
Jungjit, S, & Freitas, A., (Eds). (2015b). A lexicographic multi-objective genetic algorithm for multi-label correlation based feature selection. In: Proceedings of the
Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Computation; ACM.
Jungjit, S., Freitas, A. A., Michaelis, M., & Cinatl, J., (Eds). (2013). Two extensions to multi-label correlation-based feature selection: A case study in bioinformatics.
Paper presented at the meeting of the Systems, Man, and Cybernetics (SMC), 2013 I.E. International Conference; IEEE.
Kashef, S., & Nezamabadi-pour, H., (Eds). (2013). A new feature selection algorithm based on binary ant colony optimization. Paper presented at the meeting of the
Information and Knowledge Technology (IKT), 2013 5th Conference; IEEE.
Kashef, S., & Nezamabadi-pour, H. (2015). An advanced ACO algorithm for feature subset selection. Neurocomputing, 147, 271–279.
Kashef S, Nezamabadi-pour H, An effective method of multi-label feature selection employing evolutionary algorithms. Swarm Intelligence and Evolutionary Compu-
tation (CSIEC), 2017 2nd Conference on; 2017: IEEE.
Kira, K., & Rendell, L. A., (Eds). (1992). A practical approach to feature selection. In: Proceedings of the Ninth International Workshop on Machine Learning.
Kocev, D., Slavkov, I., & Dzeroski, S., (Eds). (2013). Feature ranking for multi-label classification using predictive clustering trees. International Workshop on Solv-
ing Complex Machine Learning Problems with Ensemble Methods, in Conjunction with ECML/PKDD.
Kong, D, Ding, C, Huang, H, & Zhao, H., (Eds). (2012). Multi-label relieff and f-statistic feature selections for image annotation. Paper presented at the meeting of
the Computer Vision and Pattern Recognition (CVPR), 2012 I.E. Conference; IEEE.
Kong, X., & Philip, S. Y. (2012). gMLC: A multi-label feature selection framework for graph classification. Knowledge and Information Systems, 31(2), 281–305.
Kononenko, I., Ed. (1994). Estimating attributes: Analysis and extensions of RELIEF. Paper presented at the meeting of the European Conference on Machine Learn-
ing; Springer.
Lastra, G., Luaces, O., Quevedo, J. R., Bahamonde, A., (Eds). (2011). Graphical feature selection for multilabel classification tasks. Paper presented at the meeting of
the International Symposium on Intelligent Data Analysis; Springer.
Lee, J., & Kim, D.-W. (2013). Feature selection for multi-label classification using multivariate mutual information. Pattern Recognition Letters, 34(3), 349–357.
Lee, J., & Kim, D.-W. (2015a). Fast multi-label feature selection based on information-theoretic feature ranking. Pattern Recognition, 48(9), 2761–2771.
Lee, J., & Kim, D.-W. (2015b). Memetic feature selection algorithm for multi-label classification. Information Sciences, 293, 80–96.
Lee, J., Kim, H., Kim, N.-r., & Lee, J.-H. (2016). An approach for multi-label classification by directed acyclic graph with label correlation maximization. Information
Sciences, 351, 101–114.
Lee, S., Park, Y.-T., & d’Auriol, B. J. (2012). A novel feature selection method based on normalized mutual information. Applied Intelligence, 37(1), 100–120.
Li, F., Miao, D., & Pedrycz, W. (2017). Granular multi-label feature selection based on mutual information. Pattern Recognition, 67, 410–423.
Li, G.-Z., You, M., Ge, L., Yang, J. Y., & Yang, M. Q., (Eds). (2010). Feature selection for semi-supervised multi-label learning with application to gene function
analysis. In: Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology; ACM.
Li, H., Li, D., Zhai, Y., Wang, S., & Zhang, J. (2016). A novel attribute reduction approach for multi-label data based on rough set theory. Information Sciences, 367,
827–847.
Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, et al. Feature selection: A data perspective. arXiv preprint arXiv:160107996. 2016.
Li, L., Liu, H., Ma, Z., Mo, Y., Duan, Z., Zhou, J., et al., (Eds). (2014). Multi-label feature selection via information gain. Paper presented at the meeting of the Inter-
national Conference on Advanced Data Mining and Applications; Springer.
Li, L., & Wang, H. Towards label imbalance in multi-label classification with many labels. arXiv preprint arXiv, 160401304, 2016.
Lin, Y., Hu, Q., Liu, J., Chen, J., & Duan, J. (2016). Multi-label feature selection based on neighborhood mutual information. Applied Soft Computing, 38, 244–256.
Lin, Y., Hu, Q., Liu, J., & Duan, J. (2015). Multi-label feature selection based on max-dependency and min-redundancy. Neurocomputing, 168, 92–103.
Liu H. Feature Selection at Arizona State University, Data Mining and Machine Learning Laboratory. Last access: October. 2010.
Liu, H., Zhang, S., & Wu, X. (2014). MLSLR: Multilabel learning via sparse logistic regression. Information Sciences, 281, 310–320.
Lo, H.-Y., Lin, S.-D., & Wang, H.-M. (2014). Generalized k-labelsets ensemble for multi-label and cost-sensitive classification. IEEE Transactions on Knowledge and
Data Engineering, 26(7), 1679–1691.
Luo, Q., Chen, E., & Xiong, H. (2011). A semantic term weighting scheme for text categorization. Expert Systems with Applications, 38(10), 12708–12716.
Ma, Z., Nie, F., Yang, Y., Uijlings, J. R., & Sebe, N. (2012). Web image annotation via subspace-sparsity collaborated feature selection. IEEE Transactions on Multi-
media, 14(4), 1021–1030.
Madjarov, G., Kocev, D., Gjorgjevikj, D., & Džeroski, S. (2012). An extensive experimental comparison of methods for multi-label learning. Pattern Recognition,
45(9), 3084–3104.
Makrehchi, M., & Kamel, M. S., (Eds). (2005). Text classification using small number of features. Paper presented at the meeting of the International Workshop on
Machine Learning and Data Mining in Pattern Recognition; Springer.
Mencía, E. L., Park, S.-H., & Fürnkranz, J. (2010). Efficient voting prediction for pairwise multilabel classification. Neurocomputing, 73(7), 1164–1176.
Mitra, P., Murthy, C., & Pal, S. K. (2002). Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence,
24(3), 301–312.
Naula, P., Airola, A., Salakoski, T., & Pahikkala, T. (2014). Multi-label learning under feature extraction budgets. Pattern Recognition Letters, 40, 56–65.
28 of 29 KASHEF ET AL.
Noh, H. G., Song, M. S., & Park, S. H. (2004). An unbiased method for constructing multilabel classification trees. Computational Statistics & Data Analysis, 47(1),
149–164.
Olsson, J, & Oard, D. W., (Eds). (2006). Combining feature selectors for text classification. In: Proceedings of the 15th ACM International Conference on Information
and Knowledge Management; ACM.
Park, S.-H., & Fürnkranz, J., (Eds). (2007). Efficient pairwise classification. Paper presented at the meeting of the European Conference on Machine Learning;
Springer.
Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238.
Pereira, R. B., Plastino, A., Zadrozny, B., & Merschmann, L. H. (2015). Information gain feature selection for multi-label classification. Journal of Information and
Data Management, 6(1), 48.
Pereira, R. B., Plastino, A., Zadrozny, B., & Merschmann, L. H. (2016). Categorizing feature selection methods for multi-label classification. Artificial Intelligence
Review, 1–22.
Prati, RC, & de França, F. O., (Eds). (2013). Extending features for multilabel classification with swarm biclustering. Paper presented at the meeting of the Evolution-
ary Computation (CEC), 2013 I.E. Congress; IEEE.
Pupo, O. G. R., Morell, C., & Soto, S. V., (Eds). (2013). ReliefF-ML: An extension of ReliefF algorithm to multi-label learning. Paper presented at the meeting of the
Iberoamerican Congress on Pattern Recognition; Springer.
Qian, B., & Davidson, I. (2010). Semi-supervised dimension reduction for multi-label classification. Paper presented at the AAAI, Atlanta, Georgia, USA.
Qiao, L., Zhang, L., Sun, Z., & Liu, X. (2017). Selecting label-dependent features for multi-label classification. Neurocomputing, 259, 112–118.
Rashedi, E., & Nezamabadi-pour, H. (2014). Feature subset selection using improved binary gravitational search algorithm. Journal of Intelligent & Fuzzy Systems,
26(3), 1211–1221.
Read, J., Ed. (2008). A pruned problem transformation method for multi-label classification. In: Proceedings of 2008 New Zealand Computer Science Research Stu-
dent Conference (NZCSRS 2008).
Read, J., Bifet, A., Holmes, G., & Pfahringer, B. (2012). Scalable and efficient multi-label classification for evolving data streams. Machine Learning, 88(1–2),
243–272.
Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2011). Classifier chains for multi-label classification. Machine Learning, 85(3), 333–359.
Read, J., Puurula, A., & Bifet, A., (Eds). (2014). Multi-label classification with meta-labels. Paper presented at the meeting of the Data Mining (ICDM), 2014 I.E.
International Conference; IEEE.
Read, J., Reutemann, P., Pfahringer, B., & Holmes, G. (2016). MEKA: A multi-label/multi-target extension to WEKA. Journal of Machine Learning Research,
17(21), 1–5.
Ren, Y., Zhang, G., Yu, G., & Li, X. (2012). Local and global structure preserving based feature selection. Neurocomputing, 89, 147–157.
Reyes, O., Morell, C., & Ventura, S. (2014). Evolutionary feature weighting to improve the performance of multi-label lazy algorithms. Integrated Computer-Aided
Engineering, 21(4), 339–354.
Reyes, O., Morell, C., & Ventura, S. (2015). Scalable extensions of the ReliefF algorithm for weighting and selecting features on the multi-label learning context. Neu-
rocomputing, 161, 168–182.
Reyes, O., Morell, C., & Ventura, S. (2016). Effective lazy learning algorithm based on a data gravitation model for multi-label learning. Information Sciences, 340,
159–174.
Rogati, M., & Yang, Y., (Eds). (2002). High-performing feature selection for text classification. In: Proceedings of the Eleventh International Conference on Informa-
tion and Knowledge Management; ACM.
Rouhi, A., & Nezamabadi-pour, H., (Eds). (2017). A hybrid feature selection approach based on ensemble method for high-dimensional data. Paper presented at the
meeting of the Swarm Intelligence and Evolutionary Computation (CSIEC), 2017 2nd Conference; IEEE.
Shao, H., Li, G., Liu, G., & Wang, Y. (2013). Symptom selection for multi-label data of inquiry diagnosis in traditional Chinese medicine. Science China Information
Sciences, 56(5), 1–13.
Sheikhpour, R., Sarram, M. A., Gharaghani, S., & Chahooki, M. A. Z. (2017). A survey on semi-supervised feature selection methods. Pattern Recognition, 64,
141–158.
Sikdar, U. K., Ekbal, A., Saha, S., Uryupina, O., & Poesio, M. (2015). Differential evolution-based feature selection technique for anaphora resolution. Soft Comput-
ing, 19(8), 2149–2161.
Song, G, & Ye, Y., (Eds). (2014) A new ensemble method for multi-label data stream classification in non-stationary environment. Paper presented at the meeting of
the Neural Networks (IJCNN), 2014 International Joint Conference; IEEE.
Spolaôr, N., Cherman, E. A., Monard, M. C., & Lee, H. D. (2013a). A comparison of multi-label feature selection methods using the problem transformation
approach. Electronic Notes in Theoretical Computer Science, 292, 135–151.
Spolaôr, N., Cherman, E. A., Monard, M. C., & Lee, H. D., (Eds). (2013b). ReliefF for multi-label feature selection. Paper presented at the meeting of the Intelligent
Systems (BRACIS), 2013 Brazilian Conference on; IEEE.
Spolaôr, N., Monard, M. C., Tsoumakas, G., & Lee, H., (Eds). (2014). Label construction for multi-label feature selection. Paper presented at the meeting of the Intel-
ligent Systems (BRACIS), 2014 Brazilian Conference; IEEE.
Spolaôr, N., Monard, M. C., Tsoumakas, G., & Lee, H. D. (2016). A systematic review of multi-label feature selection and a new method based on label construction.
Neurocomputing, 180, 3–15.
Spolaôr, N., & Tsoumakas, G. (2013). Evaluating feature selection methods for multi-label text classification. Vancouver, Canada: BioASQ workhsop.
Spyromitros, E., Tsoumakas, G., & Vlahavas, I., (Eds). (2008). An empirical study of lazy multilabel classification algorithms. Paper presented at the meeting of the
Hellenic conference on Artificial Intelligence; Springer.
Spyromitros-Xioufis, E. (2011). Dealing with concept drift and class imbalance in multi-label stream classification. Barcelona: Department of Computer Science, Aris-
totle University of Thessaloniki.
Teisseyre, P. (2016). Feature ranking for multi-label classification using Markov networks. Neurocomputing, 205, 439–454.
Trohidis, K., Tsoumakas, G., Kalliris, G., & Vlahavas, I. P. (2008). Multi-label classification of music into emotions. Paper presented at the ISMIR, Philadelphia,
PA USA.
Tsoumakas, G., Katakis, I., & Vlahavas, I. (2009). Mining multi-label data. Data mining and knowledge discovery handbook (pp. 667–685). Boston: Springer.
Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., & Vlahavas, I. (2011). Mulan: A java library for multi-label learning. Journal of Machine Learning Research,
12(Jul), 2411–2414.
Tsoumakas, G, & Vlahavas, I., (Eds). (2007). Random k-labelsets: An ensemble method for multilabel classification. Paper presented at the meeting of the European
Conference on Machine Learning; Springer.
Vergara, J. R., & Estévez, P. A. (2014). A review of feature selection methods based on mutual information. Neural Computing and Applications, 24(1), 175–186.
KASHEF ET AL. 29 of 29
Xu, J., Liu, J., Yin, J., & Sun, C. (2016). A multi-label feature extraction algorithm via maximizing feature variance and feature-label dependence simultaneously.
Knowledge-Based Systems, 98, 172–184.
Xu, S., Yang, X., Yu, H., D-J, Y., Yang, J., & Tsang, E. C. (2016). Multi-label learning with label-specific feature reduction. Knowledge-Based Systems, 104, 52–61.
Xue, B., Zhang, M., Browne, W. N., & Yao, X. (2016). A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary
Computation, 20(4), 606–626.
Yan, J., Liu, N., Zhang, B., Yan, S., Chen, Z., Cheng, Q., et al., (Eds). (2005). OCFS: Optimal orthogonal centroid feature selection for text categorization. In: Pro-
ceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; ACM.
Yang, J., Jiang, Y.-G., Hauptmann, A. G., & Ngo, C.-W. (2007). Evaluating bag-of-visual-words representations in scene classification. Paper presented at the Pro-
ceedings of the international workshop on Workshop on multimedia information retrieval, Augsburg, Bavaria, Germany.
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. Paper presented at the Icml, Nashville, TN, USA.
You, M., Liu, J., Li, G.-Z., & Chen, Y. (2012). Embedded feature selection for multi-label classification of music emotions. International Journal of Computational
Intelligence Systems, 5(4), 668–678.
Younes, Z., Abdallah, F., Denoeux, T., & Snoussi, H. (2011). A dependent multilabel classification method derived from the k-nearest neighbor rule.
EURASIP Journal on Advances in Signal Processing, Article ID 645964, 14.
Yu, K., Yu, S., & Tresp, V., (Eds). (2005). Multi-label informed latent semantic indexing. In: Proceedings of the 28th Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval; ACM.
Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. Paper presented at the ICML, Washington D.C.
Yu, L., & Liu, H., (Eds). (2004). Redundancy based feature selection for microarray data. In: Proceedings of the tenth ACM SIGKDD International Conference on
KNOWLEDGE DISCOVERY and Data Mining; ACM.
Yu, Y., & Wang, Y., (Eds). (2014). Feature selection for multi-label learning using mutual information and GA. Paper presented at the meeting of the International
Conference on Rough Sets and Knowledge Technology; Springer.
Zhang, M.-L., Li, Y.-K., & Liu, X.-Y. (2015). Towards class-imbalance aware multi-label learning. Paper presented at the IJCAI, Buenos Aires, Argentina.
Zhang, M.-L., & LIFT, W. L. (2015). Multi-label learning with label-specific features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(1),
107–120.
Zhang, M.-L., Peña, J. M., & Robles, V. (2009). Feature selection for multi-label naive Bayes classification. Information Sciences, 179(19), 3218–3229.
Zhang, M.-L., & Zhou, Z.-H. (2006). Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge
and Data Engineering, 18(10), 1338–1351.
Zhang, M.-L., & Zhou, Z.-H. (2007). ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7), 2038–2048.
Zhang, M.-L., & Zhou, Z.-H. (2014). A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8), 1819–1837.
Zhang, Y., Gong, D.-W., & Rong, M., (Eds). (2015). Multi-objective differential evolution algorithm for multi-label feature selection in classification. International
Conference in Swarm Intelligence; Springer.
Zhang, Y., Gong, D.-w., Sun, X.-y., & Guo, Y.-n. (2017). A PSO-based multi-objective multi-label feature selection method in classification. Scientific Reports,
7(1), 376.
Zhang, Y., Wang, S., Phillips, P., & Ji, G. (2014). Binary PSO with mutation operator for feature selection using decision tree applied to spam detection.
Knowledge-Based Systems, 64, 22–31.
Zhang, Y., & Zhou, Z.-H. (2010). Multilabel dimensionality reduction via dependence maximization. ACM Transactions on Knowledge Discovery from Data
(TKDD), 4(3), 14.
Zhao, Z., & Liu, H. (2007a). Searching for interacting features. Paper presented at the ijcai, Hyderabad, India.
Zhao, Z., & Liu, H., (Eds). (2007b). Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th international conference on
Machine learning; ACM.
Zhao, Z., & Liu, H., (Eds). (2007c). Semi-supervised feature selection via spectral analysis. In: Proceedings of the 2007 SIAM International Conference on Data Min-
ing; SIAM.
How to cite this article: Kashef S, Nezamabadi-pour H, Nikpour B. Multilabel feature selection: A comprehensive
review and guiding experiments. WIREs Data Mining Knowl Discov. 2018;8:e1240. https://doi.org/10.1002/
widm.1240