Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Received: 14 May 2017 Revised: 1 November 2017 Accepted: 28 November 2017

DOI: 10.1002/widm.1240

ADVANCED REVIEW

Multilabel feature selection: A comprehensive review


and guiding experiments
Shima Kashef1,2 | Hossein Nezamabadi-pour1,2 | Bahareh Nikpour1,2

1
Intelligent data Processing Laboratory (IDPL),
Department of Electrical Engineering, Shahid Feature selection has been an important issue in machine learning and data mining,
Bahonar University of Kerman, Kerman, Iran and is unavoidable when confronting with high-dimensional data. With the advent
2
Mahani Mathematical Research Center, Shahid of multilabel (ML) datasets and their vast applications, feature selection methods
Bahonar University of Kerman, Kerman, Iran have been developed for dimensionality reduction and improvement of the classifi-
Correspondence cation performance. In this work, we provide a comprehensive review of the exist-
Hossein Nezamabadi-pour, Intelligent data
Processing Laboratory (IDPL), Department of
ing multilabel feature selection (ML-FS) methods, and categorize these methods
Electrical Engineering, Shahid Bahonar University based on different perspectives. As feature selection and data classification are
of Kerman, P.O. Box 76619-133, Kerman, Iran. closely related to each other, we provide a review on ML learning algorithms as
Email: nezam@uk.ac.ir
well. Also, to facilitate research in this field, a section is provided for setup and
benchmarking that presents evaluation measures, standard datasets, and existing
software for ML data. At the end of this survey, we discuss some challenges and
open problems in this field that can be pursued by researchers in future.
This article is categorized under:
Technologies > Data Preprocessing

KEYWORDS

feature selection, multi-label data, classification, data mining

1 | INTRODUCTION

Nowadays, the world has encountered the problem of high-dimensional data generated from different sources in various fields
such as health care, social media, transportation, bioinformatics, microarray data, e-commerce, multimedia data, and so on. Fast
growth of data imposes big challenges on effective and efficient data management in the field of pattern recognition and machine
learning, that is, on applying machine learning algorithms to discover knowledge from high-dimensional data. Recently, it has
been realized that preprocessing methods play an inevitable role in working more effectively with datasets. Such methods, pro-
cess the data before being presented to a learning algorithm with the goal of improving the performance. In the fields of pattern
recognition, machine learning and statistics, feature selection, also recognized as feature subset selection, attribute selection, or
variable selection, is a preprocessing technique that selects a subset of relevant features (variables) from the existing features to
be used in model construction. There is wide-ranging interest in feature selection as data preprocessing techniques among experts
from pattern recognition and machine learning fields. Fundamental motivations for feature selection are as follows:

• Avoiding curse of dimensionality as the main reason (Friedman, Hastie, & Tibshirani, 2001)
• Reducing the training time of the algorithms
• Simplifying the models to make them easier to interpret (James, Witten, Hastie, & Tibshirani, 2013)
• Enhance the generalization ability of the systems by reducing overfitting (Bermingham et al., 2015).

Among the above-mentioned reasons, the most important one is a “curse of dimensionality,” which is an issue followed
by confusion of model (classifier) and reduction of classification accuracy. Using a suitable feature selection scheme can play

WIREs Data Mining Knowl Discov. 2018;8:e1240. wires.wiley.com/dmkd © 2018 Wiley Periodicals, Inc. 1 of 29
https://doi.org/10.1002/widm.1240
2 of 29 KASHEF ET AL.

a very important role in addressing this problem by eliminating the irrelevant and redundant data and reducing the dimen-
sionality. To clarify the feature selection procedure, suppose X as the original set of features with cardinality |X| = M and E :
0 0
X  X ! ℝ as the evaluation function to be optimized; the feature selection process is defined as finding X  X, in such a
0 0
way that |X | = m < M and E(X ) is optimized (Barani, Mirhosseini, & Nezamabadi-pour, 2017).
Feature selection methods are generally categorized as three main groups including filters, wrappers, and embedded
methods. The methods of filter category are independent of learning algorithms and make use of general characteristics of the
training data for selecting the best features. Such methods rank the features using some criteria and eliminate the features with
insufficient scores. The main advantage of methods of this group is their low computational complexity, which makes them
suitable to be used in high-dimensional data. Wrapper methods, on the other hand, take advantage of a specific learning algo-
rithm as a part of the feature selection process; hence, these approaches usually gain better results, but are computationally
expensive and sometimes cannot be utilized. Filter and wrapper methods can be considered as complementary of each other;
therefore, embedded methods exploit the strengths of these two categories simultaneously. In other words, feature selection is
performed as part of the model constructing process in embedded approaches (Kashef & Nezamabadi-pour, 2015).
In addition to the previously mentioned categorization, existing feature selection methods fall under three main categories
from the label perspective: supervised, unsupervised, and semisupervised methods. In supervised feature selection tech-
niques, sufficient labeled training data samples are available and feature relevance is determined by evaluating feature’s cor-
relation with the class. On the other hand, unsupervised methods do not need any labeled training data samples. The simplest
unsupervised method may be maximum variance in which features are evaluated by data variance and the features with max-
imum variance are selected (Ren, Zhang, Yu, & Li, 2012). Although the supervised methods achieve higher accuracy
because of relying on more information, usually obtaining data labels is expensive especially in high-dimensional datasets,
so supervised methods are no longer proper in these cases. Moreover, as label information is not available in unsupervised
scenarios, it is hard to choose discriminative features. Semisupervised feature selection techniques are suitable for cases in
which among the whole training data samples, there exist few labeled ones. This kind of datasets has been recently grown in
real applications. In this situation, the training procedure of supervised feature selection results in overfitting.
In classical supervised learning problems, each instance in the dataset belongs to only one label yj from a set of labels L .
This type of datasets is called single-label (SL) data. However, in some real-world problems, each instance is usually associ-
ated with a set of labels, Yi  L, simultaneously. Such prediction tasks are usually denoted as multilabel (ML) classification
problems, and are widely used in applications such as semantic image and video annotation (Boutell, Luo, Shen, & Brown,
2004; Yang, Jiang, Hauptmann, & Ngo, 2007), classification of protein functions and genes (Diplaris, Tsoumakas, Mitkas, &
Vlahavas, 2005; Zhang & Zhou, 2006), categorization of text (Luo, Chen, & Xiong, 2011) and emotions evoked by music
(Trohidis, Tsoumakas, Kalliris, & Vlahavas, 2008), and so on. As an example of image annotation task, an image can simul-
taneously depict “sky,” “trees,” and “sunrise,” so it should be included in all the three categorizes. Due to fast emergence
and spread of ML datasets, many studies have been done in this field during the past decade. The same as SL case, feature
selection is an essential task in ML data classification due to the large number of features and many works have been done
in this field.
Beside the feature selection methods, there exist other preprocessing approaches that focus on features including feature
extraction, feature construction, and feature ranking with the goal of improving the performance of the machine learning
algorithms. Feature extraction methods aim to reduce the dimensionality of data by creating new features using a linear com-
bination of all features. Such methods are supposed to build informative and nonredundant features from an initial set of
measured data to facilitate learning. There are some ML feature extraction methods some of which can be found in the
papers of Carmona-Cejudo, Baena-García, del Campo-Avila, and Morales-Bueno (2011); Naula, Airola, Salakoski, and
Pahikkala (2014); Xu, Liu, Yin, and Sun (2016); Yu, Yu, and Tresp (2005); and Zhang and Zhou (2010). Methods of feature
construction category attempt to incorporate original features for achieving new high-level features with the purpose of pro-
viding better discriminative ability. There are few work relating to feature construction in ML domain such as (Duivesteijn,
Mencía, Fürnkranz, & Knobbe, 2012; Prati & de França, 2013). In ML tasks, in addition to feature construction, label con-
struction can be used to generate new labels employing the information obtained by considering the relations between the
original labels (Spolaôr, Monard, Tsoumakas, & Lee, 2014). Feature ranking techniques are mainly employed to evaluate the
individual relevance of features and order features based on their relevance in prediction of labels. Such methods can be
helpful for feature selection by eliminating the features, which are the least significant. In Kocev, Slavkov, and Dzeroski
(2013); Lee and Kim (2015a); Reyes, Morell, and Ventura (2014); and Teisseyre (2016), some feature ranking methods are
presented. In our study, it is mainly focused on reviewing multilabel feature selection (ML-FS) methods in classification task,
so the other ML feature engineering approaches are not explained in detail.
Due to large amount of research in the field of single-label feature selection (SL-FS), there are many review
papers, which categorize and analyze these methods from different perspectives. Recent reviews on SL-FS methods can be
found in the papers of Bolón-Canedo, Sánchez-Maroño, and Alonso-Betanzos (2013); Bolón-Canedo, Sánchez-Marono,
KASHEF ET AL. 3 of 29

Alonso-Betanzos, Benítez, and Herrera (2014); Chandrashekar and Sahin (2014); Choudhary and Saraswat (2014); Li
et al. (2016); Vergara and Estévez (2014); Xue, Zhang, Browne, and Yao (2016); and Yu and Liu (2004). In Xue
et al. (2016) and De La Iglesia (2013), evolutionary-based SL-FS methods are surveyed precisely. Moreover, a comprehen-
sive overview about recent advances in feature selection is presented in Li et al. (2016). In the last years, with the increas-
ingly spread of ML classification, there have been many research studied for ML learning and feature selection. There exists
some papers that review ML learning algorithm such as Tsoumakas, Katakis, and Vlahavas (2009) and Zhang and Zhou
(2014). For ML-FS, Spolaôr, Monard, Tsoumakas, and Lee (2016) provide a systematic review paper, which investigates the
existing ML-FS publications. They illustrate the necessity of providing a complete taxonomy for ML-FS in this work.
Another paper by Pereira, Plastino, Zadrozny, and Merschmann (2016) reviews ML-FS methods, which provides a good cat-
egorization of ML-FS methods. To the best of our knowledge, there is no more review paper in ML-FS task. Despite the
existence of the mentioned ML review papers, there is still a need to provide a review for analyzing the ML-FS methods
from different views, investigating more recent works and discussing the challenges for future work. Therefore, we intended
to provide a comprehensive study to review and categorize the existing methods to help future researchers in their work.
The paper is organized as follows: Section 2 gives fundamental concepts including formal definition of ML data, and
suggests a novel taxonomy of ML classification approaches. ML-FS task is described in Section 3, and the existing methods
are reviewed and categorized based on different perspectives. Section 4 discusses setup and benchmarking by investigating
standard ML databases along with performance evaluation criteria, nonparametric tests, and developed software for ML task.
Finally, in Section 5, we discuss some challenges and open problems that are ignored and need more attention in future
works, and draw the conclusions.

2 | FU ND AME NT AL CONC EP TS

In this section, the formal definition of ML data is firstly presented. Then, a comprehensive taxonomy of ML learning algo-
rithms is introduced, and the most important methods are briefly described.

2.1 | ML data
 
Suppose X = ℝ M or ℤ M is an M-dimensional instance space, and L = {y1, y2, …, yq} denotes the label space with q possi-
ble class labels. The task of ML learning is to learn a function h : X ! 2qfrom the ML training set D = {(xi, Yi), i = 1, …,
N} with N samples, where each sample is associated with a feature vector xi = (xi1, xi2, …, xiM) described by M features,
and a binary label vector Yi = (yi1, yi2, …, yiq) described by q labels. Figure 1 shows this representation.
For each unseen sample x 2 X , the ML classifier h(.) predicts h(x) !L as the set of proper labels for x.
To describe the properties of ML datasets, several practical ML measures can be employed. Label cardinality
(LC) measures the degree of multilabeledness, which is defined by Equation (1). In other words, it measures the average
number of labels associated with each instance. The other ML measure is label density (LD), which is the cardinality normal-
ized by jLj defined by Equation (2) (Zhang & Zhou, 2014).

1 X
jDj
LCðDÞ = j Yi j , ð1Þ
j D j i=1

1 X j Yi j
jDj
LDðDÞ = : ð2Þ
j D j i=1 j L j

Unlike LD that considers the number of labels q, LC is not dependent on the number of labels and is utilized to quantify
the number of available labels that describe the examples of an ML training dataset. Two datasets with equal LC but with
different LD might have different behaviors in ML learning method (Tsoumakas et al., 2009).

Y
X y1 y2 ••• yq
X11 X12 ••• X1M 0 1 1 0
X21 X22 ••• X2M 1 1 0 1
•••

•••

•••

•••
•••

•••

•••


XN1 XN2 ••• XNM 1 0 1 0


FIGURE 1 Multilabel data
4 of 29 KASHEF ET AL.

Another popular criterion is label diversity that represents the number of distinct label sets, and is helpful for many prob-
lem transformation methods that work on subsets of labels.

2.2 | ML classification approaches


An important difference between ML and SL classification is that the multiple labels are usually correlated meaning they are
not independent from each other, while in SL classification the class assignments are mutually exclusive (Kong, Ding,
Huang, & Zhao, 2012; Kong & Philip, 2012). For example, in image annotation, an image containing “sea” is more probable
to contain “ship” than to contain “car.” In this example, “ship” and “sea” are highly correlated labels while “car” and “ship” or
“car” and “sea” have low correlation. This knowledge can help the classifier to have a better performance. ML learning algo-
rithms are divided into three categories based on considering the order of label correlation. First-order strategies do not con-
sider the coexistence of other labels and perform the learning task by considering each label individually. Second-order
strategies consider pairwise relation between labels. The correlation among labels can be exploited by second-order strategy to
some extent. Finally, in high-order strategies, relation among more labels is considered. In addition to label correlation, there
are several recently proposed topics in ML learning algorithms, including class-imbalance (Li & Wang, n.d.; Spyromitros-
Xioufis, 2011; Zhang, Li, & Liu, 2015), label-specific features (Huang, Li, Huang, & Wu, 2015, 2016; Qiao, Zhang, Sun, &
Liu, 2017; Xu et al., 2016; Zhang & LIFT, 2015), and data streams (Read, Bifet, Holmes, & Pfahringer, 2012; Song & Ye,
2014). Class-imbalance problem happens when instances belonging to a certain label outnumber the instances that do not
belong to it in the training set. Label-specific features mean that each class label is supposed to have its own characteristics
and is determined by some specific features that are the most discriminative features for that label. For example, the feature
blood sugar is informative to distinguish diabetic and nondiabetic people, but it is useless to determine whether the person is a
student or not. In data streaming scenarios, the classifier encounters different challenges such as dealing with large number of
instances, limited memory and time, single-pass, and real-time prediction. ML learning algorithms are mainly classified into
two groups: problem transformation methods and algorithm adaptation methods. These categories can be also divided into
some other subgroups. In this paper, we propose a hierarchical taxonomy of ML classification methods illustrated in Figure 2.

2.3 | Problem transformation


Methods of this category map the problem of ML classification into one or more SL classification problems. Afterwards, any
SL classifier can be applied to the constructed SL dataset(s), and the results are then transformed back into ML representation.
Such methods can be classified into five groups including binary relevance (BR) (Boutell et al., 2004), methods that combine
labels such as label powerset (LP) (Cherman, Monard, & Metz, 2011), pairwise methods such as calibrated label ranking
(Fürnkranz, Hüllermeier, Mencía, & Brinker, 2008), select family (Chen, Yan, Zhang, Chen, & Yang, 2007), and ensemble
methods such as random-k-label-sets (RAKEL) (Tsoumakas & Vlahavas, 2007) and classifier chain (CC) (Read, Pfahringer,
Holmes, & Frank, 2011). Commonly encountered problem transformation methods are explained in the following:

2.3.1 | Binary relevance


Binary relevance (BR) is the most frequent problem transformation approach for ML classification. This method transforms
the ML learning task into q (| L| ) independent SL binary classification problems, one for each label in L. In other words, the
original dataset is decomposed into q datasets which contain all instances of the original dataset, labeled as y if the labels of
the original instance contained y (positive instances), and as –y otherwise (negative instances) (Spyromitros, Tsoumakas, &
Vlahavas, 2008). Finally, to classify an unseen ML example, BR predicts its associated label set Y by querying positive
labels on each individual binary classifier and then combining these labels. The advantage of BR is its simplicity but it suf-
fers from a big drawback, which is the lack of ability to detect the correlation among labels (Zhang & Zhou, 2014). Different
variations of this method have been proposed to overcome this drawback such as (Cherman et al., 2011; Cherman, Metz, &
Monard, 2010; Read et al., 2011).

2.3.2 | Label powerset


This approach transforms the ML learning problem into a SL multiclass classification problem. It considers each unique sub-
set of L as a single label and trains one SL classifier h : X ! P(L), where P(L) is the power set of L containing all possible
label subsets (Tsoumakas & Vlahavas, 2007). To classify an unseen ML example, LP predicts its associated label set Y by
firstly querying the prediction of ML classifier and then mapping it back to the power set of L. Unlike BR, LP considers the
correlation among labels, but as the number of new classes grow dramatically by increasing the labels, it easily leads to a
higher complexity in the training phase. Also, some classes which are associated with very few instances, makes the learning
process difficult as well. Another disadvantage of LP is that it can only predict label sets, which have been previously
KASHEF ET AL. 5 of 29

Binaray relevance (BR)

Pruned problem transformation (PPT)


Label powerset (LP) HOMER
Calibrated label ranking (CLR)
Pair-wise methods Ranking by pairwise comparison
Multi-label pairwise perceptron (MLPP)
ALA (Copy)
NLA (Copy)
Problem Select family LLA
transformation SLA
methods ELA (Copy-weight)

Ensembles of classifier chains (ECC)


Ensemble methods Random k-label sets (RAkEl)
Ensembles of pruned sets (EPS)

Naïve bayes ML-NB


Multi-label
ML-kNN
classification
methods Ranking-based kNN
DMLkNN
Lazy learning
BRkNN
MLCWkNN
IBLR-ML

Neural network based BP-MLL

ADABOOST.MH
Tree based boosting
ADABOOST.MR
Algorithm adaptation
methods ML-DT
Decision trees
ML-C4.5

ML-KMPSO
MuLAM
Evolutionary based hmANT-Miner
hmAntMiner-C
G3P-ML

Support vector machine Rank-SVM

Information theoretic Collective multi-label classifier (CML)

FIGURE 2 Taxonomy of multilabel classification methods

appeared in the training set, and is unable to be generalized to those outside (Lee, Kim, Kim, & Lee, 2016). Various varia-
tions of the LP method can be found in the papers of Lo, Lin, and Wang (2014); Read (2008); and Read, Puurula, and Bifet
(2014) that try to improve the performance of this method and reduce its disadvantages.
Pruned problem transformation (PPT) is one of the extensions of the LP method proposed to overcome its disadvantage
regarding the created classes associated with rare instances. The idea is to prune label sets (new classes) that appear less
times than a user-defined threshold (usually 2 or 3). Instances belonging to these classes can either be eliminated or be
assigned to another class.

2.3.3 | Pairwise methods


The idea of pairwise classification methods is firstly introduced for employing binary classifiers for SL multiclass classifica-
tion problems. The basic idea of this method is to decompose the c-class problem into c(c − 1)/2 binary problems, one for
each pairs of classes. Pairwise or round robin classification methods demonstrate better performance than the conventional
one-against-all technique for different learning algorithms such as support vector machine (SVM) (Park & Fürnkranz, 2007).
6 of 29 KASHEF ET AL.

In order to reduce the computational cost of evaluating all possible pairwise binary classifiers, Park and Fürnkranz (2007)
extends a recently proposed method called QWeighted which is designed for multiclass problems and choose the base classi-
fiers that are essential for predicting the top class.
The idea of pairwise classification can be extended to ML classification problem. Here, a problem with q labels is trans-
formed into q(q − 1)/2 subproblems, that is, a binary classifier for every pairs of labels. The instances of the first label are
used as positive instances and the instances of the second label are used as the negative instances to train each classifier. To
classify a test sample, each classifier votes for one of the two labels. All labels are then sorted according to their sum of
votes, and the relevant labels for each sample are predicted by a label ranking algorithm (El Kafrawy, Mausad, & Esmail,
2015). In recent years, the idea of pairwise classification is adapted to ML learning algorithms such as calibrated label rank-
ing (CLR) (Fürnkranz et al., 2008), ranking by pairwise comparison (RPC) (Hüllermeier, Fürnkranz, Cheng, & Brinker,
2008), and multilabel pairwise perceptron (MLPP) (Mencía, Park, & Fürnkranz, 2010). For example, Mencia et al. (2010)
generalized the QWeighted approach for ML data.

2.3.4 | Select family of transformation


These methods transform the ML data into SL multiclass data by replacing the label set of the ith instance Yi, with one of its
members, yj 2 Yi. The label set can be replaced with the most (largest label assignment, LLA) or the least (smallest label
assignment, SLA) frequent label among all instances. The other methods are to eliminate all ML instances and only keep SL
instances (no label assignment, NLA), select one label randomly (random label assignment, RLA), or copy each ML instance
as many times as the number of labels which are assigned to that instance (all label assignment, ALA). Copy-weight
(entropy-based label assignment, ELA) is a variation of this transformation that assigns a weight of jY1i j to each of the con-
structed instances (Chen et al., 2007; Tsoumakas et al., 2009). Similar to BR transformation, select family of transformation
does not consider the possible correlation among labels.

2.4 | Algorithm adaptation


The second category of methods extends some popular learning algorithms to deal with ML data, directly. Several methods
of this category are multilabel naïve Bayes (MLNB) (Zhang, Peña, & Robles, 2009), kNN-based learning algorithms such as
ML-kNN (Zhang & Zhou, 2007), BRkNN (Spyromitros et al., 2008), ranking-based kNN (Chiang, Lo, & Lin, 2012), and
DMLkNN (Younes, Abdallah, Denoeux, & Snoussi, 2011), back-propagation multilabel learning (BPMLL) (Zhang & Zhou,
2006), rank-SVM (Elisseeff & Weston, 2001), collective multilabel classifier (CML) (Elisseeff & Weston, 2001), and multi-
label decision tree (ML-DT) (De Comité, Gilleron, & Tommasi, 2003). As ML-kNN is one of the most frequently used
learning algorithm in ML-FS papers, we bring a brief explanation of this method in the following. Detailed description of
other methods is out of the scope of this paper. It has been tried to choose proper names for each category to define the ori-
gin of each method, appropriately. Also, a comprehensive review of ML learning algorithms can be found in Zhang and
Zhou’s (2014) study.

2.4.1 | kNN-based ML classifiers


The kNN is a classification algorithm where samples are classified by majority votes of its neighbors in the training data. In
other word, for each test sample, the k nearest neighbors are firstly detected in the training data and then the test sample is
assigned to a class which is the most common amongst its neighbors. This classifier does not get any information from the
training data during the learning stage, and learning is only maintaining the positions of data samples. That is why this algo-
rithm is categorized in lazy learners group.
ML-kNN is the first ML lazy learning algorithm proposed by Zhang and Zhou (2007), which is based on the maximum-
a-posteriori principle. For an unseen instance x with unknown label set Y  L, it first identifies the k nearest neighbors of x
in the training data and count the number of happenings of yj, 1 ≤ j ≤ q among these neighbors and records it in a variable
named Cj. Also, assume that Hj(−Hj) shows the event that x has (does not have) label yj. Now, to determine whether label yj
belongs to x or not, the two posterior probabilities P(Hj | Cj) and P(−Hj | Cj) should be calculated and be compared with
each other (Zhang & Zhou, 2014).
     
Y = yj jP Hj j Cj > P − Hj jCj ,1 ≤ j ≤ q : ð3Þ
According to Bayes rule, the posterior probability is given by:
     
P Hj j Cj P Hj :P Cj j Hj
 =    : ð4Þ
P −Hj j Cj P −Hj :P Cj j− Hj
KASHEF ET AL. 7 of 29

Now, the problem is finding the values of the prior probability P(Hj)(P(−Hj)), and the likelihood P(Cj | Hj)(P(Cj | −Hj))
which are approximated from the training data via the frequency counting strategy (Zhang & Zhou, 2014). This process is
repeated for each label yj 2 L. Therefore, ML-kNN is a binary relevance learner, and as it should consider the information of
all k nearest neighbors, the computational complexity of this method is rather high. Another disadvantage of this classifier is
that it does not take into consideration the dependency between labels. However, it is still one of the most popular ML classi-
fiers that is frequently used in many ML-FS papers (Doquire & Verleysen, 2013a; Jungjit & Freitas, 2015a; Lee & Kim,
2015a; Li, Li, Zhai, Wang, & Zhang, 2016; Lin, Hu, Liu, Chen, & Duan, 2016).

3 | M U L TI L A B E L F E A T U R E S E L E C T I O N

As discussed earlier, feature selection is the process of reducing the number of features by eliminating irrelevant and redun-
dant features. Given input data with M features, X = {x1, x2, …, xM} and the label set L = {y1, y2, …, yq}, an ML-FS algo-
rithm should discover the smallest subset of nonredundant features S  X with m  M features and the greatest relevance to
the labels. In other words, feature-label dependency should be maximized whereas feature–feature dependency should be
minimized in the selected feature subset. Some feature selection methods only consider feature-label correlations and try to
select the most relevant features (Spolaôr, Cherman, Monard, & Lee, 2013a). As Makrehchi and Kamel (2005) showed, the
effect of redundant features is almost similar to that of noise and causes degradation of the classifier performance. Another
factor is also assumed to have an important role in improving the performance of ML-FS method, which is considering
label-label dependency. Many papers suggest to consider feature-feature, label-feature and label-label dependencies in ML-
FS methods (Lin, Hu, Liu, & Duan, 2015).

3.1 | Categorization of ML-FS methods


ML-FS methods can be categorized according to different perspectives: label perspective, search strategy perspective, interac-
tion with learning algorithm, and data format perspective. The first three categorizations are similar to SL-FS methods, but
the fourth one only belongs to ML-FS methods. Figure 3 represents our proposed categorization of ML-FS methods that are
explained in detail in this section.

Supervised

Label
Semi-supervised
perspective

Unsupervised

Exhaustive

Search strategy
Sequential
perspective

Randomized
Multi-label feature
selection methods from
different perspectives
Filter

Interaction with
learning algorithm Wrapper

Embedded

Problem
transformation
Data format
perspective
Algorithm
FIGURE 3 Categorization of multilabel
adaptation
feature selection methods
8 of 29 KASHEF ET AL.

3.2 | Label perspective


Training data samples can be either labeled or unlabeled resulting in appearing different types of feature selection techniques
including unsupervised, semisupervised, and supervised which are describes as follows.

3.2.1 | Supervised feature selection


In supervised feature selection approaches, training samples are labeled, so it is expected to achieve high accuracies. Most of
the existing feature selection methods are included in this category such as correlation-based feature selection (CFS) (Hall,
1999), INTERACT (Zhao & Liu, 2007a), mRMR (Peng, Long, & Ding, 2005), and fast correlation-based filter (FCBF)
(Yu & Liu, 2003) for SL data; and ReliefF-ML (Reyes, Morell, & Ventura, 2015), Binary Relevance Random Forest
(BRRF) (Gharroudi, Elghazel, & Aussem, 2014), multilabel feature ranker (MLFR) (Lastra, Luaces, Quevedo, & Baha-
monde, 2011), multilabel feature selection technique based on IG (IGMF) (Li et al., 2014) and multilabel correlation-based
feature selection (ML-CFS) (Jungjit, Freitas, Michaelis, & Cinatl, 2013), for ML datasets.

3.2.2 | Unsupervised feature selection


Generally, in supervised feature selection techniques, a large amount of training data with their corresponding labels is
needed. Having sufficient labeled data samples, a discriminating feature space can be provided and high accuracy can be
achieved. However, obtaining labeled training data is both expensive and time consuming in real world. Unsupervised fea-
ture selection techniques, which make no use of the training samples’ labels, are introduced to deal with this problem. In
unsupervised feature selection techniques, the capacity of features for maintaining the intrinsic properties of data is measured.
The simplest unsupervised method for SL data may be maximum variance in which features are evaluated by data variance
and the features with maximum variance are selected (Ren et al., 2012). Some other popular methods in unsupervised
domain are feature selection using feature similarity measure (Mitra, Murthy, & Pal, 2002), using Laplacian score for feature
selection (He, Cai, & Niyogi, 2005), spectral feature selection (SFS) (Zhao & Liu, 2007b), Multicluster feature selection
(MCFS) (Cai, Zhang, & He, 2010), and Singular Value Decomposition (SVD)-based unsupervised feature selection method
(Banerjee & Pal, 2014). Although there exist many SL unsupervised feature selection methods, to the best of our knowledge,
there is no proposed unsupervised feature selection in ML case.

3.2.3 | Semi-supervised feature selection


In many real-world applications, there exist a huge amount of unlabeled and only a few labeled training samples. A user can
label all the data samples, but this is a time consuming and complex procedure. In this situation, supervised learning cannot
be adopted because of the small size of labeled training samples set. On the other hand, labeled data samples can be ignored
in order to utilize unsupervised learning, but neglecting the label information causes disability in identifying the discrimina-
tive features. For the mentioned reasons, developing new methods that consider both labeled and unlabeled data is of impor-
tance. With this motivation, semisupervised learning is introduced. For feature selection, there are many works exploiting
semisupervised learning in SL and some in ML datasets. Several examples regarding SL semisupervised feature selection
methods are semisupervised feature selection based on Laplacian score (Doquire & Verleysen, 2013b), spectral semisuper-
vised feature selection method presented in Zhao and Liu (2007c) and semisupervised feature algorithm by mining label
correlation.
SL semisupervised feature selection methods can be categorized into five classes: graph-based semisupervised feature
selection, an example of which can be found in Cheng, Deng, Fu, Wang, and Qin’s (2011) study, self-training-based semisu-
pervised feature selection (Bellal, Elghazel, & Aussem, 2012), cotraining-based semisupervised feature selection (Barkia,
Elghazel, & Aussem, 2011), SVM-based semisupervised feature selection (Ang, Haron, & Hamed, 2015), and other semisu-
pervised feature selection methods. Majority of semisupervised feature selection methods construct a graph using the training
samples that correspond to graph-based semisupervised learning methods. Some semisupervised feature selection methods
use a single learner or an ensemble learning model to predict the labels of unlabeled data. They select a subset of unlabeled
data along with the predicted labels and extend the initial labeled training set. The idea of these methods corresponds to that
of semisupervised learning methods based on self-training or cotraining. Some semisupervised feature selection methods per-
form feature selection based on semisupervised SVMs. The procedure used in these methods corresponds to semisupervised
SVM learning methods and other semisupervised feature selection methods (Sheikhpour, Sarram, Gharaghani, &
Chahooki, 2017).
The mentioned classical algorithms are only designed for SL dataset. In the ML context, the number of possible combi-
nations of the label attributes increases considerably and the need of a large number of training samples is a critical problem,
so proposing semisupervised approaches is particularly important. To address ML problem, classical algorithms decompose
KASHEF ET AL. 9 of 29

the ML learning to multiple independent SL problems, which fails to take into consideration correlations between different
labels (Ma, Nie, Yang, Uijlings, & Sebe, 2012).
There are few algorithms designed specifically for ML-FS with semisupervised approach. A convex semisupervised ML-
FS (CSFS) algorithm for large-scale multimedia analysis is presented in Chang, Nie, Yang, and Huang (2014), in which both
labeled and unlabeled data are utilized to select features while correlations among different features are simultaneously taken
into consideration. In this method, firstly, the train and test data are shown using different types of features and the labels of
unlabeled data are set to zero. Afterward, the least square loss function is minimized and sparse feature selection and label
prediction is applied. Feature selection is done using the achieved sparse coefficients. This method is computationally effec-
tive, so it is appropriate for large-scale datasets.
In Qian and Davidson (2010), a framework for ML feature reduction is presented called Semi-Supervised Dimension
Reduction for Multi-Label Classification (SSDR-MC) that exploits semisupervised learning. The authors utilized reconstruc-
tion error to determine how well an example is represented by its nearest neighbors. In this way, the intrinsic geometric rela-
tions between examples are specified which can be helpful in both label inference and feature selection. The connection
between dimension reduction and ML learning by an alternating optimization process is as follows: (1) A weight matrix is
learnt from both the available labels and feature description and (2) The missing labels are derived according to the weight
matrix; Repeating (1) and (2) until the predictions are confirmed.
Another method of feature selection algorithm for semisupervised learning is proposed in Li, You, Ge, Yang, and Yang
(2010), which is applied for analysis of gene function. In this method, a Cotraining algorithm, FESCOT, is incorporated with
ML-kNN leading to a new algorithm called Cotraining ML-kNN (COMN); then, prediction risk-based embedded feature
selection for COMN (PRECOMN) is proposed that uses the sequential backward search algorithm to search for feature sub-
sets. To evaluate feature subsets, the prediction risk criterion is employed.

3.3 | Search strategy perspective


There are 2M candidate subsets for a dataset D with M features. Searching for the optimum subset of features among all exist-
ing subsets is infeasible even for a moderate number of features. Therefore, a search algorithm is needed to lead the feature
selection procedure to explore among possible feature subsets.
Search algorithms can be classified into three main groups: exhaustive methods, sequential methods, and randomized
methods (Doak, 1992).

3.3.1 | Exhaustive methods


Exhaustive search explores all states in the search space. These methods guarantee to find the best feature subset. Several
methods are proposed that ensure finding the best subset without searching all possible subsets such as branch and bound,
beam search, and best first search (Dash & Liu, 1997).

3.3.2 | Sequential methods


Sequential search is done in an iterative manner such that in each iteration a feature is added to the feature subset or removed
from it. Another variation of this strategy is to add or remove k features instead of one. The greedy algorithms including
sequential backward elimination (SBE), sequential forward selection (SFS), and bidirectional selection utilize this search
strategy. Examples of ML-FS methods that use greedy search to select the optimal feature subset can be found in the papers
of Jungjit et al. (2013); Li, Miao, and Pedrycz (2017); and Lin et al. (2015).

3.3.3 | Randomized methods


The common aspect in methods of this group is their use of randomness to escape local optima in the search space. Las
Vegas (Brassard & Bratley, 1996) is a search algorithm of this group that selects completely random subsets in each
iteration.
Evolutionary algorithms have been widely used for the task of SL-FS achieving promising results some of which are
genetic algorithm (GA) (Jing, 2014), ant colony optimization (ACO) algorithm (Kashef & Nezamabadi-pour, 2015), differen-
tial evolution (DE) algorithm (Sikdar, Ekbal, Saha, Uryupina, & Poesio, 2015), gravitational search algorithm (GSA)
(Rashedi & Nezamabadi-pour, 2014), and particle swarm optimization (PSO) algorithm (Zhang, Wang, Phillips, & Ji, 2014).
However, in the case of ML-FS, since the number of features is large, there are few works such as Jungjit and Freitas
(2015a); Jungjit and Freitas (2015b); Lee and Kim (2015b); Shao, Li, Liu, and Wang (2013); Yu and Wang (2014); Zhang,
Gong, and Rong (2015); and Zhang, Gong, Sun, and Guo (2017), which are explained in detail in Methodologies Based on
the Filter Group.
10 of 29 KASHEF ET AL.

3.4 | Interaction with learning algorithm


Similar to SL data, ML-FS methods are generally categorized into three groups according to different selection strategies: fil-
ter, wrapper, and embedded methods. As most of ML-FS methods are inspired by SL-FS methods, a brief description of the
most practical SL methods of these groups are brought in the following.

3.4.1 | Filter methods


Filter methods evaluate the quality of features based on the characteristics of features and independently from the learning
algorithm. These methods rank the features based on some feature evaluation criteria and select the top high-ranked features.
The advantages of filter methods are their high generality and low computational cost. Therefore, these methods are the best
candidate for high-dimensional datasets. Filter methods are divided into two categories: univariate and multivariate methods.
Methods of the first group consider each feature individually, and therefore, feature dependencies are ignored, which may
leads to weaker classification performance compared to multivariate feature selection techniques. Information gain
(IG) (Li et al., 2014), chi-square (Olsson & Oard, 2006) and F-score (Ding, 2009) are categorized in this group. Multivariate
approaches consider the dependencies between features, but at the cost of reducing their scalability. Mutual information
(MI) (Lee, Park, & d’Auriol, 2012), ReliefF (Kira & Rendell, 1992), CFS (Hall, 1999), INTERACT (Zhao & Liu, 2007a),
mRMR (Peng et al., 2005), and FCBF (Yu & Liu, 2003) are some examples of this category. The above-mentioned methods
are some popular SL-FS methods which are potential to be employed as ML-FS methods by using data transformation strate-
gies. Some of these methods are frequently employed in ML-FS such as chi-square, IG, MI, and ReliefF, but other SL
methods are either not used or rarely used. Some of the most well-known SL filter methods are explained in the following:
Relief (Kira & Rendell, 1992) algorithm is a multivariate filter method based on random search, and is designed for
binary class problems without missing values (Reyes et al., 2015). The basic idea of Relief is to estimate features according
to their ability to distinguish two samples from the same classes and from different classes. Relief assigns a weight for each
feature in the interval [−1,1], where large positive weights are assigned to better features. The most important strength of
Relief is that it considers the effect of interacting features; however, it does not discriminate between redundant features, and
low number of training instances misleads the algorithm. In order to handle multiclass problems and incomplete data, ReliefF
was proposed by Kononenko (1994).
Information gain is a univariate filter method based on the concept of entropy in information theory. It measures the
dependency between each feature of dataset D and the class label, as is defined by Equation (5). It ranks features base on
their amount of information in such a way that higher values of IG for feature Xi indicates stronger relationship between that
feature and the class label (Spolaôr et al., 2016).
X j Dv j
IGðD, Xi Þ = entropyðDÞ − entropyðDv Þ: ð5Þ
v2Xi
jDj

Here, feature Xi, i = 1…N, can take v distinct values, and each subset Dv  D consists of the set of examples where Xi
has the value v.
FCBF is a multivariate filter approach presented in Yu and Liu (2003) paper, which is especially designed for high-
dimensional data. It considers feature-class correlations as well as feature-feature correlations to find a subset of features that
are highly correlated to the class but not highly correlated to the other features. It introduces a measure called symmetrical
uncertainty (SU) as the ratio between the IG and the entropy of two variables. First, it calculates the SU value for each fea-
ture and selects those features associated with SU values higher than a user-defined threshold. Then, redundant features are
removed from this subset and a subset of relevant informative features remains.
F-score is a criterion to evaluate the discriminative ability of features. Equation (6) shows how to calculate the F-score of
the ith feature. The numerator specifies the discrimination among the categories of the target variable, and the denominator
indicates the discrimination within each category. A larger F-score implies a greater likelihood that this feature is discrimina-
tive (Kashef & Nezamabadi-pour, 2015).
Pc  2
xik − xi
k=1
 ði = 1,2,…, nÞ, ð6Þ
PN ki  k k 2
F_scorei =
Pc 1
k = 1 Nk −1 j=1 x ij −xi
i

where c is the number of classes and n is the number of features; Nik is the number of samples of the feature i in class k,
 
(k = 1, 2, …, c, i = 1, 2, …, n), xijk is the jth training sample for the feature i in class k, j = 1,2,…, Nik , xi is the mean value
of feature i from all classes and xik is the mean of the ith feature of the samples in class k.
KASHEF ET AL. 11 of 29

Chi-square is another common univariate feature selection method. The χ 2 is used in statistics to define the independence
of two events. More specifically, it is used in feature selection to test whether the occurrence of a specific feature Xi and the
occurrence of a specific label yj are independent. It assigns a weight to each feature, where higher weights indicate more
dependency between Xi and yj (Spolaôr & Tsoumakas, 2013).
CFS (Hall, 1999) is a multivariate filter algorithm that evaluates different subsets of features based on a correlation-based
heuristic function. CFS tries to find a small subset of relevant and nonredundant features by the following equation:
krcf
CFS − scoreF = pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi , ð7Þ
k + k ðk −1Þrff
where, the CFS-score is the heuristic score of subset F with k features. rcf and rff are the average feature-class and feature-
feature correlations, respectively. The nominator demonstrates the predictive ability of features while the denominator indi-
cates the redundancy between features of F. To obtain the values of rcf and rff , CFS uses SU measure. To find a proper sub-
set of feature, CFS starts with an empty set and add features with the greatest CFS-score, iteratively until a stopping criterion
is met.

3.4.2 | Wrapper methods


Wrapper methods select features with high prediction performance estimated by the determined learning algorithm. In other
words, wrapper methods search for an optimal feature subset by performing the following two steps, iteratively: (a) search
for a feature subset and (b) evaluate the selected features. These steps are repeated until a stopping measure is met (Li et al.,
2016). Based on the search strategy, wrapper methods are classified into sequential selection algorithms and heuristic search
algorithms. SFS method and SBE method belong to the first group. SFS (SBE) starts with an empty (a full) set, and adds
(eliminates) a feature that gives the greatest increase (lowest decrease) in predictor performance. This process is repeated
until the required number of features exists in the subset. Heuristic search algorithms try to find the optimal feature subset by
the help of evolutionary algorithms such as GA, PSO, GSA, and ACO. Here again, the fitness function is the predictor per-
formance (Chandrashekar & Sahin, 2014). The accuracy of wrapper methods is more than filter methods, but the degree of
their computational complexity for large datasets is very high (Kashef & Nezamabadi-pour, 2013).

3.4.3 | Embedded methods


Finally, embedded methods try to take advantage of both filter and wrapper approaches by exploiting their complementary
strengths. These methods combine feature selection as part of the training process to define the feature that has the best abil-
ity to differentiate among classes in each stage. Thus, embedded methods are more effective than filter approaches since they
involve interaction with the learning algorithm, and are superior to wrapper methods since they do not need to evaluate the
feature subsets iteratively (Li et al., 2016). Some examples of embedded methods can be found in the papers of Guyon, Wes-
ton, Barnhill, and Vapnik (2002) and Kashef and Nezamabadi-pour (2015), which are explained briefly in the next section.
Other types of FS techniques have been recently proposed such as clustering-based FS and ensemble-based FS methods
(Ebrahimpour & Eftekhari, 2017; Rouhi & Nezamabadi-pour, 2017).

3.5 | Data model perspective


Similar to ML classification, ML-FS methods are also separated into two categories: problem transformation methods and
algorithm adaptation methods.

3.5.1 | Problem transformation feature selection methods


The problem transformation methods transform the ML data into SL data, and employ any state-of-the-art SL-FS approach
to solve the problem. In fact, ML classification methods that are based on problem transformation strategy decompose the
ML data into one or several SL data through a specific process and employ SL classifiers for classification. Next, the reverse
of the process is done to transform the classified SL data into ML data. In ML-FS methods that use problem transformation
strategy, the decomposition step is similar to problem transformation-based classification. Here, the proper SL-FS method is
firstly applied on the SL data to determine the salient features to be selected. These features are then selected in the original
ML data and other features are removed. In this step, an ML classifier is employed to evaluate the performance of the
selected feature subset. This process is illustrated in Figure 4.
The most frequently used transformation strategies in ML-FS researches are BR, LP, PPT, and ELA. The structure of
Figure 4 is appropriate for methods ELA, LP, and PPT (which is an improved version of LP) that transform the ML data into
a multiclass SL data. For BR approach that transforms the ML data into q SL data, an aggregation strategy is needed to select
the final feature subset from the lists of features selected by each SL-FS algorithm. The aggregation process can be done in
12 of 29 KASHEF ET AL.

Multi-label data Reduced multi-label


data

Transformation
strategy
Elimination of non-
selected features in
multi-label data
Single-label data

Single-label feature Determine selected


selection method features
FIGURE 4 Multilabel feature selection using transformation strategy

two ways: (a) defining the maximum number of features to be selected (e.g., r features), and selecting the top r features,
(b) specifying a threshold value and selecting those features that have higher average score or maximum score than the
threshold value. The specified features are then selected in the ML dataset and an ML classifier is employed to evaluate the
performance of the selected feature subset. This process is called as external approach in employing BR transformation strat-
egy for ML-FS problem in (Pereira et al., 2016) and is shown in Figure 5. Internal approach comes versus external approach
that use a SL classifier after the feature selection process for each transformed dataset, that is, both feature selection and clas-
sification methods are SL ones. The results of each SL classifier are then incorporated similar to a BR method for ML classi-
fication. Most of papers that use the BR approach as the transformation strategy follow the external strategy, and there are
few papers that use internal strategy (Dendamrongvit, Vateekul, & Kubat, 2011). Therefore, we mean external strategy for
BR transformation, from now on. This process is shown in Figure 6.
Chen et al. (2007) proposed ELA transformation strategy that assigns weights to features for different labels according to
label entropy, as was described before. The authors transform the ML data into SL data using ALA, LLA, SLA, NLA, and

Multi-label data

Classified multi-
label dataset
BR Transformation

1 2 q

Single- Single- Single-


label label label
data data data Multi-label classifier

Single- Single- Single-


label label label
feature feature feature Reduced multi-label
selection selection selection data

Aggregation
methodology
Elimination of non-
selected features in
multi-label data
FIGURE 5 Multilabel feature selection using
Determine selected
features binary relevance transformation method in
external form
KASHEF ET AL. 13 of 29

Multi-label data

BR Transformation

1 2 q

Single- Single- Single-


label label label
data data data

Single- Single- Single-


label label label
feature feature feature
selection selection selection

Single- Single- Single-


label label label
classifier classifier classifier

Inverse of BR
transformation
strategy

Classified
multi-label
FIGURE 6 Multilabel feature selection using binary relevance transformation dataset
method in internal form

ELA transformation strategies, and apply three SL filter methods including IG, CHI, and Optimal Orthogonal Centroid Fea-
ture Selection (OCFS) (Yan et al., 2005). Trohidis et al. (2008) applied LP to transform an ML music dataset to a SL one for
the task of music retrieval by emotion. The most salient features are then detected by χ 2 statistics. They found that LP + χ 2
is the most effective strategy among several ML classification methods. Doquire and Verleysen (2011) employ a PPT
method that was introduced in Read’s (2008) study to transform ML datasets into SL one. Then, a greedy feature selection
procedure based on multidimensional MI is executed. The work by Doquire and Verleysen (2013a) extends preliminary
results that were presented in Doquire and Verleysen (2011) study and proposes a way to automatically select the pruning
parameter for PPT. A similar method is proposed in Reyes et al.’s (2015) study, which converts the ML problem to a SL
problem using PPT, and then utilizes the ReliefF algorithm for assigning weight to each feature. Spolaôr et al. (2013a) use
LP and BR to transform the problem, and then employ IG and ReliefF for feature selection. Finally, the performance of these
four ML-FS methods is compared to each other. Tsoumakas and Vlahavas (2007), that proposed RAKEL ML classifier,
reduced the number of features before employing their proposed classifier in order to lessen the computational cost of train-
ing process. They used the BR transformation in conjunction with the χ 2 statistic to get a ranking of all features for each
label. Finally, the top 500 features with the greatest score over all labels were selected.
Text categorization is included in the domain of ML problem; for example, an article about Persepolis can be categorized
as different topics such as Iran, Culture and History. The keyword “multilabel” may not found in these researches but they
are actually involved with an ML data. Researches in text categorization problem have long age (Olsson & Oard, 2006;
Rogati & Yang, 2002). For example, Yang and Pedersen (1997) applied and compare several filter methods including docu-
ment frequency, MI, IG, term strength, and χ 2 statistic in a text categorization problem. They evaluated each label
14 of 29 KASHEF ET AL.

individually, similar to the BR transformation strategy. Finally, the performance of each feature selection method is evaluated
by kNN and LLSF.
Gharroudi et al. (2014) discuss three wrapper ML-FS algorithms based on Random Forest (RF) model called BRRF,
RFLP (Random Forest Label Powerset), and RFPCT (Random Forest Predictive Clustering Tree). The first two methods
transform the ML data into SL data using BR and LP, respectively. Then, a RF is applied on the created SL dataset(s). The
experimental results show that BRRF acts better than RFLP, and the authors conclude that considering label dependency is
not significantly effective in ML-FS.

3.5.2 | Algorithm adaptation feature selection methods


Algorithm adaptation-based ML-FS methods generalize the existing feature selection methods to handle ML data, directly.

Methodologies based on the filter group


Three extensions of the well-known ReliefF method are proposed in Reyes et al.’s (2015) study including PPT-ReliefF,
ReliefF-ML, and RReliefF-ML. PPT-ReliefF that uses the PPT transformation strategy is introduced before. ReliefF-ML
which is initially presented in the other work of the authors (Pupo, Morell, & Soto, 2013) and RReliefF-ML which is based
on RReliefF (adaptation of ReliefF to regression problems) are two filter methods that manage ML data, directly.
Kong et al. (2012) present MF-statistic and MReliefF algorithms for ML-FS. The strategy behind MReliefF is to first
transform the ML problem into a set of pairwise ML 2-class problems, and eliminating the cases where the “miss” and “hit”
sets contain both classes. For ML F-statistics, label information is used for calculating between-class and within-class scatter
matrices.
Spolaôr, Cherman, Monard, and Lee (2013b) adapted ReliefF measure for ML data which uses hamming distance as the
dissimilarity function in the ML task to detect the nearest samples as a partial weight to compute feature importance.
An ML-FS method called PMU based on multivariate MI is introduced in Lee and Kim’s (2013) study. In this method,
the best feature is specified based on an incremental selection strategy that maximizes the multivariate MI between selected
features and the labels. PMU is the first ML filter feature selection method that considers label interactions in measuring the
dependency of given features. A fast ML-FS method based on information-theoretic feature ranking is presented in Lee and
Kim’s (2015a) study. This method speeds up the search process by scrapping dispensable calculations and identifying impor-
tant label combinations for fast ML-FS.
An extension of the famous SL filter method FCBF that is described earlier is proposed in the paper by Lastra
et al. (2011). This method is called MLFR (ML feature ranker) which uses a graphical scheme to demonstrate the relation-
ship between features and labels. This method is designed for discrete datasets due to the use of SU criterion to evaluate the
features.
The well-known feature selection filter method, IG measure is adapted to the ML model in the papers of Li et al. (2014)
and Pereira, Plastino, Zadrozny, and Merschmann (2015). Pereira et al. (2015) used the adaptation entropy calculation pro-
posed by Clare and King (2001) to calculate multilabel IG for each feature. Features are then sorted according to their score,
and the top specified number of feature is selected. In Li et al.’s (2014) study, an ML-FS technique based on IG (IGMF) is
presented. At first, the IG between every features and the label set is computed to discover the importance of each feature.
The optimal features are then obtained using a threshold value.
Kong and Philip (2012) present an ML-FS method for graph classification problem, named gMLC, where samples are
graphs. It searches for an optimal set of subgraph features utilizing multiple labels of graphs. At first, the correlation between
subgraph features and multiple labels of graphs are calculated. Then, a branch-and-bound algorithm is proposed to prune the
subgraph search space to find the most optimal subgraph features.
In Jungjit et al.’s (2013) study, ML-CFS method is improved. To do so, firstly, the correlation between features and
labels is computed and absolute values of the correlation coefficients are used to calculate the merit function of ML-CFS
approach, which is employed for measuring the quality of a candidate feature subset. In the second method, MI measures the
correlation between each pair of labels and changes the merit function for considering label dependencies. Hill-climbing
algorithm is utilized to maximize the merit function. This is a filter method that utilizes greedy search strategy.
A filter multilabel correlation-based feature selection based on GA (GA-ML-CFS) is proposed in Jungjit and Freitas’s
(2015a) study that uses the merit function of ML-CFS with absolute values of the correlation coefficients as fitness function
for GA. An improvement of GA-ML-CFS, called LexGA-ML-CFS, is presented in Jungjit & Freitas’s (2015b) study that
employs lexicographic multiobjective GAs to search for the optimal feature subset. Lexicographic multiobjective algorithms
are used, where different priorities should be assigned to different objectives. In this study, improvement of accuracy and
decreasing the number of features are considered as objectives with the first and second priorities, respectively. To do so, the
fitness function presented in Jungjit and Freitas’s (2015a) study is employed as the first objective, and the number of features
KASHEF ET AL. 15 of 29

is utilized as the second objective. However, overall, there was no statistically significant difference between the results of
the proposed LexGA-ML-CFS method and other methods.
Lin et al. (2015) proposed an incremental ML-FS method based on max-dependency and min-redundancy (MDMR) cri-
terion inspired by the well-known SL filter method, minimum Redundancy Maximum Relevancy (mRMR). More precisely,
features are selected one by one based on their calculated scores. Assuming a selected feature subset, the score of remaining
features is calculated according to θ = D − R, where D is related to the relevance of a feature with labels, and R defines the
redundancy of that feature with the selected feature subset. At each step, θ is calculated for all remaining features, and the
feature with the highest score is selected. This process is repeated until the desired number of features is selected.
An ML-FS approach that selects salient features based on ML neighborhood MI is proposed in Lin et al.’s (2016) study.
At first, all instances are granulated under different labels using the margin of instances, and three different neighborhood
MIs for ML learning are defined. Then, they introduced an optimization objective function to measure the quality of candi-
date features.
A granular ML-FS method with a maximal relevance and minimal redundancy measure is proposed in Li et al.’s (2017)
study that considers the local label correlations. At first, the labels are abstracted into some information granules based on
their correlations. Then, it selects the feature that has the greatest correlation with each label in the information granule while
having minimal correlation with already chosen features.

Methodologies based on the wrapper group


As mentioned earlier, Gharroudi et al. (2014) proposed three wrapper feature selection methods based on RF model. Two of
these methods utilized a transformation strategy before employing RF. The third one suggested method called RFPCT to
handle the ML data, directly.
Zhang et al. (2009) propose an adaptation of the traditional naïve Bayes classifiers to ML datasets, called MLNB. Next,
a feature extraction method based on principal component analysis (PCA) with a wrapper feature selection method based on
GA are incorporated into the method for feature selection to overcome its defects and improve its performance. The fitness
function of GA include both hamming loss and ranking loss.
In Yu & Wang’s (2014) study, a supervised ML-FS algorithm is introduced which is based on MI and GA. At the first
step, MI is utilized to select features locally. Then based on the results of this step, GA selects the global optimal feature
subset.
The mentioned GA-based algorithms consume much time to reach the optima and may results in premature convergence.
To overcome these problems, Lee and Kim (2015b) propose a memetic algorithm based on GA for improving the quality of
obtained solutions and increasing the search speed. To do so, a local refinement method, which is designed specifically for
ML problem, is applied on the solutions achieved by GA. Local refinement is exerted on the feature subset which has the
best fitness value in such a way that the dependency between features and labels is maximizes with the progress of the
search. As single objective fitness function, three evaluation metrics are used independently including hamming loss, ML
accuracy, and subset accuracy. Also, MLNB is used for ML classification. As is declared in this paper, this is the first work
employing memetic algorithm for ML-FS.
A hybrid optimization method for ML-FS problem called HOML is presented in Shao et al.s (2013) study which employs
a combination of simulated annealing, GA, and hill-climbing greedy algorithm with the aim of finding the optimal feature
subset. These methods are applied hierarchically on features such that the output of one method is the input of the next
method, and at each stage, a number of features are eliminated. In this wrapper method, simulated annealing is used at the
first step to lead the global search. Then at the second step, GA is adopted to avoid being trapped in local optima and finally
hill climbing is applied for its ability in local search. Average precision is utilized as the fitness function coupled with ML-
kNN, Rank-SVM, BP-MLL, and MLNB-BASIC. The results demonstrate better performance, but as a wrapper approach,
the computational complexity is high.
A wrapper ML multiobjective feature selection algorithm is introduced in Zhang et al.’s (2017) study based on PSO algo-
rithm. In this method, probability-based encoding strategy is utilized to represent each particle for adapting the problem to
PSO algorithm. To improve the performance of PSO, adaptive uniform mutation and the local learning strategy are also
added. Moreover, hamming loss, as ML classification error, and the number of features are considered as objectives of the
fitness function and an archive is introduced for saving optimal solutions achieved by the swarm during the iterations. The
used classifier for evaluation is ML-kNN.
A DE-based ML multiobjective feature selection algorithm is presented in Zhang et al.’s (2015) study. The proposed
method uses both the classification performance and the number of features as fitness functions and applies the ideas of effi-
cient nondominated sort, the crowding distance, and the Pareto dominance relationship to DE for finding a Pareto
solution set.
16 of 29 KASHEF ET AL.

Methodologies based on the embedded group


As discussed before, in embedded methods, the procedure of searching for a good feature is embedded in the classifier con-
struction process. Decision tree-based ML classifiers are examples of this category. For instance, in ML C4.5 classifier pro-
posed in Clare and King’s (2001) study, the feature that can best classifies the training examples is selected for each node.
After constructing the tree, the salient features are placed at the top of the tree while worthless features are at the bottom of
the tree, and can be eliminated from the feature set. Other examples of decision tree-based ML classifiers that embed feature
selection process can be found in the papers of Chou and Hsu (2005) and Noh, Song, and Park (2004).
Gu et al. (2011) proposed an embedded ML-FS method based on the multilabel learning algorithm, label rank support
vector machine (LaRank SVM) (Elisseeff & Weston, 2001). This method which is called CMLFS (stands for correlated mul-
tilabel feature selection) models the label relationship for FS according to LaRank SVM. The core idea of this method is to
discover a feature subset such that the label correlation regularized loss of label ranking is minimized. CMLFS considers
label correlation, but its computational cost is high due to optimizing a set of parameters for feature selection process to
adjust the kernel function of the ML learning algorithm.
You, Liu, Li, and Chen (2012) presented an embedded feature selection approach into ML classification, called multilabel
embedded feature selection (MEFS). MEFS adopts an evaluation criterion named prediction risk criterion for evaluation of fea-
tures, and uses backward search strategy for searching feature subset. Also, label correlation is taken into consideration
in MEFS.
Li et al. (2010) introduced a semisupervised ML learning algorithm, named COMN, by incorporating ML-kNN with
Cotraining series algorithms that are powerful semisupervised learning techniques. Then, an embedded feature selection algo-
rithm, called PRECOMN, is proposed that performs feature selection for COMN. It employs sequential backward search
method for searching feature subsets and uses the prediction risk measure to evaluate feature subsets.
Liu, Zhang, and Wu (2014) proposed a framework for ML learning that can manage classification learning and feature
selection, simultaneously. In their method, named MLSLR (for ML learning via sparse logistic regression), logistic regres-
sion is exploited to train models for classification in ML data. To perform feature selection, an elastic net penalty is
imposed on the logistic regression model, where the ℓ1-norm penalty prepares a solution to detect and eliminate irrelevant
features and the ℓ2-norm penalty certifies that highly correlated features have similar regression coefficients. MLSLR is
only able to find linear relations between features and detecting nonlinear dependency of features is mentioned as the
future work by authors.

4 | S E T UP A N D B E NC H M AR K IN G

In this section, issues related to comparing the performance of algorithms are discussed. To this end, the standard ML data-
sets and evaluation measures are reviewed. Next, the most frequently used nonparametric tests are investigated. Finally, vari-
ous software tools that are developed in ML task are introduced.

4.1 | Performance evaluation criteria


In SL supervised learning algorithms, traditional measures such as accuracy, recall, precision, and F-measure are used to
evaluate the performance. However, such metrics cannot be used for ML cases because each ML sample can be associated
with more than one label simultaneously. Consequently, some ML specific evaluation measures have been proposed. Zhang
and Zhou (2014) give a comprehensive taxonomy of ML evaluation measures, so it is presented here. Due to this taxonomy,
the proposed evaluation metrics fall under two categories: example-based metrics and label-based metrics. In example-based
metrics, firstly, the performance of the learning algorithm is evaluated on each test sample separately and then the mean
value is calculated among all the test samples. On the other hand, label-based metrics evaluate the performance of the learn-
ing algorithm on each class label separately, and then calculate the macro/micro average value. Both example-based and
label-based methods are categorized into two groups including ranking- and classification-based approaches. Ranking
approaches are described based on a real-valued function f(. , .) giving the quality of ranking for each label while
classification-based approaches use the ML classifier h(.). Figure 7 shows the mentioned categorization of the evaluation
metrics. In the following, methods of each group are explained in details. Let T = {(xi, Yi), i = 1, …, p} be a given test set
where Yi  L is a correct label subset, and Zi  L be a predicted label set which corresponds to xi. Also, let f(x, y) denotes
the score assigned to label y for sample x.

• Example-based measures
KASHEF ET AL. 17 of 29

FIGURE 7 Diagram of Nemenyi’s post-hoc test in terms of (a) accuracy and (b) hamming loss criteria. Methods which are not significantly different based
on the critical difference (CD) (at α = .05) are connected

1. Hamming loss
Hamming loss calculates the percentage of misclassified labels, that is, the sample associated to a wrong label or a label
belonging to the true sample which is not predicted (Cherman, Spolaôr, Valverde-Rebaza, & Monard, 2015).

1 X j Yi ΔZi j
p
Hamming lossðh, T Þ = , ð8Þ
p i=1 j L j

where Δ is the symmetric difference between two sets. Hamming loss calculates the percentage of labels whose relevance is
not predicted correctly.

2. Subset accuracy
Subset accuracy calculates the ratio of correctly classified test samples to the cardinality of test set. The predicted label set
should be identical to the original label set for each sample to be considered as correctly classified, leading this measure to
be strict. Subset accuracy is an ML counterpart of the SL accuracy measure.

1X
p
Subset accuracyðhÞ = ½Zi = Yi : ð9Þ
p i=1

3. Accuracy
This measure, which calculates the correctly predicted labels among all true and predicted labels, is determined as follows:

1 X j Yi \ Zi j
p
Accuracyðh,T Þ = : ð10Þ
p i = 1 j Yi [ Zi j

Accuracy seems to be a more balanced criteria and better representative of an actual algorithm’s predictive performance
for most classification problems compared to hamming loss (El Kafrawy et al., 2015).

4. One error
This measure counts the times the top-ranked label is not relevant:

1 X
p
one-errorð f Þ = arg maxy2Y f ðxi ,yÞ 2
= Yi : ð11Þ
p i=1

5. Coverage
Coverage evaluates the average number of steps to move down in the list of ranked labels to cover all the relevant labels of a sample.

1X
p
Coverageð f Þ = maxy2Y rankf ðxi , yÞ −1, ð12Þ
p i=1

where rankf (xi, y) denote the rank of y in Y based on the descending order induced by f.
18 of 29 KASHEF ET AL.

6. Ranking loss
Ranking loss counts the average fraction of reversely ordered pairs; that is, an irrelevant label is ranked higher than a relevant
label.

1X 1
p n 0   0    0  o
Ranking lossð f Þ = j y ,y} jf xi ,y ≤ f xi , y} , y , y} 2 Yi × Y i j : ð13Þ
p i = 1 j Yi kY i j

7. Average precision
The average precision determines the average percentage of relevant labels that are ranked above a particular relevant
label y 2 Yi.
 0  0 
1 X 1 X y jrankf xi , y ≤ rankf ðxi ,yÞ,y 2 Yi
p 0

Avg-pre = : ð14Þ
p i = 1 j Yi j y2Yi rankf ðxi ,yÞ

• Label-based metrics

For the label yj, the number of true positive (TPj), true negative (TNj), false positive (FPj), and false negative (FNj) test sam-
ples can be calculated as:

TPj = jfxi jyi 2 Yi Λ yi 2 hðxi Þ,1 ≤ i ≤ pgj, ð15Þ

TN j = jfxi jyi2
= Yi Λ yi2
= hðxi Þ,1 ≤ i ≤ pgj, ð16Þ

FPj = jfxi jyi2


= Yi Λ yi 2 hðxi Þ,1 ≤ i ≤ pgj, ð17Þ

FN j = jfxi jyi 2 Yi Λ yi2


= hðxi Þ,1 ≤ i ≤ pgj: ð18Þ

Based on the above definitions, TPj + TNj + FPj + FNj = p. Accuracy, Precision, Recall, and Fβ are four classification
metrics that are defined using, TPj, TNj, FPj, FNj as:
  TPj + TN j
Accuracy TPj , TN j ,FPj , FN j = , ð19Þ
TPj + TN j + FPj + FN j

  TPj
Precision TPj ,TN j ,FPj , FN j = , ð20Þ
TPj + FPj

  TPj
Recall TPj , TN j ,FPj , FN j = , ð21Þ
TPj + FN j

 
β
  1 + β2 :TPj
F TPj , TN j , FPj ,FN j =   : ð22Þ
1 + β2 :TPj + β2 :FN j + FPj

If B(TPj, TNj, FPj, FNj) shows one of these four functions, the label-based classification measures , including macro-
averaging and micro-averaging, are defined as follow: (q is the number of all labels).

1. Macro-averaging
1X  
q
Bmacro ðhÞ = B TPj , TN j , FPj ,FN j : ð23Þ
q j=1

2. Micro-averaging
!
X
q X
q X
q X
q
Bmicro ðhÞ = B TPj , TN j , FPj , FN j : ð24Þ
j=1 j=1 j=1 j=1

Also, two label-based ranking metrics, AUCmacro and AUCmicro, are defined as:
KASHEF ET AL. 19 of 29

3. AUCmacro
 0  0   0 
1 X x ,x} Þj f x , yj ≥ f ðx} ,yj , x , x} 2 Zj × Zj g
q
AUCmacro = ð25Þ
q j=1 j Zj kZj j,

where Zj = {xi| yi 2 Yj, 1 ≤ i ≤ p} is the set of test samples with label yi and Zj is its complementary, that is, the set of test
samples without label yi.

4. AUCmicro
 0  0 0   0 0   
1 X x , x} , y ,y} Þj f x , y ≥ f ðx} , y} , x , y 2 S + , x} ,y} 2 S −
q 0

AUCmicro = , ð26Þ
q j=1 j S + kS − j

where S+ = {(xi, y)| y 2 Yi, 1 ≤ i ≤ p} is a set of sample-label pairs which are relevant and S− = {(xi, y)| y 2 Yi, 1 ≤ i ≤
p} is irrelevant sample-label pairs set.
Amongst the mentioned example-based measures, smaller values show better performance for all criteria except average
precision and accuracy. Also, all measures are normalized to a number between 0 and 1 except for coverage. Moreover, for all
the mentioned label-based metrics, the larger metric value indicates better system’s performance, where the optimal value is 1.
Beside the given metrics, another parameter can be defined for evaluation and comparison of feature selection methods
called average feature reduction, Fr, to investigate the rate of feature reduction and is defined as:
M −r
Fr = , ð27Þ
M
where M is the total number of features and r is the number of selected features by the FS algorithm. The more it is close to
1, the more features are reduced and the classifier’s complexity is less (Kashef & Nezamabadi-pour, 2015).

4.2 | Standard datasets


The standard ML datasets can be mainly found on http://mulan.sourceforge.net/datasets.html (URL 1) http://meka.
sourceforge.net/#datasets (URL 2), http://cse.seu.edu.cn/people/zhangml/Resources.htm#data (URL 3), and http://www.keel.
es/ (URL 4). The format of the datasets available in these repositories is .arff except URL 4 that is .dat. In addition, URL
4 provides text categorization datasets along with 5-fcv and 10-fcv partitioning of all. As many researchers work with
Matlab, we converted most of these datasets into .mat format that are available to the community on https://www.
researchgate.net/project/Multi-label-Datasets-mat-format. Also, there are various datasets in text categorization field which
can be used in ML tasks. A collection of these datasets is brought in the study of Reyes, Morell, and Ventura (2016).
Table 1 summarizes the characteristics of some of the most frequently used ML datasets in researches that are sorted
based on the number of features including dataset name (Dataset); dataset domain (Domain); number of instances (N); num-
ber of features (M); number of labels (|L| = q); feature type (Type); LC, and LD, that are explained in Section 2.1.

4.3 | Nonparametric statistical tests


Nonparametric tests are tools, which can determine the differences in treatments of multiple algorithms; therefore, they are
useful for evaluation of ML-FS methods. In most of the published papers, Friedman test is employed to compare the algo-
rithm although Wilcoxon test is also used in some other references (Lee & Kim, 2015b). Friedman test, at its first step,
assumes that the algorithms are equivalent in their performances (null-hypothesis). If this hypothesis is rejected, it means that
a difference exists among the algorithms. In other words, the Friedman test determines the average ranks for displaying sig-
nificant differences (García, Fernández, Luengo, & Herrera, 2010). To analyze whether the differences in the performance of
compared algorithms are statistically significant, Nemenyi’s post-hoc test is used in many papers. This test distinguishes a
significant difference of the algorithms whenever the average ranks of two of them differ by more than a critical difference
(CD). Moreover, the results of this test can be visualized using a simple diagram (Demšar, 2006).

4.4 | Software for ML classification and feature selection


As it was discussed before, due to the rapid growth of ML data and applications, a considerable variety of ML learning and feature
selection algorithms have been proposed. There are a number of implementations of these algorithms in Matlab and R software.
For example, the Matlab implementation of many learning algorithms such as ML-kNN and MLNB are available to the
community at http://lamda.nju.edu.cn/Data.ashx and http://cse.seu.edu.cn/people/zhangml/Resources.htm. Several R packages
20 of 29 KASHEF ET AL.

TABLE 1 Characteristics of the most frequently used multilabel datasets

No. of No. of No. of Label Label


Dataset features sample label cardinality Label density diversity Type Domain URL
Flags 19 194 7 3.392 0.485 54 Nominal + Images URL 1
numeric (toy)
CAL500 68 502 174 26.044 0.15 502 Numeric Music URL 1
Emotions 72 593 6 1.869 0.311 27 Numeric Music URL 1, URL 4
Yeast 103 2417 14 4.237 0.303 198 Numeric Biology URL 1, URL 4
Mediamill 120 43907 101 4.376 0.043 6,555 Numeric Video URL 1, URL 4
Birds 260 645 19 1.014 0.053 133 Nominal + Audio URL 1
numeric
Scene 294 2407 6 1.074 0.179 15 Numeric Image URL 1, URL 4
Image 294 2,000 5 1.236 0.247 20 Numeric Images URL 3
corel5k 499 5,000 374 3.522 0.009 3,175 Nominal Images URL 1, URL 4
corel16k 500 13,811  87 161  9 2.867  0.033 0.018  0.001 4,937  158 Nominal Images URL 1
Delicious 500 16,105 983 19.02 0.019 15,806 Nominal Text URL 1, URL 4
(web)
Ohsumed 1,002 13,929 23 1.663 0.072 1,147 Nominal Text URL 2, URL 4
Language log 1,004 1,460 75 1.18 0.016 286 Nominal Music URL 2
Slashdot 1,079 3,782 22 1.181 0.054 156 Nominal Text URL 2
Genbase 1,186 662 27 1.252 0.046 32 Nominal Biology URL 1, URL 4
Medical 1,449 978 45 1.245 0.028 94 Nominal Text URL 1, URL 4
Bibtex 1,836 7,395 159 2.402 0.015 2,856 Nominal Text URL 1, URL 4
Bookmarks 2,150 87,856 208 2.028 0.01 18,716 Nominal Text URL 1, URL 4
EUR-Lex (directory 5,000 19,348 412 1.292 0.003 1,615 Numeric Text URL 1
codes)
EUR-Lex (subject 5,000 19,348 201 2.213 0.011 2,504 Numeric Text URL 1
matters)
EUR-Lex (eurovoc 5,000 19,348 3993 5.31 0.001 16,467 Numeric Text URL 1
descriptors)
Enron 10,010 1,702 53 3.378 0.064 753 Nominal Text URL 1, URL
2, URL 4
rcv1v2 (subset4) 47,229 6,000 101 2.484 0.025 816 Numeric Text URL 1
rcv1v2 (subset5) 47,235 6,000 101 2.642 0.026 946 Numeric Text URL 1
rcv1v2 (subset1) 47,236 6,000 101 2.88 0.029 1,028 Numeric Text URL 1
rcv1v2 (subset2) 47,236 6,000 101 2.634 0.026 954 Numeric Text URL 1
rcv1v2 (subset3) 47,236 6,000 101 2.614 0.026 939 Numeric Text URL 1
tmc2007 49,060 28,596 22 2.158 0.098 1,341 Nominal Text URL 1
NUS-WIDE 269,648 269,648 81 1.869 0.023 18,430 Numeric Images URL 1
Yahoo 32,786  7990 5,423  1259 31  6 1.481  0.154 0.051  0.012 321  139 Numeric Text URL 1

are also developed for different purposes: utiml1 package for ML learning, MLPUGS2 package for ML prediction using
Gibbs sampling and classifier chains, mldr3 (Charte & Charte, 2015) package that contains exploratory data analysis and
manipulation of ML data, and mldr.datasets4 (Charte, Charte, Rivera, del Jesus, & Herrera, 2016) package that contains R
ultimate ML dataset repository.
Also, two exclusive ML libraries including MEKA5 (Read, Reutemann, Pfahringer, & Holmes, 2016) and MULAN6
(Tsoumakas, Spyromitros-Xioufis, Vilcek, & Vlahavas, 2011) that are based on WEKA7 (Hall et al., 2009) are introduced.
MEKA is an open-source Java framework based on the famous WEKA library. It contains all the basic problem transfor-
mation methods, such as different varieties of classifier chains (Read et al., 2011), and many of advanced methods investi-
gated by (Madjarov, Kocev, Gjorgjevikj, & Džeroski, 2012), as well as some algorithm adaptation methods such as ML
neural networks. Also, there exist different evaluation criteria and tools for ML experiments and development. Note that
MEKA offers both the command line interface (CLI) and graphical user interface (GUI). Its associated repository8 contains
more than 20 ML datasets with ARFF format.
MULAN is an open source Java package for learning from ML learning based on WEKA. It only offers programmatic
API to the library users, and there is no GUI. MULAN contains many state-of-the-art ML learning algorithms, label ranking,
and dimensionality reduction algorithm, as well as algorithms for learning from hierarchically structured labels. It also offers
KASHEF ET AL. 21 of 29

an evaluation framework that calculates ML evaluation measures through cross validation and hold-out evaluation. Its reposi-
tory contains over 25 ML datasets with ARFF format.
As a simple way for ML-FS problem is to use SL methods via data transformation methods such as LP and BR, having
the implementation of practical SL filter methods is helpful. To this end, fspackage9 (Liu, 2010) which is a package based
on WEKA containing most of filter methods such as IG, FCBF, CFS, and chi-square is a proper choice. This package calls
WEKA in Matlab, that is, WEKA should be installed before using the functions. Also, complete explanations about how to
call the functions are given by comments at the beginning of them.
There also exists some general-purpose software that manage ML data as part of their functionality. Clus10 is a decision
tree system that implement the predictive clustering framework, and can be used for hierarchical ML classification.
LibSVM11 (Chang & Lin, 2011) is an integrated software for SVMs that can be applied for ML data using the binary rele-
vance transformation.
KEEL (Alcalá-Fdez et al., 2011) is an open source data mining software tool that contains many algorithms with different
purposes to be used in knowledge discovery tasks. Moreover, there is a dataset repository associated with this software that
contains some ML datasets. The file format of these datasets is ARFF-based, with specific header fields to show whether
each attributes is label or not. Besides, in this software, a complete set of statistical procedures including both parametric and
nonparametric procedures is provided for pairwise comparisons of the algorithms.

5 | GU I D I N G E X P E R I M E N T S

In this section, some experimental results are presented. First, the results of different evaluation measures for the most popu-
lar datasets with all features are shown in Table 2 which can be used as a baseline to evaluate the goodness of ML-FS
methods. Afterward, two series of ML-FS methods are tested and compared to draw useful conclusions. The first group of
methods use problem transformation strategy and the second group are algorithm adaptation-based methods. Methods of the
first group are based on SL filter methods, using the two standard problem transformation approaches LP and BR (in the
external form according to Figure 5).
The filter methods include ReliefF, IG, F-score, chi-square, FCBF, and CFS. Four of these methods including BR-IG,
BR-RF, LP-IG, and LP-RF, which are based on ReliefF, and IG methods are initially presented in SpolaôR et al.’s (2013a)
study. In this experiment, those algorithms that use the LP method for transformation consider the label dependency, and the
other algorithms which use BR method for transformation do not. We aim to explore the role of considering label correla-
tions in ML-FS.
As was mentioned in the introduction, some methods provide an ordered ranking of features and a threshold is needed to
select those features with greater score than the threshold, while the other methods produce a candidate subset of features.
All the above filter methods are ranking methods except FCBF and CFS. Because defining a proper threshold for each
method is not an easy task, and for the sake of fairness, we decide the number of features for ranking-based methods accord-
ing to the following rules that are inspired from the paper by Bolón-Canedo, Sánchez-Maroño, and Alonso-Betanzos (2016)
and our previous experiences.

• If M < 100, select 40% of features


• If 100 < M < 500, select 30% of features
• If 500 < M < 1000, select 20% of features
• If M > 1000, select 10% of features

We selected PMU and MDMR as the algorithms of the second group. Both of these algorithm adaptation filter methods
use a greedy search to find the optimal feature subset, that is, the features are selected one by one according to their calcu-
lated scores in each step. PMU takes into account the correlation among labels opposite to the MDMR algorithm. Here again,
the above rules-of-thumb are used to define the number of features to be selected.
For all of the comparing algorithms, the results are averaged over 20 independent runs in each dataset, and ML-kNN is
used as the classifier (k = 10). During each experiment, 60% of samples are chosen randomly for training process and the
remaining 40% are used for evaluating the performance of the methods. In PMU, numeric features are discretized using an
equal-width strategy into two bins, as suggested in by Lee and Kim (2013), while nominal features are left untouched.
Table 3 reports the results of the 12 problem transformation algorithms and the two algorithm adaptation methods in
terms of accuracy, hamming loss, and feature reduction criteria. The best results are highlighted in bold. For easier compari-
son, we repeat the baseline results in the last column of this table. By comparing the results, it can be seen that there is an
22 of 29 KASHEF ET AL.

TABLE 2 Baseline results for the most frequently used multilabel datasets using the ML-kNN classifier

Hamming One Average Ranking


Accuracy Precision Recall F_measure loss error Coverage precision loss
Flag 0.503 0.678 0.639 0.634 0.328 0.246 3.925 0.791 0.244
CAL500 0.197 0.591 0.228 0.324 0.139 0.126 130.343 0.487 0.184
Emotions 0.311 0.497 0.344 0.384 0.27 0.398 2.323 0.701 0.27
Yeast 0.501 0.715 0.572 0.607 0.198 0.239 6.383 0.756 0.173
Mediamill 0.463 0.775 0.514 0.579 0.029 0.129 15.602 0.739 0.042
Birds 0.018 0.031 0.018 0.022 0.098 0.732 7.785 0.383 0.306
Scene 0.658 0.685 0.676 0.673 0.09 0.236 0.495 0.859 0.081
Image 0.460 0.527 0.472 0.486 0.889 0.336 1.008 0.78 0.185
corel5k 0.010 0.025 0.011 0.014 0.009 0.729 115.738 0.246 0.135
corel16k_sample1 0.005 0.011 0.005 0.006 0.018 0.744 51.908 0.28 0.175
corel16k_sample2 0.001 0.002 0.001 0.001 0.017 0.897 68.455 0.138 0.287
corel16k_sample3 0.004 0.012 0.005 0.006 0.018 0.74 51.743 0.275 0.17
Language log 0.369 0.665 0.408 0.479 0.162 0.18 50.071 0.614 0.2
Slashdot 0.798 0.892 0.799 0.829 0.017 0.092 1.634 0.885 0.046
Genbase 0.928 0.969 0.929 0.942 0.005 0.01 0.586 0.985 0.007
Medical 0.528 0.58 0.553 0.554 0.016 0.273 3.005 0.788 0.048
Bibtex 0.123 0.249 0.127 0.156 0.013 0.601 56.626 0.335 0.218
Enron 0.293 0.526 0.34 0.388 0.053 0.324 13.598 0.617 0.095

improvement in most of the datasets after applying feature selection. For instance, the accuracy of medical dataset reaches
from 0.528 to 0.635 by LP-F-score showing an improvement of more than 10%.
Also, about 90% of features are eliminated by this method. Although some FS methods degrade the performance of the
classifier compared to baseline results, the ratio of feature reduction should not be neglected. For example, in Image dataset
the accuracy measure is decreased by 0.048 using MDMR method, while almost 70% of features are eliminated which
reduces the computational complexity, significantly.
In order to assess the statistical significance of the results, we employed the Friedman’s test and the Nemenyi’s post-hoc
procedure at a confidence level of α = .05. Based on the results of Table 3, the average ranks obtained by comparing the
methods through Friedman 1*N statistical test are presented in Table 4. These results are obtained using KEEL software that
is previously described. Numbers written in brackets are the ranks obtained by each algorithm among the others. Lower rank
for an algorithm indicates its superiority against the others. As the number of features to be selected in ranking-based method
is determined by the user, their ranks are the same.
According to this table, BR-CFS obtains the first rank, in both accuracy and hamming loss criteria. Based on accuracy
measure, BR-FCBF and LP-RF stand on the second and third places, whereas their ordering is reversed for hamming loss
measure. The obtained p values after applying the Friedman test for accuracy and hamming loss criteria are, respectively,
0.014525 and 0.001209, which shows significant difference of the comparing methods.
As the null hypothesis is rejected for both measures, the Nemenyi’s post-hoc test is employed to discover significant differ-
ences among the methods. Figure 7 indicates the corresponding diagrams on the accuracy and hamming loss measures. The ranks
of methods are displayed on the axis, in such a way that the top-ranked methods are at the rightmost side of the diagram. When
the difference of methods average ranks is more than a critical distance (CD), these methods are significantly different. Methods
that are not significantly different are connected to each other with a line. According to these diagrams, the sequence of methods’
ranks is similar to what Table 4 presents. Also, BR-CFS is significantly better than PMU in both criteria. Moreover, although it is
not possible to find significant difference among the other methods, the lower the ranking, the better the method is.
A challenging issue that can be concluded from these diagrams is that methods which consider label dependency such as
PMU- and LP-based methods do not have better functionality than the other methods. Also, the diagrams show that methods which
use problem transformation strategy perform better than algorithm adaptation methods for ML-FS, on average. However, we reach
these conclusions on the basis of the comparing algorithms, and more experiments are needed to reject or prove this inferences.

6 | DI S C US S I O N A ND FU T U RE DI R E C T I ON

In this section, we provide an analysis of the reviewed papers and a discussion of the ignored issues and challenges in the
field of ML-FS.
TABLE 3 Experimental results of comparing methods averaged on 20 independent runs on 10 multilabel datasets

Problem transformation Algorithm adaptation


Dataset Measure BR-RF BR-IG BR-FCBF BR-FSCORE BR-CHI BR-CFS LP-RF LP-IG LP-FCBF LP-FSCORE LP-CHI LP-CFS MDMR PMU Baseline
KASHEF ET AL.

Flag Acc 0.512 0.499 0.494 0.503 0.508 0.512 0.520 0.496 0.486 0.505 0.476 0.483 0.508 0.511 0.503
HL 0.319 0.32 0.329 0.326 0.318 0.325 0.306 0.336 0.327 0.325 0.344 0.339 0.326 0.327 0.328
Fr 0.600 0.600 0.365 0.600 0.600 0.235 0.600 0.600 0.900 0.600 0.600 0.900 0.600 0.600 0
cal500 Acc 0.196 0.195 0.195 0.192 0.196 0.192 0.196 0.192 0.196 0.195 0.191 0.195 0.196 0.191 0.197
HL 0.14 0.14 0.139 0.139 0.139 0.14 0.139 0.139 0.138 0.139 0.139 0.139 0.139 0.140 0.139
Fr 0.602 0.602 0.432 0.602 0.602 0.324 0.602 0.602 0 0.602 0.602 0 0.602 0.602 0
Emotions Acc 0.391 0.392 0.381 0.4 0.387 0.402 0.389 0.409 0.358 0.377 0.408 0.381 0.318 0.297 0.311
HL 0.254 0.259 0.258 0.259 0.259 0.254 0.257 0.25 0.272 0.26 0.247 0.261 0.270 0.281 0.27
Fr 0.597 0.597 0.631 0.597 0.597 0.329 0.597 0.597 0.956 0.597 0.597 0.808 0.597 0.597 0
Yeast Acc 0.479 0.488 0.495 0.486 0.489 0.505 0.486 0.419 0.348 0.484 0.417 0.347 0.491 0.477 0.501
HL 0.202 0.199 0.2 0.201 0.2 0.195 0.201 0.217 0.233 0.201 0.218 0.233 0.199 0.204 0.198
Fr 0.699 0.699 0.656 0.699 0.699 0.278 0.699 0.699 0.99 0.699 0.699 0.99 0.699 0.699 0
Scene Acc 0.548 0.536 0.613 0.524 0.539 0.647 0.556 0.535 0.53 0.528 0.528 0.623 0.4485 0.493 0.658
HL 0.113 0.117 0.098 0.118 0.116 0.093 0.114 0.116 0.116 0.118 0.117 0.098 0.129 0.121 0.09
Fr 0.7 0.7 0.774 0.7 0.7 0.185 0.7 0.7 0.936 0.7 0.7 0.579 0.7 0.7 0
Image Acc 0.405 0.414 0.425 0.398 0.419 0.489 0.43 0.429 0.316 0.402 0.428 0.441 0.412 0.390 0.46
HL 0.903 0.9 0.897 0.905 0.9 0.883 0.897 0.897 0.924 0.904 0.897 0.894 0.901 0.905 0.889
Fr 0.700 0.700 0.868 0.700 0.700 0.368 0.700 0.700 0.968 0.700 0.700 0.742 0.700 0.700 0
Slashdot Acc 0.799 0.794 0.795 0.796 0.794 0.794 0.795 0.797 0.807 0.8 0.795 0.809 0.792 0.655 0.798
HL 0.017 0.017 0.017 0.018 0.018 0.017 0.017 0.017 0.018 0.018 0.018 0.018 0.018 0.018 0.017
Fr 0.899 0.899 0.873 0.899 0.899 0.841 0.899 0.899 0.999 0.899 0.899 0.999 0.899 0.899 0
Genbase Acc 0.931 0.925 0.94 0.937 0.935 0.933 0.931 0.943 0.909 0.931 0.935 0.927 0.925 0.889 0.928
HL 0.005 0.006 0.005 0.005 0.005 0.005 0.005 0.004 0.008 0.005 0.005 0.006 0.006 0.009 0.005
Fr 0.899 0.899 0.97 0.899 0.899 0.943 0.899 0.899 0.991 0.899 0.899 0.99 0.899 0.899 0
Medical Acc 0.551 0.587 0.59 0.604 0.611 0.578 0.567 0.568 0.552 0.635 0.568 0.595 0.573 0.463 0.528
HL 0.015 0.014 0.014 0.014 0.014 0.015 0.015 0.015 0.016 0.013 0.015 0.014 0.016 0.017 0.016
Fr 0.899 0.899 0.853 0.899 0.899 0.811 0.899 0.899 0.993 0.899 0.899 0.99 0.899 0.899 0
Enron Acc 0.361 0.202 0.369 0.224 0.217 0.331 0.351 0.265 0.161 0.215 0.267 0.172 0.365 0.346 0.293
HL 0.05 0.057 0.051 0.058 0.058 0.052 0.051 0.054 0.061 0.058 0.054 0.061 0.051 0.052 0.053
Fr 0.900 0.900 0.886 0.900 0.900 0.674 0.900 0.900 0.999 0.900 0.900 0.999 0.900 0.900 0

Note. The best results are highlighted in bold.


23 of 29
24 of 29 KASHEF ET AL.

TABLE 4 Average rankings of the algorithms obtained by each evaluation measure by performing Friedman test

Measure
method Accuracy Hamming loss Feature reduction
BR-RF 6.35 [5] 6.10 [4] 7.70 [3]
BR-IG 8.20 [9] 7.15 [7] 7.70 [3]
BR-FCBF 5.50 [2] 5.05 [3] 8.60 [4]
BR-FSCORE 7.45 [7] 8.35 [10] 7.70 [3]
BR-CHI 6.10 [4] 6.60 [6] 7.70 [3]
BR-CFS 4.95 [1] 4.80 [1] 12.80 [5]
LP-RF 5.55 [3] 4.95 [2] 7.70 [3]
LP-IG 6.85 [6] 6.35 [5] 7.70 [3]
LP-FCBF 10.40 [11] 10.75 [13] 2.45 [1]
LP-FSCORE 7.80 [8] 7.90 [8] 7.70 [3]
LP-CHI 8.60 [10] 7.95 [9] 7.70 [3]
LP-CFS 7.60 [7] 8.70 [11] 4.15 [2]
MDMR 8.20 [9] 8.80 [12] 7.70 [3]
PMU 11.45 [12] 11.55 [14] 7.70 [3]

A challenging concern in ML classification and feature selection is taking into consideration the correlation among labels.
Undoubtedly, label dependency is a critical issue in ML classification as it is a possibility to derive the unknown labels of an
instance from the known labels according to the label correlation. However, there is a controversy over this issue in ML-FS
methods. Although, many papers confirm its positive impact in improving feature selection process (Lee & Kim, 2013;
Lee & Kim, 2015a; Qiao et al., 2017), several papers claim that considering label dependency has no effect on the results
(Gharroudi et al., 2014). We also saw this issue in our experiments that methods that consider the label correlation do not
perform better than the others; however, since our experiments are limited to several methods, we cannot claim this with cer-
tainty, and wider experiments are needed. Furthermore, it is noteworthy that in some of ML-FS methods that declared con-
sidering label dependency, the employed learning algorithm does consider the correlation among labels, not the feature
selection method (Doquire & Verleysen, 2013a).
Another controversial issue is whether to choose problem transformation or algorithm adaptation strategy. There is still
no certain answer to this question, but it seems that methods based on algorithm adaptation strategy perform better for ML
learning algorithms. Maybe, this example clarifies the issue. In a hydroelectric power plant, we see something a bit similar,
as not all potential energy is transformed into useful energy (electricity). However, based on the obtained results, we cannot
generalize this assumption to ML-FS. Obviously, more experiments are also necessary in this regard.
There are also other important issues in both SL- and ML-FS methods including stability and scalability that are usually
neglected. Most of the times, supervised feature selection methods are only evaluated by their classification accuracy. How-
ever, a good FS method is the one that is stable in addition to have a high accuracy (He & Yu, 2010). Stability of a feature
selection method is considered as the consistency of an approach to find a consistent subset of features when some training
instances are removed or new training instances are added (Chandrashekar & Sahin, 2014). A method is unreliable for fea-
ture selection if it finds a different feature subset for any perturbation in the training data. The next issue is the scalability of
FS methods that may be jeopardized in the case of large-scale data. Scalability is specified as the efficacy that an increase in
the size of the training set has on the computational performance of a method: training time, accuracy, and allocated memory
(Bolón-Canedo et al., 2016). A good FS algorithm is the one that can find a balance among them. Normally, large-scale data-
sets cannot directly be loaded into the memory and therefore confines the usability of most feature selection methods
(Li et al., 2016). Recently, distributed computing protocols and frameworks such as MPI and MapReduce have been pro-
posed to perform parallel feature selection for very large-scale datasets.
Another open and challenging problem in both SL- and ML-FS algorithms is how to determine optimal number of
selected features. For most of FS methods, the number of features to be selected should be determined by the user. However,
the optimal number of features is usually unknown and is different for various datasets. On one hand, too large number of
selected features may increase the risk of including irrelevant, noisy and redundant features. On the other hand, with too
small number of selected features, some relevant features that are necessary to be included in the final subset may be elimi-
nated. Therefore, it is more desirable for a FS algorithm to decide the size of the final feature subset, automatically.
It was also observed that employing filter methods is more than wrapper and embedded methods in the publications.
Although, filter methods are the most appropriate approaches when dealing with very large number of features, the higher
accuracy of wrapper and embedded methods should not be ignored. Maybe, employing a filter method to eliminate irrelevant
KASHEF ET AL. 25 of 29

features and then a wrapper or embedded method to select the most salient features is a good way (Kashef & Nezamabadi-
pour, 2017). Also, there are still some SL filter approaches that are not utilized for ML task neither in problem transforma-
tion nor in algorithm adaptation form.
As discussed in Section 3.2, feature selection methods are divided into three categories according to label perspective:
supervised, semisupervised, and unsupervised methods. All of these strategies have been widely discussed in SL feature
selection, but the two last methods are poorly investigated in ML task. Specially, to the best of our knowledge, there is no
publication in unsupervised ML-FS methods.
Apart from the mentioned issues, the results of several evaluation measures for 12 popular datasets with all features using
ML-kNN classifier are reported in Table 2 as a baseline. These results can be the basis for comparison of feature selection
methods. It is desired that an ML-FS method can obtain better results using fewer features (Kashef & Nezamabadi-pour,
2013). However, sometimes, finding the values of a specific feature is so costly that it would be preferred to eliminate it in
return for accepting higher classification error. Therefore, the main objective of feature selection is to simplify a dataset by
reducing its dimensionality and identifying relevant underlying features without degrading predictive accuracy (Kashef &
Nezamabadi-pour, 2013). Of course, it is usually observed that the performance of the classifier also is improved after feature
selection.

7 | CON CLU SION

Feature selection has been a hot topic and an active field of research in machine learning. In this paper, we have investigated
the previous works on ML-FS problem. We categorize ML-FS methods from different perspectives: (a) label perspective
containing supervised, semisupervise, and unsupervised methods, (b) search strategy perspective containing exhaustive,
sequential, and randomized methods, (c) Interaction with learning algorithm, containing filter, wrapper, and embedded
methods, and (d) data model perspective containing problem transformation and algorithm adaptation methods. Each strategy
is described and methods of them in the ML (and to some extent, SL) task are introduced.
Also, by analyzing the existing papers, several challenges and open issues are discussed that can be pursued in the
future.

ACKNOWLEDGMENT
The authors would like to thank Professor Min-Lin Zhang and Dr. Newton Spolaor, and also the anonymous reviewers and
the editor for their helpful and constructive comments.

CONFLICT OF INTEREST
The authors have declared no conflicts of interest for this article.

NOTES

1
https://CRAN.R-project.org/package=utiml.
2
https://CRAN.R-project.org/package=MLPUGS.
3
https://CRAN.R-project.org/package=mldr.
4
https://CRAN.R-project.org/package=mldr.datasets.
5
http://meka.sourceforge.net/.
6
http://mulan.sourceforge.net/.
7
http://www.cs.waikato.ac.nz/ml/weka/.
8
http://mulan.sourceforge.net/datasets-mlc.html.
9
http://featureselection.asu.edu/old/software.php.
10
http://clus.sourceforge.net.
11
https://www.csie.ntu.edu.tw/~cjlin/libsvm/.

REFERENC ES
Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sanchez, L., & Herrera, F. (2011). KEEL data-mining software tool: data set repository, integration
of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic & Soft Computing, 17, 255–287.
26 of 29 KASHEF ET AL.

Ang, J. C., Haron, H., & Hamed, H. N. A., (Eds). (2015). Semi-supervised SVM-based feature selection for cancer classification using microarray gene expression
data. Paper presented at the meeting of the International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. Cham:
Springer.
Banerjee, M., & Pal, N. R. (2014). Feature selection with SVD entropy: Some modification and extension. Information Sciences, 264, 118–134.
Barani, F., Mirhosseini, M., & Nezamabadi-pour, H. (2017). Application of binary quantum-inspired gravitational search algorithm in feature subset selection. Applied
Intelligence, 40, 1–15.
Barkia, H., Elghazel, H., & Aussem, A., (Eds). (2011). Semi-supervised feature importance evaluation with ensemble learning. Paper presented at the meeting of the
Data Mining (ICDM), 2011 I.E. 11th International Conference; IEEE.
Bellal, F., Elghazel, H., & Aussem, A. (2012). A semi-supervised feature ranking method with ensemble learning. Pattern Recognition Letters, 33(10), 1426–1433.
Bermingham, M. L., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., … Navarro, P. (2015). Application of high-dimensional feature selec-
tion: Evaluation for genomic prediction in man. Scientific Reports, 5, 1–12.
Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2013). A review of feature selection methods on synthetic data. Knowledge and Information Sys-
tems, 34(3), 483–519.
Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2016). Feature selection for high-dimensional data. Progress in Artificial Intelligence, 5(2), 65–75.
Bolón-Canedo, V., Sánchez-Marono, N., Alonso-Betanzos, A., Benítez, J. M., & Herrera, F. (2014). A review of microarray datasets and applied feature selection
methods. Information Sciences, 282, 111–135.
Boutell, M. R., Luo, J., Shen, X., & Brown, C. M. (2004). Learning multi-label scene classification. Pattern Recognition, 37(9), 1757–1771.
Brassard, G., & Bratley, P. (1996). Fundamentals of algorithms. New Jersey: Prentice Hall.
Cai, D., Zhang, C., & He, X., (Eds.) (2010). Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining; ACM.
Carmona-Cejudo, J. M., Baena-García, M., del Campo-Avila, J., & Morales-Bueno, R., (Eds). (2011). Feature extraction for multi-label learning in the domain of
email classification. Paper presented at the meeting of the Computational Intelligence and Data Mining (CIDM), 2011 I.E. Symposium; IEEE.
Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16–28.
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
Chang, X., Nie, F., Yang, Y., & Huang, H. (2014). A convex formulation for semi-supervised multi-label feature selection. Paper presented at the AAAI, Québec,
Canada.
Charte, F., & Charte, D. (2015). Working with multilabel datasets in R: The mldr package. R J., 7(2), 149–162.
Charte, F., Charte, D., Rivera, A., del Jesus, M. J., & Herrera, F., (Eds). (2016). R ultimate multilabel dataset repository. Paper presented at the meeting of the Interna-
tional Conference on Hybrid Artificial Intelligence Systems; Springer.
Chen, W., Yan, J., Zhang, B., Chen, Z., & Yang, Q., (Eds). (2007). Document transformation for multi-label feature selection in text categorization. Paper presented
at the meeting of the Data Mining, 2007 ICDM 2007 Seventh IEEE International Conference; IEEE.
Cheng, H., Deng, W., Fu, C., Wang, Y., & Qin, Z. (2011). Graph-based semi-supervised feature selection with application to automatic spam image identification.
Computer Science for Environmental Engineering and EcoInformatics, 159, 259–264.
Cherman, E. A., Metz, J., & Monard, M. C., (Eds). (2010). A simple approach to incorporate label dependency in multi-label classification. Paper presented at the
meeting of the Mexican International Conference on Artificial Intelligence; Springer.
Cherman, E. A., Monard, M. C., & Metz, J. (2011). Multi-label problem transformation methods: A case study. CLEI Electronic Journal, 14(1), 4.
Cherman, E. A., Spolaôr, N., Valverde-Rebaza, J., & Monard, M. C. (2015). Lazy multi-label learning algorithms based on mutuality strategies. Journal of Intelli-
gent & Robotic Systems, 80(1), 261–276.
Chiang, T.-H., Lo, H.-Y., & Lin, S.-D. (2012). A ranking-based KNN approach for multi-label classification. Paper presented at the meeting of the ACML; Vol. 25:
81–96.
Chou, S., & Hsu, C.-L. (2005). MMDT: A multi-valued and multi-labeled decision tree classifier for data mining. Expert Systems with Applications, 28(4), 799–812.
Choudhary, A., & Saraswat, J. K. (2014). Survey on hybrid approach for feature selection. International Journal of Science and Research, 3(4), 438–439.
Clare, A., & King R.D., (2001). Knowledge discovery in multi-label phenotype data. In: L. De Raedt & A. Siebes (Eds.), Principles of Data Mining and Knowledge
Discovery. PKDD 2001. Lecture Notes in Computer Science (vol 2168). Berlin: Springer.
Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent Data Analysis, 1(1–4), 131–156.
De Comité, F., Gilleron, R., & Tommasi, M., (Eds). (2003). Learning multi-label alternating decision trees from texts and data. Paper presented at the meeting of the
International Workshop on Machine Learning and Data Mining in Pattern Recognition; Springer.
De La Iglesia, B. (2013). Evolutionary computation for feature selection in classification problems. WIREs: Data Mining and Knowledge Discovery, 3(6), 381–407.
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7(Jan), 1–30.
Dendamrongvit, S., Vateekul, P., & Kubat, M. (2011). Irrelevant attributes and imbalanced classes in multi-label text-categorization domains. Intelligent Data Analy-
sis, 15(6), 843–859.
Ding, S., Ed. (2009). Feature selection based F-score and ACO algorithm in support vector machine. Paper presented at the meeting of the Knowledge Acquisition
and Modeling, 2009 KAM'09 Second International Symposium; IEEE.
Diplaris, S., Tsoumakas, G., Mitkas, P. A., & Vlahavas, I., (2005). Protein classification with multiple algorithms. Paper presented at the meeting of the Panhellenic
Conference on Informatics, Berlin, Heidelberg.
Doak, J. (1992). CSE-92-18-an evaluation of feature selection methodsand their application to computer security. UC Davis Dept of Computer Science tech reports.
Doquire, G., & Verleysen, M., (Eds). (2011). Feature selection for multi-label classification problems. Paper presented at the meeting of the International
Work-Conference on Artificial Neural Networks; Springer.
Doquire, G., & Verleysen, M. (2013a). Mutual information-based feature selection for multilabel classification. Neurocomputing, 122, 148–155.
Doquire, G., & Verleysen, M. (2013b). A graph Laplacian based approach to semi-supervised feature selection for regression problems. Neurocomputing, 121, 5–13.
Duivesteijn, W., Mencía, E. L., Fürnkranz, J., & Knobbe, A., (Eds). (2012). Multi-label LeGo—Enhancing multi-label classifiers with local patterns. Paper presented
at the meeting of the International Symposium on Intelligent Data Analysis; Springer.
Ebrahimpour, M. K., & Eftekhari, M. (2017). Ensemble of feature selection methods: A hesitant fuzzy sets approach. Applied Soft Computing, 50, 300–312.
El Kafrawy, P., Mausad, A., & Esmail, H. (2015). Experimental comparison of methods for multi-label classification in different application domains. International
Journal of Computer Applications, 114(19), 406–417.
Elisseeff, A., & Weston, J. (2001). A kernel method for multi-labelled classification. Paper presented at the NIPS, Vancouver, British Columbia, Canada.
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning: Springer series in statistics. Berlin: Springer.
Fürnkranz, J., Hüllermeier, E., Mencía, E. L., & Brinker, K. (2008). Multilabel classification via calibrated label ranking. Machine Learning, 73(2), 133–153.
García, S., Fernández, A., Luengo, J., & Herrera, F. (2010). Advanced nonparametric tests for multiple comparisons in the design of experiments in computational
intelligence and data mining: Experimental analysis of power. Information Sciences, 180(10), 2044–2064.
KASHEF ET AL. 27 of 29

Gharroudi, O., Elghazel, H., & Aussem, A., (Eds). (2014). A comparison of multi-label feature selection methods using the random forest paradigm. Paper presented
at the meeting of the Canadian Conference on Artificial Intelligence; Springer.
Gu, Q., Li, Z., & Han, J. (2011). Correlated multi-label feature selection. Paper presented at the Proceedings of the 20th ACM international conference on Information
and knowledge management.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1), 389–422.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations
Newsletter, 11(1), 10–18.
Hall, M. A. (1999). Correlation-based feature selection for machine learning. New Zealand: The University of Waikato.
He, X., Cai, D., & Niyogi, P. (2005). Laplacian score for feature selection. Paper presented at the NIPS, Vancouver, British Columbia, Canada.
He, Z., & Yu, W. (2010). Stable feature selection for biomarker discovery. Computational Biology and Chemistry, 34(4), 215–225.
Huang, J., Li, G., Huang, Q., & Wu, X., (Eds). (2015). Learning label specific features for multi-label classification. Paper presented at the meeting of the Data Min-
ing (ICDM), 2015 I.E. International Conference; IEEE.
Huang, J., Li, G., Huang, Q., & Wu, X. (2016). Learning label-specific features and class-dependent labels for multi-label classification. IEEE Transactions on Knowl-
edge and Data Engineering, 28(12), 3309–3323.
Hüllermeier, E., Fürnkranz, J., Cheng, W., & Brinker, K. (2008). Label ranking by learning pairwise preferences. Artificial Intelligence, 172(16), 1897–1916.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 6). Vancouver, British Columbia, Canada: Springer.
Jing, S.-Y. (2014). A hybrid genetic algorithm for feature subset selection in rough set theory. Soft Computing, 18(7), 1373–1382.
Jungjit, S., & Freitas, A. A., (2015a). A new genetic algorithm for multi-label correlation-based feature selection. Paper presented at the The 23rd European Sympo-
sium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium.
Jungjit, S, & Freitas, A., (Eds). (2015b). A lexicographic multi-objective genetic algorithm for multi-label correlation based feature selection. In: Proceedings of the
Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Computation; ACM.
Jungjit, S., Freitas, A. A., Michaelis, M., & Cinatl, J., (Eds). (2013). Two extensions to multi-label correlation-based feature selection: A case study in bioinformatics.
Paper presented at the meeting of the Systems, Man, and Cybernetics (SMC), 2013 I.E. International Conference; IEEE.
Kashef, S., & Nezamabadi-pour, H., (Eds). (2013). A new feature selection algorithm based on binary ant colony optimization. Paper presented at the meeting of the
Information and Knowledge Technology (IKT), 2013 5th Conference; IEEE.
Kashef, S., & Nezamabadi-pour, H. (2015). An advanced ACO algorithm for feature subset selection. Neurocomputing, 147, 271–279.
Kashef S, Nezamabadi-pour H, An effective method of multi-label feature selection employing evolutionary algorithms. Swarm Intelligence and Evolutionary Compu-
tation (CSIEC), 2017 2nd Conference on; 2017: IEEE.
Kira, K., & Rendell, L. A., (Eds). (1992). A practical approach to feature selection. In: Proceedings of the Ninth International Workshop on Machine Learning.
Kocev, D., Slavkov, I., & Dzeroski, S., (Eds). (2013). Feature ranking for multi-label classification using predictive clustering trees. International Workshop on Solv-
ing Complex Machine Learning Problems with Ensemble Methods, in Conjunction with ECML/PKDD.
Kong, D, Ding, C, Huang, H, & Zhao, H., (Eds). (2012). Multi-label relieff and f-statistic feature selections for image annotation. Paper presented at the meeting of
the Computer Vision and Pattern Recognition (CVPR), 2012 I.E. Conference; IEEE.
Kong, X., & Philip, S. Y. (2012). gMLC: A multi-label feature selection framework for graph classification. Knowledge and Information Systems, 31(2), 281–305.
Kononenko, I., Ed. (1994). Estimating attributes: Analysis and extensions of RELIEF. Paper presented at the meeting of the European Conference on Machine Learn-
ing; Springer.
Lastra, G., Luaces, O., Quevedo, J. R., Bahamonde, A., (Eds). (2011). Graphical feature selection for multilabel classification tasks. Paper presented at the meeting of
the International Symposium on Intelligent Data Analysis; Springer.
Lee, J., & Kim, D.-W. (2013). Feature selection for multi-label classification using multivariate mutual information. Pattern Recognition Letters, 34(3), 349–357.
Lee, J., & Kim, D.-W. (2015a). Fast multi-label feature selection based on information-theoretic feature ranking. Pattern Recognition, 48(9), 2761–2771.
Lee, J., & Kim, D.-W. (2015b). Memetic feature selection algorithm for multi-label classification. Information Sciences, 293, 80–96.
Lee, J., Kim, H., Kim, N.-r., & Lee, J.-H. (2016). An approach for multi-label classification by directed acyclic graph with label correlation maximization. Information
Sciences, 351, 101–114.
Lee, S., Park, Y.-T., & d’Auriol, B. J. (2012). A novel feature selection method based on normalized mutual information. Applied Intelligence, 37(1), 100–120.
Li, F., Miao, D., & Pedrycz, W. (2017). Granular multi-label feature selection based on mutual information. Pattern Recognition, 67, 410–423.
Li, G.-Z., You, M., Ge, L., Yang, J. Y., & Yang, M. Q., (Eds). (2010). Feature selection for semi-supervised multi-label learning with application to gene function
analysis. In: Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology; ACM.
Li, H., Li, D., Zhai, Y., Wang, S., & Zhang, J. (2016). A novel attribute reduction approach for multi-label data based on rough set theory. Information Sciences, 367,
827–847.
Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, et al. Feature selection: A data perspective. arXiv preprint arXiv:160107996. 2016.
Li, L., Liu, H., Ma, Z., Mo, Y., Duan, Z., Zhou, J., et al., (Eds). (2014). Multi-label feature selection via information gain. Paper presented at the meeting of the Inter-
national Conference on Advanced Data Mining and Applications; Springer.
Li, L., & Wang, H. Towards label imbalance in multi-label classification with many labels. arXiv preprint arXiv, 160401304, 2016.
Lin, Y., Hu, Q., Liu, J., Chen, J., & Duan, J. (2016). Multi-label feature selection based on neighborhood mutual information. Applied Soft Computing, 38, 244–256.
Lin, Y., Hu, Q., Liu, J., & Duan, J. (2015). Multi-label feature selection based on max-dependency and min-redundancy. Neurocomputing, 168, 92–103.
Liu H. Feature Selection at Arizona State University, Data Mining and Machine Learning Laboratory. Last access: October. 2010.
Liu, H., Zhang, S., & Wu, X. (2014). MLSLR: Multilabel learning via sparse logistic regression. Information Sciences, 281, 310–320.
Lo, H.-Y., Lin, S.-D., & Wang, H.-M. (2014). Generalized k-labelsets ensemble for multi-label and cost-sensitive classification. IEEE Transactions on Knowledge and
Data Engineering, 26(7), 1679–1691.
Luo, Q., Chen, E., & Xiong, H. (2011). A semantic term weighting scheme for text categorization. Expert Systems with Applications, 38(10), 12708–12716.
Ma, Z., Nie, F., Yang, Y., Uijlings, J. R., & Sebe, N. (2012). Web image annotation via subspace-sparsity collaborated feature selection. IEEE Transactions on Multi-
media, 14(4), 1021–1030.
Madjarov, G., Kocev, D., Gjorgjevikj, D., & Džeroski, S. (2012). An extensive experimental comparison of methods for multi-label learning. Pattern Recognition,
45(9), 3084–3104.
Makrehchi, M., & Kamel, M. S., (Eds). (2005). Text classification using small number of features. Paper presented at the meeting of the International Workshop on
Machine Learning and Data Mining in Pattern Recognition; Springer.
Mencía, E. L., Park, S.-H., & Fürnkranz, J. (2010). Efficient voting prediction for pairwise multilabel classification. Neurocomputing, 73(7), 1164–1176.
Mitra, P., Murthy, C., & Pal, S. K. (2002). Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence,
24(3), 301–312.
Naula, P., Airola, A., Salakoski, T., & Pahikkala, T. (2014). Multi-label learning under feature extraction budgets. Pattern Recognition Letters, 40, 56–65.
28 of 29 KASHEF ET AL.

Noh, H. G., Song, M. S., & Park, S. H. (2004). An unbiased method for constructing multilabel classification trees. Computational Statistics & Data Analysis, 47(1),
149–164.
Olsson, J, & Oard, D. W., (Eds). (2006). Combining feature selectors for text classification. In: Proceedings of the 15th ACM International Conference on Information
and Knowledge Management; ACM.
Park, S.-H., & Fürnkranz, J., (Eds). (2007). Efficient pairwise classification. Paper presented at the meeting of the European Conference on Machine Learning;
Springer.
Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238.
Pereira, R. B., Plastino, A., Zadrozny, B., & Merschmann, L. H. (2015). Information gain feature selection for multi-label classification. Journal of Information and
Data Management, 6(1), 48.
Pereira, R. B., Plastino, A., Zadrozny, B., & Merschmann, L. H. (2016). Categorizing feature selection methods for multi-label classification. Artificial Intelligence
Review, 1–22.
Prati, RC, & de França, F. O., (Eds). (2013). Extending features for multilabel classification with swarm biclustering. Paper presented at the meeting of the Evolution-
ary Computation (CEC), 2013 I.E. Congress; IEEE.
Pupo, O. G. R., Morell, C., & Soto, S. V., (Eds). (2013). ReliefF-ML: An extension of ReliefF algorithm to multi-label learning. Paper presented at the meeting of the
Iberoamerican Congress on Pattern Recognition; Springer.
Qian, B., & Davidson, I. (2010). Semi-supervised dimension reduction for multi-label classification. Paper presented at the AAAI, Atlanta, Georgia, USA.
Qiao, L., Zhang, L., Sun, Z., & Liu, X. (2017). Selecting label-dependent features for multi-label classification. Neurocomputing, 259, 112–118.
Rashedi, E., & Nezamabadi-pour, H. (2014). Feature subset selection using improved binary gravitational search algorithm. Journal of Intelligent & Fuzzy Systems,
26(3), 1211–1221.
Read, J., Ed. (2008). A pruned problem transformation method for multi-label classification. In: Proceedings of 2008 New Zealand Computer Science Research Stu-
dent Conference (NZCSRS 2008).
Read, J., Bifet, A., Holmes, G., & Pfahringer, B. (2012). Scalable and efficient multi-label classification for evolving data streams. Machine Learning, 88(1–2),
243–272.
Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2011). Classifier chains for multi-label classification. Machine Learning, 85(3), 333–359.
Read, J., Puurula, A., & Bifet, A., (Eds). (2014). Multi-label classification with meta-labels. Paper presented at the meeting of the Data Mining (ICDM), 2014 I.E.
International Conference; IEEE.
Read, J., Reutemann, P., Pfahringer, B., & Holmes, G. (2016). MEKA: A multi-label/multi-target extension to WEKA. Journal of Machine Learning Research,
17(21), 1–5.
Ren, Y., Zhang, G., Yu, G., & Li, X. (2012). Local and global structure preserving based feature selection. Neurocomputing, 89, 147–157.
Reyes, O., Morell, C., & Ventura, S. (2014). Evolutionary feature weighting to improve the performance of multi-label lazy algorithms. Integrated Computer-Aided
Engineering, 21(4), 339–354.
Reyes, O., Morell, C., & Ventura, S. (2015). Scalable extensions of the ReliefF algorithm for weighting and selecting features on the multi-label learning context. Neu-
rocomputing, 161, 168–182.
Reyes, O., Morell, C., & Ventura, S. (2016). Effective lazy learning algorithm based on a data gravitation model for multi-label learning. Information Sciences, 340,
159–174.
Rogati, M., & Yang, Y., (Eds). (2002). High-performing feature selection for text classification. In: Proceedings of the Eleventh International Conference on Informa-
tion and Knowledge Management; ACM.
Rouhi, A., & Nezamabadi-pour, H., (Eds). (2017). A hybrid feature selection approach based on ensemble method for high-dimensional data. Paper presented at the
meeting of the Swarm Intelligence and Evolutionary Computation (CSIEC), 2017 2nd Conference; IEEE.
Shao, H., Li, G., Liu, G., & Wang, Y. (2013). Symptom selection for multi-label data of inquiry diagnosis in traditional Chinese medicine. Science China Information
Sciences, 56(5), 1–13.
Sheikhpour, R., Sarram, M. A., Gharaghani, S., & Chahooki, M. A. Z. (2017). A survey on semi-supervised feature selection methods. Pattern Recognition, 64,
141–158.
Sikdar, U. K., Ekbal, A., Saha, S., Uryupina, O., & Poesio, M. (2015). Differential evolution-based feature selection technique for anaphora resolution. Soft Comput-
ing, 19(8), 2149–2161.
Song, G, & Ye, Y., (Eds). (2014) A new ensemble method for multi-label data stream classification in non-stationary environment. Paper presented at the meeting of
the Neural Networks (IJCNN), 2014 International Joint Conference; IEEE.
Spolaôr, N., Cherman, E. A., Monard, M. C., & Lee, H. D. (2013a). A comparison of multi-label feature selection methods using the problem transformation
approach. Electronic Notes in Theoretical Computer Science, 292, 135–151.
Spolaôr, N., Cherman, E. A., Monard, M. C., & Lee, H. D., (Eds). (2013b). ReliefF for multi-label feature selection. Paper presented at the meeting of the Intelligent
Systems (BRACIS), 2013 Brazilian Conference on; IEEE.
Spolaôr, N., Monard, M. C., Tsoumakas, G., & Lee, H., (Eds). (2014). Label construction for multi-label feature selection. Paper presented at the meeting of the Intel-
ligent Systems (BRACIS), 2014 Brazilian Conference; IEEE.
Spolaôr, N., Monard, M. C., Tsoumakas, G., & Lee, H. D. (2016). A systematic review of multi-label feature selection and a new method based on label construction.
Neurocomputing, 180, 3–15.
Spolaôr, N., & Tsoumakas, G. (2013). Evaluating feature selection methods for multi-label text classification. Vancouver, Canada: BioASQ workhsop.
Spyromitros, E., Tsoumakas, G., & Vlahavas, I., (Eds). (2008). An empirical study of lazy multilabel classification algorithms. Paper presented at the meeting of the
Hellenic conference on Artificial Intelligence; Springer.
Spyromitros-Xioufis, E. (2011). Dealing with concept drift and class imbalance in multi-label stream classification. Barcelona: Department of Computer Science, Aris-
totle University of Thessaloniki.
Teisseyre, P. (2016). Feature ranking for multi-label classification using Markov networks. Neurocomputing, 205, 439–454.
Trohidis, K., Tsoumakas, G., Kalliris, G., & Vlahavas, I. P. (2008). Multi-label classification of music into emotions. Paper presented at the ISMIR, Philadelphia,
PA USA.
Tsoumakas, G., Katakis, I., & Vlahavas, I. (2009). Mining multi-label data. Data mining and knowledge discovery handbook (pp. 667–685). Boston: Springer.
Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., & Vlahavas, I. (2011). Mulan: A java library for multi-label learning. Journal of Machine Learning Research,
12(Jul), 2411–2414.
Tsoumakas, G, & Vlahavas, I., (Eds). (2007). Random k-labelsets: An ensemble method for multilabel classification. Paper presented at the meeting of the European
Conference on Machine Learning; Springer.
Vergara, J. R., & Estévez, P. A. (2014). A review of feature selection methods based on mutual information. Neural Computing and Applications, 24(1), 175–186.
KASHEF ET AL. 29 of 29

Xu, J., Liu, J., Yin, J., & Sun, C. (2016). A multi-label feature extraction algorithm via maximizing feature variance and feature-label dependence simultaneously.
Knowledge-Based Systems, 98, 172–184.
Xu, S., Yang, X., Yu, H., D-J, Y., Yang, J., & Tsang, E. C. (2016). Multi-label learning with label-specific feature reduction. Knowledge-Based Systems, 104, 52–61.
Xue, B., Zhang, M., Browne, W. N., & Yao, X. (2016). A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary
Computation, 20(4), 606–626.
Yan, J., Liu, N., Zhang, B., Yan, S., Chen, Z., Cheng, Q., et al., (Eds). (2005). OCFS: Optimal orthogonal centroid feature selection for text categorization. In: Pro-
ceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; ACM.
Yang, J., Jiang, Y.-G., Hauptmann, A. G., & Ngo, C.-W. (2007). Evaluating bag-of-visual-words representations in scene classification. Paper presented at the Pro-
ceedings of the international workshop on Workshop on multimedia information retrieval, Augsburg, Bavaria, Germany.
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. Paper presented at the Icml, Nashville, TN, USA.
You, M., Liu, J., Li, G.-Z., & Chen, Y. (2012). Embedded feature selection for multi-label classification of music emotions. International Journal of Computational
Intelligence Systems, 5(4), 668–678.
Younes, Z., Abdallah, F., Denoeux, T., & Snoussi, H. (2011). A dependent multilabel classification method derived from the k-nearest neighbor rule.
EURASIP Journal on Advances in Signal Processing, Article ID 645964, 14.
Yu, K., Yu, S., & Tresp, V., (Eds). (2005). Multi-label informed latent semantic indexing. In: Proceedings of the 28th Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval; ACM.
Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. Paper presented at the ICML, Washington D.C.
Yu, L., & Liu, H., (Eds). (2004). Redundancy based feature selection for microarray data. In: Proceedings of the tenth ACM SIGKDD International Conference on
KNOWLEDGE DISCOVERY and Data Mining; ACM.
Yu, Y., & Wang, Y., (Eds). (2014). Feature selection for multi-label learning using mutual information and GA. Paper presented at the meeting of the International
Conference on Rough Sets and Knowledge Technology; Springer.
Zhang, M.-L., Li, Y.-K., & Liu, X.-Y. (2015). Towards class-imbalance aware multi-label learning. Paper presented at the IJCAI, Buenos Aires, Argentina.
Zhang, M.-L., & LIFT, W. L. (2015). Multi-label learning with label-specific features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(1),
107–120.
Zhang, M.-L., Peña, J. M., & Robles, V. (2009). Feature selection for multi-label naive Bayes classification. Information Sciences, 179(19), 3218–3229.
Zhang, M.-L., & Zhou, Z.-H. (2006). Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge
and Data Engineering, 18(10), 1338–1351.
Zhang, M.-L., & Zhou, Z.-H. (2007). ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7), 2038–2048.
Zhang, M.-L., & Zhou, Z.-H. (2014). A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8), 1819–1837.
Zhang, Y., Gong, D.-W., & Rong, M., (Eds). (2015). Multi-objective differential evolution algorithm for multi-label feature selection in classification. International
Conference in Swarm Intelligence; Springer.
Zhang, Y., Gong, D.-w., Sun, X.-y., & Guo, Y.-n. (2017). A PSO-based multi-objective multi-label feature selection method in classification. Scientific Reports,
7(1), 376.
Zhang, Y., Wang, S., Phillips, P., & Ji, G. (2014). Binary PSO with mutation operator for feature selection using decision tree applied to spam detection.
Knowledge-Based Systems, 64, 22–31.
Zhang, Y., & Zhou, Z.-H. (2010). Multilabel dimensionality reduction via dependence maximization. ACM Transactions on Knowledge Discovery from Data
(TKDD), 4(3), 14.
Zhao, Z., & Liu, H. (2007a). Searching for interacting features. Paper presented at the ijcai, Hyderabad, India.
Zhao, Z., & Liu, H., (Eds). (2007b). Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th international conference on
Machine learning; ACM.
Zhao, Z., & Liu, H., (Eds). (2007c). Semi-supervised feature selection via spectral analysis. In: Proceedings of the 2007 SIAM International Conference on Data Min-
ing; SIAM.

How to cite this article: Kashef S, Nezamabadi-pour H, Nikpour B. Multilabel feature selection: A comprehensive
review and guiding experiments. WIREs Data Mining Knowl Discov. 2018;8:e1240. https://doi.org/10.1002/
widm.1240

You might also like