Robust Feature Selection On Incomplete Data

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

Robust Feature Selection on Incomplete Data


Wei Zheng1 , Xiaofeng Zhu1∗ , Yonghua Zhu2 , Shichao Zhang1
1
Guangxi Key Lab of Multi-source Information Mining & Security,
Guangxi Normal University, Guilin, 541004, China
2
Guangxi University, Nanning, 530004, China
zwgxnu@163.com, seanzhuxf@gmail.com, yhzhu66@qq.com, zhangsc@gxnu.edu.cn

Abstract process, and has been shown more effective than both filter
model and wrapper model.
Feature selection is an indispensable preprocess- Given a feature matrix, each of whose row and column,
ing procedure for high-dimensional data analysis, respectively, represent a sample and a feature, feature se-
but previous feature selection methods usually ig- lection can be regarded as removing redundancy or noise
nore sample diversity (i.e., every sample has indi- from the vertical direction of the feature matrix, i.e., re-
vidual contribution for the model construction) and moving unimportant features, with the assumption that dif-
have limited ability to deal with incomplete data ferent features have different contribution for data analy-
sets where a part of training samples have unob- sis, i.e., feature diversity. Actually, different samples con-
served data. To address these issues, in this pa- tain different influence for data analysis as well, i.e., sam-
per, we firstly propose a robust feature selection ple diversity [Huber, 2011; Nie et al., 2010]. However, a
framework to relieve the influence of outliers, and few feature selection methods were focused on considering
then introduce an indicator matrix to avoid unob- to remove the influence of the redundancy or noise from
served data to take participation in numerical com- the horizontal direction of the feature matrix, i.e., remov-
putation of feature selection so that both our pro- ing or relieving the influence of outliers [Zhu et al., 2014a;
posed feature selection framework and exiting fea- Huang et al., 2016].
ture selection frameworks are available to conduct In practical applications, incomplete data sets can often be
feature selection on incomplete data sets. We fur- found, where some elements of a part of the samples are un-
ther propose a new optimization algorithm to op- observed/incomplete/missed. In particular, the ratio of in-
timize the resulting objective function as well as complete samples may approach to 90% in industrial data
prove our algorithm to converge fast. Experimen- [Little and Rubin, 2014]. The popular strategies for dealing
tal results on both real and artificial incomplete with incomplete data sets include discard strategy and im-
data sets demonstrated that our proposed method putation strategy [Zhang et al., 2017]. The discard strategy
outperformed the feature selection methods under discards incomplete samples (i.e., the samples containing un-
comparison in terms of clustering performance. observed elements) and uses the observed samples (i.e., all
the elements of the samples are observed) for data analysis.
The discard strategy ignores the information in incomplete
1 Introduction samples (i.e., observed information in incomplete samples)
Feature selection is one of popular dimensionality reduction as well as lacks enough samples so that outputting unreliable
techniques for dealing with high-dimensional data due to its models [Van Hulse and Khoshgoftaar, 2014; Zhang et al.,
interpretability. Based on the ways of the model construc- 2017]. The imputation strategy firstly imputes unobserved
tion, previous feature selection methods are widely parti- elements with guessed values, and then uses all the samples
tioned into three categories [Chandrashekar and Sahin, 2014], in the data set for data analysis. Compared to discard strat-
i.e., filter model [He et al., 2005], wrapper model [Unler egy, the imputation strategy utilizes the observed information
et al., 2011], and embedded model [Peng and Fan, 2017; in incomplete samples, but the imputed values may be noise
Zhu et al., 2015; 2017b]. Filter model selects important fea- or un-useful since unobserved information is never known
tures from all the features without a training model, and thus [Zhang et al., 2006; Van Hulse and Khoshgoftaar, 2014;
is simpler and more efficient than both wrapper model and Doquire and Verleysen, 2012]. The study of data analysis
embedded model. Embedded model integrates the process of on incomplete data has widely been focused on all kinds of
feature selection with the training model into a unified frame- real applications, but few study was focused on conducting
work, i.e., automatically selecting the features in the training feature selection on incomplete data sets.
To address the above issues, in this paper, we propose a
∗ new method to conduct robust unsupervised feature selection
Xiaofeng Zhu and Wei Zheng had equally contribution on this
work, Corresponding author: Xiaofeng Zhu. on incomplete data sets. Specifically, our proposed method

3191
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

X the feature matrix of the training data


x a vector of X
weight coefficients of unimportant features to small values
xi the i-th row of X or even zeros and the weight coefficients of important fea-
xj the j-th column of X tures to large values. Specifically, given a feature matrix
xi,j qP of X
the element of the i-th row and the j-th column X = [x1 ; ...; xn ] = [x1 , ..., xd ] ∈ Rn×d , traditional unsuper-
||X||F the Frobenius norm of X, i.e., ||X||F = 2
i,j xi,j vised sparse feature selection methods [Bengio et al., 2009;
P qP 2
||X||2,1 the `2,1 -norm of X, i.e., ||X||2,1 = i j xi,j Huang et al., 2011; Zhu et al., 2014b; 2015] can be formu-
XT the transpose of X lated as:
tr(X) the trace of X min ||X − XW||2F + λ||W||2,1 (1)
X−1 the inverse of X W

where λ is a non-negative tuning parameter and W ∈ Rd×d


Table 1: The notations used in this paper.
is the weight coefficient matrix.
Eq. (1) equivalently considers every sample, so the result-
conducts data analysis by 1) using all the observed informa- ing model is easily influenced by outliers [Nie et al., 2010;
tion (including the observed information in incomplete sam- Huber, 2011]. To address this issue, Nie et al. proposed to re-
ples, whose unobserved elements are not considered to be im- place Frobenius norm in Eq. (1) by an `2,1 -norm loss function
puted), and 2) relieving or even removing the influence of the [Nie et al., 2010; Zhu et al., 2016; 2017a] to have:
redundancy and noise from both the horizontal and vertical
directions of the data set. To do this, firstly, we propose a min ||X − XW||2,1 + λ||W||2,1 (2)
W
robust feature selection framework to firstly select a subset
of samples (Most Important (MI) samples for short) from the In Eq. (2), the residual of X − XW (i.e., ||X − XW||2,1 )
data set to construct an initial feature selection model, and does not have a squared operator, and thus outliers have less
then other MI samples in the left samples were automatically influence compared to Frobrenious norm [Nie et al., 2010].
selected to take participation in the revision of this initial fea- Compared to Eq. (1), Eq. (2) considers sample diversity, but
ture selection model with the former MI samples. The ro- it lacks flexibility to avoid the influence of outliers. Moreover,
bustness and generalization ability of the initial feature selec- both Eq. (1) and Eq. (2) cannot be used for dealing with
tion model is gradually improved with the increasing MI sam- incomplete data sets.
ples until all the samples are used up or the model achieves In this paper, we employ the theory of robust statistics
stability. As a result, outliers will be selected later than MI [Huber, 2011] such as self-paced learning [Kumar et al.,
samples or never be selected to take participation in the con- 2010; Lee and Grauman, 2011], half-quadratic minimization
struction of the feature selection model. Secondly, we intro- [Nikolova and Ng, 2005], and M-estimation (i.e., maximum
duce an indicator matrix to avoid unobserved values involved likelihood type estimation) [Negahban et al., 2009] to replace
into the numerical computation of feature selection models the `2,1 -norm loss function in Eq. (2) with predefined robust
so that both our proposed robust feature selection framework loss functions, aim at achieving flexibly robust feature selec-
and existing feature selection frameworks can be directly ap- tion. In robust statistics, functions can be constructed so that
plied on incomplete data sets. Furthermore, we propose a they can firstly select MI samples from all the samples to con-
new optimization algorithm to optimize the resulting objec- struct an initial model, and the left MI samples are automat-
tive function and also prove the convergence of our proposed ically selected from the left samples joining with the former
optimization algorithm. MI samples to revise the robustness and generalization ability
To our knowledge, our proposed method is the first work of this initial model gradually until all the samples are used up
to integrate feature selection, dealing with unobserved data, and this model is not further improved [Kumar et al., 2010;
and robust statistics [Huber, 2011; Huang et al., 2016] in a Lee and Grauman, 2011]. By replacing the `2,1 -norm loss
unified framework since previous works were only designed function in Eq. (2) with a general robust loss function φ(·),
to focus on a part of these three tasks [Nie et al., 2010; the robust feature selection framework can be formulated as:
Doquire and Verleysen, 2012; Huang et al., 2016]. Moreover, min φ(||X − XW||F ) + λ||W||2,1 (3)
it is very easily to extend our proposed framework to the tasks W
such as conducting unsupervised feature selection using other A number of robust loss functions have been designed in
embedded modles, supervised feature selection [Song et al., the literature [Fan et al., 2017; Nikolova and Chan, 2007;
2007; Zhu et al., 2018] and semi-supervised feature selection Nikolova and Ng, 2005] such as Cauchy function, `1 -`2 func-
[Zhao and Liu, 2007], on incomplete data sets. tion, Welsch M-estimator, and Geman–McClure estimator.
Every robust loss function φ(·) has individual characteristics
2 Approach different from the others and can be flexibly selected in real
We summarize all the used notations in this paper in Table 1. applications, e.g., tuning the parameters of φ(·) to achieve
flexibility. In this way, Eq. (2) can be regarded as a special
2.1 Robust Feature Selection issue of Eq. (3).
By considering the efficiency and effectiveness, filter model
and embedded model are two common feature selection mod- 2.2 Robust Feature Selection on Incomplete Data
els. In particular, sparse based embedded model (sparse fea- A few feature selection methods have been designed to deal
ture selection for short) has widely been used in real appli- with incomplete data sets. A possible method of conduct-
cations since these methods select features by pushing the ing feature selection on incomplete data sets is to construct

3192
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

feature selection models only using observed samples by dis- and Mcclure, 1987],
carding incomplete samples. Obviously, the higher the un-
observed ratio of the data sets, the lower the robustness and µz 2
φ(z) = (6)
generalization ability of the feature selection model is. To µ + z2
address this issue, an alternative method is to firstly impute to replace φ(·) in Eq. (4) to have:
unobserved values with imputation methods (such as mean-
value method [Little and Rubin, 2014] and k Nearest Neigh- µ||D ◦ (X − XW)||2F
bors (kNN) imputation method [Zhang et al., 2017]) and then min + λ||W||2,1 (7)
W µ + ||D ◦ (X − XW)||2F
conduct feature selection using all the samples including the
imputed incomplete samples. However, the imputation strat- where µ is used to tune the number of MI samples in each iter-
egy for feature selection is usually unpractical. Specifically, ations [Geman and Mcclure, 1987]. Eq. (7) can be optimized
if the imputation models are estimated perfectly, then the con- by any gradient methods. By the consideration of efficient
structed feature selection models may be over-fitting. Other- and scalable optimization, we apply Lemma 1 to Eq. (7), and
wise, the models will be under-fitting with a bad imputation. then obtain our final objective function:
Moreover, we have no idea on the ground truth of unobserved n
data so that it is difficult to evaluate imputation models.
X √
min (vi ||di ◦ (xi − xi W)||22 + µ( vi − 1)2 ) + λ||W||2,1
Based on above observations, it may be unnecessary to im- W,v
i=1
pute unobserved values for feature selection. To make the use (8)
of the observed information in incomplete samples, we pro- √
pose to use an indicator matrix D ∈ Rn×d (which has the where ψ(v) = µ( v − 1)2 , v = [v1 , ..., vn ] ∈ Rn is an
same size as the feature matrix X) to avoid unobserved val- auxiliary variable, µ and λ are two non-negative tuning pa-
ues involving the numerical computation of feature selection, rameters. According to Lemma 1, Eq. (7) and Eq. (8) are
we thus propose the following objective function to conduct equivalent with respect to the optimization of W, and are
robust feature selection on incomplete data sets: more flexible than either Eq. (2) or Eq. (1) since v con-
trols the selection of MI samples as well as enables to de-
min φ(||D ◦ (X − XW)||F ) + λ||W||2,1 (4) fine additional constraints to make the model (e.g., Eq. (8))
W
more flexible. Specifically, by considering the optimization
where D = [d1 ; ...; dn ] is an indicator matrix. Specifically, 2
result of vi (i.e., vi = (µ+kdi ◦(xµi −xi W)k2 )2 ), if the resid-
if the i-th row and the j-th column element xi,j is unob- 2

served, the value of di,j is 0, otherwise 1. The symbol ◦ is ual (i.e., kdi ◦ (xi − xi W)k22 ) is small, then the i-th sam-
a Hadamard product operator that conducts the element-wise ple xi can be regarded a MI sample and thus its weight vi
multiplication between two same size matrices. will be large. By contrast, the weight of an outlier is small
According to Eq. (4), our proposed method conducts fea- due to its large residual. Moreover, in the alternative it-
ture selection by using all available information (without im- erations of optimizing Eq. (8), the value of every vi will
puting unobserved values) as well as reliving the influence of vary with the iteratively updated W. In this way, the ini-
outliers. Moreover, the definition of D can be applied in any tial model constructed in the former iterations will be up-
sparse feature selection model for conducting robust unsuper- dated gradually with different weights for each sample until
vised feature selection using other embedded models, robust the model achieves stability. Furthermore, the number of MI
supervised feature selection, and robust semi-supervised fea- samples in every iteration can be flexibly controlled by the
ture selection, on incomplete data sets. tuning parameter µ according to the practical demand [Ge-
man and Mcclure, 1987], i.e., µ  f˜ where f˜ is the average
2.3 Proposed Objective Function of {f i = ||di ◦ (xi − xi W)||22 , i = 1, ..., n}.
The optimization of Eq. (8) is challenging due to introduc-
Although a number of robust loss functions have been re- ing an auxiliary variable v, the convex but no-smooth con-
ported in the literature, the resulting objective functions in straint on the variable of W (i.e., ||W||2,1 ), and introducing
these predefined robust functions may be optimized diffi- the indicator matrix D. In this paper, we utilize the alterna-
cultly/ineffieciently or even non-convex, so the half-quadratic tive optimization strategy (i.e., the framework of Iteratively
technique was designed to address this issue by introducing Reweighted Least Squares (IRLS) [Daubechies et al., 2008])
an auxiliary variable, i.e., to optimize Eq. (8), i.e., update v by fixing W and update W
Lemma 1. Given a fixed scalar z, if a differentiable function by fixing v.
φ(z) satisfies four conditions listed in [Nikolova and Chan,
2007], then the following holds, 2.4 Convergence Analysis
By denoting v(t) and W(t) , respectively, as the t-th iteration
φ(z) = inf {vz 2 + ψ(v)} (5)
z∈R result of v and W, we change Eq. (8) to the following:
n
where v is a variable determined by the minimization func- X (t)
tion of the dual potential function ψ(v) of φ(z). J(W(t) , v(t) ) = (vi ||di ◦ (xi − xi W(t) )||22
i=1 (9)
In this paper, for a fixed scalar z, we select a concrete ro- q
(t)
bust loss function, i.e., Geman–McClure estimator [Geman +µ( vi − 1)2 ) + λ||W(t) ||2,1

3193
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

With the fixed W(t) and according to half-quadratic theory We employed the 10-fold cross-validation scheme to re-
[Nikolova and Ng, 2005], we have: peat every method on a data set ten times. We reported the
average results of ten times, each of which is the average of
J(W(t) , v(t+1) ) ≤ J(W(t) , v(t) ) (10) 20 clustering results. We set the ranges of the parameters of
the comparison methods according to the corresponding lit-
While fixing v(t+1) , we have the following inequality ac- erature, and set the ranges of the parameter (i.e., λ) of our
cording to the IRLS framework: method in Eq. (8) as {10−3 , 10−2 , ..., 103 }. We further set
J(W(t+1) , v(t+1) ) ≤ J(W(t) , v(t+1) ) (11) the number of clusters in k-means clustering as the number
of real classes of the data sets.
By integrating Eq. (10) with (11), we have We used two evaluation metrics (i.e., Accuracy (ACC) and
Normalized Mutual Information (NMI)) to evaluate the per-
J(W(t+1) , v(t+1) ) ≤ J(W(t) , v(t) ) (12) formance of all the methods on four real incomplete data sets.
We listed the definitions of these two valuation metrics as fol-
According to Eq. (12), Eq. (9) is non-increasing at each
lows:
iteration in optimization. Hence, our proposed method finally
achieves converged in the optimized process. • ACC indicates the percentage of correctly classified
samples in the total samples, i.e.,
3 Experimental Analysis Nc
ACC = (13)
3.1 Experimental Settings N
In our experiments, we separated an incomplete data set into where N denotes the number of the samples and Nc is
two sets, i.e., Incomplete Set (IS) including all incomplete the number of the correctly classified samples.
samples and Observed Set (OS) including all observed sam-
ples. Firstly, we denoted the method of conducting k-means • NMI uncovers a correlation between the obtained labels
clustering with original features of OS as Baseline. We also and the real labels, i.e.,
employed a filter feature selection method and two embedded 2I(X, Y)
feature selection methods. Three comparison algorithms are NMI = (14)
(H(X) + H(Y))
summarized as follows:
• Laplacian Score (LPscore) [He et al., 2005] is a unsuper- where I(X, Y) denotes the mutual information between
vised filter feature selection model that evaluates the im- the samples and the labels and H(·) is the corresponding
portance of each features based on the Laplacian scores entropy.
of the data.
3.2 Real Incomplete Data Sets
• Regularized Self-Representation (RSR) [Zhu et al.,
2015] firstly uses a feature-level self-representation to We downloaded four real incomplete data sets from UCI web-
reconstruct each feature by the linear relationship of all site, i.e., Advertisement, Arrhythmia, Cvpu, and Mice, with
features, and then uses an `2,1 -norm regularization to unobserved ratios of samples, are 28.60%, 84.96%, 94.99%
sparsely penalize the regression matrix to conduct fea- and 48.89%, respectively. We reported clustering perfor-
ture selection. RSR can be considered as the basic algo- mance of all the feature selection methods with different ra-
rithm of our proposed method. tios of selected features, i.e., {10%, 20%, ..., 90%} of all the
features in Figure 3 and 4.
• General Framework for Sparsity Regularized (GSR) From Figure 3 and 4, firstly, our method achieved
[Peng and Fan, 2017] is a general sparsity feature se-
the best clustering performance, followed by GSR kNN,
lection model that can achieve the different combination GSR Mean, RFS, GSR, RSR, LapScore and Baseline. For
of different loss functions and sparse regularizations in example, our proposed method on average improved by
the same framework by tuning parameters. By compar- 2.15% and 18.70%, compared to the best comparison
ing to other sparsity feature selection methods, GSR is method (i.e., GSR kNN) and the worst comparison method
more flexible and could be robust to the outliers. (i.e., Baseline), in term of ACC results. The reason may be
Moreover, we use our proposed Robust Feature Selection that our method uses all the available information and re-
(RFS) framework in Eq.(3) as one of our comparison algo- lieves the influence of outliers. It is noteworthy that impu-
rithms. tation methods (i.e., GSR Mean and GSR kNN) use all the
Secondly, we used OS to impute unobserved values in available information but not taking the influence of outliers
IS by the imputation method, i.e., mean-value imputation into account, while feature selection methods (i.e., LapScore,
method (Mean) and kNN imputation method (kNN) [Zhang S RSR, GSR, and RFS) only use the information in observed
et al., 2017], and then conducted feature selection on OS samples and not use the observed information in incomplete
IS by GSR, i.e., GSR mean and GSR kNN, followed by con- samples. Secondly, our method achieved the maximal im-
ducting k-means clustering on OS with selected features. provement, compared to these four feature selection methods
Thirdly, we used
S our proposed Eq. (8) to conduct feature on data set Cvpu because this data set only has 5.01% ob-
selection on OS IS, and then conducted k-means clustering served samples. This indicates that reliable feature selection
on OS with selected features. models need enough training samples.

3194
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

0.91 0.42 0.56 0.42

ACC 0.4

ACC

ACC

ACC
0.4 0.54
0.38

0.9 0.38 0.52 0.36


0.001 0.01 0.1 0 1 2 3 0.001 0.01 0.1 0 1 2 3 0.001 0.01 0.1 0 1 2 3 0.001 0.01 0.1 0 1 2 3

(a) Advertisement (b) Arrhythmia (c) Cvpu (d) Mice


0.3 0.48 0.24 0.45

0.47 0.22
0.4

NMI
NMI

NMI

NMI
0.25 0.46 0.2
0.35
0.45 0.18

0.2 0.44 0.16 0.3


0.001 0.01 0.1 0 1 2 3 0.001 0.01 0.1 0 1 2 3 0.001 0.01 0.1 0 1 2 3 0.001 0.01 0.1 0 1 2 3

(e) Advertisement (f) Arrhythmia (g) Cvpu (h) Mice

Figure 1: The ACC (upper row) and NMI (bottom row) results of our method with varied parameter’ setting (i.e., λ), where four real
incomplete data sets were kept 50% features.

6 5 6 5
4 x 10 5 x 10 2.5 x 10 15 x 10

Objective function values


Objective function values

Objective function values

Objective function values


4
2
10
3
2 1.5
2
5
1
1

0 0 0.5 0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Iterations Iterations Iterations Iterations

(a) Advertisement (b) Arrhythmia (c) Cvpu (d) Mice

Figure 2: The convergence of our proposed method, where four real incomplete data sets were kept 50% features.

3.3 Parameters’ Sensitivity and Convergence results, compared to the methods under comparison. We will
We listed both the ACC and the NMI variations of our pro- extend our framework to conduct supervised/semi-supervised
posed method at different values of the parameter λ in Figure feature selection on incomplete data sets in our future work.
1. Obviously, our proposed method is sensitive to parame-
ters’ setting a little, and may achieve stability and good clus- Acknowledgements
tering in some ranges. For example, the ACC results of our This work was supported in part by the Nation Natural
method varied from 40.65% to 37.33%, but achieved the best Science Foundation of China (Grants No: 61573270 and
results while λ ∈ [10−3 , 10−2 ] on the data set Mice. This en- 61672177), Innovation Project of Guangxi Graduate Educa-
ables our method to achieve reliable clustering performance tion (Grant No: YCSW2018093), the Guangxi High Institu-
via easy parameter tuning. Figure 2 reported the variations of tions Program of Introducing 100 High-Level Overseas Tal-
the objective function value of Eq. (8) at different iterations, ents, the Guangxi Collaborative Innovation Center of Multi-
and clearly showed that our proposed method (i.e., Eq. (8)) Source Information Integration and Intelligent Processing,
can be efficiently optimized, i.e., convergence within about and the Research Fund of Guangxi Key Lab of Multi-source
15 iterations. Information Mining & Security (18-A-01-01).

4 Conclusion References
[Bengio et al., 2009] Samy Bengio, Fernando Pereira,
This paper proposed a novel robust feature selection frame- Yoram Singer, and Dennis Strelow. Group sparse coding.
work to conduct unsupervised feature selection on incom- In NIPS, pages 82–89, 2009.
plete data sets. Specifically, we employed robust statistics
to consider sample diversity for conducting feature selection, [Chandrashekar and Sahin, 2014] Girish Chandrashekar and
and used an indicator matrix to avoid unobserved values tak- Ferat Sahin. A survey on feature selection methods. Com-
ing participation in the process of feature selection as well as puters & Electrical Engineering, 40(1):16–28, 2014.
making the use of all the available information. Experimental [Daubechies et al., 2008] Ingrid Daubechies, Ronald De-
results showed that our method achieved the best clustering vore, Massimo Fornasier, and Sinan Gunturk. Iteratively

3195
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

1 0.5 Baseline LapScore RSR GSR


Baseline LapScore RSR GSR
RFS GSR_mean GSR_knn Proposed RFS GSR_mean GSR_knn Proposed

0.9 0.4

ACC

ACC
0.8 0.3

0.7 0.2
10 30 50 70 90 10 30 50 70 90
The selected feature (%) The selected feature (%)
(a) Advertisement (b) Arrhythmia

0.8 0.5
Baseline LapScore RSR GSR Baseline LapScore RSR GSR
RFS GSR_mean GSR_knn Proposed RFS GSR_mean GSR_knn Proposed

0.6 0.4
ACC

ACC
0.4 0.3

0.2 0.2
10 30 50 70 90 10 30 50 70 90
The selected feature (%) The selected feature (%)
(c) Cvpu (d) Mice

Figure 3: The ACC results of all the methods on different data sets.

0.4 0.6
Baseline LapScore RSR GSR Baseline LapScore RSR GSR
RFS GSR_mean GSR_knn Proposed RFS GSR_mean GSR_knn Proposed
0.3 0.5
NMI

NMI

0.2 0.4

0.1 0.3

0 0.2
10 30 50 70 90 10 30 50 70 90
The selected feature (%) The selected features (%)
(a) Advertisement (b) Arrhythmia

0.3
Baseline LapScore RSR GSR Baseline LapScore RSR GSR
RFS GSR_mean GSR_knn Proposed RFS GSR_mean GSR_knn Proposed

0.2 0.4
NMI

NMI

0.1 0.2

0 0
10 30 50 70 90 10 30 50 70 90
The selected feature (%) The selected feature (%)
(c) Cvpu (d) Mice

Figure 4: The NMI results of all the methods on different data sets.

reweighted least squares minimization for sparse recov- [Doquire and Verleysen, 2012] Gauthier Doquire and
ery. Communications on Pure & Applied Mathematics, Michel Verleysen. Feature selection with missing data
63(1):1–38, 2008. using mutual information estimators. Neurocomputing,

3196
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

90:3–11, 2012. [Unler et al., 2011] Alper Unler, Alper Murat, and
[Fan et al., 2017] Yanbo Fan, Ran He, Jian Liang, and Bao- Ratna Babu Chinnam. mr 2 pso: A maximum rele-
Gang Hu. Self-paced learning: an implicit regularization vance minimum redundancy feature selection method
perspective. In AAAI, pages 1877–1883, 2017. based on swarm intelligence for support vector machine
classification. Information Sciences, 181(20):4625–4641,
[Geman and Mcclure, 1987] S. Geman and D. E. Mcclure. 2011.
Statistical methods for tomographic image reconstruction.
Proc Session of the Ici Bulletin of the Ici, lii-4, 1987. [Van Hulse and Khoshgoftaar, 2014] Jason Van Hulse and
Taghi M Khoshgoftaar. Incomplete-case nearest neigh-
[He et al., 2005] Xiaofei He, Deng Cai, and Partha Niyogi. bor imputation in software measurement data. Information
Laplacian score for feature selection. In NIPS, pages 507– Sciences, 259(3):596–610, 2014.
514, 2005.
[Zhang et al., 2006] Shichao Zhang, Yongsong Qin, Xi-
[Huang et al., 2011] Junzhou Huang, Tong Zhang, and Dim- aofeng Zhu, Jilian Zhang, and Chengqi Zhang. Optimized
itris Metaxas. Learning with structured sparsity. Journal of parameters for missing data imputation. In PRICAI, pages
Machine Learning Research, 12(Nov):3371–3412, 2011. 1010–1016, 2006.
[Huang et al., 2016] Dong Huang, Ricardo Cabral, and Fer- [Zhang et al., 2017] Shichao Zhang, Xuelong Li, Ming
nando De la Torre. Robust regression. IEEE transactions Zong, Xiaofeng Zhu, and Ruili Wang. Efficient knn
on pattern analysis and machine intelligence, 38(2):363– classification with different numbers of nearest neighbors.
375, 2016. IEEE Transactions on Neural Networks and Learning Sys-
[Huber, 2011] Peter J Huber. Robust statistics. In Interna- tems,DOI: 10.1109/TNNLS.2017.2673241, 2017.
tional Encyclopedia of Statistical Science, pages 1248– [Zhao and Liu, 2007] Zheng Zhao and Huan Liu. Semi-
1251. 2011. supervised feature selection via spectral analysis. In SDM,
[Kumar et al., 2010] M Pawan Kumar, Benjamin Packer, and pages 641–646, 2007.
Daphne Koller. Self-paced learning for latent variable [Zhu et al., 2014a] Xiaofeng Zhu, Heung-Il Suk, and Ding-
models. In NIPS, pages 1189–1197, 2010. gang Shen. Multi-modality canonical feature selection for
[Lee and Grauman, 2011] Yong Jae Lee and Kristen Grau- alzheimer’s disease diagnosis. In MICCAI, pages 162–
man. Learning the easy things first: Self-paced visual cat- 169, 2014.
egory discovery. In CVPR, pages 1721–1728, 2011. [Zhu et al., 2014b] Yingying Zhu, Dong Huang, Fernando
[Little and Rubin, 2014] Roderick JA Little and Donald B De La Torre, and Simon Lucey. Complex non-rigid mo-
Rubin. Statistical analysis with missing data. John Wi- tion 3d reconstruction by union of subspaces. In CVPR,
ley & Sons, 2014. pages 1542–1549, 2014.
[Negahban et al., 2009] Sahand Negahban, Bin Yu, Martin J [Zhu et al., 2015] Pengfei Zhu, Wangmeng Zuo, Lei Zhang,
Wainwright, and Pradeep K Ravikumar. A unified frame- Qinghua Hu, and Simon C. K. Shiu. Unsupervised fea-
work for high-dimensional analysis of m-estimators with ture selection by regularized self-representation. Pattern
decomposable regularizers. In NIPS, pages 1348–1356, Recognition, 48(2):438–446, 2015.
2009. [Zhu et al., 2016] Yingying Zhu, Xiaofeng Zhu, Minjeong
[Nie et al., 2010] Feiping Nie, Heng Huang, Xiao Cai, and Kim, Dinggang Shen, and Guorong Wu. Early diagnosis
Chris Ding. Efficient and robust feature selection via of alzheimer’s disease by joint feature selection and classi-
joint `2,1 -norms minimization. In NIPS, pages 1813–1821, fication on temporally structured support vector machine.
2010. In MICCAI, pages 264–272, 2016.
[Nikolova and Chan, 2007] M Nikolova and R. H. Chan. The [Zhu et al., 2017a] Pengfei Zhu, Wencheng Zhu, Qinghua
equivalence of half-quadratic minimization and the gradi- Hu, Changqing Zhang, and Wangmeng Zuo. Subspace
ent linearization iteration. IEEE Transactions on Image clustering guided unsupervised feature selection. Pattern
Processing, 16(6):1623–7, 2007. Recognition, 66:364–374, 2017.
[Nikolova and Ng, 2005] Mila Nikolova and Michael K Ng. [Zhu et al., 2017b] Yingying Zhu, Xiaofeng Zhu, Minjeong
Analysis of half-quadratic minimization methods for sig- Kim, Daniel Kaufer, and Guorong Wu. A novel dynamic
nal and image recovery. SIAM Journal on Scientific com- hyper-graph inference framework for computer assisted
puting, 27(3):937–966, 2005. diagnosis of neuro-diseases. In IPMI, pages 158–169.
[Peng and Fan, 2017] Hanyang Peng and Yong Fan. A gen- Springer, 2017.
eral framework for sparsity regularized feature selection [Zhu et al., 2018] Pengfei Zhu, Qian Xu, Qinghua Hu,
via iteratively reweighted least square minimization. In Changqing Zhang, and Hong Zhao. Multi-label feature se-
AAAI, pages 2471–2477, 2017. lection with missing labels. Pattern Recognition, 74:488–
[Song et al., 2007] Le Song, Alex Smola, Arthur Gretton, 502, 2018.
Karsten M Borgwardt, and Justin Bedo. Supervised fea-
ture selection via dependence estimation. In ICML, pages
823–830, 2007.

3197

You might also like