Sequential Forward Selection (SFS)

Sequential forward selection (SFS):
A sequential forward selection (SFS) is a bottom-up search procedure that first includes the best found
feature, then looks for another best by attempting to include any possible feature to the one previously
selected [1]. SFS starts with an empty set and adds each unused feature selected by some evaluation
function. This procedure goes on in this way until no improvement is obtained when adding a new feature
or until some stopping criterion is met [2]. In each iteration, the performance is estimated using the cross
validation for each added attribute. The feature which gives the highest value is included to the evaluation
function. SFS is used widely in wrapper based feature selection due to its well-balanced accuracy and
speed.
Sequential backward selection (SFS):
A sequential backward selection (SBS) a top-bottom search algorithm that starts with the full set of
features and it eliminates each remaining feature from the given features set in one cycle. It is the reverse
form of (SFS) and works in the opposite direction. SBS sequentially eliminate features from the full
feature space until new feature subspace has enough of features [3]. For each iteration, the performance is
evaluated by using the inner operators. Only the feature which gives the least decrease in predictive
performance is finally eliminated from the selected feature set [4]. The goal of backward elimination is to
remove the most irrelevant features from the feature set and makes a useful and more predictive subset. It
is more computationally expensive than (SFS) because in (SBS) feature elimination starts with the full
feature function [5]. Moreover, (SFS) allows backtracking that improves the predictive capabilities of a
model and reduces the model execution time.
The Pseudo-code of SBS algorithm can be defined as
1: Algorithm starting with k = d, the d is dimensional of feature full space Xd
2: Eliminate feature x- , that maximizes the criterion x- = arg max J(Xk –x) Where xęX k
3: Eliminate feature x- from feature space: Xk-1:= X k –x-; k: = k-1.
4: Finish if k reached to the required features, if not then repeat step 2.[3 reference]
or
The technique is explained by starting from the full set of features, sequentially we remove the feature 𝑥
− that least reduces the value of the objective function (𝑌 − 𝑥 −) [8].
Removing a feature may actually increase the objective function (𝑌𝑘 − 𝑥 −) > (𝑌𝑘):
1. Start with the full set 𝑌0 = 𝑋
2. Remove the worst feature 𝑥 − = arg max 𝑥∈𝑌𝑘 𝐽 𝑌𝑘 – 𝑥
3. Update 𝑌𝑘+1 = 𝑌𝑘 − 𝑥 −; 𝑘 = 𝑘 + 1 4. Go to 2 G [11paper name: Feature Selection Using Sequential

Backward thod in Melanoma Recognition]
Genetic algorithm (GA): Genetic algorithm is one of the most useful wrapper based feature selection
methods. It is a heuristic search algorithm that simulates the natural metamorphosis process [6, 4].
Genetic algorithm (GA) parameters and operators can be modified to obtain the best search results within
the general idea of an evolutionary algorithm. The heuristic search algorithms evaluate various subsets to
optimize the objective function. [7]. There are two possible ways to generate various subsets by searching
around the search-space or by developing solutions to the optimized issue. In heuristic search various
subsets are used for generating useful solutions to optimization and search problems by using genetic
algorithm [8].
Recursive feature elimination (RFE): Recursive feature elimination (RFE) is a well-known feature
selection method for small subset of features classification problems [9]. RFE develops the generalization
performance by eliminating the least important features. RFE removes those features whose elimination
will not effect on training errors. At each iteration, RFE recursively deletes the unused features and re-
ranks remaining features based on their deletion until all features are explored. If there exists any kinds of
weak feature, RFE will simply remove it to construct the desire model [10]. However, when other
features are used together then a weak feature may be considered as an important feature [11]. Thus, it
makes the classification performance ineffective for removing the redundant or weak features.
Embedded method:
Lasso regularization: The most effective and widely used technique for regularization and feature
selection method is the least absolute shrinkage and selection operator (LASSO) ([Tibshirani ,1996])
[12,13]. LASSO requires the ℓ1 regularization to the loss function. The loss function of Lasso regression
is defined as:
L=∑ ¿¿
i
Where x ipindicates the pth feature in the ith data, y i represents the value of the response in this data, and
β pdenotes the regression coefficient of the pth feature [14].LASSO reduces the absolute sum of the
coefficients (l 1 regularization).
The LASSO is a particular case of the penalized least squares regression with l 1-penalty function. The
LASSO estimate can be defined by
lasso N
∧ 1
=arg min { ∑ ¿¿¿¿
β 2 i=1
[15]
Where λ is a tuning parameter, it controls the amount of shrinkage and the effect of regularization. The
value of λ is estimated by a cross-validation procedure. LASSO uses l 1 penalty (∑ |β j|) and the penalty
term depends on the tuning parameter, λ is an important part of the model fitting [16]. When the
parameter λ is large then the regression coefficients for most irrelevant or redundant features are shrunk to
zero. Because of Lasso’s L1 penalty generating from regularization leads to sparse solution in the feature
space, it is widely used to reduce feature from the high-dimensional data [17]. Lasso is especially
effective when there are huge amount of irrelevant features and small number of training observations.
Ridge regression: Ridge regression, was proposed by Hoerl and Kennard [18] in 1970 which gives a
method to reduce the least squares error problem in the presence of co-linear features. The ridge
regression method is highly effective or minimizing the variability by shrinking the coefficients error and
avoiding the over-fitting issues to obtain more prediction accuracy [19]. Ridge Regression minimizes the
following loss function:
2 2
floss=‖X t W t−Y t‖ −λ‖W t‖ , (1)
where X t indicates the input data from the training set, Y t denotes the desired output , W t represents the
output weights and λ the regularization parameter that includes an extra cost to the squared norm of the
output weights [20]. The regularization parameter λ, controls the bias- variance trade-off and cross-
validation is applied to select the value of λ [21]. There is no difference between Ridge regression and
least squares unless the ridge coefficients are estimated by minimizing a slightly different quantity .That
means when λ=0 ridge regression is equivalent to least squares.
The ridge coefficients minimize a penalized residual sum of squares, ridge regression uses l 2 penalty
(∑ β j2 ),
ridge
∧ =arg min ¿ ¿
β β
Using the estimated coefficient equation∧β =( XTX+ hI )−1 XT , ridge regression solves the least square
problem, where h is the parameter of ridge regression and I is the identity matrix [15].Ridge regression
penalizes the magnitude of the parameter to l 2 regularization with parameter vector β and shrinks the
estimated coefficients nearly to zero [22].
Elastic net: Elastic net is a regularization method that was introduced by Zou and Hastie [26] . It is a
linear combination of ℓ1 and ℓ2 penalty or regularization to deal with the limitations of LASSO.
Moreover, the regularization method that selects the accurate subset of features with probability leads to
one which is desired [23].The elastic net penalty is defined by
elastic N
∧ 1
=arg min{ ∑ ¿ 0,1¿ ¿ ¿ ¿
β 2 i=1
Elastic net prediction depends on the tuning parameters λ1 and λ2 which controls the regularized logistic
regression coefficient. Elastic Net uses a combination of both according to α ∈]0, 1[. The function
( 1−α )|β j|+ α∨β j ¿2 is called elastic net penalty, which is a convex combination of the lasso and ridge
penalty. When α=1, the elastic net identical to ridge regression. The elastic net simultaneously does
automatic variable selection and continuous shrinkage, and it can select groups of correlated variables
[24].
References
[1] P. Bermejo, J. A. Gamez and J. M. Puerta, "Incremental Wrapper-based subset Selection with
replacement: An advantageous alternative to sequential forward selection," 2009 IEEE Symposium on
Computational Intelligence and Data Mining, Nashville, TN, 2009, pp. 367-374, doi:
10.1109/CIDM.2009.4938673.
[2] Borboudakis, Giorgos and Ioannis Tsamardinos. “Forward-Backward Selection with Early Dropping.”
J. Mach. Learn. Res. 20 (2019): 8:1-8:39.
[3] A. U. Haq, J. Li, M. H. Memon, M. Hunain Memon, J. Khan and S. M. Marium, "Heart Disease
Prediction System Using Model Of Machine Learning and Sequential Backward Selection Algorithm for
Features Selection," 2019 IEEE 5th International Conference for Convergence in Technology (I2CT),
Bombay, India, 2019, pp. 1-4, doi: 10.1109/I2CT45611.2019.9033683.
[4] Panthong, Rattanawadee and Anongnart Srivihok. “Wrapper Feature Subset Selection for Dimension
Reduction Based on Ensemble Learning Algorithm.” Procedia Computer Science 72 (2015): 162-169.
[5] Dunne, K., P. Cunningham and F. Azuaje. “Solutions to Instability Problems with Sequential
Wrapper-based Approaches to Feature Selection.” (2002).
[6] R. Gutierrez-Osuna, "Pattern analysis for machine olfaction: a review," in IEEE Sensors Journal, vol.
2, no. 3, pp. 189-202, June 2002, doi: 10.1109/JSEN.2002.800688.
[7] Chandrashekar, G. and F. Sahin. “A survey on feature selection methods.” Comput. Electr. Eng. 40
(2014): 16-28.
[8] Li Zhuo, Jing Zheng, Xia Li, Fang Wang, Bin Ai, and Junping Qian "A genetic algorithm based
wrapper feature selection method for classification of hyperspectral images using support vector
machine", Proc. SPIE 7147, Geoinformatics 2008 and Joint Conference on GIS and Built Environment:
Classification of Remote Sensing Images, 71471J (7 November 2008); https://doi.org/10.1117/12.813256
[9] Guyon, I., Weston, J., Barnhill, S. et al. “Gene Selection for Cancer Classification using Support
Vector Machines." Machine Learning 46, 389–422 (2002). https://doi.org/10.1023/A:1012487302797
[10] X. Chen and J. C. Jeong, "Enhanced recursive feature elimination," Sixth International Conference
on Machine Learning and Applications (ICMLA 2007), Cincinnati, OH, 2007, pp. 429-435, doi:
10.1109/ICMLA.2007.35.
[11] Guyon, I. and A. Elisseeff. “An Introduction to Variable and Feature Selection.” J. Mach. Learn.
Res. 3 (2003): 1157-1182.
[12] Tibshirani, R. Optimal Reinsertion:Regression shrinkage and selection via the lasso. J.R.Statist. Soc.
B(1996), 58,No.1, 267-288.
[13] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B 58 (1996)
267–288.
[14] Li, Fan, Yiming Yang and E. Xing. “From Lasso regression to Feature vector machine.” NIPS
(2005).
15] R. Muthukrishnan and R. Rohini, "LASSO: A feature selection technique in predictive modeling for
machine learning," 2016 IEEE International Conference on Advances in Computer Applications
(ICACA), Coimbatore, 2016, pp. 18-20, doi: 10.1109/ICACA.2016.7887916.
[16] Algamal, Z. and M. H. Lee. “Regularized logistic regression with adjusted adaptive elastic net for
gene selection in high dimensional cancer classification.” Computers in biology and medicine 67 (2015):
136-45 .
[17] Zhao, Peng, and Bin Yu. "On model selection consistency of Lasso." Journal of Machine learning
research 7, no. Nov (2006): 2541-2563.
[18] A.E. Hoerl, and R.W. Kennard, “Ridge regression: Applications to nonorthogonal problems,”
Technometrics, vol. 12, 1970a, pp. 69-82.
[19] Zhang, S., Cheng, D., Hu, R., “Supervised feature selection algorithm via discriminative ridge
regression.” World Wide Web 21,1545–1562 (2018). https://doi.org/10.1007/s11280-017-0502-9
[20] Buteneers, P., Caluwaerts, K., Dambre, J. “Optimized Parameter Search for Large Datasets of the
Regularization Parameter and Feature Selection for Ridge Regression”. Neural Process Lett 38, 403–416
(2013). https://doi.org/10.1007/s11063-013-9279-8
[21] Cawley, Gavin C. "Causal & non-causal feature selection for ridge regression." In Causation and
Prediction Challenge, pp. 107-128. 2008.
[22] Paul, Saurabh, and Petros Drineas. "Feature selection for ridge regression with provable guarantees."
Neural computation 28, no. 4 (2016): 716-742.
[23] Algamal, Z. and M. H. Lee. “Regularized logistic regression with adjusted adaptive elastic net for
gene selection in high dimensional cancer classification.” Computers in biology and medicine 67 (2015):
136-45 .
[24] Zou, Hui, and Trevor Hastie. "Regularization and variable selection via the elastic net." Journal of
the royal statistical society: series B (statistical methodology) 67, no. 2 (2005): 301-320

Sequential Forward Selection (SFS)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sequential Forward Selection (SFS)

Uploaded by

Copyright:

Available Formats

Sequential forward selection (SFS):

Sequential backward selection (SFS):

The Pseudo-code of SBS algorithm can be defined as

1: Algorithm starting with k = d, the d is dimensional of feature full space Xd

3: Eliminate feature x- from feature space: Xk-1:= X k –x-; k: = k-1.

1. Start with the full set 𝑌0 = 𝑋

2. Remove the worst feature 𝑥 − = arg max 𝑥∈𝑌𝑘 𝐽 𝑌𝑘 – 𝑥

3. Update 𝑌𝑘+1 = 𝑌𝑘 − 𝑥 −; 𝑘 = 𝑘 + 1 4. Go to 2 G [11paper name: Feature Selection Using Sequential

You might also like