Chi-Square:: Principal Component Analysis (PCA)

Chi-Square:
The Chi-Square algorithm is a univariate feature selection algorithm which evaluates the correlation
between the feature and the class label. It is used to test the independence of the feature and the class
variable by computing a score, χ 2 test. In feature selection, χ 2measures the independence of features with
respect to the class [1]. During the evaluation, all the features will be categorized and the feature selection
method is applied to a labeled dataset to test whether the class label is independent from a given feature.
The Chi-Square score χ 2, is calculated for a particular feature with different values and classes as follows:
k c
2
χ =∑ ∑ ¿ ¿ ¿ ¿ ¿
i=1 j =1
Here, o ijis the number of samples with the i th value.
oi∗¿
e ij =o¿ j ¿
o
Here, o i∗¿¿ is the number of samples that takes the i th value of a given feature. o ¿ j is the number of
samples in the i th class [2].The higher value of χ 2 indicates the existence of a strong dependency between
the corresponding feature and the class variable.
Symmetric Uncertainty (SU):
Symmetrical Uncertainty is a statistical method which is a modified procedure of information gain (IG) .It
is developed in such a way that it overcomes the problem of IG’s bias towards features with more values.
It calculates the SU score for every feature and ranks all the features according to their individual
evaluation. The feature with maximum SU score is placed on the top position and minimum score is
placed on the least position [3].The calculation is done by using the following formula:
IG( X ,Y )
SU=2 ×
H ( X )+H ¿ ¿
Where,
IG is Information Gain; H(X) is Entropy of X, H(Y) is Entropy of Y. Due to the correlation factor, SU
score will be in the range [0, 1]. If X and Y are not depend on each other, then SU = 0.On the other
contrary, SU = 1, when X and Y are completely correlated as they can predict the knowledge of each
other. SU is unable to identify the correlations more than two features as it has been defined for pairs of
variables [4].
Principal Component Analysis (PCA):
Principal Component Analysis (PCA) is a statistical procedure that reduces dimensionality in the analysis
of real data. It transforms a set of possibly correlated attributes space into a set of linearly uncorrelated
attributes space. These linearly uncorrelated attributes are known as principal components. The
transformation is obtained by estimating the covariance matrix ( n × m ) of the actual attributes and
computing its Eigen vectors and Eigenvalue [5]. After that, it employs the q Eigenvector with the greatest
Eigenvalues. Assume that n dimension vector y is extended to q dimensional subspace and the vector p is
computed by [6]:
p= X T −( y−μ )
PCA uses ranker search algorithm that ranks the variation between the original dataset and the new one
according to their individual calculations. It saves the most variant attributes and discards the remaining
attributes. Moreover, it is the most useful method for unsupervised data because it does not identify which
features are necessary for the model [7].
Information gain (IG):
Information gain or Kullback-Leibler divergence is supervised, univariate and entropy –based feature
selection algorithm of the filter model which is a measure of the difference between two probability
distributions [7]. Information Gain (IG) evaluates a feature X with respect to the class labels Y is defined
by:
IG(X, Y) = H(X) – H (X|Y)
Entropy (H) is a estimation of the uncertainty correlated with a random variable. H(X) and
H (X|Y) is the entropy of X and the entropy of X after observing Y, respectively.
The entropy of X is estimated as follows:
H ( X )=−∑ P ( x i ) log 2 ( P ( x i ))
i
The entropy of X after observing Y is defined as follows:
H ( X|Y )=−∑ P ( y j ) ∑ P ( xi| y j ¿ ¿ log 2 (P(x i∨ y j))

j i
The features with the maximum values are selected as the relevant features. Information gain is estimated
separately for each and every feature. This feature selection algorithm does not eliminate redundant
features as the features are selected in a univariate way [8].
Relief:
Relief is an instance-based feature selection method that was proposed by Kira and Rendell [9] in 1992.
As a filter selection method, it estimates the quality of features individually and works by randomly select
a target instance Ri from the data and finding its nearest neighbor from the same and opposite class [10].
Relief identifies its two nearest neighbors: one from the same class, called nearest hit H, and the other
from a different class, called nearest miss M. The feature score vector W is used to measure the values
between the nearest neighbors and the target instance Rithat are near to each. The feature score vector W
is updated when the feature value differs from the target instance Ri and neighboring instances either miss
M or hit H .The feature score will be decreased when a feature value differs between the target instance
Ri and nearest neighbor instance hit H. On the contrary, the feature score will be increased when the
value differs between the target instance Ri and neighboring instance miss M [11].
ReliefF:
ReliefF is a supervised multivariate feature selection algorithm that was introduced by Kononenko [12] in
1994. ReliefF is the extension of Relief algorithm that can handle multiclass problems and capable of
dealing incomplete and noisy data [13]. Assume that p instances are randomly selected from the original
features. The evaluation criterion of ReliefF is estimated by the following formula:
p
1
RF= ∑ d ( f m ,t −fPQ ( x m ) , t )−d ( f m , t−fPR ( x m ) , t )
2 t =1
Where
f m ,t Represents the value of instances x m
fPQ ( xm ) ,t and fPR ( x m ) ,t represents value on the t th features of the nearest points to x m
D represents the distance measurement [6].
Extending the above criterion, ReliefF criterion can handle the multiclass problem by using the equation
as follows:
p
1
RF= ∙ ∑ ¿ ¿
p t =1
Here, y x is the class label of the instance x t .

t
P ( y ), is the probability of an instance in class y .
NH(x) or NM(x, y) denotes a set of nearest points to x with the same class of x, or a different class (the
class y), respectively.
m xt and m x are the sizes of the sets NH ( x t ) and NM( x t , ), respectively [8].
t,
y
y
One-R:
One-R is a simple instance-based feature selection algorithm proposed by Holte [14]. It creates one rule
for each feature in the training data set to predict the class. It ensures that the rule is selected with the
smallest error. It considers all numerically valued features as continuous and split the range of values into
several disjoint intervals [7]. The most relevant class for each feature value must be determined to create
a rule for a feature. The score of a feature X k is calculated as follows: Let v(k) denote the set of possible
values for feature X k. Also, for each value V (k) , let n (k) vi denote the numbus1er of instances with X k =
vand classi ,i {1 ,. . . .l }.A simple classification rule for instances with X k= v is predicting the class i with
the highest countn(k)
vi . The One-R algorithm is calculated by using the filter score:
1
J oneR ( X k )= ∑ max n(k)
vi
n vϵV
( k)
iϵ {1, ..l}
The feature X k is considered as suitable for univariate class prediction when the rule gives higher
accuracy [15].
Correlation-based Feature Selection (CFS):
Correlation-based Feature Selection (CFS) is a simple multivariate filter algorithm that uses a heuristic
function to evaluate subset of attributes for ranking by their correlation coefficient. The heuristic function
maintains the trade-off between the prediction of a group features and redundancy among them [5]. The
motivation of this algorithm is to find subsets that consist of features which are extremely correlated with
the class and uncorrelated with one another. To reduce low correlation with the class, irrelevant features
should be removed. Redundant features should be excluded so that they can be highly correlated with rest
of the features. The feature acknowledgment depends on the extent to which it predicts classes in areas of
the instance space not already predicted by other features [16].The following equation represents the
CFS’s feature subset evaluation function:
k r´z i
r zc=
√ k + k (k −1) r´ii
Where, r zc represents the correlation among the feature sum and feature subset sum. The number of
features is represented as 𝑘, r z i represents the average value of the feature-class (f ∈ S) correlation and r ii
represents the average value of the feature-feature inter-correlation [17].
References
[1] Osanaiye, Opeyemi, Haibin Cai, Kim-Kwang Raymond Choo, Ali Dehghantanha, Zheng Xu, and
Mqhele Dlodlo. "Ensemble-based multi-filter feature selection method for DDoS detection in cloud
computing." EURASIP Journal on Wireless Communications and Networking 2016, no. 1 (2016): 130.
[2] Vora, Suchi, and Hui Yang. "A comprehensive study of eleven feature selection algorithms and their
impact on text classification." In 2017 Computing Conference, pp. 440-449. IEEE, 2017.
[3] Potharaju, Sai Prasad, M. Sreedevi, and Shanmuk Srinivas Amiripalli. "An Ensemble Feature
Selection Framework of Sonar Targets Using Symmetrical Uncertainty and Multi-Layer Perceptron (SU-
MLP)." In Cognitive Informatics and Soft Computing, pp. 247-256. Springer, Singapore, 2019.
[4] Sosa-Cabrera, Gustavo, Miguel García-Torres, Santiago Gómez-Guerrero, Christian E. Schaerer, and
Federico Divina. "A multivariate approach to the symmetrical uncertainty measure: Application to feature
selection problem." Information Sciences 494 (2019): 1-20.
[5] Afzal, Wasif, and Richard Torkar. "Towards benchmarking feature subset selection methods for
software fault prediction." In Computational intelligence and quantitative software engineering, pp. 33-
58. Springer, Cham, 2016.
[6] Kumar, V. Arul, and L. Arockiam. "MFSPFA: an enhanced filter based feature selection algorithm."
International Journal of Computer Applications 51, no. 12 (2012).
[7] Yildirim, Pinar. "Filter based feature selection methods for prediction of risks in hepatitis disease."
International Journal of Machine Learning and Computing 5, no. 4 (2015): 258.
[8] Porkodi , R. "Comparison of filter based feature selection algorithms: An overview." International
journal of Innovative Research in Technology & Science 2, no. 2 (2014): 108-113.
[9] Kira, K., Rendell, L. A. “A practical approach to feature selection.” In: Proceedings of the ninth
international workshop on Machine learning. (1992)pp. 249–256.
[10] Bolón-Canedo, Verónica, Noelia Sánchez-Maroño, and Amparo Alonso-Betanzos. "A review of
feature selection methods on synthetic data." Knowledge and information systems 34, no. 3 (2013): 483-
519.
[11] Urbanowicz, Ryan J., Melissa Meeker, William La Cava, Randal S. Olson, and Jason H. Moore.
"Relief-based feature selection: Introduction and review." Journal of biomedical informatics 85 (2018):
189-203.
[12] I. Kononenko, “Estimating attributes: Analysis and extensions of RELIEF,” Mach. Learn. ECML-94,
vol. 784, (1994) pp. 171–182.
[13] Gao, Kehan, Taghi M. Khoshgoftaar, and Jason Van Hulse. "An evaluation of sampling on filter-
based feature selection methods." In Twenty-Third International FLAIRS Conference. 2010.
[14] R. C. Holte, “Very simple classification rules perform well on most commonly used datasets,”
Machine Learning, ( 1993) vol. 11, pp. 63-91.
[15] Bommert, Andrea, Xudong Sun, Bernd Bischl, Jörg Rahnenführer, and Michel Lang. "Benchmark
for filter methods for feature selection in high-dimensional classification data." Computational Statistics
& Data Analysis 143 (2020): 106839.
[16] Bolón-Canedo, Verónica, Noelia Sánchez-Marono, Amparo Alonso-Betanzos, José Manuel Benítez,
and Francisco Herrera. "A review of microarray datasets and applied feature selection methods."
Information Sciences 282 (2014): 111-135.
[17] Elgin Christo, V. R., H. Khanna Nehemiah, B. Minu, and Arputharaj Kannan. "Correlation-Based
Ensemble Feature Selection Using Bioinspired Algorithms and Classification Using Backpropagation
Neural Network." Computational and mathematical methods in medicine 2019 (2019).
----------------------------------------------------------------End---------------------------------------------------------------------

Chi-Square:: Principal Component Analysis (PCA)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chi-Square:: Principal Component Analysis (PCA)

Uploaded by

Copyright:

Available Formats

Chi-Square:

Here, o ijis the number of samples with the i th value.

Symmetric Uncertainty (SU):

Principal Component Analysis (PCA):

Information gain (IG):

IG(X, Y) = H(X) – H (X|Y)

H (X|Y) is the entropy of X and the entropy of X after observing Y, respectively.

The entropy of X is estimated as follows:

The entropy of X after observing Y is defined as follows:

H ( X|Y )=−∑ P ( y j ) ∑ P ( xi| y j ¿ ¿ log 2 (P(x i∨ y j))

f m ,t Represents the value of instances x m

D represents the distance measurement [6].

Here, y x is the class label of the instance x t .

P ( y ), is the probability of an instance in class y .

Correlation-based Feature Selection (CFS):

You might also like