Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

International Journal of Innovations & Advancement in Computer Science

IJIACS
ISSN 2347 – 8616
Volume 4, Issue3
March 2015

Fast Clustering Based Feature Selection


Ubed S. Attar1, Ajinkya N. Bapat2, Nilesh S. Bhagure3, Popat A. Bhesar4,

1
Computer Engineering, Sanjivani College Of Engineering Kopargaon (Pune University)
2
Computer Engineering, Sanjivani College Of Engineering Kopargaon (Pune University)
3
Computer Engineering, Sanjivani College Of Engineering Kopargaon (Pune University)
4
Computer Engineering, Sanjivani College Of Engineering Kopargaon (Pune University)

Abstract: Index Terms - Feature subset selection,


Feature selection involves identifying a filter method, feature clustering, graph-based
subset of the most useful features that produces clustering.
compatible results as the original entire set of
features. A feature selection algorithm may be I. INTRODUCTION
evaluated from both the efficiency and It is widely recognized that a large number
effectiveness points of view. While the of features can adversely affect the performance
efficiency concerns the time required to find a of inductive learning algorithms, and clustering
subset of features, the effectiveness is related to is not an exception. However, while there exists
the quality of the subset of features. Based on a large body of literature devoted to this problem
these criteria, a fast clustering-based feature for supervised learning task, feature selection for
selection algorithm (FAST) is proposed and clustering has been rarely addressed. The
experimentally evaluated in this paper. The problem appears to be a difficult one given that
FAST algorithm works in two steps. In the first it inherits all the uncertainties that surround this
step, features are divided into clusters by using type of inductive learning. Particularly, that
graph-theoretic clustering methods. In the there is not a single performance measure widely
second step, the most representative feature that accepted for this task and the lack of supervision
is strongly related to target classes is selected available.
from each cluster to form a subset of features. 1.1 Feature Selection:
Features in different clusters are relatively
independent, the clustering-based strategy of In machine learning and statistics, feature
FAST has a high probability of producing a selection, also known as variable selection,
subset of useful and independent features. To attribute selection or variable subset selection, is
ensure the efficiency of FAST, we adopt the the process of selecting a subset of relevant
efficient minimum-spanning tree (MST) using features for use in model construction. The
the Kruskal’s Algorithm clustering method. The central assumption when using a feature
efficiency and effectiveness of the FAST selection technique is that the data contains
algorithm are evaluated through an empirical many redundant or irrelevant features.
study. Redundant features are those which provide no
more information than the currently selected

35 Ubed S. Attar, Ajinkya N. Bapat, Nilesh S. Bhagure, Popat A. Bhesar


International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Issue3
March 2015

features, and irrelevant features provide no broad categories: the Embedded, Wrapper,
useful information in any context. Feature Filter, and Hybrid approaches [2].
selection techniques are a subset of the more
general field of feature extraction. Feature 1.2 Wrapper Filter:
extraction creates new features from functions of
Wrapper methods are widely recognized as a
the original features, whereas feature selection
superior alternative in super- vised learning
returns a subset of the features. Feature selection
problems, since by employing the inductive
techniques are often used in domains where
algorithm to evaluate alternatives they have into
there are many features and comparatively few
account the particular biases of the algorithm.
samples (or data points). The archetypal case is
How- ever, even for algorithms that exhibit a
the use of feature selection in analysing DNA
moderate complexity, the number of executions
microarrays, where there are many thousands of
that the search process requires results in a high
features, and a few tens to hundreds of samples.
computational cost, especially as we shift to
Feature selection techniques provide three main
more exhaustive search strategies. The wrapper
benefits when constructing predictive models:
methods use the predictive accuracy of a
predetermined learning algorithm to determine
 Improved model interpretability, the goodness of the selected subsets, the
 Shorter training times, accuracy of the learning algorithms is usually
 Enhanced generalization by reducing high. However, the generality of the selected
over fitting. features is limited and the computational
complexity is large. The filter methods are
Feature selection is also useful as part of the data independent of learning algorithms, with good
analysis process, as shows which features are generality. Their computational complexity is
important for prediction, and how these features low, but the accuracy of the learning algorithms
are related. is not guaranteed [2], [3], and [4].
With such an aim of choosing a subset of good 1.3 Hybrid Approach:
features with respect to the target concepts,
feature subset selection is an effective way for The hybrid methods are a combination of
reducing dimensionality, removing irrelevant filter and wrapper methods by using a filter
data, increasing learning accuracy, and method to reduce search space that will be
improving result considered by the subsequent wrapper. They
comprehensibility [1]. Irrelevant features, mainly focus on combining filter and wrapper
along with redundant features, severely affect methods to achieve the best possible
the accuracy of the learning machines. Thus, performance with a particular learning algorithm
feature subset selection should be able to with similar time complexity of the filter
identify and remove as much of the irrelevant methods.
and redundant information as possible.
Moreover, “good feature subsets contain features In cluster analysis, graph-theoretic methods
highly correlated with (predictive of) the class, have been well studied and used in many
yet uncorrelated with (not predictive of) each applications. Their results have, sometimes, the
other.” best agreement with human performance. The
general graph-theoretic clustering is simple:
Many feature subset selection methods have compute a neighborhood graph of instances, then
been proposed and studied for machine learning delete any edge in the graph that is much
applications. They can be divided into four longer/shorter (according to some criterion) than
its neighbors. The result is a forest and each tree

36 Ubed S. Attar, Ajinkya N. Bapat, Nilesh S. Bhagure, Popat A. Bhesar


International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Issue3
March 2015

in the forest represents a cluster. In our study, we into the second group. Traditionally, feature
apply graph-theoretic clustering methods to subset selection research has focused on
features. In particular, we adopt the minimum searching for relevant features. A well-known
spanning tree (MST)-based clustering example is Relief [7], which weighs each feature
algorithms, because they do not assume that data according to its ability to discriminate instances
points are grouped around centers or separated under different targets based on distance-based
by a regular geometric curve and have been criteria function. However, Relief is ineffective
widely used in practice. at removing redundant features as two predictive
Based on the MST method, we propose a but highly correlated features are likely both to
Fast clustering based feature Selection algorithm be highly weighted. Relief-F extends Relief,
(FAST). The FAST algorithm works in two enabling this method to work with noisy and
steps. In the first step, features are divided into incomplete data sets and to deal with multiclass
clusters by using graph-theoretic clustering problems, but still cannot identify redundant
methods. In the second step, the most features.
representative feature that is strongly related to 2.2 Drawbacks:
target classes is selected from each cluster to
1) Some works eliminates irrelevant
form the final subset of features.
features alone.
2) Some existing systems remove redundant
Features in different clusters are relatively
features alone.
independent; the clustering based strategy of
3) This reduces the speed and accuracy of
FAST has a high probability of producing a
learning algorithms.
subset of useful and independent features. The
proposed feature subset selection algorithm III. FEATURE SUBSET SELECTION ALGORITHM
FAST was tested various numerical data sets. 3.1 Proposed System:
The experimental results show that, compared
with other five different types of feature subset Quite different from these hierarchical
selection algorithms, the proposed algorithm not clustering-based algorithms, our proposed FAST
only reduces the number of features, but also algorithm uses minimum spanning tree-based
improves the classification accuracy. method to cluster features. Meanwhile, it does
not assume that data points are grouped around
LITERATURE SURVEY
II. centers or separated by a regular geometric
2.1 Existing System: curve. Moreover, our proposed FAST does not
limit to some specific types of data.
Feature subset selection can be viewed as the
process of identifying and removing as many
3.2 System Architecture and Definitions:
irrelevant and redundant features as possible.
This is because 1) irrelevant features do not Irrelevant features, along with redundant
contribute to the predictive accuracy, and 2) features, severely affect the accuracy of the
redundant features do not redound to getting a learning machines [6]. Thus, feature subset
better predictor for that they provide mostly selection should be able to identify and remove
information which is already present in other as much of the irrelevant and redundant
feature(s). Of the many feature subset selection information as possible. Moreover, “good
algorithms, some can effectively eliminate feature subsets contain features highly correlated
irrelevant features but fail to handle redundant with (predictive of) the class, yet uncorrelated
features [4] [5], yet some of others can eliminate with (not predictive of) each other.” Keeping
the irrelevant while taking care of the redundant these in mind, we develop a novel algorithm
features[5]. Our proposed FAST algorithm falls which can efficiently and effectively deal with

37 Ubed S. Attar, Ajinkya N. Bapat, Nilesh S. Bhagure, Popat A. Bhesar


International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Issue3
March 2015

both irrelevant and redundant features, and


obtain a good feature subset. We achieve this
through a new feature selection architecture
(shown in Fig. 1) which composed of the two
connected components of irrelevant feature
removal and redundant feature elimination. The
former obtains features relevant to the target
concept by eliminating irrelevant ones, and the
latter removes redundant features from relevant
ones via choosing representatives from different
feature clusters, and thus produces the final
subset.

The irrelevant feature removal is


straightforward once the right relevance measure
is defined or selected, while the redundant
feature elimination is a bit of sophisticated. In
our proposed FAST algorithm, it involves 1) the
construction of the minimum spanning tree from
a weighted complete graph; 2) the partitioning of
the MST into a forest with each tree representing
a cluster; and 3) the selection of representative
features from the clusters.
3.3 Advantages:
1) Removes both irrelevant and redundant
attributes.
2) Uses minimum spanning tree concept for
fast elimination.
3) Improves speed and accuracy of learning
algorithms. Fig. 3.2 System Architecture
3.4 Algorithm
1. for i = 1 to m do
2. T-Relevance = SU (Fi, C)
3. if T-Relevance > θ then
4. S = S ∪ Fi;
5. G = NULL; // G is a complete graph
6. for each pair of features F0 i , F0 j ⊂ S do
7. F-Correlation = SU (F0 i , F0 j)
8. Add F0 i and/or F0 j to G with F-
correlation as the weight of the
corresponding edge;
9. minSpanTree = Prim (G); //Using
Kruskal’s Algorithm to generate the
Fig. 3.1 System Architecture minimum spanning tree

38 Ubed S. Attar, Ajinkya N. Bapat, Nilesh S. Bhagure, Popat A. Bhesar


International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Issue3
March 2015

10. Forest = minSpanTree derived from the mutual information by


11. for each edge Eij ∈ Forest do normalizing it to the entropies of feature values
12. if SU(F0 i, F0 j) < SU(F0 i,C) ∧ SU(F0 i , or feature values and target classes, and has been
used to evaluate the goodness of features for
F0 j) < SU(F0 j,C)
classification The symmetric uncertainty is
13. Forest = Forest Eij defined as follows:
14. S = φ
15. For each tree Ti ∈ Forest do Gain(X|Y) = H(X) - H(X|Y)
16. Fj R = argmaxF0k ∈ Ti SU (F0 k ,C) = H(Y) - H(Y|X)
17. S = S ∪ Fj R; To calculate gain, we need to find the entropy
18. return S and conditional entropy values. The equations
for that are given below:
IV. MODULE DESCRIPTION

4.1 Load Data and Classify:



Load the data into the process. The data has
to be preprocessed for removing missing values,
noise and outliers. Then the given dataset must ∈ ∈
be converted into the arff format which is the Where p(x) is the probability density function
standard format for WEKA toolkit. From the arff and p(x-y) is the conditional probability density
format, only the attributes and the values are function.
extracted and stored into the database. By
considering the last column of the dataset as the 4.3 MST Construction:
class attribute and select the distinct class labels With the F-Correlation value computed
from that and classify the entire dataset with above, the Minimum Spanning tree is
respect to class labels. constructed. For that, we use Kruskals algorithm
which form MST effectively. Kruskal’s
4.2 Information Gain Computation: algorithm is a greedy algorithm in graph theory
Relevant features have strong correlation that finds a minimum spanning tree for a
with target concept so are always necessary for a connected weighted graph. This means it finds a
best subset, while redundant features are not subset of the edges that forms a tree that includes
because their values are completely correlated every vertex, where the total weight of all the
with each other. Thus, notions of feature edges in the tree is minimized. If the graph is not
redundancy and feature relevance are normally connected, then it finds a minimum spanning
in terms of feature correlation and feature-target forest (a minimum spanning tree for each
concept correlation. To find the relevance of connected component.
each attribute with the class label, Information
gain is computed in this module. This is also Description
said to be Mutual Information measure. (a) Create a forest F (a set of trees), where each
Mutual information measures how much the vertex in the graph is a separate tree.
distribution of the feature values and target (b) Create a set S containing all the edges in the
classes differ from statistical independence. This graph
is a nonlinear estimation of correlation between (c) While S is nonempty and F is not yet
feature values or feature values and target spanning • Remove an edge with minimum
classes. The symmetric uncertainty (SU) is weight from S • If that edge connects two

39 Ubed S. Attar, Ajinkya N. Bapat, Nilesh S. Bhagure, Popat A. Bhesar


International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Issue3
March 2015

different trees, then add it to the forest,


combining two trees into a single tree • 4.5 F-correlations:
Otherwise discard that edge.
The correlation between any pair of features
Fi and Fj (Fi,Fj ) is called the F-Correlation of Fi
At the termination of the algorithm, the forest
and Fj, and denoted by SU(Fi, Fj). The equation
forms a minimum spanning forest of the graph.
symmetric uncertainty which is used for finding
If the graph is connected, the forest has a single
the relevance between the attribute and the class
component and forms a minimum spanning tree.
is again applied to find the similarity between
The sample tree is as follows :
two attributes with respect to each label.
In this tree, the vertices represent the
4.6 Cluster Formation:
relevance value and the edges represent the F-
Correlation value. The complete graph G reflects After building the MST, in the third step, we
the correlations among all the target-relevant first remove the edges whose weights are smaller
features. Unfortunately, graph G has k vertices than both of the T-Relevance SU(Fi‘, C) and
and k(k1)/2 edges. For high-dimensional data, it SU(Fj‘, C), from the MST. After removing all
is heavily dense and the edges with different the unnecessary edges, a forest Forest is
weights are strongly interwoven. Moreover, the obtained. Each tree Tj Forest represents a cluster
decomposition of complete graph is NP-hard. that is denoted as V (Tj), which is the vertex set
Thus for graph G, we build an MST, which of Tj as well. As illustrated above, the features in
connects all vertices such that the sum of the each cluster are redundant, so for each cluster V
weights of the edges is the minimum, using the (Tj) we choose a representative feature Fj R
well known Kruskal algorithm. The weight of whose T-Relevance SU(Fj R,C) is the greatest.
edge (Fi‘,Fj‘) is F-Correlation SU(Fi‘,Fj‘).
V. CONCLUSION
In this paper, we have presented a novel
clustering-based feature subset selection
algorithm for high dimensional data. The
algorithm involves 1) removing irrelevant
features, 2) constructing a minimum spanning
tree from Relative ones, and 3) partitioning the
MST and selecting representative features. In the
proposed algorithm, a cluster consists of
Fig. 4.1 Sample Tree features. Each cluster is treated as a single
feature and thus dimensionality is drastically
reduced.
4.4 T-relevance Calculations:
The relevance between the feature Fi F and IV. ACKNOWLEDGMENT
the target concept C is referred to as the T-
Relevance of Fi and C, and denoted by It gives me a great pleasure to present the
SU(Fi,C). If SU(Fi,C) is greater than a paper on “Fast Clustering Based Feature
predetermined threshold , we say that Fi is a Selection”. I would like to express my gratitude
strong T-Relevance feature. • After finding the to Prof. D. B. Kshirsagar, Head of Computer
relevance value, the redundant attributes will be Engineering Department and Prof. S Srinivas A.
removed with respect to the threshold value.

40 Ubed S. Attar, Ajinkya N. Bapat, Nilesh S. Bhagure, Popat A. Bhesar


International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Issue3
March 2015

and Prof. Vinay Kumar A. for their kind support Proc. Fourth Pacific Asia Conf.
and valuable suggestions. Knowledge Discovery and Data Mining,
pp. 98-109, 2000.
REFERENCES

[1] Qinbao Song, Jingjie Ni, and Guangtao AUTHORS

Wang, “A Fast Clustering-Based Feature First Author – Ubed S. Attar:


Subset Selection Algorithm for High- BE Computer Engineering, SRES’s College Of
Engineering, Kopargaon.
Dimensional Data”,IEEE Transactions On University Of Pune
Knowledge And Data Engineering, Vol. E-mail address: ubedattar45@gmail.com
25, No. 1, Januaryr 2013.
Second Author – Ajinkya N. Bapat:
[2] H. Almuallim and T.G. Dietterich, BE Computer Engineering, SRES’s College Of
“Algorithms for Identifying Relevant Engineering, Kopargaon.
University Of Pune
Features”,Proc. Ninth Canadian Conf. E-mail address: ajinkyabapat12@gmail.com
Artificial Intelligence, pp. 38-45, 1992.
[3] P.Abinaya, Dr.J.Sutha, “Effective Feature Third Author – Nilesh S. Bhagure:
BE Computer Engineering, SRES’s College Of
Selection For High Dimensional Data Engineering, Kopargaon.
using Fast Algorithm”, International University Of Pune
Journal of Advanced Research in E-mail address: nileshbhagure77@gmail.com

Computer Science Technology (IJARCST Fourth Author – Popat A. Bhesar:


2014) – volume 2 Issue Special 1– Jan- BE Computer Engineering, SRES’s College Of
Engineering, Kopargaon.
March 2013. University Of Pune
[4] L.C. Molina, L. Belanche, and A. Nebot, E-mail address: bhesarpopat@gmail.com
“Feature Selection Algorithms: A Survey
and Experimental Evaluation”, Proc. IEEE
Intl Conf. Data Mining, pp. 306-313, 2002.
[5] Karthikeyan P., Saravanan P., Vanitha E.,
“High Dimensional Data Clustering Using
Fast Cluster Based Feature Selection”, Int.
Journal of Engineering Research and
Applications, Vol. 4, Issue 3( Version 1),
March 2014
[6] L. Yu and H. Liu, “Efficient Feature
Selection via Analysis of Relevance and
Redundancy”, J. Machine Learning
Research, vol. 10, no. 5, pp. 12051224,
2004.
[7] M. Dash and H. Liu, “Feature Selection
for Classification, Intelligent Data
Analysis”, vol. 1, no. 3, pp. 131-156, 1997.
[8] M. Dash, H. Liu, and H. Motoda,
“Consistency Based Feature Selection”,

41 Ubed S. Attar, Ajinkya N. Bapat, Nilesh S. Bhagure, Popat A. Bhesar

You might also like