Professional Documents
Culture Documents
Featureselction 12 Chapter 3
Featureselction 12 Chapter 3
3.1 Introduction
25
redundant features increases the speed of mining algorithm. It also improves the
accuracy and creates a better classifier model. The dimensionality is a major
problem in machine learning and data mining applications. Dimensionality
reduction is ever be an active research area in the field of pattern recognition,
machine learning, data mining for genomic analysis Inza et al., (2004), text
mining Forman (2003) multi source data Zhao et al., (2008, 2010). The former
researchers realized feature selection is an essential component in order to achieve
successful data mining Liu Motoda (1998), Guyon Elisseeff (2003), Liu Motoda
(2007). The target of dimensionality reduction is to improve the classification
performance. It could be arrived in two different ways, i.e. through feature
selection and feature transformation.
Starting Point:
First, it determines the starting point in feature space from the full set of
features which influences the direction of search. The search for feature subsets
starts with no features or with the full features. The first approach is forward
selection whereas the second approach is backward selection Blum.A.L.(1997)
26
Feature Feature Selection Phase I
Training Data Subset
Generation No
Stop
Evaluation Criterion
Yes
Test
Learning
Model
ACC
Fig. 3.1: Feature Subset Selection Process – Training and Test Model
Search Strategy:
The best subset of features could be found by evaluating all the possible
subsets, which is known as exhaustive search. Then this exhaustive search needs
to search all of 2n possible subsets of n features, so it becomes impractical with
huge volume of features. Therefore more realistic practical approach needs to be
considered here.
Subset Evaluation:
27
Stopping Criteria:
It is necessary to halt the search Blum. A.L., Langely. P. (1997) once the
number of selected features reached a predetermined threshold value Liu.H. and
Yu.L., (2005). After this iterative process, the best subset among the generated
candidates can be selected. A stopping criterion controls and determines when the
feature selection process must stop. Some frequently used stopping criteria are:
Result Validation:
A typical feature selection process has two phases. (i) Feature selection (ii)
Model fitting performance evaluation. The feature selection phase further contains
three steps:
(i) Generating a candidate subset from the original features via certain
research strategies
(ii) Evaluating the candidate set by estimating the utility of the features in
the candidate set. Based on the evaluation, some features in the
28
candidate set might be discarded or new ones might be added to the
selected feature set according to their relevance.
(iii) Determining the current set of selected features are good enough using
certain stopping criterion, i.e. a feature selection methods returns the
selected set of features, else iterates until the stopping criterion is met.
Irrelevant: These features do not have any influence on the output and it
has values random for each instances.
29
represents the filter model and wrapper model. Embedded model is nothing but
utilizing the Filter model inside the wrapper model.
Feature Subset
Selection
Evaluation
Features
Optimal Feature
Subset
Learning Learning
Algorithm Algorithm
Optimal Feature
Subset
Performance Performance
30
Another advantage of the filter model is that it allows the algorithms to
have very simple structure, which usually employed as straightforward search
strategy, such as backward elimination or forward selection. The filter model is
easy to design and also easy to understand. This is the proof why most features
selection algorithms uses the filter model.
In this research work, the fundamentals and formulas used for the standard
existing six features are as follows:
Gini Index
Chi Square
MRMR
T-Test
Relief-F
Information Gain
31
3.6.1 Gini Index
The Gini coefficient (also known as the Gini index or Gini ratio) is a
measure of statistical dispersion intended to represent the income distribution
Breiman et al., (1984). It is the most commonly used measure of inequality
developed by the Italian statistician sociologist Corrado Gini published in his
research paper "Variability Mutability" (Italian: Variabilità e mutabilità) in the
year 1912 Gini.C. (1912). It measures the inequality among values of a frequency
distribution. The Gini coefficient could theoretically range from 0 (complete
equality) to 1 (complete inequality). Normally the mean (or total) is assumed
positive, which ruled out a Gini coefficient less than zero.
The Gini coefficient was often calculated with the more practical Brown Formula:
n
G 1 ( X k X k 1 ) (Yk Yk 1 ) ___________________(Equation 1)
k 1
Where,
G : Gini Coefficient
32
( Ai ( f v) Ei ( f v)) 2
m
(f)
2
__________________ (Equation 2)
vV i 1 Ei ( f v)
3.6.3 MRMR
1
Max D( S , c), D I (xi : c); ____________________ (Equation 3)
S xi S
I(Xi;c) indicates the mutual information between feature Xi class c . MRMR also
uses the mutual information between feature as redundancy of each feature. The
following condition finds out the Minimal Redundancy feature set R:
1
Min R( S ), R
S
2 I(X , X
xi , x j s
i j ) ________(Equation 4)
Where I(xi,xj) indicates the mutual information between feature xi and xj.
33
The criterion combining the above two conditions are called “Minimal-
Redundancy Maximal-Relevance (MRMR). The MRMR measures have the
following form to optimize D and R simultaneously:
3.6.4 T-Test
The T-test is another common statistical test that is used for comparing a
sample mean or difference in means to the true mean or expected difference in
means. T–test is another simple method applied in gene expression analysis. It
manipulated logarithm of expression levels required comparison computation
against the mean variance of both treatment control groups. The basic problem
with T-test is that it requires repeated treatment controlled experiments, which are
both tedious and costly. But the small number of experimental repetition could
affect the reliability of the mean-based approach
E C
t ______________ (Equation 6)
varE varC
NE NC
2
varE varC
NE NC __________________(Equation 7)
2 2
varE varC
NE NC
NE 1 NC 1
34
3.6.5 Relief-F
Information Gain (IG) is another method for attribute selection Hunt et al.,
(1966). This measure is based on pioneering work of Claude Shannon on
information theory, which studied the value or “information content” of messages.
The IG is a method that measures the number of bits of information obtained for
class prediction by knowing the value of a feature.
m m
G( f ) P(ci ) log P(ci ) P( f v) P(ci | f v) log P(ci | f v) (Equation 8)
i 1 vV i 1
Where Ci im1 denotes the set of classes, vV is the set of possible values
35
3.7 Classifiers Used for Performance Evaluation
Bayes classifier and J48 (C4.5) classifier are the two classifiers used in this
research work to evaluate the performance. The Bayes classifier and J48(C4.5)
classifier algorithms are used to verify the accuracy Quinlan (1993).
Predicted Class
Class = No c d
(FP) (TN)
Accuracy = (a + d) / (a + b + c + d)
i.e Accuracy = (No. of True Positives + No. of True Negatives) / (Sum of all)
Error rate = (b + c) / (a + b + c + d)
i.e Error rate = (No. of False Positives + No. of False Negatives) / (Sum of all)
36
3.8 Validation Methods
This Leave One Out Cross Validation (LOOCV) method is the most
commonly used method for error estimation. One reasonable approach to measure
the performance of a feature selection method is to define a classification task and
measure the classification accuracy of that task. Given a colon tumor dataset with
62 class-labeled samples, the LOOCV method successfully constructed 2
classifiers, where each classifier was trained on n-1 samples in the dataset. It was
tested on the remaining samples.
This execution was repeated n times for each of the n observations, then
mean square error was computed. In this cross validation, the dataset is partitioned
into k disjoint subsets trained on k-1 partitions, tested on the remaining one.
37
3.9 Matlab Tool
The Matlab is one of the most important tools required for data analysis. It
has numerous toolboxes which assist in complex calculations. Graphs can be
plotted directly using inbuilt functions. It has infinite number of options that are
necessary for microarray data analysis with numerical programme runs. Matlab is
quite popular and well suited software package. Matlab performs well at:
Prototyping
Engineering simulation
Hence, Matlab was engaged for the microarray data analysis of colon
tumor dataset in this research. In many circumstances, Matlab is used to optimize
the algorithms out of many algorithms tried out. Many present researchers adopt
this tool and uses as primary tool for their research work.
Most of the people would not reveal that they have cancer due to fear,
shyness and social stigma associated with. So, the real survey by anyone would
not have been truthfully success. Also, some of the previous researchers used the
colon tumor dataset and highlighted the complexity of that dataset. Hence, it was
decided to select and use the available standard colon tumor dataset for this
research work.
Colon tumor dataset was used as primary data source for this research
work. This is a standard patient database which covers most of the colon tumor
facts that are globally accepted by research community. This research work
38
selected a „colon tumor data set‟ from this standard patient database using
general selection approach. This colon tumor dataset was utilized throughout this
research work at various levels. It was also used to evaluate the performance of
proposed algorithms as well. The detailed information about this dataset is given
in Table – 3.2.
Number of 2000
Genes
Cancer (Negative)
Cancer - 40
Reference http://www.molbio.princeton.edu/colondata
This colon tumor dataset contains 62 samples collected from Colon Tumor
patients. Among them, 40 tumor biopsies were from tumors (labeled as
“negative”) and 22 normal (labeled as “positive”) biopsies were from healthy
parts of the colons of the same patients. Each sample was described by 2000
genes. So, the data set contains 62 x 2000 continuous variables and 2000 class ids
(we represented the negative as 1 and positives as 2 for the ease of handling inside
MATLAB code).
In the dataset, each sample recorded corresponding genes or Expressed
Sequence Tag (EST) of 2000 gene probe in DNA chip, namely gene expression
profile. The source of the dataset was http://www.molbio.princeton.edu/colondata.
By pathological examination, the final statuses of the biopsy samples were
39
labeled. Gene expression levels in the 62 samples were measured using high
density oligonucleotide arrays. Of the genes detected in these microarrays, 2000
genes were selected based on the confidence in the measured expression levels.
Later, the selected colon tumor dataset was undergone with the data
preprocesses, feature selection classification algorithms through the weka package
for this research work.
Pre-Processing
Colon Missing
Tumor Value Yes
Dataset
No
Replacement
Normalization
Existing Six
Feature Selection
Methods
Classification
Evaluation
40
quality of data. The fig. 3.4 shows the detailed workflow that was used in this
research work.
It is very important to scale down the data into the small range for all
experimental situations. The colon tumor data set was converted into
normalized data between 0 and 1 in order to prevent the features with large
numeric ranges dominated those with small numeric range. Data instances
were rescaled between 0 and 1 using min-max normalization procedure. The
min-max normalization procedure performed a linear transformation of
original input range into a new specified range. The old minimum min_old
was mapped to the new minimum min_new(i.e:0) max_old was mapped to
max_new (i.e: 1) as shown in equation (9).
Originalvalue - Minold
Newvalue = --------------------------(Maxnew - Minnew) + Minnew (Equation -9)
Maxold - Minold
(iv) Feature Space Reduction- Using Existing Six and Proposed Two-
Feature Selection Methods
This step aims to select the most relevant features and reduce the size of
the colon dataset using existing six feature selection methods in this research.
The advantages of combination of six methods were used for identifying
informative genes from the colon tumor dataset in order to facilitate the fast
process operation which increased the efficiency, flexibility and adaptability.
Feature Space Reduction is applied in the existing six feature selection
41
methods as a feature subset optimization process. The methods were given
below in fig. 3.5.
Chi-Square T-Test
1. DWFS-AED
Information Gain MRMR
2. DWFS_CKN
Gini-Index Relief-F
Fig. 3.5 Existing Six Feature Selection Methods and Proposed Two New Methods
This step helped to reduce the hundreds of features to select 10, 20, 30, 40
and 50 features from the reduced colon tumor dataset. The purpose of this
step is to optimize the performance of the feature subset.
Here, the existing six feature selection methods were compared with the
proposed two new methods. After applying 10 fold cross validation, the results
obtained from the existing six methods were compared with the proposed two
methods.
42
The obtained results were consolidated, compared, analyzed and
concluded. The Proposed one showed improved, better classification accuracy
than other methods which is elaborated in Chapter 7.
3.11 Summary
The feature selection methods reveal the future directions for developing
challenging new algorithms in research; particularly handling large dimensionality
of informative gene selection with new data types. Accuracy is the most important
in the field of medical diagnosis to analyze the patient disease. The experimental
results showed that the preprocessing technique greatly enhanced the accuracy of
the classification in the feature selection. In this research work, two new feature
selection algorithms were developed for a particular dataset - colon tumor dataset
which is applied to enhance the classification accuracy.
43