Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

CHAPTER 3

FEATURE SELECTION METHODS AND METRICS USED


FOR PERFORMANCE EVALUATION

3.1 Introduction

Many real world databases are grown up highly unpredictable and


unbelievable. Every piece of information attracts the attention, creates a need and
insists to collect and store it in the database in its original form. Automatically no
one can escape from the tremendous growth of data facts, often referred a
„feature‟. Then feature selection becomes a major process in data classification
after data pre processing. The main challenge is here to remove the unwanted
features from the dataset that were irrelevant to the task performed. It could be
extremely useful in reducing the dimensionality of the data.

This chapter aims to detail the basic concepts of a feature, feature


selection, feature subset selection, objectives and characteristics of feature
selection and its related areas. In this research work, the feature selection methods
are applied on microarray data particularly for colon tumor dataset. Microarray
gene expression data are characterized by high feature dimensionality and
complex gene behavior. It challenges in designing algorithm with the aim of
interpreting the gene patterns that causes the disease Han Kamber (2001). It is
essential to single out the feature (gene) subset involved in classification accuracy,
by eliminating the least interesting genes, i.e. selecting the most interesting one.
Distinctive features from a set of candidates are selected and used for analysis.

3.2 Feature Selection

Commonly, a feature or an attribute or a variable are all points to an aspect


of the data, i.e. „a fact‟. Features could be of any kind such as discrete type,
continuous type or nominal type. Feature selection is a process of selecting a
subset of original features according to certain criteria Jain Zongker (1997). It is
frequently used in data mining for dimension reduction A.L.Blum P.Langley
(1997). A big reduction in number of features by removing irrelevant and

25
redundant features increases the speed of mining algorithm. It also improves the
accuracy and creates a better classifier model. The dimensionality is a major
problem in machine learning and data mining applications. Dimensionality
reduction is ever be an active research area in the field of pattern recognition,
machine learning, data mining for genomic analysis Inza et al., (2004), text
mining Forman (2003) multi source data Zhao et al., (2008, 2010). The former
researchers realized feature selection is an essential component in order to achieve
successful data mining Liu Motoda (1998), Guyon Elisseeff (2003), Liu Motoda
(2007). The target of dimensionality reduction is to improve the classification
performance. It could be arrived in two different ways, i.e. through feature
selection and feature transformation.

3.3 Objectives of the Feature Selection

The objective of the feature selection is three fold:

 Avoid over fitting and improve the prediction performance of the


predictors
 Provide faster and more cost effective predictors/models
 Gain a deeper insight into the underlying processes that generates the data.

3.4 Feature Subset Selection Process

Feature subset selection process typically involves generation of the


subset, criteria for stopping the search and validation procedures Dash (1997). The
various step of this process are shown in Fig. 3.1.

Starting Point:

First, it determines the starting point in feature space from the full set of
features which influences the direction of search. The search for feature subsets
starts with no features or with the full features. The first approach is forward
selection whereas the second approach is backward selection Blum.A.L.(1997)

26
Feature Feature Selection Phase I
Training Data Subset
Generation No

Stop
Evaluation Criterion

Yes

Training Best Subset


Learning Phase II
Test Data Model

Test
Learning
Model

ACC

Fig. 3.1: Feature Subset Selection Process – Training and Test Model

Search Strategy:

The best subset of features could be found by evaluating all the possible
subsets, which is known as exhaustive search. Then this exhaustive search needs
to search all of 2n possible subsets of n features, so it becomes impractical with
huge volume of features. Therefore more realistic practical approach needs to be
considered here.

Subset Evaluation:

After generating subsets of candidate‟s features, it needs be evaluated.


Filter approach and wrapper approach are followed in this evaluation process.
Details of these approaches are explained in the following section of this chapter.
Based on these two approaches, the goodness (ie: classification accuracy) of
executing algorithm among the preprocessed dataset is determined Blum. A.L.
Langely. P. (1997) Draper.B. et al., (2003).

27
Stopping Criteria:

It is necessary to halt the search Blum. A.L., Langely. P. (1997) once the
number of selected features reached a predetermined threshold value Liu.H. and
Yu.L., (2005). After this iterative process, the best subset among the generated
candidates can be selected. A stopping criterion controls and determines when the
feature selection process must stop. Some frequently used stopping criteria are:

 Reach the end and the search gets completed.


 Reach the specified boundary which is set externally (minimum number of
features or maximum number of iterations).
 Find out reasonable good number of good subset.

Result Validation:

The simple and straightforward method of validating the result is to


directly measure the result with the help of prior knowledge about the data. In the
real-world situation, having such adequate prior knowledge is not possible in all
the circumstances. So, some indirect methods are used by monitoring the change
of mining performance with the change of features. For example, if classification
error rate is used as a performance indicator for a learning task, for a selected
feature subset, perform “before--after” experiment. Compare the error rate of the
classifier learned on the full set of features versus error rate of the classifier
learned on the selected subset Liu.H. Motoda.H,(1998), Witten et al., (2000).
The feature selection process is diagrammatically represented in the fig.3.1.

A typical feature selection process has two phases. (i) Feature selection (ii)
Model fitting performance evaluation. The feature selection phase further contains
three steps:

(i) Generating a candidate subset from the original features via certain
research strategies
(ii) Evaluating the candidate set by estimating the utility of the features in
the candidate set. Based on the evaluation, some features in the

28
candidate set might be discarded or new ones might be added to the
selected feature set according to their relevance.
(iii) Determining the current set of selected features are good enough using
certain stopping criterion, i.e. a feature selection methods returns the
selected set of features, else iterates until the stopping criterion is met.

A test set is commonly used to determine the accuracy of the model.


Usually, the given dataset is divided into training and test sets. The training set is
used to build the model and test set is used to validate it.

3.5 Characteristics of Feature Subset Selection

Quality of the feature subset is very essential to determine performance of


algorithm. Also, the number of data needed to be processed depends on the
dimensionality of features. The representation of quality of the instance data is
very important Mark et al. (2003). If there are much irrelevant, redundant
information present or noisy with unreliable data, then knowledge discovery
during the training phase is merely waste. In real-world data, the representation of
data often consists of too many features, but only a few of them might be related
to the target concept. Redundancy is always expected and that should be carefully
handled and removed. Generally, features are characterized as: Veeraswamy et al.
(2011).
 Relevant: These features have an influence on the output and its role can‟t
be assumed by the rest.

 Irrelevant: These features do not have any influence on the output and it
has values random for each instances.

 Redundant: A redundancy exists whenever a feature takes the role of


another feature.

Depends on how the selected features are evaluated, different strategies


could be adopted, which broadly fall into three categories such as Filter model,
Wrapper model and Embedded model. The following fig. 3.2 and fig.3.3

29
represents the filter model and wrapper model. Embedded model is nothing but
utilizing the Filter model inside the wrapper model.

Filter Model Wrapper Model

All Features All Features

Feature Subset
Selection

Feature Subset Feature Subset


Selection Generation

Evaluation
Features
Optimal Feature
Subset

Learning Learning
Algorithm Algorithm
Optimal Feature
Subset

Performance Performance

Fig. 3.2 Fig. 3.3

Fig.3.2 Filter Model and Fig.3.3 Wrapper Model

The evaluation step feature selection algorithms of filter model relied on


analyzing the general characteristics of data evaluating features without involving
any learning algorithms. On the other hand, the feature selection algorithms of
wrapper model require a predetermined learning algorithm. Algorithms of
embedded model e.g: C4.5 Quilan (1993) incorporated feature selection as a part
of the model fitting / training process, the features utility is obtained based on
analyzing their utility for optimizing the objective function of the learning model.

Compared to wrapper and embedded models, algorithms of the filter


model is independent of any learning model and is free from bias associated with
any learning models. This is the main advantage of filter model.

30
Another advantage of the filter model is that it allows the algorithms to
have very simple structure, which usually employed as straightforward search
strategy, such as backward elimination or forward selection. The filter model is
easy to design and also easy to understand. This is the proof why most features
selection algorithms uses the filter model.

In real world applications also, most frequently used feature selection


algorithms are also of filters. Since the structure of the algorithms is simple,
normally it runs very fast. On the other hand, the researchers also recognizes that
compared to the filter model, feature selection algorithms of the learning
embedded models could usually select features that results in higher learning
performance of a particular learning model, which is used in the feature selection
process Kira .K., Rendell, L.(1992).

3.6 Feature Selection Methods Under This Research Work

In the field of machine learning pattern recognition, feature subset selection


is a prime area, where many approaches had been proposed. In this research work,
the standard six existing feature selection methods are chosen and analyzed their
performance using the dataset called colon tumor dataset from the public domain.
The number of features such as 10, 20, 30, 40 and 50 features are chosen from the
public data for analysis. These selected features are executed and calculated their
accuracy. More details of this execution and the results are given in chapter 4.

In this research work, the fundamentals and formulas used for the standard
existing six features are as follows:

 Gini Index

 Chi Square

 MRMR

 T-Test

 Relief-F

 Information Gain

31
3.6.1 Gini Index

The Gini coefficient (also known as the Gini index or Gini ratio) is a
measure of statistical dispersion intended to represent the income distribution
Breiman et al., (1984). It is the most commonly used measure of inequality
developed by the Italian statistician sociologist Corrado Gini published in his
research paper "Variability Mutability" (Italian: Variabilità e mutabilità) in the
year 1912 Gini.C. (1912). It measures the inequality among values of a frequency
distribution. The Gini coefficient could theoretically range from 0 (complete
equality) to 1 (complete inequality). Normally the mean (or total) is assumed
positive, which ruled out a Gini coefficient less than zero.

The Gini coefficient was often calculated with the more practical Brown Formula:
n
G  1   ( X k  X k 1 ) (Yk  Yk 1 ) ___________________(Equation 1)
k 1

Where,

G : Gini Coefficient

Xk : cumulated proportion of the one variable,

for : k=0,...,n, with X0=0, Xn=1

Yk : cumulated proportion of the target variable,

for : k = 0,...,n, with Y0 = 0, Yn = 1

3.6.2 Chi Square

Chi-Square is the common statistical test that measures the divergence


from the distribution expected if the assumed feature occurrence is actually
independent of the class value. When the Chi-square Statistics is larger than the
critical value determined by the degrees of freedom, then the feature and the class
are considered dependent. Such features are selected. The chi-square method
evaluates features individually by measuring their chi-squared statistic with
respect to the classes Liu, Li .J. Wong .L. (2002). The formula for chi-square is:

32
( Ai ( f  v)  Ei ( f  v)) 2
m
 (f) 
2
__________________ (Equation 2)
vV i 1 Ei ( f  v)

The Chi-square of a feature f is defined as, where Ai(f = v) is the number of


instances in class ci with f = v, Ei(f = v) is the expected value of Ai(f = v),
calculated as P(f = v)P(ci)N.

This method measured the lack of independence between the terms of


category. In statistics, the χ2 test is applied to test the independence of two events.
The two events A and B are defined to be independent if P(AB)= P(A) P(B) or
equivalently, P(A/B)=P(A) P(B/A)=P(B). In feature selection, the two events are
the occurrence of the term and the occurrence of the class.

3.6.3 MRMR

Maximum Relevance-Minimum Redundancy (MRMR) is a scheme in


feature selection to adopt the features that correlate the strongest correlation with a
classification variable. This MRMR scheme combines the selection features that
are mutually different from each other while still having a high correlation. The
MRMR methods use the mutual information between a feature and a class as
relevance of the feature for the class. Peng.H. and F.Long (2005), Maximal
Relevance is to search feature set S satisfying

1
Max D( S , c), D   I (xi : c); ____________________ (Equation 3)
S xi S

I(Xi;c) indicates the mutual information between feature Xi class c . MRMR also
uses the mutual information between feature as redundancy of each feature. The
following condition finds out the Minimal Redundancy feature set R:
1
Min R( S ), R 
S
2 I(X , X
xi , x j s
i j ) ________(Equation 4)

Where I(xi,xj) indicates the mutual information between feature xi and xj.

33
The criterion combining the above two conditions are called “Minimal-
Redundancy Maximal-Relevance (MRMR). The MRMR measures have the
following form to optimize D and R simultaneously:

Max  ( D, R),  D  R ____________________(Equation 5)

Where D - R indicates relevance redundancy of each feature.

3.6.4 T-Test

The T-test is another common statistical test that is used for comparing a
sample mean or difference in means to the true mean or expected difference in
means. T–test is another simple method applied in gene expression analysis. It
manipulated logarithm of expression levels required comparison computation
against the mean variance of both treatment control groups. The basic problem
with T-test is that it requires repeated treatment controlled experiments, which are
both tedious and costly. But the small number of experimental repetition could
affect the reliability of the mean-based approach

The formula for the t-test was given below:

 E  C
t ______________ (Equation 6)
varE varC

NE NC

also the t-statistic formula is given as equation of

2
 varE varC 
  
  NE NC  __________________(Equation 7)
2 2
 varE   varC 
   
 NE    NC 
NE 1 NC 1

Where  - the degrees of freedom.

34
3.6.5 Relief-F

Relief-F is a feature selection strategy that chooses instances randomly,


change the weights of the feature-relevance based on the nearest neighbor. By its
merits, Relief-F is one of the most successful strategies in feature selection. The
main idea of Relief-F is to compute a score for each feature measuring how well
this feature separated neighboring samples in the original space. The nearest
neighbor version seeks after nearest example from the same class (nearest hit) its
nearest example from the opposite class (nearest miss), in the original feature
space. The score is then the difference (or the ratio) between the averages over all
samples of the distance to the nearest miss the average distance to the nearest hit
in projection on that feature. Ratio is used in this research work because of self-
normalization of the scores Marko Robnik et al., (2003).

3.6.6 Information Gain

Information Gain (IG) is another method for attribute selection Hunt et al.,
(1966). This measure is based on pioneering work of Claude Shannon on
information theory, which studied the value or “information content” of messages.
The IG is a method that measures the number of bits of information obtained for
class prediction by knowing the value of a feature.

The information gain of a feature f is defined as Wang et al., (2005)

m m
G( f )   P(ci ) log P(ci )   P( f  v) P(ci | f  v) log P(ci | f  v) (Equation 8)
i 1 vV i 1

Where Ci im1 denotes the set of classes, vV is the set of possible values

for feature f Yang. J. Honavar, V. (1997).

George Forman (2003) states that it could generalize to any number of


classes.

35
3.7 Classifiers Used for Performance Evaluation

Understanding the outcomes of performance evaluation is very important


to identify whether the performance is improved or not. The required metrics
should be finalized prior to the execution of any algorithm.

Bayes classifier and J48 (C4.5) classifier are the two classifiers used in this
research work to evaluate the performance. The Bayes classifier and J48(C4.5)
classifier algorithms are used to verify the accuracy Quinlan (1993).

C4.5 is quite popular of its efficient algorithm in Decision tree-based


approach. The accuracy and error rate are calculated using the following formula
as per the Table 3.1.

Table.3.1 Classification Constraints

Predicted Class

Class = Yes Class = No


Actual
Class = Yes a b
Class
(TP) (FN)

Class = No c d
(FP) (TN)

Here, „a‟ denotes number of True Positives (TP)

„b‟ denotes number of False Negatives (FN)

„c‟ denotes number of False Positives (FP)

„d‟ denotes number of True Negatives (TN)

The accuracy, error rate are calculated as follows:

Accuracy = (a + d) / (a + b + c + d)

i.e Accuracy = (No. of True Positives + No. of True Negatives) / (Sum of all)

Error rate = (b + c) / (a + b + c + d)

i.e Error rate = (No. of False Positives + No. of False Negatives) / (Sum of all)

36
3.8 Validation Methods

Leave-One-Out Cross Validation (LOOCV) and k-fold cross validation are


the two validation methods used in this research work to validate the evaluated
performance in previous step.

3.8.1 Leave One Out Cross Validation (LOOCV)

This is a class validation method that performs testing in rounds. In each


round, the classifier is trained by a set of samples then testes using the rest.
Finally, the classification accuracy values of each round are averaged. Many
validation methods introduced later periods also use this class validation methods.

This Leave One Out Cross Validation (LOOCV) method is the most
commonly used method for error estimation. One reasonable approach to measure
the performance of a feature selection method is to define a classification task and
measure the classification accuracy of that task. Given a colon tumor dataset with
62 class-labeled samples, the LOOCV method successfully constructed 2
classifiers, where each classifier was trained on n-1 samples in the dataset. It was
tested on the remaining samples.

This execution was repeated n times for each of the n observations, then
mean square error was computed. In this cross validation, the dataset is partitioned
into k disjoint subsets trained on k-1 partitions, tested on the remaining one.

3.8.2 k fold validation

k-fold validation method divides the data randomly into k complimentary


sets of samples then iterates the same for k rounds. It uses one of these sets for
testing while using the rest for training. LOOCV is a special case of k-fold where
k is equal to the number of available samples. It means that in each round one
sample is held out for validation while the rest is used for training.

37
3.9 Matlab Tool

Matlab is a numerical computing environment with fourth generation


programming language. It was developed by the Math Works. Matlab allows
matrix manipulation, plotting of functions, data analysis, implementation of
algorithms, creation of user interfaces, interfacing with programs in other
languages.

The Matlab is one of the most important tools required for data analysis. It
has numerous toolboxes which assist in complex calculations. Graphs can be
plotted directly using inbuilt functions. It has infinite number of options that are
necessary for microarray data analysis with numerical programme runs. Matlab is
quite popular and well suited software package. Matlab performs well at:

 Prototyping

 Engineering simulation

 Fast visualization of data

Hence, Matlab was engaged for the microarray data analysis of colon
tumor dataset in this research. In many circumstances, Matlab is used to optimize
the algorithms out of many algorithms tried out. Many present researchers adopt
this tool and uses as primary tool for their research work.

3.10 Methodology Adopted for Colon Tumor Dataset

Most of the people would not reveal that they have cancer due to fear,
shyness and social stigma associated with. So, the real survey by anyone would
not have been truthfully success. Also, some of the previous researchers used the
colon tumor dataset and highlighted the complexity of that dataset. Hence, it was
decided to select and use the available standard colon tumor dataset for this
research work.

Colon tumor dataset was used as primary data source for this research
work. This is a standard patient database which covers most of the colon tumor
facts that are globally accepted by research community. This research work

38
selected a „colon tumor data set‟ from this standard patient database using
general selection approach. This colon tumor dataset was utilized throughout this
research work at various levels. It was also used to evaluate the performance of
proposed algorithms as well. The detailed information about this dataset is given
in Table – 3.2.

Table – 3.2: The Colon Tumor Microarray Dataset

Data set Colon Tumor

Number of 2000
Genes

Classes Normal (Positive)

Cancer (Negative)

Training Data Normal - 22

Cancer - 40

Reference http://www.molbio.princeton.edu/colondata

Colon Tumor Microarray Dataset:

This colon tumor dataset contains 62 samples collected from Colon Tumor
patients. Among them, 40 tumor biopsies were from tumors (labeled as
“negative”) and 22 normal (labeled as “positive”) biopsies were from healthy
parts of the colons of the same patients. Each sample was described by 2000
genes. So, the data set contains 62 x 2000 continuous variables and 2000 class ids
(we represented the negative as 1 and positives as 2 for the ease of handling inside
MATLAB code).
In the dataset, each sample recorded corresponding genes or Expressed
Sequence Tag (EST) of 2000 gene probe in DNA chip, namely gene expression
profile. The source of the dataset was http://www.molbio.princeton.edu/colondata.
By pathological examination, the final statuses of the biopsy samples were

39
labeled. Gene expression levels in the 62 samples were measured using high
density oligonucleotide arrays. Of the genes detected in these microarrays, 2000
genes were selected based on the confidence in the measured expression levels.

Later, the selected colon tumor dataset was undergone with the data
preprocesses, feature selection classification algorithms through the weka package
for this research work.

Pre-Processing

Colon Missing
Tumor Value Yes
Dataset
No
Replacement

Normalization

Existing Six
Feature Selection
Methods

Classification

Evaluation

Fig. 3.4: Methodology Adopted for Colon Tumor Dataset

(i) Preprocessing the Dataset


Data pre-processing is the first step in the data mining process. This
process is mandatory before applying data mining to the dataset Lin Haug
(2006). Without knowing data carefully in advance, the classification task
could be misleading and ends with inaccurate outcomes. First the whole
dataset is dealt with missing value replacement scheme. Then, normalize the
data at small scale before fed into a feature selection process. Data
preprocessing should be applied before data mining process to improve the

40
quality of data. The fig. 3.4 shows the detailed workflow that was used in this
research work.

(ii) For Missing Values Problem


In any research practice, it is very common to face missing value issues.
Most of the datasets encountered missing values and many learning schemes
lacked the ability to handle this dataset. In this research work, the missing
values are handled by using K-NN ( K-Nearest Neighborhood) approach.

(iii) Normalization Process

It is very important to scale down the data into the small range for all
experimental situations. The colon tumor data set was converted into
normalized data between 0 and 1 in order to prevent the features with large
numeric ranges dominated those with small numeric range. Data instances
were rescaled between 0 and 1 using min-max normalization procedure. The
min-max normalization procedure performed a linear transformation of
original input range into a new specified range. The old minimum min_old
was mapped to the new minimum min_new(i.e:0) max_old was mapped to
max_new (i.e: 1) as shown in equation (9).

Originalvalue - Minold
Newvalue = --------------------------(Maxnew - Minnew) + Minnew (Equation -9)
Maxold - Minold

(iv) Feature Space Reduction- Using Existing Six and Proposed Two-
Feature Selection Methods

This step aims to select the most relevant features and reduce the size of
the colon dataset using existing six feature selection methods in this research.
The advantages of combination of six methods were used for identifying
informative genes from the colon tumor dataset in order to facilitate the fast
process operation which increased the efficiency, flexibility and adaptability.
Feature Space Reduction is applied in the existing six feature selection

41
methods as a feature subset optimization process. The methods were given
below in fig. 3.5.

Reduce the Feature Space


Existing Feature Selection Methods Proposed Methods

Chi-Square T-Test
1. DWFS-AED
Information Gain MRMR
2. DWFS_CKN

Gini-Index Relief-F

Fig. 3.5 Existing Six Feature Selection Methods and Proposed Two New Methods

This step helped to reduce the hundreds of features to select 10, 20, 30, 40
and 50 features from the reduced colon tumor dataset. The purpose of this
step is to optimize the performance of the feature subset.

(v) Accuracy Calculation

The reduced feature subset from the previous step is considered to


calculate the accuracy with the two classifiers Bayes and J48 in this research
work. The error rate provided a measure of how well the selected features
could discriminate the data. The lower the error rate, the better the selected set
of features.

(vi) Validated using LOOCV 10- fold CV

Here, the existing six feature selection methods were compared with the
proposed two new methods. After applying 10 fold cross validation, the results
obtained from the existing six methods were compared with the proposed two
methods.

42
The obtained results were consolidated, compared, analyzed and
concluded. The Proposed one showed improved, better classification accuracy
than other methods which is elaborated in Chapter 7.

3.11 Summary

This chapter provides coverage on overview of the feature selection such


as basic concepts, objectives of feature selection, general procedure and the
characteristics of feature selection. It also details the feature selection methods
taken for analysis (under this research work) i.e. the existing six feature selection
methods and the formulas used for each method. Additionally, it provides the
research methodology, steps involved in the research work, dataset used, tools
used for data analysis, metrics used for performance evaluation and validation
methods used for the research work.

The feature selection methods reveal the future directions for developing
challenging new algorithms in research; particularly handling large dimensionality
of informative gene selection with new data types. Accuracy is the most important
in the field of medical diagnosis to analyze the patient disease. The experimental
results showed that the preprocessing technique greatly enhanced the accuracy of
the classification in the feature selection. In this research work, two new feature
selection algorithms were developed for a particular dataset - colon tumor dataset
which is applied to enhance the classification accuracy.

43

You might also like