Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Expert Systems with Applications 38 (2011) 2727–2732

Contents lists available at ScienceDirect

Expert Systems with Applications


journal homepage: www.elsevier.com/locate/eswa

Feature selection using genetic algorithm and cluster validation


Yi-Leh Wu a, Cheng-Yuan Tang b,⇑, Maw-Kae Hor c, Pei-Fen Wu b
a
Dept. of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan
b
Dept. of Information Management, Huafan University, Taipei, Taiwan
c
Dept. of Computer Science, National Chengchi University, Taipei, Taiwan

a r t i c l e i n f o a b s t r a c t

Keywords: Feature selection plays an important role in image retrieval systems. The better selection of features usu-
Feature selection ally results in higher retrieval accuracy. This work tries to select the best feature set from a total of 78 low
Image retrieval level image features, including regional, color, and textual features, using the genetic algorithms (GA).
Genetic algorithms However, the GA is known to be slow to converge. In this work we propose two directions to improve
Taguchi method
the convergence time of the GA. First we employ the Taguchi method to reduce the number of necessary
Hubert’s C statistics
offspring to be tested in every generation in the GA. Second we propose to use an alternative measure, the
Hubert’s C statistics, to evaluate the fitness of each offspring instead of evaluating the retrieval accuracy
directly. The experiment results show that the proposed techniques improve the feature selection results
by using the GA in both time and accuracy.
Ó 2010 Elsevier Ltd. All rights reserved.

1. Introduction The first category represents the regional information which in-
cludes the region position, the circumference, the area, etc. The
Today’s image retrieval systems usually employ low level color, second category represents the color information which includes
shape, and textual features to represent the contents in a given im- the lab, the invariant moments, the color moments, the color
age. However, these low level image features are usually not con- coherence vectors, etc. The third category represents the textual
sistent with human perception in semantic such results in less information which includes the edge orientation histogram, the
satisfactory image retrieval performance. In recent years many re- edge density, the anisotropy, the contrast, etc. A total of 78 image
searches have proposed more human visual system aware image features are included in the initial image features set. The similar-
retrieval systems such as the Region-Based Image Retrieval (RBIR) ity of two given image regions is computed by the Euclidean dis-
systems (Carson, Belongie, Greenspan, & Malik, 2002; Li, Dai, Xu, & tance of their corresponding feature vectors.
Er, 2008; Ma & Manjunath, 1997; Wang, 2001; Wang, Li, & Wieder- To improve the accuracy of the image retrieval systems, it is
hold, 2001) that employ the objects or similar image regions as the important to have a proper image feature set that describes the
basis for image retrieval. When considering the whole image as the contents of an image. The more suitable image features set results
retrieval target, if there are many objects in the images or the im- in higher retrieval accuracy. The main contributions of this work
age backgrounds are not related with the foreground objects, the are summarized as follows:
retrieval results will be dissatisfactory. Image retrieval systems
based on RBIR include: Berkeley Blobworld (Carson et al., 2002),  We propose to employ the Hybrid Taguchi-Genetic Algorithm
UCSB Netra (Ma & Manjunath, 1997), SIMPLIcity (Wang, 2001), etc. (HTGA) to perform feature section for the RBIR systems.
The RBIR systems first segment images into many regions then  Instead of using the direct retrieval accuracy, which is expen-
extract image features from each segmented region. Each region is sive to compute, to select better offsprings in every generations
represented by a feature vector. Feature vectors may have different of the HTGA, we propose to use the Hubert’s C statistic, which
dimensionality depending on the number of image features to rep- estimates the cluster validity, as the fitness measure to select
resent the given region. In this work, we employ the Blobworld better offspring.
(Carson et al., 2002) method to segment images. For each seg-  We propose to use the Halton quasi-random sampling method
mented region we then extract the low level image features. Our that greatly reduces the computation time of the Hubert’s C
initial image features set includes three categories of features. statistic.

⇑ Corresponding author. Our experiment results support that the proposed improve-
E-mail addresses: ywu@csie.ntust.edu.tw (Y.-L. Wu), cytang@cc.hfu.edu.tw (C.-Y.
ments over the original GA can perform feature set selection effi-
Tang), hor@cs.nccu.edu.tw (M.-K. Hor). ciently from a large image features set.

0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2010.08.062
2728 Y.-L. Wu et al. / Expert Systems with Applications 38 (2011) 2727–2732

c
2. The Hybrid Taguchi-Genetic Algorithm (HTGA) La ðb Þ;
where a is the number of experimental runs; b is the number of lev-
Taguchi method (Tsai, Liu, & Chou, 2004) is commonly applied els for each factor; c is the number of columns in the orthogonal
in quality management to improve the quality and the stability array.
in production. Taguchi method can reduce the environment influ- The quality of the OA will greatly influence the accuracy and the
ence in production. From the manufacturing cost point of view, objectivity of the experiments. To construct a qualified OA we use
Taguchi method allows the use of low grade materials and not the following general principles:
expensive equipments while maintaining certain quality level.
From the developing cost point of view, Taguchi method wants 1. All factors are assumed to be independent:
to shorten construction period and to reduce required resources. During the process of numerical calculation, we do not take the
Taguchi method is a robust design which has the characteristic interrelation among factors into account. If there are dependent
of high quality, low developing, and low cost. The two major tools factors, we create a polynomial term to represent the factors
employed by the Taguchi method are the orthogonal array (OA) that are dependent; e.g., if factor A and factor B are dependent,
and the signal–noise ratio (SNR). We briefly discuss the orthogonal we create a new factor A * B separately as an independent
array and the signal–noise ratio as follows: factor.
2. The number of appearances at each level must be equal:
2.1. The orthogonal array (OA) To maintain objectivity of the experiments, the occurrences of
levels must be equal; e.g., in a level-2 orthogonal array, if factor
In factor design, when the number of factors increases, the num- 1 has four 0’s, then it must also have four 1’s to preserve
ber of experiments required increases. Taguchi method utilizes the objectivity.
OA to collect the experimental data directly and the result is a more 3. The stronger the orthogonal array, the more reliable the exper-
robust factor estimator with fewer number of experiments re- iment results. However, stronger orthogonal arrays are harder
quired. The OA is an important tool to conduct a robust experiment to construct and require more experiments:
design. A general orthogonal array is denoted as follows: The strength of an orthogonal array is defined as follows. An OA
of level-2 (only 1 and 2 appears) and strength 3 has the charac-
teristic that by selecting any three columns there must exist at
Table 1 least one of the following eight combinations (111, 112, 121,
L8 (24) orthogonal array.
122, 211, 212, 221, 222). A sample OA of level 2 and strength
Run Factor 3 is shown in Table 1.
A B C D
2.2. The signal–noise ratio (SNR)
1 1 1 1 1
2 1 1 2 2
3 1 2 1 2 Taguch method employs the SNR to estimate the contribution
4 1 2 2 1 degree to the objective function of each factor in each level. Formu-
5 2 1 1 2 lation of the SNR is derived from the unbiasedness in statistics. It is
6 2 1 2 1
7 2 2 1 1
an estimate of how samples deviate from the center of population.
8 2 2 2 2 The general formulation of the SNR is as follows:

0110 = 1 solution 0001 Fitness = 5 0110 Fitness = 30

Fitness evaluation Mating selection

0010 0001
Replacement

0000 0000
0101 0110

0001

Crossover

Taguchi Method

Mutation

Fig. 1. Architecture of the Hybrid Taguchi-Genetic Algorithm (HTGA).


Y.-L. Wu et al. / Expert Systems with Applications 38 (2011) 2727–2732 2729

  mÞ2  S2 ;
ðy 1988), etc. The steps for testing the validity of a clustering structure
are as follows:
 is the mean of sample, m is the mean of object, S is the stan-
where y
dard deviation of sample. Step 1: Select the clustering structure, the validation criteria and
"P # the index.
n 2
i¼1 ðyi  mÞ
  mÞ2  S2  ¼ 10 log
SNR ¼ 10 log½ðy ; Step 2: Obtain the distribution of the index under the no structure
n hypothesis.
Step 3: Compute the index for the clustering structure.
where n is the number of samples in the population.
Step 4: Statistically test the no structure hypothesis of by deter-
In 2004, Tsai et al. proposed the Hybrid Taguchi-Genetic Algo-
mining whether the index from Step 3 is unusually large
rithm (HTGA) (Tsai et al., 2004) that combined the Taguchi method
or unusually small.
and the GA which results in faster convergence speed. The main
difference of the HTGA and the original GA is that the offspring
after the crossover operation need to pass an additional Taguchi 3.1. The Hubert’s C statistic
method test which results the optimal offspring in each genera-
tions. A diagram of the HTGA is shown in Fig. 1. Through this opti- To validate a computed clustering structure one can compare it
mization process, the GA can converge early and can improve the to an a priori structure. The Hubert’s C statistic was designed to
precision. The HTGA is detailed as follows: measure the fitness between data and a priori structures. Let
X = [X(i, j)] and Y = [Y(i, j)] be two n  n proximity matrices on n
1. Initialization (parameter setting): The population size is M chro- objects. X(i, j) is the observed proximity between objects i and j.
mosomes, the crossover rate is PC, the mutation rate is PM, and Y(i, j) is defined as:
the number of generations is N. 
0 if objects i and j belong to the same category;
2. Fitness: Calculate the objective value of each individual and the Yði; jÞ ¼
fitness value of each population. 1 if not:
3. Selection: Use the roulette wheel approach or other similar The Hubert’s C statistic is the measurement of point by point
methods to select the individuals with higher fitness to perform correlation between the two matrices X and Y. When X and Y
crossover. are symmetric, we have
4. Crossover: Determine by the probability of PC, select the set of
individuals that should crossover. From the set we select two X
n1 X
n

individuals at random then apply the one-cut-point method C¼ Xði; jÞYði; jÞ:
i¼1 j¼iþ1
to generate two offspring.
5. Taguchi test: With a 2-level orthogonal array appropriate for However, the C computed from the above equation is in its raw
our experiment, we take the offspring of step 4 and calculate form. To normalize C statistic, we have
their fitness and SNR. We then calculate the effective degree ( )
of each factor in the objective function to generate the best X
n1 X
n
C ¼ ð1=MÞ ½Xði; jÞ  mx ½Yði; jÞ  my  =Sx Sy
offspring.
i¼1 j¼iþ1
6. Repeat steps 3 and 4 until the number of better offspring
reaches 1/2 * M * PC. where M = n(n  1)/2, mx and my denote the sample means of the
7. Mutation: The probability of mutation is determined by the entities in X and Y and Sx and Sy denote the sample standard devi-
mutation rate PM. ations of the entities in X and Y. The normalized C statistic will have
8. Replacement: Sort the parents and offspring by their fitness the range between 1 and 1. If the two matrices agree with each
measures. Then select the best M chromosomes as the parents other in structure then the absolute value of C statistic will be
of the next generation. unusually large. One of the most common applications of C statistic
9. Repeat steps 2–8, until one of the following two stopping con- is to test the random label hypothesis; i.e., could the values in one of
ditions is met. the two matrices X and Y have been inserted at random? To test the
random label hypothesis, the distribution of C under the random la-
 HTGA converges to the optimal solution, or bel hypothesis must be known in advance. This distribution is the
 The number of execution generations exceeds the pre-defined accumulated histogram of C with all n! permutations of the row
threshold. and column numbers of Y.

3. Cluster validity 3.2. The Halton Quasi-random Numbers

Cluster validity measures the adequacy of a structure recovered When testing whether the value of C is unusually large, the dis-
by cluster analysis that can be interpreted objectively. The ade- tribution must be found by evaluating C for all n! permutations in
quacy of a clustering structure refers to which the clustering advance. However, with six objects, 6! = 720 values of C must be
structure reflects the intrinsic character of the data (Bel Mufti, computed, and with nine objects, 9! = 362, 880 values must be
Bertrand, & El Moubarki, 2005; Dubes, 1993; Halkidi, Batistakis, found. It leads to a computationally complex procedure. We pro-
& Vazirgiannis, 2002; Jain & Dubes, 1988; Liu, Jiang, & Kot, 2009; pose to employ the Halton Quasi-random Numbers technique
Santos, Marques de Sa, & Alexandre, 2008). In general, there are (Press, Teukolsky, Vetterling, & Flannery, 2002) as a solution to this
three criteria to investigate cluster validity, namely external, inter- high computational problem. The random samples generated by
nal, and relative. The hypothesis tests are used to determine if a the Halton Quasi-random Numbers technique can distribute uni-
recovered structure is appropriate for the data. When the external formly in n-dimensional space.
and internal criteria are used, the hypothesis tests are to test We employ the Halton Quasi-random Numbers technique to re-
whether the value of the index is either very large or very small. duce computation by generating sample distributions for Hubert’s
Many statistical tools can be employed for cluster validity; e.g., C statistic. The distributions of the Halton Quasi-random Numbers
Monte Carlo, Hubert’s C and Goodman–Kruskal c (Jain & Dubes, in a two dimensional space are shown in Fig. 2.
2730 Y.-L. Wu et al. / Expert Systems with Applications 38 (2011) 2727–2732

of relevant images retrieved to total number of relevant images in


collection. Precision refers to the ratio of number of relevant
images retrieved to total number of images retrieved. The higher
recall and precision result the better searching efficiency. The def-
initions of recall and precision are as depicted in Fig. 3.
Number of relevant images retrieved
Recall ¼ ;
Total number of relevant images in collection
Number of relevant images retrieved
Precision ¼ :
Total number of images retrieved

4.2. The F-measure

The F-measure (Hanza, 2003) combines precision and recall into


one single measure. Let D indicate all the data in the database,
C = {C1, . . ., Ck} indicate k clusters discovered by clustering algo-
 
rithms, C  ¼ C 1 ; . . . ; C l indicate target clusters from the same
data, l is the number of clusters, the F-measure is defined as:
 
l  
X Ci
F¼  max fF i;j g;
Fig. 2. Random sample use Halton Quasi-random Numbers. (a) Points 1–128, (b) i¼1
D j¼1;...;k
points 129–512, (c) points 513–1024, and (d) points 1–1024.
2
F i;j ¼ 1 1
;
precði;jÞ
þ recði;jÞ
4. Measure of retrieval efficiency jC \C  j jC \C  j
where prec(i, j) indicates jCj i and rec(i,j) indicates jC  i .
i
Larger F-measure indicates that cluster structure result by the
4.1. The precision and recall
clustering algorithm is more similar to the target cluster structure.
Recall and precision are the measurements of the search effi-
4.3. The CS measure
ciency in information retrieval. Recall refers to the ratio of number

Suppose a clustering algorithm generates clustering structure


X = {xj; j = 1, 2, . . ., N}, where N is the number of the resulting clus-
ters, the CS measure (Chou, Su, & Lai, 2003) is defined as:
 
1
Pc 1
P
c i¼1 jAi j xj 2Ai maxfdðxj ; xk Þg
xk 2Ai
CSðcÞ ¼ Pc
i¼1 fmin fdðv i ; v j Þgg
1
c j2c;j–i
 
Pc 1
P
i¼1 jAi j xj 2Ai maxfdðxj ; xk Þg
xk 2Ai
¼ Pc
i¼1 fmin fdðv i ; v j Þgg
j2c;j–i
P
where v i ¼ 1
xj 2Ai xj , c is the number of clusters, Ai are data
jAi j
grouped into the ith cluster, jAij is the cardinalities of the ith cluster,
Fig. 3. Recall and precision. a + b + c + d is all the images in the database, a + c is the and d is the distance function. Smaller CS measure indicates better
relevant images, and a + b is the retrieved images.
cluster structure result by the clustering algorithm.

Fig. 4. Images from the leopard category.

Fig. 5. Images from the bird category.


Y.-L. Wu et al. / Expert Systems with Applications 38 (2011) 2727–2732 2731

Table 2 Table 5
Initial image features set. Details of the 15 feature selection experiments.

5. Experiments

5.1. Feature selection


We then employ the HTGA to automatically select features to-
In this section we conduct experiments to combine HTGA and gether with Hubert’s C statistics to evaluate the fitness. The fitness
the Hubert’s C statistics to select better image features set. We function is defined as: C (sample)-95% critical value, with larger
use the Coral image dataset and select images from two categories, values the better.
leopards and birds, for this experiments. We randomly take five In Table 3, the indices are highlighted as in the color, the tex-
images from each category, as shown in Figs. 4 and 5, for this fea- ture, and the region features categories with respect to Table 2. Ta-
ture selection experiments. We conduct a total of 15 experiments. ble 3 shows the number of experiments for each feature that had
We employ the Blobworld method to segment images. The total been selected in the total of 15 experiments. Table 4 is from Table
number of initial image features is 78. The detailed numbers of im- 3 that has been sorted by the number of experiments for each fea-
age features in each category are as shown in Table 2. ture that had been selected in the total of 15 experiments. Table 5

Table 3
Feature selection results.

Table 4
Feature selection results (sorted).
2732 Y.-L. Wu et al. / Expert Systems with Applications 38 (2011) 2727–2732

Table 6 ment results also suggest that the proposed method can select
Comparison of indexing accuracy. smaller image features set and produce higher retrieval accuracy.

Acknowledgements

This work was partially supported by the National Science


Council, Taiwan, under the Grant No. NSC99-2221-E-011-124,
shows the final fitness values and the number of features selected NSC98-2631-H-211-001, and NSC99-2631-H-211-001.
for each experiment.
From Table 5, we observe that the first experiment produces the References
largest fitness value and the average number of features selected is
Bel Mufti, G., Bertrand, P., & El Moubarki, L. (2005). Determining the number of
32. groups from cluster stability. Proceeding of ASMDA, 404–414.
Carson, C., Belongie, S., Greenspan, H., & Malik, J. (2002). Blobworld: Image
5.2. Indexing accuracy segmentation using expectation–maximization and its application to image
querying. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8),
1026–1038.
After selecting the features, the next experiment evaluates the Chou, C. H., Su, M. C., & Lai, E. (2003). A new cluster validity measure for clusters
indexing accuracy of the selected features. In this experiment, we with different densities. In 2003 IASTED international conference on intelligent
systems and control, Salzburg (pp. 276–281).
continue to use the same two categories of images (leopards and
Dubes, R. C. (1993). Clustering analysis and related issues. Handbook of pattern
birds) with 100 images from each category. So the total number recognition and computer vision (2nd ed.). World Scientific. pp. 3–32.
of testing images is 200. Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2002). Cluster validity methods: Part I.
We employ the k-means algorithms for clustering. We compare SIGMOD Record, 31(2).
Hanza, M. H. (2003). On cluster validity and the information need of users. In The
the results of using: (1) all 78 features, (2) 27 features selected by 3rd IASTED international conference on artificial intelligence and applications
the experiment 1 in Table 5, and (3) 32 (average number of feature (AIA03), Spain (pp. 216–221).
selected) highest ranking features in Fig. 5. The experiment results Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice Hall.
Li, F., Dai, Q., Xu, W., & Er, G. (2008). Multilabel neighborhood propagation for
are shown in Table 6. region-based image retrieval. IEEE Transactions on Multimedia, 10(8), 1592–
From Table 6, we conclude that the features selected by the pro- 1604.
posed HTGA method produces higher indexing accuracy then using Liu, M., Jiang, X., & Kot, A. C. (2009). A multi-prototype clustering algorithm. Pattern
Recognition, 42(5), 689–698.
all features without any selection process. The experiment results Ma, W. Y., & Manjunath, B. (1997). Netra: A toolbox for navigating large image
suggest that the proposed method can produce better image fea- database. In Proceedings of IEEE international conference image processing (pp.
tures set and thus higher retrieval accuracy. 568–571).
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (2002). Numerical
recipes in C++. Cambridge.
6. Conclusion Santos, J. M., Marques de Sa, J., & Alexandre, L. A. (2008). LEGClust—A clustering
algorithm based on layered entropic subgraphs. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 30(1), 62–75.
This work presents a feature section scheme based on the GA for Tsai, J. T., Liu, T. K., & Chou, J. H. (2004). Hybrid Taguchi-Genetic Algorithm for global
the Region-Based Image Retrieval systems. We show that the numerical optimization. IEEE Transactions on Evolutionary Computation, 8(4),
feature set selected by the proposed feature selection scheme 365–377.
Wang, J. Z. (2001). Integrated region-based image retrieval. Kluwer Academic Publishers.
produces higher retrieval accuracy than the one without feature Wang, J. Z., Li, J., & Wiederhold, G. (2001). SIMPLIcity: Semantics-sensitive
selection. We also show that the Hubert’s C statistics can be used integrated matching for picture libraries. IEEE Transactions on Pattern Analysis
as the fitness measure in the evolution of the HTGA. The experi- and Machine Intelligence, 23, 947–963.

You might also like