Feature Extraction, Feature Selection and Machine Learning For Image Classification: A Case Study

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Feature Extraction, Feature Selection and Machine

Learning for Image Classification: A Case Study


Mădălina Cosmina Popescu1 , Lucian Mircea Sasu2 ,3
1 Faculty of Mathematics and Computer Science, Transilvania University of Braşov, Romania, cosmina.popescu@xu.unitbv.ro
2 Mathematics and Computer Science Department, Transilvania University of Braşov, Romania, lmsasu@unitbv.ro
3 Siemens Corporate Technology RTC, Braşov, Romania

Abstract—This paper presents feature extraction, feature se- recognition with minimal preprocessing [6]; Krizhevsky et
lection and machine learning–based classification techniques for al. [7] use a neural network to classify 1.2 million high
pollen recognition from images. The number of images is small resolution images, reaching 37.5% top-1 error rates, much
compared both to the number of derived quantitative features
and to the number of classes. The main subject is investigation of better than previous results; in [8], Kavukcuoglu et al. propose
the effectiveness of 11 feature extraction/feature selection algo- an unsupervised method for learning multi–stage hierarchies of
rithms and of 12 machine learning–based classifiers. It is found sparse convolutional features, improving performance on some
that some of the specified feature extraction/selection algorithms visual recognition and detection tasks. A machine learning
and some of the classifiers exhibited consistent behavior for this classification system which outruns human performance in
dataset.
traffic sign recognition is described by Ciresan et al. in [9]
I. I NTRODUCTION and for handwritten digit recognition in [10].
The present paper follows the classical machine–learning
In palynology, a specific task is recognizing plants from workflow. Related work was performed by Langford, Taylor
the pollen grains. Although domain expertise as expressed by and Flenley [11] and they obtained 94.3% correct classification
highly qualified palynologist is for now the standard approach, on a 6 pollen class database with 192 samples. The images
we witness an increasing interest in applying automatic learn- were obtained through scanning electron microscope, an ex-
ing approaches for plant recognition. pensive and specialized solution. For a 3–class dataset – also
There are at least two approaches which can be used for provided by scanning electron microscope – an artificial neural
automatic image classification. The former is the classical network was able to perform 100% accuracy [12]. Getting
machine learning workflow for image classification: a) use the images through electronic scanning is a rather exclusivist
some chain of image-processing techniques to extract quan- method, and further researchers considered regular images, i.e.
titative features from each image; b) apply automatic feature images captured by normal light-sensitive cameras from an
extraction/selection techniques; c) apply some classifiers on optical microscope.
the data subsets. The mix between the considered numerical In [13] 90% classification accuracy was obtained for a 3-
features and classifiers dictates the success of the recognition class problem. Zhang et al. [14] reported 91% classification
task. success using wavelet decomposition and reconstruction and
During feature extraction, a human expert might decide to combination with coocurrence matrices, followed by learning
follow or to avoid some of the processing techniques. For vector quantization and backpropagation networks. Dell’Anna
classification models, one can find scarce prior information et al. [15] used a k–nearest neighbors classifier starting
on which of them might be successful, based solely on the from data provided by Fourier transform infrared microspec-
task’s particularities. We recall Tom Mitchell’s statements that troscopy; the discrimination between the 11 taxonomic groups
any useful machine learning algorithm must have its own in- was accomplished with 84% accuracy.
ductive bias and that “progress towards understanding learning For some of these reports we remark the small number
mechanisms depend upon understanding the sources of, and of classes and explicit mention of small data sets, finalized
justification for, various biases”, beyond the particularities of with large percent of correct classification. It seems to be
the training sets [1]. Despite many theoretical results and an inverse correlation between the number of classes and the
experimental analysis of machine learning models, machine classification performance: for example, Boucher and Thonnat
learning largely remains an experimental science. [16] obtain 77% and 97% correct classification for 30 and 4
The latter approach used for automatic image classification classes, respectively.
is a more recent one, and allows the machine to extract features In our study we started from a public pollen image dataset1 .
on its own, without human intervention. We refer here to the The challenge of this problem is given by the small number
so-called “deep learning” branch of machine learning [2]–[5]. of available images – between 4 and 7 for each of the 38
Computer vision is the most popular choice as a testbed for
deep learning architectures: the first successful story is the 1 The set of images can be downloaded from
usage of convolutional neural networks for optical character http://www.geo.arizona.edu/palynology/sem/nucastl.html.

978-1-4799-5183-3/14/$31.00 ' 2014 IEEE 968


species considered, with an average of about 4.76 images per use the saturation channel; this step helps us to separate
species. The scarcity of the data makes it prone to overfitting foreground from background, since in the pollen grain image
during classification model development. Additionally, the the saturation of foreground is higher than the saturation of
number of the numerical attributes (48) derived through image background.
processing techniques is large compared to the number of Further, the contrast is increased by equalizing the his-
available records. Considerable effort was put into feature togram for saturation channel and applying gamma-correction
extraction and selection, and multiple classification models with γ = 3 [17]. We chose this value for gamma firstly because
were considered. An assessment process based on stratified gamma-correction with γ > 1 exponentially increases intensity
k–fold cross validation, data shuffling and t-tests was finally for high-intensity pixels, and decreases it for low-intensity
considered in order to rank the classifiers. It turns out that ones. The exact value is chosen experimentally, to fit our set
some feature extraction/selection algorithms and some of the of images. Next, an adaptive threshold is applied, to obtain a
classifiers exhibit consistent behavior, and this finding is binary image. The threshold is computed for each image using
a key contribution of the paper. Beside this, the reported data from image histogram [17]. Our set of images required
classification scores could serve as a benchmark for further to apply a Gaussian filter before thresholding, to make this
attempts on this dataset. operation less sensitive to noise.
The outline of the paper is as follows: Section II contains a At this point, the binary image that we obtained contains
description of the dataset we used in our study and extraction black holes within. To eliminate them, we applied the fol-
of quantitative features from images. Section III describes the lowing set of morphological operations: dilation, hole filling,
feature extraction and feature selection preprocessing methods, erosion, and extraction of central connected component [17].
the classification models and the performance assessment
procedure. Section IV contains the experimental results and C. Numerical Attribute Extraction
comments on the feature extraction/selection effectiveness and This step starts from the processed images obtained in
the classification performance for the machine learning models section II-B and produces numerical descriptors, which can
considered. Section V concludes the paper. be further used as input values in machine learning based
classifiers.
II. DATASET DESCRIPTION AND QUANTITATIVE FEATURES We use the attributes presented in [18]: area of the pollen
EXTRACTION
grain, width and height of the bounding box, coordinates x
A. Dataset Description and y of the grain centroid, major axis length and minor
In our study, we used a publicly available dataset from the axis length of the ellipse which has the same second moment
Pollen Laboratory of The University of Newcastle, Australia. as the pollen shape, area of the convex hull enclosing the
From the mentioned dataset, we selected only those species shape, diameter of the circle with the same area as the object,
for which at least 4 images were available at the time when solidity, perimeter of the mask image, extent, eccentricity,
we started this research. The reason for this selection is that, coordinates x and y of the weighed centroid, shape, thickness,
in classification, we cannot obtain relevant results using 1 to box (x and y coordinates for both upper-left and bottom-right
3 instances per class. We used 38 species of plants collected corners of an inner rectangle of the bounding box), height,
in a total of 181 images. There are 19, 10, 8, and 1 species width, correlation, homogeneity, contrast, entropy (computed
depicted by 4, 5, 6, and 7 images, respectively. This gives an for 6 images: the blue channel of the RGB color space, the
average of 4.76 images per species. saturation and value channels from HSV color space, each of
In order to derive image classifiers, we extracted 48 nu- them extracted from the bounding box and the box), Fourier
merical attributes from each pollen grain image. This was Descriptors (the first 5 Fourier coefficients, as described below
accomplished in two steps: image enhancement and numerical in this section), relative areas (5 values, computed from the
attribute extraction. binary image of the pollen grain, as presented in [18]), and
relative objects (5 values based on the images obtained in
B. Image Enhancement for Foreground Extraction relative areas computation, as explained at the end of this
The computation of attributes, as it is presented in the section).
following section, requires to first compute the pollen grain Some of the mentioned attributes are clearly explained in
mask, i.e. an image where the foreground is represented by the cited work and we will not go into details for them. To
white pixels and the background by black pixels. The original compute the values for the remaining attributes (the ones that
image is then masked, in order to separate foreground from are briefly described in [18]) additional research was required.
background in pollen grain image. The preprocessing steps are Below we provide some explanation of the algorithms behind
described below. attribute computation.
First, the image is stretched, to reduce autocorrelation. The For the computation of major axis length and minor axis
image is split into R, G, B channels which are stretched length of the ellipse which has the same second moment as
separately, then the inverse of splitting operation is applied the pollen shape we use the covariance matrix of white pixels
to put the channels back together. The resulted image is coordinates. Calculating the eigenvectors of this matrix, we
transformed from RGB into HSV color space, in order to obtain the pixel distribution orientation. The minor axis length

969
We had to represent the contour applying discrete Fourier
transform to complex numbers given by pixel coordinates.
From the resulted coefficients we used only the first 5 values,
as suggested in [18]. In our vector of features, we use the
magnitude of the 5 selected descriptors.
Relative objects are represented by numbers of connected
components in inverted and masked version of the binarized
(a) Original image (b) Stretched (c) Saturation chan- images computed for relative areas. To obtain the connected
nel components from the binarized image we used the Two-Pass
Connected Components Labeling algorithm [19].
As a result of these steps, we obtain 48 numerical descrip-
tors for each image, and we call this dataset “pollen”.
III. E XPERIMENTAL S ETUP
A. Feature extraction and selection
One legitimate question is whether all these 48 features
(d) Histogram (e) Gamma correc- (f) Gaussian filter are suitable for the classification task, or whether a lower
equalization tion cardinality set of features would improve the classification.
The existing literature describes an impressing number of
approaches devoted to feature selection and feature extraction
– see for example [20]–[25]. The implementations we used in
this paper are provided by Weka [26].
One popular feature extraction technique is based on prin-
cipal component analysis; it is the only feature extraction
method we consider here. For our dataset we chose enough
(g) Adaptive (h) Dilation (i) Hole filling eigenvectors to account for 95% of the variance in the original
threshold data, and thus 17 derived features were produced; we hence
get the “pca” dataset.
Another candidate preprocessing step is lowering the num-
ber of linearly correlated features. It is known that highly
correlated features rise unnecessary computational burden and
worsens the classification accuracy for most linear and for
few nonlinear techniques [25]. The resulted dataset is denoted
as “uncorr” and contains 34 features whose pairwise absolute
(j) Erosion (k) Central (l) Foreground sep- correlation value is at most 95%.
connected aration The next nine feature selection methods share the following
component
strategy: the value of a subset of attributes is evaluated by
Fig. 1. Image enhancement steps for foreground extraction, applied to Calytrix counting the predictive ability of each feature along with the
tetragona pollen grain (in common language, fringe-myrtle).
degree of redundancy between them.
The considered heuristics are: best-first, which either starts
with an empty set of candidate features which is succesively
and major axis length are the eigenvalues corresponding to grown – the so-called forward search – or with the set of all
the above mentioned eigenvectors. Eccentricity quantifies how available features from which it performs eliminations – the
much the shape deviates from being circular. backward search. The resulted datasets are called “bff” and
Height is the length of the largest line enclosed in the “bfb”, respectively. The third heuristic is based on evolutionary
pollen. We computed this as the maximum length of a line algorithms; the resulted dataset is named “ea”. The fourth
that passes through 2 points on the contour and the centroid. strategy performs the search based on genetic algorithm [27]
Width is the length of the largest line enclosed in the pollen and produces the dataset “ga”. The fifth strategy performs a
and perpendicular to Height. greedy search in the space of attributes, augmenting the initial
By using Fourier descriptors we translated points of the empty set of features until the performance starts to decrease;
shape contour from spatial domain to frequency domain and the corresponding dataset is named “gs”. The linear forward
extracted only the terms corresponding to low frequencies. selection is an extension to the best-first approach [28]; we
Unlike the terms that correspond to high frequencies, they give denote the corresponding dataset as “lfs”. The seventh feature
information about the shape, ignoring details such as small selection strategy employs particle swarm optimization [29]
oscillations on the contour. This aspect is of real importance as an exploration method and produces the dataset “pso”. The
for generalization. eighth selection strategy – reranking search – is an information

970
theoretic based meta-search algorithm proposed in [30]; the lowest PCCs are produced by the best–first with backward
resulted dataset is denoted as “rrs”. The last heuristic is tabu search strategy and principal component analysis (8 cases
search [31] and produces the dataset “ts”. each), respectively. A noteworthy remark is that the three
After reducing the features count according to the proce- mentioned strategies produce all the lowest three PCC results
dures described above, the number of input features ranged for all classifiers. For the numerical features we got from
between 15 and 34. We also consider the set of all 48 features, image processing, it clearly signifies that greedy search, best–
to check whether the feature selection/extraction procedures first with backward search and principal component analysis
impact the classification accuracy. In a nutshell, we have 12 must be avoided as feature extraction/selection strategies.
datasets which are in turns used for the classification step2 . For the best influence on the classification step, the results
are not so well grouped: the largest PCC values were obtained
B. Classification models
by evolutionary algorithms (5 cases out of 12), reranking
The machine learning literature contains a large number search, linear forward selection, and tabu search (each with 3
of classification approaches. Unfortunately, it is also scarce cases out of 12). Note that there are some ties: for multilayer
when looking for advice on how to properly choose among perceptron, best-first with forward search ties with linear
them. Although there are some attempts to formulate some forward selection and tabu search; for functional trees, the best
strategies from experimental results (e.g. [32]) – the machine score is shared by best-first with forward search, evolutionary
learning related research is nowadays tributary to trial–and– algorithms, linear forward selection and tabu search.
error approach. The second largest PCCs are obtained for evolutionary
The following 12 classification methods were investigated and genetic algorithm (4 cases each), best-first with forward
in turn: naive Bayes (NB), multinomial logistic regression search, linear forward selection, and tabu search (2 cases
(MLR), multilayer perceptron (MLP), radial basis function each). There are ties between best-first with forward search,
network (RBF), k-nearest neighbors (KNN), the rule–based linear forward selection and tabu search for Hoeffding tree and
classifier PART [33], functional trees (FT) [34], [35], best-first logistic model trees.
decision tree (BFT) [36], Hoeffding tree (HT) [37], logistic The third largest PCC is obtained by reranking search (5
model trees (LMT) [35], C4.5 [38], and Random forest (RF) cases), best-first with forward search, genetic algorithm, linear
[39]. forward selection, and tabu search (3 cases each). There are
Some of the hyperparameters of the models were tuned, ties between best-first with forward search, linear forward
based on trial and error. For example, in case of KNN, we used selection and tabu search for multilayer perceptron, k-nearest
leave-one-out strategy to decide which value k ∈ {1, . . . , 20} neighbors and random forests.
gives the best result; for the multilayer perceptron, through
In short, when we refer to the highest 3 PCCs (ties
cross validation we searched for the number of hidden neurons
included), evolutionary algorithms and reranking search are
and the learning rate, while keeping a constant momentum
encountered 9 times, the best-first with forward search and
coefficient.
linear forward selection appears 8 times, and genetic algorithm
C. Performance assessment appears 7 times.
The performance assessments were accomplished using the As a conclusion on the effect of feature extraction/selection,
Weka Experiment Environment [26]. Each of the 12 datasets the worst behavior is given exclusively by greedy search, best–
was randomly shuffled 10 times. For each shuffle and each of first with backward search and principal component analysis;
the 12 classification models its percent of correct classification the highest PCCs are found for evolutionary and genetic
(PCC) was averaged using 4-fold stratified cross validation. algorithms, reranking search, best-first with forward search,
The number of folds is dictated by the classes distribution: and linear forward selection.
in the considered dataset, there are 19 species for which
only 4 pollen grain images were available. The differences B. Classifiers’ performances
between the results were quantified using statistical methods For classifiers comparisons we took naive Bayes as a
on the 12 datasets · 10 shuffles · 4 folds · 12 classifiers = 5760 baseline model, as its PCC is close to the median PCC
measurements. over all datasets. Starting from the 5760 measurements, we
IV. E XPERIMENTAL RESULTS used corrected resampled t-test [40] as implemented in the
Weka Experimental Environment. Table I marks the relation
A. Feature extraction/selection influence
between naive Bayes and the other classifiers as follows: a
We start by discussing the influence of the feature extrac- “(v)” annotation shows that a result is statistically better than
tion/selection strategies on classification results. One remarks the baseline model; “(*)” marks the results that are statistically
a consistent behavior across the various classifiers for some of worse than the baseline model; all unannotated results are
these strategies. In 8 cases out of 12, the greedy search leads statistically the same as the one obtained by NB. The two
to the worst classification results. The second and the third summary rows contain x/y/z triples showing how many times
2 These arff datasets are available for download from each classifier was (x) statistically better than, (y) the same
http://cs.unitbv.ro/∼lmsasu/data/pollen.arff. as, and respectively (z) worse than the baseline model.

971
TABLE I
E XPERIMENTAL RESULTS OF PCC FOR THE 12 DATASETS AND 12 CLASSIFIERS USING STRATIFIED 4- FOLD CROSS VALIDATION AND 10 RANDOM
SHUFFLES .

Dataset NB MLR MLP RBF KNN PART


pollen 63.99 66.73 77.35 (v) 74.98 (v) 76.09 (v) 44.91 (*)
pca 52.42 54.44 65.83 (v) 56.78 58.00 39.28 (*)
uncorr 64.71 68.34 77.46 (v) 74.43 (v) 73.83 (v) 43.76 (*)
bff 67.25 77.91 (v) 80.94 (v) 77.03 (v) 82.66 (v) 42.28 (*)
bfb 48.12 53.43 56.25 58.23 (v) 62.76 (v) 39.46
ea 68.30 76.29 (v) 76.84 (v) 78.74 (v) 83.21 (v) 46.07 (*)
ga 68.23 78.83 (v) 80.78 (v) 77.46 (v) 81.16 (v) 46.53 (*)
gs 46.46 50.34 54.53 54.96 (v) 60.62 (v) 40.12
lfs 67.25 77.91 (v) 80.94 (v) 77.03 (v) 82.66 (v) 42.28 (*)
pso 64.93 73.64 (v) 74.37 (v) 79.12 (v) 79.73 (v) 44.81 (*)
rrs 67.26 79.77 (v) 80.27 (v) 76.86 (v) 83.43 (v) 45.53 (*)
ts 67.25 77.91 (v) 80.94 (v) 77.03 (v) 82.66 (v) 42.28 (*)
Summary 7/5/0 10/2/0 11/1/0 11/1/0 0/2/10
Dataset FT BFT HT LMT C4.5 RF
pollen 75.26 (v) 45.81 (*) 59.85 74.64 49.84 (*) 62.14
pca 63.41 (v) 40.51 52.53 61.37 40.50 (*) 50.77
uncorr 76.51 (v) 47.68 (*) 59.45 76.45 (v) 50.23 (*) 63.58
bff 82.05 (v) 46.56 (*) 66.42 81.50 (v) 47.15 (*) 64.08
bfb 56.14 39.50 46.19 53.59 44.26 48.35
ea 82.05 (v) 49.65 (*) 67.08 81.88 (v) 48.90 (*) 64.36
ga 81.43 (v) 49.61 (*) 64.87 80.77 (v) 50.61 (*) 63.75
gs 54.87 39.50 45.58 53.98 43.86 46.52
lfs 82.05 (v) 46.56 (*) 66.42 81.50 (v) 47.15 (*) 64.08
pso 78.56 (v) 47.71 (*) 62.00 78.94 (v) 47.75 (*) 62.16
rrs 81.83 (v) 44.25 (*) 66.41 80.89 (v) 48.48 (*) 65.70
ts 82.05 (v) 46.56 (*) 66.42 81.50 (v) 47.15 (*) 64.08
Summary 10/2/0 0/3/9 0/12/0 8/4/0 0/2/10 0/12/0

The performance, as quantified by the x/y/z triples natu- PCCs than the baseline model; this happens for the 4 classi-
rally clustered in four groups. The top four performance was fiers which compose the highest performance cluster, so the
given by RBF, KNN (each having (x/y/z) = (11/1/0)), MLP results might be due to the robustness of these models. The
and FT (both with (x/y/z) = (10/2/0)). The RBF and KNN performance is counterbalanced by 3 cases when they produce
– the two with highest x values – are both local models. statistically lower PCCs than NB (PART, C4.5 and BFT). This
The only dataset for which RBF and KNN did not provide later result might be explained by the overall poor performance
statistically better PCCs than NB is the one obtained through of the involved classifiers.
principal component analysis. MLP and FT underperformed The PCC scores might appear as modest at first sight: only
on the “pca” and “gs” datasets. It is interesting to note that few of the results are beyond 80% PCC. The best results
these four classifiers were the only ones for which the original mentioned in the introduction are obtained for a quite small
dataset “pollen” provided statistically better PCCs than NB. number of classes, while in our case, the classifiers deal with
The next performance group contains LMT and MLR, 38 species (classes). The scarcity of the dataset – an average
providing statistically better (the same) behavior as NB in 8 of 4.76 images/species – is another potential cause of these
(4) and 7 (5) cases, respectively. Hence, one can consider they quite modest PCC values.
are still better than the baseline classifier, although at a lower V. C ONCLUSIONS
margin than the previous cluster. There is an almost perfect
This paper started from a commonly encountered task in
overlap of the datasets for which the LMT and the MLR
palynology: classification of plants using pollen grain images.
classifiers outperform NB. For the original “pollen” dataset the
The aim is to automate this task, from image processing to
LMT, MLR and the NB models lead to statistically identical
machine-learning based classification. Various candidate fea-
scores.
ture extraction/selection algorithms were discussed, in order to
The HT and RF were statistically equivalent to naive Bayes reduce the predictive variables from the 48 extracted through
for this problem for all datasets. They give the third cluster of image processing techniques. The resulted datasets were fed
performance. in 12 classifiers.
Finally, the weakest performance was obtained by PART, One of the lessons learned is that greedy search, best
BFT and C4.5; in at least 9 of the data subsets their per- first search with backward search and principal component
formance was statistically inferior to NB, and they never analysis should be avoided for this problem, as their induced
performed better than it. performance is constantly poor across all classifiers. Although
A remark on the efficiency of the original “pollen” dataset: it is hard to extrapolate to other image classification problems,
as seen in Table I, in 4 cases it produces statistically higher one should not blindly use them.

972
Another remark is that it is worth investigating the potential [15] R. Dell’anna, P. Lazzeri, M. Frisanco, F. Monti, C. F. Malvezzi,
benefits of feature selection. The original 48 numeric attributes E. Gottardini, and M. Bersani, “Pollen discrimination and classification
by fourier transform infrared (ft-ir) microspectroscopy and machine
allowed statistically better performance in 4 cases, but also learning,” Anal Bioanal Chem, vol. 394, no. 5, pp. 1443–1452, 2009.
performed poorer in 3 cases. Some feature selection techniques [16] A. Boucher and M. Thonnat, “Object recognition from 3d blurred
– evolutionary and genetic algorithms, reranking search, best- images.” in ICPR (1), 2002, pp. 800–803.
[17] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 3rd ed.
first with forward search and linear forward selection – repeat- Pearson Education International, 2008.
edly allowed the classifiers to outperform the baseline model. [18] M. del Pozo-Baños, J. R. Ticay-Rivas, J. Cabrera-Falcón, J. Arroyo,
Again, it is hard to extrapolate from a single case study, but C. M. Travieso-González, L. Sánchez-Chavez, S. T. Pérez, J. B. Alonso,
and M. Ramı́rez-Bogantes, “Image Processing For Pollen Classification,”
it is reasonable that one might expect attaining better results InTech, 2012.
when one filters out some predictive variables. [19] F. S. Perkins, A. Walker, and E. Wolfart, Hypermedia Image Processing
Finally, we remarked that the two classifiers with the Reference. Wiley, 1997.
[20] I. Guyon and A. Elisseeff, “An introduction to variable and feature
largest number of good results rely on local models, building selection,” J. Mach. Learn. Res., vol. 3, pp. 1157–1182, Mar. 2003.
predictions based on data that cluster together. This behavior [21] I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh, Feature Extraction:
deserves further investigation, as the literature offers plenty Foundations and Applications (Studies in Fuzziness and Soft Comput-
ing). Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006.
of such cluster–based associators. It might also be worth [22] H. Liu and H. Motoda, Computational Methods of Feature Selection.
comparing these results with the ones produced by deep Chapman & Hall/CRC, 2007.
learning architectures: they have the potential to automatically [23] Y. Saeys, I. n. Inza, and P. Larrañaga, “A review of feature selection
techniques in bioinformatics,” Bioinformatics, vol. 23, no. 19, pp. 2507–
build a set of features on top of which a classifier might lead 2517, Sep. 2007.
to a better performance. To our best knowledge, such a study [24] J. Fan and J. Lv, “A selective overview of variable selection in high
for small datasets of images is not yet performed. dimensional feature space,” Statistica Sinica, vol. 20, no. 1, pp. 101–
148, 2010.
[25] T. J. Hastie, R. J. Tibshirani, and J. H. Friedman, The elements of
R EFERENCES statistical learning : data mining, inference, and prediction, ser. Springer
series in statistics. New York: Springer, 2009.
[26] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and
[1] T. M. Mitchell, “The need for biases in learning generalizations,” Rutgers I. H. Witten, “The WEKA data mining software: an update,” SIGKDD
University, New Brunswick, NJ, Tech. Rep., 1980. Explorations Newsletter, vol. 11, no. 1, pp. 10–18, 2009.
[2] Y. Bengio, “Learning deep architectures for AI,” Foundations and Trends [27] D. E. Goldberg, Genetic Algorithms in Search, Optimization and Ma-
in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009, also published as chine Learning, 1st ed. Boston, MA, USA: Addison-Wesley Longman
a book. Now Publishers, 2009. Publishing Co., Inc., 1989.
[3] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A [28] M. Gütlein, E. Frank, M. Hall, and A. Karwath, “Large-scale attribute
review and new perspectives,” IEEE Transactions on Pattern Analysis selection using wrappers,” in Proc. IEEE Symposium on Computational
and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013. Intelligence and Data Mining, 2009, pp. 332–339.
[4] I. Arel, D. C. Rose, and T. P. Karnowski, “Research Frontier: Deep [29] A. Moraglio, C. D. Chio, and R. Poli, “Geometric particle swarm
Machine Learning–a New Frontier in Artificial Intelligence Research,” optimisation.” in EuroGP, ser. Lecture Notes in Computer Science,
Comp. Intell. Mag., vol. 5, no. 4, pp. 13–18, Nov. 2010. M. Ebner, M. O’Neill, A. Ekrt, L. Vanneschi, and A. Esparcia-Alczar,
[5] A. Graves, Supervised Sequence Labelling with Recurrent Neural Net- Eds., vol. 4445. Springer, 2007, pp. 125–136.
works, ser. Studies in Computational Intelligence. Springer, 2012, vol. [30] P. Bermejo, L. de la Ossa, J. A. Gámez, and J. M. Puerta, “Fast wrapper
385. feature subset selection in high-dimensional datasets by means of filter
[6] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning re-ranking,” Know.-Based Syst., vol. 25, no. 1, pp. 35–44, Feb. 2012.
applied to document recognition,” in Intelligent Signal Processing. [31] A.-R. Hedar, J. Wang, and M. Fukushima, “Tabu search for attribute
IEEE Press, 2001, pp. 306–351. reduction in rough set theory.” Soft Comput., vol. 12, no. 9, pp. 909–
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica- 918, 2008.
tion with deep convolutional neural networks,” in Advances in Neural [32] R. Caruana and A. Niculescu-Mizil, “An empirical comparison of
Information Processing Systems 25, P. Bartlett, F. Pereira, C. Burges, supervised learning algorithms,” in Proceedings of the 23rd International
L. Bottou, and K. Weinberger, Eds., 2012, pp. 1106–1114. Conference on Machine Learning, ser. ICML ’06. New York, NY, USA:
[8] K. Kavukcuoglu, P. Sermanet, Y.-L. Boureau, K. Gregor, M. Mathieu, ACM, 2006, pp. 161–168.
and Y. L. Cun, “Learning convolutional feature hierarchies for visual [33] E. Frank and I. H. Witten, “Generating accurate rule sets without global
recognition,” in Advances in Neural Information Processing Systems 23, optimization,” in Proceedings of the Fifteenth International Conference
J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, on Machine Learning, ser. ICML ’98. San Francisco, CA, USA:
Eds., 2010, pp. 1090–1098. Morgan Kaufmann Publishers Inc., 1998, pp. 144–151.
[9] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber, [34] J. Gama, “Functional trees,” Mach. Learn., vol. 55, no. 3, pp. 219–250,
“Convolutional neural network committees for handwritten character Jun. 2004.
classification,” in ICDAR, 2011, pp. 1135–1139. [35] N. Landwehr, M. Hall, and E. Frank, “Logistic model trees,” Mach.
[10] ——, “Deep, big, simple neural nets for handwritten digit recognition,” Learn., vol. 59, no. 1-2, pp. 161–205, May 2005.
Neural Computation, vol. 22, no. 12, pp. 3207–3220, 2010. [36] H. Shi, “Best-first decision tree learning,” Master’s thesis, University of
[11] M. Langford, G. Taylor, and J. Flenley, “Computerized identification Waikato, Hamilton, NZ, 2007, cOMP594.
of pollen grains by texture analysis,” Review of Palaeobotany and [37] G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing data
Palynology, vol. 1, no. 4, pp. 197–203, Nov. 2010. streams,” in Proceedings of the Seventh ACM SIGKDD International
[12] W. Treloar, Digital Image Processing Techniques and Their Application Conference on Knowledge Discovery and Data Mining, ser. KDD ’01.
to the Automation of Palynology, Ph.D. thesis. University of Hull, 1992. New York, NY, USA: ACM, 2001, pp. 97–106.
[13] M. Rodrguez-Damin, E. Cernadas, A. Formella, and P. de S-Otero, [38] J. R. Quinlan, C4.5: Programs for Machine Learning. San Francisco,
“Pollen classification using brightness-based and shape-based descrip- CA, USA: Morgan Kaufmann Publishers Inc., 1993.
tors.” in ICPR (2), 2004, pp. 212–215. [39] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32,
[14] Y. Zhang, R. Wang, and P. Hunter, “Airborne pollen texture discrim- Oct. 2001.
ination using wavelet transforms in combination with cooccurrence [40] C. Nadeau and Y. Bengio, “Inference for the generalization error,” Mach.
matrices.” Int. J. of Intelligent Systems Technologies and Applications, Learn., vol. 52, no. 3, pp. 239–281, Sep. 2003.
vol. 1, no. 1/2, pp. 143–156, 2005.

973

You might also like