Clark 2017

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

29

Leaf-based Automated Species Classification Using Image


Processing and Neural Networks
JONATHAN Y. CLARK1, DAVID P. A. CORNEY2 and PAUL WILKIN3

Abstract

Automated identification of plant species is highly desirable to facilitate species diversity


surveys, especially because of the relatively low numbers of qualified expert taxonomists
compared with the high number of species and specimens in the current biodiversity crisis
fuelled by global warming and climate change. Often only vegetative parts are available, and
conventional identification techniques usually require floral structures for effective
identification. Building on our earlier work, this chapter describes a system that can be used as
a practical method for classification, and subsequent identification of images of botanical
herbarium specimens. The system is composed of an almost completely automated leaf shape
Biological Shape Analysis Downloaded from www.worldscientific.com

extraction component and two artificial neural networks, specifically a multilayer perceptron
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

(MLP) and a self-organizing map (SOM). Leaves are detected in the specimen images using
deformable templates and evolutionary algorithms, and then a level set method is used to
separate the leaf outline from the background. Length and width and other shape
measurements are then automatically extracted, together with measurements of the teeth on the
leaf margin. The MLP is used to filter out poor quality data before passing the relatively
information-rich high quality data to the SOM. The SOM then performs unsupervised
classification, creating a topological map, which is then used for the identification of unknown
specimens. The system is thus able to identify plants from leaf shape alone, from images of
traditional herbarium specimens. In addition, as in our earlier work, the system is able to refer
difficult specimens to a botanist for further expert examination, in cases of uncertainty,
together with suggestions as to their identity. Thus a methodology is presented here to provide
a practical way for taxonomists to use a combination of different neural networks and image
processing as an automated tool for plant identification. A case study is provided using data
extracted from images specimens of four species of the tree genus Tilia in the herbarium of the
Royal Botanical Gardens, Kew, UK. It was found that over half the leaves that were identified
by the SOM correctly to the species level using 21 automatically extracted leaf shape
parameters, without test set leaf quality filtering. A full detailed comparison is carried out with
updated results from earlier studies using the MLP alone. After applying filters for leaf quality,
and referral, although the results for specimen identification from the SOM are good (65%),
those using the MLP were found to be much better (95%).

INTRODUCTION
Plant classification and identification are both important for those who need to
know which species they are dealing with; e.g., if some species contain important
compounds of medical interest, and/or others who are interested in establishing
levels of biodiversity, for instance to investigate changes in plant distributions due to
climate change. It is still common to carry out biological identification using a
printed "taxonomic key", although there is a trend towards computer-based or
computer-aided systems [1, 2]. Traditional keys are followed manually, the user
making choices from contrasting statements, usually concerning morphology,

1
Nature Inspired Computing and Engineering (NICE) Research Group, Department of Computer
Science, University of Surrey, Guildford, GU2 7XH, UK j.y.clark@surrey.ac.uk. Partly funded by
the Leverhulme Trust (grant F/00 242/H): MORPHIDAS (Morphological Herbarium Image Data
Analysis) Project.
2
Signal, 5th Floor, 32-38 Leman Street, London E1 8EW.
3
Royal Botanic Gardens, Kew, Richmond, Surrey, TW9 3AB, UK.
30

eventually resulting in a name. Identification performance thus depends largely on


the experience of the author of the key, and its interpretation by the user.
Artificial neural networks (ANNs) are computer-based programs that, like
humans, are able to learn from real-world examples and can therefore perform
classification and/or recognition of previously unseen data. A multilayer perceptron
(MLP) is a supervised ANN and is usually best for identification; i.e., classification
of named entities, such as species, because it is trained using data where the class of
each data record is known. This training is achieved by presenting a set of data
records (known as the training dataset) to the network, each record containing data
from a named specimen or data record of known identity. During training the
generalization ability of the network, i.e. its ability to recognize previously unseen
patterns, is periodically tested using a similar, independent validation dataset, also
containing data records of which the classes are known. This independent testing
Biological Shape Analysis Downloaded from www.worldscientific.com

using the validation set is necessary to enable training termination before over-
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

training occurs. A completely independent test dataset containing the actual


unknown data records to be identified is finally presented to the trained network.
Information derived from this test set should never be used to optimize the network
parameters, because that would introduce a bias towards recognizing the data in that
particular test set. Instead the performance of the network using the validation set
(which is already involved in preventing over fitting) can be used to optimize
network parameters. Further information about ANNs is available in [3] and [4]. A
Self-Organizing Map, otherwise known as a Kohonen Map after its originator, is
another kind of ANN, usually used for classification in the biological sense [28, 29].
In a previous study [21], a MLP for supervised classification was used, using
data records derived from named specimens, and also tested by using data from
named specimens whose identity was unknown to the system. In biology, such
classification is more properly called identification. In the current study, the same
data from the same specimen images are used, but the MLP is only used to filter out
poor quality leaf image data from the training set, and to provide indication of the
quality of leaf data in the test set. Thus the quality of the training data set is
improved, although the number of data records is, understandably, reduced. This
improved data set is then used as the training set for unsupervised classification
using a Kohonen Self-Organizing Map (SOM). Although usually used for
unsupervised classification alone, here one is using a topological classification
produced by a SOM for identification of named entities (species). This work
expands on earlier work [22, 23, 24, 10, 21]. Work by other authors includes a good
account [25] regarding the use of neural networks for biological identification and
classification.
Here, as also in the earlier study [21], a case study is presented regarding ANN-
based automated identification of four cultivated species of the genus Tilia
(Malvaceae), from leaf shape features extracted using image processing methods.
This genus comprises around 23 species of woody trees, widely distributed in north
temperate regions, many being commonly cultivated in gardens or as street trees [7,
30]. These are deciduous, with heart-shaped, pointed leaves, and are often known as
limes, lindens or basswoods, although they are not related to the citrus lime.
Biological Shape Analysis Downloaded from www.worldscientific.com
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

Fig. 1. Tilia platyphyllos herbarium specimen (PLA0136).


31
32

The images referred to are actual photographs of herbarium specimens (see Fig. 1:
Tilia platyphyllos PLA0136); i.e., actual physical dried and pressed plant specimens,
mounted on stiff white cartridge paper/card, kept in paper folders and stored in
cabinets to use as voucher specimens for identification and other studies.
To date, the only published computer-based identification studies of Tilia are
ones that separate one species of Tilia (T. cordata) from 12 other species of woody
trees (in 12 other genera) by means of a neural network using leaf image data [27],
and those of the authors [8, 10, 11, 26, 21], who used a neural network for
identification of a number of Tilia species. It is usually much simpler to distinguish
between species in different genera (or other higher level taxa, for that matter). The
real challenge is to distinguish between closely related species in the same genus, as
is the aim here. The original work by Clark [8, 10, 11] involved computer-based
identification of the 19 Tilia species cultivated in Europe. However, character states
Biological Shape Analysis Downloaded from www.worldscientific.com

were then manually recorded in the traditional way by manual observation and
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

measurement. In more recent work [12, 13, 21] and that presented here, information
is extracted almost totally automatically from images of complete herbarium
specimens.
A review of other plant species identification using image processing methods
is given in [5]. This includes leaf and flower shape analysis, leaf texture analysis and
vein analysis, as well as other species identification methodologies. Some work by
other authors similar to that presented here uses a MLP to distinguish between
species and varieties of the genus Banksia in the family Proteaceae [6]. Here,
software was used to automatically extract characters from leaves, such as area and
roundness, for identification purposes. However, the characters obtained were
extracted from images of single undamaged leaves as opposed to whole herbarium
specimens, making the character extraction more straightforward. In this study, leaf
images are automatically extracted from whole herbarium specimen images, and
since that is a much more difficult task, where leaves are often incomplete, folded, or
overlapping, a large number of specimens of each species are needed. Therefore, the
scope of this project is currently restricted here to 4 species of Tilia, as these 4
species were the only ones present in sufficient numbers in the Kew herbarium. The
work is an extension of earlier work involving similar data from the same specimens
and processed in the same way [21] using a MLP, but with the addition of a
methodology for removing poor data from the training set, and subsequent
production of a self-organizing map (SOM) for topological classification and
subsequent species identification. A detailed comparison is also made between the
effectiveness of the SOM technique and that of using the MLP alone.

MATERIALS AND METHODS


Datasets and Feature Extraction
Training of the neural network was carried out using data derived from 177
different Tilia specimens in the Herbarium of the Royal Botanic Gardens, Kew, UK.
Preliminary studies suggested that large numbers of named leaves were needed for
this kind of study, and, as in the previous study [21], it was decided to concentrate
work on the four species of Tilia that had the most leaves available in the Kew
33

Herbarium (Tilia cordata Mill., T. platyphyllos Scop., T. americana L. and


T. tomentosa Moench, referred to here as COR, PLA, AME and TOM respectively).
All specimens in folders labelled with these species names were initially included,
and all names were a priori assumed to be correct (although, in fact, they might not
be), in order to establish the principle that herbarium specimens alone can be used in
this kind of study without additional specialist knowledge. Full details are given in
[13], but briefly these are as follows. The software finds leaves in three stages:
1) All regions in the image that might be leaves are found using a deformable
templates approach [14], optimized with a simple evolutionary algorithm [15].
The edges of the image are found using a Canny edge detector [16], and
approximate matches to leaf shapes are found using trigonometric deformations
[14] of initial templates,
2) The boundary of each candidate leaf is iteratively adjusted using a level set
Biological Shape Analysis Downloaded from www.worldscientific.com
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

method [17] until it corresponds closely to the high-contrast edges in the


image. In this way, the leaf boundaries can be found precisely, and
3) The set of possible leaves is filtered to remove non-leaf objects. Here, in the
only non-automated step of the process, a user is presented with a number of
candidate leaves, and they have to click on some displayed images of some
actual leaves (up to 100, ideally), to indicate examples of outlines that have
been correctly determined, and in order to provide information to the system
about the correct appearance of a leaf. The user does not have to identify the
species, nor carry out time-consuming tasks such as drawing the outline.
The third stage allows flexibility in the technique, rather than being restricted to
a particular genus. Non-leaf objects are then rejected by comparison of Centroid-
contour distances [18] of non-leaf objects with those of the chosen leaves. The
remaining objects are presumed to be leaves and are used for subsequent character
extraction. In this particular case study, this resulted in 441 data records, each
containing data from a single leaf on a named specimen.
For use in the MLP stage of the study, data for 41 morphological characters (see
[21] for full details), comprising some conventional and some new, were extracted
automatically and algorithmically from the leaf outlines. These initially included 17
of the 22 characters used in earlier studies [12] plus 24 new characters, these mostly
pertaining to asymmetric morphometric measurements of the leaf base. Later, this
was reduced to the 21 characters used in the main SOM study (see Table 1). There
were different numbers of specimens for each species in the herbarium, due to its
historical development and the availability of specimens. Also, the software resulted
in different numbers of leaves (sometimes up to 3, sometimes none) being extracted
from each specimen image. Since training ANNs with very unbalanced class
numbers is difficult [20], specimens from the over-represented classes were
randomly discarded until all classes had approximately the same number of
examples that is 119 to 126. The exception was Tilia americana (AME), where data
from only 77 leaves were available. The individual leaf data records were then
randomly divided into independent training, validation and test sets in a ratio of
approximately 70:20:10 respectively. Care was taken to ensure that leaf data from
the same specimen was not present in more than one set, so that the test specimens
were not present, even in part, in the training or validation datasets.
34

Table 1. List of characters (attributes) extracted.


Character Description
01 Length of leaf blade From petiole (leaf stalk) insertion point to tip
02 Width of leaf blade Width at widest point, perpendicular to length-axis
03 Relative position, widest point Distance along main axis: 0 = at insertion point, 1 = at tip
04 Width-25 Leaf blade width, 25% from insertion point to tip
05 Width-50 Leaf blade width, 50% from insertion point to tip
06 Width-75 Leaf blade width, 75% from insertion point to tip
07 Total perimeter of leaf blade Including teeth
08 Total area of leaf blade Including teeth
99 Shape ratio 1 Width 25% / width 50%
10 Shape ratio 2 Width 25% / width 75%
Biological Shape Analysis Downloaded from www.worldscientific.com

11 Shape ratio 3 Width 50% / width 75%


by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

12 Aspect ratio Width / length


13 Total number of teeth Total count around the leaf blade outline, excluding the tip
14 Total area of teeth Total area in mm2
15 Mean angle Angle at tip of tooth, averaged over all teeth
16 Tooth frequency Number of teeth / inner length
17 Total outer edge length Total length of outer edges of all teeth
18 Tooth edge ratio Average ratio of lengths of two outer edges of each tooth
19 Tooth area / blade ratio Total tooth area / leaf blade area excluding teeth
20 Tooth number / blade ratio Number of teeth / total leaf blade perimeter
21 Average tooth area Total area of teeth / number of teeth

Unlike some earlier, similar studies [12], preliminary work showed that adding
Gaussian noise did not improve the overall classification accuracy, so none was
added here.
A list of sources of the material is available in Tables 14 and 15; the species
acronyms are given in Tables 5 through 8, Tables 10 through 13 and Figures 3, 4 and
5. The original photographs of herbarium specimens are available on request from
the Herbarium of the Royal Botanic Gardens, Kew.

Neural Networks
Multilayer Perceptron (MLP)
The initial data filtering stage, intended to enable removal of the least useful
training data, involved using a simple feed-forward MLP with one input layer, one
hidden layer, and one output layer. One input node corresponded to each character
(attribute) and one output node was assigned to represent each of the four target
species. Therefore there were 41 input nodes and 4 output nodes (as in earlier work,
[21]). The network performance was optimized by varying the number of hidden
nodes. There were no connections between nodes in the same layer, and no recurrent
connections.
35

The network architecture is shown in Fig. 2 (though the actual number of nodes
is different from that shown in the Figure). The input vectors were independently
normalized for each character over all training records between ±0.9. This was to
help reduce the training time required, and to minimize unintentional character
weighting.
Biological Shape Analysis Downloaded from www.worldscientific.com
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

Fig. 2. MLP network architecture for identification.

The minimum and maximum values for each character over the entire training
set were used for identical scaling by normalization of the validation and test data
sets. The network weights were initialized to small random values between ±0.5 [3],
and the presentation order of input vectors (leaf data records) was randomized
between epochs (each epoch being one run through the complete training set). A bias
input of 1.0 was used. Further details on parameters and training algorithms used are
given in [9]. The error value reported in the results is the Squared Error Percentage
(E) [19], with corrections [8], given by:
36

(1)

where omax and omin are the maximum and minimum output values used in training,
here 0.9 and 0.1 respectively. N is the number of output nodes (equal to the number
of species), and P is the number of records (in this case leaf data records) in the data
set under consideration. opi is the actual output at output node i when input pattern p
is presented. tpi is the target (desired) output at output node i when pattern p is
presented.
Initially, training was carried out using a learning rate of 0.1, no momentum, a
variable number of nodes in the hidden layer and a fixed random seed. Neural
networks have a tendency to become 'overfitted' if too much training is carried out
Biological Shape Analysis Downloaded from www.worldscientific.com

and so they then perform badly when presented with previously-unseen data; i.e.,
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

this reduces their ability to generalize. To mitigate this problem of overtraining, a


validation dataset was used to enable early stopping. The number of nodes in the
hidden layer was optimized by performing trials with different numbers of hidden
nodes. The configuration that resulted in the lowest Squared Error Percentage (E) on
the validation set was considered to have an optimized number of hidden nodes.
After fixing the number of hidden nodes, further runs were carried out in order to
determine an optimized learning rate. After the number of hidden nodes and learning
rate had been optimized, the random seed itself was varied, this having the effect of
changing the initial random weights. The run with the lowest Squared Error
Percentage (E) overall, using the validation set was taken as the most appropriate to
consider for further processing.
In an earlier paper [21] a novel and practical method of collating results from
this kind of study was introduced, in order to automatically and logically decide the
species identification of specimens in the test set, or refer to a botanist those
specimens that cannot be identified. After the network parameters had been
optimized, the results from the test set were examined more closely, and the concept
of a 'trusted' leaf identification was introduced. In order to establish this level of
trust, a special test is carried out, where the fully trained network with the best
performance is tested again, but this time, using the original training set as a dummy
test set.

୒େ ሺ୭౦౟ ି୲౦౟ ሻమ

ୡ୭୰୰ୣୡ୲ ൌ ୒େ ෍ ቀ ቁǡ (2)
ଵ ሺ୭ౣ౗౮ ି୲ౣ౗౮ ሻమ

୒୛ ሺ୭౦౟ ି୲౦౟ ሻమ

୵୰୭୬୥ ൌ ୒୛ ෍ ቀሺ୭ మ ቁ, (3)
ଵ ౣ౗౮ ି୲ౣ౗౮ ሻ
37

୑ୗ୉ౙ౥౨౨౛ౙ౪ ା୑ୗ୉౭౨౥౤ౝ
–Š”‡•Š‘Ž†‘ˆ–”—•– ൌ  ଶ
(4)

Then the Mean Squared Error (MSE), shown in Eq. 2, was calculated for those
leaves the network identified 'correctly', where NC is the total number of leaves
'correctly' identified, and similarly the Mean Squared Error (Eq. 3) for those leaves
that were 'wrongly' identified, and NW is the total number of leaves 'wrongly'
identified. The mean of these two values was then used as a threshold of trust
(Eq. 4). The principle here is that if the system cannot identify a leaf well even if it is
in the training set, then it cannot be expected to give a trustworthy identification of a
similar leaf in the independent test set either. These thresholds are determined
separately for each species, because some species will be easier to identify than
others. Then these thresholds of trust are considered in conjunction with the network
Biological Shape Analysis Downloaded from www.worldscientific.com
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

results when using the test dataset. Test records which result in a Mean Squared
Error (MSE) greater than the threshold are regarded as 'Bad' leaves and not counted
in the final analysis (see later).
In this new study, as a method of preprocessing before using the SOM, these
thresholds of trust were determined, by using the MLP as described above. Then, the
usefulness of the principle is extended to help decide which of the training data
records are 'good' and which are 'bad'. In short, these thresholds are used to increase
the quality of the training set. Only those records whose MSE were below the
respective threshold, when showing the training set to the trained network, are
retained in the training set to use for classification using the Self-Organizing Map
(SOM). Here one refers to the resultant smaller file as a distilled training set -
because of the analogy to distilling liquids such as alcohol to produce a more
concentrated result.

Self-Organizing Map (SOM)


After the relatively low quality data records were removed from the training set,
the resultant training set, of presumed higher quality, was then used as the training
set for unsupervised classification using a SOM, after removal of the 20 characters
pertaining to leaf asymmetry, that were found to be inhibiting the training of the
SOM. The following tests were therefore conducted using the same data, but with
only the 21 main attributes.
Although there are many variations of SOM [28, 29], it was decided to adhere
to a simple methodology, with parameters set to sensible values determined by
experiment or established heuristics, in order to simplify usage by non-experts. One
input node was allocated for each character, and a 2-D Kohonen competitive layer of
100 (10 x 10) nodes, with no recurrent connections, was used. The input vectors
were normalized between ±1.0, for each character independently over all training
records, to prevent a priori character weighting. All network weights were initialized
to random values ranged between ±1.0. Learning was sequential, carried out on a
pattern-by-pattern basis with a constant learning rate (gain) of 0.3 ([3]: p. 285). The
neighborhood size was initially set to 9, to cover the whole competitive layer, and
then reduced periodically, the optimal number of training epochs (period) between
neighborhood reductions being determined by empirical results. When the
38

neighborhood size reached zero, 1000 further iterations were performed for
consolidation. During training, the neighborhood region was not allowed to extend
beyond the boundary of the array.
For every sample x from the total I training records, the best-matching, or
‘winning’ neuron i in the Kohonen (competitive) layer at presentation t was chosen
as follows:

i( x) = arg min || x(t ) − wj || (5)

where j = Kohonen node 1, 2, .....N and the Euclidean norm is calculated from:

ฮ‫ݔ‬ሺ‫ݐ‬ሻ  െ  ‫ݓ‬௝ ฮ  ൌ  ටሺσூଵ ‫ ݔ‬ሺ‫ݐ‬ሻ௜  െ  ‫ݓ‬௝௜ ሻǤ (6)


Biological Shape Analysis Downloaded from www.worldscientific.com
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

Training in the competitive layer was carried out at each presentation (t) as follows:

‫ݓ‬௝ ሺ‫ ݐ‬൅ ͳሻ  ൌ  ‫ݓ‬௝ ሺ‫ݐ‬ሻ  ൅ Șሾšሺ‫ݐ‬ሻǦ‫ݓ‬௝ ሺ‫ݐ‬ሻሿǡ ݆ ‫ א‬Ȧ‹ሺ‫ݔ‬ሻሺ‫ݐ‬ሻ(7)


w j ( t ) , j ∉ ȁi ( x )( t )
where Ș is the gain, or learning rate, and ȁi(x)(t) is the neighborhood function
centered about the winning neuron. In this study, the neighborhood function is
merely a simple step function, so only the weights of the winning node, and those in
the neighborhood are changed (with no center weighting) [29].
The presentation order of the data records was randomized between epochs
(complete presentations) of the training data set, in order to prevent the presentation
order influencing the topological map. After training, the entire data set was again
presented to the trained network, and the mean Euclidean norm (Ewin) for winning
nodes over all the data records was evaluated using:

1 I
 Ewin = ¦ ฮ‫ݔ‬ሺ‫ݐ‬ሻ  െ ‫ݓ‬௝ ฮ(8)
I j =1

This is a measure of the average distance between a data record (taxon) and the
winning node in the multidimensional data space.
Tests were then performed as follows:
1) The random seed was set to an arbitrary fixed value. Several runs were carried
out using different period lengths, presented in Table 2. The run that resulted in
the lowest mean Ewin value, when the entire training set was again presented
to the trained network, was declared the winner. This ensured that the resultant
topological map that had the closest fit to the original data was chosen.
2) Having set the period to the optimized value at the end of the consolidation,
further tests were carried out in order to determine an optimized gain value
(learning rate). The results used for optimization are shown in Table 3.
39

3) Having determined the optimized period length and gain, the tests were
repeated using 30 different random seeds. A topological map (Fig. 3) was
constructed from the run with the lowest mean Ewin by presenting the entire
training set again to the trained network, and labelling the winning node for
each leaf data record accordingly, so that each grid square shows which
species resulted in the greatest number of winning nodes at that particular grid
position in the competitive layer. Fig. 4 shows more details, giving the actual
number of winning nodes for each species, while Fig. 5 displays the actual
number of winning nodes with test data.
Biological Shape Analysis Downloaded from www.worldscientific.com
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

Key:
cordata (COR) platyphyllos (PLA)
americana (AME) tomentosa (TOM)
no species

Fig. 3. 10 x 10 SOM topological map populated with training and validation set data
showing the winning species for each competitive layer node (grid location).

4) The validation set that had previously been used for the MLP training was then
presented (with a reduced number of attributes) to the trained SOM and the
tally for each named species added to the corresponding winning node for each
data record. Although the validation set was not actually used in training, this
was a way for the data available from the validation set to enhance the SOM
model before it was used for identification.
40
Biological Shape Analysis Downloaded from www.worldscientific.com
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

Key:
cordata (COR) platyphyllos (PLA)
americana (AME) tomentosa (TOM)

Fig. 4. 10 x 10 SOM topological map populated with training set (and validation set) data
showing the number of training data records (leaves) for each species placed in each grid
location

Table 2. SOM Determination of optimized period (iterations between neighborhood


reduction).
Period 8 16 24 32 40 48 56 64
Iterations 1072 1144 1216 1288 1360 1432 1504 1576
Ewin 0.191 0.200 0.190 0.184 0.191 0.193 0.188 0.190

Table 3. SOM Determination of optimized gain (learning rate).


Gain 0.01 0.02 0.05 0.10 0.15 0.20 0.30 0.40
Iterations 1288 1288 1288 1288 1288 1288 1288 1288
Ewin 0.213 0.204 0.209 0.193 0.195 0.193 0.184 0.180
41
Biological Shape Analysis Downloaded from www.worldscientific.com
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

Key:
cordata (COR) platyphyllos (PLA)
americana (AME) tomentosa (TOM)

Fig. 5. 10 x 10 SOM topological map populated with test set data showing the number of
test data records (leaves) for each species placed in each grid location.

Having constructed the SOM classification model, each single leaf record in the
test data set was presented in turn to the trained network model, and the
corresponding winning node in the competitive layer of the trained topological map
(considered as a grid) noted. At the position of this winning node, the total number
of previously recorded training set winners was counted for each species, taking into
account a neighborhood radius of 1 (that is, the winning node plus one square radius
surrounding it), thus considering a 3x3 window. Here the total was center weighted,
with each count for the actual winning node multiplied by 2; whereas those in the
surrounding 1 square neighborhood counted singly. Parts of a neighborhood
extending beyond the limit of the grid were ignored. The highest total was then said
to be the identification decision by the system for that particular test data leaf record
(see Table 9). The confidence value shown is different from that in the previous
study [21]. In this paper, it is the winning species tally divided by the total for all
species for that leaf in the target 3x3 window as shown in Eq. 9 below.
42

୒୳୫ୠୣ୰୭୤୵୧୬୬୧୬୥୬୭ୢୣୱ୤୭୰ୡ୦୭ୱୣ୬ୱ୮ୣୡ୧ୣୱ୧୬୵୧୬ୢ୭୵
…‘ˆ‹†‡…‡ ൌ  ୘୭୲ୟ୪୬୳୫ୠୣ୰୭୤୵୧୬୬୧୬୥୬୭ୢୣୱ୤୭୰ୟ୪୪ୱ୮ୣୡ୧ୣୱ୧୬୵୧୬ୢ୭୵ (9)

An additional consideration is the concept of the threshold of trust for an


individual leaf data record in the test set. This concept was introduced in the earlier
paper [21], which used a MLP alone. This means that if the individual Mean Squared
Error (MSE) for a test data record, when tested against the trained MLP network, is
above that threshold, then the identification of that leaf can be said to be 'not trusted'
('Bad') and should be referred to a botanist for closer examination. The test set used
here is the same as that used in the earlier paper, and so the conclusions as to the
'Good' and 'Bad' leaves from the MLP study are of relevance here too, and therefore
are also shown here in Table 9. Unfortunately, an earlier error meant that some of
the earlier Good/Bad decisions were incorrect. The corrected Good/Bad decisions
Biological Shape Analysis Downloaded from www.worldscientific.com

are incorporated here in both the MLP results (Table 4, which supersedes the table in
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

the earlier paper) and the SOM results (Table 9). In addition, since the primary aim
of such an exercise is to identify actual specimens, rather than individual leaves, it is
of value to consider specimens as a whole, in which case the identifications of the
individual leaves contribute to the specimen identification. Here only the
identification decisions of all the 'Good' leaves on a specimen contribute to a winner-
takes-all identification decision for the whole specimen.
As in the previous study [21], the concept of referral is used. This means that if
all, or the majority of the leaves on a specimen are 'Bad', then any decision made by
the system is said to be only tentative, and not to be trusted, so the decision should
be 'Referred' to an experienced botanist. It is not surprising that in most cases,
decisions made using 'Bad' leaves mostly have poor confidence levels. This principle
ensures that the system is used conservatively - that is, it is most useful if the user
can trust the system for the decisions that it says it can trust, so it is more sensible to
err on the side of caution, and refer all decisions of which the system is unsure,
whilst still maintaining a high level of automation.

RESULTS
The MLP was exactly the same as that used in previous work [21] using the
same data, with resulting optimized parameters of hidden nodes 36, and learning rate
of 0.065. Since the optimized random seed used was also the same, then the trained
network was identical to that obtained previously. Details of the parameter
optimization are given in [21].
A similar table to Table 4, showing the leaf and specimen identifications using
the MLP alone, was presented in the previous paper [21]. Although there were some
errors relating to Leaf Quality in the previously published table, these are now
corrected and the results are shown here in Table 4, which should be considered to
be a fully updated version. Each Image ID represents a different herbarium specimen
image; Leaf ID is a unique number for a given leaf on that specimen. Leaves Ident as
is the majority rule identification for that leaf; Quality is Good if the Mean Squared
Error (MSE) for that leaf is less than or equal to the threshold of trust [21] for that
species; quality is Bad if the MSE for that leaf was greater than the threshold of trust
for that species.
43

Table 4. Leaf and specimen identifications (MLP - corrected).

Leaves Specimens
Image ID Leaf ID Ident as Quality Ident as Trust/Refer Confidence
COR0449 1 COR Good COR Trusted 100
COR0434 1 AME Bad AME Referred 100
COR0421 1 COR Good COR Trusted 93.3
COR0421 2 COR Good
COR0421 3 COR Good
COR0426 1 PLA Bad PLA Referred 100
COR0463 1 COR Good COR Trusted 100
COR0463 2 COR Good
Biological Shape Analysis Downloaded from www.worldscientific.com

COR0463 3 COR Good


by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

COR0427 1 COR Good COR Trusted 100


PLA0151 1 PLA Good PLA Trusted 50
PLA0151 2 COR Bad
PLA0134 1 COR Bad COR Referred 100
PLA0136 1 PLA Good PLA Trusted 96.7
PLA0136 2 PLA Good
PLA0136 3 PLA Good
PLA0169 1 PLA Good PLA Trusted 66.6
PLA0169 2 TOM Bad
PLA0169 3 PLA Good
PLA0146 1 PLA Good PLA Trusted 95
PLA0146 2 PLA Good
PLA0171 1 COR Bad COR Referred 100
AME1150 1 AME Good AME Trusted 80
AME1148 1 TOM Bad TOM Referred 80
AME1156 1 PLA Bad TOM Referred 66.6
AME1156 2 TOM Bad
AME1156 3 TOM Bad
AME1129 1 PLA Bad PLA Referred 95
AME1129 2 PLA Bad
AME1136 1 PLA Bad PLA Referred 90
AME1128 1 PLA Bad PLA Referred 100
AME1128 2 PLA Bad
AME1128 3 COR Bad
TOM0060 1 PLA Bad PLA Referred 60
TOM0713 1 AME Good AME Trusted 70
TOM0101 1 TOM Good TOM Trusted 100
TOM0014 1 TOM Good TOM Trusted 100
TOM0097 1 TOM Good TOM Trusted 55
TOM0097 2 TOM Good
TOM0039 1 TOM Good TOM Trusted 80
TOM0039 2 TOM Good
44

Regarding the whole specimens, 'Ident as' is the majority rule species
identification decision, considering all the 'Good' leaves on that specimen (or 'Bad'
leaves, if there are no good leaves on the specimen); 'Trust/Refer' relates to whether
to trust the decision of the system, or to refer the final decision to a botanist, based
on the criteria described earlier. Species identifications in bold are those which are
considered 'correct', that is, they agree with the label of the herbarium folder in
which the specimen is kept.
A misidentification matrix for all leaves in the test set is shown in Table 5 with
the majority identifications underlined and the ‘correct’ identification shown in bold.
This is different from the similar table in the previous paper [21], because earlier the
figures were calculated from a population of results; whereas here one is using the
result of showing the test data set to the best optimized trained MLP network (as
determined by the MSE on the validation dataset). This is to enable a more practical
Biological Shape Analysis Downloaded from www.worldscientific.com

and realistic comparison between the results using the SOM and those obtained
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

using the MLP alone. Table 6 shows a similar MLP-based matrix, but instead
considering only those leaves said to be 'Good' using the threshold of trust described
earlier. Since a primary practical aim of using such a system is to identify actual
specimens, rather than just individual leaves, a misidentification matrix for all
specimens is provided in Table 7, and a similar table considering only 'Trusted'
specimens (where the number of constituent 'Good' leaves is greater than the number
of 'Bad' leaves - see earlier) is shown in Table 8.

Table 5. ALL LEAVES test dataset leaf misidentification matrix (MLP only).
LEAVES Identification %
Label COR PLA AME TOM
COR 80.0 10.0 10.0 0.0
PLA 25.0 66.7 0.0 8.3
AME 9.1 54.5 9.1 27.3
TOM 0.0 12.5 12.5 75.0
COR = Tilia cordata Mill. PLA = Tilia platyphyllos Scop.
AME = Tilia americana L. TOM = Tilia tomentosa Moench

Table 6. GOOD LEAVES ONLY test dataset leaf misidentification matrix (MLP
only).
LEAVES Identification %
Label COR PLA AME TOM
COR 100.0 0.0 0.0 0.0
PLA 0.0 100.0 0.0 0.0
AME 0.0 0.0 100.0 0.0
TOM 0.0 0.0 14.3 85.7
COR = Tilia cordata Mill. PLA = Tilia platyphyllos Scop.
AME = Tilia americana L. TOM = Tilia tomentosa Moench
45

Table 7. ALL SPECIMENS test dataset specimen misidentification matrix (MLP).


SPECIMENS Identification %
Label COR PLA AME TOM
COR 66.7 16.7 16.7 0.0
PLA 33.3 66.7 0.0 0.0
AME 0.0 50.0 16.7 33.3
TOM 0.0 16.7 16.7 66.7
COR = Tilia cordata Mill. PLA = Tilia platyphyllos Scop.
AME = Tilia americana L. TOM = Tilia tomentosa Moench

Table 8. TRUSTED SPECIMENS test dataset specimen misidentification matrix


(MLP).
Biological Shape Analysis Downloaded from www.worldscientific.com

SPECIMENS Identification %
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

Label COR PLA AME TOM


COR 100.0 0.0 0.0 0.0
PLA 0.0 100.0 0.0 0.0
AME 0.0 0.0 100.0 0.0
TOM 0.0 0.0 20.0 80.0
COR = Tilia cordata Mill. PLA = Tilia platyphyllos Scop.
AME = Tilia americana L. TOM = Tilia tomentosa Moench

Regarding the SOM, optimization results are presented for period length
(Table 2) and gain parameters (Table 3). In each case, the number of iterations and
the Ewin error value at the point of the termination of training, after consolidation, are
provided. The optimized period length was found to be 32 and optimized gain was
0.40. Figs. 3 and 4 show the resultant topological map showing the winning species
for each node in the competitive layer. Fig. 4 shows the number of training data
records for each winning node, for each species.
Turning now to the detailed map of the trained classification model in Fig. 4,
showing the actual number of winning nodes for each species in each competitive
layer grid square, we can see that there is more overlap with respect to species in any
given square and its immediate neighborhood, that is obscured by the winner-takes-
all view shown in Fig. 3. The test set data map shown in Fig. 5 indicates the location
of the winning nodes when presenting the test set to the fully trained network, and
shows the positions of all the winning nodes for all the individual leaf records in the
test set. These positions on the grid can then be directly correlated with the same
positions on the main model produced from the training (and validation) sets shown
in Fig. 4. The position of the winning node for each test data record thus forms the
center of a 3x3 window which can be logically placed in position over the training
(and validation) set map, in which the numbers of training set winning nodes can be
counted separately for each of the four species. Center weighting is used because,
presumably, the actual winning node position is more important and relevant than
the neighbouring positions - though the technique is designed to show relationships
by adjacent topology, so the immediate neighborhood should have relevance.
Now, one can consider the identifications shown in Table 9 resulting from
running the test dataset through the fully trained, consolidated and optimized SOM.
46

Table 9. Leaf and specimen identifications (SOM with leaf quality from MLP).

Leaves Specimens
Image ID Leaf ID Ident as Quality Ident as Trust/Refer Confidence
COR0449 1 COR Good COR Trusted 53.5
COR0434 1 AME Bad AME Referred 52.6
COR0421 1 COR Good COR Trusted 65.8
COR0421 2 COR Good
COR0421 3 COR Good
COR0426 1 PLA Bad PLA Referred 38.7
COR0463 1 COR Good COR Trusted 60.5
COR0463 2 COR Good
COR0463 3 COR Good
Biological Shape Analysis Downloaded from www.worldscientific.com
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

COR0427 1 COR Good COR Trusted 69.6


PLA0151 1 PLA Good PLA Trusted 37.7
PLA0151 2 TOM Bad
PLA0134 1 TOM Bad TOM Referred 44.8
PLA0136 1 PLA Good PLA Trusted 61.9
PLA0136 2 PLA Good
PLA0136 3 PLA Good
PLA0169 1 PLA Good PLA Trusted 47.5
PLA0169 2 PLA Bad
PLA0169 3 PLA Good
PLA0146 1 PLA Good PLA Trusted 56.5
PLA0146 2 PLA Good
PLA0171 1 COR Bad COR Referred 62.5
AME1150 1 TOM Good TOM Trusted 41.7
AME1148 1 TOM Bad TOM Referred 41.7
AME1156 1 AME Bad TOM Referred 34.3
AME1156 2 TOM Bad
AME1156 3 PLA Bad
AME1129 1 PLA Bad PLA Referred 41.8
AME1129 2 PLA Bad
AME1136 1 TOM Bad TOM Referred 43.5
AME1128 1 COR Bad COR Referred 59.2
AME1128 2 COR Bad
AME1128 3 COR Bad
TOM0060 1 PLA Bad PLA Referred 40.5
TOM0713 1 AME Good AME Trusted 52.6
TOM0101 1 TOM Good TOM Trusted 50.0
TOM0014 1 TOM Good TOM Trusted 47.2
TOM0097 1 TOM Good TOM Trusted 38.3
TOM0097 2 COR Good
TOM0039 1 TOM Good COR Trusted 49.3
TOM0039 2 COR Good
47

Considering the test dataset as a whole, treating the leaf identifications


individually, and ignoring the quality of the leaves, then more than half (53.7%)
of the leaves are identified correctly. If one only considers the 'Good' leaves, then
20 out of 24 leaves were identified correctly, giving an accuracy of 83.3%. This is
very good, considering that it is achieved with minimum (non-expert) human
intervention, and is only operating using leaf outlines obtained almost completely
automatically from images of whole herbarium specimens. Furthermore, a botanist
would typically also need to use features from other parts of the plant. Incidentally, it
is not surprising that COR0427 is identified correctly as T. cordata, given that it is a
type specimen (K isotype of Pigott's BM neotype) [31].
Moving now to consider identifications of whole specimens as shown in
Table 9, if one considers all specimens in the test set without regard to leaf quality
(except when considering the winner-takes-all approach to specimen identification),
Biological Shape Analysis Downloaded from www.worldscientific.com

then 11 out of all the 24 specimens (45.8%) are identified correctly. If one considers
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

only 'Trusted' identifications, then 11 out of the 14 'Trusted' specimen identifications


are identified correctly, giving an accuracy of 78.6%. The remaining 'Trusted'
specimens consist of specimens AME1150, TOM0713 and TOM0039 are
'incorrectly' identified and will be discussed later. Of the remaining specimens, 10 of
the 24 test specimens have no ‘Trusted’ ('Good') leaves, or the majority of leaves are
'Bad' - these are therefore referred to a botanist for further study. These specimens
are assumed to be too poor (with regard to detectable leaf outline) for the system to
identify well, and are likely to be difficult for an expert botanist. It is not surprising
that, given a number of randomly chosen specimens to include in the test set that
many are difficult to identify due to poor quality.
A misidentification matrix using the SOM for identification of all leaves in the
test set is shown in Table 10 with the majority identification underlined and the
‘correct’ identification shown in bold. Table 11 contains a similar matrix, but instead
considers only those leaves said to be 'Good' using the threshold of trust described
earlier. The 'weighting' refers to the center weighting used when performing counts
in the 3x3 window. A misidentification matrix for all specimens using the SOM is
provided in Table 12, and a similar matrix considering only ‘Trusted’ specimens
(where the number of constituent 'Good' leaves is greater than the number of 'Bad'
leaves - see earlier) is presented in Table 13. This facilitates comparison with the
results produced using the MLP alone. Details of all the specimens from the Kew
Herbarium that were used as test data in this study are presented in Tables 14 and 15.

Table 10. ALL LEAVES test dataset leaf misidentification matrix (SOM).

LEAVES Identification % (weighted)


Label COR PLA AME TOM
COR 80.0 10.0 10.0 0.0
PLA 8.3 75.0 0.0 16.7
AME 27.3 27.3 9.1 36.4
TOM 25.0 12.5 12.5 50.0
COR = Tilia cordata Mill. PLA = Tilia platyphyllos Scop.
AME = Tilia americana L. TOM = Tilia tomentosa Moench
48

Table 11. GOOD LEAVES ONLY test dataset leaf misidentification matrix (SOM).

LEAVES Identification % (weighted)


Label COR PLA AME TOM
COR 100.0 0.0 0.0 0.0
PLA 0.0 100.0 0.0 0.0
AME 0.0 0.0 0.0 100.0
TOM 28.6 0.0 14.3 57.1
COR = Tilia cordata Mill. PLA = Tilia platyphyllos Scop.
AME = Tilia americana L. TOM = Tilia tomentosa Moench

Table 12. ALL SPECIMENS test dataset specimen misidentification matrix (SOM).
Biological Shape Analysis Downloaded from www.worldscientific.com

SPECIMENS Identification %
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

Label COR PLA AME TOM


COR 66.7 16.7 16.7 0.0
PLA 16.7 66.7 0.0 16.7
AME 16.7 16.7 0.0 66.7
TOM 16.7 16.7 16.7 50.0
COR = Tilia cordata Mill. PLA = Tilia platyphyllos Scop.
AME = Tilia americana L. TOM = Tilia tomentosa Moench

Table 13. TRUSTED SPECIMENS ONLY test dataset leaf misidentification matrix
(SOM).

SPECIMENS Identification %
Label COR PLA AME TOM
COR 100.0 0.0 0.0 0.0
PLA 0.0 100.0 0.0 0.0
AME 0.0 0.0 0.0 100.0
TOM 20.0 0.0 20.0 60.0
COR = Tilia cordata Mill. PLA = Tilia platyphyllos Scop.
AME = Tilia americana L. TOM = Tilia tomentosa Moench

CONCLUSIONS
This paper builds on earlier studies by the authors that used neural networks for
identification and classification of Tilia herbarium specimens from images. The
principle of using a MLP to help filter out relatively noisy/bad data from a training
set before feeding into a SOM seems useful, since this is clearly a way of removing
much potentially inhibitory, badly damaged leaf data. There is much noise in these
data because the leaf shape information was extracted automatically from images of
herbarium specimens.
49

In the previous study [21], the concept of automated referral was introduced, in
which it is accepted that some specimens are not identifiable by the system and these
are instead referred to an expert botanist. This is reasonable since here one is
attempting to identify specimens automatically from images, with minimal human
intervention, and also only using leaf outlines. The quality of herbarium specimens
can vary considerably, and if there are many leaves overlapping, it is difficult to
separate individual leaves by image processing. Furthermore, one aim of this study is
to show the use of existing herbaria as repositories of information and accumulated
expert knowledge resulting from many years of human experts studying and naming
the specimens. This study does indeed show that it is possible to use such data to
help identify some unknown specimens, without the help of expert knowledge. For
instance, a trainee, non-expert in the plant group concerned could be provided with
help by this kind of system, and then refer difficult identifications to an expert on the
Biological Shape Analysis Downloaded from www.worldscientific.com

group.
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

Table 14. Herbarium specimens used for testing (RBG, KEW Herbarium, UK).
Material continued in Table 15.
Image ID Collection: Location Date
COR0449 Bois de Premieres (Cote d'Or), France 12 July 1875
COR0434 Kings Wood, Yatton, N. Somerset, UK 19 Sept 1906
COR0421 Little Malvern, Worcestershire, UK 21 June 1914?
COR0426 Egham, Surrey, UK 19 August 1917
COR0463 Austria
Western edge of Chalkney Wood, near Earls Colne, Essex, UK
COR0427 14 August 1994
52/872276
PLA0151 Slopes of Chanctonbury Hill, West Sussex, UK 7 July 1926
PLA0134 Near Wargrave, Berkshire, UK 5 August 1946
PLA0136 In garden in Malvern Road, Cheltenham, Gloucs, UK 12 July 1945
PLA0169 Bruton, Somerset, UK July 1936
PLA0146 Near River Mole, Stoke D'Abermon, Surrey, UK 21 July 1918
PLA0171 Oxford Road, Woodstock, UK (planted) 25 June 1945
AME1150 Quebec, Canada June 1825
AME1148 Lake region and Ontario, Canada 10 July 1877
AME1156 St. Louis, Missouri, USA July 1841
AME1129 Noel, Missouri, USA 8 Oct 1909
AME1136 Gordon Hills, Gibson County, Indiana, USA 6 July 1915
AME1128 Noel, Missouri, USA 25 April 1909
7 Oct
TOM0060 Hungary
1927/1947?
TOM0713 Northern Syria Aug? 1908
TOM0101 Cultivated At RBG Kew, UK Oct 1881
TOM0014 Northern Albania 9 July 1918
TOM0097 Arboretum, RBG Kew, UK 16 Sept 1884
TOM0039 Therapia, Turkey July 1862
50

Referring to the general topological map (Fig. 3), this shows the overall
classification view (of the training and validation datasets), with a winner-takes-all
view of each grid square, where the species indicated is the one with the most
winning nodes centered on that square. It can be seen that, as expected, there has
been some separation into areas relating to individual species, although this is not
perfect. Roughly speaking, there is a concentration of T. americana (AME) in the
bottom right of the map; T. cordata (COR) tends to be on the left side of the map; T.
platyphyllos (PLA) tends to be on the right side of the map, but there is also a
significant concentration at the top left; T. tomentosa (TOM) seems to be more
randomly distributed, although it avoids the top left of the map. It is not too
surprising that it is difficult to separate these species, given that they are closely
related, and one is only using automatically extracted characters from leaves on
whole herbarium specimens.
Biological Shape Analysis Downloaded from www.worldscientific.com
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

Table 15. Herbarium specimens used for testing (RBG, KEW Herbarium, UK).
Continuation of Table 14.

Image ID Collector and No. Notes


COR0449 E.A Willmott Herb. Ed Bonnet, Herb. Warleyensis
COR0434 J.W.White Leaves of stump shoots. Ex Herb. C.E.Britton
COR0421 A.J.Crosfield Herb. A.B. Jackson
COR0426 J. Fraser Herb. J. Fraser
T. betulaefolia Hofm. apud Bayer in Verh. zool. bot. Ges.
COR0463 Braun 1693
XII. p. 23. (1862) Pres. April 1892.
COR0427 C.D. Pigott H. 1519/96
PLA0151 H.J. Riddelsdell Vice County 13
Herb. C.C. Townsend VC 22. Glue on leaf margins.
PLA0134 L.H. Williams
(JYC: actually T. tomentosa - see Conclusions)
PLA0136 G. Redhead 5356 Vice County 33
PLA0169 F.K. Makins Altitude 280 feet
PLA0146 C.E. Britton No.
PLA0171 W.B. Turril Pollen was sent to Kings College London June 1973
AME1150 Hubbard? Labelled T. pubescens.
AME1148 Prof. Macoun Flora Canadensis No. 256
AME1156 Riebl? Herb. J. Gay. T. nigra, T. glabra, T. canadensis
AME1129 B.F. Bush 5983 Rich woods
AME1136 C.C. Deam 16,871 T. americana var. neglecta
AME1128 B.F. Bush 5530 Rich woods. Poor specimen.
TOM0060 J. Wagner 11 Flora of Hungaria. T. argentea var. dolichocarpa Wagn.
TOM0713 M. Haradjian 2472 Plantae Syriae borealis
Cult. in Hort. Kew as T. alba. Now T. tomentosa
TOM0101
'Petiolaris'
TOM0014 J. B. Kümmerle 700m Mus. Nat Hung. Budapest
TOM0097 G. Nicholson 2730 T. tomentosa 'Petiolaris'. Glue on leaf margins
TOM0039 Pres. by Helen Taylor Mar 1875
51

This does justify the technique used here, however, of including the immediate
neighborhood, with center weighting, when calculating identification counts, as this
should make the best use of the information included in this kind of incomplete
separation in the topological map.
Referring to the SOM Misclassification Matrices (Tables 10-13), three of the
four species are mostly correctly placed with at least half of leaves, often more,
identified as the appropriate species. It is clear, however, that Tilia americana is not
identified well by the system. Only one leaf out of 11 is good; all the rest are bad, so
most specimens are ‘Referred’, and the one ‘Trusted’ specimen is incorrect. This is
likely to be due to the inherent noisy nature of the data, and inaccuracies in outlines
derived automatically, especially when leaves are overlapping in the specimen,
which is often the case because of the species' mostly large leaves. In addition, it is
possible that problems are caused by this species being the one with the least data
Biological Shape Analysis Downloaded from www.worldscientific.com

records in the training set - the number of leaves may be below a critical threshold
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

when compared with the other species. The system is still useful, though, because
almost all of the T. americana leaves in the test set were automatically classified as
'Bad' by the MLP stage of the process, and could therefore be discounted from the
final analysis. Another influence is likely to be the number of leaves that the system
can successfully extract from images of whole specimens - usually this is only about
2 or 3, and the decision of the system is thus likely to be heavily influenced by the
proportion of immature to mature leaves extracted.
Considering the incorrect identifications of 'Trusted' identifications by the
system:
AME1150 is from Quebec, Canada, so it cannot be T. tomentosa. It is labelled
T. pubescens, which is a synonym of T. caroliniana, which in the strict sense
grows further south. However, its collection location suggests that it is
just possible that it is T. caroliniana subsp. heterophylla (Venténat) Pigott
(synonym: T. americana var. heterophylla (Venténat) Loudon), which does just
extend that far north, or even a hybrid T. americana × T. caroliniana ([7],
p. 261). Both of these are out of the scope of this study.
TOM0713 is a specimen from Syria, so it is almost certainly correctly T.
tomentosa. However, the leaves of this particular specimen are very similar in
outline shape to those of the American species, so it is not too surprising that
the system is confused.
TOM0039 is almost certainly correct, given that the specimen was collected
near Istanbul in Turkey. However, the leaf shape of T. tomentosa is very
variable, and this particular specimen has leaves of very similar shape to those
of T. cordata. So it is understandable that the system identified it to be that
species on leaf shape characters alone. Also, many test specimens 'Referred' to
a botanist had problems that would make the decision of the system
understandable:
COR0434 is a stump sprout of T. cordata. As stated in the previous paper [21],
it is extremely difficult, if not impossible, for a human to identify leaves on
shoots sprouting from Tilia tree stumps, as their leaf morphology is often
extremely different from the normal canopy leaves. Indeed, all existing keys to
identify Tilia species state that they are for identifying shoots from the canopy,
and also often specify flowering or fruiting shoots.
52

COR0426 (a perfectly reasonable T. cordata) is identified tentatively as T.


platyphyllos by the system, with low confidence - however, this is probably just
an 'educated guess', considering the decision is based on a single 'Bad' leaf.
PLA0134 is 'Referred' for good reason. Clearly the system has difficulties
because of the spots of glue on the edges of the leaves. The specimen label says
T. platyphyllos, but the system, although uncertain, identifies this specimen as
T. tomentosa. On closer examination of the specimen, unfortunately, the bottom
surface of every leaf is stuck down, making it very difficult to view the
underside. With careful investigation it was just possible to notice a few stellate
hairs on the leaf underside, which T. platyphyllos does not possess, but T.
tomentosa does. They are also clearly visible on the bract surface. Furthermore,
closer examination of the flowers revealed the presence of staminodes - again
possessed by T. tomentosa, but not T. platyphyllos. Case closed - this specimen
is a cultivated specimen of T. tomentosa. A standard determinavit slip with the
Biological Shape Analysis Downloaded from www.worldscientific.com

corrected name was therefore attached to the sheet. When this specimen is
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

correctly considered to be T. tomentosa instead, this improves the 'all leaf'


identification using the SOM system, shown in Table 10 for PLA and TOM, so
that the average correct identification for all leaves (with no quality
assessment) is raised from 53.5% to 58.7%. Similarly, the performance in
Table 12 (Specimen identification with no specimens Referred) improves from
an average of 45.9% to 50.95%.
PLA0171: Closer examination of this specimen reveals that although it is
labelled T. platyphyllos, it does have a slightly glaucous (bluish) leaf underside
- which suggests it might be the hybrid T. X europaea - though the relatively
few flowers per inflorescence still suggest T. platyphyllos.
TOM0060 had paper strips’ crossing the leaves, as part of the specimen
mounting, so it is possible that this is why the single leaf extracted was 'Bad',
causing a poor identification.
It is interesting to compare the effectiveness of the technique presented here, in
which a MLP is used to provide an enhanced distilled training dataset for use with a
SOM, and the SOM is subsequently used for identification, with that of just using
the MLP for identification alone, using the same data (see also [12]). In short, the
MLP technique alone, using the threshold of trust to filter out poorer quality leaf
data from the test set works better that the SOM technique, even if the training set
for the latter is enhanced by removing poor quality data from the training set. The
principle of thus labelling test leaves as 'Good' or 'Bad', and then using this in a
winner-takes-all decision for each test specimen works well with both the MLP and
the SOM. However, as can be seen clearly from the misidentification matrices
(Tables 5-8 and 10-13), whilst the two techniques produce rather similar results
when considering ALL leaves or ALL specimens, when this is combined with
automated rejection of 'Bad' leaves and 'Referral' of particularly poor or challenging
specimens, the MLP technique is clearly demonstrated to be the better one. Either
way, the principle of 'Referring' to a botanist those specimens of whose identification
the system is unsure, is clearly a sensible procedure in order to increase the degree of
trust the user can have in the identifications that the system makes. Even so, as
explained above, the system can still make helpful suggestions.
53

After referral of such specimens, the remaining identification decisions made by


the MLP-based system are extremely good for the four species (on average 95% see
Table 8). The SOM-based system is not as good (average 65%), let down by the
poor identification of Tilia americana.
An obvious limitation of the system is that it can currently only be used to help
classify and identify leaf shape data from four species of Tilia. Also, the specimen to
be identified will need to be one of these four in order for the system to work well.
This is a problem shared with traditional keys, in that they only work within their
defined scope. The methodology should work with other broadleaved plant groups,
given enough leaves of sufficient quality. More work is needed, though, to determine
how this kind of system reacts to data that is out of scope - it is likely to simply
return a low-confidence decision. Extending the system to cover other Tilia species
would obviously be advantageous, but undoubtedly, because of similarities in leaf
Biological Shape Analysis Downloaded from www.worldscientific.com

shape that would become more apparent with a greater number of species, it is likely
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

to be necessary to extract other character information such as floral structures


mentioned earlier, or information regarding the hairs on the leaves, which are
critically important in published keys to the genus [7], in order to provide accurate
identification. In addition, the inclusion of simple geographic information, such as
continent of origin, often available on herbarium specimen labels, has already been
shown to greatly enhance the identification performance of this kind of system [11].
In conclusion, as has also been demonstrated in earlier work, practical use can
be made for a largely automated artificial intelligence methodology involving neural
networks for identification, such as that presented here. The primary aim to provide
taxonomists with a methodology for creating an advisory tool has been met, which
provides an alternative taxonomic opinion based on the same real-world data. Here
this has been demonstrated effectively with a real-world case study using herbarium
material.
As a result of this work, it is now very clear that care should be taken when
mounting herbarium specimens, that leaves are not obscured with glue or paper
strips, as this reduces the effectiveness of such automated systems that combine
image analysis with artificial intelligence. Fortunately such practice is less common
today, but herbarium curators should be made aware of these issues. Clearly, such
techniques are especially useful for analysis of these kinds of datasets, where the
attributes are extremely variable, the data is noisy or incomplete, and relationships
between taxa are unclear, making it difficult to perform effective classification by
traditional methods. Furthermore, such methods are more objective and repeatable
than the creation of identification and classification systems solely by human
experts, and are they are theoretically able to detect non-linear relationships between
taxa, characters and character states. As such, they are particularly useful to
elucidate relationships between species, or taxa of other ranks, from which DNA
data cannot easily be obtained, or where such data does not provide enough
meaningful information for effective classification.
54

ACKNOWLEDGMENTS
Grateful thanks are due to the Trustees of the Royal Botanic Gardens, Kew for
permission to study and photograph specimens in the Herbarium, and to include the
photographs here. Also, gratitude is due, as always, to Donald Pigott for earlier
encouragement and especially for writing his wonderful monograph on Tilia. Thanks
also go to Lilian Tang, Y. Hu and J. Jin for help with extraction algorithms, and to
Kulamagul Mahendrarajah for development of the Java-based shell for the MLP
program. Also, thanks are due to Scott Notley for work with the character extraction
software and preparation of datafiles. Many thanks also go to Katherine Clark for
help with the Figures.

REFERENCES
Biological Shape Analysis Downloaded from www.worldscientific.com
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

[1] Pankhurst RJ (1991) Practical Taxonomic Computing. Cambridge University


Press, Cambridge, UK.

[2] MacLeod N (Ed.) (2007) Automated Taxon Identification in Systematics


(Systematics Association Special Volume). CRC Press, Boca Raton, FA, USA.

[3] Freeman JA and Skapura DM (1992) Neural networks: algorithms, applications,


and programming techniques. Addison-Wesley, Reading, Massachusetts, USA.

[4] Haykin S (1994) Neural networks – A comprehensive foundation. Macmillan


College Publishing Company, New York.

[5] Cope JS, Corney DPA, Clark JY, Remagnino P and Wilkin P (2012) Plant
species identification using digital morphometrics: a review. Expert Systems
with Applications 39: 7562-7573.

[6] Messina G, Pandolfi C, Mugnai S, Azzarello E, Dixon K and Mancuso S


(2009) Phyllometric parameters and artificial neural networks for the
identification of Banksia accessions. Australian Syst. Bot. 22: 31-38.

[7] Pigott D (2012) Lime Trees and Basswoods: A Biological Monograph of the
genus Tilia. Cambridge University Press, Cambridge, UK.

[8] Clark JY (2000) Botanical identification and classification using artificial


neural networks. PhD Thesis, University of Reading, Reading, UK.

[9] Clark JY (2003) Artificial neural networks for species identification by


taxonomists. BioSystems 72: 131-147.
55

[10] Clark JY (2004) Identification of botanical specimens using artificial neural


networks. IEEE Symposium on Computational Intelligence in Bioinformatics
and Computational Biology (CIBCB), San Diego, CA, USA - October 2004:
87-94.

[11] Clark JY (2007) Plant identification from characters and measurements using
artificial neural networks. In: MacLeod N. (Ed.) Automated Taxon
Identification in Systematics (Systematics Association Special Volume): 207-
224. CRC Press, Boca Raton, FA, USA.

[12] Clark JY, Corney DPA and Tang HL (2012) Automated plant identification
using artificial neural networks. IEEE Symposium on Computational
Intelligence in Bioinformatics and Computational Biology (CIBCB). San Diego,
Biological Shape Analysis Downloaded from www.worldscientific.com

CA, USA: 343-348.


by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

[13] Corney DPA, Clark JY, Tang HL and Wilkin P (2012) Automatic extraction of
leaf characters from herbarium specimens. Taxon 61: 231-244.

[14] Jain AK, Zhong Y and Lakshmanan S (1996) Object matching using
deformable templates. IEEE Trans. Pattern Analysis 18: 267-278.

[15] Fogel DB (2006) Evolutionary Computation: Toward a new philosophy of


machine intelligence, 3rd ed. Wiley-IEEE Press, Hoboken, NJ, USA.

[16] Canny JF (1986) A computational approach to edge detection. IEEE Trans.


Pattern Anal. 8: 679-698.

[17] Malladi R, Sethian JA and Vemuri BC (1995) Shape modeling with front
propagation: a level set approach. IEEE Trans. Pattern Anal. 17: 158-175.

[18] Ye L and Keogh E (2009) Time series shapelets: a new primitive for data
mining. Proc. of the 15th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. Paris: 947-956.

[19] Prechelt L (1994) Proben1 – A set of neural network benchmark problems and
benchmarking rules. Technical Report 21/94, Universität Karlrühe, Germany.

[20] He H and Garcia EA (2009) Learning from imbalanced data. IEEE


Transactions on knowledge and data Engineering 21: 1263-1284.

[21] Clark JY, Corney D, Notley S and Wilkin P (2015) Image processing and
artificial neural networks for automated plant species identification from leaf
outlines. In: Lestrel PE (Ed.) Proc. 3rd Int Symp Biol Shape Anal (ISBSA)
World Scientific Pub Singapore and New Jersey, USA.
56

[22] Clark JY and Warwick K (1998) Artificial keys for botanical Identification
using a multilayer perceptron neural network (MLP). Artificial Intelligence
Review 12: 105-115.

[23] Clark JY (2003) Computer-aided taxonomy of Lithops. Mesemb Study Group


Bulletin 18 (1) 10-12.

[24] Clark JY (2003) Artificial neural networks for species identification by


taxonomists. Biosystems 72: 131-147.

[25] Boddy L, Morris CW and Morgan A (1998) Development of artificial neural


networks for identification. In Information Technology, Plant Pathology &
Biodiversity. Bridge P, Jeffries P, Morse DR and Scott PR (Eds.) Wallingford,
Biological Shape Analysis Downloaded from www.worldscientific.com

UK: CAB International, 221-231.


by LA TROBE UNIVERSITY on 10/16/17. For personal use only.

[26] Clark JY (2009) Neural networks and cluster analysis for unsupervised
classification of cultivated species of Tilia (Malvaceae). Bot. J Linnean Soc.
159: 300-314.

[27] Rath T (1996) Klassifikation und Identifikation gartenbaulicher Objekte mit


könstlichen neuronalen Netzwerken. Gartenbauwissenschaft 61: 153-159.

[28] Kohonen T (1982) Self-organized formation of topologically correct feature


maps. Biol. Cybernetics 43: 59-69.

[29] Kohonen T (1989) Self-Organization and Associative Memory. 3rd Edition.


New York: Springer.

[30] Pigott CD (1997) Tilia. In The European Garden Flora, Walters, SM et al.
Cambridge University Press, Cambridge, UK: 205-212.

[31] Pigott CD (1997) Two proposals to maintain the names Tilia cordata and Tilia
platyphyllos (Tiliaceae) in their current use. Taxon 46: 351-353.

You might also like