Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

APPENDIX 6

Glossary of technical terms for non-technical readers.


aggregation: (noun) any process in which multiple items of information are gathered and
expressed in a summary form such as a mean or median. Aggregating is used by
some to mean "averaging", but is also includes finding the median. Aggregation
helps reduce sources of error such as outliers.

cluster analysis: (noun) a rather loose collection of numerical methods that can be used
to assign objects to hierarchical groups (clusters) usually resulting in a dendrogram
(branching tree-like graphic, also called a tree) showing similarities or associations
between objects.

coefficient: (noun) a multiplier that measures some property of a particular set of data,
for which it is constant, while differing for different data sets, or a number multiplied
to a variable, as in 2x, where 2 is the coefficient.

Coefficient of Variation (CV): the standard deviation expressed as a percentage mean;


roughly a measure of how much the data are spread out around the mean. The CV
works well in data-sets which are normally distributed.

correlation: a measure of association between two variables, similar to the idea of things
being proportional to each other.

Delta Score: (noun) a statistic of mensural reliability developed during this study. The
Delta score uses pairs of specimens taken from single plants to represent genetically
identical specimens grown in similar environments. Multiple sets of these pairs from
any number of plants are used in the analysis. The mean (average) is calculated for
each pair (see Figure1). The absolute value of the difference between mean and one
of the pair members (either one since they are the same) is recorded as the delta (∆).
The delta is divided by the mean to arrive a Delta percent. The distribution of the
whole set of Delta percents, most of which will be near zero, is examined to find the
median Delta Percent. This median is called the Delta Score.

191
0.20

0.15
0.12
0.10 0.10
0.08
0.05 0.06
0.04
0.02
0.00
0 00
Median = 0.058 = 5.8% Median = 0.0199 = 2%
Calculation of Delta percent Leaf Length Style Tip Length
(∆ %). Distribution of ∆ %s Distribution of ∆ %s
FIGURE 1. CALCULATION OF THE DELTA SCORES. Delta score is the median delta percentage.
Histogram of style tip length was compressed to be at a similar scale to the leaf length histogram.

dimension reduction: the effect of expressing the characteristics of an object or set of


objects for which there are several measured variables, in a format summarizing those
variables by a smaller number of variables. The most simple form of dimension
reduction is expressing two measures of an object as a fraction.

distance: a measure of the closeness of objects, where "closeness" might be not in


physical space but in terms of some other dimensions. Several distances are similar to
the familiar concept of Euclidean distance calculated by the Pythagorean theorem,

c = a 2 + b 2 , where a and b are quantified differences between the objects in two


measures, and c is the calculated distance between the objects in two dimensions.
This concept is extended to many more than 2 measures to create n-dimensional
space, where n is the number of variables.

eigenvalues: also known as characteristic roots or latent roots of a square matrix, these
values correspond to a stretch/shrink factor associated with transformation of a
matrix. Eigenvalues show the measure of the strength of the axis or the amount of the
total sample variance accounted for in that axis. Higher eigenvalues in the first
components or axes indicate that more of the variance is accounted for in those axes
and are preferred to an outcome where the eigenvalues gradually trail off.

192
eigenvectors: also known as characteristic vectors or latent vectors of a square matrix,
these numbers correspond to a change in direction associated with transformation of a
matrix. Eigenvectors are normally shown as columns of coefficients showing the
relative contribution of the original variable to the component (a correlation between
the axis in the new matrix and the original variable). These are used to identify the
important variables along the new axis.

frequency distribution: a summary of the data found by dividing the range of


measurements (minimum to maximum) into a given number of equal-sized units
(intervals) and then counting the number of measurements that fall into each interval.

histogram: a graphic for showing a frequency distribution. Usually the count of items in
and interval is shown by height of a rectangle, but it could also be stacked icons,
alpha-numeric characters, or even shades of a color. A pair of histograms is
illustrated under Delta score.

kurtosis: in terms of a normal bell-shaped distribution, kurtosis is a measure of how high


the bell is relative to how wide it is at the base (in the "tails").

leptokurtic: relative to the normal bell-shaped curve, leptokurtic implies that there are
more items in the center (the bell is taller) and the "tails" shorter.

measurement error: the component of an observed score that accounts for the difference
between the observed score and the true score.

median: a measure of the center or middle of the dataset which is calculated by sorting
the data into ascending order and finding the observation that has an equal number of
observations on either side of it, thus the middlemost value in a distribution of data.
The median is not as influenced by outliers as the mean.

metric: 1)(general usage - as used in this document) a quantitative measure that is


expressed on an interval or ratio scale; 2)(mathematical) a distance function
formalizing the idea of a "space"in which objects can occur: some function d is a
metric if it cannot be negative, d(1,2) = d(2,1) (symmetry), d(1,2) is less than or equal
to d(1,3) + d(2,3) (triangle inequality), and d(1,2) = 0 only if objects 1 and 2 are the
same.

193
nested: a condition of one thing being contained within another as in cities nested within
states, here used as 1) a concept of hierarchical order where in one level of
classification is subordinate to another level of classification; 2) where a physical
structure is contained with a second physical structure.

nested variance components: variance components are, in general terms, the variance
accounted for by a source, they are estimated using the mean squares for random
effects and their expected values. In nested variance components, the sources of
variance are nested effects, so that in the city(state) model, there is a effect of being
in a state that all cities will be subject to and there is the effect of each city itself, and
there is also some variation that can't be assigned to an effect of either (model error).

non-normal frequency distribution: a distribution which is not normal, thus having a


less definitive relationship between the standard deviation and the mean creating
problems in the use of any statistics that rely on a defined relationship between the
mean and the standard deviation. Most often non-normal distribution are so because
they are skewed, but also because they are kurtotic.

normal frequency distribution: technically a distribution is normal if it fits the normal


probability density function, but in non-technical terms a normal distribution is a
symmetrical "bell-shape" frequency distribution in which specific proportions of the
items in the distribution fall within given distances from the mean (i. e. 50% of the
items fall between ± 6.074 standard deviations of the mean.

ordination : In a non-secular context, ordination is the ordering of objects along axes


according to a estimation of similarity or dissimilarity between the objects. The major
objective of ordination is to achieve an effective dimension reduction, expressing
many-dimensional relationships in a smaller number of dimensions. This amounts to
finding the strongest structure in the data and using it to position objects in the
ordination space. Objects close in the ordination space are generally more similar
than objects distant in the ordination space.

outgroup: a group of organisms that is related to but removed from the study taxa.

194
outlier: a observation that is extremely small or large relative to the rest of the
observations. These may be due to errors of measurement, to recording errors, or
abberent individuals. If the sample is small, outliers can have a big effect on the
mean.

Pearson Correlation Coefficient: the most commonly used test for correlation, used
with continuous data. The assumptions for this test are that the data are normally-
distributed and the variances are equal.

platykurtic: a non-normal distribution that has fewer items in the center of the
distribution and more on the edges.

Principal Components Analysis (PCA): a dimension reduction technique for


describing objects for which one has a large number of measures/variables. This
technique uses a correlation matrix to find the linear combinations of original
variables which account for most of the variance and then represents the objects in
terms of these new linear combinations, which are called principle components axes.
The first axis will be defined to account for the maximum amount of variation
possible to depict in this way and each subsequent axis will sequentially account for a
maximum of the remaining variation.

p-value: the statistic associated with the probability of getting this result (or one more
extreme) in the sample you took, if there were really no differences or correlations
between the things you are studying. Typically a p-value of 0.05 or less is taken to
indicate that some difference or correlation exists.

quantile: points at which various percentages of the total sample are above or below, the
median (50% quantile) and the quartiles (25% quantiles) are both quantiles. Besides
dividing the sample into halves or quarters, the quantile points can be defined to
divide the sample into any number of parts.

quartile: the set of quantiles that divides the number of items into four equal sized sets,
25 percent of the items are between the 1st quartile and the median.

range: the interval between the minimum and maximum observations.

195
reliability: the extent to which a measure represents the "true" measure; a measure of
consistency between a repeated measurements of the same thing.

r-value: the correlation coefficient, this value shows the nature and size of the
relationship between two variables.

skewness: a measure of how asymmetric, or lop-sided a frequency distribution is. A


skewed distribution is non-normal.

standard deviation: A measure of variability equal to the square root of the variance. In
a normal distribution, approximately 68% of the sample will fall within 1 standard
deviation (above and below) the mean. In non-normal distributions this measure is
less useful.

standardize: a transformation of the data which makes it proportional to the standard


deviation. Most often standardizing means subtracting the mean of all the
observations from each observation, and dividing the result by the standard
deviation of all the observations.

stem and leaf plot: (also called a stemplot) a graphic device for showing a frequency
distribution, but where the intervals are defined by the first digits of the numbers and
the items are shown as the final digits of the numbers. For example if the items
ranged from 1 to 100, the intervals would be units of ten, labeled as 0, 1, 2,…, 10, and
the items would be shown as the last digit of each number; for example 73 would
show up as a 3 in the 7 interval.

symplesiomorphy: a character shared among a group of individuals which is found in


their common ancestor and thought to have originated in that ancestor.

Taxonomic Distance: a specific type of distance defined by the equation

∑ n (x − x kj ) . It is the most commonly used distance measure.


1 2
Eij = k ki

tokogenetic: genetic relationships within a species caused by family lines, patterns of


interbreeding, etc.

196
transform: to change the scale the observations occur on. Common transformations
include subtracting the mean from each item, which changes the observations to have
a mean of zero, dividing by the standard deviation (which makes them a proportion of
the standard deviation, or converting them to logarithmic values.

tree: a diagram of hierarchical associations between things.

Ultrametric distance: technically a distance function like metric, but with an additional
property. Here the significance is that this concept of distance allows a concise
description of a branching pattern. In morphometrics, ultrametric functions are used
to quantify clustering dendrograms, but it was suggested that this could also be used
to compare vein branching patterns.

unordered pair: a pair of measurements of similar things for which there is no reason to
have one first and the other second.

variance component: a measure of the variability in the dataset that corresponds to or is


accounted for a particular source of variation.

weighting: using a coefficient to emphasize some part of the data set. Weighting is used
in ordination to control what variables are likely to fall on the major axis of
variation.

z-score: the numerical result of transforming the data by subtracting the mean of each
observation (thus centering it on zero) and dividing it by the standard deviation
(making it an expression of how many standard deviations away the observation is
from the mean).

197

You might also like