SpadeR UserGuide PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 89

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/307598241

User’s Guide for Online Program SpadeR (Species-richness Prediction And


Diversity Estimation in R)

Code · August 2016


DOI: 10.13140/RG.2.2.20744.62722

CITATIONS READS

171 2,858

4 authors, including:

Anne Chao T. C. Hsieh


National Tsing Hua University National Tsing Hua University
177 PUBLICATIONS   30,096 CITATIONS    11 PUBLICATIONS   4,743 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Measuring Phylogenetic Diversity View project

All content following this page was uploaded by Anne Chao on 04 September 2016.

The user has requested enhancement of the downloaded file.


Original Version (March 2015)
Latest Version (August 2016)

User’s Guide for Online Program SpadeR


(Species-richness Prediction And Diversity Estimation in R)
by
Anne Chao, K. H. Ma, T. C. Hsieh and Chun-Huo Chiu
Institute of Statistics
National Tsing Hua University, TAIWAN 30043

Table of Contents:
page
Overview 2
Data Input Formats 4
Data Formats for One Community
Type (1): Abundance (or Frequency) Data
Type (1A): Abundance-Frequency Counts Data
Type (2): Incidence-Frequency Data Based on Multiple Sampling Units
Type (2A): Incidence-Frequency Counts Data
Type (2B): Incidence-Raw Data
Data Formats for Two Communities 11
Type (1), (2) and (2B)
Data Formats for More Than Two Communities 12
Type (1), (2) and (2B)
Genetics Data 14
Type (1)

Part I: Species (Species Richness Estimation in One Community) 15


Example 1.1: Birds Data (Abundance Data)
Example 1.1A: Vascular Plant Data (Abundance-Frequency Counts Data)
Example 1.2: Seed-bank Data (Incidence-Frequency Data)
Example 1.2A: Cotton Tail Rabbit Data (Incidence-Frequency Counts Data)
Example 1.2B: Cotton Tail Rabbit (Incidence-Raw Data)
Part II: Diversity Profile Estimation (in One Community) 25
Example 2.1: Rain Forest Data (Abundance Data)
Example 2.1A: Beetles Data (Abundance-Frequency Counts Data)
Example 2.2: Ants Data (Incidence-Frequency Data)
Example 2.2A: Seed-bank Data (Incidence-Frequency Counts Data)
Example 2.2B: Seed-bank Data (Incidence-Raw Data)
Part III: Shared Species (Shared Species Richness Estimation in Two Communities) 33
Example 3.1: Birds Data in Two Estuaries (Abundance Data)
Example 3.2: The Hong Kong Big Bird Race Data (Incidence-Frequency Data)
Example 3.2B: The Hong Kong Big Bird Race Data (Incidence-Raw Data)

Part IV: Two-Community (Similarity) Measures 37


Example 4.1: Rain Forest Data (Abundance Data)

-1-
Example 4.2: Ants Data (Incidence-Frequency Data)
Example 4.2B: Soil Ciliates Data (Incidence-Raw Data)

Part V: Multiple-Community (Similarity) Measures 45


Example 5.1: Rain Forest Data (Abundance Data)
Example 5.2: Ants Data (Incidence-Frequency Data)
Example 5.2B: Soil Ciliates Data (Incidence-Raw Data)

Part VI: Genetics (Differentiation) Measures 54


Example 6.1: Hypothetical Allele Frequency Data (Abundance Data)
Example 6.2: Human Allele Frequency Data (Abundance Data)

Acknowledgements
References 65
Appendix (Mathematical formulas and details) 68

Overview

The program SpadeR is the R-based online version of SPADE available via the link
http://chao.stat.nthu.edu.tw/wordpress/software_download/ or https://chao.shinyapps.io/SpadeR/.
Clicking these links, you will be directed to the online interface window. Users do not need to
learn/understand R to run SpadeR. The interactive web application was built using the Shiny (a
web application framework). SpadeR includes nearly all of the important features from the
original program SPADE while also having the advantages of expanded output displays and
simplified data input formats. Further, owing to the power of the R language, SpadeR now offers
high-resolution plots/figures that the original SPADE lacks. Note, however, that some features in
SPADE have been expanded to become an independent online program iNEXT; see below.

Like the original SPADE, the program SpadeR computes various biodiversity indices based on
two major types of sample data (abundance data and replicated incidence data) taken from one or
multiple communities. This user guide illustrates how to run this program in an easily accessible
way through numerical examples with proper interpretations of portions of the output. SpadeR is
divided into six parts:

 Part I: Species (estimating species richness for one community).

 Part II: Diversity Profile Estimation (estimating a continuous diversity profile and
various diversity indices in one community). This expanded part also features plots of
empirical and estimated continuous diversity profiles.

 Part III: Shared Species (estimating the number of shared species between two
communities).

 Part IV: Two-Community (Similarity) Measures (estimating various similarity indices


between two assemblages). Both richness- and abundance-based two-community
similarity indices are included.

-2-
 Part V: Multiple-Community (Similarity) Measures (estimating various N-community
similarity indices). Both richness- and abundance-based N-community similarity
indices are included.

 Part VI: Genetics (Differentiation) Measures (applying Part V to estimate allelic


dissimilarity/differentiation among sub-populations based on multiple-population
genetics data).

NOTE: Part III of the original SPADE (Prediction) has been expanded to become an independent
program known as iNEXT (iNterpolation and EXTrapolation) which provides diversity estimates
for rarefied and extrapolated samples up to a maximum sample size or sample completeness
specified by the user. The program iNEXT also features seamless plots of sample-size- and
coverage-based rarefaction and extrapolation sampling curves. Users can download iNEXT from
the same links given above.

The running procedures for each part are illustrated with numerical examples with interpretations.
There are several estimators and predictors available in each part. To gain familiarity with the
program, we suggest that users first run the demo data sets included in SpadeR and check the
output with that given in this guide. Part of the output for each example is also interpreted in this
guide to help users understand the numerical results. The formulas for each estimator or predictor
with relevant references are provided in the Appendix. The user is referred to the books by
Magurran (1988, 2004) and the review papers by Chao (2005), Chao et al. (2014a, b), Chao and
Chiu (2012, 2016a, b) for pertinent background on biodiversity measures and statistical analyses.
These papers can be directly downloaded from Anne Chao’s website.

Please do not use SpadeR in any commercial form or distribute it to other people. Instead,
potential users should access the program directly through the SpadeR webpage (see above). If
you publish your work based on results from SpadeR, please make references to the relevant
papers mentioned in the following sections and also use the following reference to cite SpadeR:

Chao, A., Ma, K. H., Hsieh, T. C. and Chiu, C. H. (2015) Online Program SpadeR
(Species-richness Prediction And Diversity Estimation in R). Program and User’s Guide
published at http://chao.stat.nthu.edu.tw/wordpress/software_download/.

The SpadeR interface window is shown below in Figure 1. Along the top menu, there are six
selection tabs: Species (Part I), Diversity Profile Estimation (Part II), Shared Species (Part III),
Two-Community Measures (Part IV), Multiple-Community Measures (Part V) and Genetics
Measures (Part VI). Throughout this document, we use “community” and “assemblage”
interchangeably.

-3-
Figure 1. The Interface Window of SpadeR Online

Along the second row (output) menu, there are five output selection tabs:

 In the “Estimation” tab panel, basic data information and various estimates are shown for
the demo/uploaded data. You can click “download as txt file” at the bottom of the output to
download these estimates as a file.

 The functionality of the “Visualization” tab panel varies by part. For Species and Shared
Species parts, this tab panel shows various estimates and their confidence intervals to
facilitate clear comparisons, while for Diversity Profile Estimation part, this tab panel shows
the plots of empirical diversity profile and estimated profile. These figures can be
downloaded by clicking “download as PNG file” at the bottom of the displayed figure.

 The function of the “Data Viewer” tab panel is to display the first ten records of the
demo/uploaded data. The entire demo/uploaded data can be downloaded by clicking
“download as txt file” at the bottom of the displayed data.

 In the “Introduction” panel, users can view a brief introduction to SpadeR and a summary
of the running procedures.

 In the “User Guide” panel, a link will direct users to this user guide.

Data Input Formats

In the left panel of the SpadeR window, you are asked to select your data type. Two major types
of sampling data (abundance data and incidence data) are supported. For abundance data, sample
size means “the number of individuals”; for incidence data, sample size means “the number of
sampling units.” Data are classified into the following five types: (See later sections for SpadeR
data input formats.)

-4-
Type (1) Data. Abundance (or frequency) data: data for each community include sample
abundances/frequencies for all observed species in an empirical sample of individuals.

Type (1A) Data. Abundance-frequency counts data (a simplified version of abundance data): data
for each community include sample species abundance-frequency counts (f1, f2, ∙∙∙, fr), where r
denotes the maximum frequency and fk denotes the number of species represented by exactly k
individuals/times in the sample.

Type (2) Data. Incidence-frequency data: data for each community include the observed species
incidence frequencies (the number of detections or the number of sampling units in which a
species is detected) based on detection/non-detection in multiple sampling units.

Type (2A) Data. Incidence-frequency counts data (a simplified version of incidence-frequency


data): data for each community include sample incidence-frequency counts (Q1, Q2, ∙∙∙, Qr),
where Qk denotes the number of species that were detected in exactly k sampling units, while r
denotes the number of sampling units in which the most frequent species were found.

Type (2B) Data. Incidence-raw data (raw detection/non-detection data): data for each community
consist of a species-by-sampling-unit matrix (typically, “1” denotes a detection and “0” denotes a
non-detection). When there are N communities, input data consist of an augmented
species-by-sampling-unit matrix.

All these five types of data can be used in Part I and Part II. Three data types (1, 2 and 2B) are
supported in Parts III, IV and V. For Part VI, only abundance data are supported. See below for a
summary for the data types supported in each part.

Analyses part Data types/formats options

Part I: Species One community, Type (1), (1A), (2), (2A) and (2B)
Part II: Diversity Profile Estimation One community, Type (1), (1A), (2), (2A) and (2B)
Part III: Shared Species Two communities, Type (1), (2) and (2B)
Part IV: Two-Community Measures Two communities, Type (1), (2) and (2B)
Part V: Multiple-Community Measures Multiple communities, Type (1), (2) and (2B)
Part VI: Genetics Measures Multiple communities, Type (1) only

If your data are stored in a database or spreadsheet program, you must save your data file as an
ascii text file to run SpadeR. Data entries (meaning the actual numbers) must be separated by at
least one blank space or a tab, or separated by rows (with a “return” at the end of each row). If
you have comma-delimited data (or data delimited with other symbols, e.g., semicolons), you
need to replace them with the symbol for “blank space” ( ) or “tab” (→) in a word processing
program using the find-and-replace function.

The following sections explain how different ascii text files need to be formatted so that SpadeR
can read them properly. Therefore, the following sections concern only how the data files must
be structured; their use when actually running the analyses will be explained in later sections
(beginning with “Running Procedures” for Part 1).

-5-
These sections on data input formats are divided into four parts: the first part explains data input
formats for one community data, the second, third and fourth parts explains respectively data
input formats for two communities, more than two communities, and genetics.

Data Formats for One Community/Assemblage

When your data originate from only one community or assemblage, the program accepts the
following five types of data input formats:

Type (1): Species Abundance (or Frequency) Data

Demo Data at Species Part (Example 1.1)


For any species found in the sample, this type of data includes the number of times (frequency)
that the species was discovered, or the number of individuals (abundance) representing the
species in the sample. We use a set of birds data from Magurran (1988, p. 152) as an illustrative
example; the data are stored in demo data for Part I (Species). The data set contains 25 observed
species, and their observed frequencies are respectively 752, 276, 194, 126, 121, 97, 95, 83, 72,
44, 39, 16, 15, 13, 9, 9, 9, 8, 7, 4, 2, 2, 1, 1, 1 (that is, the first species is represented by 752
individuals, the second species by 276 individuals, ..., and the last three species are each
represented by only one individual). The required data entry format is shown below. First, the
frequency data must be entered in a column. Second, if species names or any identification codes
are recorded in your original data, they must be removed to conform to the data format shown
below. Third, rows with frequency 0 for an unobserved species may be included, but that species
will not have any effect on the analysis. We deliberately add some zeros to the preceding demo
data frequencies in Part I for illustrative purposes. Frequencies do not have to be ordered
according to magnitude, as can be seen in the example below. The number in each row needs to
be followed by a return symbol, as in the example below:

752
276
194
126
121
97
95
83
72
44
39
0
16
15
0
13
9
9
9
8
7
4

-6-
0
0
2
2
1
1
1

Type (1A): Abundance-Frequency Counts Data

Species frequency data are often classified by their frequencies using the simple form
( f1 , f 2 ,, f r ) , where r denotes the maximum frequency and f k denotes the number of species
represented by exactly k individuals/times in the sample. The statistics ( f1 , f 2 ,, f r ) are often
referred to as “frequencies of frequencies” or “frequency counts”. As an example, for the above
bird data set, f1 = 3 (there are three species represented by one individual), f2 = 2 (there are two
species represented by two individuals), f3 = 0 (there are no species represented by three
individuals), f4 = 1, f5 = 0, ∙∙∙, f752 = 1, with the maximum frequency r = 752. Again, for data entry,
we only require the non-zero values of fk’s. Data must be arranged in the following order:
(1 f1 2 f 2 r f r ) , where r denotes the maximum frequency. For example, the data entry for the
bird data from above is (1 f1 2 f 2 r f r ) = (1 3 2 2 4 1 7 1 ∙∙∙ 752 1); each number needs to be
separated by at least one blank space or separated by rows. (You may include zero frequencies in
the input; the analysis will not be affected.)

Demo Data at Species Part (Example 1.1A)


The demo data for frequency counts are taken from Miller and Wiegert (1989) who documented
the species abundance distribution of endangered and rare vascular plant species in the central
portion of the southern Appalachian region. A total of 188 species were recorded out of 1008
individuals compiled over a span of 150 years of field observations. The frequency counts are
shown in the following table:

i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
fi 61 35 18 12 15 4 8 4 5 5 1 2 1 2 3

i 16 19 20 22 29 32 40 43 48 67
fi 2 1 2 1 1 1 1 1 1 1

That is, 61 species were represented by only one individual, 35 species were represented by two
individuals, and so on. The most abundant species were represented by 67 individuals.
Therefore, the maximum frequency r = 67. The data (1 f1 2 f 2 r f r ) are read as: (1 61 2 35 3
18 4 12 ∙∙∙ 67 1). Here the first pair (1, 61) indicates that there are 61 singletons, the second pair
(2, 35) indicates there are 35 doubletons, and so on, with the last pair (67, 1) indicating that there
is one species that is represented by 67 individuals.

Remark: Note that the input format based on species frequency counts for SpadeR is different
from that for the original SPADE. This is because SPADE was written using C++, and, as such,
the input requires additional information about the maximum frequency r and the number of
non-zero frequencies m. For SpadeR, the input format is (1 f1 2 f 2 r f r ) whereas the input
format for SPADE is (r m 1 f1 2 f 2 r f r ) .

-7-
Demo Data at Diversity Profile Estimation Part (Example 2.1A)
Janzen (1973) tabulated many data sets of tropical foliage insects from sweep samples in Costa
Rica. Here we select one beetle data set from the Osa old-growth forest site. In the data set, there
were 112 species among 237 individuals. The frequency counts are shown in the following table:

i 1 2 3 4 5 6 7 8 14 42
fi 84 10 4 3 5 1 2 1 1 1

Here the most abundant beetle species were represented by 42 individuals. Therefore, the
maximum frequency r = 42. The data (1 f1 2 f 2 r f r ) are read as: (1 84 2 10 3 4 4 3 ∙∙∙ 42 1).
Here the first pair (1, 84) indicates that there are 84 singletons, the second pair (2, 10) indicates
there are 10 doubletons, and so on, with the last pair (42, 1) indicating that there is one species
that is represented by 42 individuals.

Type (2): Incidence-Frequency Data


Demo Data at Species Part (Example 1.2)
Assume that in one community, several independent sampling units are randomly selected, and
the presence/absence (or detection/non-detection) of a number of species are recorded for each
sampling unit. In contrast to the example above (1), here we enter the total number of detections
for each species (i.e., species incidence frequencies or the number of sampling units in which a
species is detected; each frequency cannot exceed the total number of sampling units). As an
example, we use the seed-bank data (Butler and Chazdon, 1998) which are stored in demo data
for Part I (Species). There were 121 soil samples (each soil sample is regarded as a sampling unit)
and species of seedlings that germinated from each soil sample were recorded. A total of 34
species of seedlings were found in the 121 soil samples. The incidence frequencies for the
observed species should be entered in a column as listed below. Notice that for this type of data,
the first entry must be the number of sampling units. Then, beginning with the second row, each
row denotes the total number of detections of a given species in all the sampling units. In this
example, the second row denotes that the first species was detected in only one soil sample (a
“unique” species), and the last row denotes that the most frequent species was detected in 61 soil
samples.

121
1
1
1
2
2
3
3
3
4
4
4
5
6
6
6
-8-
6
6
7
8
9
9
9
10
11
11
13
17
24
24
43
43
47
52
61

Although the above frequencies are arranged from the least frequent to the most frequent species,
their ordering is irrelevant in the analysis.

Type (2A): Incidence-Frequency Counts Data

Demo Data at Diversity Profile Estimation Part (Example 2.2A)


Species incidence frequency data based on T sampling units are often classified by their
frequencies using the incidence counts (Q1 , Q2 ,, Qr ) , where Qk denotes the number of species
that were detected in exactly k sampling units in the data and r denotes the number of sampling
units in which the most frequent species were found. Thus, Q1 and Q2 represent the number of
species that occur in exactly one sampling unit (“uniques”) or in exactly two sampling units
(“duplicates”), respectively. Data are to be arranged in the following order:
(T 1 Q1 2 Q2 r Qr ) , where the first entry T denotes the number of sampling units. Each
number needs to be separated by at least one blank space or by rows. For the seed-bank data, the
incidence frequency counts should be read as:

121 1 3 2 2 3 3 4 3 5 1 6 5 7 1 8 1 9 3 10 1
11 2 13 1 17 1 24 2 43 2 47 1 52 1 61 1

Note that only the non-zero value of Qk’s are required to be entered. (The analysis will not be
affected by adding zero frequencies.) The first entry, indicating that there are 121 soil samples, is
followed by the 18 pairs (1, 3), (2, 2), (3, 3), (4, 3), (5, 1), (6, 5), and so on, up to (61, 1). Here (1,
3) indicates that there are 3 unique species, (2, 2) indicates there are 2 duplicate species, and so
on, with (61, 1) indicating that there is one species found in 61 soil samples.

Demo Data at Species Part (Example 1.2A)


We purposely used a dataset based on capture-recapture data to illustrate another important
application. There is a simple analogy between species richness estimation for a multiple-species
assemblage and population size estimation for a single species. An “individual” animal in

-9-
capture-recapture studies corresponds to a “species” in the richness estimation. In the former, the
estimating target is population size, while in the latter, it is species richness. Also, the capture
probability in a capture-recapture experiment corresponds to species detection probability, which
is defined as the chance of encountering at least one individual of a given species. Therefore,
species richness estimators can be directly applied to estimate population size for
capture-recapture data.

In the context of population size estimation, there were some well known real capture-recapture
data sets collected from populations with known sizes. We use the cottontail capture-recapture
data provided in Edwards and Eberhardt (1967) for illustration. They conducted a live-trapping
study on a confined cottontail population of known size. In their study, 135 cottontail rabbits
were penned in a 4-acre rabbit-proof enclosure. Live trapping was conducted for 18 consecutive
nights. Marking or tagging is needed to distinguish individuals. Often, a unique tag was attached
to each individual so that the capture history of each individual captured in the experiment was
known. A total of 142 captures were recorded for the 76 distinct rabbits. For these data, the
incidence frequency counts (Q1 to Q7) were 43, 16, 8, 6, 0, 2, 1. The input data should be read as
follows: (18 1 43 2 16 3 8 4 6 6 2 7 1); each number needs to be separated by at least one blank
space or separated by rows. Here the pair (1, 43) indicates that there are 43 unique species, the
next pair (2, 16) indicates there are 16 duplicate species, and so on. You may include zero
frequencies, e.g., (18 1 43 2 16 3 8 4 6 5 0 6 2 7 1); the analysis will not be affected.

Remark: The input format based on species incidence frequency counts for SpadeR is different
from that of the original SPADE. This is because SPADE was written using C++, and, as such,
the input requires additional information about the maximum frequency r and the number of
non-zero frequencies m. That is, for SpadeR, the input format is (T 1 Q1 2 Q2 r Qr )
whereas the format for SPADE is (T r m 1 Q1 2 Q2 r Qr ) , where r denotes the maximum
frequency and m denotes the number of non-zero values of Qk’s.

Type (2B): Incidence-Raw Data


Demo Data at Species Part (Example 1.2B)
The original cotton tail rabbit data discussed above are stored as demo data for Species Part. A
total of 76 distinct individuals were found in 18 trapping nights. The incidence-raw data consist
of a capture/non-capture matrix (where “1” means a capture and “0” means a non-capture) with
76 rows and 18 columns. The first five records (rows) of the 76 x 18 raw data matrix is shown
below:

1 1 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0
1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

Note that Type (2A) data (incidence-frequency) only consist of the frequency counts of the row
sums of this original raw matrix.

Demo Data at Diversity Profile Estimation Part (Example 2.2B)


The original seed-bank data (Butler and Chazdon, 1998) discussed earlier are stored as demo data
for Diversity Profile Estimation Part. A total of 34 species of seedlings were found in the 121 soil
samples. The incidence-raw data consist of a detection/non-detection matrix (where “1” means a

- 10 -
detection and “0” means a non-detection) with 34 rows and 121 columns. You can download and
view the original data using the “Data Viewer” panel along the second row menu. Note that Type
(2) data (incidence-frequency) only consist of the row sums of this original raw matrix, and Type
(2A) data only consist of the incidence-frequency counts of those row sums.

Data Formats for Two Communities/Assemblages

When your data originate from two communities or assemblages, the program accepts three
different types of data input formats:

Type (1): Species Abundance Data

Assume that one sample of individuals is taken from each of two communities. Each sample
contains different numbers of individuals for several species. These data must be stored as a
species (in rows) by community (in two columns) matrix file. The first column denotes the
abundance (or frequency) of a species discovered in Community 1, and the second column
denotes the abundance (or frequency) of the same species in Community 2. The two abundances
are separated by at least one blank space or a tab.

Demo Data at Shared Species Part (Example 3.1)


The demo data for Part III (Shared Species) are bird frequency records of two estuaries. The data
are arranged in two columns as follows:

39 42
0 70
0 1
842 616
. .
. .
1 1

In this dataset, the first species was observed 39 times in Community 1 and 42 times in
Community 2, the second species was not observed in Community 1 whereas it was observed 70
times in Community 2, and the last species was observed once in each community. Each row lists
the frequencies or abundances for a specific species and different rows refer to frequencies of
different species. SpadeR cannot handle missing data, which are empty cells without numbers.
Therefore, when a species was not observed or detected in a community, you must enter the
frequency 0 (as, e.g., in the second and third row of the first column above). Again, SpadeR
cannot handle species identification names so they need to be removed from the data file. Our
analysis does not depend on species identification names.

Type (2): Incidence-Frequency Data


Assume that in each of two focal communities, several sampling units are randomly taken, and
species detection/non-detection of a number of species are recorded in each sampling unit. In
contrast to the example above, here we enter the total number of detections for each species
(which cannot exceed the total number of sampling units).

Demo Data at Shared Species Part (Example 3.2)

- 11 -
We here use the Hong Kong Big Bird Race (BBR) data which are stored in demo data for Part II
(Shared Species); see http://www.wwf.org.hk/en/getinvolved/hkbbr/ for data sources. The Hong
Kong BBR is an annual competition among teams of birdwatchers. The challenge is to record as
many bird species as possible during a fixed interval of time in the Hong Kong territory. For
example, 16 teams competed in 2015 while 17 teams competed in 2016. The two-column data
are shown below:

16 17
1 0
3 8
. .
8 1
. .
. .
0 0
0 1

In this data format, the first row denotes the number of sampling units (16 and 17 teams,
respectively, as each team is regarded as a sampling unit). Beginning with the second row, the
entry for the first column denotes the total number of detections of a given species in all
sampling units taken from Community 1, and the entry for the second column denotes the total
number of detections of the same species in all sampling units taken from Community 2. To
illustrate, in the above data, the first species entered as (1, 0) was observed by only one team in
2015 and was not observed by any team in 2016. The second species entered as (3, 8) was
observed by three teams in 2015 and was observed by eight teams in 2016. A species with record
(0 0) means that it was not observed by any team in 2015 and 2016, although the species was
listed in the bird check-list. Such records will not affect our statistical analysis. Finally, the
species in the last row was not observed by any team in 2015 and was observed by only one team
in 2016.

Type (2B): Incidence-Raw Data


Demo Data at Shared Species Part (Example 3.2B)
For the above Hong Kong BBR data, we use a 280-species checklist, although only 250 species
were found in the pooled data of the two years. (That is, there were 30 species in the checklist
that were not observed by any team in the two-year data.) The raw data for the two years consist
of a 280 x 33 matrix with element 1’s (detection) or 0’s (non-detection); each row of the matrix
refers to the detection/non-detection records for the same species so that the information about
shared species can be computed. If these data were user-uploaded data, then in running SpadeR,
users must specify in the left panel the number of sampling units taken from each community.
For the above BBR data, users must type in “16 17” (separated by at least one space) to specify
that 16 and 17 teams (sampling units) were taken from Community 1 and Community 2
respectively. Also, in the cases when a species-check list is not available in data collection, users
must first merge multiple-community data by species identity to obtain a pooled list of species;
then the rows of the input data refer to this pooled list.

Data Formats for More Than Two Communities/Assemblages

When your data originate from more than two communities or assemblages, the program accepts
three types of data input format:

- 12 -
Type (1): Species Abundance Data

Demo Data at Two- and Multiple-Community Measures Parts (Examples 4.1 and 5.1)
This data format is simply an extension of that for two communities (see above). Instead of two
columns, we have N columns if there are N communities or assemblages. For example, the data
stored in demo data for Part V (Multiple-community Measures) are frequencies of an assemblage
of seedlings (column 1), an assemblage of saplings (column 2) and an assemblage of trees
(column 3) recorded in a site called LEP old-growth rain forest in Costa-Rica. The original data
were collected by Dr. Robin Chazdon and colleagues and discussed in Chao et al. (2005, 2008).

17 48 7
14 38 0
. . .
. . .
1 0 0

In each of these N communities, one sample of individuals is taken; each sample contains
different numbers of individuals for several species. Again, these data must be stored as a species
(in rows) by community (in columns) matrix file. The first column denotes the frequency (or
abundance) of a species discovered in Community 1, the second column denotes the frequency
(or abundance) of the same species in Community 2, and so on. The columns need to be
separated by at least one blank space or a tab. In this data set, the first species is represented by
17 seedlings, 48 saplings and 7 trees; the second species is represented by 14 seedlings, 38
saplings and no trees; and the last species is represented by 1 seedling, no saplings and no trees.
Again, missing data are not allowed; therefore, you must store 0 in the data file if no observation
was made for that species (i.e., 0 for an undetected species). Again, currently SpadeR cannot
handle species names so they need to be removed from the data file.

Type (2): Incidence-frequency Data

This data format is also simply an extension of that for two communities (see above). Instead of
two columns, we have k columns if there are N communities or assemblages. In each of these N
communities or assemblages, several independent sampling units are taken; and species
detection/non-detection of a number of species are recorded in each sampling unit.

Demo Data at Two- and Multiple-Community Measures Parts (Examples 4.2 and 5.2)
Here, we use a data set of tropical rainforest ants collected in Costa-Rica (Longino et al. 2002) to
illustrate the data format. Ants were captured by using three techniques: (a) Berlese extraction of
soil samples (217 samples) (b) fogging samples from canopy fogging (459 samples), and (c)
Malaise trap samples for flying and crawling insects (62 samples). Therefore, each sampling
technique, representing a different assemblage, is presented in one of the three columns:

217 459 62
1 0 0
0 27 3
. . .
. . .

0 12 0

- 13 -
Again, in this data format, the first row denotes the number of sampling units (217, 459 and 62
denote the number of sampling units for Berlese, fogging, and Malaise, respectively). Beginning
with the second row, the three numbers in each row denotes respectively the total number of
detections of a given species in 217 Berlese samples, 459 fogging samples, and 62 Malaise
samples. Each row of the matrix refers to the incidence frequencies of the same species in the
three communities. To illustrate, the first species was captured only in one Berlese sample and in
none of the fogging and Malaise samples; the second species was not recorded in any of the
Berlese samples, but was recorded in 27 fogging samples and 3 Malaise samples. The last species
was only captured in 12 fogging samples.

Type (2B): Incidence-Raw Data

Demo Data at Two- and Multiple-Community Measures Parts (Examples 4.2B and 5.2B)
This data set includes soil ciliate species detection/non-detection data for a total of 51 soil
samples from three areas (Etosha Pan, Central Namib Desert and Southern Namib Desert) of
Namibia, Africa. The original data are provided in Foissner et al. 2002, p. 58-63). The numbers
of soil samples are respectively 19, 17 and 15 in the three areas. The observed numbers of
species in Etosha Pan, Central Namib Desert and Southern Namib Desert are respectively 216,
130 and 150. A total of 304 species were recorded in the pooled data.

The species checklist in the original data includes 365 ciliates species, although only 304 species
were found in the pooled data of the three areas. The incidence raw data are stored in a 365 x 51
of 0’s and 1’s (“0” denotes non-detection while “1” denotes detection); each row of the matrix
refers to the detection/non-detection records for the same species so that the information about
shared species can be computed. If these data were user-uploaded data, then in running SpadeR,
users must specify in the left panel the number of sampling units taken from each community;
users must type in “19 17 15” (separated by at least one space) to specify the number of sampling
units taken from each community. Also, in the cases when a species-check list is not available in
data collection, users must first merge multiple-community data by species identity to obtain a
pooled list of species; then the rows of the input data refer to this pooled list.

Genetics Data

Type (1): Allele Frequency Data

Demo Data at Genetics Measures Part (Example 6.1)


When your data are allele frequency data, the program accepts only one input format. We use
allele frequency data as an example. Store your data as an allele (row) by subpopulation (column)
matrix file. In many genetic data, only allele proportions are given. To use SpadeR, you must
convert proportional data to frequency data (the actual number of sampled instances of allele j).
As an example, one of the demo data sets in Part VI includes allele frequency counts in locus
D3S2427 from four human subpopulations (BiakaPyg, Palestin, Bedouin and Druze); see
Rosenberg et al. (2002) for details. The allele frequencies for the four subpopulations are listed
below:

0 6 3 1
11 0 0 0
5 0 0 0
. . . .

- 14 -
. . . .

2 0 0 0
0 0 0 0

Each row of the matrix refers to the abundances of the same allele in the four subpopulations. For
any allele which is not found in a subpopulation, you must enter the frequency 0 in the data file.
For example, as the first allele was not found in the PiakaPyg, an entry of 0 was entered. The last
allele was not found in any of the four chosen subpopulations (although it was found in another
subpopulation whose data are not shown in this example); therefore, we have four 0’s in the last
row. Such alleles will not affect our statistical analysis.

Part 1: Species (Species Richness Estimation in One Community)

Running Procedures

Species identification names/codes or site/community labels in your original data must be


removed to conform to the SpadeR data formats.

Step 1. Select the “Species” tab from the top menu.


Step 2. Select Abundance data ,Frequency counts , Incidence-frequency data,
Incidence-frequency counts or Incidence-raw data from the Data Setting.
Step 3. Check the Demo data radio button to load data (you can load your own data by
checking Upload data).
Step 4. Press the Run! button to get the output.

In the following, we run separate examples for each of the five types of data input formats.

Example 1.1. Using the Birds Data to Demo Type (1) Abundance Data

We use the bird data described in Type (1) Data in the Section “Data Formats for One
Community” to show the output:

OUTPUT:
(1) BASIC DATA INFORMATION:

Variable Value
Sample size n 1996
Number of observed species D 25
Coverage estimate for entire data set C 0.998
CV for entire data set CV 1.916

Variable Value
Cut-off point k 10
Number of observed individuals for rare group n_rare 53
Number of observed species for rare group D_rare 11
Estimate of the sample coverage for rare group C_rare 0.943
Estimate of CV for rare group in ACE CV_rare 0.629
Estimate of CV1 for rare group in ACE-1 CV1_rare 0.74
Number of observed individuals for abundant group n_abun 1943

- 15 -
Number of observed species for abundant group D_abun 14

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10
Frequency counts 3 2 0 1 0 0 1 1 3 0

(2) SPECIES RICHNESS ESTIMATORS TABLE:

Estimate s.e. 95%Lower 95%Upper


Homogeneous Model 25.660 0.954 25.082 30.295
Homogeneous (MLE) 25.000 0.975 25.000 28.500
Chao1 (Chao, 1984) 27.249 3.394 25.266 44.030
Chao1-bc 25.999 1.817 25.094 35.673
iChao1 (Chiu et al. 2014) 27.249 3.394 25.266 44.030
ACE (Chao & Lee, 1992) 26.920 2.367 25.292 37.639
ACE-1 (Chao & Lee, 1992) 27.399 3.163 25.336 42.153
1st order jackknife 27.998 2.449 25.739 37.171
2nd order jackknife 28.998 4.240 25.730 46.915

(3) DESCRIPTION OF ESTIMATORS/MODELS:

Homogeneous Model: This model assumes that all species have the same abundances or discovery probabilities. See Eq. (2.3)
of Chao and Lee (1992) or Eq. (7a) of Chao and Chiu (2016b).

Homogeneous (MLE): An approximate maximum likelihood estimate under homogeneous model. See Eq. (1.1) and Eq. (1.2) of Chao
and Lee (1992) or Eq. (3) of Chao and Chiu (2016b).

Chao1 (Chao, 1984): This approach uses the numbers of singletons and doubletons to estimate the number of undetected species
because undetected species information is mostly concentrated on those low frequency counts; see Chao (1984), and Chao
and Chiu (2012, 2016a,b).

Chao1-bc: A bias-corrected form for the Chao1 estimator; see Chao (2005) or Eq. (6b) of Chao and Chiu (2016b).

iChao1: An improved Chao1 estimator; see Chiu et al. (2014).

ACE (Abundance-based Coverage Estimator): A non-parametric estimator proposed by Chao and Lee (1992) and Chao, Ma and Yang
(1993). The observed species are separated as rare and abundant groups; only data in the rare group is used to estimate
the number of undetected species. The estimated CV of the species in rare group characterizes the degree of heterogeneity
among species discovery probabilities. See Eq. (2.14) in Chao and Lee (1992) or Eq. (7b) of Chao and Chiu (2016b).

ACE-1: A modified ACE for highly-heterogeneous communities when CV of the entire data set > 2 and species richness > 1000.
See Eq. (2.15) in Chao and Lee (1992).

1st order jackknife: It uses the number of singletons to estimate the number of undetected species; see Burnham and Overton
(1978).

2nd order jackknife: It uses the numbers of singletons and doubletons to estimate the number of undetected species; see
Burnham and Overton (1978).

95% Confidence interval: A log-transformation is used for all estimators so that the lower bound of the resulting interval
is at least the number of observed species. See Chao (1987).

The output shown above is divided into three parts. The first part shows the basic data
information for the whole data se including the total number of individuals, the observed number
of species, the coverage estimate, and the estimated coefficient of variation (CV) value (which is
always ≥ 0). The coverage estimate is an objective measure of sample completeness. It represents
the estimated fraction of the entire population of individuals in the community that belong to the
species represented in the sample. The CV measure is used to characterize the degree of
heterogeneity among species abundances or species discovery probabilities. CV = 0 would mean
that all species are homogeneous (i.e., they all have equal abundances or equal discovery
probabilities in the community). Therefore, the larger the CV, the greater the degree of

- 16 -
heterogeneity for species discovery probabilities. The other statistics listed in the first part for
species in rare group will be explained in later text because they are used to calculate two of the
estimators (ACE and ACE-1) involved in the second part of the output.

The second part lists estimates, standard errors and confidence intervals for each model or
estimator of species richness. A brief explanation of each model/estimator is provided below this
list in the third part, but for more details, refer to Colwell and Coddington (1994), Chao (2005)
and Chao and Chiu (2012, 2016a, b). The formulas for each model/estimator are provided in the
Appendix of this user’s guide.

The estimator proposed in Chao (1984) is referred to as the Chao1 estimator in the ecological
literature, e.g., in the program EstimateS (Colwell, 2013) and in Colwell and Coddington (1994).
The estimator Chao1 is a simple and useful estimator; it uses only the numbers of singletons and
doubletons to estimate the number of undetected species. It was derived as a lower bound of
species richness in Chao (1984). A greater lower bound based on the additional information of
tripletons and quadrupletons was recently derived by Chiu et al. (2014) and is referred to as the
iChao1 estimator (here the sub-index i stands for “improved”); see below and the Appendix for
formulas of these estimators. Although both Chao1 and iChao1 were derived as lower bounds of
species richness, they work satisfactorily as point estimators when the size n is sufficiently large
relative to species richness or when the very rare species (including singletons and undetected
species) have approximately the same abundances or discovery probabilities; see Chao and Chiu
(2016a, b). In these cases, their use as species richness estimators (and not just estimators of the
lower bound of species richness) can be justified.

The Abundance-based Coverage Estimator (ACE), proposed by Chao and Lee (1992) and Chao,
Ma and Yang (1993), uses the additional information contained in the frequencies (f3 up to fκ) to
estimate the number of undetected species, where κ is a cut-off point. In the left panel of Part I
window, you can specify the cut-off point which separates the observed species into two groups:
“rare” and “abundant”; the former group includes species observed at most κ times and the latter
includes those observed at least κ+1 times. The default cut-off point is κ = 10. That is, only the
frequency statistics (f1, f2, ∙∙∙, f10) are used to estimate the number of undetected species. The
basic concept is that the information contained in the already detected rare species carries almost
all useful information about the undetected species. We do not use the abundant species for the
estimation of undetected species because they would be observed in almost every sample
regardless of the observer and thus carry negligible information about the undetected species. In
the first part of the output, we display for any given cut-off point these κ frequencies along with
information about the two groups, as described below.

The ACE-1 estimator is a modified ACE estimator for a highly heterogeneous community. We
recommend the use of ACE-1 only if the CV for entire set of data exceeds 2.0 and the species
richness is large (> 1000). When the degree of heterogeneity is low to moderate, the ACE-1
estimator may be substantially greater than other estimates; in these cases, Chao1, iChao1 and
ACE are recommended as point estimators.

For the rare species group, the output of the first part (Basic Data Information) includes the
cut-off point, sample size n_rare, the observed rare species D_rare, the coverage estimate C_rare,
and two CV values (CV_rare for the ACE estimator, and CV1_rare for the ACE-1 estimator).
These statistics are needed to calculate the ACE and ACE-1 estimators; see the Appendix for the
formulas of these statistics. For the abundant species group, the output includes only the sample

- 17 -
size n_abun and the observed species D_abun; only D_abun is needed in calculating the ACE and
ACE-1 estimators.

For the demo example with the default cut-off κ = 10, based on the ACE, the estimated CV =
0.629 (which is the numerical value for the estimator CV_rare) signifies moderate heterogeneity
in species abundances or discovery probabilities. Based on the ACE-1, the estimated CV = 0.740
(which is the numerical value for the estimator CV1_rare) signifies slightly higher heterogeneity.
Due to the presence of heterogeneity, the estimate of species richness calculated from the
“Homogeneous Model” (assuming CV = 0) will underestimate the true number of species. The
Chao1 and iChao1 are identical for this data set. Their estimates are close to those based on ACE
and ACE-1. All four of the estimators (Chao1, iChao1, ACE and ACE-1) imply that two species
remain undetected, although their respective 95% confidence intervals differ slightly due to
difference in sampling models. Both Chao1 and iChao1 estimators produce a 95% confidence
interval ranging between 25.3 and 44.0 species. The corresponding 95% confidence intervals for
ACE and ACE-1 are (25.3, 37.6) and (25.3, 42.2), respectively.

Species richness estimates satisfy the following ordering: ACE-1 > ACE, and iChao1 > Chao1.
Although both Chao1 and iChao1 were derived as lower bounds, due to sampling errors and
possible negative bias for the CV estimate, it is sometimes the case that Chao1 and iChao1
estimates are both higher than ACE, as in this example.

Remark 1 It is generally more useful to provide an accurate lower bound than an unstable point
estimate of total species richness. The Chao1 lower bound has the form

ˆ  D  [(n  1) / n ][ f12 /( 2 f 2 )], if f 2  0


SChao1  
 D  [(n  1) / n ] f1 ( f1  1) / 2, if f 2  0

 D  f12 /( 2 f 2 ), if f 2  0

 D  f1 ( f1  1) / 2, if f 2  0
where D denotes the number of distinct species observed in the sample. It is a very accurate
universally valid lower bound for any type of abundance distribution. The variance of this
estimator for f2 > 0 is approximately

 K  f 2 3
1 2  f1  
4
2  f1 
ˆ SChao1 )  f 2     K    K    , K   n  1 / n ,
var( ˆ 1

 2  f 2   f2  4  f 2  

which can be subsequently used for constructing a confidence interval. In the special case of
f2 = 0, the Chao1 estimator is modified to its bias-corrected form; see Remark 2 below. The
iChao1 estimator which uses the additional information of tripletons and quadrupletons to
estimate the number of undetected species has the following form:

(n  3) f 3  (n  3) f 2 f 3 
SˆiChao1  SˆChao1   max  f1  , 0 
n 4 f4  (n  1) 2 f 4 
f  f f 
 SˆChao1  3  max  f1  2 3 , 0  .
4 f4  2 f4 
Remark 2 When very rare species (including singletons and undetected species in the sample) all

- 18 -
have approximately the same abundances or discovery probabilities, the Chao1 estimator is
a nearly unbiased point estimate of species richness; its bias can be evaluated and an
unbiased or “bias-corrected” estimator of the Chao1 estimator has the form

SˆC* h a1 o D  n 1 / n 1 f( 1 f 1 ) / [ 22 (f  , 1 ) ]

which we call Chao1-bc. This bias-corrected estimator is always obtainable even in the case
of f2 = 0. Thus, in the special case where f2 = 0, Chao1-bc is adopted as the Chao1 estimator.
The variance formula of Chao1-bc for f2 > 0 is approximately

ˆ Kf1 ( f 1 1) K 2 f (2
1 f  1 1)
2
K (2 f ) 1 f2 ( f2  1)1 2
ˆ SChao1 ) 
var( *
  .
2( f 2  1) 4( f 2  1)2 4( f 2  1) 4

In the special case of f2 = 0, this variance formula should be adjusted slightly; the third term
in the above variance formula vanishes, but an additional term should be added as follows
(Chao and Chiu, 2016b):

Kf1 ( f1  1) K 2 f1 (2 f1  1)2 K 2 f14


ˆ SˆChao
var( *
1)    * .
2 4 4Sˆchao1

Remark 3 Note in Remark 2 that the bias-corrected estimator is nearly “unbiased” when the very
rare species (including singletons and undetected species) have approximately the same
abundances or discovery probabilities. Otherwise, it is impossible to evaluate the bias of the
Chao1 estimator with respect to the true species richness because the heterogeneity pattern
for rare species is generally unknown. Only under a homogeneous case for very rare species,
should we use the unbiased form as a legitimate point estimate of total species richness.

Remark 4 Extensive simulations conducted by Chiu et al. (2014) based on various species
abundance models revealed that the two jackknife estimators typically underestimate when
the sample size is relatively small, but exceed the true species richness and overestimate at
larger sample sizes. Thus, there is a limited range of sample sizes (near crossing points)
where jackknife estimators are close to the true species richness. This is likely the reason
why many studies found a relatively good performance of the jackknife estimators. However,
this narrow range of good performance changes with each model, meaning that the
theoretical behavior of the jackknife estimators is not predictable. Outside this range, the
two jackknife estimators may have appreciable biases. The jackknife estimators also exhibit
counter-intuitive patterns: their bias, accuracy and coverage probability regularly does not
improve as sample size increases whereas the other estimators presented in the output
always improve.

Example 1.1A. Using the Vascular Plant Data to Demo Type (1A) bundance-Frequency
Counts Data

The plant data are described in Type (1A) Data in the Section “Data Formats for One
Community”. The running procedures are the same as those for Example 1.1 except that the data
entry type is different. The output and interpretation are also similar to those in Example 1.1. As
such, here we only show the main part of the output without interpretation.

OUTPUT:

- 19 -
(1) BASIC DATA INFORMATION:

Variable Value
Sample size n 1008
Number of observed species D 188
Coverage estimate for entire dataset C 0.94
CV for entire dataset CV 1.567

Variable Value
Cut-off point k 10
Number of observed individuals for rare group n_rare 515
Number of observed species for rare group D_rare 167
Estimate of the sample coverage for rare group C_rare 0.882
Estimate of CV for rare group in ACE CV_rare 0.715
Estimate of CV1 for rare group in ACE-1 CV1_rare 0.891
Number of observed individuals for abundant group n_abun 493
Number of observed species for abundant group D_abun 21

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10
Frequency counts 61 35 18 12 15 4 8 4 5 5

(2) SPECIES RICHNESS ESTIMATORS TABLE:

Estimate s.e. 95%Lower 95%Upper


Homogeneous Model 210.438 6.323 201.053 226.572
Homogeneous (MLE) 188.910 0.969 188.165 193.011
Chao1 (Chao, 1984) 241.104 17.849 215.967 288.836
Chao1-bc 238.783 17.099 214.716 284.530
iChao1 (Chiu et al. 2014) 254.136 10.632 236.358 278.448
ACE (Chao & Lee, 1992) 245.828 15.195 222.850 283.957
ACE-1 (Chao & Lee, 1992) 265.366 22.736 232.012 323.998
1st order jackknife 248.939 11.037 230.852 274.661
2nd order jackknife 274.923 19.105 244.787 321.050

Example 1.2. Using the Seed-bank Data to Demo Type (2) Incidence-Frequency Data

The seedlings data are described in Type (2) Data in the Section “Data Formats for One
Community”. There are 121 soil samples and 34 observed species. The running procedures are
the same as those in Example 1.1 except that the data entry type is different. The output shown
below is generally parallel to that in Example 1.1.

OUTPUT:
(1) BASIC DATA INFORMATION:

Variable Value
Number of sampling units T 121
Number of observed species D 34
Total number of incidences U 461
Coverage estimate for entire data set C 0.994
CV for entire data set CV 1.162

Variable Value
Cut-off point k 10
Total number of incidences for infrequent group U_infreq 115

- 20 -
Number of observed species for infrequent group D_infreq 23
Estimated sample coverage for infrequent group C_infreq 0.974
Estimated CV for infrequent group in ICE CV_infreq 0.384
Estimated CV1 for infrequent group in ICE-1 CV1_infreq 0.412
Number of observed species for frequent group D_freq 11

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
Incidence freq. counts 3 2 3 3 1 5 1 1 3 1

(2) SPECIES RICHNESS ESTIMATORS TABLE:

Estimate s.e. 95%Lower 95%Upper


Homogeneous Model 34.609 0.880 34.076 38.878
Chao2 (Chao, 1987) 36.231 3.370 34.263 52.900
Chao2-bc 34.992 1.805 34.093 44.606
iChao2 (Chiu et al. 2014) 36.723 2.403 34.615 46.053
ICE (Lee & Chao, 1994) 35.064 1.371 34.153 41.398
ICE-1 (Lee & Chao, 1994) 35.132 1.473 34.161 41.966
1st order jackknife 36.975 2.434 34.731 46.103
2nd order jackknife 37.975 4.193 34.730 55.652

(3) DESCRIPTION OF ESTIMATORS/MODELS:

Homogeneous Model: This model assumes that all species have the same incidence or detection probabilities. See Eq. (3.2)
of Lee and Chao (1994) or Eq. (12a) in Chao and Chiu (2016b).

Chao2 (Chao, 1987): This approach uses the frequencies of uniques and duplicates to estimate the number of undetected species;
see Chao (1987) or Eq. (11a) in Chao and Chiu (2016b).

Chao2-bc: A bias-corrected form for the Chao2 estimator; see Chao (2005).

iChao2: An improved Chao2 estimator; see Chiu et al. (2014)

ICE (Incidence-based Coverage Estimator): A non-parametric estimator originally proposed by Lee and Chao (1994) in the
context of capture-recapture data analysis. The observed species are separated as frequent and infrequent species groups;
only data in the infrequent group are used to estimate the number of undetected species. The estimated CV for species in
the infrequent group characterizes the degree of heterogeneity among species incidence probabilities. See Eq. (12b) of
Chao and Chiu (2016b), which is an improved version of Eq. (3.18) in Lee and Chao (1994). This model is also called Model(h)
in capture-recapture literature where h denotes heterogeneity.

ICE-1: A modified ICE for highly-heterogeneous cases.

1st order jackknife: It uses the frequency of uniques to estimate the number of undetected species; see Burnham and Overton
(1978).

2nd order jackknife: It uses the frequencies of uniques and duplicates to estimate the number of undetected species; see
Burnham and Overton (1978).

95% Confidence interval: A log-transformation is used for all estimators so that the lower bound of the resulting interval
is at least the number of observed species. See Chao (1987).

As with the abundance data, the output shown above is divided into three parts. The first part
shows the basic data information for the whole data set including the observed number of species,
the number of sampling units, the total number of incidences, the coverage estimate, and the

- 21 -
estimated coefficient of variation (CV) value (which is always ≥ 0). For any sampling unit, the
incidence-data model assumes that each species has its own unique incidence (or detection)
probability that is constant for any randomly selected sampling unit. The sample coverage
represents the fraction of incidence probabilities that are associated with the detected species. It
is an objective measure of sample completeness. The CV measure is used to characterize the
degree of heterogeneity among species incidence or detection probabilities. The other statistics
for the infrequent species group in the first part will be explained in later text because they are
used to calculate two of the estimators (ICE and ICE-1) involved in the second part of the output.

The second part of the output shown above lists all estimation results while the third part
describes estimators/models used in the program. All estimators considered in the example were
originally derived from the context of capture-recapture studies (e.g., estimating the number of
individual animals by tagging or marking) because there is a simple analogy between species
richness estimation and population size estimation, with the species-by-sampling-unit incidence
matrix being similar to the capture-recapture history matrix used for population size estimation.
The capture probabilities in a capture-recapture study thus correspond to detection or occurrence
probabilities of species, defined here as the chance of encountering at least one individual of a
given species.

As with the Chao1 and iChao1 estimators in abundance data, the corresponding estimators in
incidence data (Chao, 1987; Chiu et al. 2014) are called the Chao2 and iChao2 estimators. Like
the Chao1 estimator, the Chao2 estimator was developed as a lower bound, and it only uses the
numbers of uniques (species detected in only one sampling unit) and duplicates (species detected
in two sampling units) to estimate the number of undetected species. Chiu et al. (2014) derived
the corresponding iChao2 estimator, which is a greater lower bound; see below for the formulas
of these estimators.

Analogous to the ACE and ACE-1 estimators in abundance data, the corresponding estimators in
incidence data (Lee and Chao, 1994) are referred to as the Incidence-based Coverage Estimator
(ICE) and ICE-1 in the ecological literature. Like the abundance data, we also select a cut-off
point to separate the data into two groups, but here we call the groups the “frequent” and
“infrequent” groups (corresponding to “abundant” and “rare” groups in the abundance data). We
select a default cut-off = 10 in all analysis. A different cut-off point κ can be specified in the left
panel of the SpadeR window. For a given value of κ, those species present in at least κ+1
sampling units are placed into the “frequent” group, while the other species, present in 1 to κ
sampling units, are placed in the “infrequent” group. Let Qk (incidence frequency count) denote
the number of species that were detected in exactly k samples. For a default cut-off, only the
counts (Q1, Q2 ,..., Q10 ) are used to estimate the number of undetected species. These counts
along with other information for “frequent” and “infrequent” groups are displayed in the first
part of the output shown above.

For the infrequent species group, the output in the first part includes the number of sampling
units T_infreq (the number of sampling units which include at least one infrequent species), the
observed infrequent species D_infreq, the total number of incidences for infrequent species, the
coverage estimate C_infreq, and two CV values (CV_infreq for the ICE, and CV1_infreq for the
ACE-1). These statistics are needed to calculate the ICE and ICE-1 estimators; see the Appendix
for the formulas of these estimators. We do not use the frequent species for the estimation of
undetected species because they would be observed in almost every sample and thus carry
negligible information about the undetected species. For the frequent species group, the output

- 22 -
includes only the observed species D_freq. That is, only D_freq is needed in calculating ICE and
ICE-1.

Using the seedlings data with a default cut-off, both Chao2 and iChao2 consistently estimate that
there were two undetected species, whereas ICE and ICE-1 consistently estimate that there was
only one undetected species. Their 95% confidence intervals differ to some extent.

Species richness estimates for incidence data satisfy the following ordering: ICE-1 > ICE, and
iChao2 > Chao2. Although both Chao2 and iChao2 were derived as lower bounds, due to
sampling errors and possible negative bias for CV estimate, it may be the case that Chao2 and
iChao2 are both higher than ICE, as in this example.

Remark 1 The Chao2 estimator has the form

 S  [(T  1) / T ]Q12 /( 2Q2 ), if Q2  0


SˆChao2   obs
S obs  [(T  1) / T ]Q1 (Q1  1) / 2, if Q2  0

Unlike in the Chao1 estimator, here the factor (T−1)/T cannot be neglected because T may not be
sufficiently large for incidence data. Like the Chao1 estimator, Chao2 returns a quite precise
lower bound. The variance of this estimator is approximately

 K  Q 2 3
1 2  Q1  
4
ˆ 2  Q1 
vâr( SChao2 )  Q2     K    K    ,
1

 2  Q2   Q2  4  Q2  

where K = (T  1)/T. In the special case of Q2 = 0, the Chao2 estimator is modified to its
bias-corrected form; see Remark 2. The iChao2 estimator which uses the additional information
of the incidence counts Q3 and Q4 to estimate the number of undetected species has the following
form:

(T  3) Q3  (T  3) Q2Q3 
SˆiChao2  SˆChao2   max  Q1  , 0  .
4T Q4  2(T  1) Q4 

Remark 2 When extremely infrequent species (including uniques in the sample and the
undetected) all have approximately the same detection probabilities in any sample, a
“bias-corrected” estimator has the following form:

SˆChao
*
2  D  [(T  1) / T ]Q1 (Q1  1) /[ 2(Q2  1)] .

The variance formula for this bias-corrected estimator for Q2 > 0 is approximately

KQ1 (Q1  1) K 2Q1 (2Q1  1)2 K 2Q1 Q2 (Q1  1)2


2
vâr(SˆChao
*
2)    .
2(Q2  1) 4(Q2  1)2 4(Q2  1)4

In the special case that Q2 = 0, it is reduced to SˆChao


*
2  D  KQ1 (Q1  1) / 2 and the variance is

modified to:

KQ1 (Q1  1) K 2Q1 (2Q1  1)2 K 2Q14


vâr(SˆChao
*
)    * .
4SˆChao2
2
2 4

- 23 -
Remarks analogous to Remark 3 and Remark 4 for abundance data apply to incidence data as
well.

Example 1.2A. Using the Cottontail Rabbit Data to Demo Type (2A) Incidence-frequency
Counts Data
The rabbit data are described in Type (2A) Data in the Section “Data Formats for One
Community”. The running procedures are the same as those for Example 1.2 except that the data
entry type is different. The output and interpretation are also similar to those in Example 1.2. As
such, here we only show the main part of the output without interpretation.

OUTPUT:

(1) BASIC DATA INFORMATION:

Variable Value
Number of sampling units T 18
Number of observed species D 76
Total number of incidences U 142
Coverage estimate for entire dataset C 0.71
CV for entire dataset CV 0.654

Variable Value
Cut-off point k 10
Total number for incidences in infrequent group U_infreq 142
Number of observed species for infrequent group D_infreq 76
Estimated sample coverage for infrequent group C_infreq 0.71
Estimated CV for infrequent group in ICE CV_infreq 0.662
Estimated CV1 for infrequent group in ICE-1 CV1_infreq 0.891
Number of observed species for frequent group D_freq 0

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
Incidence freq. counts 43 16 8 6 0 2 1 0 0 0

(2) SPECIES RICHNESS ESTIMATORS TABLE:

Estimate s.e. 95%Lower 95%Upper


Homogeneous Model 107.060 10.201 92.587 134.161
Chao2 (Chao, 1987) 130.571 22.753 100.899 195.605
Chao2-bc 126.167 20.718 99.048 185.193
iChao2 (Chiu et al. 2014) 139.901 15.602 115.874 178.408
ICE (Lee & Chao, 1994) 133.595 20.793 105.003 190.374
ICE-1 (Lee & Chao, 1994) 155.184 34.275 111.147 254.396
1st order jackknife 116.611 8.886 102.581 138.048
2nd order jackknife 141.448 14.872 118.160 177.598

Example 1.2B. Using the Cotton-tail Rabbit Data to Demo Type (2B) Incidence-Raw Data

The rabbit data are described in Type (2B) Data in the Section “Data Formats for One
Community”. The running procedures are the same as those for Example 1.2 except that the data
entry type is different. Here we have a 76 x 18 (individual-by-trapping-occasion) matrix of
capture/non-capture records. The output is the same as those in Example 1.2A.

- 24 -
Part II: Diversity Profile Estimation (Estimating Various Diversity Measures)

All five data input formats are supported; see the Section “Data Input Formats” for details. The
interface window for this part, shown below, is slightly different from that for the Species Part
(Here in the left panel, you can specify the orders of q).

Figure 2. The Interface Window of SpadeR Online for Part II (Diversity Profile Estimation)

For abundance data, the basic theoretical framework is briefly described below. The
compositional complexity of a community is not expressible as a single number. Hill numbers (or
the effective number of species) have been increasingly used to quantify species/taxonomic
diversity of a community because they represent an intuitive and statistically amenable
alternative to other diversity indices; see Chao et al. (2014a, b) and Chao and Jost (2015) for a
recent review. Hill numbers are parameterized by a diversity order q, which determines the
measures’ sensitivity to species relative abundances. Given a species relative abundance set
{ p1 , p2 , ..., pS } , the Hill number of order q is defined as:
1 / (1q )

D    piq 
S
q
, q ≥ 0, q ≠ 1;
 i1 

D  e xp   pi l ogpi .


S
1
Dl i m
q
q1  i 1 
Hill numbers include three widely used measures as special cases: species richness (q = 0),
Shannon diversity (the exponential of Shannon entropy, q = 1) and Simpson diversity (the
inverse of Simpson concentration, q = 2), all of which are expressed in units of “species
equivalents”. The measure for q = 0 counts species equally without regard to their relative
abundances. The measure for q = 1 counts individuals equally and thus weighs species in
proportion to their abundances; the measure 1D can be interpreted as the effective number of
common species in the community. The measure for q = 2 discounts all but the dominant species

- 25 -
and can be interpreted as the effective number of dominant species in the community. Relevant
formulas are given in the Appendix.

Rather than selecting one or a few measures to describe a community, it is preferable to convey
the complete story by presenting a continuous profile, i.e., a plot of diversity as a function of q ≥
0. This makes it easy to visually compare the compositional complexities of multiple
communities, and to judge the evenness of the relative abundance distributions of the
communities. These profiles are usually generated by substituting species sample proportions
into the complexity measures. However, this empirical approach typically underestimates the true
profile for low values of q, because samples usually miss some of the community’s species due to
under-sampling. We adopt the analytic method of Chao and Jost (2015) to obtain accurate,
continuous, low-bias diversity profiles. For the special cases of q = 0, their estimator reduces to
the Chao1 estimator; for q = 1, their estimator reduces to the Shannon diversity estimator
proposed in Chao et al. (2013); for q = 2, their estimator reduces to the inverse of the minimum
variance unbiased estimator of the Simpson index; see Gotelli and Chao (2013). They also
proposed a bootstrap method to obtain approximate variances to construct the associated
confidence intervals.

This part computes empirical diversity and Chao and Jost (2015) diversity estimates for any
specified orders of q; the default range of q is set from 0 to q = 3 (beyond which it generally
changes little) by increments of 0.25. SpadeR also features the plots of empirical and estimated
continuous diversity profiles for all specified values of order q. As researchers in some
disciplines may be interested in estimating Shannon entropy and the Gini-Simpson index, their
empirical diversity profiles and estimates are also provided.

For incidence data, the model framework for Hill numbers is slight modified. Following Colwell
et al. (2012), we assume that the ith species has its own unique incidence probability πi that is
constant for any randomly selected sampling unit, i = 1, 2, ∙∙∙, S. Since iS1 i may be greater
than 1, we first normalize each parameter πi (i.e. divide each πi by the sum iS1 i ) to yield the
relative incidence of the ith species in the community. Hill numbers of order q for incidence data
are defined as (Chao et al. 2014b):
1 / (1 q )
S  
q

  
q
    S i
  , q ≥ 0, q ≠ 1;
 i 1   j 1  j  
 
 S   
  limq  ex p  
1 i
lo g i.
q1   j
S
  j 
S
 i 1 j 1 j 1

Suppose T sampling units are randomly selected from the focal community. The incidence raw
data (Type 2B) consist of a matrix with S rows and T columns. Type 2 data only include row
sums of the raw matrix, and Type 2A data only consist of incidence frequency counts. All
estimation of this class of incidence-based diversity measures is parallel to that for abundance
data; see the Appendix for formulas and Chao and Jost (2015) for details.

Running Procedures
Species identification names/codes or site/community labels in your original data must be
removed to conform to the SpadeR data formats.

- 26 -
Step 1. Select the “Diversity Profile Estimation” tab panel from the top menu.
Step 2. Select Abundance data , Abundance-frequency counts , Incidence-frequency data,
Incidence-frequency counts or Incidence-raw data from the Data Setting.
Step 3. Check the Demo data radio button to load data (you can load your own data by
checking Upload data).
Step 4. Specify the orders of q to compute diversity. (In the left panel, you can specify the
orders of q for your output and plot; the default range of q is set from 0 to q = 3 by
increments of 0.25).
Step 5. Press the Run! button to get the output.

Example 2.1. Using the Rain Forest Data to Demo Type (1) Abundance Data

The demo dataset at Diversity Profile Estimation Part are subset of those described in Type (1)
Data in the Section “Data Formats for More Than Two Communities”. The demo data here
include the sample tree frequencies from a LEP old-growth rain forest. The output is shown
below:
(1) BASIC DATA INFORMATION:
Variable Value
Sample size n 557
Number of observed species D 69
Estimated sample coverage C 0.957
Estimated CV CV 2.237

(2) ESTIMATION OF SPECIES RICHNESS (DIVERSITY OF ORDER 0):

Estimate s.e. 95%Lower 95%Upper


Chao1 (Chao, 1984) 104.9 20.3 81.8 169.9
Chao1-bc 99.6 16.9 80.1 153.2
iChao1 113.9 12.7 95.1 146.4
ACE (Chao & Lee, 1992) 92.1 10.2 79.1 121.8
ACE-1 (Chao & Lee, 1992) 100.4 15.7 81.4 148.1

Descriptions of richness estimators (See Species Part)

(3a) SHANNON ENTROPY:

Estimate s.e. 95%Lower 95%Upper


MLE 3.193 0.076 3.044 3.343
Jackknife 3.280 0.081 3.120 3.440
Chao & Shen 3.308 0.082 3.146 3.469
Chao et al. (2013) 3.293 0.083 3.130 3.456

MLE: empirical or observed entropy.


Jackknife: see Zahl (1977).
Chao & Shen: based on the Horvitz-Thompson estimator and sample coverage method; see Chao and Shen (2003).
Chao et al. (2013): A nearly optimal estimator of Shannon entropy; see Chao et al. (2013).
Estimated standard error is computed based on a bootstrap method.

(3b) SHANNON DIVERSITY (EXPONENTIAL OF SHANNON ENTROPY):

Estimate s.e. 95%Lower 95%Upper


MLE 24.372 1.829 20.787 27.956
Jackknife 26.573 2.125 22.409 30.737
Chao & Shen 27.320 2.218 22.972 31.668
Chao et al. (2013) 26.917 2.176 22.651 31.183

(4a) SIMPSON CONCENTRATION INDEX:

Estimate s.e. 95%Lower 95%Upper

- 27 -
MVUE 0.08328 0.00677 0.07001 0.09655
MLE 0.08493 0.00676 0.07168 0.09817

MVUE: minimum variance unbiased estimator; see Eq. (2.27) of Magurran (1988).
MLE: maximum likelihood estimator or empirical index; see Eq. (2.26) of Magurran (1988).

(4b) SIMPSON DIVERSITY (INVERSE OF SIMPSON CONCENTRATION):

Estimate s.e. 95%Lower 95%Upper


MVUE 12.00729 0.95424 10.13698 13.87760
MLE 11.77460 0.91635 9.97855 13.57065

(5) CHAO AND JOST (2015) ESTIMATES OF HILL NUMBERS

q ChaoJost 95%Lower 95%Upper Empirical 95%Lower 95%Upper


1 0.00 104.935 76.220 133.650 69.000 62.694 75.306
2 0.25 75.753 60.454 91.051 54.139 49.186 59.092
3 0.50 53.093 45.536 60.650 41.565 37.644 45.487
4 0.75 37.214 32.899 41.528 31.658 28.418 34.898
5 1.00 26.917 23.707 30.127 24.372 21.573 27.170
6 1.25 20.469 17.774 23.164 19.280 16.810 21.750
7 1.50 16.411 14.072 18.750 15.806 13.610 18.002
8 1.75 13.781 11.716 15.845 13.430 11.465 15.396
9 2.00 12.007 10.156 13.859 11.775 9.997 13.552
10 2.25 10.763 9.076 12.450 10.588 8.961 12.215
11 2.50 9.856 8.296 11.417 9.711 8.202 11.221
12 2.75 9.174 7.711 10.636 9.045 7.628 10.461
13 3.00 8.644 7.257 10.032 8.524 7.181 9.868

ChaoJost: diversity profile estimator derived by Chao and Jost (2015).


Empirical: maximum likelihood estimator (observed index).

The above output is divided into five parts: Output (1) includes basic data information: the
sample size, observed species richness, estimated sample coverage and estimated CV value.
Output (2) provides species richness estimators (diversity of order 0); these estimators are
extracted from the output of Part I which features other estimators; see Examples 1a and 1b for
details. Output (3a) shows five estimates of the Shannon entropy index; its effective number of
species (diversity of order 1) is given in Output (3b). Output (4a) shows two estimates of the
Simpson index; its effective number of species (diversity of order 2) is given in Output (4b).
Output (5) provides empirical diversity values and Chao and Jost (2015) diversity profile
estimates and their 95% confidence intervals based on the bootstrap method.

In the output “Visualization” panel, the empirical diversity profile and Chao and Jost (2015)
diversity estimates profile are shown. The latter always lies above the empirical diversity profile
which is plotted for comparison purposes; the reader is advised to always use the Chao and Jost
estimated diversity profile because the empirical profile generally underestimates the true profile.

- 28 -
Example 2.1A. Using the Beetles Data to Demo Type (1A) Abundance-Frequency Counts
Data

The demo dataset (Janzen, 1973) at Diversity Profile Estimation Part is described in Type (1A)
Data in the Section “Data Formats for One Community”. In the data set, there were 112 species
among 237 individuals. The input frequency counts are read as: (1 84 2 10 3 4 4 3 ∙∙∙ 42 1). Here
the first pair (1, 84) indicates that there are 84 singletons, the second pair (2, 10) indicates there
are 10 doubletons, and so on, with the last pair (42, 1) indicating that one species is represented
by 42 individuals. The output is similar to that in Example 2.1 and thus is omitted.

Example 2.2. Using the Ants Data to Demo Type (2) Incidence-Frequency Data

The demo dataset at Diversity Profile Estimation Part is a subset of that described in Type (2)
Data in the Section “Data Formats for More Than Two Communities”. The demo data here
include the sample incidence-based frequencies of tropical rainforest ants collected by Berlese
extraction of soil samples (217 samples) in Costa Rica (Longino et al. 2002). The output is
shown below:

OUTPUT:

(1) BASIC DATA INFORMATION:


Variable Value
Number of sampling units T 217
Number of observed species D 117
Total number of incidences U 775
Estimated sample coverage C 0.958
Estimated CV CV 1.553

(2) ESTIMATION OF SPECIES RICHNESS (DIVERSITY OF ORDER 0):

Estimate s.e. 95%Lower 95%Upper


Chao2 (Chao, 1987) 145.5 13.0 129.1 184.0
Chao2-bc 143.3 12.1 128.2 178.9
iChao2 152.6 7.8 140.2 171.5

- 29 -
ICE (Lee & Chao, 1994) 144.4 9.9 130.8 171.5
ICE-1 (Lee & Chao, 1994) 152.5 14.0 133.8 191.9

Descriptions (see Species Part)

(3a) SHANNON ENTROPY:

Estimate s.e. 95%Lower 95%Upper


MLE 4.093 0.041 4.012 4.174
Chao et al. (2013) 4.194 0.044 4.109 4.280

(3b) SHANNON DIVERSITY (EXPONENTIAL OF SHANNON ENTROPY):

Estimate s.e. 95%Lower 95%Upper


MLE 59.901 2.332 55.329 64.473
Chao et al. (2013) 66.307 2.709 60.997 71.617

(4a) SIMPSON CONCENTRATION INDEX:

Estimate s.e. 95%Lower 95%Upper


MVUE 0.02793 0.00213 0.02375 0.03211
MLE 0.02909 0.00213 0.02492 0.03327

(4b) SIMPSON DIVERSITY (INVERSE OF SIMPSON CONCENTRATION):

Estimate s.e. 95%Lower 95%Upper


MVUE 35.80412 2.57417 30.75875 40.84949
MLE 34.37446 2.37556 29.71836 39.03057

(5) CHAO AND JOST (2015) ESTIMATES OF HILL NUMBERS

q ChaoJost 95%Lower 95%Upper Empirical 95%Lower 95%Upper


1 0.00 145.526 118.699 172.353 117.000 108.000 126.000
2 0.25 120.280 103.210 137.351 99.193 92.029 106.357
3 0.50 98.418 87.729 109.107 83.712 77.927 89.497
4 0.75 80.465 73.283 87.646 70.652 65.710 75.595
5 1.00 66.307 60.695 71.918 59.901 55.350 64.452
6 1.25 55.403 50.364 60.441 51.208 46.789 55.627
7 1.50 47.076 42.233 51.920 44.262 39.879 48.646
8 1.75 40.709 35.959 45.458 38.747 34.390 43.104
9 2.00 35.804 31.140 40.468 34.374 30.062 38.687
10 2.25 31.991 27.423 36.558 30.899 26.653 35.146
11 2.50 28.993 24.532 33.455 28.122 23.957 32.287
12 2.75 26.610 22.258 30.962 25.884 21.809 29.959
13 3.00 24.691 20.447 28.935 24.064 20.081 28.046

ChaoJost: diversity profile estimator derived by Chao and Jost (2015).


Empirical: maximum likelihood estimator (observed index).

The numerical results in the output are generally parallel to those for abundance data. As such,
the details are omitted. In the output “Visualization” panel, the empirical diversity profile and
Chao and Jost (2015) diversity estimates profile are shown. The latter always lies above the
empirical diversity profile. Again, the reader is advised to always use the Chao and Jost estimated
diversity profile.

- 30 -
Example 2.2A. Using the Seed-bank Data to Demo Type (2A) Incidence-Frequency Counts
Data

The demo seed-bank dataset at Diversity Profile Estimation Part is described in Type (2A) Data
in the Section “Data Formats for One Community”. In this dataset, 34 seedlings species were
germinated from 121 soil samples. The incidence frequency counts are read as:

121 1 3 2 2 3 3 4 3 5 1 6 5 7 1 8 1 9 3 10 1
11 2 13 1 17 1 24 2 43 2 47 1 52 1 61 1

The first entry, indicating that there are 121 soil samples, is followed by the 18 pairs (1, 3), (2, 2),
(3, 3), (4, 3), (5, 1), (6, 5), and so on, up to (61, 1). Here (1, 3) indicates that there are 3 unique
species, (2, 2) indicates there are 2 duplicate species, and so on, with (61, 1) indicating that there
is one species found in 61 soil samples. The output shown below is similar to that in Example
2.2.

OUTPUT:

(1) BASIC DATA INFORMATION:


Variable Value
Number of sampling units T 121
Number of observed species D 34
Total number of incidences U 461
Estimated sample coverage C 0.994
Estimated CV CV 1.162

(2) ESTIMATION OF SPECIES RICHNESS (DIVERSITY OF ORDER 0):

Estimate s.e. 95%Lower 95%Upper


Chao2 (Chao, 1987) 36.2 3.4 34.3 52.9
Chao2-bc 35.0 1.8 34.1 44.6
iChao2 36.7 2.4 34.6 46.1
ICE (Lee & Chao, 1994) 35.1 1.4 34.2 41.4

- 31 -
ICE-1 (Lee & Chao, 1994) 35.1 1.5 34.2 42.0

Descriptions (see Species Part)

(3a) SHANNON ENTROPY:

Estimate s.e. 95%Lower 95%Upper


MLE 2.986 0.044 2.899 3.072
Chao et al. (2013) 3.023 0.044 2.936 3.109

(3b) SHANNON DIVERSITY (EXPONENTIAL OF SHANNON ENTROPY):

Estimate s.e. 95%Lower 95%Upper


MLE 19.797 0.848 18.135 21.460
Chao et al. (2013) 20.550 0.881 18.823 22.276

(4a) SIMPSON CONCENTRATION INDEX:

Estimate s.e. 95%Lower 95%Upper


MVUE 0.06865 0.00362 0.06155 0.07576
MLE 0.07026 0.00360 0.06319 0.07732

(4b) SIMPSON DIVERSITY (INVERSE OF SIMPSON CONCENTRATION):

Estimate s.e. 95%Lower 95%Upper


MVUE 14.56563 0.72741 13.13990 15.99136
MLE 14.23354 0.69093 12.87931 15.58777

(5) CHAO AND JOST (2015) ESTIMATES OF HILL NUMBERS

q ChaoJost 95%Lower 95%Upper Empirical 95%Lower 95%Upper


1 0.00 36.231 29.927 42.536 34.000 31.691 36.309
2 0.25 31.027 27.593 34.462 29.324 27.484 31.165
3 0.50 26.734 24.561 28.907 25.441 23.771 27.110
4 0.75 23.276 21.485 25.068 22.294 20.677 23.910
5 1.00 20.550 18.856 22.243 19.797 18.212 21.383
6 1.25 18.431 16.789 20.073 17.846 16.300 19.392
7 1.50 16.798 15.209 18.387 16.331 14.835 17.827
8 1.75 15.540 14.009 17.071 15.154 13.711 16.596
9 2.00 14.566 13.091 16.040 14.234 12.844 15.623
10 2.25 13.804 12.381 15.226 13.506 12.165 14.848
11 2.50 13.200 11.822 14.578 12.924 11.624 14.223
12 2.75 12.715 11.374 14.056 12.451 11.187 13.715
13 3.00 12.320 11.009 13.632 12.062 10.827 13.297

ChaoJost: diversity profile estimator derived by Chao and Jost (2015).


Empirical: maximum likelihood estimator (observed index).

Example 2.2B. Using the Seed-bank Data to Demo Type (2B) Incidence-Raw Data
The demo seed-bank dataset at Diversity Profile Estimation Part is described in Type (2B) Data
in the Section “Data Formats for One Community”. The incidence-raw include 34 x 121 matrix.
You can download and view the original data using the “Data Viewer” panel and examine the
output in the “Estimation” panel along the second row menu.

- 32 -
Part III: Shared Species (Shared Species Richness Estimation in Two Communities)

This part of the program features estimates for the number of shared species in two communities
based on species abundance data (Type 1), incidence-frequency data (Type 2) or incidence raw
data (Type 2B). The reader is referred to Chao and Chiu (2012) for the models and theoretical
background.

Running Procedures
Species identification names/codes or site/community labels in your original data must be
removed to conform to the SpadeR data formats.

Step 1. Select “Shared Species” tab from the top menu.


Step 2. Select Abundance data , Incidence-frequency data or Incidence-raw data from the
Data setting.
Step 3. Check the Demo data radio button to load data (you can also upload your own data by
checking Upload data).
(If incidence-raw data are loaded, you must specify the number of sampling units in
each community in the left panel.)
Step 4. Press the Run! button to get the output.

In the following, we run separate examples for each of the three types of data.

Example 3.1. Using the Birds Data in Two Estuaries to Demo Type (1) Abundance Data

The birds data are described in Type (1) Data in the Section “Data Format for Two
Communities”. This example with data stored in demo data at shared species part was analyzed
in details by Chao et al. (2000). A total of 85867 and 59646 observations were made from two
estuaries (which are hereafter referred to as Community I and II respectively). In these two
communities, there were, respectively, 155 and 140 species observed, of which 111 were
recorded for both areas (shared species). The purpose was to estimate the true number of shared
species in the two communities because it was expected that some shared species were not
observed or were only observed in one community.

The output shown far below includes three parts. The first part shows basic data information for
the two samples. The second part lists four estimates of shared species richness, along with their
standard errors and confidence intervals. A brief explanation of each model/estimator is provided
below this list in the third part, but for more details, please refer to Chao et al. (2000) and Chao
and Chiu (2012) for reviews. The formulas for each model/estimator are provided in the
Appendix of this user’s guide.

The homogeneous model assumes that the shared species in each community have the same
discovery probabilities. This model yields an estimate of 114 (s.e. 6.325) for the number of
shared species. In practice, the homogeneous model is rarely valid. The ACE and Chao1
estimators for estimating species richness in one community have been respectively extended to
estimate shared species richness for two communities. The corresponding shared species richness
estimators for ACE, Chao1 and Chao1-bc (bias-corrected of the Chao1) are respectively named
Heterogeneous (ACE-shared), Chao1-shared and Chao1-shared-bc in the output. The

- 33 -
ACE-shared estimator was derived by Chao et al. (2000) whereas the Chao1-shared and its
bias-corrected form were derived as lower bounds by Pan et al. (2009).

To extend the ACE approach to estimate shared species richness, we select the cut-off point κ =
10 which separates the observed shared species into two groups: rare and abundant. The first part
of the output reveals that among the observed 111 shared species, there were 21 rare shared
species which were observed less than or equal to 10 times in both communities; these are
classified into the “rare” shared species group and the other 90 shared species are classified into
the “abundant” group. Only the former group is used for estimating the number of undetected
shared species. For those 21 species in the “rare” group, the sample sizes and relevant
information are also shown.

For the ACE-shared estimator, we use the coefficient of covariation (CCV) to characterize the
heterogeneity of species discovery probabilities in each community and the correlation of the two
sets of discovery probabilities; see Chao et al. (2000, p. 232-233). Some relevant formulas are
provided in the Appendix. As shown in the output, the CCV estimates are CCV1 = 0.733, CCV2 =
1.007 and CCV12 = 0.457; see Equations (3.9a) and (3.9b) of Chao et al. (2000). These large
values of CCVs show strong evidence for the existence of heterogeneity and correlation.
Therefore, we need to incorporate the estimated CCVs in the resulting estimator. The estimated
number of shared species in the ACE-shared approach by Chao et al. (2000) is 134, implying that
there are still 23 shared species not discovered in the survey. A bootstrap s.e. estimate for the
ACE-shared estimator based on 100 bootstrap replications is approximately 14.347, which
implies a 95% confidence interval of (118, 181). We remark that the bootstrap resampling
procedures vary with trial, meaning that two different runs may result in different s.e. estimates
and different confidence intervals. Both the Chao1-shared estimator and its bias-corrected
version yield a minimum number of 142 shared species, implying that there are at least 31 shared
species not discovered in the survey. A bootstrap s.e. estimate for the Chao1-shared estimator
based on 100 bootstrap replications is approximately 18.364, which implies a 95% confidence
interval of (121, 201).

OUTPUT:
(1) BASIC DATA INFORMATION:

Sample size in Community 1 n1 = 85867


Sample size in Community 2 n2 = 59646
Number of observed species in Community 1 D1 = 155
Number of observed species in Community 2 D2 = 140
Number of observed shared species D12 = 111
Bootstrap replications for s.e. estimate 100

"Entire" Shared Species Group:


Some Statistics:
--------------------------------------------------------------------------
f[11] = 4 ; f[1+] = 10 ; f[+1] = 15 ; f[2+] = 2 ; f[+2] = 7 ; f[22] = 0
--------------------------------------------------------------------------

"Rare" Shared Species Group: (Both frequencies can only up to 10)


Some Statistics:
-------------------------------------------------------------------
f[1+]_rare = 8 ; f[+1]_rare = 9 ; f[2+]_rare = 1 ; f[+2]_rare = 4
-------------------------------------------------------------------
Number of observed individuals in Community 1 n1_rare = 3358

- 34 -
Number of observed individuals in Community 2 n2_rare = 558
Number of observed shared species D12_rare = 21
Estimated sample coverage C12_rare = 0.86
Estimated CCVs CCV_1 = 0.733
CCV_2 = 1.007
CCV_12 = 0.457

(2) ESTIMATION RESULTS OF THE NUMBER OF SHARED SPECIES:

Estimate s.e. 95%Lower 95%Upper


Homogeneous 114.429 5.978 110.602 142.603
Heterogeneous (ACE-shared) 133.921 18.248 116.344 200.205
Chao1-shared 142.125 18.364 121.329 201.090
Chao1-shared-bc 142.125 18.364 121.329 201.090

(3) DESCRIPTION OF MODELS FOR ESTIMATING SHARED SPECIES RICHNESS:


Homogeneous: This model assumes that the shared species in each community have the same discovery probabilities; see Eq.
(3.11a) of Chao et al. (2000).

Heterogeneous (ACE-shared): This model allows for heterogeneous discovery probabilities among shared species; see Eq. (3.11b)
of Chao et al. (2000). An extension of the ACE estimator to two communities; it is replaced by Chao1-shared when the estimated
sample coverage for the rare shared species group (C12_rare in the output) is zero.

Chao1-shared: An extension of the Chao1 estimator to estimate shared species richness between two communities. It provides
a lower bound of shared species richness. See Eq. (3.6) of Pan et al. (2009). It is replaced by Chao1-shared-bc for the
case f[2+]=0 or f[+2]=0.

Chao1-shared-bc: A bias-corrected form of the Chao1-shared estimator; see Pan et al. (2009).

Example 3.2. Uisng the Hong Kong Big Bird Race Data to Demo Type (2)
Incidence-Frequency Data

The Hong Kong Big Bird Race data are described in Type (2) Data in the Section “Data Formats
for Two Communities”. In 2015, a total of 223 species was observed by 16 teams competing in
the Hong Kong Big Bird Race. In 2016, the same number of species was observed by 17 teams.
Merging the two-year data by species names, we found that there were 250 species, of which 196
were observed in both years (“shared”). In 2015, the winning team recorded 157 species. This
means that the winning team missed 66 species that were observed by at least one of the other
teams in that year’s competition. In 2016, the winning team recorded 159 species; thus, 64
species that were observed by the other teams were missed by the winning team.

The Chao2 estimator for estimating species richness in one community and its bias-corrected
version have been extended to estimate shared species richness for two communities (Pan et al.
2009), with the resulting estimators referred to as the Chao2-shared and Chao2-shared-bc in the
output. The ICE method has not been extended to the shared species richness estimation yet. To
estimate the number of shared species between the two sets of data, all the step-by-step running
procedures are the same as in Example 2.1 except that the data type is different. The output is
shown below.

There are three parts in the output. The first part shows basic data information for the two
samples. The second part lists two estimates of shared species richness (the Chao2-shared and
Chao2-shared-bc), along with their bootstrap standard errors and 95% confidence intervals. A
brief explanation of each model/estimator is provided below this list in the third part; see Pan et

- 35 -
al. (2009) and Chao and Chiu (2012) for pertinent theory. The formulas for each model/estimator
are provided in the Appendix of this user’s guide.

From the merged (by species identity) data of the two years, we obtain frequency counts Q1+ = 16,
Q2+ = 16, Q+1 = 11, Q+2 = 6, Q11 = 5, and Q22 = 2 which are the main statistics used in
Chao2-shared. Based on the Chao2-shared estimate, the shared species richness is 216 (s.e. 12.0)
with a 95% confidence interval of (203, 255). Thus, we can conclude that there were about 20
shared species not detected with a 95% confidence interval of (7, 59). Based on the
Chao2-shared-bc estimate, the shared species richness is 211 (s.e. 9.245) with a 95% confidence
interval of (201, 242). Thus, we can conclude that there were about 15 shared species not
detected with a 95% confidence interval of (5, 46).

OUTPUT:
(1) BASIC DATA INFORMATION:

Number of sampling units in Community 1 T1 = 16


Number of sampling units in Community 2 T2 = 17
Number of total incidences in Community 1 U1 = 1958
Number of total incidences in Community 2 U2 = 2215
Number of observed species in Community 1 D1 = 223
Number of observed species in Community 2 D2 = 223
Number of observed shared species in two communities D12 = 196
Bootstrap replications for s.e. estimate 100

Some Statistics:
--------------------------------------------------------------------------
Q[11] = 5 ; Q[1+] = 16 ; Q[+1] = 11 ; Q[2+] = 16 ; Q[+2] = 6 ; Q[22] = 2
--------------------------------------------------------------------------

(2) ESTIMATION RESULTS OF THE NUMBER OF SHARED SPECIES:

Estimate s.e. 95%Lower 95%Upper


Chao2-shared 215.748 12.000 202.581 255.253
Chao2-shared-bc 211.483 9.245 201.246 241.697

(3) DESCRIPTION OF MODELS FOR ESTIMATING SHARED SPECIES RICHNESS:

Chao2-shared: An extension of the Chao2 estimator to estimate shared species richness between two communities. It provides
a lower bound of shared species richness. See Eq. (3.6) of Pan et al. (2009). It is replaced by Chao2-shared-bc for the
case Q[2+]=0 or Q[+2]=0.

Chao2-shared-bc: A bias-corrected form of Chao2-shared. See Pan et al. (2009).

Example 3.2B. Using the Hong Kong Big Bird Race Data to Demo Type (2B) Incidence-Raw
Data
Demo Data at Shared Species Part (Example 3.2B)
The raw data for the Hong Kong Big Bird Race are described in Type (2B) Data in the Section
“Data Formats for Two Communities”. We use a 280-species checklist, although only 250
species were found in the pooled data of the two years. If these were user-uploaded data, then in
the left panel, users would specify the number of sampling units taken from each community.
That is, users would have to type in “16 17” (separated by at least one space) to specify that 16
and 17 teams (sampling units) were taken from Community 1 and Community 2, respectively.
The raw data for the two years consist of a 280 x 33 (species-by-team) matrix whose elements
are 1’s (detection) or 0’s (non-detection); each row of the matrix is devoted to the
detection/non-detection records of the same species so as to make computations related to shared
species information more convenient.

- 36 -
Part IV: Two-Community (Similarity) Measures

This part features various similarity indices for quantifying species compositional resemblance or
overlap between two communities/assemblages. The ranges of all indices are in the interval [0, 1].
Thus, the corresponding dissimilarity or differentiation indices are one complement of similarity
measures. This part is restricted to the special case of two communities because pairwise
similarity measures have been widely used in many research fields. Even in the case of more than
two communities, pairwise similarity measures provide useful information. Three types of data
are supported: species abundance data (Type 1), incidence-frequency data (Type 2) or incidence
raw data (Type 2B).

Running Procedures
Species identification names/codes or site/community labels in your original data must be
removed to conform to the SpadeR data formats.

Step 1. Select the “Two-Community Measures” tab from the top menu.
Step 2. Select Abundance data , Incidence-frequency data, or Incidence-raw data from the
Data Setting.
Step 3. Check the Demo data radio button to load data (you can load your data by checking
Upload data).
(If incidence-raw data are loaded, you must specify the number of sampling units in
each community in the left panel.)
Step 4. Press the Run! button to get the output.

In the following, we run separate examples for each of the three types of data.

Example 4.1. Using the Rain Forest Data to Demo Type (1) Abundance Data

This dataset is a subset of that described in Type (1) Data in the Section “Data Formats for More
Than Two Communities”. The original data include species frequencies from three communities
(seedlings, saplings and trees) sampled in an old-growth rain forest. Here we only select two
communities for illustration. The data stored in the demo data include species frequencies from
seedlings (column 1), and trees (column 2). The goal here is to assess the species compositional
similarity between seedlings and trees. The output is shown below.

OUTPUT:
(1) BASIC DATA INFORMATION:

The loaded set includes abundance/incidence data from 2 communities


and a total of 86 species.

Samples size in Community 1 n1 = 557


Samples size in Community 2 n2 = 111
Number of observed species in Community 1 D1 = 69
Number of observed species in Community 2 D2 = 43
Number of observed shared species in two communities D12 = 26
Number of bootstrap replications for s.e. estimate 100

Some statistics:

- 37 -
f[11]= 4 ; f[1+]= 8 ; f[+1]= 10 ; f[2+]= 2 ; f[+2]= 8 ; f[22]= 0

(2) EMPIRICAL SIMILARITY INDICES:

Estimate s.e. 95%Lower 95%Upper


(a) Classic richness-based similarity

C02 (q=0, Sorensen) 0.4643 0.0398 0.3863 0.5423


U02 (q=0, Jaccard) 0.3023 0.0307 0.2421 0.3625

(b) Measures for comparing species relative abundances

C12=U12 (q=1, Horn) 0.5502 0.0282 0.4949 0.6056

C22 (q=2, Morisita-Horn) 0.7442 0.0413 0.6632 0.8252


U22 (q=2, Regional overlap) 0.8533 0.0283 0.7978 0.9088

ChaoJaccard-abundance 0.4040 0.0288 0.3476 0.4604


ChaoSorensen-abundance 0.5755 0.0304 0.5159 0.6351

(c) Measures for comparing size-weighted species relative abundances

Horn size-weighted (q=1) 0.6002 0.0326 0.5363 0.6640

(d) Measures for comparing species absolute abundances

C12=U12 (q=1) 0.3894 0.0326 0.3256 0.4533

C22 (Morisita-Horn) 0.3039 0.0355 0.2343 0.3734


U22 (Regional overlap) 0.4661 0.0419 0.3839 0.5483

Bray-Curtis 0.2425 0.0159 0.2113 0.2738

(3) ESTIMATED SIMILARITY INDICES:

Estimate s.e. 95%Lower 95%Upper


(a) Classic richness-based similarity:

C02 (q=0, Sorensen) 0.4966 0.1495 0.2036 0.7896


U02 (q=0, Jaccard) 0.3303 0.1606 0.0155 0.6451

(b) Measures for comparing species relative abundances

C12=U12 (q=1, Horn) 0.6763 0.0637 0.5516 0.8011

C22 (q=2, Morisita-Horn) 0.7868 0.0426 0.7033 0.8702


U22 (q=2, Regional overlap) 0.8807 0.0280 0.8259 0.9355

ChaoJaccard-abundance 0.4845 0.1227 0.2440 0.7250


ChaoSorensen-abundance 0.6527 0.0960 0.4645 0.8409

(c) Measures for comparing size-weighted species relative abundances

Horn size-weighted (q=1) 0.7586 0.0841 0.5938 0.9234

(d) Measures for comparing species absolute abundances

C12=U12 (q=1) 0.4922 0.0841 0.3274 0.6570

C22 (q=2, Morisita-Horn) 0.3114 0.0364 0.2401 0.3828


U22 (q=2, Regional overlap) 0.4749 0.0425 0.3916 0.5583

Bray-Curtis 0.2725 0.0506 0.1734 0.3716

NOTE: If an estimate is greater than 1, it is replaced by 1.

- 38 -
The output includes three parts. The first part includes basic data information, while the second
part includes empirical (i.e., observed) similarity measures. These empirical measures are for
comparison purposes only because they are generally subject to large negative bias due to
undetected species and undetected shared species in samples. When a community is
under-sampled, an empirical similarity index generally underestimates the true parameter
whereas an empirical dissimilarity index overestimates the true parameter. The third part
includes estimated similarity measures which were developed to correct for under-sampling bias
and are recommended for practical use. Note that any estimated similarity index is higher than its
corresponding empirical value, with the magnitude of this difference representing the
under-sampling bias associated with the empirical index. Although we restrict ourselves to the
case of two communities in this example, a general framework regarding N-community similarity
measures is reviewed below to help readers understand the theory behind all the measures. Later
on, in the section describing the output, further explanation of the measures is given.

Theory of compositional similarity measures

Assume that there are N communities, with S species indexed by 1, 2, ∙∙∙, S in the pooled
community. Nearly all (dis)similarity indices are closely related to or derived from beta diversity
which measures the extent of differentiation in species composition among a set of communities
in a geographical area, over a time period, or along an environmental gradient (Whittaker 1960,
1972; Legendre and Legendre 2012). Numerous concepts and measures of beta diversity exist in
the literature along with related similarity/differentiation indices. The variance framework
derived from the total variance of a community species abundance matrix (Legendre and De
Cáceres 2013) and diversity decomposition based on partitioning gamma diversity into alpha and
beta components (e.g., see Chao et al. 2014) are two major approaches.

Chao and Chiu (2016c) bridged the above two major beta-diversity approaches and proved that
both approaches lead to the same two classes of similarity measures, as described below. These
two classes of measures are also functions of diversity order q defined in species diversity (Hill
numbers) in Part II. When q = 0, similarity measures are based only on species richness, whereas
for q > 0, similarity measures incorporate species abundances (for abundance data) and species
incidence frequencies (for incidence data). SpadeR mainly focuses on the three orders of q = 0, 1
and 2. The two convergent classes of similarity measures are formulated from a local perspective
(i.e., property of an individual community) or from a regional perspective (i.e., property of a
region or pooled community).

(1) A class of local (Sørensen-type) species-overlap measures: CqN.

This class of measures reduces to the classic N-community Sørensen index if q = 0, whereas
it generalizes the classic N-community Sørensen index to incorporate species abundances if q
> 0. It quantifies the effective average proportion of a community’s species (a local
perspective) that are shared across all communities. The measure 1  CqN quantifies the
effective average proportion of non-shared species in a community.

When the measures are applied to compare sets of within-community relative abundances,
communities are equally weighted and the measures for q = 1 and 2 respectively reduce to the
traditional N-community Horn (1966) and Morisita-Horn (Morisita 1959) similarity measures.
These measures are valid for all types of data, meaning that it also generalizes the traditional
Horn and Morisita-Horn measures to compare sets of species absolute abundances.

- 39 -
(2) A class of regional (Jaccard-type) species-overlap measures: UqN.

This class of measures reduces to the classic N-community Jaccard index if q = 0, whereas it
generalizes the classic N-community Jaccard index to incorporate species abundances if q > 0.
It quantifies the effective proportion of species in the pooled community (a regional
perspective) that are shared across all communities. The measure 1  U qN is a
complementarity measure that quantifies the effective proportion of non-shared species in the
pooled community.

When the measures are applied to compare sets of within-community relative abundances,
communities are equally weighted and the measures for q = 1 and 2 respectively reduce to the
traditional N-community Horn (1966) and regional species-overlap (Chiu et al. 2014)
similarity measures. These measures also generalize the traditional Horn and regional
species-overlap measures to compare sets of species absolute abundances.

The two classes of measures range in the interval [0, 1] and thus can be compared across multiple
sets of communities. All measures attain the minimum value of 0 when communities are identical,
whereas all measures attain the maximum value of 1 when communities are completely distinct
(no shared species). While SpadeR mainly features the above two types of similarity measures, it
also accommodates some additional measures. For the two-community case, the additional
measures include the Bray-Curtis similarity index, Horn’s size-weighted overlap measure and the
abundance-based ChaoSørensen and ChaoJaccard measures (Chao et al. 2005, 2006). Below we
list and provide some details for all measures featured by SpadeR; see the Appendix for the
formulas and Table A1 for a summary of these measures.

A summary of similarity measures featured in Part IV (see Table A1 in the Appendix):

(a) Classic richness-based similarity measures (q = 0)


-- C02 index: classic two-community Sørensen similarity index;
-- U02 index: classic two-community Jaccard similarity index.

For the cases of q = 1 and q = 2, the choice of measures depends on the study goal.

(b) If the goal is to compare the two sets of species relative abundances, then communities are
equally weighted and SpadeR features the following indices:

-- C12 = U12 (q = 1, which is identical to the equal-weight Horn overlap measure);


-- C22 (q = 2, N-community equal-weight Morisita-Horn similarity index);
-- U22 (q = 2, N-community equal-weight regional overlap index).

(c) If the goal is to compare N sets of species relative abundances with size as community weight,
then only the size-weighted Horn (1966) index is accommodated:

-- Horn size-weighted index (q = 1).

(d) When the goal is to compare two sets of species absolute abundances, SpadeR features the
following four similarity indices:

-- C12 = U12 (q = 1);


-- C22 (q = 2, two-community Morisita-Horn similarity index);

- 40 -
-- U22 (q = 2, two-community regional overlap index);
-- The two-community Bray-Curtis index.

Remarks

Remark 1 When sampling effort in each community is standardized (i.e., equal-effort sampling),
one can compare both relative abundances as in Part (b) above, and absolute abundances as in
Part (d). However, if sampling effort is not standardized, then only the measures for comparing
relative abundances are statistically meaningful. Thus, the widely used Bray-Curtis index for
comparing absolute abundances makes sense only under equal-effort sampling. Equal-effort
sampling implies that the sampling fractions (ratio of sample size and community size) in all
samples are approximately the same; see Chao et al. (2006) for further explanation.

Remark 2 The two classes of similarity measures, CqN and UqN, for q > 0 basically match species
relative abundances species by species. For instance, the measure of q = 2 assesses a normalized
probability that two randomly-chosen individuals, one from each community, belong to the same
species. Chao et al. (2005, 2006) derived a class of measures which look at a different kind of
similarity, based on assessing the normalized probability that two randomly-chosen individuals,
one from each community, belong to shared species (not necessarily the same species). It is
simple to convert any richness-based measure to its corresponding abundance version. Chao et al.
(2005, 2006) converted the classic Jaccard and Sørensen to their corresponding abundance
versions, with the resulting indices referred to as the Chao-Jaccard abundance and
Chao-Sørensen abundance indices in the literature. Unlike the ordinary similarity indices that
match species-by-species abundances, these measures match the total relative abundances of
species shared between two communities.

One main advantage of this class of measures is that the under-sampling bias due to undetected
shared species can be more accurately evaluated and corrected. These measures can also be
extended to deal with replicated incidence-frequency and incidence-raw data. However, they
should be used only if their concept of similarity is relevant to the question of interest. For
example, the complement of CqN and UqN are useful if the focus is to construct an
abundance-based complementarity measure from a local or regional perspective (Gotelli and
Chao 2013).

Remark 3 The s.e. estimates for the ChaoSørensen and ChaoJaccard measures, as shown in the
output, are different from those in Chao et al. (2005, 2006) because we have modified our
original bootstrap variance estimation procedures. Our original variance estimates were
over-estimated, making the s.e. estimates overly conservative. The current modified procedure
yields more reasonable s.e. estimates.

Remark 4 The diversity analysis within each community is featured in Part II (Diversity Profile
Estimation). The diversity of orders 0, 1 and 2 as well as their s.e. estimates and 95% confidence
intervals are provided in that part.

Incidence data

For incidence data (Data Type 2 and Type 2B), although the sampling model is different from
that of abundance data, all similarity measures are either identical or only slightly different. As
such, we omit the details here; see the Appendix for the formulas of the similarity estimators.
Below we show the output for two incidence data examples.

- 41 -
Example 4.2. Using the Ants Data to Demo Type (2) Incidence-Frequency Data

The data include incidence frequencies of tropical rainforest ants using two sampling techniques:
(a) Berlese extraction of soil samples (217 samples) and (b) Malaise trap samples for flying and
crawling insects (62 samples). The full data (with three sampling techniques) are described in
Type (2) data in the Section “Data Formats for More than Two Communities”. In comparing the
two methods (Berlese vs. Malaise) for their species composition, the resulting output is as
follows (see Chao et al. 2005 for an interpretation):

OUTPUT:
(1) BASIC DATA INFORMATION:

The loaded set includes abundance/incidence data from 2 communities


and a total of 200 species.

Number of sampling units in Community 1 T1 = 217


Number of sampling units in Community 2 T2 = 62
Number of total incidences in Community 1 U1 = 775
Number of total incidences in Community 2 U2 = 455
Number of observed species in Community 1 D1 = 117
Number of observed species in Community 2 D2 = 103
Number of observed shared species in two communities D12 = 20
Number of bootstrap replications for s.e. estimate 100

Some Statistics:
Q[11]= 2 ; Q[1+]= 6 ; Q[+1]= 7 ; Q[2+]= 2 ; Q[+2]= 3 ; Q[22]= 1

(2) EMPIRICAL SIMILARITY INDICES:

Estimate s.e. 95%Lower 95%Upper


(a) Classic richness-based similarity:

C02 (q=0, Sorensen) 0.1818 0.0262 0.1304 0.2332


U02 (q=0, Jaccard) 0.1000 0.0160 0.0686 0.1314

(b) Measures for comparing species relative abundances

C12=U12 (q=1, Horn) 0.1421 0.0148 0.1131 0.1711

C22 (q=2, Morisita-Horn) 0.0931 0.0166 0.0605 0.1256


U22 (q=2, Regional overlap) 0.1703 0.0090 0.1526 0.1880

ChaoJaccard-incidence 0.1335 0.0171 0.1000 0.1670


ChaoSorensen-incidence 0.2356 0.0277 0.1813 0.2899

(c) Measures for comparing size-weighted species relative abundances

Horn size-weighted (q=1) 0.1417 0.0164 0.1096 0.1738

(d) Measures for comparing species absolute abundances

C12=U12 (q=1) 0.1347 0.0164 0.1026 0.1668

C22 (Morisita-Horn) 0.0778 0.0162 0.0461 0.1095


U22 (Regional overlap) 0.1444 0.0282 0.0892 0.1996

Bray-Curtis 0.0813 0.0084 0.0649 0.0977

(3) ESTIMATED SIMILARITY INDICES:

Estimate s.e. 95%Lower 95%Upper

- 42 -
(a) Classic richness-based similarity:

C02 (q=0, Sorensen) 0.2618 0.0645 0.1354 0.3882


U02 (q=0, Jaccard) 0.1506 0.0432 0.0659 0.2353

(b) Measures for comparing species relative abundances

C12=U12 (q=1, Horn) 0.2286 0.0835 0.0649 0.3923

C22 (q=2, Morisita-Horn) 0.0994 0.2205 0.0000 0.5316


U22 (q=2, Regional overlap) 0.1809 0.0436 0.0954 0.2664

ChaoJaccard-incidence 0.1918 0.0356 0.1220 0.2616


ChaoSorensen-incidence 0.3218 0.0528 0.2183 0.4253

(c) Measures for comparing size-weighted species relative abundances

Horn size-weighted (q=1) 0.2281 0.1032 0.0259 0.4303

(d) Measures for comparing species absolute abundances

C12=U12 (q=1) 0.2168 0.1032 0.0146 0.4191

C22 (Morisita-Horn) 0.0835 0.0173 0.0495 0.1175


U22 (Regional overlap) 0.1541 0.0299 0.0956 0.2127

Bray-Curtis 0.1247 0.0498 0.0271 0.2222

NOTE: If an estimate is greater than 1, it is replaced by 1.

Example 4.2B. Using the Soil Ciliates Data to Demo Type (2B) Incidence-Raw Data

The original dataset include soil ciliate species detection/non-detection data for a total of 51 soil
samples from three areas (Etosha Pan, Central Namib Desert and Southern Namib Desert) of
Namibia, Africa. The full data set (including all three areas) consist of a 365 x 51 matrix, as
described in Type (2B) data in the Section “Data Formats for More than Two Communities”.
Here we select two areas: Etosha Pan (Community 1, 19 soil samples) and Central Namib Desert
(Community 2, 17 soil samples) for the purpose of illustrating two-community similarity
measures. Thus, the data for these two areas consist of a 365 x 33 matrix. For any uploaded data,
users must type in the numbers of sampling units in each community. For example, if these data
were user-uploaded data, then in the left panel, you would have to type in the two numbers “19
17” (separated by at least one space) to specify that 19 and 17 sampling units were taken from
Community 1 and Community 2, respectively. The output for the two-area ciliates data is shown
below:

(1) BASIC DATA INFORMATION:

The loaded set includes abundance/incidence data from 2 communities


and a total of 262 species.

Number of sampling units in Community 1 T1 = 19


Number of sampling units in Community 2 T2 = 17
Number of total incidences in Community 1 U1 = 516
Number of total incidences in Community 2 U2 = 380
Number of observed species in Community 1 D1 = 216
Number of observed species in Community 2 D2 = 130
Number of observed shared species in two communities D12 = 84
Number of bootstrap replications for s.e. estimate 100

Some Statistics:

- 43 -
Q[11]= 16 ; Q[1+]= 28 ; Q[+1]= 33 ; Q[2+]= 15 ; Q[+2]= 18 ; Q[22]= 4

(2) EMPIRICAL SIMILARITY INDICES:

Estimate s.e. 95%Lower 95%Upper


(a) Classic richness-based similarity:

C02 (q=0, Sorensen) 0.4855 0.0257 0.4351 0.5359


U02 (q=0, Jaccard) 0.3206 0.0210 0.2794 0.3618

(b) Measures for comparing species relative abundances

C12=U12 (q=1, Horn) 0.6142 0.0208 0.5735 0.6549

C22 (q=2, Morisita-Horn) 0.6551 0.0152 0.6254 0.6848


U22 (q=2, Regional overlap) 0.7916 0.0090 0.7739 0.8093

ChaoJaccard-incidence 0.5033 0.0256 0.4531 0.5535


ChaoSorensen-incidence 0.6696 0.0250 0.6206 0.7186

(c) Measures for comparing size-weighted species relative abundances

Horn size-weighted (q=1) 0.6218 0.0213 0.5801 0.6635

(d) Measures for comparing species absolute abundances

C12=U12 (q=1) 0.6114 0.0213 0.5697 0.6532

C22 (Morisita-Horn) 0.6912 0.0293 0.6338 0.7486


U22 (Regional overlap) 0.8174 0.0220 0.7742 0.8606

Bray-Curtis 0.4821 0.0206 0.4418 0.5225

(3) ESTIMATED SIMILARITY INDICES:

Estimate s.e. 95%Lower 95%Upper


(a) Classic richness-based similarity:

C02 (q=0, Sorensen) 0.5699 0.0955 0.3827 0.7571


U02 (q=0, Jaccard) 0.3985 0.0974 0.2076 0.5894

(b) Measures for comparing species relative abundances

C12=U12 (q=1, Horn) 0.7368 0.0554 0.6281 0.8454

C22 (q=2, Morisita-Horn) 0.7870 0.1022 0.5868 0.9872


U22 (q=2, Regional overlap) 0.8808 0.0724 0.7389 1.0000

ChaoJaccard-incidence 0.6534 0.0531 0.5493 0.7575


ChaoSorensen-incidence 0.7904 0.0435 0.7051 0.8757

(c) Measures for comparing size-weighted species relative abundances

Horn size-weighted (q=1) 0.7439 0.0441 0.6575 0.8303

(d) Measures for comparing species absolute abundances

C12=U12 (q=1) 0.7315 0.0441 0.6450 0.8179

C22 (Morisita-Horn) 0.7925 0.0326 0.7286 0.8564


U22 (Regional overlap) 0.8842 0.0223 0.8405 0.9280

Bray-Curtis 0.5440 0.0268 0.4915 0.5964

NOTE: If an estimate is greater than 1, it is replaced by 1.

- 44 -
Part V: Multiple-Community (Similarity) Measure

This part computes N-community similarity indices for comparing species abundance data from
more than two communities. A traditional approach for simultaneously comparing more than two
communities in ecology is to use multiple pairwise comparisons. However, Chao et al. (2008)
demonstrated that two sets of (more than two) communities may have the same pairwise
similarities but are still not globally similar (e.g., they may differ in total species richness). Thus,
pairwise similarity indices do not completely characterize multiple-community similarity because
the information shared by more than two communities is not being considered in the pairwise
analyses. See “Theory of compositional similarity measures” in Part IV for a review.

In this part, three types of data are supported: species abundance data (Type 1),
incidence-frequency data (Type 2) or incidence-raw data (Type 2B). The interface window for
these analyses is shown below.

Figure 3: The Interface Window of SpadeR for Multiple-Community Similarity Measures

A summary of similarity measures featured in Part V (see Table A2 in the Appendix):

SpadeR mainly features the two types of similarity measures, CqN and UqN (see Part IV for a
review of these two classes of measures) along with two additional ones: the size-weighted Horn
similarity measures (Horn, 1966) and the N-community Bray-Curtis measure (Chao and Chiu,
2016c). Here we briefly summarize the measures featured in our output: (See the Appendix for
the formulas and Table A2 for a summary of these measures.)

(a) Classic richness-based similarity measures (q = 0):


-- C0N index: N-community Sørensen similarity index;
-- U0N index: N-community Jaccard similarity index.

- 45 -
For the cases of q = 1 and q = 2, the choice of measures depends on the study goal.

(b) If the goal is to compare N sets of species relative abundances, then communities are equally
weighted and SpadeR features the following indices:

-- C1N = U1N (q = 1, which is identical to the equal-weight Horn overlap measure);


-- C2N (q = 2, N-community equal-weight Morisita-Horn similarity index);
-- U2N (q = 2, N-community equal-weight regional overlap index).

(c) If the goal is to compare N sets of species relative abundances with size as community weight,
then only the size-weighted Horn (1966) index is accommodated:

-- Horn size-weighted index (q = 1).

(d) When the goal is to compare N sets of species absolute abundances, SpadeR features the
following four similarity indices:

-- C1N = U1N (q = 1);


-- C2N (q = 2, N-community Morisita-Horn similarity index);
-- U2N (q = 2, N-community regional overlap index);
-- The Bray-Curtis index.

In addition to the above N-community measures, SpadeR also supplies similarity values for any
pair of samples. Since there are several measures, users need to specify an order q and
comparison target (relative abundances or absolute abundances) in the left panel to output all
pairwise similarities.

Running Procedures
Species identification names/codes or site/community labels in your original data must be
removed to conform to the SpadeR data formats.

Step 1. Select the “Multiple-Community Measure” tab from the top menu.
Step 2. Select Abundance data , Incidence-frequency data, or Incidence raw data from the
Data Setting.
Step 3. Check the Demo data radio button to load data (you can load your data by checking
Upload data).
(If incidence-raw data are loaded, you must specify the number of sampling units in
each community in the left panel.)
Step 4. Choose an order q (0, 1 or 2) and the comparison target (relative abundances or
absolute abundances) that you would like to use to compute the similarity for any
pair of samples. Here we select q = 1 and Relative abundances as an example.
Step 5. Press the Run! button to get the output.

Example 5.1. Using the Rain Forest Data to Demo Type (1) Abundance Data

The demo data for the Multiple-community Measures Part include the frequencies from three
assemblages: seedlings (column 1), saplings (column 2) and trees (column 3) in the LEP
old-growth rain forest; see the Section “Data Format for More Than Two Communities” for

- 46 -
details.. In this part, we calculate the similarity indices for comparing three communities. The
output for comparing these three assemblages is shown below:

OUTPUT:

(1) BASIC DATA INFORMATION:

The loaded set includes abundance/incidence data from 3 communities


and a total of 120 species.

Sample size in each community n1 = 557


n2 = 729
n3 = 111

Number of observed species in one community D1 = 69


D2 = 102
D3 = 43

Number of observed shared species in two communities D12 = 59


D13 = 26
D23 = 32

Number of observed shared species in three communities D123 = 23

Number of bootstrap replications for s.e. estimate 100

(2) EMPIRICAL SIMILARITY INDICES:

Estimate s.e. 95%Lower 95%Upper


(a) Classic richness-based similarity

C03 (q=0, Sorensen) 0.6589 0.0238 0.6123 0.7054


U03 (q=0, Jaccard) 0.3917 0.0238 0.3450 0.4384

(b) Measures for comparing species relative abundances

C13=U13 (q=1, Horn) 0.6521 0.0176 0.6177 0.6865

C23 (q=2, Morisita-Horn) 0.5802 0.0286 0.5242 0.6362


U23 (q=2, Regional overlap) 0.8057 0.0193 0.7679 0.8435

(c) Measures for comparing size-weighted species relative abundances

Horn size-weighted (q=1) 0.6885 0.0174 0.6545 0.7226

(d) Measures for comparing species absolute abundances

C13=U13 (q=1) 0.5686 0.0174 0.5345 0.6026

C23 (Morisita-Horn) 0.4069 0.0255 0.3569 0.4569


U23 (Regional overlap) 0.6730 0.0241 0.6258 0.7202

Bray-Curtis 0.3962 0.0147 0.3675 0.4249

(3) ESTIMATED SIMILARITY INDICES:

Estimate s.e. 95%Lower 95%Upper


(a) Classic richness-based similarity

C03 (q=0, Sorensen) 0.8235 0.0649 0.6963 0.9506


U03 (q=0, Jaccard) 0.6086 0.1013 0.4101 0.8071

(b) Measures for comparing species relative abundances

C13=U13 (q=1, Horn) 0.7036 0.0240 0.6565 0.7506

- 47 -
C23 (q=2, Morisita-Horn) 0.6134 0.0316 0.5515 0.6753
U23 (q=2, Regional overlap) 0.8264 0.0201 0.7870 0.8657

(c) Measures for comparing size-weighted species relative abundances

Horn size-weighted (q=1) 0.7430 0.0230 0.6980 0.7880

(d) Measures for comparing species absolute abundances

C13=U13 (q=1) 0.6136 0.0230 0.5686 0.6586

C23 (Morisita-Horn) 0.4213 0.0265 0.3694 0.4732


U23 (Regional overlap) 0.6859 0.0243 0.6384 0.7335

Bray-Curtis 0.4287 0.0175 0.3944 0.4629

(4) ESTIMATED PAIRWISE SIMILARITY:

----------------------Measure C12 (=U12)-----------------------

Estimator Estimate s.e. 95% Confidence Interval

C12(1,2) 0.771 0.034 ( 0.704 , 0.837 )


C12(1,3) 0.676 0.073 ( 0.533 , 0.820 )
C12(2,3) 0.635 0.077 ( 0.484 , 0.786 )

Average pairwise similarity= 0.694

Pairwise similarity matrix:

C12(i,j) 1 2 3
1 1.000 0.771 0.676
2 1.000 0.635
3 1.000

-------------------Measure Horn size-weighted--------------------

Estimator Estimate s.e. 95% Confidence Interval

Horn(1,2) 0.768 0.033 ( 0.703 , 0.833 )


Horn(1,3) 0.759 0.075 ( 0.612 , 0.905 )
Horn(2,3) 0.688 0.060 ( 0.570 , 0.806 )

Average pairwise similarity = 0.738

Pairwise similarity matrix:

Horn(i,j) 1 2 3
1 1.000 0.768 0.759
2 1.000 0.688
3 1.000

NOTE: Any estimate greater than 1 is replaced by 1; any estimate less than 0 is replaced by 0.

There are four output parts. The first part shows some basic data information, including the
number of individuals and the number of species in each community, as well as the number of
species shared by exactly two communities and the number of species shared by exactly three
communities. For comparing more than three communities, only shared information between any
two communities is given. The second part includes empirical similarity measures for those
measures described earlier and their s.e. estimates along with 95% confidence intervals. These
measures are for comparison purposes only because they are generally subject to negative bias
due to undetected species and undetected shared species in samples. The third part includes

- 48 -
estimated similarity measures which are recommended for practical use. The s.e. estimate for
each estimator is obtained by a bootstrap method based on 100 replications (default) along with
95% confidence intervals; see Chao et al. (2008) for details. You could increase the number of
replications in the left panels, but a large number of bootstrap replications will take a longer time
to show output. For a selected order q (on the left tab) and comparison target (relative
abundances or absolute abundances), the fourth part includes similarity values for all pairs of
samples and their average.

For pairwise similarity, here we select q = 1 and the measures for comparing species relative
abundances in Step 4 of the running procedures. Thus, SpadeR supplies the results for the two
measures: C12 (= U12 = the equal-weight Horn measure) and Horn size-weighted measure. For
this example, these two measures of q = 1 yield very close similarity values for the three pairs of
communities (trees vs. seedlings, trees vs. saplings and saplings vs. seedlings) are generally
comparable. If we select q = 2 and the measures for comparing species relative abundances in
Step 4 of the running procedures, then SpadeR supplies the results for the two measures: C22
(Morisita-Horn) and U22 (regional overlap). By contrast, the two values C22(1,3) = 0.787 and
U22(1,3) = 0.881 reflect that the similarity between trees and seedlings is significantly higher than
the similarity between seedlings and saplings (C22(1,2) = 0.471 and U22(1,2) = 0.641)), and also
higher than the similarity between saplings and trees (C22(2,3) = 0.483 and U22(2,3) = 0.652).
The different interpretation arises because the measures of q = 2 are mainly sensitive to dominant
species. The results for q = 2 emphasize the resemblance among the more abundant species,
whereas the results for q = 1 weigh species by their abundances.

Note that in Example 4.1, we show all measures for comparing seedlings and trees for the same
data set. All the estimated pairwise similarity values between seedlings and trees in the above
output are exactly the same as those given in Example 4.1, but the estimated standard errors and
confidence intervals are slightly different due to the variation of the bootstrap replications (i.e.,
two different runs may result in different s.e. estimates and different confidence intervals.)

Users may wonder which of the similarity measures featured in SpadeR is the most appropriate
choice. Like the diversity profile presented in Part II, we recommend using a similarity profile
which depicts the estimated similarity as a function of q. The relative merits of the measures of q
= 0, 1 and 2 are summarized as follows.

(1) The richness-based measures (q = 0, Jaccard and Sørensen measures), while intuitive, are
very sensitive to undetected alleles. Thus, they are difficult to estimate accurately when
samples are incomplete. Our estimated measures for q = 0 may be subject to substantial
under-sampling bias.

(2) The similarity measures of q = 2 (C2N and U2N) can be estimated accurately from sampling
data because they are focused on the abundant alleles and thus unaffected by undetected alleles.
The measure of q = 1 (C1N =U1N) weighs alleles by their frequencies; although undetected
alleles have an effect to some extent, the under-sampling bias can be statistically evaluated and
a large portion of the bias can be corrected for (Chao et al., 2015).

(3) Chao et al. (2015, Appendix S6) proved that the Shannon measures (q = 1) satisfy some
essential monotonicity properties that measures of q = 2 lacks. Therefore, our practical
guidelines regarding the choice of measures are as follows: if the study goal is to focus on
dominant species, then the measures of q = 2 are recommended; otherwise, measures of q = 1
are recommended.

- 49 -
As noted in Part IV, only when sampling effort in each community is standardized (i.e.,
equal-effort sampling), are the measures for comparing absolute abundances statistically
meaningful. Thus, the widely used Bray-Curtis index for comparing absolute abundances makes
sense only under equal-effort sampling. However, the estimation of the Bray-Curtis is
statistically challenging. We offer an estimator in SpadeR; derivation details will be published
soon.

Example 5.2. Using the Ants Data to Demo Type (2) Incidence-Frequency Data

The demo data include incidence frequencies of tropical rainforest ants using three sampling
techniques: (a) Berlese extraction of soil samples (217 samples), (b) fogging samples from
canopy fogging (459 samples), and (c) Malaise trap samples for flying and crawling insects (62
samples). The full dataset is described in Type (2) data in the Section “Data Formats for More
than Two Communities”. The output shown below can be used to compare the three methods for
their species composition (see Chao et al. 2005 for an interpretation):

OUTPUT:
(1) BASIC DATA INFORMATION:

The loaded set includes abundance/incidence data from 3 communities


and a total of 276 species.

Number of sample units in each community T1 = 217


T2 = 459
T3 = 62

Number of total incidences in each community U1 = 775


U2 = 3262
U3 = 455

Number of observed species in one community D1 = 117


D2 = 165
D3 = 103

Number of observed shared species in two communities D12 = 24


D13 = 20
D23 = 78

Number of observed shared species in three communities D123 = 13

Number of bootstrap replications for s.e. estimate 100

(2) EMPIRICAL SIMILARITY INDICES:

Estimate s.e. 95%Lower 95%Upper


(a) Classic richness-based similarity

C03 (q=0, Sorensen) 0.4247 0.0150 0.3952 0.4542


U03 (q=0, Jaccard) 0.1975 0.0256 0.1473 0.2477

(b) Measures for comparing species relative abundances

C13=U13 (q=1, Horn) 0.3633 0.0078 0.3480 0.3786

C23 (q=2, Morisita-Horn) 0.2713 0.0108 0.2501 0.2924


U23 (q=2, Regional overlap) 0.5276 0.0039 0.5199 0.5353

(c) Measures for comparing size-weighted species relative abundances

- 50 -
Horn size-weighted (q=1) 0.3664 0.0052 0.3562 0.3766

(d) Measures for comparing species absolute abundances

C13=U13 (q=1) 0.2559 0.0052 0.2457 0.2661

C23 (Morisita-Horn) 0.3271 0.0167 0.2943 0.3598


U23 (Regional overlap) 0.5932 0.0190 0.5560 0.6304

Bray-Curtis 0.1530 0.0128 0.1280 0.1781

(3) ESTIMATED SIMILARITY INDICES:

Estimate s.e. 95%Lower 95%Upper


(a) Classic richness-based similarity

C03 (q=0, Sorensen) 0.5229 0.0375 0.4495 0.5964


U03 (q=0, Jaccard) 0.2676 0.0835 0.1040 0.4312

(b) Measures for comparing species relative abundances

C13=U13 (q=1, Horn) 0.4325 0.0081 0.4166 0.4484

C23 (q=2, Morisita-Horn) 0.2851 0.1874 0.0000 0.6523


U23 (q=2, Regional overlap) 0.5447 0.0151 0.5150 0.5744

(c) Measures for comparing size-weighted species relative abundances

Horn size-weighted (q=1) 0.4378 0.0051 0.4278 0.4478

(d) Measures for comparing species absolute abundances

C13=U13 (q=1) 0.3058 0.0051 0.2958 0.3158

C23 (Morisita-Horn) 0.3423 0.0174 0.3082 0.3765


U23 (Regional overlap) 0.6096 0.0192 0.5721 0.6472

Bray-Curtis 0.1931 0.0831 0.0303 0.3560

(4) ESTIMATED PAIRWISE SIMILARITY:

----------------------Measure C12 (=U12)-----------------------

Estimator Estimate s.e. 95% Confidence Interval

C12(1,2) 0.211 0.073 ( 0.068 , 0.354 )


C12(1,3) 0.229 0.082 ( 0.067 , 0.390 )
C12(2,3) 0.720 0.038 ( 0.646 , 0.793 )

Average pairwise similarity= 0.386

Pairwise similarity matrix:

C12(i,j) 1 2 3
1 1.000 0.211 0.229
2 1.000 0.720
3 1.000

-------------------Measure Horn size-weighted--------------------

Estimator Estimate s.e. 95% Confidence Interval

Horn(1,2) 0.222 0.098 ( 0.031 , 0.414 )


Horn(1,3) 0.228 0.064 ( 0.103 , 0.353 )
Horn(2,3) 0.737 0.037 ( 0.665 , 0.809 )

Average pairwise similarity = 0.396

- 51 -
Pairwise similarity matrix:

Horn(i,j) 1 2 3
1 1.000 0.157 0.217
2 1.000 0.395
3 1.000

NOTE: Any estimate greater than 1 is replaced by 1; any estimate less than 0 is replaced by 0.

Example 5.2B. Using the Soil Ciliates Data to Demo Type (2B) Incidence-Raw Data

The demo data include soil ciliate species detection/non-detection data for a total of 51 soil
samples from three areas of Namibia, Africa: Etosha Pan (19 samples), Central Namib Desert (17
samples) and Southern Namib Desert (15 samples). The full dataset is described in Type (2B)
data in the Section “Data Format for More Than Two Communities”. If these were user-uploaded
data (rather than demo data), then in the left panel, you must type in the three numbers “19 17
15” (separated by at least one space) to specify the number of sampling units in the three areas.
The species checklist in the original data includes 365 ciliates species, The demo incidence raw
data are formatted in a 365 x 51 of 0’s and 1’s (“0” denotes non-detection while “1” denotes
detection); each row of the matrix refers to the detection/non-detection records for the same
species so that the information about shared species can be computed. The output is shown below
(see Chao et al. 2005 for an interpretation):
(1) BASIC DATA INFORMATION:

The loaded set includes abundance/incidence data from 3 communities


and a total of 304 species.

Number of sample units in each community T1 = 19


T2 = 17
T3 = 15

Number of total incidences in each community U1 = 516


U2 = 380
U3 = 358

Number of observed species in one community D1 = 216


D2 = 130
D3 = 150

Number of observed shared species in two communities D12 = 84


D13 = 97
D23 = 76

Number of observed shared species in three communities D123 = 65

Number of bootstrap replications for s.e. estimate 100

(2) EMPIRICAL SIMILARITY INDICES:

Estimate s.e. 95%Lower 95%Upper


(a) Classic richness-based similarity

C03 (q=0, Sorensen) 0.5806 0.0146 0.5520 0.6093


U03 (q=0, Jaccard) 0.3158 0.0199 0.2767 0.3549

- 52 -
(b) Measures for comparing species relative abundances

C13=U13 (q=1, Horn) 0.6929 0.0106 0.6722 0.7136

C23 (q=2, Morisita-Horn) 0.7523 0.0132 0.7265 0.7781


U23 (q=2, Regional overlap) 0.9011 0.0056 0.8901 0.9121

(c) Measures for comparing size-weighted species relative abundances

Horn size-weighted (q=1) 0.6903 0.0097 0.6713 0.7094

(d) Measures for comparing species absolute abundances

C13=U13 (q=1) 0.6818 0.0097 0.6628 0.7009

C23 (Morisita-Horn) 0.7652 0.0198 0.7265 0.8040


U23 (Regional overlap) 0.9072 0.0107 0.8863 0.9282

Bray-Curtis 0.5658 0.0103 0.5455 0.5860

(3) ESTIMATED SIMILARITY INDICES:

Estimate s.e. 95%Lower 95%Upper


(a) Classic richness-based similarity

C03 (q=0, Sorensen) 0.6658 0.0384 0.5906 0.7409


U03 (q=0, Jaccard) 0.3990 0.0897 0.2232 0.5749

(b) Measures for comparing species relative abundances

C13=U13 (q=1, Horn) 0.7440 0.0113 0.7219 0.7662

C23 (q=2, Morisita-Horn) 0.9158 0.0616 0.7950 1.0000


U23 (q=2, Regional overlap) 0.9703 0.0305 0.9106 1.0000

(c) Measures for comparing size-weighted species relative abundances

Horn size-weighted (q=1) 0.7417 0.0125 0.7172 0.7662

(d) Measures for comparing species absolute abundances

C13=U13 (q=1) 0.7326 0.0125 0.7080 0.7571

C23 (Morisita-Horn) 0.8833 0.0214 0.8414 0.9252


U23 (Regional overlap) 0.9578 0.0099 0.9383 0.9773

Bray-Curtis 0.5938 0.0100 0.5743 0.6134

(4) ESTIMATED PAIRWISE SIMILARITY:

----------------------Measure C12 (=U12)-----------------------

Estimator Estimate s.e. 95% Confidence Interval

C12(1,2) 0.737 0.045 ( 0.648 , 0.826 )


C12(1,3) 0.787 0.047 ( 0.694 , 0.880 )
C12(2,3) 0.817 0.041 ( 0.735 , 0.898 )

- 53 -
Average pairwise similarity= 0.780

Pairwise similarity matrix:

C12(i,j) 1 2 3
1 1.000 0.737 0.787
2 1.000 0.817
3 1.000

-------------------Measure Horn size-weighted--------------------

Estimator Estimate s.e. 95% Confidence Interval

Horn(1,2) 0.744 0.045 ( 0.656 , 0.832 )


Horn(1,3) 0.793 0.045 ( 0.705 , 0.882 )
Horn(2,3) 0.816 0.040 ( 0.738 , 0.894 )

Average pairwise similarity = 0.784

Pairwise similarity matrix:

Horn(i,j) 1 2 3
1 1.000 0.731 0.775
2 1.000 0.815
3 1.000

NOTE: Any estimate greater than 1 is replaced by 1; any estimate less than 0 is replaced by 0.

Part VI: Genetics (Differentiation) Measures

Nearly all measures featured in this genetics part are taken from Part V. The interface shown
below is similar to that for Part V. However, all ecological terminology in the output is converted
to its corresponding genetic version. The changes include: (a) “species” is replaced by “alleles”;
(b) “community/assemblage” is replaced by “subpopulation”; (c) “similarity index” is replaced
by its complement because in genetics, dissimilarity/differentiation measures are more widely
used; and (d) in most applications, only allele relative frequency distributions among
subpopulations are compared, thus the output does not include comparisons of allele absolute
abundances.

Store your data as an allele (row) by subpopulation (column) matrix file. In this part, only
abundance data (Type 1) are supported; see the Section “Data Format for Genetics data” for data
input format. In many genetic data, only allele proportions are given. To run SpadeR, you must
convert proportional data (e.g. pj = 0.04) to frequency data (e.g. 36, the actual number of sampled
instances of allele j) first and then load the data.

- 54 -
Figure 4: The Interface Window of SpadeR for Part VI (Genetics Measures)

Running Procedures
Species identification names/codes or site/community labels in your original data must be
removed to conform to the SpadeR data formats.

Step 1. Select the “Genetics Measures” tab from the top menu.
Step 2. Select Hypothetical data or Human allele data from the Data Setting.
(No options for data types; only abundance/frequency data are supported).
Step 3. Check the Demo data radio button to load data (you can load your data by checking
Upload data).
Step 4. Choose an order q (0, 1 or 2) that you would like to use to compute the dissimilarity
for any pair of samples. In the following demo examples, we select q = 2 for the first
run. Then in the second run, we select q = 1 for comparison purpose.
Step 5. Press the Run! button to get the output.

Before measuring allelic differentiation among subpopulations, you may first assess genetic
diversity within each subpopulation by running Part II (Diversity Profile Estimation). Part II
provides the number of effective alleles (diversity of orders 0, 1 and 2) along with their s.e.
estimate and 95% confidence interval; refer to examples there for an interpretation of the output
(replacing “species” and “community” by “alleles” and “subpopulation”). See Jost (2006, 2007,
2008) and Chao et al. (2015) for pertinent theoretical background.

Example 6.1. Using Hypothetical Alleles Frequency Data to Demo Type (1) Abundance Data

We first use a hypothetical dataset to demonstrate the performance of some differentiation


measures. The data listed below include the allele frequencies from three subpopulations. For any
allele which is not found in the sample from a subpopulation, you must store the frequency 0 in
the data file. For example, in the following data, the first 13 alleles were not found in the second

- 55 -
subpopulation and the other alleles were not found in the first subpopulation, we thus have
inputted 0’s in the first 13 rows of the second column and also in the next 7 rows of the first
column.

Table 1. Hypothetical allele by subpopulation frequency matrix

24 0 1
20 0 3
28 0 4
20 0 5
24 0 7
15 0 8
18 0 9
24 0 10
24 0 11
20 0 13
8 0 14
15 0 15
13 0 16
0 38 18
0 40 19
0 30 20
0 33 21
0 43 23
0 35 24
0 33 13

OUTPUT: (In the fourth part of the output, we present all pairwise dis-similarity estimates for
both q = 1 and q = 2. The estimates used in the text are high-lighted in red)
(1) BASIC DATA INFORMATION:

The loaded set includes abundance (or frequency) data from 3 subpopulations
and a total of 20 distinct alleles are found.

Sample size in each subpopulation n1 = 253


n2 = 252
n3 = 254

Number of observed alleles in one subpopulation D1 = 13


D2 = 7
D3 = 20

Number of observed shared alleles in two subpopulations D12 = 0


D13 = 13
D23 = 7

Number of shared alleles in three subpopulations D123 = 0

- 56 -
Number of bootstrap replications for s.e. estimate 100

(2) EMPIRICAL DIS-SIMILARITY INDICES:

Estimate s.e. 95%Lower 95%Upper


(a) Classic richness-based dis-similarity

1-C03 (q=0, Sorensen) 0.2500 0.0092 0.2321 0.2679


1-U03 (q=0, Jaccard) 0.5000 0.0119 0.4767 0.5233

(b) Measures for comparing alleles relative abundances

1-C13=1-U13 (q=1, Horn) 0.4516 0.0082 0.4355 0.4678

1-C23 (q=2, Morisita-Horn) 0.6223 0.0106 0.6015 0.6431


1-U23 (q=2, Regional diff.) 0.3545 0.0107 0.3336 0.3754

Gst 0.0427 0.0014 0.0399 0.0455

(c) Measures for comparing size-weighted alleles relative abundances

Horn size-weighted(q=1) 0.4509 0.0087 0.4338 0.4680

(3) ESTIMATED DIS-SIMILARITY INDICES:

Estimate s.e. 95%Lower 95%Upper


(a) Classic richness-based dis-similarity

1-C03 (q=0, Sorensen) 0.2500 0.0138 0.2229 0.2771


1-U03 (q=0, Jaccard) 0.5000 0.0187 0.4633 0.5367

(b) Measures for comparing alleles relative abundances

1-C13=1-U13 (q=1, Horn) 0.4516 0.0085 0.4351 0.4682

1-C23 (q=2, Morisita-Horn) 0.6078 0.0111 0.5861 0.6295


1-U23 (q=2, Regional diff.) 0.3406 0.0108 0.3195 0.3618

Gst 0.0401 0.0014 0.0373 0.0429

(c) Measures for comparing size-weighted alleles relative abundances

Horn size-weighted (q=1) 0.4509 0.0083 0.4346 0.4672

(4) ESTIMATED PAIRWISE DIS-SIMILARITY: (An order q = 2 is selected in Step 4 in the running procedures)

-----------------------Measure 1-C22------------------------

Estimator Estimate s.e. 95% Confidence Interval

1-C22(1,2) 1.000 0.000 ( 1.000 , 1.000 )


1-C22(1,3) 0.540 0.049 ( 0.444 , 0.637 )
1-C22(2,3) 0.225 0.036 ( 0.154 , 0.296 )

Average pairwise dis-similarity = 0.588

Pairwise dis-similarity matrix:

1-C22(i,j)1 2 3
1 0 1.000 0.540
2 0 0.225
3 0

-----------------------Measure 1-U22------------------------

- 57 -
Estimator Estimate s.e. 95% Confidence Interval

1-U22(1,2) 1.000 0.000 ( 1.000 , 1.000 )


1-U22(1,3) 0.370 0.048 ( 0.275 , 0.465 )
1-U22(2,3) 0.127 0.024 ( 0.080 , 0.174 )

Average pairwise dis-similarity = 0.499

Pairwise dis-similarity matrix:

1-U22(i,j)1 2 3
1 0 1.000 0.370
2 0 0.127
3 0

NOTE: Any estimate greater than 1 is replaced by 1; any estimate less than 0 is replaced by 0.

(4) ESTIMATED PAIRWISE DIS-SIMILARITY: (An order q = 1 is selected in Step 4 in the running procedures)

--------------------Measure 1-C12 (=1-U12)---------------------

Estimator Estimate s.e. 95% Confidence Interval

1-C12(1,2) 1.000 0.000 ( 1.000 , 1.000 )


1-C12(1,3) 0.414 0.028 ( 0.360 , 0.469 )
1-C12(2,3) 0.282 0.026 ( 0.232 , 0.332 )

Average pairwise dis-similarity = 0.565

Pairwise dis-similarity matrix:

1-C12(i,j)1 2 3
1 0 1.000 0.414
2 0 0.282
3 0

------------------Measure Horn size-weighted--------------------

Estimator Estimate s.e. 95% Confidence Interval

Horn(1,2) 1.000 0.000 ( 1.000 , 1.000 )


Horn(1,3) 0.414 0.028 ( 0.359 , 0.469 )
Horn(2,3) 0.282 0.020 ( 0.242 , 0.322 )

Average pairwise dis-similarity = 0.565

Pairwise dis-similarity matrix:

Horn(i,j) 1 2 3
1 0 1.000 0.414
2 0 0.282
3 0

NOTE: Any estimate greater than 1 is replaced by 1; any estimate less than 0 is replaced by 0..

The output is generally parallel to that in Part V except that here dis-similarity is used instead of
similarity, and the part related to comparing absolute abundances is omitted. See Examples 5.1,
5.2 and 5.2B for an explanation of the output. Here any empirical dis-similarity measure (in the
second part of the output) is subject to positive bias for under-sampled data due to the effect of
undetected species and undetected shared species in samples. Therefore, any estimated
dis-similarity index (in the third part of the output) is generally lower than the corresponding
empirical index, with the magnitude of this difference representing the under-sampling bias

- 58 -
associated with the empirical index. In the fourth part of the output, we present all pairwise
dis-similarity estimates for both q = 1 and q = 2.

A conventional genetic differentiation measure is Nei’s GST based on the heterozygosity measure
(Nei, 1973). When the relative allele frequencies of K subpopulations are compared, the
Morisita-Horn (= Jost’s D), regional differentiation and Shannon differentiation measure (i.e., the
Horn equal-weight index) are identical to the measures 1−C2K, 1−U2K and 1−C1K (= 1−U1K),
respectively. Using the hypothetical data given in Table 1, we compare in Table 2 the estimates of
these three measures with GST for three pairs of subpopulations and also for three-subpopulation.

Table 2. Comparison of the allele relative frequency distributions based on the estimates of
four differentiation measures (GST, Morisita-Horn, regional, and Shannon) for the hypothetical
data given in Table 1

Sub- GST Morisita-Horn Regional Shannon or Horn


populations (q = 2) = Jost’s D (q = 2) differentiation (q =2) differentiation (q =1)
(1, 2) 0.058 1− C22(1,2) =1.000 1− U22(1,2) =1.000 1− C12(1,2) = 1.000
(1, 3) 0.020 1− C22(1,3) = 0.540 1− U22(1,3) = 0.370 1− C12(1,3) = 0.414
(2, 3) 0.012 1− C22(2,3) = 0.225 1− U22(2,3) = 0.127 1− C12(2,3) = 0.282
(1, 2, 3) 0.040 1− C23 = 0.608 1− U23 = 0.341 1− C13 = 0.452

In our hypothetical example, we deliberately make it so that subpopulations 1 and 2 are


completely differentiated (i.e., they have no shared alleles). Since the theoretical range for all
three measures is between 0 and 1, the Morisita-Horn, regional differentiation and the Shannon
measure correctly indicate that subpopulations 1 and 2 achieve complete differentiation and thus
these three measures all attain the maximum value of unity. By contrast, GST (= 0.058, not shown
in our output for pairwise comparison) indicates little differentiation.

In our hypothetical example, we also deliberately make that the infrequent alleles in
subpopulation 3 corresponds to frequent alleles in subpopulation 1, whereas the frequent alleles
in subpopulation 3 are non-shared alleles (i.e., zero-frequency in subpopulation 1). The
Morisita-Horn (= 0.540), regional differentiation (= 0.370), and Shannon measure (= 0.414) all
show moderate differentiation between these two subpopulations. These measures are interpreted
as follows: if we focus on dominant alleles, then on the average, about 54% alleles in a
subpopulation is not shared by the other subpopulation, and approximately 37% alleles in the
total population are non-shared species. If we weigh alleles by their frequencies, then both
proportions become 41.4%. All these proportions reflect allelic difference between the two
subpopulations. By contrast, the value of GST (= 0.020) implies that there is almost no difference.

Similar contrasting results are also shown in the three-subpopulation comparison. Table 2 shows
that the estimates for the above three differentiation measures are respectively 0.608, 0.341 and
0.452. That is, based on dominant alleles, about 60.8% alleles in a subpopulation is not shared by
the other two subpopulations, and approximately 34.1 alleles in the total population are not
shared by at least two subpopulations. If we weigh alleles by their frequencies, then both
proportions become 45.2%. Again, all these proportions reflect the existence of allelic difference
among the three subpopulations. However, GST yields a value of 0.040, implying a very low

- 59 -
differentiation among the three subpopulations. Application of a real data set, as illustrated below,
further reveals similarly contrasting results of the estimates of these three measures.

Example 6.2. Using Human Alleles Frequency Data to Demo Type (1) Abundance Data

We chose allele frequencies from four human subpopulations (BiakaPyg, Palestin, Bedouin and
Druze) taken from the data set provided by Rosenberg et al. (2002). The allele frequency data in
locus D3S2427 for the four populations are shown below. In the output, we present all pairwise
dis-similarity estimates for both q = 1 and q = 2 in the fourth part of the output.

Table 3. Human allele frequency data in Locus D3S2427


for four subpopulations

Allele Subpopulations
Code BiakaPyg Palestin Bedouin Druze
203 0 6 3 1
209 11 0 0 0
213 5 0 0 0
215 3 0 3 0
217 1 0 0 0
219 0 2 3 3
221 0 0 0 1
223 2 0 0 0
225 3 4 9 6
227 6 2 3 0
229 1 4 6 4
231 5 16 9 5
233 3 4 0 5
235 3 11 16 29
237 9 16 17 9
239 2 15 9 18
241 1 2 4 6
243 2 6 5 4
245 4 4 3 1
247 4 2 6 0
249 1 1 0 0
251 0 1 1 0
253 0 0 1 0
257 2 0 0 0
259 0 0 0 0

- 60 -
261 2 0 0 0
263 0 0 0 0

OUTPUT: (The estimates used in the text are high-lighted in red)

(In the fourth part of the output, we present all pairwise dis-similarity
(1) BASIC DATA INFORMATION:
estimates for both q = 1 and q = 2. The estimates used in the text are high-lighted in red)
The loaded set includes abundance (or frequency) data from 4 subpopulations
and a total of 25 distinct alleles are found.

Sample size in each subpopulation n1 = 70


n2 = 96
n3 = 98
n4 = 92

Number of observed alleles in one subpopulation D1 = 20


D2 = 16
D3 = 16
D4 = 13

Number of observed shared alleles in two subpopulations D12 = 13


D13 = 12
D14 = 10
D23 = 14
D24 = 12
D34 = 11

Number of bootstrap replications for s.e. estimate 100

(2) EMPIRICAL DIS-SIMILARITY INDICES:

Estimate s.e. 95%Lower 95%Upper


(a) Classic richness-based dis-similarity

1-C04 (q=0, Sorensen) 0.1795 0.0221 0.1361 0.2229


1-U04 (q=0, Jaccard) 0.4667 0.0379 0.3923 0.5410

(b) Measures for comparing alleles relative abundances

1-C14=1-U14 (q=1, Horn) 0.2044 0.0260 0.1534 0.2553

1-C24 (q=2, Morisita-Horn) 0.3259 0.0442 0.2393 0.4125


1-U24 (q=2, Regional diff.) 0.1078 0.0219 0.0649 0.1508

Gst 0.0303 0.0057 0.0190 0.0415

(c) Measures for comparing size-weighted alleles relative abundances

Horn size-weighted (q=1) 0.1933 0.0261 0.1421 0.2445

(3) ESTIMATED DIS-SIMILARITY INDICES:

Estimate s.e. 95%Lower 95%Upper


(a) Classic richness-based dis-similarity

1-C04 (q=0, Sorensen) 0.1524 0.0583 0.0381 0.2667


1-U04 (q=0, Jaccard) 0.4183 0.1142 0.1945 0.6421

(b) Measures for comparing alleles relative abundances

- 61 -
1-C14=1-U14 (q=1, Horn) 0.2013 0.0290 0.1445 0.2581

1-C24 (q=2, Morisita-Horn) 0.2585 0.0493 0.1618 0.3551


1-U24 (q=2, Regional diff.) 0.0802 0.0218 0.0374 0.1229

Gst 0.0218 0.0058 0.0104 0.0331

(c) Measures for comparing size-weighted alleles relative abundances

Horn size-weighted (q=1) 0.1903 0.0260 0.1393 0.2413

(4) ESTIMATED PAIRWISE DIS-SIMILARITY: (An order q = 2 is selected in Step 4 for pairwise comparisons)

-----------------------Measure 1-C22------------------------

Estimator Estimate s.e. 95% Confidence Interval

1-C22(1,2) 0.338 0.091 ( 0.161 , 0.516 )


1-C22(1,3) 0.290 0.103 ( 0.088 , 0.492 )
1-C22(1,4) 0.602 0.089 ( 0.428 , 0.777 )
1-C22(2,3) 0.015 0.063 ( 0.000 , 0.138 )
1-C22(2,4) 0.188 0.080 ( 0.031 , 0.346 )
1-C22(3,4) 0.137 0.091 ( 0.000 , 0.315 )

Average pairwise dis-similarity = 0.262

Pairwise dis-similarity matrix:

1-C22(i,j)1 2 3 4
1 0 0.338 0.290 0.602
2 0 0.015 0.188
3 0 0.137
4 0

-----------------------Measure 1-U22------------------------

Estimator Estimate s.e. 95% Confidence Interval

1-U22(1,2) 0.204 0.076 ( 0.055 , 0.352 )


1-U22(1,3) 0.170 0.085 ( 0.003 , 0.336 )
1-U22(1,4) 0.431 0.097 ( 0.240 , 0.622 )
1-U22(2,3) 0.007 0.035 ( 0.000 , 0.077 )
1-U22(2,4) 0.104 0.054 ( 0.000 , 0.210 )
1-U22(3,4) 0.074 0.058 ( 0.000 , 0.187 )

Average pairwise dis-similarity = 0.165

Pairwise dis-similarity matrix:

1-U22(i,j)1 2 3 4
1 0 0.204 0.170 0.431
2 0 0.007 0.104
3 0 0.074
4 0

NOTE: Any estimate greater than 1 is replaced by 1; any estimate less than 0 is replaced by 0.

(4) ESTIMATED PAIRWISE DIS-SIMILARITY: (An order q = 1 is selected in Step 4 for pairwise comparisons)

--------------------Measure 1-C12 (=1-U12)---------------------

Estimator Estimate s.e. 95% Confidence Interval

1-C12(1,2) 0.300 0.061 ( 0.181 , 0.419 )


1-C12(1,3) 0.290 0.058 ( 0.176 , 0.403 )

- 62 -
1-C12(1,4) 0.421 0.068 ( 0.288 , 0.554 )
1-C12(2,3) 0.085 0.034 ( 0.018 , 0.152 )
1-C12(2,4) 0.105 0.050 ( 0.008 , 0.203 )
1-C12(3,4) 0.128 0.043 ( 0.044 , 0.212 )

Average pairwise dis-similarity = 0.222

Pairwise dis-similarity matrix:

1-C12(i,j)1 2 3 4
1 0 0.300 0.290 0.421
2 0 0.085 0.105
3 0 0.128
4 0

------------------Measure Horn size-weighted--------------------

Estimator Estimate s.e. 95% Confidence Interval

Horn(1,2) 0.310 0.060 ( 0.191 , 0.428 )


Horn(1,3) 0.301 0.053 ( 0.197 , 0.404 )
Horn(1,4) 0.431 0.060 ( 0.314 , 0.547 )
Horn(2,3) 0.085 0.034 ( 0.018 , 0.152 )
Horn(2,4) 0.105 0.046 ( 0.015 , 0.195 )
Horn(3,4) 0.128 0.047 ( 0.036 , 0.220 )

Average pairwise dis-similarity = 0.226

Pairwise dis-similarity matrix:

Horn(i,j) 1 2 3 4
1 0 0.310 0.301 0.431
2 0 0.085 0.105
3 0 0.128
4 0

NOTE: Any estimate greater than 1 is replaced by 1; any estimate less than 0 is replaced by 0.

Based on the above output, we compare in Table 4 the performance of the four measures: GST,
Morisita-Horn (= Jost’s D), regional differentiation, and Shannon measure (= Horn equal-weight
index) as we did in Table 2. All values of GST in Table 4 indicate little differentiation between any
two subpopulations and among the four subpopulations. In contrast, the other three
differentiation measures reveal moderate differentiation. The average of the six pairwise
dissimilarity estimates for the Morisita-Horn, regional differentiation and Shannon measure are
respectively 0.262, 0.165 and 0.222. Among the four subpopulations, the estimates of the
Morisita-Horn measure (1− C24), regional differentiation (1− U24) and Shannon measure (1− C14
= 1− U14) are respectively 0.258, 0.080 and 0.201, implying notable differentiation. However, the
value of GST is 0.022 which shows almost no differentiation among the four subpopulations.

- 63 -
Table 4. Comparison of the allele relative frequency distributions based on the estimates of
four differentiation measures (GST, Morisita-Horn, regional and Shannon) for human allele
data given in Table 3

Sub- GST Morisita-Horn Regional Shannon or Horn


populations (q = 2) = Jost’s D (q = 2) differentiation (q =2) differentiation (q =1)
(1, 2) 0.015 1− C22(1,2) = 0.338 1− U22(1,2) = 0.310 1− C12(1,2) = 0.300
(1, 3) 0.012 1− C22(1,3) = 0.290 1− U22(1,3) = 0.301 1− C12(1,3) = 0.290
(1, 4) 0.036 1− C22 (1,4) = 0.602 1− U22 (1,4) = 0.431 1− C12(1,4) = 0.421
(2, 3) 0.001 1− C22 (2,3) = 0.015 1− U22 (2,3) = 0.085 1− C12(2,3) = 0.085
(2, 4) 0.014 1− C22 (2,4) = 0.188 1− U22 (2,4) = 0.105 1− C12(2,4) = 0.105
(3, 4) 0.010 1− C22 (3,4) = 0.137 1− U22 (3,4) = 0.128 1− C12(3,4) = 0.128
(1, 2, 3, 4) 0.022 1− C24 = 0.258 1− U24 = 0.080 1− C14 = 0.201

As shown by Jost (2008) and Chao et al. (2015), the conventional measure GST and its relatives
do not quantify allelic differentiation among sub-populations. This is because heterozygosity
cannot be decomposed into the sum of independent within- and between-subpopulation
components. The resulting between-subpopulation component is confounded with
subpopulation-heterozygosity and also with total-heterozygosity. The resulting “differentiation”
measure GST always tends to 0 as subpopulation- or total-heterozygosity is high, regardless of
structure among subpopulations (e.g., even if there are no shared types among subpopulations.)
Thus, this differentiation measure ceases to be biologically informative. Chao and Chiu (2016c)
proved the following for the case of q = 2: (1) removing the dependence of the
between-subpopulation component on the subpopulation-heterozygosity, one obtains
Morisita-Horn (=1−C2N) measure; (2) removing the dependence of the between-subpopulation
component on the total-heterozygosity, one obtains the regional differentiation measure (=
1−U2N).

Users may wonder which of the differentiation measures featured in SpadeR (other than the
conventional measure GST) is the most appropriate choice. Like the diversity profile presented in
Part II, we recommend using a differentiation profile which depicts the estimated dissimilarity as
a function of q. See Part V for our discussion about the relative merits of the measures of q = 0, 1
and 2 and suggested guidelines regarding the choice of measures.

Acknowledgements

We thank a number of users for feedback and suggestions/comments, which have helped improve
SPADE and SpadeR and led to the removal of several bugs. We acknowledge Taiwan Ministry of
Science and Technology for continuous support of our research projects related to SPADE and
SpadeR.

- 64 -
References

Burnham, K. P. and Overton, W. S. (1978). Estimation of the size of a closed population when
capture probabilities vary among animals. Biometrika, 65, 625–633.
Butler, B. J., and Chazdon, R. L. (1998). Species richness, spatial variation, and abundance of the
soil seed bank of a secondary tropical rain forest. Biotropica, 30, 214–222.
Chao, A. (1984). Nonparametric estimation of the number of classes in a population.
Scandinavian Journal of Statistics, 11, 265–270.
Chao, A. (1987). Estimating the population size for capture-recapture data with unequal
catchability. Biometrics, 43, 783–791.
Chao, A. (2005). Species estimation and applications. In N. Balakrishnan, C. B. Read and B.
Vidakovic (eds), Encyclopedia of Statistical Sciences, Second Edition, Vol. 12, p.7907–7916,
Wiley, New York.
Chao, A., Chazdon, R. L., Colwell, R. K. and Shen, T.-J. (2005). A new statistical approach for
assessing similarity of species composition with incidence and abundance data. Ecology
Letters, 8,148–159.
Chao, A., Chazdon, R. L., Colwell, R. K. and Shen, T.-J. (2006). Abundance-based similarity
indices and their estimation when there are unseen species in samples. Biometrics, 62,
361–371.
Chao, A., and Chiu, C. H. (2012). Estimation of species richness and shared species richness. In
N. Balakrishnan (ed). Methods and Applications of Statistics in the Atmospheric and Earth
Sciences. p.76–111, Wiley, New York.
Chao, A., and Chiu, C. H. (2016a). Nonparametric estimation and comparison of species richness.
Wiley Online Reference in the Life Science. In: eLS. John Wiley & Sons, Ltd: Chichester. DOI:
10.1002/9780470015902.a0026329.
Chao, A., and Chiu, C. H. (2016b). Species richness: estimation and comparison. Wiley StatsRef:
Statistics Reference Online. 1–26.
Chao, A., and Chiu, C. H. (2016c). Bridging the variance and diversity decomposition
approaches to beta diversity via similarity and differentiation measures. Methods in Ecology
and Evolution. Early view at
http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12551/abstract.
Chao, A., Chiu, C. H. and Hsieh, T. C. (2012). Proposing a resolution to debates on diversity
partitioning. Ecology, 93, 2037–2051.
Chao, A., Chiu, C. H. and Jost, L. (2014a). Unifying species diversity, phylogenetic diversity,
functional diversity, and related similarity/differentiation measures through Hill
numbers. Annual Reviews of Ecology, Evolution, and Systematics, 45, 297–324.
Chao, A., Gotelli, N. G., Hsieh, T. C., Sander, E. L., Ma, K. H., Colwell, R. K. and Ellison, A. M.
(2014b). Rarefaction and extrapolation with Hill numbers: a framework for sampling and
estimation in species diversity studies. Ecological Monographs, 84, 45–67.

- 65 -
Chao, A., Hwang, W.-H., Chen, Y.-C. and Kuo, C.-Y. (2000). Estimating the number of shared
species in two communities. Statistica Sinica, 10, 227–246.
Chao, A. and Jost, L. (2015). Estimating diversity and entropy profiles via discovery rates of new
species. Methods in Ecology and Evolution, 6, 873–882.
Chao, A., Jost, L., Chiang, S.-C., Jiang, Y.-H. and Chazdon, R. L. (2008). A two-stage
probabilistic approach to multiple-community similarity indices. Biometrics, 64,1178–1186.
Chao, A., Jost, L., Hsieh, T. C., Ma, K. H., Sherwin, W. B. and Rollins, L. A. (2015). Expected
Shannon entropy and Shannon differentiation between subpopulations for neutral genes under
the finite island model. PloS One, 10:e0125471.
Chao, A. and Lee, S.-M. (1992). Estimating the number of classes via sample coverage. Journal
of the American statistical Association, 87, 210–217.
Chao, A., Ma, M.-C. and Yang, M. C. K. (1993). Stopping rules and estimation for recapture
debugging with unequal failure rates. Biometrics, 43, 783–791.
Chao, A. and Shen, T.-J. (2003). Nonparametric estimation of Shannon's index of diversity when
there are unseen species in sample. Environmental and Ecological Statistics, 10, 429–443.
Chao, A., Wang, Y. T. and Jost, L. (2013). Entropy and the species accumulation curve: a novel
entropy estimator via discovery rates of new species. Methods in Ecology and Evolution, 4,
1091–1100.
Chiu, C. H., Jost, L. and Chao, A. (2014). Phylogenetic beta diversity, similarity, and
differentiation measures based on Hill numbers. Ecological Monographs, 84, 21–44.
Chiu, C. H., Wang, Y. T., Walther, B. A. and Chao, A. (2014). An improved nonparametric lower
bound of species richness via a modified Good–Turing frequency formula. Biometrics, 70,
671–682.
Colwell, R. K. (2013). EstimateS: Statistical estimation of species richness and shared species
from samples. Version 9 and earlier. User’s Guide and application. Published at:
http://purl.oclc.org/estimates.
Colwell, R. K., Chao, A., Gotelli, N. G., Lin, S.-Y., Mao, C. X., Chazdon, R. L. and Longino, J. T.
(2012). Models and estimators linking individual-based and sample-based rarefaction,
extrapolation and comparison of assemblages. Journal of Plant Ecology, 5, 3–21.
Colwell, R. K., and Coddington, J. A. (1994). Estimating terrestrial biodiversity through
extrapolation. Philosophical Transactions of the Royal Society of London B - Biological
Sciences, 345, 101–118.
Edwards, W. R. and Eberhardt. L. (1967). Estimating cottontail abundance from live trapping
data. The Journal of Wildlife Management, 31, 87–96.
Foissner, W., Agatha, S. and Berger, H. (2002). Soil ciliates (Protozoa, Ciliophora) from Namibia
(Southwest Africa), with emphasis on two contrasting environments, the Etosha Region and
the Namib Desert. Denisia, 5, 1–1063.
Gotelli, N. G. and Chao, A. (2013). Measuring and estimating species richness, species diversity,

- 66 -
and biotic similarity from sampling data. In Levin, S. A. (Ed). Encyclopedia of Biodiversity,
2nd Edition, Vol. 5, 195–211, Waltham, MA.
Horn, H. S. (1966). Measurement of "overlap" in comparative ecological studies. American
Naturalist, 100, 419–424.
Janzen, D. H. (1973). Sweep samples of tropical foliage insects: description of study sites, with
data on species abundances and size distributions. Ecology, 54, 659–686.
Jost, L. (2006). Entropy and diversity. Oikos, 113, 363–375.
Jost, L. (2007). Partitioning diversity into independent alpha and beta components. Ecology, 88,
2427–2439.
Jost, L. (2008). GST and its relatives do not measure differentiation. Molecular Ecology, 17,
4015–4026.
Lee, S.-M. and Chao, A. (1994). Estimating population size via sample coverage for closed
capture-recapture models. Biometrics, 50, 88–97.
Legendre, P. and De Cáceres, M. (2013). Beta diversity as the variance of community data:
dissimilarity coefficients and partitioning. Ecology Letters, 16, 951–963.
Legendre, P. and Legendre, L. (2012). Numerical Ecology. 3rd English Edition, Elsevier,
Amsterdam.
Longino, J. T., Coddington, J. A. and Colwell, R. K. (2002). The ant fauna of a tropical rain
forest: estimating species richness three different ways. Ecology, 83, 689-702.
Magurran, A. E. (1988). Ecological Diversity and Its Measurement. Princeton University Press,
Princeton, New Jersey.
Magurran, A. E. (2004). Measuring Biological Diversity. Blackwell, Oxford.
Miller, R. I. and Wiegert, R. G. (1989). Documenting completeness, species-area relations, and
the species-abundance distribution of a regional flora. Ecology, 70, 16–22.
Morisita, M. (1959). Measuring of interspecific association and similarity between communities.
Memoires of the Faculty of Science, Kyushu University, Series E (Biology), 3, 65–80.
Nei, M. (1973). Analysis of gene diversity in subdivided populations. Proceedings of the
National Academy of Sciences USA, 70, 3321–3323.
Pan, H.-Y., Chao, A. and Foissner, W. (2009). A nonparametric lower bound for the number of
species shared by multiple communities. Journal of Agricultural, Biological, and
Environmental Statistics, 14, 452–468.
Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivotovsky, L. A. and
Feldman, M. W. (2002). Genetic structure of human populations. Science, 298, 2381–2385.
Whittaker, R. H. (1960). Vegetation of the Siskiyou mountains, Oregon and California.
Ecological Monographs, 30, 279–338.
Whittaker, R. H. (1972). Evolution and measurement of species diversity. Taxon, 12, 213–251.
Zahl, S. (1977). Jackknifing an index of diversity. Ecology, 58, 907–913.

- 67 -
Appendix: Description and Formulas for Models/Estimators

Common notation to all parts

S unknown number of species in a community;


Xi number of individuals (frequency) the ith species is observed in the sample, i = 1, 2, …, S;
(Only those species with Xi > 0 are observable in the sample);
n sample size, n  iS1 X i =  j1 jf j . (f’s are defined below);
I [A] the usual indicator function, i.e., I [A] = 1 if the event A occurs, 0 otherwise;
fj number of species that are represented exactly j times in the sample, j = 0, 1, ..., n,

f j  i1 I [ X i  j ] . (f0 denotes the number of undetected species);


S

C sample coverage;

^
an estimator from data; e.g., Ŝ (estimator of S) and Ĉ (estimator of C), and so on;

number of distinct species discovered in the sample, ( D  i1 I [ X i  0]   j1 f j );


S
D

T number of sampling units for incidence data;


Qk number of species that are observed in exactly k sampling units, k = 0, 1, ..., T, based on
presence/absence or detection/non-detection data;

 the cut-off point (default = 10), which separates species into “abundant” and “rare” groups
for abundance data; it separates species into “frequent” and “infrequent” groups for
incidence data;

CV: coefficient of variation.

Part I: Species (Species Richness Estimation in One Community)

Formulas for abundant data (Examples 1.1 and 1.1A):


Additional Notation
nrare : sample size for “rare” group;

Drare : the number of distinct species for “rare” group; Drare  i 1 f i ;


- 68 -
Dabun : the number of distinct species for “abundant” group; Dabun  i f i ;

Ĉrare : estimated sample coverage for “rare” group; Cˆ rare  1  f1 / i 1 if i ;



ˆrare or CV rare : estimated CV for “rare” group (CV_rare in the output);

~rare : estimated CV for highly heterogeneous case (CV1_rare in the output) .

 Homogenous Model (Chao and Lee, 1992)


D
Sˆ  Dabun  rare ,
Cˆ rare

where Drare   f i , Cˆ rare  1  f1 /  if i and  = cut-off point.


 
i 1 i 1

 Homogeneous (MLE):
Approximate MLE under the homogeneous model

Ŝ is the solution of D  S[1  exp( n / S )] .


 Chao1 (Chao, 1984):
 D  f12 /( 2 f 2 ), if f 2  0
Sˆ  
D  f1 ( f1  1) / 2, if f 2  0
 Chao1-bc (bias-corrected estimator for Chao1):

Sˆ  D  f1 ( f1  1) /[ 2( f 2  1)] .
 iChao1 (Chiu et al. 2014):

(n  3) f 3  (n  3) f 2 f 3 
Sˆ  C h a1o  m a x f1  ,0 .
4n f 4  2( n  1) f 4 
 ACE (Chao and Lee, 1992):
D f
Sˆ  Dabun  rare  1 ˆrare
2
,
ˆ ˆ
C rare Crare




 Drare i (i  1) f i 

where Dabun  i f i and ˆ 2
 max  i 1
 1, 0 
Crare (  if i )(  if i  1)
ˆ
rare  

 i 1 i 1 

(CV_rare in the output).
 ACE-1 (Chao and Lee, 1992) for highly-heterogeneous case:
D f
Sˆ  Dabun  r a r e 1 ~r2a r e
Cˆ r a r e Cˆ r a r e

- 69 -
 (1  Cˆ rare )i 1 i(i  1) f i

 

~
where  rare  max ˆrare (1 
2 2
), 0 .
 Cˆ ( if  1) 
 rare i 1 i 
(CV1_rare in the output).
 1st order jackknife (Abundance Data):
n 1
Sˆ  D  f1 .
n
 2nd order jackknife (Abundance Data):
2n  3 ( n  2) 2
Sˆ  D  f1  f2 .
n n(n  1)

Formulas for incidence data (Examples 1.2, 1.2A and 1.2B):


Additional Notation

D freq : the number of species for “frequent” group, D freq  i Qi ;



Dinfreq : the number of species for “infrequent” group; Dinfreq  i1 Qi ;

Ĉinfreq : estimated sample coverage for “infrequent” group;

Tinfreq : number of sampling units that includes at least one infrequent species;

ˆinfreq or CV infreq : estimated squared CV for “infrequent” group (CV-rare in the output);

~infreq : estimated CV for highly heterogeneous case (CV1-rare in the output);

 Homogeneous Model (Lee and Chao, 1994):


Dinfreq
Sˆ  D freq  .
Cˆ infreq

where D freq  i Qi , Dinfreq  i1 Qi , and Cˆ Q1 (Tinfreq  1)Q1 .


infreq  1  
i 1 iQi [(Tinfreq  1)Q1  2Q2 ]
 Chao2 (Chao, 1987):
 D  (T  1)Q12 /( 2TQ2 ), if Q2  0
Sˆ  
 D  (T  1)Q1 (Q1  1) /( 2T ), otherwise
 Chao2-bc (bias-corrected form for Chao2)
Sˆ  D  [(T  1) / T ]Q1 (Q1  1) /[2(Q2  1)] .
 iChao2 (Chiu et al. 2014)

- 70 -
(T  3) Q3  (T  3) Q2Q3  .
Sˆ  Chao2   max Q1  , 0
4T Q4  2(T  1) Q4 
 ICE (Lee and Chao, 1994):
Di n f r e q Q1 2
Ŝ  D f r e q  ˆ i n f r,e q
Ĉi n f r e q Ĉi n f r e q


D Tinfreq i 1 i(i  1)Qi 

where ˆinfreq
2
 max  infreq  1, 0
 ˆ
C (T  1) ( 

iQ )( 

iQ  1) 
 infreq infreq i 1 i i 1 i 
(CV_rare in the output).

 ICE-1 for highly-heterogeneous case:


D Q
Sˆ  D freq  infreq  1 ~infreq
2
,
ˆ ˆ
Cinfreq Cinfreq


  i 1 i(i  1)Qi 

 max  S ICE  D freq  infreq
where ~infreq T
2
   1, 0 

  Tinfreq  1 ( i 1 iQi )( i 1 iQi  1) 

(CV1_rare in the output) and Ŝ ICE denotes the ICE estimate.

 1st order jackknife (Incidence Data):


T 1 .
Sˆ  D  Q1
T
 2nd order jackknife (Incidence Data):
2T  3 (T  2) 2 .
Sˆ  D  Q1  Q2
T T (T  1)

Part II: Diversity Profile Estimation

Formulas for abundant data (Examples 2.1 and 2.1A)

Shannon’s Index Ĥ and its effective number of species, exp(Hˆ ) (diversity of order 1):
 MLE
S
X X
Hˆ   I ( X i  0) i log( i )
i 1 n n
 Jackknife
(n  1) n
Hˆ  nHˆ obs  
n i 1
Hˆ obs( i ) .

- 71 -
where Hˆ obs( i ) is the observed entropy based on a sub-sample in which the i-th

individual is deleted. It can also be written as


1 n k 1 1 n
Hˆ  n log( n)  (n  1) log( n  1)   f k k 2 log( )   f k k log( k  1) .
n k 2 k n k 2
 Chao and Shen (2003)

Cˆ ( X i / n) log[ Cˆ ( X i / n)]
Hˆ CAE    ,
X i 0 1  [1  Cˆ ( X i / n)]n
where Cˆ  1  f / n is an estimate of sample coverage. It can also be written as
1

n
(k (1  f1 / n) / n) log[ k (1  f1 / n) / n]
Hˆ   f k .
k 1 1  [1  k (1  f1 / n) / n]n
 Chao et al. (2013)

n  Xi 
 
  f1 (1  A)  n 1  log( A)  1 (1  A) r 
n 1 n 1
 k
S
1 X
Hˆ    i
 r 
k 1 k i 1 n  n  1 n  r 1 
 
 k 
where

 2 f 2 /[( n  1) f1  2 f 2 ], if f1  0, f 2  0

A  2 /[( n  1)( f1  1)  2], if f1  0, f 2  0
 if f1  0
 1,

Simpson Index ˆ and its effective number of species 2 Dˆ  1 / ˆ (diversity of order 2):
 MVUE (Minimum Variance Unbiased Estimator)
S
X i ( X i  1) n k (k  1)
ˆ    f k .
i 1 n(n  1) k 1 n(n  1)
 MLE (Maximum Likelihood Estimator)
S
X i2 n
k
ˆ     f k ( n )2 .
i 1 n2 k 1

Hill's number of order q, q ≥ 0:


 ChaoJost (Chao and Jost 2015) for abundance data
1 /(1q )
 n1  q  1 f  n1  q  1 
q
Dˆ     (1) k ˆ (k )  1 (1  A) n1  Aq1    ( A  1) r  
 k 0  k  r 0  r 

 n  

- 72 -
n  Xi   n  k  1
   
Xi  k   X i  1 
where ˆ (k )    
n  n  1 1 X i  n  k  n 
, k  n,
1 X i  n  k
   
 k   Xi 
and A is defined as above in the entropy estimator.
Special cases:
(1) For q = 0, it reduces to the Chao1 estimator; see Species Part.
(2) For q = 1, it reduces to the estimator derived in Chao et al. (2013).
(3) For q = 2, it reduces to

1
 X ( X  1) 
2
Dˆ    i i  .
 X i  2 n(n  1) 

Formulas for incidence data (Examples 2.2, 2.2A and 2.2B)

Shannon’s Index Ĥ and its effective number of species, exp(Hˆ ) (diversity of order 1):
 MLE
S Y Y
Hˆ   I (Yi  0) i log( i ) .
i 1 U U
 Chao et al. (2013) Chao et al. (2013, their Appendix S6).
  T  Yi  
   
T T 1 1 S Y  k   Q1 (1  A* ) T 1  log( A* )  1 (1  A* ) r   log U 
T 1
Hˆ    i  r   
U  k 1 k i 1 T  T  1 T  r 1  T 
 
  k  

 2Q2 /[(T  1)Q1  2Q2 ] if Q2  0



where A  2 /[(T  1)(Q1  1)  2] if Q2  0, Q1  0
*

 if Q2  Q1  0
 1

Simpson Index ˆ and its effective number of species 1/ ˆ (diversity of order 2):
 MVUE (Minimum Variance Unbiased Estimator)
S Yi (Yi  1) .
ˆ  
i 1 U (1  1 / T )
2

 MLE (Maximum Likelihood Estimator)


2
S Yi .
ˆ   2
i 1 U

- 73 -
Hill's number of order q, q ≥ 0:
 ChaoJost (Chao and Jost 2015) for incidence data
1 /(1 q )
q /( q 1)
 T 1 q  1 *  T 1 
T 1 q  1
  * 
q
Dˆ   U     ( 1) t ˆ*
 ( t ) 
Q1
(1  A ) ( A* q 1
)    ( A  1) r  
T   t  0  t  T r 0  r 

  

 T  Yi   T  t  1
   
Y  t   Y  1 , t  T ,
where ˆ (t )  
* i
  i

1Yi  T  t T  T  1 1Yi T  t   T
   
 t   Yi 
and A* defined as
 2Q2 /[(T  1)Q1  2Q2 ] if Q2  0

A  2 /[(T  1)(Q1  1)  2] if Q2  0, Q1  0
*

 if Q2  Q1  0
 1

Special cases:
(1) For q = 0, it reduces to the Chao2 estimator; see Species Part.
(2) For q = 1, it reduces to the Shannon diversity estimator described above.
(3) For q = 2, it reduces to

1
U 
2
 Yi (Yi  1)  (1  1 / T )U 2
2
Dˆ        .
T   Yi 2 T (T  1)  Yi 2 Yi (Yi  1)

Part III: Shared Species (Shared Spceies Richness Estimation in Two Communities)

Formulas for abundant (Examples 3.1)


S12: the number of shared species (unknown);
D12: the number of shared species between Sample 1 and Sample 2;
ni: size of Sample i, i = 1, 2;
(Xi1, Xi2): sample frequencies of the ith species in Sample 1 and Sample 2;
D12( rare) : the number of shared species in the “rare” group,

D12( rare)  i 121 I (0  X i1  10, 0  X i 2  10) ;


S

D12( abun) : the number of shared species in the “abundant” group, D12( abun) = D12  D12( rare) ;

- 74 -
f1 ( rare) : number of shared species that are singletons in Sample 1 for the “rare” group,

f1( rare)  i 121 I ( X i1  1, 0  X i 2  10) ;


D

f 1( rare) : number of shared species that are singletons in Sample 2 for the “rare” group,

f 1( rare)  i 121 I (0  X i1  10, X i 2  1) ;


D

f11 : number of shared species that are singletons in both samples,

f11  i 121 I ( X i1  X i 2  1) ;
D

f 22 : number of shared species that are doubletons in both samples,

f 22  i 121 I ( X i1  X i 2  2) ;
D

f1  i 121 I X i1  1, X i 2  1 : the observed number of shared species that occur once in Sample
D

1;

f 2  i 121 I X i1  2, X i 2  1 : the observed number of shared species that occur twice in Sample
D

1;

f 1  i 121 I X i1  1, X i 2  1 : the observed number of shared species that occur once in Sample
D

2;

f  2  i 121 I X i1  1, X i 2  2 : the observed number of shared species that occur twice in Sample 2.
D

 Homogeneous (Chao et al., 2000):


D
Sˆ12  D12( abun)  12( rare) ,
Cˆ12( rare)

where D12( rare)   12 I (0  X i1  10, 0  X i 2  10) and


D
i 1


D12
[ X i 2 I ( X i1  1, X i 2  10)  X i1 I ( X i 2  1, X i1  10)  I ( X i1  X i 2  1)]
Cˆ12( rare)  1  i 1

D12
i 1
X i1 X i 2 I ( X i1  10, X i 2  10)

 Heterogeneous (ACE-shared); (Chao et al., 2000):


D
Sˆ12  D12( abun)  12( rare) 
ˆ ˆ
1
C12( rare) C12( rare)

f1 ( rare)ˆ1  f 1( rare)ˆ 2  f11ˆ12 , 
where

- 75 -
Sˆ120 n1( rare)T21
(CCV1) ˆ1  1,
(n1( rare)  1)T10T11

Sˆ120 n2( rare)T12


(CCV2) ˆ2   1,
(n2( rare)  1)T01T11

n1( rare) n2( rare) ( Sˆ120 ) 2 T22 Sˆ120 T11 ˆ ˆ


(CCV12) ˆ12    1   2 ,
(n1( rare)  1)(n2( rare)  1)T10T01T11 T10T01

D12( rare)
where Sˆ120  ,
Cˆ 12( rare)

T10  i 121 X i1 I ( X i1  10, X i 2  10) , T01  i 121 X i 2 I ( X i1  10, X i 2  10) ,


D D

T11  i 121 X i1 X i 2 I ( X i1  10, X i 2  10) ,


D

T21  i 121 X i1 ( X i1  1) X i 2 I ( X i1  10, X i 2  10) ,


D

T12  i 121 X i1 X i 2 ( X i1  1) I ( X i1  10, X i 2  10) ,


D

T22  i 121 X i1 ( X i1  1) X i 2 ( X i 2  1) I ( X i1  10, X i 2  10) .


D

 Chao1-shared lower bound (Pan et al. 2009)

f2 f2 f2
Sˆ12  D12  K1 1  K 2 1  K1K 2 11 ,
2 f 2 2 f2 4 f 22
where Ki  (ni  1) / ni , i  1, 2.

 Chao1-shared-bc lower bound (Pan et al. 2009)


f ( f  1) f ( f  1) f ( f  1) ,
Sˆ12  D12  K1 1 1  K 2 1 1  K1K 2 11 11
2( f 2  1) 2( f 2  1) 4( f 22  1)

where Ki  (ni  1) / ni , i  1, 2.

Formulas for incidence data (Examples 3.2 and 3.2B)


Ti: size of Sample i (number of sampling units taken from Community i), i = 1, 2;
(Yi1, Yi2): observed incidence frequencies of the ith species in Sample 1 and Sample 2;

Q11  i 121 I ( Yi1  1, Yi 2  1 ) : number of shared species that are detected only in one sampling
D

unit in each of the two samples;

Q22  i 121 I ( Yi1  2, Yi 2  2 ) : number of shared species that are detected in two sampling units
D

in each of the two samples;

- 76 -
Q1  i 121 I Yi1  1, Yi 2  1 : the number of shared species that that are detected only in one
D

sampling unit in Community 1;

Q2  i 121 I Yi1  2, Yi 2  1 : the number of shared species that that are detected in two sampling
D

units in Community 1;

Q1  i 121 I Yi1  1, Yi 2  1 : the number of shared species that that are detected only in one
D

sampling unit in Community 2;

Q2  i 121 I Yi1  1, Yi 2  2 : the number of shared species that that are detected in two sampling
D

units in Community 2.

 Chao2-shared lower bound (Pan et al. 2009)

Q2 Q2 Q2
Sˆ12  D12  K1 1  K 2 1  K1K 2 11 ,
2Q2 2Q2 4Q22
where Ki  (Ti  1) / Ti , i  1, 2.

 Chao2-shared lower-bound-bc (Pan et al. 2009)


Q (Q  1) Q (Q  1) Q (Q  1) ,
Sˆ12  D12  K1 1 1  K 2 1 1  K1K 2 11 11
2(Q2  1) 2(Q2  1) 4(Q22  1)
where Ki  (Ti  1) / Ti , i  1, 2.

Part IV: Two-Community (Similarity) Measures

Assume that there are N communities, with S species indexed by 1, 2,..., S in the pooled
community. The comparison target may be the species within-community relative abundance sets,
species absolute abundance sets or any transformation of these abundances. We first review the
theoretical community-level similarity and dissimilarity/differentiation indices; all measures are
in terms of true species richness and abundances. See Chiu et al. (2014) and Chao et al. (2014a, b)
for pertinent background and developments.

Denote zik as the true (community-level) species abundance of the i-th species in the k-th
community. The abundance zik can be any measure of species importance such as a
species-incidence (presence-absence or detection/non-detection) record, species absolute
abundance (i.e., the number of individuals), species relative abundance in the pooled community,
within-community relative abundance, density, biomass, coverage of plants or corals, or basal
area of plants. Under this general definition, compositional similarity or differentiation refers to
the resemblance or difference in any chosen species abundance or species importance measure
among communities.

- 77 -
Most of the indices are special cases of the CqN and UqN measures developed by Chiu et al. (2014)
and Chao et al. (2014a). Define zi   k 1 zik , zi   zi  / N , z k   i 1 zik (the size of the k-th
N S

community), z   k 1 i 1 zik . The formulas for CqN and UqN are given below:
N S

(1) A class of local (Sørensen-type) species-overlap measures: CqN

i 1 k 1 ( zikq  ziq ) ,
S N S N
1 z
CqN  1  C1N   zik log zik .
(1  N 1 q )i 1 k 1 zikq
S N
z  log N i 1 k 1 i

(2) A class of regional (Jaccard-type) species-overlap measures: UqN.

 i 1 k 1
( zikq  ziq )
S N

U qN  1  , U1N  C1N .
( N q  N )i 1 ziq
S

All measures featured in SpadeR for the special case of two communities are summarized in
Table A1.

Table A1. (The special case of two communities). A summary of similarity measures at
community-level: All are in terms of true species richness and species abundances or frequencies.
The sub index i is summed over all species. (zi1, zi2) denote the true abundances of species i in
Community 1 and Community 2, and (pi1, pi2) denote the true within-community relative
abundances of species i.

Goal CqN or UqN Sørensen-type Similarity Jaccard-type Similarity


Measures Measures
(a) Comparing Yes (q=0) Classic Sørensena Classic Jaccarda
richness-based 2 S12 S12
C02  U 02 
S1  S2 S1  S2  S12
(b) Comparing Yes (q=1) Horn (1966) equal-weight indexb
H  H
relative C12  U12  1 
log N
abundances
Yes (q=2) Morisita-Horn c Regional-overlapc
(A special case
 ( p  p )2 U 22  1 
i
( pi1  pi 2 ) 2
C22  1  i i12 i22
of Goal (d) by
i ( pi1  pi 2 ) i ( pi1  pi 2 )2
letting zik = pik ,
H GS ,  H GS , H GS ,  H GS ,
1 1
z+k =1 and z++ =
(1 / 2)(1  H GS , ) 1  H GS ,
N in (d)) ChaoSørensen abundanced ChaoJaccard abundanced
No
2UV UV
U V U  V  UV
(c) Comparing No Horn (1966) size-weighted indexb
H  H
weighted relative 1
 k 1 ( zk / z ) log( zk / z )
N

abundances

- 78 -
(d) Comparing Yes (q=1) Chiu et al. (2014) Shannon-entropy-based indexb
absolute
H   H   k 1 ( zk / z ) log( zk / z )
N

abundances C12  U12 


log N
(sampling effort
Yes (q=2) Morisita-Horn Regional-overlap
must be
standardized in  (z  z )
C22  1  i i12 i 22
2

U 22  1 
i
( zi1  zi 2 )2

all communities) i ( zi1  zi 2 ) i ( zi1  zi 2 )2


No Classic Bray-Curtis

1
i zi1  zi 2 1
i zi1  zi 2
i ( zi1  zi 2 ) z 

a
S1, S2 = species richness in Community 1 and Community 2; S12 = shared species richness.
b
H  , H  = gamma and alpha Shannon entropy in Goal (c) and Goal (d).
S
zi  z  2
z S z z 
H   log i   ; H    k  ik log ik  .
i 1 z   z   k 1 z  i 1 z k  z k 
In Goal (b), let zik = pik = within-community relative abundance, implying z k  1, z   N .
The above formulas reduce to
S
pi1  pi 2  p  pi 2  1 S 1 S
H   log  i1 , H    p log p   pi 2 log pi 2 .
 2 
 i 1 i 1
i 1 2 2 i 1 2 i 1

c
H GS , , H GS , = gamma and alpha Gini-Simpson index in Goal (c) and Goal (d).
z k  zik  
2 2
S zi   N S 
H GS ,  1     ; H GS ,   1     .
i 1  z   k 1 z   i 1  z k  
 
In Goal (b), let zik = pik = within-community relative abundance, implying z k  1, z   N .
The above formulas reduce to
 p  pi 2  ,
2
S
1 S
2 1 S
2 
H GS ,  1    i1  H GS ,  1   pi1   1   pi 2  .
i 1  2  2  i 1  2  i 1 
d
U and V denote the total relative abundances of the shared species in Community 1 and
Community 2; see Chao et al. (2005, 2006). That is, U  i 121 pi1, V  i 121 pi 2 , assuming
S S

that the first S12 species are shared.

_________________________________________________________________________
In the second part of the output, all empirical similarity indices are obtained by plugging the
observed data in the above formulas. In the third part of the output, the estimation procedures are
briefly summarized below. Let (Xi1, Xi2) denotes the observed abundances of species i in Sample
1 and Sample 2, and (X+1, X+2) ≡ (n1, n2) denotes the sample sizes of Sample 1 and Sample 2.

- 79 -
Formulas for abundant data (Example 4.1)

(a) Classic richness-based similarity

2 SˆChao1 shared
 C02 (q=0, Sø rensen) Cˆ 02  .
Sˆ1,Chao1  Sˆ2,Chao1

SˆChao1 shared
 U02 (q=0, Jaccard) Uˆ 02  .
Sˆ1,Chao1  Sˆ2,Chao1  Sˆ12,Chao1

(See Part I and Part II for the estimators of species richness and shared richness).

(b) Measures for comparing species relative abundances

Hˆ  Hˆ 
 C12=U12 (q=1, Horn) Cˆ12  Uˆ12  1   ,
log 2

where Ĥ  is based on Chao et al. (2013), and the estimation of Ĥ  will be

published soon.

 C22 (q=2, Morisita-Horn)


S
X i1 X i 2
2
n1 n2
Cˆ 22  i 1
(2)
,
X X (2)
 n i1
( 2)
  (i22)
X i1  2 1 X i 2  2 n2

where x( 2 )  x( x  1) .
 U22 (q=2, Regional overlap)
S
X i1 X i 2
4
i 1 n1 n2
Uˆ 22  .
X i(12 ) X i(22 ) S
X i1 X i 2
 ( 2)   ( 2 )  2
X i 1  2 n1 X i 2  2 n2 i 1 n1 n2

UˆVˆ
 ChaoJaccard Abundance ,
Uˆ  Vˆ  UˆVˆ
D12
X (n  1) f 1 D12 X i1
where Uˆ   i1  2  I ( X i 2  1) ;
i 1 n1 n2 2 f  2 i 1 n1
D12
X (n  1) f1 D12
X i2
Vˆ   i 2  1  I ( X i1  1) .
i 1 n2 n1 2 f 2  i 1 n2

2UˆVˆ
 ChaoSø rensen Abundance
Uˆ  Vˆ

- 80 -
(c) Measures for comparing size-weighted species relative abundances

 Horn (q=1) Hˆ   Hˆ 
1
 k 1 (nk / n ) log( nk / n )
N

where Ĥ  and Ĥ  are the estimators developed by Chao et al. (2013) when

applied to the pooled and individual communities, respectively.

(d) Measures for comparing species absolute abundances)


 C12=U12 (q=1)

 Hˆ   Hˆ   k 1 (nk / n ) log( nk / n )
N
ˆ ˆ
C12  U12  ,
log N

where Ĥ  and Ĥ  are the estimators developed by Chao et al. (2013) when

applied to the pooled and individual communities, respectively.


 C22 (Morisita-Horn) under a Poisson model and equal-effort sampling scheme
S
2  X i1 X i 2
Cˆ 22  i 1
.
X ( 2)
i1  X (2)
i2
X i1  2 X i2 2

 U22 (Regional overlap) under a Poisson model and equal-effort sampling


scheme,
S
4  X i1 X i 2
Uˆ 22  i 1
S
.
X ( 2)
i1  X ( 2)
i2  2  X i1 X i 2
X i1  2 X i 2 2 i 1

 Bray-Curtis
The proposed estimator will be published.

Formulas for incidence data (Examples 4.2 and 4.2B)


For incidence data: Suppose T1 and T2 sampling units are taken respectively from Community 1
and 2. Let (Yi1, Yi2) denote the observed incidence frequencies of species i in Sample 1 and
Sample 2, i.e., Yi1 and Yi2 denote the number of sampling units that the ith species is detected in
Sample 1 and Sample 2, respectively. For most of the similarity measures based on replicated
incidence data, (Yi1, Yi2) can be regarded as proxy of species “abundances”, and (Y+1, Y+2), the

- 81 -
number of total incidences in Sample 1 and Sample 2, plays the same role as sample sizes. All
formulas are the same as or parallel to those for abundance data.
(a) Classic richness-based similarity
The estimators are the same as those presented in abundance data.
(b) Measures for comparing species relative detection probabilities
 C12=U12 (q=1, Horn), C22 (q=2, Morisita-Horn), and U22 (q=2, Regional
overlap)
The estimators are the same as those derived in abundance data by regarding (Yi1,
Yi2) as proxy of species abundances.

UˆVˆ
 ChaoJaccard Incidence ,
Uˆ  Vˆ  UˆVˆ
D12
Y (T  1) Q1 D12 Yi1
where Uˆ   i1  2  I (Yi 2  1) ;
i 1 Y1 T2 2Q 2 i 1 Y1
D12
Y (T  1) Q1 D12
Y
Vˆ   i 2  1  Yi 2 I (Yi1  1) .
i 1 Y 2 T1 2Q2  i 1 2

2UˆVˆ
 ChaoSø rensen Incidence
Uˆ  Vˆ
(c) Measures for comparing size-weighted species relative detection probabilities
Hˆ   Hˆ 
 Horn (q=1) 1
 k 1 (Y k / Y  ) log( Y k / Y  )
N

The estimator is the same as those derived in abundance data by regarding (Yi1, Yi2)
as proxy of species abundances.

(d) Measures for comparing species detection probabilities


 C12=U12 (q=1)
The estimator is the same as those derived in abundance data by regarding (Yi1, Yi2)
as proxy of species abundances.
S
Yi1 Yi 2
2
 C22 (Morisita-Horn) Cˆ 22  i 1 T1 T2 .
Y ( 2) Y ( 2)
 Ti1( 2)   Ti (22)
X i1  2 1 X i2 2 2

S
Yi1 Yi 2
4
 U22 (Regional overlap) Uˆ 22  i 1 T1 T2 .
Yi1( 2 ) Yi (22 ) S
Yi1 Yi 2
 ( 2)   ( 2 )  2
X i 1  2 T1 X i 2  2 T2 i 1 T1 T2

- 82 -
 Bray-Curtis
The proposed estimator will be published.

Part V: Multiple-Community (Similarity) Measures

All measures featured in SpadeR for the general case of N communities are summarized in Table
A2.

- 83 -
Table A2. (A general case of N communities). A summary of all similarity indices at
community-level. All indices are in terms of true species richness and species abundances or
frequencies. (zi1, zi2, ∙∙∙, ziN) denote the true abundances of species i in Community 1 to
Community N, and (pi1, pi2, ∙∙∙, piN) denote the true within-community relative abundances of
species i.

Goal CqN or UqN Sørensen-type Similarity Jaccard-type Similarity


Measures Measures
(a) Comparing Yes (q=0) Classic Sørensena Classic Jaccarda
richness-based N S /S S / S  1/ N
C0 N  U0N 
N 1 1  1/ N

(b) Comparing Yes (q=1) Horn (1966) equal-weight indexb


H  H
relative C1N  U1N  1 
log N
abundances c
Yes (q=2) Morisita-Horn Regional-overlapc
(A special case of S S

Goal (d) by   ( pim  pik ) 2   ( pim  pik )2


C2 N  1  i 1 m  k
S N
U2N  1  i 1 mk
S
letting zik = pik , ( N  1) p 2
ik ( N  1) pi2
i 1 k 1 i 1
z+k =1 and z++ = N
H GS ,  H GS , H GS ,  H GS ,
in (d)) 1 1
(1  1 / N )(1  H GS , ) ( N  1)(1  H GS , )

(c) Comparing No Horn (1966) size-weighted indexb


H  H
weighted relative 1
 k 1 ( zk / z ) log( zk / z )
N

abundances
(d) Comparing Yes (q=1) Chiu et al. (2014) Shannon-entropy-based indexb
absolute
H   H   k 1 ( zk / z ) log( zk / z )
N

abundances C1N  U1N 


log N
(sampling effort
Yes (q=2) Morisita-Horn Regional-overlap
must be
S
standardized in all S
  ( zim  zik ) 2   ( zim  zik )2
U2N  1  i 1 mk
communities) C2 N  1  i 1 m k S

( N  1) z
S N
2 ( N  1) zi2
ik i 1
i 1 k 1

No N-community Bray-Curtis Type

i 1 k 1 zik  zi 
S N

2(1  1 / N ) z 

a
S = species richness in the pooled community, S = average species richness per community,
b
H  , H  = gamma and alpha Shannon entropy in Goal (c) and Goal (d).

- 84 -
S
zi  z  N
z S z z 
H   log i   ; H    k  ik log ik  .
i 1 z   z   k 1 z  i 1 z k  z k 
In Goal (b), let zik = pik = within-community relative abundance, implying z k  1, z   N .
The above formulas reduce to
S N
pi1 N p  1 S N
H    log  ik  , H    pik log pik .
i 1 k 1 N  k 1 N  N i 1 k 1

c
H GS , , H GS , = gamma and alpha Gini-Simpson index in Goal (c) and Goal (d).
z k  zik  
2 2
S  zi   N S 
H GS ,  1     ; H GS ,   1     .
i 1  z   k 1 z   i 1  z k  
 
In Goal (b), let zik = pik = within-community relative abundance, implying z k  1, z   N .
The above formulas reduce to
2
S
N p  1 N
 S

H GS ,  1     ik  , H GS ,    pik2  .
 1 
i 1  k 1 N  N k 1  i 1 
__________________________________________________________________________
In the second part of the output, all empirical similarity indices are obtained by plugging the
observed data in the above formulas. In the third part of the output, the estimation procedures are
briefly summarized below. Assume that a random sample (Sample k) of individuals is taken from
Community k for k = 1, 2, …, N. Let (Xi1, Xi2, ∙∙∙, XiN) denote the observed abundances of species
i in Sample 1 to Sample N, and (X+1, X+2, ∙∙∙, X+N) ≡ (n1, n2, ∙∙∙, nN) denote the sample sizes of
Sample 1 to Sample N.

Formulas for abundance data (Example 5.1)

(a) Classic richness-based similarity

N  Sˆ / Sˆ
 C0N (q=0, Sorensen) Cˆ 0 N  .
N 1

Sˆ / Sˆ  1 / N
 U0N (q=0, Jaccard) Uˆ 0 N  ,
1  1/ N
where Ŝ is the Chao1 species richness estimator of the pooled community (Part

1), and Sˆ  kN1 Sˆk / N , Ŝ k denotes the Chao1 species richness estimator of the

kth community.
(b) Measures for comparing species relative abundances

Hˆ  Hˆ 
 C1N=U1N (q=1, Horn) Cˆ1N  Uˆ1N  1   ,
log N

- 85 -
where Ĥ  is based on Chao et al. (2013), and the estimation of Ĥ  will be

published soon.

 C2N (q=2, Morisita-Horn)


We adopt the nearly unbiased estimator developed by Chao et al. (2008). Generally,
the nearly unbiased estimator is formulated by noting that both C2N and U2N can be

expressed as a function of pir11 pir22  piNrN (with iN1 ri  2 ) and the unbiased

estimator for this term is X i(1r1 ) X i(2r2 ) ... X iN( rN ) /( n1( r1 )n2( r2 ) ... nN( rN ) ) , where

x ( k )  x( x  1)( x  k  1) . For example, in the case of N = 3, we have

S  X i1 X i 2 X i1 X i 3 X i 2 X i 3 
  
n2n3 
i 1  n1n2 n1n3
Cˆ 23  ,
S  X i(12 ) X i(22 ) X i(32 ) 
  n( 2)  n( 2)  n( 2) 
i 1  1 2 3 
where x( 2 )  x( x  1) .

 U2N (q=2, Regional overlap)

The method described above can be applied to obtain a nearly unbiased estimator
of U2N. For example, in the case of N = 3, we have
S 
X X X X X X 
3  i1 i 2  i1 i 3  i 2 i 3 
i 1  n1n 2 n1n3 n 2 n3 
Uˆ 23 
S  X i(12 ) X ( 2 ) X ( 2 ) 2 X i1 X i 2 2 X i1 X i 3 2 X i 2 X i 3 
  n (2)  (i22)  (i23)   
n2 n3 
i 1  1 n2 n3 n1n2 n1n3

(c) Measures for comparing size-weighted species relative abundances


Hˆ   Hˆ 
 Horn (q=1) 1
 k 1 (nk / n ) log( nk / n )
N

where Ĥ  and Ĥ  are the estimators respectively developed by Chao et al.

(2013) when applied to the pooled and individual communities, respectively.

(d) Using size as community weight (for comparing species absolute abundances)

- 86 -
 C1N=U1N (q=1)

 Hˆ   Hˆ   k 1 (nk / n ) log( nk / n )
N

Cˆ1N  Uˆ1N  ,
log N

where Ĥ  and Ĥ  are the estimators developed by Chao et al. (2013) when

applied to the pooled and individual communities, respectively.


 C2N (Morisita-Horn)
Generally, the nearly unbiased estimator is formulated by noting that both C2N and

U2N can be expressed as a function of zir11 zir22  ziNrN (with i 1 ri  2 ) and the
N

unbiased estimator for this term is X i(1r1 ) X i(2r2 ) ... X iN( rN ) . For the special case of N =3,

we have
i1  X i1 X i 2  X i1 X i 3  X i 2 X i 3  .
S

Cˆ 23 
i1 X i(12)  X i(22)  X i(32) 
S

 U2N (Regional overlap)


The method described above can be applied to obtain a nearly unbiased estimator
of U2N. For example, in the case of N = 3, we have

3i 1  X i1 X i 2  X i1 X i 3  X i 2 X i 3 
S

Uˆ 23  .
i1 X i(12)  X i(22)  X i(32)  2 X i1 X i 2  2 X i1 X i 3  2 X i 2 X i 3 
S

i1 X i(2) .
S
The denominator of the above formula is identical to

 Bray-Curtis
The proposed estimator will be published soon.

Formulas for incidence data (Examples 5.2 and 5.2B) are direct extension of the corresponding
two-community measures.

Part VI: Genetics Measures

The genetic differentiation measures featured in this part are a subset of those presented in Part V;
see those formulas in Part V for details. The output in Part VI does not include comparisons of
allele absolute abundances because most genetic applications are focused on comparing allele
relative frequencies. The only additional measure we added is GST which has been widely used in
genetics, but not in ecology. The differentiation GST measure is expressed as

- 87 -
GST  1 H S / HT ,

where HT denotes the heterozygosity in the total population and HS denotes the average
heterozygosity in a sub-population. Note that HT and HS are the same as the gamma and alpha
Gini-Simpson index in ecology; see the footnotes in Table A2. Given allele frequency data from
N sub-populations, an unbiased estimator for the sub-population heterozygosity is

1 N
X ik ( X ik  1)
2
Hˆ S  1 
N
 nk (nk  1)
,
k 1 i

where Xik denotes the observed frequency of allele i in the kth sub-population. An unbiased
estimator for the total-population heterozygosity 2 H T is
1 N X ( X  1) 1 X X
H T  1  2  ik ik
2 ˆ
 2  ij ik .
N k 1 i nk (nk  1) N j  k i n j nk

The estimated GST measure is then computed as Gˆ ST  1  Hˆ S / Hˆ T .

- 88 -

View publication stats

You might also like