Professional Documents
Culture Documents
07 - Diversity - Stats in R
07 - Diversity - Stats in R
html
Diversity statistics
Load example data
If you are starting the workshop at this section, or had problems running code in a previous section, use the
following to load the data used in this section. You can download the “clean_data.Rdata” file here
(clean_data.Rdata). If obj and sample_data are already in your environment, you can ignore this and proceed.
load("clean_data.Rdata")
Measures of diversity
Diversity in the ecological sense is intuitively understood as the complexity of a community of organisms.
There are many ways to quantify this complexity so that we can compare communities objectively. The two
main categories of methods are known as alpha diversity and beta diversity (Whittaker 1960). Alpha diversity
measures the diversity within a single sample and is generally based on the number and relative abundance of
taxa at some rank (e.g. species or OTUs). Beta diversity also uses the number of relative abundance of taxa at
some rank, but measures variation between samples. In other words, an alpha diversity statistic describes a
single sample and a beta diversity statistic describes how two samples compare.
The vegan package is the main tool set used for calculating biological diversity statistics in R.
There are also some diversity indexes that take into account the taxonomic similarity of the species called
“taxonomic diversity” and “taxonomic distinctness”, but we will not go into those.
The diversity function from the vegan package can be used to calculate the alpha diversity of a set of
samples. Like other vegan functions, it assumes that samples are in rows, but they are in columns in our data,
so we need to use the MARGIN = 2 option. We also need to exclude the taxon ID column by subsetting the
columns to only samples (i.e. all column besides the first one). Since alpha diversity is a per-sample attribute,
we can just add this as a column to the sample data table:
library(vegan)
sample_data$alpha <- diversity(obj$data$otu_rarefied[, sample_data$SampleID],
MARGIN = 2,
index = "invsimpson")
hist(sample_data$alpha)
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 1/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
Adding this as a column to the sample data table makes it easy to graph using ggplot2 .
library(ggplot2)
ggplot(sample_data, aes(x = Site, y = alpha)) +
geom_boxplot()
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 2/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
That tells us that there is a difference, but does not tell us which means are different. A Tukey’s Honest
Significant Difference (HSD) (00--glossary.html#tukey's_honest_significant_difference_(hsd)_anchor) test can
do pairwise comparisons of the means to find this out. We will use the HSD.test function from the agricolae
package since it provides grouping codes that are useful for graphing.
library(agricolae)
tukey_result <- HSD.test(anova_result, "Site", group = TRUE)
print(tukey_result)
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 3/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
## $statistics
## MSerror Df Mean CV
## 2722.819 289 57.48091 90.77907
##
## $parameters
## test name.t ntr StudentizedRange alpha
## Tukey Site 3 3.33168 0.05
##
## $means
## alpha std r Min Max Q25 Q50 Q75
## Jam 32.62941 31.37153 84 1.266110 101.3504 1.676679 32.45049 58.00689
## Mah 62.03473 49.53254 104 2.144132 153.5168 13.061531 58.63623 104.13730
## Sil 72.99945 66.28129 104 3.431795 224.0853 12.485001 37.78903 142.18317
##
## $comparison
## NULL
##
## $groups
## alpha groups
## Sil 72.99945 a
## Mah 62.03473 a
## Jam 32.62941 b
##
## attr(,"class")
## [1] "group"
Looking at the tukey_result$groups table it appears that the alpha diversity of sites “Sil” and “Mah” might not
be different, but there is evidence that the diversity in site “Jam” is lower. We can add this information to the
graph using the tukey_result$groups$groups codes:
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 4/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
So that takes care of comparing the alpha diversity of sites, but there are other interesting groupings we can
compare, such as the genotype and the type of the sample (roots vs leaves). We could do the above all over
with minor modifications, but one of the benefits of using a programming language is that you can create your
own functions to automate repeated tasks. We can generalize what we did above and put it in a function like
so:
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 5/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
# Do ANOVA
sample_data$grouping <- sample_data[[grouping_var]] # needed for how `aov` works
anova_result <- aov(alpha ~ grouping, sample_data)
# Plot result
group_data <- tukey_result$groups[order(rownames(tukey_result$groups)),]
my_plot <- ggplot(sample_data, aes(x = grouping, y = alpha)) +
geom_text(data = data.frame(),
aes(x = rownames(group_data),
y = max(sample_data$alpha) + 1,
label = group_data$groups),
col = 'black',
size = 10) +
geom_boxplot() +
ggtitle("Alpha diversity") +
xlab(grouping_var) +
ylab("Alpha diversity index")
# Return plot
return(my_plot)
}
Using this function, we can compare plot the alpha diversities by type of sample and genotype:
compare_alpha(sample_data, "Type")
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 6/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
compare_alpha(sample_data, "Genotype")
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 7/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
Looks like there is no difference in the alpha diversity between genotypes, but a large difference between the
diversity of roots and leaves.
The phyloseq package (McMurdie and Holmes (2013)) can be used to quickly plot a variety of alpha diversity
indexes per sample using the plot_richness function. First we need to convert the taxmap object to a
phyloseq object, since all of the phyloseq functions expect phyloseq objects.
library(phyloseq)
##
## Attaching package: 'phyloseq'
Each dot is a sample and the error bars in some of the indexes are the standard error.
NOTE: the as_phyloseq function is only available in the development version of metacoder. If you want to try
it, you can install it by typing:
devtools::install_github("ropensci/taxa")
devtools::install_github("grunwaldlab/metacoder")
or you can use the current version of metacoder, download the file “ps_obj.Rdata” here (ps_obj.Rdata), and load
the ps_obj object this way:
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 8/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
load("ps_obj.Rdata")
## Summing per-taxon counts from 292 columns in 2 groups for 489 taxa
print(obj$data$type_abund)
## # A tibble: 489 x 3
## taxon_id leaf root
## * <chr> <dbl> <dbl>
## 1 aad 271122. 271122.
## 2 aaf 175195. 123773.
## 3 aag 28010. 75115.
## 4 aah 19116. 4137.
## 5 aai 43662. 29404.
## 6 aaj 1144. 1195.
## 7 aak 369. 7350.
## 8 aal 284. 8723.
## 9 aam 213. 3625.
## 10 aan 1548. 12523.
## # ... with 479 more rows
Now we can use these taxon counts to make heat trees of the primary taxa present in leafs and roots:
set.seed(2)
obj %>%
taxa::filter_taxa(leaf > 50) %>% # taxa:: needed because of phyloseq::filter_taxa
heat_tree(node_label = taxon_names,
node_size = leaf,
node_color = leaf,
layout = "da", initial_layout = "re",
title = "Taxa in leafs")
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 9/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
set.seed(3)
obj %>%
taxa::filter_taxa(root > 50) %>% # taxa:: needed because of phyloseq::filter_taxa
heat_tree(node_label = taxon_names,
node_size = root,
node_color = root,
layout = "da", initial_layout = "re",
title = "Taxa in roots")
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 10/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
Note that we needed to qualify filter_taxa with taxa:: . This is because phyloseq is loaded and it also has
a function called filter_taxa .
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 11/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
Bray–Curtis: The sum of lesser counts for species present in both communities divided by the sum of all
counts in both communities. This can be thought of as a quantitative version of the Sørensen index.
Weighted Unifrac: The fraction of the phylogenetic tree branch lengths shared by the two communities,
weighted by the counts of organisms, so more abundant organisms have a greater influence.
The vegan function vegdist is used to calculate the pairwise beta diversity indexes for a set of samples.
Since this is a pairwise comparison, the output is a triangular matrix. In R, a matrix is like a data.frame , but
all of the same type (e.g. all numeric), and has some different behavior.
Since vegdist does not have a MARGIN option like diversity , we need to transpose (00--
glossary.html#transpose_anchor) the matrix with the t function.
Ordination
The typical way beta diversity is plotted is using ordination. Ordination is a way to display “high dimensional”
data in a viable number of dimensions (2 to 3). Our data is “high dimensional” because we have many samples
with many species and species can be considered a “dimension”. If we had only two species, we could make a
scatter plot of their abundance in each sample and get an idea of how the samples differ. With thousands of
species, this is not possible. Instead, ordination is used to try to capture the information in many dimensions
by in a smaller number of new “artificial” dimensions.
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 12/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
That transformed our beta diversity matrix into a set of coordinates in two dimensions, which attempt to
capture the differences in the data. However, it is in a format specific to vegan , so we will have to convert the
data to a form that we can use for plotting.
To use the sample data in the plotting, we can combine the coordinate data with the sample data table:
## Joining, by = "SampleID"
Now that we have the data in a format ggplot2 likes, we can plot it. Lets plot our two new dimensions and color
them by sample type (i.e. leaves vs roots).
library(ggplot2)
ggplot(mds_data, aes(x = MDS1, y = MDS2, color = Type)) +
geom_point()
This shows that leaf and root samples are quite distinct, as we would expect. We can also color them by Site:
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 13/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
It appears that within the leaf and root clusters, we “sub-clusters” corresponding to site. Finally, lets look at
genotype:
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 14/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
There is no discernible pattern there, suggesting plant genotype does not correspond to community structure.
We can also do the above quickly in phyloseq using the ordinate and plot_ordination functions. Lets look
at the differences between leaf and root samples again, but using a difference index this time.
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 15/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 16/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
For each taxon, at every rank, the compare_groups function compares two groups of counts. We have to define
which sample belongs to which groups using groups option:
## # A tibble: 489 x 7
## taxon_id treatment_1 treatment_2 log2_median_ratio median_diff mean_diff wilcox_p_value
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 aad leaf root 0. 0. 0. NaN
## 2 aaf leaf root 0.446 312. 352. 2.51e-17
## 3 aag leaf root -1.22 -266. -323. 9.87e-43
## 4 aah leaf root 2.00 46.5 103. 3.86e-18
## 5 aai leaf root 0.440 69.5 97.7 1.03e- 2
## 6 aaj leaf root -1.81 -5.00 -0.349 2.88e- 8
## 7 aak leaf root -Inf -50.0 -47.8 4.36e-49
## 8 aal leaf root -Inf -54.0 -57.8 2.71e-51
## 9 aam leaf root -Inf -24.0 -23.4 2.42e-49
## 10 aan leaf root -4.82 -81.5 -75.2 1.04e-44
## # ... with 479 more rows
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 17/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
The most useful statistic for plotting is the log of ratio of median abundances in the two groups, since it is
centered on 0 and is symmetric (e.g., a value of -4 is the same magnitude as 4). Lets set any differences that
are not significant to 0 so all differences shown on the plot are significant.
Now we have all the info needed to make a differential heat tree. The standard heat_tree function can be
used, but a few things need to be done to make an effective differential heat tree:
A diverging color scale should be used, preferably with a neutral color in the middle, like gray.
The interval of values displayed must be symmetric around zero so the neutral middle color is centered
on zero.
The node color should be set to the log of ratio of median abundances in the two groups
set.seed(1)
heat_tree(obj,
node_label = taxon_names,
node_size = n_obs, # number of OTUs
node_color = log2_median_ratio, # difference between groups
node_color_interval = c(-10, 10), # symmetric interval
node_color_range = c("cyan", "gray", "magenta"), # diverging colors
node_size_axis_label = "OTU count",
node_color_axis_label = "Log 2 ratio of median counts")
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 18/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
That’s not too bad looking, but we can tweak it a bit to make it better by changing the layout, adding a title, and
filtering out some of the taxa with odd names (depending on what the plot is for, you might not want to remove
these). Here mutate_obs is used to add a temporary variable to the taxmap object that will contain the taxon
names with special characters like [ removed.
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 19/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
set.seed(1)
obj %>%
mutate_obs("cleaned_names", gsub(taxon_names, pattern = "\\[|\\]", replacement = "")) %>%
taxa::filter_taxa(grepl(cleaned_names, pattern = "^[a-zA-Z]+$")) %>%
heat_tree(node_label = cleaned_names,
node_size = n_obs, # number of OTUs
node_color = log2_median_ratio, # difference between groups
node_color_interval = c(-10, 10), # symmetric interval
node_color_range = c("cyan", "gray", "magenta"), # diverging colors
node_size_axis_label = "OTU count",
node_color_axis_label = "Log 2 ratio of median counts",
layout = "da", initial_layout = "re", # good layout for large trees
title = "leaf vs root samples")
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 20/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
What color corresponds to each group depends on the order they were given in the compare_groups function.
Since “leaf” is “treatment_1” in the “diff_table”, and “log2_median_ratio” is defined as “log2(treatment_1 /
treatment_2)”, when a taxon has more counts in leaf samples, the ratio is positive, therefore taxa more
abundant in leafs are colored magenta in this case.
Comparing taxon abundance with more than 2 groups
For pair-wise comparisons with more than two groups we have developed a graphing technique we call a heat
tree matrix, in which trees are made for all pairs of sample groupings and arranged in a matrix with a larger key
tree that has the taxon names and a legend. The code to do this is similar to the code for making a single
differential heat tree like we did above, but uses the heat_tree_matrix function. First we need to use
compare_groups to generate data for all pair-wise comparisons for a grouping with more than two treatments.
The code below compares all the sites used in this study to eachother:
## # A tibble: 1,467 x 7
## taxon_id treatment_1 treatment_2 log2_median_ratio median_diff mean_diff wilcox_p_value
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 aad Mah Jam 0. 0. 0. NaN
## 2 aaf Mah Jam 0.186 114. -174. 8.01e- 1
## 3 aag Mah Jam -0.107 -27.5 -58.4 7.99e- 1
## 4 aah Mah Jam 0.752 13.0 13.5 2.64e- 1
## 5 aai Mah Jam 0.867 103. 197. 2.09e-13
## 6 aaj Mah Jam 0.322 1.00 0.622 4.17e- 1
## 7 aak Mah Jam -0.112 -1.50 -2.60 9.68e- 1
## 8 aal Mah Jam 0.307 4.50 9.90 1.32e- 1
## 9 aam Mah Jam -0.900 -6.50 -3.78 7.18e- 2
## 10 aan Mah Jam 0.186 4.00 12.3 6.23e- 3
## # ... with 1,457 more rows
We then need to correct for multiple comparisons and set non-significant differences to zero like we did
before:
Finally we call the heat_tree_matrix command with the same options that would be used for a single tree.
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 21/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
obj %>%
taxa::filter_taxa(taxon_ranks == "o", supertaxa = TRUE) %>%
mutate_obs("cleaned_names", gsub(taxon_names, pattern = "\\[|\\]", replacement = "")) %>%
taxa::filter_taxa(grepl(cleaned_names, pattern = "^[a-zA-Z]+$")) %>%
heat_tree_matrix(dataset = "diff_table",
node_label = cleaned_names,
node_size = n_obs, # number of OTUs
node_color = log2_median_ratio, # difference between groups
node_color_trans = "linear",
node_color_interval = c(-3, 3), # symmetric interval
edge_color_interval = c(-3, 3), # symmetric interval
node_color_range = diverging_palette(), # diverging colors
node_size_axis_label = "OTU count",
node_color_axis_label = "Log 2 ratio of median counts",
layout = "da", initial_layout = "re",
key_size = 0.67,
seed = 2)
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 22/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
We have only compared three groups here, due to the nature of this dataset, so this technique is not much
better than 3 separate, full sized graphs in this case, but with more groups it can be a uniquely effective way to
show lots of comparisons. See the end of the example analysis (02--quick_example.html) for a better example
of this technique.
Exercises
In these exercises, we will be using the ps_obj and obj from the analysis above. If you did not run the code
above or had problems, run the following code to get the objects used. You can download the
“diversity_data.Rdata” file here (diversity_data.Rdata).
load("diversity_data.Rdata")
1a) Look at the documentation for the plot_richness function from phyloseq . Try to make a plot using only
the Simpson and inverse Simpson indexes, colored by site and split up by sample type (leaf vs root).
SHOW SOLUTION
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 23/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
1b) The Simpson and inverse Simpson indexes display the same information in different ways (if you know
one, you can calculate the other). How do they differ?
SHOW ANSWER
2) Rarefaction and converting counts to proportions are two ways of accounting for unequal sample depth.
Although proportions are more intuitive and easier to understand, why might rarefaction be better when
calculating diveristy indexes?
SHOW ANSWER
3a) Using the techniques presented in the section on plotting the abundance of taxa, make a plot of taxon
abundance for the site encoded “Jam”.
SHOW SOLUTION
3b) What are some of the most abundant families in the site “Jam”?
SHOW ANSWER
SHOW ANSWER
4a) The ordination analysis colored by genotype showed no community differences corresponding to
genotype. Try to use compare_groups to see if any taxa are significantly differentially abundant between
genotypes to verify this result.
SHOW SOLUTION
SHOW ANSWER
References
McMurdie, Paul J, and Susan Holmes. 2013. “Phyloseq: An R Package for Reproducible Interactive Analysis
and Graphics of Microbiome Census Data.” PloS One 8 (4). Public Library of Science: e61217.
https://doi.org/10.1371/journal.pone.0061217 (https://doi.org/10.1371/journal.pone.0061217).
Whittaker, Robert Harding. 1960. “Vegetation of the Siskiyou Mountains, Oregon and California.” Ecological
Monographs 30 (3). Wiley Online Library: 279–338.
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 24/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
Analysis of Microbiome Community Data in R by The Grunwald lab and the USDA Horticultural Crops Research Unit
(http://grunwaldlab.cgrb.oregonstate.edu/) is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
(http://creativecommons.org/licenses/by-sa/4.0/).
Based on a work at https://github.com/grunwaldlab/analysis_of_microbiome_community_data_in_r
(https://github.com/grunwaldlab/analysis_of_microbiome_community_data_in_r).
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 25/25