07 - Diversity - Stats in R

1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.
html
Diversity statistics
Load example data
If you are starting the workshop at this section, or had problems running code in a previous section, use the
following to load the data used in this section. You can download the “clean_data.Rdata” file here
(clean_data.Rdata). If obj and sample_data are already in your environment, you can ignore this and proceed.
load("clean_data.Rdata")
Measures of diversity
Diversity in the ecological sense is intuitively understood as the complexity of a community of organisms.
There are many ways to quantify this complexity so that we can compare communities objectively. The two
main categories of methods are known as alpha diversity and beta diversity (Whittaker 1960). Alpha diversity
measures the diversity within a single sample and is generally based on the number and relative abundance of
taxa at some rank (e.g. species or OTUs). Beta diversity also uses the number of relative abundance of taxa at
some rank, but measures variation between samples. In other words, an alpha diversity statistic describes a
single sample and a beta diversity statistic describes how two samples compare.
The vegan package is the main tool set used for calculating biological diversity statistics in R.
Alpha (within sample) diversity

Common alpha diversity statistics include:
Shannon: How difficult it is to predict the identity of a randomly chosen individual.

Simpson: The probability that two randomly chosen individuals are the same species.
Inverse Simpson: This is a bit confusing to think about. Assuming a theoretically community where all
species were equally abundant, this would be the number of species needed to have the same Simpson
index value for the community being analyzed.
There are also some diversity indexes that take into account the taxonomic similarity of the species called
“taxonomic diversity” and “taxonomic distinctness”, but we will not go into those.
The diversity function from the vegan package can be used to calculate the alpha diversity of a set of
samples. Like other vegan functions, it assumes that samples are in rows, but they are in columns in our data,
so we need to use the MARGIN = 2 option. We also need to exclude the taxon ID column by subsetting the
columns to only samples (i.e. all column besides the first one). Since alpha diversity is a per-sample attribute,
we can just add this as a column to the sample data table:
library(vegan)
sample_data$alpha <- diversity(obj$data$otu_rarefied[, sample_data$SampleID],
MARGIN = 2,
index = "invsimpson")
hist(sample_data$alpha)
https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html 1/25
1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.html
Adding this as a column to the sample data table makes it easy to graph using ggplot2 .
library(ggplot2)
ggplot(sample_data, aes(x = Site, y = alpha)) +
geom_boxplot()
We can use analysis of variance (ANOVA) (00--glossary.html#analysis_of_variance_(anova)_anchor) to tell if

at least one of the diversity means is different from the rest.
anova_result <- aov(alpha ~ Site, sample_data)

summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)

## Site 2 79081 39540 14.52 9.78e-07 ***
## Residuals 289 786895 2723
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
That tells us that there is a difference, but does not tell us which means are different. A Tukey’s Honest
Significant Difference (HSD) (00--glossary.html#tukey's_honest_significant_difference_(hsd)_anchor) test can
do pairwise comparisons of the means to find this out. We will use the HSD.test function from the agricolae
package since it provides grouping codes that are useful for graphing.
library(agricolae)
tukey_result <- HSD.test(anova_result, "Site", group = TRUE)
print(tukey_result)
## $statistics
## MSerror Df Mean CV
## 2722.819 289 57.48091 90.77907
##
## $parameters
## test name.t ntr StudentizedRange alpha
## Tukey Site 3 3.33168 0.05
##
## $means
## alpha std r Min Max Q25 Q50 Q75
## Jam 32.62941 31.37153 84 1.266110 101.3504 1.676679 32.45049 58.00689
## Mah 62.03473 49.53254 104 2.144132 153.5168 13.061531 58.63623 104.13730
## Sil 72.99945 66.28129 104 3.431795 224.0853 12.485001 37.78903 142.18317
##
## $comparison
## NULL
##
## $groups
## alpha groups
## Sil 72.99945 a
## Mah 62.03473 a
## Jam 32.62941 b
##
## attr(,"class")
## [1] "group"
Looking at the tukey_result$groups table it appears that the alpha diversity of sites “Sil” and “Mah” might not
be different, but there is evidence that the diversity in site “Jam” is lower. We can add this information to the
graph using the tukey_result$groups$groups codes:
group_data <- tukey_result$groups[order(rownames(tukey_result$groups)),]

ggplot(sample_data, aes(x = Site, y = alpha)) +
geom_text(data = data.frame(),
aes(x = rownames(group_data), y = max(sample_data$alpha) + 1, label = group_data
$groups),
col = 'black',
size = 10) +
geom_boxplot() +
ggtitle("Alpha diversity") +
xlab("Site") +
ylab("Alpha diversity index")
So that takes care of comparing the alpha diversity of sites, but there are other interesting groupings we can
compare, such as the genotype and the type of the sample (roots vs leaves). We could do the above all over
with minor modifications, but one of the benefits of using a programming language is that you can create your
own functions to automate repeated tasks. We can generalize what we did above and put it in a function like
so:
compare_alpha <- function(sample_data, grouping_var) {

# Calcuate alpha diversity
sample_data$alpha <- diversity(obj$data$otu_rarefied[, sample_data$SampleID],
MARGIN = 2,
index = "invsimpson")
# Do ANOVA
sample_data$grouping <- sample_data[[grouping_var]] # needed for how `aov` works
anova_result <- aov(alpha ~ grouping, sample_data)
# Do Tukey's HSD test

tukey_result <- HSD.test(anova_result, "grouping", group = TRUE)
# Plot result
group_data <- tukey_result$groups[order(rownames(tukey_result$groups)),]
my_plot <- ggplot(sample_data, aes(x = grouping, y = alpha)) +
geom_text(data = data.frame(),
aes(x = rownames(group_data),
y = max(sample_data$alpha) + 1,
label = group_data$groups),
col = 'black',
size = 10) +
geom_boxplot() +
ggtitle("Alpha diversity") +
xlab(grouping_var) +
ylab("Alpha diversity index")
# Return plot
return(my_plot)
}
Using this function, we can compare plot the alpha diversities by type of sample and genotype:
compare_alpha(sample_data, "Type")
compare_alpha(sample_data, "Genotype")
Looks like there is no difference in the alpha diversity between genotypes, but a large difference between the
diversity of roots and leaves.
The phyloseq package (McMurdie and Holmes (2013)) can be used to quickly plot a variety of alpha diversity
indexes per sample using the plot_richness function. First we need to convert the taxmap object to a
phyloseq object, since all of the phyloseq functions expect phyloseq objects.
library(phyloseq)
##
## Attaching package: 'phyloseq'
## The following object is masked from 'package:taxa':

##
## filter_taxa
ps_obj <- as_phyloseq(obj,

otu_table = "otu_rarefied",
otu_id_col = "OTU_ID",
sample_data = sample_data,
sample_id_col = "SampleID")
plot_richness(ps_obj, color = "Type", x = "Site")
Each dot is a sample and the error bars in some of the indexes are the standard error.
NOTE: the as_phyloseq function is only available in the development version of metacoder. If you want to try
it, you can install it by typing:
devtools::install_github("ropensci/taxa")
devtools::install_github("grunwaldlab/metacoder")
or you can use the current version of metacoder, download the file “ps_obj.Rdata” here (ps_obj.Rdata), and load
the ps_obj object this way:
load("ps_obj.Rdata")
Plotting taxon abundance with heat trees

Alpha diversity statistics capture the diversity of whole samples in a single number, but to see the abundance
of each taxon in a group of samples (e.g., root samples), we need to use other techniques. Stacked barcharts
are typically used for this purpose, but we will be using heat trees. First, we need to calculate the abundance of
each taxon for a set of samples from our OTU abundance information. We can use the calc_taxon_abund
function to do this, grouping by a sample characteristic:
obj$data$type_abund <- calc_taxon_abund(obj, "otu_rarefied",

cols = sample_data$SampleID,
groups = sample_data$Type)
## Summing per-taxon counts from 292 columns in 2 groups for 489 taxa
print(obj$data$type_abund)
## # A tibble: 489 x 3
## taxon_id leaf root
## * <chr> <dbl> <dbl>
## 1 aad 271122. 271122.
## 2 aaf 175195. 123773.
## 3 aag 28010. 75115.
## 4 aah 19116. 4137.
## 5 aai 43662. 29404.
## 6 aaj 1144. 1195.
## 7 aak 369. 7350.
## 8 aal 284. 8723.
## 9 aam 213. 3625.
## 10 aan 1548. 12523.
## # ... with 479 more rows
Now we can use these taxon counts to make heat trees of the primary taxa present in leafs and roots:
set.seed(2)
obj %>%
taxa::filter_taxa(leaf > 50) %>% # taxa:: needed because of phyloseq::filter_taxa
heat_tree(node_label = taxon_names,
node_size = leaf,
node_color = leaf,
layout = "da", initial_layout = "re",
title = "Taxa in leafs")
set.seed(3)
obj %>%
taxa::filter_taxa(root > 50) %>% # taxa:: needed because of phyloseq::filter_taxa
heat_tree(node_label = taxon_names,
node_size = root,
node_color = root,
title = "Taxa in roots")
Note that we needed to qualify filter_taxa with taxa:: . This is because phyloseq is loaded and it also has
a function called filter_taxa .
Beta (between sample) diversity

Beta diversity is a way to quantify the difference between two communities. There are many metrics that are
used for this, but we will only mention a few of the more popular ones. A few also incorporate phylogenetic
relatedness and require a phylogenetic tree of the organisms in either community to be calculated.
Examples of indexes used with presence/absence data:

Sørensen: two times the number of species common to both communities divided by the sum of the
number of species in each community.
Jaccard: the number of species common to both communities divided by the number of species in
either community.
Unifrac: The fraction of the phylogenetic tree branch lengths shared by the two communities.
Examples of indexes used with count data:
Bray–Curtis: The sum of lesser counts for species present in both communities divided by the sum of all
counts in both communities. This can be thought of as a quantitative version of the Sørensen index.
Weighted Unifrac: The fraction of the phylogenetic tree branch lengths shared by the two communities,
weighted by the counts of organisms, so more abundant organisms have a greater influence.
The vegan function vegdist is used to calculate the pairwise beta diversity indexes for a set of samples.
Since this is a pairwise comparison, the output is a triangular matrix. In R, a matrix is like a data.frame , but
all of the same type (e.g. all numeric), and has some different behavior.
beta_dist <- vegdist(t(obj$data$otu_rarefied[, sample_data$SampleID]),

index = "bray")
Since vegdist does not have a MARGIN option like diversity , we need to transpose (00--
glossary.html#transpose_anchor) the matrix with the t function.
Ordination
The typical way beta diversity is plotted is using ordination. Ordination is a way to display “high dimensional”
data in a viable number of dimensions (2 to 3). Our data is “high dimensional” because we have many samples
with many species and species can be considered a “dimension”. If we had only two species, we could make a
scatter plot of their abundance in each sample and get an idea of how the samples differ. With thousands of
species, this is not possible. Instead, ordination is used to try to capture the information in many dimensions
by in a smaller number of new “artificial” dimensions.
mds <- metaMDS(beta_dist)
## Run 0 stress 0.1135986

## Run 1 stress 0.1171675
## Run 2 stress 0.1158758
## Run 3 stress 0.1159449
## Run 4 stress 0.1143069
## Run 5 stress 0.1143073
## Run 6 stress 0.1141812
## Run 7 stress 0.1162369
## Run 8 stress 0.1133639
## ... New best solution
## ... Procrustes: rmse 0.00602089 max resid 0.06356699
## Run 9 stress 0.1151679
## Run 10 stress 0.1163855
## Run 11 stress 0.1154791
## Run 12 stress 0.1146687
## Run 13 stress 0.1442563
## Run 14 stress 0.1168622
## Run 15 stress 0.1141063
## Run 16 stress 0.1314841
## Run 17 stress 0.1134438
## ... Procrustes: rmse 0.00165907 max resid 0.02791924
## Run 18 stress 0.1155482
## Run 19 stress 0.1164316
## Run 20 stress 0.1143947
## *** No convergence -- monoMDS stopping criteria:
## 19: stress ratio > sratmax
## 1: scale factor of the gradient < sfgrmin
That transformed our beta diversity matrix into a set of coordinates in two dimensions, which attempt to
capture the differences in the data. However, it is in a format specific to vegan , so we will have to convert the
data to a form that we can use for plotting.
mds_data <- as.data.frame(mds$points)
To use the sample data in the plotting, we can combine the coordinate data with the sample data table:
mds_data$SampleID <- rownames(mds_data)

mds_data <- dplyr::left_join(mds_data, sample_data)
## Joining, by = "SampleID"
Now that we have the data in a format ggplot2 likes, we can plot it. Lets plot our two new dimensions and color
them by sample type (i.e. leaves vs roots).
library(ggplot2)
ggplot(mds_data, aes(x = MDS1, y = MDS2, color = Type)) +
geom_point()
This shows that leaf and root samples are quite distinct, as we would expect. We can also color them by Site:
ggplot(mds_data, aes(x = MDS1, y = MDS2, color = Site)) +

geom_point()
It appears that within the leaf and root clusters, we “sub-clusters” corresponding to site. Finally, lets look at
genotype:
ggplot(mds_data, aes(x = MDS1, y = MDS2, color = Genotype)) +

geom_point()
There is no discernible pattern there, suggesting plant genotype does not correspond to community structure.
We can also do the above quickly in phyloseq using the ordinate and plot_ordination functions. Lets look
at the differences between leaf and root samples again, but using a difference index this time.
ps_ord <- ordinate(ps_obj, method = "NMDS", distance = "jsd")
## Run 0 stress 0.1075255

## Run 1 stress 0.1090522
## Run 2 stress 0.109622
## Run 3 stress 0.1164171
## Run 4 stress 0.108985
## Run 5 stress 0.1075254
## ... New best solution
## ... Procrustes: rmse 6.336693e-05 max resid 0.0006860886
## ... Similar to previous best
## Run 6 stress 0.1101178
## Run 7 stress 0.1090525
## Run 8 stress 0.1108608
## Run 9 stress 0.1082951
## Run 10 stress 0.1084189
## Run 11 stress 0.1084414
## Run 12 stress 0.1272716
## Run 13 stress 0.1110661
## Run 14 stress 0.1107109
## Run 15 stress 0.1082951
## Run 16 stress 0.1084411
## Run 17 stress 0.1104917
## Run 18 stress 0.1084424
## Run 19 stress 0.1117763
## Run 20 stress 0.109174
## *** Solution reached
plot_ordination(ps_obj, ps_ord, type = "samples", color = "Type")
Differential heat trees

Beta diversity summarizes the difference between two samples in a single number, but does not tell you why
the samples are different. We developed a plotting technique we call “differential heat trees” to display
differences in abundance of each taxon in two samples or groups of samples. These are like the heat trees
shown in the plotting section, but colored with a diverging color scale the indicates which sample a each taxon
is more abundant in. In this way, it is similar to heat maps used for the same purpose and in gene expression
studies.
Comparing taxon abundance in two groups

Before we can make a differential heat tree, we need to calculate the difference in abundance for each taxon
between groups of samples, such as root vs leaf samples. There are many ways this can be done, ranging from
simple differences in mean read counts, to outputs from specialized programs designed for microbiome data.
We will be using a function in metacoder called compare_groups to do the comparisons. First however, we
need to calculate the per-taxon abundance from our OTU counts. We can use the calc_taxon_abund function
to do this:
obj$data$tax_abund <- calc_taxon_abund(obj, "otu_rarefied",

cols = sample_data$SampleID)
## Summing per-taxon counts from 292 columns for 489 taxa
For each taxon, at every rank, the compare_groups function compares two groups of counts. We have to define
which sample belongs to which groups using groups option:
obj$data$diff_table <- compare_groups(obj, dataset = "tax_abund",

groups = sample_data$Type)
print(obj$data$diff_table)
## # A tibble: 489 x 7
## taxon_id treatment_1 treatment_2 log2_median_ratio median_diff mean_diff wilcox_p_value
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 aad leaf root 0. 0. 0. NaN
## 2 aaf leaf root 0.446 312. 352. 2.51e-17
## 3 aag leaf root -1.22 -266. -323. 9.87e-43
## 4 aah leaf root 2.00 46.5 103. 3.86e-18
## 5 aai leaf root 0.440 69.5 97.7 1.03e- 2
## 6 aaj leaf root -1.81 -5.00 -0.349 2.88e- 8
## 7 aak leaf root -Inf -50.0 -47.8 4.36e-49
## 8 aal leaf root -Inf -54.0 -57.8 2.71e-51
## 9 aam leaf root -Inf -24.0 -23.4 2.42e-49
## 10 aan leaf root -4.82 -81.5 -75.2 1.04e-44
## # ... with 479 more rows
By default, Wilcoxon Rank Sum test (00--glossary.html#wilcoxon_rank_sum_test_anchor) is used to test for

statistical significance of differences and various summary statistics are reported to measure the size of the
difference. Since we have done many independent tests (one for each taxon), we need to correct for multiple
comparisions (00--glossary.html#multiple_comparison_corrections_anchor). We will do that with a false

discovery rate (FDR) correction, but other types can be used as well.
obj <- mutate_obs(obj, "diff_table",

wilcox_p_value = p.adjust(wilcox_p_value, method = "fdr"))
The most useful statistic for plotting is the log of ratio of median abundances in the two groups, since it is
centered on 0 and is symmetric (e.g., a value of -4 is the same magnitude as 4). Lets set any differences that
are not significant to 0 so all differences shown on the plot are significant.
obj$data$diff_table$log2_median_ratio[obj$data$diff_table$wilcox_p_value > 0.05] <- 0
Now we have all the info needed to make a differential heat tree. The standard heat_tree function can be
used, but a few things need to be done to make an effective differential heat tree:
A diverging color scale should be used, preferably with a neutral color in the middle, like gray.
The interval of values displayed must be symmetric around zero so the neutral middle color is centered
on zero.
The node color should be set to the log of ratio of median abundances in the two groups
set.seed(1)
heat_tree(obj,
node_label = taxon_names,
node_size = n_obs, # number of OTUs
node_color = log2_median_ratio, # difference between groups
node_color_interval = c(-10, 10), # symmetric interval
node_color_range = c("cyan", "gray", "magenta"), # diverging colors
node_size_axis_label = "OTU count",
node_color_axis_label = "Log 2 ratio of median counts")
That’s not too bad looking, but we can tweak it a bit to make it better by changing the layout, adding a title, and
filtering out some of the taxa with odd names (depending on what the plot is for, you might not want to remove
these). Here mutate_obs is used to add a temporary variable to the taxmap object that will contain the taxon
names with special characters like [ removed.
set.seed(1)
obj %>%
mutate_obs("cleaned_names", gsub(taxon_names, pattern = "\\[|\\]", replacement = "")) %>%
taxa::filter_taxa(grepl(cleaned_names, pattern = "^[a-zA-Z]+$")) %>%
heat_tree(node_label = cleaned_names,
node_color_range = c("cyan", "gray", "magenta"), # diverging colors
node_color_axis_label = "Log 2 ratio of median counts",
layout = "da", initial_layout = "re", # good layout for large trees
title = "leaf vs root samples")
## Adding a new "character" vector of length 489.
What color corresponds to each group depends on the order they were given in the compare_groups function.
Since “leaf” is “treatment_1” in the “diff_table”, and “log2_median_ratio” is defined as “log2(treatment_1 /
treatment_2)”, when a taxon has more counts in leaf samples, the ratio is positive, therefore taxa more
abundant in leafs are colored magenta in this case.
Comparing taxon abundance with more than 2 groups
For pair-wise comparisons with more than two groups we have developed a graphing technique we call a heat
tree matrix, in which trees are made for all pairs of sample groupings and arranged in a matrix with a larger key
tree that has the taxon names and a legend. The code to do this is similar to the code for making a single
differential heat tree like we did above, but uses the heat_tree_matrix function. First we need to use
compare_groups to generate data for all pair-wise comparisons for a grouping with more than two treatments.
The code below compares all the sites used in this study to eachother:
obj$data$diff_table <- compare_groups(obj, dataset = "tax_abund",

groups = sample_data$Site)
print(obj$data$diff_table)
## # A tibble: 1,467 x 7
## taxon_id treatment_1 treatment_2 log2_median_ratio median_diff mean_diff wilcox_p_value
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 aad Mah Jam 0. 0. 0. NaN
## 2 aaf Mah Jam 0.186 114. -174. 8.01e- 1
## 3 aag Mah Jam -0.107 -27.5 -58.4 7.99e- 1
## 4 aah Mah Jam 0.752 13.0 13.5 2.64e- 1
## 5 aai Mah Jam 0.867 103. 197. 2.09e-13
## 6 aaj Mah Jam 0.322 1.00 0.622 4.17e- 1
## 7 aak Mah Jam -0.112 -1.50 -2.60 9.68e- 1
## 8 aal Mah Jam 0.307 4.50 9.90 1.32e- 1
## 9 aam Mah Jam -0.900 -6.50 -3.78 7.18e- 2
## 10 aan Mah Jam 0.186 4.00 12.3 6.23e- 3
## # ... with 1,457 more rows
We then need to correct for multiple comparisons and set non-significant differences to zero like we did
before:
obj <- mutate_obs(obj, "diff_table",

wilcox_p_value = p.adjust(wilcox_p_value, method = "fdr"))
obj$data$diff_table$log2_median_ratio[obj$data$diff_table$wilcox_p_value > 0.05] <- 0
Finally we call the heat_tree_matrix command with the same options that would be used for a single tree.
obj %>%
taxa::filter_taxa(taxon_ranks == "o", supertaxa = TRUE) %>%
mutate_obs("cleaned_names", gsub(taxon_names, pattern = "\\[|\\]", replacement = "")) %>%
taxa::filter_taxa(grepl(cleaned_names, pattern = "^[a-zA-Z]+$")) %>%
heat_tree_matrix(dataset = "diff_table",
node_label = cleaned_names,
node_color_trans = "linear",
edge_color_interval = c(-3, 3), # symmetric interval
node_color_range = diverging_palette(), # diverging colors
node_color_axis_label = "Log 2 ratio of median counts",
key_size = 0.67,
seed = 2)
## Adding a new "character" vector of length 247.
We have only compared three groups here, due to the nature of this dataset, so this technique is not much
better than 3 separate, full sized graphs in this case, but with more groups it can be a uniquely effective way to
show lots of comparisons. See the end of the example analysis (02--quick_example.html) for a better example
of this technique.
Exercises
In these exercises, we will be using the ps_obj and obj from the analysis above. If you did not run the code
above or had problems, run the following code to get the objects used. You can download the
“diversity_data.Rdata” file here (diversity_data.Rdata).
load("diversity_data.Rdata")
1a) Look at the documentation for the plot_richness function from phyloseq . Try to make a plot using only
the Simpson and inverse Simpson indexes, colored by site and split up by sample type (leaf vs root).
SHOW SOLUTION
1b) The Simpson and inverse Simpson indexes display the same information in different ways (if you know
one, you can calculate the other). How do they differ?
SHOW ANSWER
2) Rarefaction and converting counts to proportions are two ways of accounting for unequal sample depth.
Although proportions are more intuitive and easier to understand, why might rarefaction be better when
calculating diveristy indexes?
SHOW ANSWER
3a) Using the techniques presented in the section on plotting the abundance of taxa, make a plot of taxon
abundance for the site encoded “Jam”.
SHOW SOLUTION
3b) What are some of the most abundant families in the site “Jam”?
SHOW ANSWER
3c) What is the most abundant phylum in the site “Jam”?
SHOW ANSWER
4a) The ordination analysis colored by genotype showed no community differences corresponding to
genotype. Try to use compare_groups to see if any taxa are significantly differentially abundant between
genotypes to verify this result.
SHOW SOLUTION
4b) How many taxa have significant differences between genotypes?
SHOW ANSWER
References
McMurdie, Paul J, and Susan Holmes. 2013. “Phyloseq: An R Package for Reproducible Interactive Analysis
and Graphics of Microbiome Census Data.” PloS One 8 (4). Public Library of Science: e61217.
https://doi.org/10.1371/journal.pone.0061217 (https://doi.org/10.1371/journal.pone.0061217).
Whittaker, Robert Harding. 1960. “Vegetation of the Siskiyou Mountains, Oregon and California.” Ecological
Monographs 30 (3). Wiley Online Library: 279–338.
Analysis of Microbiome Community Data in R by The Grunwald lab and the USDA Horticultural Crops Research Unit
(http://grunwaldlab.cgrb.oregonstate.edu/) is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
(http://creativecommons.org/licenses/by-sa/4.0/).
Based on a work at https://github.com/grunwaldlab/analysis_of_microbiome_community_data_in_r
(https://github.com/grunwaldlab/analysis_of_microbiome_community_data_in_r).

07 - Diversity - Stats in R

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

07 - Diversity - Stats in R

Uploaded by

Copyright:

Available Formats

1/19/23, 5:58 PM https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/07--diversity_stats.

Alpha (within sample) diversity

Shannon: How difficult it is to predict the identity of a randomly chosen individual.

We can use analysis of variance (ANOVA) (00--glossary.html#analysis_of_variance_(anova)_anchor) to tell if

anova_result <- aov(alpha ~ Site, sample_data)

## Df Sum Sq Mean Sq F value Pr(>F)

group_data <- tukey_result$groups[order(rownames(tukey_result$groups)),]

compare_alpha <- function(sample_data, grouping_var) {

# Do Tukey's HSD test

## The following object is masked from 'package:taxa':

ps_obj <- as_phyloseq(obj,

Plotting taxon abundance with heat trees

obj$data$type_abund <- calc_taxon_abund(obj, "otu_rarefied",

Beta (between sample) diversity

Examples of indexes used with presence/absence data:

Examples of indexes used with count data:

beta_dist <- vegdist(t(obj$data$otu_rarefied[, sample_data$SampleID]),

mds <- metaMDS(beta_dist)

## Run 0 stress 0.1135986

mds_data <- as.data.frame(mds$points)

mds_data$SampleID <- rownames(mds_data)

ggplot(mds_data, aes(x = MDS1, y = MDS2, color = Site)) +

ggplot(mds_data, aes(x = MDS1, y = MDS2, color = Genotype)) +

ps_ord <- ordinate(ps_obj, method = "NMDS", distance = "jsd")

## Run 0 stress 0.1075255

plot_ordination(ps_obj, ps_ord, type = "samples", color = "Type")

Differential heat trees

Comparing taxon abundance in two groups

obj$data$tax_abund <- calc_taxon_abund(obj, "otu_rarefied",

## Summing per-taxon counts from 292 columns for 489 taxa

obj$data$diff_table <- compare_groups(obj, dataset = "tax_abund",

By default, Wilcoxon Rank Sum test (00--glossary.html#wilcoxon_rank_sum_test_anchor) is used to test for

comparisions (00--glossary.html#multiple_comparison_corrections_anchor). We will do that with a false

obj <- mutate_obs(obj, "diff_table",

obj$data$diff_table$log2_median_ratio[obj$data$diff_table$wilcox_p_value > 0.05] <- 0

## Adding a new "character" vector of length 489.

obj$data$diff_table <- compare_groups(obj, dataset = "tax_abund",

obj <- mutate_obs(obj, "diff_table",

obj$data$diff_table$log2_median_ratio[obj$data$diff_table$wilcox_p_value > 0.05] <- 0

## Adding a new "character" vector of length 247.

3c) What is the most abundant phylum in the site “Jam”?

4b) How many taxa have significant differences between genotypes?

You might also like