Professional Documents
Culture Documents
Learning Objectives of Module: Fancy Statistical Methods
Learning Objectives of Module: Fancy Statistical Methods
Module bioinformatics.ca
disclaimer
i am not even going
to pretend
that we will make it through
all of these slides
Module bioinformatics.ca
• No!
– Well, we can still explore the data
Module bioinformatics.ca
Fun things that break statistics
• Weird distribution of observations (lots of 0s. So many
0s)
Module bioinformatics.ca
Unsupervised -
Clustering and Correlation
Module bioinformatics.ca
Supervised –
Classification and Regression
Classification –
Distinguish 2 or more
discrete classes
based on underlying
features
Regression –
Predict quantitative
values based on 1 or
more features
Module bioinformatics.ca
Compositionality
10
Module bioinformatics.ca
N = 10 N = 30
Module bioinformatics.ca
N = 10 N = 30
Module bioinformatics.ca
N = 10 N = 30
Module bioinformatics.ca
14
Module bioinformatics.ca
So what?
Module bioinformatics.ca
Module bioinformatics.ca
Let’s talk about…
Module bioinformatics.ca
18
Module bioinformatics.ca
ANOVA
• Parametric!
– A good thing
– Also a bad thing
• ANOVA can tell you *if* a difference exists, but not *where* - post-
hoc tests required!!
Module bioinformatics.ca
Kruskal-Wallace
• The nonparametric answer to ANOVA
Module bioinformatics.ca
perMANOVA
• Use permutations to simulate a null distribution
21
Module bioinformatics.ca
Differential abundance
• What features (ASVs, OTUs, species, pathways, etc) are
different between two or more samples?
22
Module bioinformatics.ca
Linear discriminant analysis for Effect Size
(LEfSE)
• The big idea: find groupings that distinguish two
categories of samples, based on effect size
Module bioinformatics.ca
Module bioinformatics.ca
LEfSE step 2 – differences between
(functional, phylogenetic) subclasses
unpaired Wilcoxon rank-sum test
Module bioinformatics.ca
Module bioinformatics.ca
Comparing two populations of mice Transcription
factor; loss
linked to colon
cancer
27
Module bioinformatics.ca
28
Module bioinformatics.ca
Log-ratio transformations
• Divide the values in each sample vector by some
quantity, and take the logarithm
• Divide by what?
– Some magic invariant feature (?): the additive log ratio
– The geometric mean: centred log ratio
29
Module bioinformatics.ca
Module bioinformatics.ca
Correlations
Goal: Infer different types of ecological interaction by
examining shared abundance patterns among taxa in all
samples in a study
Taxon 2
Taxon 1
31
Weiss et al. (2016) ISME J PMID 26905627
Module bioinformatics.ca
Correlation networks
32
Module bioinformatics.ca
Edges in the network
• To find the Pearson correlation between taxa or functions
X and Y, line up the corresponding samples and apply this
formula:
Covariance
#$%(', )) /
!= .!
+, +-
Module bioinformatics.ca
34
Friedman and Alm (2012) PLoS Comp Biol (PMID 23028285)
Module bioinformatics.ca
Pearson Pearson SparCC
(real data) (fake data)
35
Module bioinformatics.ca
Correlation network of
OTUs, size
proportional to
abundance
36
Module bioinformatics.ca
Machine Learning – the leap (?)
37
Module bioinformatics.ca
38
Module bioinformatics.ca
Key things to keep in mind
• Holy frig there are a lot of classifiers
39
Module bioinformatics.ca
Generalization
• A sufficiently complex classifier can learn pretty much
anything in your training set
Module bioinformatics.ca
Data set splitting
(holdout method)
Test classifier
Train classifier
41
Module bioinformatics.ca
Cross-validation
etc…
Module bioinformatics.ca
So let’s pick a classifier.
• Support vector machines are based on a simple
principle: try to fit a model that gives the best chance of
generalizing well
43
Module bioinformatics.ca
There’s a line
Here’s a line 44
Module bioinformatics.ca
Define a maximum margin line
(plane, hyperplane)
45
Module bioinformatics.ca
46
Module bioinformatics.ca
Things to know about SVMs
• They can actually handle non-separable problems
Module bioinformatics.ca
48
Module bioinformatics.ca
Data encoding
1. OTUs L
Firmicutes
2. Phylogenetic trees
Actinobacteria
3. Predicted functions using PICRUSt
Proteobacteria Bacteroidetes
49
Module bioinformatics.ca
Accuracy
• OTU: 77-80%
50
Module bioinformatics.ca
What is the limit of classification?
• Probably not 100%
• Other classifiers get about the same accuracy, but on
different subsets of samples
• 10% of samples appear to be hopeless
Both right
SVM right,
SourceTracker
wrong
51
Both wrong
Module bioinformatics.ca
Summary
• Microbial community data make life difficult
– Compositional
– Lots of 0s
– Biased data recovery
– Hierarchical
52
Module bioinformatics.ca