Course 7 - Statistics For Genomic Data Science - Week 4

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 25

Genomic Data Science Specialization

Course 7 - Statistics for Genomic Data Science


Week 4 – Module 4
1 – Module 4

Video: Module 4 Overview (1:21)

Welcome to week four of Statistics for Genomic Data Science, we're in the home stretch now. So
last week, we talked a lot about statistical significance and multiple testing. This week, we're
going to talk about what you do after you get a list of genes or snips or other features that you
find interesting, how do you sort of summarize them and communicate them. We're going to be
talking a little bit about things like Gene Set Enrichment Analysis. Basically ways that you
combine all of the statistically significant results in a study with information that we have about
what those genes or snips do to try to integrate them in a way that will give you something that
you can communicate biologically about the bigger picture process of what's happening when
you do a statistical analysis on genomic data. We're also going to touch briefly on other
experimental designs that are a little bit more complicated. For example, EQTL studies where
you combine multiple different genomic measurements. For example, measurements about snips
or DNA information with measurements about RNA or gene expression information, and try to
combine those to identify places where snipped variation associates with gene expression
variation. So there are a whole bunch of more complicated genomic experimental designs. We're
going to talk about just a few of them and start to give you a sneak preview of how you can do
integration of multiple data sets in this last week of the class. Good luck with the last week.

Video: Gene Set Enrichment (4:19)

Once you fit a statistical model and you've identified those genes or those features that are
statistically significantly associated with the phenotype you care about after correcting for
multiple testing, you might want to identify if there's some biological pattern to those genes or to
those features that you've identified that are differentially expressed. So again I'm going to go
back to this example where we're trying to predict the response to Lenalidomide from
Myelodysplastic Syndrome. So again we find these genes that are 47 Genes that are differentially
expressed at a false discovery rate of 10%. And so you can see for example that they're appears
to be some genes that have something in common here near the top of this list of differentially
expressed genes but is there a way that we can quantify that? So one way that you can do that is
you can take the statistic for every gene that you calculated, and you can order them from largest
to smallest. Alternatively you can take the smallest p value to the largest p value. And so over
here are the most statistically significant associations and over here are the least statistically
significant associations. Then you can take some gene set that you care about and label all the
genes that are in that gene set. In this case, I've made them red. So what you can do is then you
can calculate a running statistic that goes up every time you have a gene in the gene set and goes
down every time you have a gene out of the gene set. And so what you can see is, if all of the
genes that are in the gene set cluster near the most statistically significant values, then you'll see
much more values that go up than values that go down, and you'll get a high peak here. And so
the statistic here is actually a max deviation from zero. That's the gene set enrichment statistic.
This is related to something called the Kolmogorov-Smirnov statistic if you know a little bit
more about advanced statistics.

1
And so the idea here is that we want to identify, is this enrichment statistically significant if it's
more than we would expect to see by chance? So one way that people do that is they again
permute the sample labels. We've permuted the responders and the non-responders. And now we
get the new set of labels. And so, once we get the new set of labels, We can recalculate the
statistics and reorder them. And so now that we see the genes that belong to the gene set are a
little bit more scattered throughout this profile and so you see that the profile goes down and then
up and then down and then up. It wiggles a little bit more but it doesn't deviate from zero as far
and so there appears to be less of an enrichment of those values. So you can recalculate for
several permutations the value of this gene set statistic, and then you can calculate again a P-
value for each gene set category as to whether the permuted values are more extreme than the
observed value. And so you can calculate a P-value for Each of the gene sets and then again do a
false discovery correction and identify gene sets that are associated with those statistically
significant results. So what are the gene sets you can look at? The Gene Ontology Consortium
has a large ontology of gene sets that are based on their function and based on their spatial
location within the cell and so forth. You can also look at molecular signatures that have been
curated. For example this set of molecular signatures that you can get from this MSigDB
database. Or you can look at things like interactions between proteins and then see is there an
enrichment for a particular set of interactions among the genes that you found to be differentially
expressed. Really its any previously defined set of genes that has some function that you care
about you can use for a gene set enrichment analysis. So one thing to keep in mind is this can be
very hard to interpret especially if the categories are broad or vague. So for example, if you get a
category that comes out as transcriptional regulation, that's a very broad category, there's lots of
different subcategories of that. And so if that's enriched, it's not clear how much added value it's
giving you. It's better if you can find specific, concrete categories that are enriched. Here, if
you're not very careful you can tell stories, so again you have to correct for the multiple testing
problem and you have to be very aware of your own implicit biases. This incurs a second
multiple testing problem like I said compared to just the multiple testing problem involved in
identifying differentially expressed genes. Now you're multiply testing multiple sets and so you
have to account for that as well. This idea can actually be simplified. The statistic I showed you
here, this gene set enrichment statistic can be simplified into basically a very simple T statistic
comparing the genes that are in the set to the genes that are out of the set and so you can read
about that here in this paper.

Video: More Enrichment (3:59)

Enrichment analysis can be useful for summarizing results that are statistically significant for a
number of different things, not just for gene sets. So here I'm going to talk about an example
where we're looking for enrichment when we compare, for example, two particular sets of results
and see if they're enriched for one another. And so as an example of this we're going to be
looking at SNPs, Single Nucleotide Polymorphisms. And so we're going to be looking at SNPs
that are labeled with one of two different labels for two different analyses. So first we are going
to look at eQTLs, Expression Quantitive Trait Loci. We'll talk about what those are later in the
class, but for now just consider that one analysis has been done and one set of labels. And then a
second is a set of SNPs that have been implicated when genome wide association studies have
been done. So the SNPs that have been implicated in genome wide association studies, and the
SNPs that have been implicated as eQTLs, we want to know if they are enriched with respect to

2
one another. So one thing that you could do is you could look, for example, if you count the
number of eQTLs that correspond to SNPs that are in a particular set of GWAS hits or GWAS
SNPs. And so here that's that number at a particular P value cut off. One times ten to the negative
four. And so then what you could do is you could take another randomly selected subset of SNPs
that aren't GWAS hits and count the number of eQTLs. And if you do that for different random
selections you get this distribution here. And so you can see that this distribution in general
appears to be fewer eQTLs than what you get in the observed set of samples. So again you're
doing a permutation scheme to try to identify if there's an enrichment of eQTL among GWAS
hits. So another thing that you can do is you can make this two by two table. So we count the
number of genes that have an eQTL and are GWAS hits. And then we can count the number of
genes that don't have any eQTL but are also GWAS hits. We can also look at the case where
there are no GWAS hits, the number that are in eQTL. And then no GWAS hits, the case that
aren't eQTL. And then we calculate, are these two things independent of each other? In other
words, is being an eQTL independent of being a GWAS hit? You can do this with Fisher's exact
test or chi-square test. Those are statistical tests that you could use. You could also use the
logistic regression technique that we learned previously in the class. But what people typically
do is they permute the samples or, again, they produce permutations. So, here, they're not
necessarily permuting the samples. They're permuting the set of SNPs that they get that are
associated with GWAS hits, or they resample from the total possible set of steps. And so then
they're trying to look for enrichment that they've observed. When they do that, it's actually quite
complicated because they have to account for the fact that, for example, it's much more likely to
get a GWAS hit if you have a higher minor allele frequency because you'll have bigger power to
be able to detect it. They have to do this permutation or this resampling within levels of minor
allele frequency. So this is a common problem that you run into when you're doing the source of
enrichment along the genome. You find that genomic features are usually clustered together or
they have common properties. And so you have to take those into account when you're doing this
analysis. This is one example of a package that attempts to take into account some of that spatial
structure that's involved in genomic enrichment. But there's also a number of other issues that
you have to deal with. And so getting the null right is really hard in this case. In this case you
want to know what is the case where we

expect to see no enrichment at all, what does that look like? That's very hard to simulate from
given the spatial structure in the genome, given the other things like GC content or the minor
allele frequency that might influence the way that the null looks. And again it's easy to tell stories
if you aren't careful. This incurs a second multiple testing problem because you might be testing
multiple genomic features to see if they're enriched. And every time you do that you're going to
have a new test. So again, to be able to do this really well, you have to be very careful in the very
specific circumstance you're looking for to account for all the potential variability due to
genomic features, genomic structure, and so forth when doing this type of enrichment.

Video: Gene Set Analysis in R (7:43)

After you've done a differential expression analysis, often you want to be able to combine the
results into some interpretable story of what's happened. And so one way to do that is to use gene
setting enrichment analysis to identify gene sets that are enriched among the differentially
expressed genes. You could also use enrichment in lots of other ways to identify regions in the

3
genome that are enriched for different signals. I'm going to hear folks on gene set enrichment for
a differential expression experiments. So, I'm going to set up my parameters as always, and I'm
going to load the packages that we're going to need, in this case the two main drivers are going to
be DESeq2, and goseq. So I'm going to load those packages, and then I'm going to perform an
analysis using this goseq package. So the first thing you might want to look at is it tells you
which genomes are supported, in other words, which genomes does it have information on that
genome. So, it has these different databases and the species. And that way you can go and look
and see is the genome you're using for alignment or for analysis or for what organism your using,
does it exist in the database. Because it's going to need to be able to calculate the length of

the genes in the genome in order to do the appropriate adjustment. You could also look and see
which gene IDs are supported. You can actually build your own supported genome with the
goseq package if you have the data available in the right format. You can sort of reformat and
add to these lists of available cases. So here I'm going to use an example from the goseq package.
I'm going to load that example here. So I'm loading a system file. So basically this is just loading
a data set from this package, from the goseq package. And so I've read that in to R, and so this
data set actually, so if you look at the first row of this data set, you can see that the genes are
labeled within the data set. So we're going to remove that first column. And we're going to label
the genes with that column, but we're going to remove it from the data set. So the first thing that
I do is I remove that column from the data set. Then I'm going to make the row names of our
new expression object to be equal to that first row. So I've made the row names equal to the gene
names.

And then I'm going to do some filtering here to remove lowly expressed genes. This is
particularly important in this case because otherwise we'll end up with some genes that can't be
tested. And so, then once you start combining results for genes that can't be tested, it's going to
be harder downstream. So in this case, I'm going to have to set the group variable up. It doesn't
come with the pre-loaded data set. But the first four samples are control samples, and the last
three samples are treated samples. So then I'm going to create a pdata object by just basically
building a data.frame with just that one variable in it. And then I can pass that data to DESeq. So
again I have to build the DESeqDataSetFromMatrix. I have to use that quite complicated long
function to pass it the expression data, the phenotype data and the model we want to fit, in this
case the relationship to group. And then I can use DESeq to identify differentially expressed
genes. So it's going to take a second to fit that model because it's again fitting a relationship
between the mean and the variance with the model fit.

That, okay. So now I have the differentially expressed results. And one of the things that I see in
that differentially expressed results table is I've got a pvalue or an adjusted pvalue for every
single gene. So the first thing I'm going to do is I'm going to find those genes that are
differentially expressed at an FDR level.

0.05, and so I'm going to do that by looking at the p adjusted. So this is the adjusted pvalues for
false discovery rate. So we're going to look at those that are < 0.05 to identify the genes that are
differentially expressed at that level. Okay, and then I'm going to remove the ones where there
wasn't enough data to check, basically, for differential expression, differentially expressed genes.
And then I'm going to apply row names. So, I have to apply the row names to this data set

4
because the function requires the row names to match for the genes to the genomic data that
we're going to be downloading in just a minute. So, then I'm going to remove the genes where
we can calculate a statistic. Okay, so now what I am going to do is I am going to basically go
back and identify in this case the hg19 genome. So what I am going to do is I am going to have
to calculate the probability weight function. And so you can do that by passing it the genes, the
ones that are differentially expressed. You can pass it the genome that it's going to be using, and
then you can tell it that it's going to be using the ensemble gene references. And, so now we need
to calculate, basically, a set of weights that are calculated basically proportional to the, sort of the
size of the genes when you're doing a gene expression analysis. In RNA sequencing data, the
length of the gene actually matters for the levels of expression which relates to power, which
means you have to adjust for that when you're doing a GO analysis. So then you can use the
goseq

package to actually calculate the statistics for enrichment. And so basically you apply the goseq
function to the probability wave function that you calculated from these genes. And then you
pass it again, the genome that you're looking at. And so it's going to go out, and it's going to try
to find the GO annotations from the web. And then it's going to tell you if there are any genes
where it couldn't find the annotations. And so then it's going to calculate the pvalues. This is
relatively quick compared to, you can also use a non-parametric version of this test by changing
some of the parameters. But if you do that, you, it'll be slower because it's basically going to
commute the labels and recalculate.

So if I look at the result of that, it tells me the GO category names. You can look this up on the
Gene Ontology website, how over_represented it is, what the pvalue is, more under_represented.
And then it tells you the number of genes that were differentially expressed in that category, the
number of genes that are in that category overall. And then it tells you what the category is, so
single-organism cellular processes, extracellular region part, and so forth for this differential
expression experiment. Okay, if you want to look at an particular category, you can do that too.
So, here, it's the exact same test I ran before, the goseq test. I pass it a probability weight
function, the genome that we care about. But now if I want to change the categories, if I say only
test the molecular function categories of gene ontology. So gene ontology is broken up into
biological processes, molecular functions, and cellular locations. And so if you only want to test
the molecular functions, you can actually have goseq go and actually only analyse certain subsets
of the annotation. So this is helpful if you know for example that you only care about the
function of the genes, and you don't really care about where they're located in the cell. And so,
then I look at this, and you can see that now that ontology term is molecular function for every
single one. So that's a brief introduction to how you identify these categories. Remember to keep
in mind that the more specific they are, the more interesting they are. If they're really general
purpose categories, it's sometimes a little bit hard to interpret.

Video: The Process for RNA-seq (3:59)

So far I've talked in pretty general terms about statistical modeling strategies and now I want to
talk about a few very specific pipelines. The first one I'm going to talk about are the steps in an
RNA sequencing analysis. So recall that the central dogma of molecular biology says that
information is passed from DNA to RNA to protein. And if we want to measure this RNA we

5
use the technology RNA sequencing. And so again, this is just review from the genomic
technology class. Imagine we have a fragmented RNA molecule like this. And we want to
capture that RNA molecule using it's poly(A) tail. That's one of the ways in which you can
capture RNA or you can extract the mature RNA.

Then you can reverse transcribe it into complementary DNA, and then you sequence that using a
sequencing machine. Then you can get that sequence, and you can turn it back into the RNA
sequence. And so using that, you can identify the RNA transcripts by using the RNA sequencing
reads. And so there are a number of different steps that come involved once you have these RNA
sequencing reads in analyzing an RNA sequencing data set. So, the first step is you need to align
those reads to the transcriptome, to the genome or transcriptome. This is assuming that a
genome, or transcriptome is known. For some organisms that's not necessarily true here. I'm
going to assume that you do have such a genome. And there you can use different software.
HiSat, Rails, Star, Tophat2 all align the reads, the genome accounting for this potential splicing
between exonic fragments from the genome. Then you need to count the values that correspond
to a particular gene. You can either count them, of course responding to a particular gene, or you
can count them corresponding to a particular transcript. Kallisto is one exception to these
different software that just do basic gene counting. In this case, Kallisto doesn't require
alignment, it does a pseudo-alignment so it does the counting for transcripts relatively quickly.

You can also assemble and quantify these samples rather than just counting the genes and so
here, this software is basically going to try to actually take these reads and estimate what
transcripts actually exist in the sample rather than just counting the annotated genes. And then
once it does that, it's going to try to estimate abundances for each of the different transcripts. So,
StringTie and Cufflinks do this in the case where there's a reference genome that's known. If
there isn't a reference genome that's known you can use Trinity. And RSEM will retake a
transcriptome and then calculate the abundance for each transcript. Then the next step is you
need to normalize the data just like in any kind of genomic data, you do the normalization and
preprocessing. EDAseq and cqn are R packages, or bioconductor packages that you can use to
normalize for GC content. DESeq2 and edgeR are actually differential expression packages that
have some normalization built in. As do Ballgown and derfinder which are Ballgown is a
backend for the cufflinks and RSEM pipelines and derfinder is a single base resolution
differential expression analyst. Both of these have their own built in normalization. To remove
batch effects you can use the SVA and SVA seek function. Or RUVseg to remove batch effects
in these RNA sequencing data. Then you need to perform statistical tests and statistical
modeling. You can do that with edgeR and DESeq2, or this for the case where you have count
data. In the case where you have transcript quantification, you can use the Ballgown package as a
back-end to RSEM or as a back-end to Cufflinks. And then if you want to do single based
resolution analysis, you can use the derfinder package to perform statistical tests or statistical
modeling of the RNA sequencing data. Finally, you need to do some sort of gene set enrichment
analysis to identify gene sets or categories that are enriched within the sets, genes that are
differentially expressed. You can do that with the goseq or the SeqGSEA packages in
bioconductor.

So that's a little bit of a quick run down of the different steps that you can do in an RNA
sequencing analysis and what are the software that can be used to do that analysis.

6
Video: The Process for Chip-Seq (5:25)

This video is about the process for a ChIP-seq analysis. ChIP-seq can be used for a couple of
different things, but what we're going to be using it for, is actually measuring the way in which
proteins interact with DNA. So if you remember the central dogma of molecular biology,
information flows from DNA to RNA to protein. But you can actually regulate the amount of
transcription in many different ways, but one of those is by binding of proteins to the DNA. This
is the case for transcription factors for example, that might regulate the expression of particular
genes. And so you can actually adapt the next generation sequencing technology to measure the
amount of that particular proteins are bound to particular locations on the DNA.

So the first step in this process is to cross-link proteins to DNA, as you'll recall from the
Genomic Technologies class, and then the fragmented DNA. And then if you have an antibody
for that particular protein that you've attached to the DNA, or a cross-link to the DNA, you can
then do an antibody pulldown. So this basically enriches for a particular subset of the DNA,
that's associated with the proteins that you're interested in, or whatever they might be. Or
whether there are transcription factors, or something else. And so then we did pull down just
those fragments, and then we sequence them. And then we're going to be looking at basically
how do we analyze the data that come from these sequences, that come from this pull down
experiment. So the first step in this process, just like the first step often in a next generation
sequencing experiment, is to align. And then in this case, you actually don't necessarily need to
worry about, like you did with say something like RNA seek, or with sequencing. You don't
necessarily need to worry about doing anything, other than a sort of a straight ahead alignment to
the Genome. And so the popular software for doing these are software like Bowtie2 and BWA.
That are very fast aligners to the Genome. The next step in the process is to detect peaks,
basically to identify, again we've enriched particular sequences because they've been pulled
down by proteins. And so we want to be able to identify where there are peaks, or where there
big piles of reeds corresponding to those sequences. And so there are a couple of different
software for doing these, CisGenome, MACS, and PICS are a couple of the popular ones. PICS
is in bio-conductor, and CisGenome and MACS are both sort of standalone software. And they
can be used to basically detect, where are those peaks are in the sequence. And then the next step
is to count basically, so to try to obtain a measurement for the amount of reeds that cover a
particular peak. Now, there is a questions as to how quantitative the ChIP-seq technology is, in
terms of how much binding there is there. But it isn't useful to have the quantification of how
many reeds fall into each of the different peaks. So then the next step, and this is actually
relatively recent that these sort of processes have been heavily introduced in sort of
normalization, and so especially the cross sample normalization. Until relatively recently, many
ChIP-seq experiments didn't have a large number of replicates, but they're definitely increasing
over time. And so some of the ideas that have been used in RNA sequencing analysis, and other
places have been moved over into the ChIP-seq world. And so, it's now common to apply some
kind of normalization. Whether that's MAnormalization as I've shown in this figure here, or
some other kind. And so the diffbind package in bio-conductor, and the MAnorm package in bio-
conductor are the two approaches that use various different types of normalization, to make the
peak counts comparable to each other. And then you need to do some sort of statistical tests. And
so this is basically to identify whether there's any differences between

7
the cases that you care about. Whether there is, usually you do it in a comparative experiment
with a sort of ChIP-seq, and when you do that there are different tests that people use. Some
people do use sort of a binomial test.

Or you can use other types of tests. And so these are the software packages, the CisGenome, a
sort of suite, covers the whole process. So does MACS, these typically focus on the two sample
case. Diffbind is a package that can analyze multi-sample ChIP-seq data, or ChIP-seq with the
outcomes that aren't necessarily just two groups.

And then the next step actually in the ChIP-seq analysis is so we've identified these regions,
mainly in the Genome, or the particular transcription factor has bound. And so the next step is
we might want to understand, what is the sequence motifs underlying that particular transcription
factor. And so there's several sets of software that will allow you to annotate the sequences that
have been sort of bound by the transcription factor, and try to identify their transcription factor
binding sites, or their sequence motifs that are associated with that particular protein. And then
downstream from that it's often the case if we want to look for, is there further annotation that we
can do with these sequences? Like, are they commonly enriched in particular genomic motifs?
Also, people would try to like build a network model that relates the transcription factor binding
back to gene expression. That's a little bit beyond this scope here, but these software will kind of
get you started on annotating, and pointing you toward the direction of analyzing these ChIP-seq
data after you've done the actual statistical analysis.

Video: The Process for DNA Methylation (5:03)

DNA methylation is one of many epigenetic marks that are often measured with high throughput
technology, such as next-generation sequencing or microarrays. So we're going to talk a little bit
about the analytical pipeline. So the first thing is, what is DNA methylation? Well, DNA
methylation refers to a particular methyl group binding to CpG sites in the genome. A CpG site
is a C followed by a G so you can see here is a CpG. And here's another CpG, and another CPG,
and those are bound by this molecule here we've shown in cartoons with this methyl group. And
so the idea is where can we identify those places where that binding is occurring? And so there
are multiple technologies using next generation sequencing and microarrays to measure this, and
so I'm going to talk about two of those. One is bisulfite conversions followed by sequencing. So
the idea here is you split the DNA into two identical samples. So you basically randomly create
two samples from the same set of DNA. And then you bisulfite convert one of the two samples,
and so that basically converts cytosines, that are not methylated, into uracils. And so then you
can align these back to the genomap that they've been sequenced, and you have to account for the
fact that you've had this conversion that's happened, so this makes the alignment process a little
bit more tricky. But once you do that you can compare the converted samples to the not
converted samples and see do you see more of one or the other as a measure of how much DNA
methylation is at that location. Another way that this is often used is actually through

aluminum methylation arrays or other types of methylation arrays. And so the idea here is you,
similarly, do this sort of bisulfite conversion step, but then you do hybridization to a microarray.
And then after hybridization to a microarray, you have both the probe that corresponds to the
case where you have an unmethylated probe and a methylated probe, and you measure the

8
intensity of both of those. And then you use that same information to try to estimate how much
methylation is happening at a particular locus. So the first step in either of these processes,
whether you use bisulfite sequencing or DNA methylation arrays, is to normalize the samples.
And so you have to process a couple of different things. First, you want to be able to detect
whether there was methylation that was at that locus. You have to compare the methylated DNA
to the unmethylated DNA, whether that's through the bisulfite conversion comparison with the
sequence samples, or with the hybridization signal. And so there's a couple of packages that you
can use to do this, the charm package and the minfi package. The minfi package specifically
deals with both bisulfite sequencing and microarrays, and charm focuses more exclusively on the
microarray version. Now, I'm talking about microarrays here although we primarily talk about
sequencing throughout the class because still a large number of studies that are performed in
DNA methylation are focused on using microarray technology, especially for large samples. So
the next step is smoothing. And so often you see DNA methylation data that look like this where
you measured it across the genomes. So this is genomic location on the x-axis, and this is sort of
a methylation measurement after normalization on the y-axis. And you can see that it sort of
jumps around, and so the idea is you want to find sort of clumps of points like this that are above,

at a particular level that they're highly methylated. And so, the way to do that often is to smooth
the data and then identify bumps or regions that are differential, or that are methylated. And so,
you can do that with the charm or bsseq package.

And then you want to do region finding. So once you've done that smoothing, you want to
basically identify regions that are say different, between different, in this case it's going to be
different categories of tissue, and so it's brain versus liver versus spleen. And you can see that the
smooth curve through each of these is a little bit different, and so you basically have to fit a
statistical model to identify those regions, and then maybe label them with where they are. And
so, you can do that, again, with the charm or bsseq packages depending on what type of data that
you have.

And then the next step is you want to annotate them. So you basically want to be able to annotate
the regions to different components of the genome and in particular, often for DNA methylation,
there are these particular categories that are specific to DNA methylation, including sort of CpG
islands or places where you see

Video: The Process for GWAS/WGS (6:12)

One of the most common applications of sequencing or high throughput technology to genomics
that uses statistics is whole genome sequencing, or what was previously called genome wide
association studies. So the idea here is we're basically directly measuring variability in the DNA.
So we want to identify different types of variants, whether it's the single nucleotide variant or a
deletion or insertion of a particular region in the genome. And so one way that people do this is
with just direct re-sequencing of DNA. And so the idea is you have a DNA molecule, you
fragment that DNA, and then you sequence it and then you look for variability. You basically
look for variations in the genome compared to the standard reference genome and identify if
there's any variability there. Now, more recently people are thinking about whether there should
be more than one reference genome and there are other ways to identify and define variability.

9
But for the moment the sort of standard pipeline is to identify those DNA variations that are
different from the reference and then quantify how much those variations associate with different
outcomes. And so basically once you've got the genome here and you've got your fragments you
can identify, for example, if many fragments have a C versus a G there, you might say oh this
might be a heterozygote for a particular variant that isn't necessarily the variant that appears in
the reference sequence. And so the idea here is you can also do the same sort of thing with a
microarray and in fact it was done very often with microarrays, so that's why this genome-wide
association study approach has typically been applied on microarrays now. It's called whole
genome sequencing, which is sort of the natural extension of that idea to sequencing. But with a
microarray you basically do the similar sort of thing, you do a digestion and then a step where
you compare basically the fragmentation of the DNA. You have these fragmented samples and
then you compare them on probes that probe for the homozygous reference alleal, the
homozygous variant and the heterozygote. And then you identify which one it is and so you can
basically use that to genotype samples. This is sort of the technology that at least currently
23andme and other places like that use to very cheaply genotype lots of people.

So the first step is variant identification and so this is actually if you're using a SNP chip or a
microarray you can use software like the C-Realm software and Bioconductor that will basically
identify. It'll compare the basically levels that you have observed for the different genotypes,
probes and I then make a genotype call for each different variant. For variant identification
within sequencing, it's a little bit more complicated, especially for whole genome sequencing,
given the very high number of sample, or high amount of data that's being generated. And so two
very common pipelines that are used for this are FreeBayes and GATK that often require sort of
heavy computation to be able to identify variants and whole genome sequencing data which is
often very large. The next thing to take into account, or sort of the standard way you would do
any statistical analysis, is the confounders and one of the most common confounders is
population stratification. So basically, often if you're looking for association between variants
that might cause disease,

the most common confounder is that there's population structure, and that that population
structure might also be associated with disease. And so, there's two different ways that you can
address this. There's actually many, but here are two concrete examples, the Eigen software,
which is the EIGENSTRAT, and the SNPStats package will do sort of this PCA based
adjustment for population stratification in bioconductors. So once you have those confounders,
you do a set of statistical tests. Usually this is done SNP by SNP or variant by variant and you
basically test for an association with the outcome. You might do this with a logistic regression
model like we talked about, adjusting for some principal components. And at the end of the day
you get a minus, you basically calculate a P value for every SNP, and then you often make these
sort of Manhattan plots where you plot the minus log 10 P value. So this means like the smaller
the P-value, the higher the value on this chart since you have the negative there. And so what you
end up getting is sort of plots that look like this where the signals look like sort of spikes along
the P-values like this that go above a threshold and people often use sort of Bonferroni
corrections because there's typically expected to be relatively few signals when doing a disease
association study. And so once you've identified those variance, then you basically can go down
and annotate them and try to determine if they're sort of the causal variant. This is actually very
tricky and a hard thing to do. It's, so what you've done isn't an association study, it's basically just

10
identify variants that are associated with the disease. But you're trying to figure out which one
might cause it if you can, and so people use software like PLINK, and the annotating genomic
variants workflow and bioconducter to try to, sort of drill down into particular regions and look
and see if we can identify which SNPs are most highly associated, and what's the LD structure
associated with those. What genes that they're near or what regions they're in and sort of identify
the variability genomically in the sort of regions you've identified. Now, actually defining them
as the causal variant is quite another story and it's quite a bit harder to necessarily deal with and
so there's been a whole bunch of software that's been developed to sort of like take steps towards
that. So for example the CADD and variantAnnotation software basically will categorize the
variance that you've identified and to these various different sorts of things. Whether they're a
synonymous or nonsynonymous variation, where they're an intronic region or where they're in a
splice site. By using that information, they can assign a score that says, this is how deleterious
we think this variant might be. Ultimately what you need to do is downstream experiments to
identify the functional variation associated with those genetic variants. Now here, I've talked
about one particular type of application of full genome sequencing, or genome wide association
studies. That's the sort of population based inference of disease-causing variants but there's
obviously many other applications, including family based studies and rare variance studies
which we are not going to be talking about here.

Video: Combining Data Types (eQTL) (6:04)

Expression quantitative trait loci or eQTL is one of the most common anagrative analyses that
are performed in genomics. So an eQTL is an analysis where you're trying to identify variations
in DNA that correlate with variations in RNA. So basically what you do is you measure the
abundance of different RNA molecules. And measure the DNA in those same samples and then
you try to correlate the variation in DNA to the variation in RNA. This is representative of a
whole class of problems that are associated with combining different genomic data types.
Whether it's measuring proteomic data and RNA data, or DNA data and RNA data, or RNA and
methylation. And then trying to integrate those data together to try to identify their sort of cross
regulation between these different measurements. So one of the first examples of an eQTL study
was this study be Brem et al in 2002 in Science. And they basically crossed two strains of yeast
and they created 112 random segregants. And so once they had those yeast segregants, they
measured mRNA expression at the time they used gene expression in the microarrays and then
they measured genotypes using a microarray genotyping tool. And the goal was to identify
associations between the expression levels as well as the genotype levels. And so you can think
of this as basically having two components. One is this sort of the SNP data, so that's the marker
or SNP associated with each gene in the genome. In this case, it's the yeast genome. And so you
have the position of the particular SNP that you're measuring and then you also have information
about a particular gene. Like how much that gene is turned on or expressed. And then you have
the information on where that gene is located in the genome as well. So, you're basically trying
to do an association between all possible gene expression levels and all possible SNP levels. So,
this obviously complicates the issue of multiple testing because you're doing all possible SNPs
versus all possible gene expression values. So if you think about it as for every single SNP,
you're performing basically a gene expression microarray analysis for every single SNP. And if
you have thousands or hundreds of thousands of SNPs, that's thousands or hundreds of thousands
of micro experiments. And you're basically looking for in cases like this where you see, so in this

11
case, there are the two strains. They have the BY and RM strains, so those are surrogates for the
genotypes in this case. And so here, you're looking for differences in expression. So here you
don't see any difference or not a very strong difference in expression between the BY and RM
strains for this particular gene, for this particular variant. Here for this other variant for this other
gene, you do see differences in the mean level of expression between the two genotypes. And so
that would be sort of classified as an eQTL if it passed the significance thresholds. And so this is
typically the kind of plot that you can make when you do an eQTL analysis, so on the x-axis
here, we've got the position of the marker or the genotype. So again, that was where that SNP
was positioned in the genome and then you also have the trait position. So that's where the gene
expression levels were located at. So basically you can imagine where's the gene that codes for
the mRNA that is being measured and where is the SNP that's being measured. So then you'd just
line up the chromosomes on each axis and so this circled component right here, this diagonal line
represents what's called typically CISeQTL. So CISeQTL are often defined as eQTL where the
SNP position is close to the gene expression position.

And then there are also what's called TRANS eQTL, now in this case, there appear to be lots of
TRANS eQTL. But well, it's often been noticed is that if you see these sort of big stripes of loci
that seem to associate with many genes' expression levels. Very often, those tend to be artifacts
so it might be a batch effect or some sort of artifact in the data that basically are driving the sort
of variability. Now sometimes that may or may not be true. Like if you identify, for example, a
biological reason that there might be a large number of associations between a particular locus
and lots of genes' expression, that might be true. But typically, your assumption is that it might
be an artifact if you see these sort of large stripes in the pattern here where there's a particular
marker that's associated with many genes. So this idea is actually really popular right now. It's
being used in a whole large number of studies. One of the most recent and very large scale
studies of gene expression variation in context of eQTL is the GTEx project. Where they took
multiple people, multiple donors, and they took from each donor, multiple tissues and they
measured information about their DNA sequence. And they also measured their level of
expression in various different tissues, say their brain, heart, and liver, and then they performed
eQTL analysis that are both across tissues and within tissues. And so, they've identified a large
number of eQTL including sort of cross tissue eQTL. That data is all available and you can start
analyzing it yourself if you're interested. And so, eQTL is sort of an area that's here to stay and is
probably the most popular of the integrative, the genomic sort of applications. So just some notes
and further reading. So the cis-eQTL tend to be more believable than trans-eQTL. So the cis-
eQTL being those eQTL where the SNP position or the variant position are close to the coding
region of the gene. Then the trans-eQTL where you see SNP position that's very distant from the
position of the coding gene. There are many potential confounders here. So in this analysis,
usually you have to just like in the sort of a GWAS analysis you have to adjust for population
stratification. You have to do that here. You also have to adjust for things like batch effects on
the gene expression data just like you would do in a gene expression analysis. And then there's
even more complicated things like sequence artifacts. Where a sequence artifact could actually
make it look like that there's eQTL, especially a trans-eQTL, when they're not actually there. So
this paper I've linked to here is actually an excellent review of many of the issues associated with
eQTL analysis if you want to learn a little bit more about that.

Video: eQTL in R (10:36)

12
The most common [INAUDIBLE] analysis is to do an expressive quantitative trade locust
analysis, or analysis, where you're combining gene expression information with genetic
information to try to identify association. So I'm going to show you an example of that using the
matrix package. So here I'm going to set my parameters like I usually do and then I'm going to
load the packages that we're going to need, in this case the primary driver of most of this analysis
is the MatrixEQTL package. That means you can see there, but they're sort of strange caps, you
need to all caps for EQTL and caps for Matrix. So then what I can do is I can basically load some
data from this. I'm going to load snp data, expression data, and covariant data. The snp data is the
genotype information, the expression data the gene expression information, and the covariants
are anything you might want to adjust for. I'm going to do that by copying these lines here.

These are basically just finding out the location of the package. So it's the base directory. And
then I'm going to paste together that directory to

basically different text files to basically pull them out. So if I look at the base directory it's
basically finds the base directory on my computer and then the snip file is going to basically
paste that together with some downstream information that we might need. Just make that txt.
And then it's going to say paste those together with no separation between them. And so you can
see it creates like a file name that goes straight to the snip.txt. So then I can read those in with
read.table. So I'm basically going to read in the expression file with read.table expression. File
name, I'm going to tell it that the separation is a tab separator. It's got a header. And it's has a real
name that is equal to one. I knew that because I'd already looked at the files in advance, but in
general, you might have to check and see if they had a header, or if they have a real name. So,
here you go. So now we see the expression data. We can do the same thing with the snips data,
we can read it in.

Read it in from basically the R package itself, and again this one has a header and it had real
names.

So, I can do the same thing, so you can see the genotype information is coded as zero one two.
The expression measurements are quantitive. I can do the same thing for the covariants here.
And so I can look at the covariants which has information on the gender and other information
you might care about for the people, for example, their age. Okay, so then, basically the idea
behind an eQTL analysis is to do a linear regression relating gene expression information to
genotype information. So we could do that one by one. So here's how you'd do that. You could
basically take the first genes expression levels

by extracting them out from the first row of the expression matrix. And you can extract out the
first set of snps from the snps matrix. So now add e1, s1, and now I just want to relate those two
things together, you could use a linear model.

And so, if you look at that You see that there might be a relationship here between this gene's
expression levels and SNIP because it has a relatively

reasonable effect size but the p-values too big. So it's probably not highly associated. So then
what we can do is we can actually plot the data to see what this looks like. So what I'm going to

13
do is I'm actually just going to plot the expression levels vs the genotype, so here I'm plotting
expression level vs a jittered version of the genotype so just so the points don't all land on top of
each other and then I'm going to color it by the genotype. So you can see here they're colored by
genotype and then

I'm going to add an access label so you can see which genotypes they are, and then you can plot
the fitted values versus the observed genotypes, and in this case I'm going to color that in dark
gray, and see here's the fitted values. You can see that this is the average in the homozygous
major the heterozygote, and the homozygous minor In this case I've assumed an additive model
since I've just indirectly included the snip. So you can see it's the same difference between the
homozygous major and the het, as the het and the homozygous minor allele.

So now we can set it up to do this on every single case. because if you wanted to use LM, in this
case the data sets are really quite small. The expression data has only got ten genes in it, but in
general, you might have tens of thousands or hundreds of thousands of genes or axons that you're
testing, and equivalently, millions of snips, and so you're doing billions and billions of regression
models. If you do that with LM, it will take a long time to do it, so instead, we can use this
matrix eQTL package. So, the first thing we need to do is, we need to set up some parameters.
And so the P value threshold is at how statistically significant do you need it a P value to be
before it actually saves the output. So it's basically going to throw away everything that just,
throw away the calculations for everything that are above the threshold and that'll save a lot of
computational time and space.

So you should choose this to be as liberal as you need to catch everything that you think might
potentially be interesting. Even if It doesn't have to be just the ones that are at the most extreme
significance levels if you want to look at the broader distribution.

The error covariance, if you just set it to the numeric, that's basically saying that they're after I
kind of of the genotypes and the covariance. There is essentially an independence error model for
the gene expression variation which is usually the most common assumption.

And then it tells you which model, you have to tell it which model to use, in this case. It's going
to say, use the linear model, so just use the standard additive linear model rather than trying to do
a dominant and recessive model.

Okay, so then once you've got those general parameters set up, you have to set up the files,
matrix CTTO Because it's, it is very, very fast. But to do that it has to sort of be organized about
how it's going to analyze the data. And so here we have to create a new sliced data object, both
the snips and the gene expression information. And so to do that we have to tell it what the file
delimiter is. So this is basically the file you're reading in. If it's got Tab separated. Then you give
it a file delimiter of tab.

What to miss like if it's at what's the definition of a missing value? How many rows and values to
skip. So in this case the data set had both a column

14
set of column names and a set of row names, so we want to skip one line each because it's just
going to test the values that are numeric. And then the file slice size tells you something about,
it's going to basically break the files up into chunks so that it can be computed on more easily in
R, and so this tells you about how big the chunks are going to be. Sizes of 2000. The bigger the
chunks, probably, the faster I can compute, but the slower it is because it has to load more data
in, so there's a balance there. 2000 seems like a reasonable compromise here. In this case, there's
no need to chunk it up, because the data sets are so small, and so then I just load the file name in.
Using all those parameters I just defined. I do something very similar for the D expression data,
to create another sliced data object. So here it's 15 snips and the ten genes and in the case I'm not
going adjust for any covariance, you can do the same thing with the covariance but I'm just going
to say It's an empty object, so we're not going to do any adjustment. So then here's the main
command for running a matrix_EQTL, so matrix_EQTL has, gotta be careful about the
capitalizing, capital M And lower case e capital QTL. Then you pass it the snips slice data
object, the gene slice data object, the covariance slice data object. If you want to, you can have it
directly output the output to a text file rather than output into r. This can be useful because
sometimes these computations, especially if you have many genes or many snips can take a long
time to run. And so you might just want to get this running in the background and then come
back to it and have it output the results. The p value threshold, what model to use, the error code
variance. In this case we want to be able to look at the p value distribution for all the different
snips. This will slow it down a little bit and require a little more memory, but it allows you to do
a better analysis, deeper analysis, so I suggest always lead P value equals true, and then you can
tell it whether to calculate the false discovery rate or not.

So this case, it ran very quickly, there's hardly any steps in genes, but it's still incredibly fast even
if you do many analysis. And so the first thing you can do is you can sort of look at the object
that gets created. And it saved the p-values for all 150 tests. So there's 150 tests because there are
ten genes and 15 snips. And so, it's basically testing every possible combination of snips and
genes. So that's 150. 10 times 15. And these are the p-values from all of them. Here, they're
relatively flat so you see that there's probably not much statistical significance, but also maybe
there's no artifacts.

The ME object that comes out has several components to it. It tells you how long it took to run
and it tells you something about the parameters that you use to run the analysis, so you can save
those as well. And then the all component tells you the number of EQTLs it calculated. How
many tests it did. So, if I look at the number of tests, it calculated 150 tests just like we talked
about. And then you can look at the number of EQTLs it found. So, that's the number that passed
that significance threshold. And then, the next step is looking at the eQTLs themselves. Here it
is. It tells you which SNP, which gene, what the statistic was, what the p value was, the false
discovery rate, and the estimate of the association statistic.

So the next step with this is to go back and check. In general you gotta check to make sure that
there are not any artifacts, to make sure that the plots look reasonable so you can go back and
plot the ETTLs after you've done this and make sure that they really do appear to be associated.
But this is the first step in doing ETTL analysis is actually just doing all the possible
comparisons between all SNIPs and all genes.

15
Video: Researcher Degrees of Freedom (5:49)

As we've seen in many of the analyses we've talked about throughout this class, there are a large
number of steps that are involved in doing a statistical genomics project from pre-processing and
normalization, to statistical modeling, to post hoc analyses of the results that you get. So I
wanted to talk a little bit about Researcher degrees of freedom. This is an idea that was originally
proposed in psychology, and there was this paper that said, basically, undisclosed flexibility in
data collection and analysis allows for presenting anything as statistically significant. And, so,
what are they talking about here? They're talking about how there's a large number of steps in the
sort of data analytic pipeline. They go from experimental design, all the way from the raw data to
the summary statistics, and then finally there's a p-value at the end. Now usually when people are
talking about statistical significance, they talk about p-values or multiple testing corrected p-
values, and often a lot depends on that p-value being sort of small enough that a journal will
publish the paper, or something like that. And so that dependence is going down a little bit over
time, but originally there's been a lot of sort of focus on that. But there's been a lot of sort of
steps underneath that process before you get to a p--value that could change what the p-value is.
So, for example, if you throw out a particular outlier, or if you normalize the data a little bit
differently, you might get different results. And so, there's lots of different ways you can analyze
data. And the danger here is that, when they were talking about it in this paper, they were sort of
talking about a nefarious case where you keep doing everything you can until you get a p-value
that's significant, but you could imagine doing this just sort of by accident. You make a large
number of choices when doing a genomic data analysis, and once you've made those choices,
you get some result. And maybe you don't like that result so you redo the analysis. So one thing
that you have to be very careful about when doing genomic analysis is redoing the analysis too
many times. It makes sense when there's new updated software or there's sort of new biological
or scientific knowledge that's been brought to bear to redo the analysis. But if you keep redoing
it over and over again you sort of fall into this trip. And so, you can imagine how that would
happen with different teams. So, this comes from sort of a recent analysis. This is an analysis in
genomics, but it kind of illustrates the point that 29 different research teams were asked to see if
referees were more likely to give red cards to dark-skinned players. And so each team analyzed
the data a little bit differently. And here you can see the dots represent the different effect sizes
that they estimated for the different studies, and so you can see that they're all different. And then
the sort of confidence intervals, or the sort of confidence uncertainty intervals, for each of these
different estimates are also different from each other. And so, while they're comfortingly sort of
similar for many of the estimates here in the middle, you can get quite big variability just by
changing the way that you analyze the data. And so, you have to be careful to make sure that you
don't do this over and over and over again until you find just the one case where you get a large
estimate of the effect, even if it's probably not necessarily due to anything other than the way that
you analyze the data. And so, the difficult thing about thinking about that is if you do a different
analysis, particularly if you adjust for different covariants, you actually are answering different
questions. So the a question is going to be conditional on what your sort of model is. Ans so if
you have whole bunch of extra covariants in the model, then you're asking, is there a difference
in gene expression after I account for all of these other variables? That's a very different question
than, is there just a gene expression difference overall, which might mean something totally
different. And so you have to be a little bit careful about this idea researcher degrees of freedom
as related to knowing what question it is that you're answering. And so this whole idea was sort

16
of summarizing in this paper by Andrew Gelman and Eric Loken when they talk about The
garden of forking paths. What they mean by that is basically that you start off doing an analysis
where you just haven't seen the data, and maybe you have an analysis plan in mind. Then once
you collect the data you realize, oh that there's a problem of a particular type. This happens all
the time in genomic data. And then you start making decisions based on the data that you've
observed, and once you start doing that you start playing into this researcher degrees of freedom
idea. You're basically changing the way that you're analyzing the data based on the data, and you
can end up with a little bit of trouble. So the key is to be thinking ahead right from the beginning,
how am I going to analyze these data, what decisions am I going to make before looking at the
data, so that you're not sort of driven by those, and sort of end up chasing a false positive. So the
key take home message here, have a specific hypothesis that you're looking for. So with genomic
data there's this sort of tendency to just sort of do discovery for the sake of doing discovery
without a specific hypothesis. And that can often lead towards this sort of garden of forking
paths or these researchers degrees of freedom. Another thing that you can do is pre-specify your
analysis plan, that even if it's just internally to you, say like this is the way we're going to analyze
the data and we're going to stick to it. And then even if you end up adapting it later, it's good to
just analyze the data once exactly how you planned on analyzing it, even if it has problems, just
so you know what would have happened, and see if there's big differences and why those
differences might be.

Another thing that you can do if you have enough data, although it's often not the case in
genomics, is use training and test sets, so the idea that you can split your data up into a first
analysis data set and then you can validate the results that you get in the remaining data. And
then analyze your data once. So a very common temptation with genomics is to increasingly add
complicated models until you find more and more things, and that often leads to false positives.
The other thing that you could do is if you're going to do any analyses, if you report all of those
analyses, it will give people the opportunity to sort of understand if maybe there's potential for
data dredging or researcher degrees of freedom in your analysis. So this is sort of a cautionary
note that genomic data is complicated, and if you add complicated analysis on top, you can often
run into extra false positives.

Video: Inference vs. Prediction (8:52)

Throughout most of this course, we've talked about statistical inference for genomics, but there's
also this idea of statistical prediction or machine learning for genomics. So recall that the central
dogma of inference is basically that we're going to have this population and we want to use
probability to sample from that population. So once we get that sample, we're going to try to say
something about this global population. So this is sort of a population level analysis. By contrast
you can think of sort of the central dogma of prediction, is you take some sample from a
population again and you build that into a training set, where you have two different kinds of
things that you're trying to predict, and then you use that data to build a prediction function. And
so once you have that prediction function, if you get a new sample and you don't know what
color it is, that function assigned it to one of the two colors based on some of the properties. And
so prediction is a little bit different problem than inference and we haven't covered too much
about it, but wanted to cover just a little bit about some of the key issues that often come up
related to prediction in genomics.

17
So the first thing to keep in mind, is that inference and prediction can give you very different
answers, totally sensibly. So here's an example, suppose we want to test for the differences
between the values between two different distributions, and we collect a whole bunch of data. If
you do inference, and you ask are these two populations different? In this case, they're definitely
different from each other, the distributions are very different from each other. But they're not
necessarily very predictive. So imagine that I wanted to predict which of the two distributions the
data point came from. If it came from sort of out here, you might be able to predict, oh, it's
maybe a little bit more likely to come from the light gray sample than the dark grey sample. But
if it came from here, it's not very predictive at all. It's sort of could be very likely to be either the
dark grey sample or the light grey sample.

On the other hand, this is another case where inference would definitely tell you that there's a
difference, just like it would in the previous example, but here it's much more predictive.
Basically if you have any data point out here, it's going to be easily assigned to one of the two
distributions. But a data point here in the middle might not necessarily be assigned, but that's just
a very small fraction of the cases. So the first thing to keep in mind, is that in the case of
inference we're looking for differences that may or may not be predictive. So if you do, say, a
differential expression analysis, you might identify lots of differences. Many of those might not
necessarily be good for prediction. So the other thing to keep in mind is the quantities of interest.
So suppose that you're doing genomic tests and you have some disease that you want to test for,
then the quantities that you care about are the case where the test says you have the disease and
you actually do, that's a true positive. Or the test says that you don't have the disease but you
don't have the disease, that's a false positive.

Or the case where the test says you do not have the disease and you actually do, that's a false
negative. And then the case where the test says you do not have the disease and you actually
don't, that's a true negative. So usually people in the genomics talk a lot about false positives
when they're talking about inference. And they also talk about true positives. But in prediction
you need to sort of carefully balance how these different potential categories work. So here's a
really simple definition of some of the key quantities of sensitivity. You might hear about the
sensitivity of a test. That's the probability that you get a positive test given that you actually do
have the disease. Specificity is the probability of a negative test given that you don't have the
disease. And then the positive predictive values is, so if I do have a positive test, how is it likely
that I actually have the disease? Same with the negative predictive value. And then the accuracy
is just the probability that you the correct outcome. That's sort of the sum of the true positives
here and the true negatives here divided by the total number of cases. And so, here you're going
to again define all of these sort of things in terms of the true positives, false positives, false
negatives, and true negatives. So, for example, sensitivity is the TP / (TP+FN). So these
definitions that I'm showing you here, in terms of these quantities, correspond to the probability
definitions that you saw on the previous screen, so that probability of a positive test given that
you have the disease. Here, (TP+FN) here, so the (TP+FN) are all the cases where you have the
disease. And then you're looking at the fraction of the time where you actually identify them. So
that's TP / (TP+FN).

Okay, so let's use an example. This just sort of illustrate how any kind of screening can be tricky,
but particularly geometric screening can be tricky. So, assume that there is a disease in it. Only

18
about .01% of the population have that disease. And so we have a test that's 99% sensitive. That
is, if you have the disease with 99% of the time, it will say you have the disease. And it's 99%
specific. So, that means that if you don't have the disease, then 99% of the time, when you don't
have the disease, it'll say that you don't have the disease. So, that seems like a pretty good test.
So, the question is, what's the probability of a person having the disease given the test result is
positive? In other words, what's the positive predictive value of this test? So we're going to
consider two cases, a general population where the rate of this disease is 0.1% and then a higher
at-risk sub-population. So in the general population this is what it might boil down to. So
remember, it's a very accurate test. So if you have the disease, 99% of the time it'll tell you that
you have it. And then if you don't have the disease, 99% of the time it'll tell you that you don't
have it. But these numbers are a little bit sort of unbalanced because almost no one has the
disease. It's a highly rare disease, so if you actually go calculate the sensitivity and the
specificity, they're both very high, just like we expected. But the positive predictive value is only
9%. Why is that? It's because your testing a huge number of people that don't have the disease,
so even though you only get a tiny fraction of those to be false, there's a large number of them
because you tested so many. So, it turns out that the positive predictive value or that the
probability you actually have disease, if we tell you you have the disease, is only 9%, which
might not be that great for lot's of reasons. One we might give you all sorts of treatments you
don't necessarily want to get. For two, you might be nervous or scared because we told you you
have the disease, even though it's actually kind of unlikely that you have the disease. Even
though the test is, in this case what a lot of people consider to be a really, really sensitive and
specific test. Now except for sort of rare disorders and some very specific variations, it's very
rare that you would get these numbers to be this high in a genomics experiment. Typically the
sensitivity and the specificity are relatively low compared to what we're showing here.

And, so this effects even sort of what people consider to be really used in quite strong screening
tests. So, for example, when you're looking at mammogram screening, particularly in young
women or same thing for prostate cancer. If you're sort of doing PSA screening in younger men,
it turns out that when you do this sort of screening, even though the test might be pretty good,
you're just testing so many people, most of whom who don't have the disease. You'll get lots of
false positives, which will lead to, sort of potentially consequences. In particular the
consequences tend to relate to how much money people spend on downstream therapy, and how
much difficulty they go through for downstream therapy. So one way to address this, particularly
this is useful for genomics but also other areas, is to basically go to a population where there's a
higher risk of the disease. So in this case now, we have again a 99% sensitive and specific test,
but now we've gone to a situation where we're at a higher risk of that disease in the population
overall. So you can see, now there's 10,000 potential people who have the disease, compared to
100,000 who don't necessarily have the disease, so the frequency of the disease in the population
is higher. And so if you do goes calculations again, now you have a 99% sensitive and specific
test. But you don't get overwhelmed by the fact that there's so many more not diseased people
that diseased people and your positive predictive values stays pretty high. So this is one example
of the ways that it can be a little bit tricky to do screening or to use genomics measurements for
prediction.

This is again a whole class, so I'm linking here to one class that I actually teach on Coursera, but
there are other really good prediction and machine learning classes.

19
And then this is sort of the idea that's underlying precision medicine. Some of precision medicine
is focused on sort of rare diseases and Mendelian disorders, where it's a little bit more targeted,
and you tend to get much higher sensitivity and specificity. But particularly for precision
medicine for common complex diseases, these are the issues that will come up. And so far this
has sort of been a major challenge for genomics, it's sort of an open area where a lot of people
are working.

Video: Knowing When to Get Help (2:31)

In this class, I've tried to cover the key topics of statistical genomics, but as you might imagine,
this is a really complicated topic. And actually, a whole sequence of classes has been designed
around just this one topic that's being taught by my colleague Rafael Irizarry up at edX. And I
think that, that's a really useful set of classes if you've like, enjoyed what you've learned here,
and you want to get a little more in depth, this is a great place to start. But it turns out that, sort
of, genomics and statistical genomics is sort of a huge area of research and it's sort of actively
ongoing and so often you need to get some help. And that help isn't always in a form of classes
where we can sort of formalize the basic ideas but it's very hard to formalize the latest and
greatest for any new technology. I do suggest that you take more statistics classes if you're
interested. We also teach a set of statistics classes in data science that are more general purpose,
but will teach some of the topics we've learned here in maybe a little bit more depth. So one
thing to keep in mind is that

no matter how much you've learned particularly in these classes or in other classes, there's
always a lot more to learn. And so there's this sort of default assumption that statistics is once
learned the basic concepts, you can just apply it without sort of thinking about it too much. And I
think that I, for certain, am of the opinion that statistics is a very complicated and difficult topic
that deserves careful attention just like any other part of scientific endeavor. And so, this is a post
I wrote kind of tongue-in-cheek talking about how we require surgeons to be very well trained
and to be very careful, but sometimes we let data scientists kind of get away with whatever they
want. And so, I think it's important that you know when and where to look for help. And so you
can look for help locally. If you're here at Johns Hopkins, we have a biostatistics consulting
center which can help out with this statistical genomics experiments. And in many sort of
research institutions or many universities or colleges, there'll be a consulting center that could be
a local resource for you to talk to people about statistics. And then you could also go to online
resources like Stack Overflow if you want to ask questions about coding or R programming. If
you have questions about bioconductor packages, you can go to bioconductor support site. Or
you can go to sequencing specifics support sites, like SEQanswers, which will give you more
answers to the questions that are specific to the latest and greatest sequencing types that maybe
people are seeing in their labs or in their research groups. So the key point, the key take-home
message though, is that know that if you're getting a little bit uncomfortable with the statistics or
the functions that you're using, it's worth looking into it more and it's worth going and asking for
help.

Video: Statistics for Genomic Data Science Wrap-Up (1:53)

20
Welcome to the end of the class. I hope that you've really enjoyed learning about statistics for
genomic data science. As you know, statistics is one of the main branch of genomic data science.
But sometimes its the one that gets left out a little bit, statistics is often the one that's the least
one talked about when you talk about genomic data science. Computer science and biology often
get a lot of the attention. You can see that in, this is no joke, a published paper where they put the
statement, (insert statistical method here), right there in the abstract of the paper. Sort of gives
you an idea that sometimes statistics isn't always on the top of people's minds. But it's really
important and it's a huge component of the process. In fact, the p-value that we talked about in
the class is the most widely used statistic in all of genomics, and all of science really.

If every time that the guy who wrote the p-value paper got a citation he would have over 3
million citations, which would make it the most highly cited paper in the history of any scientific
discipline. So statistics is incredibly important and I hope that you've learned enough about it to
get you sort of started and get you excited on your journey towards being a statistical genomic
data scientist. So there's a lot more to learn that what we've covered in these four weeks.
Obviously we had to move pretty quickly to get through all the material in that time. But here's a
couple of other things that you could go check out. There's an Advanced Statistics for Life
Sciences course taught by one of my buddies, Rafa Irizarry. And you should go check that out if
you have liked what you learned here and want to learn a little bit more. You could also take the
other classes in the genomic data science specialization, or the classes from the Johns Hopkins
data science specialization which focus more on basic statistics. Which you might want to learn
about more if you're kind of into the stuff that you've learned in this class. So finally, I'd just like
to thank you for sticking around. And I hope you've enjoyed the class, and good luck with your
genomics career.

2 – Quiz

Module 4 Quiz

Question 1. When performing gene set analysis it is critical to use the same annotation as was used in
pre-processing steps. Read the paper behind the Bottomly data set on the ReCount
database: http://www.ncbi.nlm.nih.gov/pubmed?term=21455293

Using the paper and the function: \verb|supportedGenomes()|supportedGenomes() in the \verb|


goseq|goseq package can you figure out which of the Mouse genome builds they aligned the
reads to.

UCSC hg19

UCSC mm9

NCBI Build 35

UCSC hg18

21
Question 2. Load the Bottomly data with the following code and perform a differential expression
analysis using \verb|limma|limma with only the strain variable as an outcome. How many genes are
differentially expressed at the 5% FDR level using Benjamini-Hochberg correction? What is the gene
identifier of the first gene differentially expressed at this level (just in order, not the smallest FDR) ?
(hint: the \verb|featureNames|featureNames function may be useful)

library(Biobase)
library(limma)
con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bottomly_eset
  .RData")
load(file=con)
close(con)
bot = bottomly.eset
pdata_bot=pData(bot)
fdata_bot = featureData(bot)
edata = exprs(bot)
fdata_bot = fdata_bot[rowMeans(edata) > 5]
edata = edata[rowMeans(edata) > 5, ]

edata = log2(edata+1)

9431 at FDR 5%; ENSMUSG00000027855 first DE gene

90 at FDR 5%; ENSMUSG00000000402 first DE gene

223 at FDR 5%; ENSMUSG00000000001 first DE gene

223 at FDR 5%; ENSMUSG00000000402 first DE gene

Question 3. Use the \verb|nullp|nullp and \verb|goseq|goseq functions in the \verb|goseq|


goseq package to perform a gene ontology analysis. What is the top category that comes up as
over represented? (hint: you will need to use the genome information on the genome from
question 1 and the differential expression analysis from question 2.

GO:0004888

GO:0001653

GO:0043231

22
GO:0038023

Question 4. Look up the GO category that was the top category from the previous question.
What is the name of the category?

signaling receptor activity

transmembrane signaling receptor activity

detection of chemical stimulus involved in sensory perception of sour taste

G-protein coupled peptide receptor activity

Question 5. Load the Bottomly data with the following code and perform a differential
expression analysis using \verb|limma|limma and treating strain as the outcome but adjusting for
lane as a factor. Then find genes significant at the 5% FDR rate using the Benjamini Hochberg
correction and perform the gene set analysis with \verb|goseq|goseq following the protocol from
the first 4 questions. How many of the top 10 overrepresented categories are the same for the
adjusted and unadjusted analysis?

library(Biobase)
library(limma)
con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bottomly_eset
  .RData")
load(file=con)
close(con)
bot = bottomly.eset
pdata_bot=pData(bot)
fdata_bot = featureData(bot)
edata = exprs(bot)
fdata_bot = fdata_bot[rowMeans(edata) > 5]

edata = edata[rowMeans(edata) > 5, ]

3 X

0 X

23
5 X

24

You might also like