Statquest Gentle Introduction To Rna Seq

StatQuest:
A Gentle Introduction To RNA-seq
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

= a normal neural cell
A bunch of
normal neural
cells.

= a normal neural cell = a mutated neural cell
A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.
The mutated cells behave differently than the normal cells.
We want to know what genetic mechanism is causing the difference…

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.
This means we want to look at differences in gene expression.

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.
Each cell has a bunch of

chromosomes..

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.
Gene1 Gene2 Gene3
Each chromosome has a bunch of genes…

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.
Some genes are active…

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.
Some genes are active…
These wavy lines

represent mRNA
transcripts.

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.
…but this gene is not active.

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.
High throughput sequencing tells us which genes are

active, and how much they are transcribed.

High throughput sequencing tells us which genes are active, and how
much they are transcribed.

= a normal neural cell
A bunch of
normal neural
cells.
We can use RNA-seq to measure

gene expression in normal cells…
A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.
We can use RNA-seq to measure … then use it to measure gene

gene expression in normal cells… expression in mutated cells…
A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.
Then we can compare the two cell types

and figure out what’s different in the
mutated cells.
A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.
Gene1: No difference between normal and mutated cells.

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.
Gene2: A big difference between normal and mutated cells.

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.
Gene3: A subtle difference between normal and mutated cells.

3 Main Steps for RNA-Seq:

1) Prepare a sequencing library
2) Sequence
3) Data analysis

2) Sequence
3) Data analysis

2) Sequence
3) Data analysis

Preparing an RNA-seq library

NOTE: I’m using the Illumina protocol and

sequencer as my example because they
are commonly used, but keep in mind
there are other protocols and sequencers
that do it differently.

Step 1: Isolate the RNA

Step 1: Isolate the RNA Step 2: Break the RNA into
small fragments.

Step 1: Isolate the RNA Step 2: Break the RNA into
small fragments.
We do this because RNA transcripts can be thousands of bases

long, but the sequencing machine can only sequence short
(200-300 bp) fragments

Step 1: Isolate the RNA Step 2: Break the RNA into Step 3: Convert the RNA
small fragments. fragments into double
stranded DNA.

stranded DNA.
Double stranded DNA is more stable than

RNA and can be easily amplified and
modified. This leads us to the next step…

stranded DNA.
Step 4: Add sequencing

adaptors.

stranded DNA.
Step 4: Add sequencing The adaptors do two things:

adaptors.

stranded DNA.

adaptors.
1) Allow the sequencing machine to recognize the
fragments.

stranded DNA.

adaptors.
1) Allow the sequencing machine to recognize the
fragments.
2) Allow you to sequence different samples at the same

time, since different samples can use different adaptors.
stranded DNA.
Step 4: Add sequencing

adaptors.
Notice that this step doesn’t work 100% of the time.

stranded DNA.
Step 4: Add sequencing Step 5: PCR amplify.

adaptors.

stranded DNA.
Step 4: Add sequencing Step 5: PCR amplify.

adaptors.
Only the fragments with
sequencing adapters are
amplified; they are enriched.

stranded DNA.
Step 4: Add sequencing Step 5: PCR amplify. Step 6: QC

adaptors.
1) Verify library concentration
2) Verify library fragment

lengths
Hooray! Now we sequence the library!
Let’s see how this is done…
NOTE: I’m using the Illumina sequencer as my

example because it is commonly used, but keep
in mind there are other machines that do it
differently.
Hooray! Now we sequence the library!
Let’s see how this is done…
NOTE: I’m using the Illumina sequencer as my

example because it is commonly used, but keep
in mind there are other machines that do it
differently.
G
C Imagine this is a
A fragment of DNA we
G want to sequence…
C
A
C
A

G
C Imagine this is a
A fragment of DNA we
G want to sequence…
C
A It’s vertical, because
C that’s how it is inside
A the sequencer.

Actually, there are about 400,000,000 fragments laid out vertically
in a grid.
G T
C A
C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
A G

in a grid.
I’m just showing you 4 fragments so your brain doesn’t explode.
G T
C A
C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
A G

in a grid.
I’m just showing you 4 fragments so your brain doesn’t explode.
G T
C A
C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
A G
This grid is called a “flow cell”.

The machine has fluorescent probes
that are color coded according to the
type of nucleotide they can bind to.
= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

The machine has fluorescent probes
that are color coded according to the
type of nucleotide they can bind to.
= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
The probes are C G
attached to the A G
first base in each
sequence.

Once the probes have
attached, the machine takes
a picture of the flow cell
from above that looks like G T
C A
this… C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
A G

Once the probes have
attached, the machine takes
a picture of the flow cell
from above that looks like G T
C A
this… C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
The picture tells the machine A G
that the first base in the
bottom left-hand corner is
an “A”.
G T
C A
C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
This base is a “G”. C G
A G

These two G T
C A
bases are C C A T
“C”. A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
A G

Then the machine washes the color off of the probes….
G T
C A
C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
A G

= A
Then probes are G T
= G C A
bound to the next C C A T
= C base in each A A
T
C
G
C
= T fragment. G T T A
A A
C T T G
A C C A
C G
A G

The machine takes a picture G T
from above… C
C A
T
C A
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
A G

G T
C A
C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
And now it
A G
knows that
this base is
“C”

G T
C A
C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
This base is “G” C G
A G

These two G T
C A
bases are “T” C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
A G

Then the machine washes the color off of the probes….
G T
C A
C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
A G

And the process repeats until the machine has
determined each sequence of nucleotides.
= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G
This is how it works with 4 DNA fragments.

With 400,000,000 DNA fragments, the matrix is much denser.
= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G
This is how it works with 4 DNA fragments.

With 400,000,000 DNA fragments, the matrix is much denser.
This matrix still isn’t 400,000,000 DNA fragments, but
it illustrates one type of problem that can occur.

Sometimes a probe will
not shine as bright as it
should and the machine
isn’t super confident that
it is calling the correct
color.

it is calling the correct Quality scores , that are part of the output, reflect how
color. confident the machine is that it correctly called a base.

it is calling the correct Quality scores , that are part of the output, reflect how
color. confident the machine is that it correctly called a base.
In this case, the faded dot would get a low quality

score.
Another reason you might get
a low quality score is when
there are lots of probes the
same color in the same region

This is called “low diversity”, and
the over abundance of a single
color can make it hard to identify
the individual sequences; the colors
will blur together.

This is called “low diversity”, and
the over abundance of a single
color can make it hard to identify
the individual sequences; the colors
will blur together.
“Low diversity” is especially a
problem when the first few
nucleotides are sequenced,
because that is when the machine
determines where the DNA
fragments are located on the grid.
The raw data…

The raw data…
@NS500177:196:HFTTTAFXX:1:11101:10916:1458 2:N:0:CGCGGCTG
ACACGACGATGAGGTGACAGTCACGGAGGATAAGATCAATGCCCTCATTAAAGCAGCCGGTGTAA
+
AAAAAEEEEEEEEEEE//AEEEAEEEEEEEEEEE/EE/<<EE/AAEEAEE///EEEEAEEEAEA<
Each sequencing “read” consists of 4

lines of data.

The raw data…
The first line (which always starts with
+ ‘@’) is a unique ID for the sequence that
follows.

The raw data…
+
The second line contains the bases

called for the sequenced fragment

The raw data…
+
The third line is always a “+” character.
I have no idea why...

The raw data…
+
The third line is always a “+” character.
I have no idea why...
I asked the internet and I don’t think it

knows either…

The raw data…
+
The forth line contains the quality scores

for each base in the sequenced
fragment.

The raw data…
+
A typical sequence run with 400,000,000 reads

will generate a file containing 1.6 billion lines
of data!!!

Now that we understand the raw data and how it’s
generated, we need to:

• Filter out garbage reads
• Align the high quality reads to a genome
• Count the number of reads per gene



Filter out garbage reads

Garbage reads are:
1) Reads with low quality base calls

Garbage reads are:
2) Reads that are clearly artifacts of the chemistry.

Garbage reads are:
A typical read is a DNA fragment…

Garbage reads are:
…plus adapter sequences…

Garbage reads are:
…but sometimes the adapters

just bind to each other and the
“read” is just adapter sequence.
Garbage reads are:
…but sometimes the adapters

just bind to each other and the This is a garbage read…
“read” is just adapter sequence.
Align the reads to a genome

Genome:

Genome:
The genome sequence

gattacataccagga…

Genome:
gattacataccagga…
gattac attaca ttacat

Split into small fragments (for
tacata acatac catacc
reasons that will be explained in a
atacca taccag accagg
little bit).
ccagga cagga…

Genome:
gattacataccagga…

ccagga cagga…
Index of all the

fragments and
locations

Genome: A sequenced
read:
gattacataccagga… ACACGACGATGAG...

ccagga cagga…
Index of all the

fragments and
locations

Genome: A sequenced
read:
gattac attaca ttacat Split the ACACGA CGACGA

tacata acatac catacc read into CACGAC GACGAT
atacca taccag accagg fragments: ACGACG ACGATG
ccagga cagga…
Index of all the

fragments and
locations

Genome: A sequenced
read:
gattac attaca ttacat ACACGA CGACGA

tacata acatac catacc CACGAC GACGAT
atacca taccag accagg ACGACG ACGATG
ccagga cagga…
Match the read fragments to
Index of all the the genome fragments.
fragments and
locations

Genome: A sequenced
read:

ccagga cagga…
The genome fragments that
Index of all the matched the read fragments
fragments and will determine a location
locations (chromosome and position)
in the genome.
Genome: A sequenced
read:

ccagga cagga…
Why are we breaking the sequences up into small fragments?
It allows us to align reads even if they are not exact matches to

the “reference” genome.
Genome: A sequenced
read:

ccagga cagga…
Why are we breaking the sequences up into small fragments?
It allows us to align reads even if they are not exact matches to

the “reference” genome.
Genome: A sequenced
read:

ccagga cagga…
Imagine this base wasn’t in the reference
genome (because, for example, my genome is
slightly different from yours).

Genome: A sequenced
read:

ccagga cagga…
Then this fragment won’t match anything in
the index, but the other fragments will, and
we will still be able to figure out where the
read came from.

Count reads per gene

Once we know the chromosome and
position for a read, we can see if it falls
within the coordinates of a gene (or some
other interesting feature.)

Once we know the chromosome and
position for a read, we can see if it falls
within the coordinates of a gene (or some
other interesting feature.)
Xkr4 – Chromosome 1, position: 3204563-3661579

Rp1 – Chromosome 1, position: 4280927-4399322
etc.. (for all 20,000 genes in the genome)

Gene Sample1 Sample2 Sample3…
A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...
After you count the reads per gene, you end up

with a matrix of numbers like this…

A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...
The first column contains gene names.
The human genome has about 20,000

genes, so this matrix has about 20,000
rows. (We’re just looking at the first few!)
A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...
The first column contains gene names.
The human genome has about 20,000

genes, so this matrix has about 20,000
rows. (We’re just looking at the first few!)
A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...
The remaining columns contain counts for each

sample you sequenced.
There are usually between 6 and 800 samples.

A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

sample you sequenced.
There are usually between 6 and 800+ samples.

A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...
“Bulk” RNA-seq, where a “sample” is the average

of a pool of cells (usually 6 million cells), might
have 3 “normal” samples and 3 “disease state” sample you sequenced.
samples, or 6 total.

A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...
“Single-cell” RNA-seq
treats each cell like an
individual sample, so it can sample you sequenced.
generate a lot of samples.

A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...
Each row gives counts, per sample, for a

specific gene.

A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...
If this were a Single Cell RNA-seq experiment, we would have

20,000 rows (genes) by 800+ columns (samples), giving us at
least 16 million values to keep track of

A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...
If this were a Single Cell RNA-seq experiment, we would have

20,000 rows (genes) by 800+ columns (samples), giving us at
least 16 million values to keep track of
That’s a huge matrix, and it’s only going to get

bigger, since sequencing gets cheaper and
people are doing more and more samples.
A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...
The last thing we do before analysis is normalize the data.

A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...
The last thing we do before analysis is normalize the data.
This is because each sample will have a different number of reads

assigned to it, due to the fact that one sample might have more low
quality reads, or another sample might have a slightly higher
concentration on the flow cell.
Here’s an example:
Sample #1 Sample #2
Gene 635 reads 1,270 reads
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26

Sample #1 Sample #2
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26
Sample #1 has 635 reads assigned to it.

Sample #1 Sample #2
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26
Sample #2 has 1,270 reads assigned to it,

twice as many reads as Sample #1.

Sample #1 Sample #2
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26
This does not mean that the genes in Sample #2 were all transcribed
twice as much as in Sample #1. Instead, it means that Sample#2 had
fewer low quality reads and might have landed on more spots on the
flow cell than Sample #1.

Sample #1 Sample #2
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26
However, the read counts make it look like the genes in

Sample #2 were transcribed twice as much as in Sample #1.

Sample #1 Sample #2
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26


Sample #1 Sample #2
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26


Sample #1 Sample #2
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26


Sample #1 Sample #2
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26


Sample #1 Sample #2
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26


Sample #1 Sample #2
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26
So we need to adjust the the read counts per gene to reflect

differences in how many reads were assigned to each
sample.

Sample #1 Sample #2
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26
The simplest method is to just divide the read counts per

gene by the total mapped to each sample.
However, there are many more sophisticated ways to do this.

Sample #1 Sample #2
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26
For more details, see my StatQuest videos at StatQuest.com

We started out with a …and a bunch of

bunch of normal neural mutated neural cells.
cells…

Then we extracted the mRNA…

Then we sequenced, aligned, counted the

reads per gene in each sample and
normalized.
Gene WT1 WT2 WT3…

A1BG 30 5 13…
A1BG-AS1 24 10 18…
... ... ... ...
Now it’s time to analyze

the data!!!
Gene WT1 WT2 WT3…

A1BG 30 5 13…
A1BG-AS1 24 10 18…
... ... ... ...
Step 1 in any analysis is always the same:

Plot the data

Plot the data
Remember, the data is a huge matrix…

A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

If there were only two genes, then plotting the data would be easy.

A1BG 30 5 13…
A1BG-AS1 24 10 18…

Frist, we’d replace
the gene names
with “X” and “Y”

X 30 5 13…
Y 24 10 18…

And then just 30
plot the
20
samples on an
X/Y graph. 10
10 20 30

X 30 5 13…
Y 24 10 18…

And then just 30
Sample1
plot the
20
samples on an
X/Y graph. 10
10 20 30

X 30 5 13…
Y 24 10 18…

And then just 30
Sample1
plot the
20
samples on an
X/Y graph. 10 Sample2
10 20 30

X 30 5 13…
Y 24 10 18…

And then just 30
Sample1
plot the
20 Sample3
samples on an
X/Y graph. 10 Sample2
10 20 30

X 30 5 13…
Y 24 10 18…

But we have 20,000 genes…
So we would need a graph with 20,000 axes to plot the raw data…

A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

But we have 20,000 genes…
So we would need a graph with 20,000 axes to plot the raw data…

A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

So we use PCA (principal component analysis) or
something like it to plot this data.

A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

So we use PCA (principal component analysis) or
something like it to plot this data.
PCA reduces the number of axes you need to display

the important aspects of the data.

A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

This is a PCA plot from a real RNA-seq experiment done on neural cells.
The “wt” samples are “normal”.
The “ko” samples are samples that were mutated by the researchers.
wt2
1.0
0.8
0.6
Leading logFC dim 2
0.4
0.2
ko3 ko4
ko2
0.0
ko5
ko6
ko1
wt4
−0.2
wt3 wt5
wt1
wt6
−1.0 −0.5 0.0 0.5 1.0
Leading logFC dim 1

wt2
1.0
0.8
0.6
Leading logFC dim 2
The “ko” samples make

0.4
a nice little cluster in

the corner.
0.2
ko3 ko4
ko2
0.0
ko5
ko6
ko1
wt4
−0.2
wt3 wt5
wt1
wt6
−1.0 −0.5 0.0 0.5 1.0
Leading logFC dim 1

wt2
1.0
0.8
0.6
Leading logFC dim 2
The “wt” samples are all on

0.4
the left side, but spread

out on the y-axis.
0.2
ko3 ko4
ko2
0.0
ko5
ko6
ko1
wt4
−0.2
wt3 wt5
wt1
wt6
−1.0 −0.5 0.0 0.5 1.0
Leading logFC dim 1

wt2
1.0
0.8
0.6
Leading logFC dim 2
The way these graphs are drawn,

0.4
the most important differences are

0.2
on the X-axis.
ko3 ko4
ko2
0.0
ko5
ko6
ko1
wt4
−0.2
wt3 wt5
wt1
wt6
−1.0 −0.5 0.0 0.5 1.0
Leading logFC dim 1

wt2
1.0
0.8
0.6
Leading logFC dim 2
Differences along the

0.4
Y-axis are not as

important.
0.2
ko3 ko4
ko2
0.0
ko5
ko6
ko1
wt4
−0.2
wt3 wt5
wt1
wt6
−1.0 −0.5 0.0 0.5 1.0
Leading logFC dim 1

wt2
1.0
0.8
0.6
This means that the
Leading logFC dim 2
biggest differences are

0.4
between the “wt” and

0.2
the “ko” samples.

ko3 ko4
ko2
0.0
ko5
ko6
ko1
wt4
−0.2
wt3 wt5
wt1
wt6
−1.0 −0.5 0.0 0.5 1.0
Leading logFC dim 1

wt2
1.0
However, when we do
0.8
further analyses, we
0.6
may wish to exclude
Leading logFC dim 2
“wt2”.
0.4
0.2
ko3 ko4
ko2
0.0
ko5
ko6
ko1
wt4
−0.2
wt3 wt5
wt1
wt6
−1.0 −0.5 0.0 0.5 1.0
Leading logFC dim 1

Excessive Self Promotion!!!!
If you want to learn about how PCA does what it does, check out…
StatQuest: Principal Component Analysis (PCA)

clearly explained
It’s on my YouTube Channel:
https://www.youtube.com/user/joshstarmer
You can find it by googling “StatQuest PCA”

This is a Single-Cell RNA-seq PCA plot from neural cells.
5.0
2.5
Annotation
PC2
Mig
non−Mig
0.0
−2.5

−6 −3 0
PC1
3 6
The colors were added based on what we knew about the cells.
The green cells were stationary. The orange cells moved around
the petri dish.
5.0
2.5
Annotation
PC2
Mig
non−Mig
0.0
−2.5

−6 −3 0
PC1
3 6
Most of the orange cells are separated from the green cells.
However, there are a few orange cells that seem more like the green
cells.
5.0
2.5
Annotation
PC2
Mig
non−Mig
0.0
−2.5

−6 −3 0
PC1
3 6
If we want to determine what is
different between these cells…
5.0
2.5
Annotation
PC2
Mig
non−Mig
0.0
−2.5

−6 −3 0
PC1
3 6
different between these cells… … and these cells…
5.0
2.5
Annotation
PC2
Mig
non−Mig
0.0
−2.5

−6 −3 0
PC1
3 6
different between these cells… … and these cells…
5.0
2.5
Annotation
PC2
Mig
non−Mig
0.0
…we might
exclude these
cells from the −2.5
analysis.

−6 −3 0
PC1
3 6
In summary, plotting the data…
1) Tells us if we can expect to find

interesting differences.
2) Tells us if we should exclude some

samples from any down stream analysis.







Step 2) Identify differentially expressed genes between
the “normal” and “mutant” samples.

This is typically done using R with either edgeR or DESeq2,

and the results are generally displayed using this sort of graph.
4
2
logFC
0
−4 −2
0 5 10 15
logCPM

A Red dot is a gene that is different between

“normal” and “mutant” samples.
4
2
logFC
0
−4 −2
0 5 10 15
logCPM

Black dots are genes that are the same.

4
2
logFC
0
−4 −2
0 5 10 15
logCPM

The X-axis tells you how much each

gene is transcribed.
4
2
logFC
0
−4 −2
0 5 10 15
logCPM

The X-axis tells you how much each

gene is transcribed.
CPM stands for “counts per million”
4
2
logFC
0
−4 −2
0 5 10 15
logCPM

The Y-axis tells you how big the relative

difference is between “normal” and
“mutant”.
4
2
logFC
0
−4 −2
0 5 10 15
logCPM

The Y-axis tells you how big the relative

logFC = log(fold change) difference is between “normal” and
“mutant”.
4
2
logFC
0
−4 −2
0 5 10 15
logCPM

We’ve identified interesting genes, now what?
4
2
logFC
0
−4 −2
0 5 10 15
logCPM

4
2
logFC
0
−4 −2
0 5 10 15
logCPM
1) If you know what you’re looking for, you can see if the
experiment validated your hypothesis.

4
2
logFC
0
−4 −2
0 5 10 15
logCPM
1) If you know what you’re looking for, you can see if the
experiment validated your hypothesis.
2) If you don’t know what you’re looking for, you can see if
certain pathways are enriched in either the normal or
mutant gene sets.
And then what?

Check out StatQuest!!!
Google: “StatQuest”
You will find complete tutorials on all kinds of stuff related to RNA-seq:
PCA,
Heatmaps,
Hierarchical Clustering
K-means Clustering
P-values
False Discovery Rates
Differential Gene Expression
Linear Models
Thanks to…
Terry Magnuson and his lab full of awesome people!!
Weipeng Mu
Jesse Rab
John Runge
Prabuddha Chakraborty
Dominic Ciavatta
Keri Smith
Cam Spear
Karl Shpargel
Della Yee
The Debu
Chuan-Wei Jang
Sarah Miller
Thanks to…
Weipeng Mu
Jesse Rab
John Runge Scott Magness and his lab full of awesome people!!!
Dominic Ciavatta
Keri Smith
Cam Spear
Karl Shpargel
Della Yee
The Debu
Chuan-Wei Jang
Sarah Miller
Thanks to…
Weipeng Mu
Jesse Rab
John Runge Scott Magness and his lab full of awesome people!!!
Dominic Ciavatta
Keri Smith
Cam Spear
Karl Shpargel
All my collaborators and friends here at UNC!!!!!!
Della Yee
The Debu
Chuan-Wei Jang
Sarah Miller
The End!!!!

Statquest Gentle Introduction To Rna Seq

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statquest Gentle Introduction To Rna Seq

Uploaded by

Copyright:

Available Formats

StatQuest:

A Gentle Introduction To RNA-seq

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

The mutated cells behave differently than the normal cells.

We want to know what genetic mechanism is causing the difference…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

The mutated cells behave differently than the normal cells.

We want to know what genetic mechanism is causing the difference…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

The mutated cells behave differently than the normal cells.

We want to know what genetic mechanism is causing the difference…

This means we want to look at differences in gene expression.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

Each cell has a bunch of

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

Gene1 Gene2 Gene3

Each chromosome has a bunch of genes…

Some genes are active…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

Some genes are active…

These wavy lines

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

…but this gene is not active.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

High throughput sequencing tells us which genes are

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

We can use RNA-seq to measure

We can use RNA-seq to measure … then use it to measure gene

Then we can compare the two cell types

Gene1: No difference between normal and mutated cells.

Gene2: A big difference between normal and mutated cells.

Gene3: A subtle difference between normal and mutated cells.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

1) Prepare a sequencing library

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

1) Prepare a sequencing library

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

1) Prepare a sequencing library

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

NOTE: I’m using the Illumina protocol and

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

We do this because RNA transcripts can be thousands of bases

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

Double stranded DNA is more stable than

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

Step 4: Add sequencing

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

Step 4: Add sequencing The adaptors do two things:

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

Step 4: Add sequencing The adaptors do two things:

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

Step 4: Add sequencing The adaptors do two things:

2) Allow you to sequence different samples at the same

Step 4: Add sequencing

Notice that this step doesn’t work 100% of the time.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY