Download as pdf or txt
Download as pdf or txt
You are on page 1of 188

StatQuest:

A Gentle Introduction To RNA-seq

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


= a normal neural cell

A bunch of
normal neural
cells.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


= a normal neural cell = a mutated neural cell

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


= a normal neural cell = a mutated neural cell

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.

The mutated cells behave differently than the normal cells.

We want to know what genetic mechanism is causing the difference…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


= a normal neural cell = a mutated neural cell

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.

The mutated cells behave differently than the normal cells.

We want to know what genetic mechanism is causing the difference…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


= a normal neural cell = a mutated neural cell

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.

The mutated cells behave differently than the normal cells.

We want to know what genetic mechanism is causing the difference…

This means we want to look at differences in gene expression.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


= a normal neural cell = a mutated neural cell

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.

Each cell has a bunch of


chromosomes..

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


= a normal neural cell = a mutated neural cell

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.

Gene1 Gene2 Gene3

Each chromosome has a bunch of genes…


© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
= a normal neural cell = a mutated neural cell

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.

Some genes are active…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


= a normal neural cell = a mutated neural cell

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.

Some genes are active…

These wavy lines


represent mRNA
transcripts.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


= a normal neural cell = a mutated neural cell

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.

…but this gene is not active.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


= a normal neural cell = a mutated neural cell

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.

High throughput sequencing tells us which genes are


active, and how much they are transcribed.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


High throughput sequencing tells us which genes are active, and how
much they are transcribed.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


= a normal neural cell

A bunch of
normal neural
cells.

We can use RNA-seq to measure


gene expression in normal cells…
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
= a normal neural cell = a mutated neural cell

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.

We can use RNA-seq to measure … then use it to measure gene


gene expression in normal cells… expression in mutated cells…
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
= a normal neural cell = a mutated neural cell

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.

Then we can compare the two cell types


and figure out what’s different in the
mutated cells.
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
= a normal neural cell = a mutated neural cell

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.

Gene1: No difference between normal and mutated cells.


© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
= a normal neural cell = a mutated neural cell

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.

Gene2: A big difference between normal and mutated cells.


© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
= a normal neural cell = a mutated neural cell

A bunch of
A bunch of
normal neural
mutated
cells.
neural cells.

Gene3: A subtle difference between normal and mutated cells.


© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
3 Main Steps for RNA-Seq:

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


3 Main Steps for RNA-Seq:

1) Prepare a sequencing library

2) Sequence

3) Data analysis

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


3 Main Steps for RNA-Seq:

1) Prepare a sequencing library

2) Sequence

3) Data analysis

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


3 Main Steps for RNA-Seq:

1) Prepare a sequencing library

2) Sequence

3) Data analysis

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Preparing an RNA-seq library

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Preparing an RNA-seq library

NOTE: I’m using the Illumina protocol and


sequencer as my example because they
are commonly used, but keep in mind
there are other protocols and sequencers
that do it differently.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Preparing an RNA-seq library
Step 1: Isolate the RNA

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Preparing an RNA-seq library
Step 1: Isolate the RNA Step 2: Break the RNA into
small fragments.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Preparing an RNA-seq library
Step 1: Isolate the RNA Step 2: Break the RNA into
small fragments.

We do this because RNA transcripts can be thousands of bases


long, but the sequencing machine can only sequence short
(200-300 bp) fragments

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Preparing an RNA-seq library
Step 1: Isolate the RNA Step 2: Break the RNA into Step 3: Convert the RNA
small fragments. fragments into double
stranded DNA.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Preparing an RNA-seq library
Step 1: Isolate the RNA Step 2: Break the RNA into Step 3: Convert the RNA
small fragments. fragments into double
stranded DNA.

Double stranded DNA is more stable than


RNA and can be easily amplified and
modified. This leads us to the next step…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Preparing an RNA-seq library
Step 1: Isolate the RNA Step 2: Break the RNA into Step 3: Convert the RNA
small fragments. fragments into double
stranded DNA.

Step 4: Add sequencing


adaptors.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Preparing an RNA-seq library
Step 1: Isolate the RNA Step 2: Break the RNA into Step 3: Convert the RNA
small fragments. fragments into double
stranded DNA.

Step 4: Add sequencing The adaptors do two things:


adaptors.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Preparing an RNA-seq library
Step 1: Isolate the RNA Step 2: Break the RNA into Step 3: Convert the RNA
small fragments. fragments into double
stranded DNA.

Step 4: Add sequencing The adaptors do two things:


adaptors.
1) Allow the sequencing machine to recognize the
fragments.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Preparing an RNA-seq library
Step 1: Isolate the RNA Step 2: Break the RNA into Step 3: Convert the RNA
small fragments. fragments into double
stranded DNA.

Step 4: Add sequencing The adaptors do two things:


adaptors.
1) Allow the sequencing machine to recognize the
fragments.

2) Allow you to sequence different samples at the same


time, since different samples can use different adaptors.
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
Preparing an RNA-seq library
Step 1: Isolate the RNA Step 2: Break the RNA into Step 3: Convert the RNA
small fragments. fragments into double
stranded DNA.

Step 4: Add sequencing


adaptors.

Notice that this step doesn’t work 100% of the time.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Preparing an RNA-seq library
Step 1: Isolate the RNA Step 2: Break the RNA into Step 3: Convert the RNA
small fragments. fragments into double
stranded DNA.

Step 4: Add sequencing Step 5: PCR amplify.


adaptors.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Preparing an RNA-seq library
Step 1: Isolate the RNA Step 2: Break the RNA into Step 3: Convert the RNA
small fragments. fragments into double
stranded DNA.

Step 4: Add sequencing Step 5: PCR amplify.


adaptors.
Only the fragments with
sequencing adapters are
amplified; they are enriched.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Preparing an RNA-seq library
Step 1: Isolate the RNA Step 2: Break the RNA into Step 3: Convert the RNA
small fragments. fragments into double
stranded DNA.

Step 4: Add sequencing Step 5: PCR amplify. Step 6: QC


adaptors.
1) Verify library concentration

2) Verify library fragment


lengths
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
Hooray! Now we sequence the library!

Let’s see how this is done…

NOTE: I’m using the Illumina sequencer as my


example because it is commonly used, but keep
in mind there are other machines that do it
differently.
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
Hooray! Now we sequence the library!

Let’s see how this is done…

NOTE: I’m using the Illumina sequencer as my


example because it is commonly used, but keep
in mind there are other machines that do it
differently.
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
G
C Imagine this is a
A fragment of DNA we
G want to sequence…
C
A
C
A

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


G
C Imagine this is a
A fragment of DNA we
G want to sequence…
C
A It’s vertical, because
C that’s how it is inside
A the sequencer.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Actually, there are about 400,000,000 fragments laid out vertically
in a grid.

G T
C A
C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Actually, there are about 400,000,000 fragments laid out vertically
in a grid.
I’m just showing you 4 fragments so your brain doesn’t explode.
G T
C A
C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Actually, there are about 400,000,000 fragments laid out vertically
in a grid.
I’m just showing you 4 fragments so your brain doesn’t explode.
G T
C A
C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
A G

This grid is called a “flow cell”.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


The machine has fluorescent probes
that are color coded according to the
type of nucleotide they can bind to.

= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


The machine has fluorescent probes
that are color coded according to the
type of nucleotide they can bind to.

= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
The probes are C G
attached to the A G
first base in each
sequence.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Once the probes have
attached, the machine takes
a picture of the flow cell
from above that looks like G T
C A
this… C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Once the probes have
attached, the machine takes
a picture of the flow cell
from above that looks like G T
C A
this… C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
The picture tells the machine A G
that the first base in the
bottom left-hand corner is
an “A”.
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
G T
C A
C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
This base is a “G”. C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


These two G T
C A
bases are C C A T
“C”. A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Then the machine washes the color off of the probes….

G T
C A
C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


= A
Then probes are G T
= G C A
bound to the next C C A T
= C base in each A A
T
C
G
C
= T fragment. G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


The machine takes a picture G T
from above… C
C A
T
C A
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


G T
C A
C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
And now it
A G
knows that
this base is
“C”

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


G T
C A
C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
This base is “G” C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


These two G T
C A
bases are “T” C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Then the machine washes the color off of the probes….

G T
C A
C C A T
A A
T
C
G
C
G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


And the process repeats until the machine has
determined each sequence of nucleotides.
= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


And the process repeats until the machine has
determined each sequence of nucleotides.
= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


And the process repeats until the machine has
determined each sequence of nucleotides.
= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


And the process repeats until the machine has
determined each sequence of nucleotides.
= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


And the process repeats until the machine has
determined each sequence of nucleotides.
= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


And the process repeats until the machine has
determined each sequence of nucleotides.
= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


And the process repeats until the machine has
determined each sequence of nucleotides.
= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


And the process repeats until the machine has
determined each sequence of nucleotides.
= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


And the process repeats until the machine has
determined each sequence of nucleotides.
= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


And the process repeats until the machine has
determined each sequence of nucleotides.
= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


And the process repeats until the machine has
determined each sequence of nucleotides.
= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

This is how it works with 4 DNA fragments.


With 400,000,000 DNA fragments, the matrix is much denser.
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
And the process repeats until the machine has
determined each sequence of nucleotides.
= A
G T
= G C A
C C A T
= C A A
T
C
G
C
= T G T T A
A A
C T T G
A C C A
C G
A G

This is how it works with 4 DNA fragments.


With 400,000,000 DNA fragments, the matrix is much denser.
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
This matrix still isn’t 400,000,000 DNA fragments, but
it illustrates one type of problem that can occur.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Sometimes a probe will
not shine as bright as it
should and the machine
isn’t super confident that
it is calling the correct
color.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Sometimes a probe will
not shine as bright as it
should and the machine
isn’t super confident that
it is calling the correct Quality scores , that are part of the output, reflect how
color. confident the machine is that it correctly called a base.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Sometimes a probe will
not shine as bright as it
should and the machine
isn’t super confident that
it is calling the correct Quality scores , that are part of the output, reflect how
color. confident the machine is that it correctly called a base.

In this case, the faded dot would get a low quality


score.
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
Another reason you might get
a low quality score is when
there are lots of probes the
same color in the same region

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Another reason you might get
a low quality score is when
there are lots of probes the
same color in the same region
This is called “low diversity”, and
the over abundance of a single
color can make it hard to identify
the individual sequences; the colors
will blur together.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Another reason you might get
a low quality score is when
there are lots of probes the
same color in the same region
This is called “low diversity”, and
the over abundance of a single
color can make it hard to identify
the individual sequences; the colors
will blur together.
“Low diversity” is especially a
problem when the first few
nucleotides are sequenced,
because that is when the machine
determines where the DNA
fragments are located on the grid.
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
The raw data…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


The raw data…
@NS500177:196:HFTTTAFXX:1:11101:10916:1458 2:N:0:CGCGGCTG
ACACGACGATGAGGTGACAGTCACGGAGGATAAGATCAATGCCCTCATTAAAGCAGCCGGTGTAA
+
AAAAAEEEEEEEEEEE//AEEEAEEEEEEEEEEE/EE/<<EE/AAEEAEE///EEEEAEEEAEA<

Each sequencing “read” consists of 4


lines of data.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


The raw data…
@NS500177:196:HFTTTAFXX:1:11101:10916:1458 2:N:0:CGCGGCTG
ACACGACGATGAGGTGACAGTCACGGAGGATAAGATCAATGCCCTCATTAAAGCAGCCGGTGTAA
The first line (which always starts with
+ ‘@’) is a unique ID for the sequence that
follows.
AAAAAEEEEEEEEEEE//AEEEAEEEEEEEEEEE/EE/<<EE/AAEEAEE///EEEEAEEEAEA<

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


The raw data…
@NS500177:196:HFTTTAFXX:1:11101:10916:1458 2:N:0:CGCGGCTG
ACACGACGATGAGGTGACAGTCACGGAGGATAAGATCAATGCCCTCATTAAAGCAGCCGGTGTAA
+
AAAAAEEEEEEEEEEE//AEEEAEEEEEEEEEEE/EE/<<EE/AAEEAEE///EEEEAEEEAEA<

The second line contains the bases


called for the sequenced fragment

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


The raw data…
@NS500177:196:HFTTTAFXX:1:11101:10916:1458 2:N:0:CGCGGCTG
ACACGACGATGAGGTGACAGTCACGGAGGATAAGATCAATGCCCTCATTAAAGCAGCCGGTGTAA
+
The third line is always a “+” character.
AAAAAEEEEEEEEEEE//AEEEAEEEEEEEEEEE/EE/<<EE/AAEEAEE///EEEEAEEEAEA<

I have no idea why...

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


The raw data…
@NS500177:196:HFTTTAFXX:1:11101:10916:1458 2:N:0:CGCGGCTG
ACACGACGATGAGGTGACAGTCACGGAGGATAAGATCAATGCCCTCATTAAAGCAGCCGGTGTAA
+
The third line is always a “+” character.
AAAAAEEEEEEEEEEE//AEEEAEEEEEEEEEEE/EE/<<EE/AAEEAEE///EEEEAEEEAEA<

I have no idea why...

I asked the internet and I don’t think it


knows either…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


The raw data…
@NS500177:196:HFTTTAFXX:1:11101:10916:1458 2:N:0:CGCGGCTG
ACACGACGATGAGGTGACAGTCACGGAGGATAAGATCAATGCCCTCATTAAAGCAGCCGGTGTAA
+
AAAAAEEEEEEEEEEE//AEEEAEEEEEEEEEEE/EE/<<EE/AAEEAEE///EEEEAEEEAEA<

The forth line contains the quality scores


for each base in the sequenced
fragment.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


The raw data…
@NS500177:196:HFTTTAFXX:1:11101:10916:1458 2:N:0:CGCGGCTG
ACACGACGATGAGGTGACAGTCACGGAGGATAAGATCAATGCCCTCATTAAAGCAGCCGGTGTAA
+
AAAAAEEEEEEEEEEE//AEEEAEEEEEEEEEEE/EE/<<EE/AAEEAEE///EEEEAEEEAEA<

A typical sequence run with 400,000,000 reads


will generate a file containing 1.6 billion lines
of data!!!

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Now that we understand the raw data and how it’s
generated, we need to:

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Now that we understand the raw data and how it’s
generated, we need to:

• Filter out garbage reads

• Align the high quality reads to a genome

• Count the number of reads per gene

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Now that we understand the raw data and how it’s
generated, we need to:

• Filter out garbage reads

• Align the high quality reads to a genome

• Count the number of reads per gene

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Now that we understand the raw data and how it’s
generated, we need to:

• Filter out garbage reads

• Align the high quality reads to a genome

• Count the number of reads per gene

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Filter out garbage reads

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Filter out garbage reads
Garbage reads are:
1) Reads with low quality base calls

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Filter out garbage reads
Garbage reads are:
1) Reads with low quality base calls
2) Reads that are clearly artifacts of the chemistry.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Filter out garbage reads
Garbage reads are:
1) Reads with low quality base calls
2) Reads that are clearly artifacts of the chemistry.

A typical read is a DNA fragment…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Filter out garbage reads
Garbage reads are:
1) Reads with low quality base calls
2) Reads that are clearly artifacts of the chemistry.

A typical read is a DNA fragment…

…plus adapter sequences…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Filter out garbage reads
Garbage reads are:
1) Reads with low quality base calls
2) Reads that are clearly artifacts of the chemistry.

A typical read is a DNA fragment…

…plus adapter sequences…

…but sometimes the adapters


just bind to each other and the
“read” is just adapter sequence.
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
Filter out garbage reads
Garbage reads are:
1) Reads with low quality base calls
2) Reads that are clearly artifacts of the chemistry.

A typical read is a DNA fragment…

…plus adapter sequences…

…but sometimes the adapters


just bind to each other and the This is a garbage read…
“read” is just adapter sequence.
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
Align the reads to a genome

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Genome:

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Genome:

The genome sequence


gattacataccagga…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Genome:

gattacataccagga…

gattac attaca ttacat


Split into small fragments (for
tacata acatac catacc
reasons that will be explained in a
atacca taccag accagg
little bit).
ccagga cagga…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Genome:

gattacataccagga…

gattac attaca ttacat


tacata acatac catacc
atacca taccag accagg
ccagga cagga…

Index of all the


fragments and
locations

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Genome: A sequenced
read:

gattacataccagga… ACACGACGATGAG...

gattac attaca ttacat


tacata acatac catacc
atacca taccag accagg
ccagga cagga…

Index of all the


fragments and
locations

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Genome: A sequenced
read:

gattacataccagga… ACACGACGATGAG...

gattac attaca ttacat Split the ACACGA CGACGA


tacata acatac catacc read into CACGAC GACGAT
atacca taccag accagg fragments: ACGACG ACGATG
ccagga cagga…

Index of all the


fragments and
locations

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Genome: A sequenced
read:

gattacataccagga… ACACGACGATGAG...

gattac attaca ttacat ACACGA CGACGA


tacata acatac catacc CACGAC GACGAT
atacca taccag accagg ACGACG ACGATG
ccagga cagga…
Match the read fragments to
Index of all the the genome fragments.
fragments and
locations

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Genome: A sequenced
read:

gattacataccagga… ACACGACGATGAG...

gattac attaca ttacat ACACGA CGACGA


tacata acatac catacc CACGAC GACGAT
atacca taccag accagg ACGACG ACGATG
ccagga cagga…
The genome fragments that
Index of all the matched the read fragments
fragments and will determine a location
locations (chromosome and position)
in the genome.
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
Genome: A sequenced
read:

gattacataccagga… ACACGACGATGAG...

gattac attaca ttacat ACACGA CGACGA


tacata acatac catacc CACGAC GACGAT
atacca taccag accagg ACGACG ACGATG
ccagga cagga…

Why are we breaking the sequences up into small fragments?

It allows us to align reads even if they are not exact matches to


the “reference” genome.
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
Genome: A sequenced
read:

gattacataccagga… ACACGACGATGAG...

gattac attaca ttacat ACACGA CGACGA


tacata acatac catacc CACGAC GACGAT
atacca taccag accagg ACGACG ACGATG
ccagga cagga…

Why are we breaking the sequences up into small fragments?

It allows us to align reads even if they are not exact matches to


the “reference” genome.
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
Genome: A sequenced
read:

gattacataccagga… ACACGACGATGAG...

gattac attaca ttacat ACACGA CGACGA


tacata acatac catacc CACGAC GACGAT
atacca taccag accagg ACGACG ACGATG
ccagga cagga…
Imagine this base wasn’t in the reference
genome (because, for example, my genome is
slightly different from yours).

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Genome: A sequenced
read:

gattacataccagga… ACACGACGATGAG...

gattac attaca ttacat ACACGA CGACGA


tacata acatac catacc CACGAC GACGAT
atacca taccag accagg ACGACG ACGATG
ccagga cagga…
Then this fragment won’t match anything in
the index, but the other fragments will, and
we will still be able to figure out where the
read came from.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Count reads per gene

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Count reads per gene
Once we know the chromosome and
position for a read, we can see if it falls
within the coordinates of a gene (or some
other interesting feature.)

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Count reads per gene
Once we know the chromosome and
position for a read, we can see if it falls
within the coordinates of a gene (or some
other interesting feature.)

Xkr4 – Chromosome 1, position: 3204563-3661579


Rp1 – Chromosome 1, position: 4280927-4399322

etc.. (for all 20,000 genes in the genome)

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Gene Sample1 Sample2 Sample3…
A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

After you count the reads per gene, you end up


with a matrix of numbers like this…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Gene Sample1 Sample2 Sample3…
A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

The first column contains gene names.

The human genome has about 20,000


genes, so this matrix has about 20,000
rows. (We’re just looking at the first few!)
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
Gene Sample1 Sample2 Sample3…
A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

The first column contains gene names.

The human genome has about 20,000


genes, so this matrix has about 20,000
rows. (We’re just looking at the first few!)
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
Gene Sample1 Sample2 Sample3…
A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

The remaining columns contain counts for each


sample you sequenced.

There are usually between 6 and 800 samples.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Gene Sample1 Sample2 Sample3…
A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

The remaining columns contain counts for each


sample you sequenced.

There are usually between 6 and 800+ samples.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Gene Sample1 Sample2 Sample3…
A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

“Bulk” RNA-seq, where a “sample” is the average


of a pool of cells (usually 6 million cells), might
The remaining columns contain counts for each
have 3 “normal” samples and 3 “disease state” sample you sequenced.
samples, or 6 total.
There are usually between 6 and 800+ samples.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Gene Sample1 Sample2 Sample3…
A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

“Single-cell” RNA-seq
treats each cell like an
The remaining columns contain counts for each
individual sample, so it can sample you sequenced.
generate a lot of samples.
There are usually between 6 and 800+ samples.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Gene Sample1 Sample2 Sample3…
A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

Each row gives counts, per sample, for a


specific gene.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Gene Sample1 Sample2 Sample3…
A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

If this were a Single Cell RNA-seq experiment, we would have


20,000 rows (genes) by 800+ columns (samples), giving us at
least 16 million values to keep track of

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Gene Sample1 Sample2 Sample3…
A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

If this were a Single Cell RNA-seq experiment, we would have


20,000 rows (genes) by 800+ columns (samples), giving us at
least 16 million values to keep track of

That’s a huge matrix, and it’s only going to get


bigger, since sequencing gets cheaper and
people are doing more and more samples.
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
Gene Sample1 Sample2 Sample3…
A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

The last thing we do before analysis is normalize the data.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Gene Sample1 Sample2 Sample3…
A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

The last thing we do before analysis is normalize the data.

This is because each sample will have a different number of reads


assigned to it, due to the fact that one sample might have more low
quality reads, or another sample might have a slightly higher
concentration on the flow cell.
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
Here’s an example:

Sample #1 Sample #2
Gene 635 reads 1,270 reads
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Sample #1 Sample #2
Gene 635 reads 1,270 reads
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26

Sample #1 has 635 reads assigned to it.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Sample #1 Sample #2
Gene 635 reads 1,270 reads
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26

Sample #2 has 1,270 reads assigned to it,


twice as many reads as Sample #1.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Sample #1 Sample #2
Gene 635 reads 1,270 reads
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26

This does not mean that the genes in Sample #2 were all transcribed
twice as much as in Sample #1. Instead, it means that Sample#2 had
fewer low quality reads and might have landed on more spots on the
flow cell than Sample #1.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Sample #1 Sample #2
Gene 635 reads 1,270 reads
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26

However, the read counts make it look like the genes in


Sample #2 were transcribed twice as much as in Sample #1.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Sample #1 Sample #2
Gene 635 reads 1,270 reads
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26

However, the read counts make it look like the genes in


Sample #2 were transcribed twice as much as in Sample #1.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Sample #1 Sample #2
Gene 635 reads 1,270 reads
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26

However, the read counts make it look like the genes in


Sample #2 were transcribed twice as much as in Sample #1.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Sample #1 Sample #2
Gene 635 reads 1,270 reads
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26

However, the read counts make it look like the genes in


Sample #2 were transcribed twice as much as in Sample #1.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Sample #1 Sample #2
Gene 635 reads 1,270 reads
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26

However, the read counts make it look like the genes in


Sample #2 were transcribed twice as much as in Sample #1.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Sample #1 Sample #2
Gene 635 reads 1,270 reads
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26

However, the read counts make it look like the genes in


Sample #2 were transcribed twice as much as in Sample #1.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Sample #1 Sample #2
Gene 635 reads 1,270 reads
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26

So we need to adjust the the read counts per gene to reflect


differences in how many reads were assigned to each
sample.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Sample #1 Sample #2
Gene 635 reads 1,270 reads
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26

The simplest method is to just divide the read counts per


gene by the total mapped to each sample.

However, there are many more sophisticated ways to do this.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Sample #1 Sample #2
Gene 635 reads 1,270 reads
A1BG 30 60
A1BG-AS1 24 48
A1CF 0 0
A2M 563 1126
A2M-AS1 5 10
A2ML1 13 26

For more details, see my StatQuest videos at StatQuest.com

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


= a normal neural cell = a mutated neural cell

We started out with a …and a bunch of


bunch of normal neural mutated neural cells.
cells…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


= a normal neural cell = a mutated neural cell

Then we extracted the mRNA…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


= a normal neural cell = a mutated neural cell

Then we sequenced, aligned, counted the


reads per gene in each sample and
normalized.

Gene WT1 WT2 WT3…


A1BG 30 5 13…
A1BG-AS1 24 10 18…
... ... ... ...
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
= a normal neural cell = a mutated neural cell

Now it’s time to analyze


the data!!!

Gene WT1 WT2 WT3…


A1BG 30 5 13…
A1BG-AS1 24 10 18…
... ... ... ...
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
Step 1 in any analysis is always the same:

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Step 1 in any analysis is always the same:
Plot the data

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Step 1 in any analysis is always the same:
Plot the data
Remember, the data is a huge matrix…

Gene Sample1 Sample2 Sample3…


A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


If there were only two genes, then plotting the data would be easy.

Gene Sample1 Sample2 Sample3…


A1BG 30 5 13…
A1BG-AS1 24 10 18…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Frist, we’d replace
the gene names
with “X” and “Y”

Gene Sample1 Sample2 Sample3…


X 30 5 13…
Y 24 10 18…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


And then just 30
plot the
20
samples on an
X/Y graph. 10

10 20 30

Gene Sample1 Sample2 Sample3…


X 30 5 13…
Y 24 10 18…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


And then just 30
Sample1
plot the
20
samples on an
X/Y graph. 10

10 20 30

Gene Sample1 Sample2 Sample3…


X 30 5 13…
Y 24 10 18…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


And then just 30
Sample1
plot the
20
samples on an
X/Y graph. 10 Sample2

10 20 30

Gene Sample1 Sample2 Sample3…


X 30 5 13…
Y 24 10 18…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


And then just 30
Sample1
plot the
20 Sample3
samples on an
X/Y graph. 10 Sample2

10 20 30

Gene Sample1 Sample2 Sample3…


X 30 5 13…
Y 24 10 18…

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


But we have 20,000 genes…

So we would need a graph with 20,000 axes to plot the raw data…

Gene Sample1 Sample2 Sample3…


A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


But we have 20,000 genes…

So we would need a graph with 20,000 axes to plot the raw data…

Gene Sample1 Sample2 Sample3…


A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


So we use PCA (principal component analysis) or
something like it to plot this data.

Gene Sample1 Sample2 Sample3…


A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


So we use PCA (principal component analysis) or
something like it to plot this data.

PCA reduces the number of axes you need to display


the important aspects of the data.

Gene Sample1 Sample2 Sample3…


A1BG 30 5 13…
A1BG-AS1 24 10 18…
A1CF 0 0 0…
A2M 5 9 7…
A2M-AS1 3563 5771 4123…
A2ML1 13 8 7…
... ... ... ...

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


This is a PCA plot from a real RNA-seq experiment done on neural cells.
The “wt” samples are “normal”.
The “ko” samples are samples that were mutated by the researchers.

wt2

1.0
0.8
0.6
Leading logFC dim 2

0.4
0.2

ko3 ko4
ko2
0.0

ko5
ko6
ko1
wt4
−0.2

wt3 wt5
wt1
wt6

−1.0 −0.5 0.0 0.5 1.0

Leading logFC dim 1


© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
This is a PCA plot from a real RNA-seq experiment done on neural cells.
The “wt” samples are “normal”.
The “ko” samples are samples that were mutated by the researchers.

wt2

1.0
0.8
0.6
Leading logFC dim 2

The “ko” samples make


0.4

a nice little cluster in


the corner.
0.2

ko3 ko4
ko2
0.0

ko5
ko6
ko1
wt4
−0.2

wt3 wt5
wt1
wt6

−1.0 −0.5 0.0 0.5 1.0

Leading logFC dim 1


© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
This is a PCA plot from a real RNA-seq experiment done on neural cells.
The “wt” samples are “normal”.
The “ko” samples are samples that were mutated by the researchers.

wt2

1.0
0.8
0.6
Leading logFC dim 2

The “wt” samples are all on


0.4

the left side, but spread


out on the y-axis.
0.2

ko3 ko4
ko2
0.0

ko5
ko6
ko1
wt4
−0.2

wt3 wt5
wt1
wt6

−1.0 −0.5 0.0 0.5 1.0

Leading logFC dim 1


© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
This is a PCA plot from a real RNA-seq experiment done on neural cells.
The “wt” samples are “normal”.
The “ko” samples are samples that were mutated by the researchers.

wt2

1.0
0.8
0.6
Leading logFC dim 2

The way these graphs are drawn,


0.4

the most important differences are


0.2

on the X-axis.
ko3 ko4
ko2
0.0

ko5
ko6
ko1
wt4
−0.2

wt3 wt5
wt1
wt6

−1.0 −0.5 0.0 0.5 1.0

Leading logFC dim 1


© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
This is a PCA plot from a real RNA-seq experiment done on neural cells.
The “wt” samples are “normal”.
The “ko” samples are samples that were mutated by the researchers.

wt2

1.0
0.8
0.6
Leading logFC dim 2

Differences along the


0.4

Y-axis are not as


important.
0.2

ko3 ko4
ko2
0.0

ko5
ko6
ko1
wt4
−0.2

wt3 wt5
wt1
wt6

−1.0 −0.5 0.0 0.5 1.0

Leading logFC dim 1


© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
This is a PCA plot from a real RNA-seq experiment done on neural cells.
The “wt” samples are “normal”.
The “ko” samples are samples that were mutated by the researchers.

wt2

1.0
0.8
0.6
This means that the
Leading logFC dim 2

biggest differences are


0.4

between the “wt” and


0.2

the “ko” samples.


ko3 ko4
ko2
0.0

ko5
ko6
ko1
wt4
−0.2

wt3 wt5
wt1
wt6

−1.0 −0.5 0.0 0.5 1.0

Leading logFC dim 1


© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
This is a PCA plot from a real RNA-seq experiment done on neural cells.
The “wt” samples are “normal”.
The “ko” samples are samples that were mutated by the researchers.

wt2

1.0
However, when we do
0.8
further analyses, we
0.6
may wish to exclude
Leading logFC dim 2

“wt2”.
0.4
0.2

ko3 ko4
ko2
0.0

ko5
ko6
ko1
wt4
−0.2

wt3 wt5
wt1
wt6

−1.0 −0.5 0.0 0.5 1.0

Leading logFC dim 1


© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
Excessive Self Promotion!!!!
If you want to learn about how PCA does what it does, check out…

StatQuest: Principal Component Analysis (PCA)


clearly explained
It’s on my YouTube Channel:

https://www.youtube.com/user/joshstarmer

You can find it by googling “StatQuest PCA”

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


This is a Single-Cell RNA-seq PCA plot from neural cells.

5.0

2.5

Annotation
PC2

Mig
non−Mig

0.0

−2.5

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


−6 −3 0
PC1
3 6
The colors were added based on what we knew about the cells.

The green cells were stationary. The orange cells moved around
the petri dish.
5.0

2.5

Annotation
PC2

Mig
non−Mig

0.0

−2.5

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


−6 −3 0
PC1
3 6
Most of the orange cells are separated from the green cells.
However, there are a few orange cells that seem more like the green
cells.
5.0

2.5

Annotation
PC2

Mig
non−Mig

0.0

−2.5

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


−6 −3 0
PC1
3 6
If we want to determine what is
different between these cells…
5.0

2.5

Annotation
PC2

Mig
non−Mig

0.0

−2.5

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


−6 −3 0
PC1
3 6
If we want to determine what is
different between these cells… … and these cells…
5.0

2.5

Annotation
PC2

Mig
non−Mig

0.0

−2.5

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


−6 −3 0
PC1
3 6
If we want to determine what is
different between these cells… … and these cells…
5.0

2.5

Annotation
PC2

Mig
non−Mig

0.0

…we might
exclude these
cells from the −2.5

analysis.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


−6 −3 0
PC1
3 6
In summary, plotting the data…

1) Tells us if we can expect to find


interesting differences.

2) Tells us if we should exclude some


samples from any down stream analysis.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


In summary, plotting the data…

1) Tells us if we can expect to find


interesting differences.

2) Tells us if we should exclude some


samples from any down stream analysis.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


In summary, plotting the data…

1) Tells us if we can expect to find


interesting differences.

2) Tells us if we should exclude some


samples from any down stream analysis.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Step 2) Identify differentially expressed genes between
the “normal” and “mutant” samples.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Step 2) Identify differentially expressed genes between
the “normal” and “mutant” samples.

This is typically done using R with either edgeR or DESeq2,


and the results are generally displayed using this sort of graph.
4
2
logFC

0
−4 −2

0 5 10 15

logCPM

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Step 2) Identify differentially expressed genes between
the “normal” and “mutant” samples.

A Red dot is a gene that is different between


“normal” and “mutant” samples.
4
2
logFC

0
−4 −2

0 5 10 15

logCPM

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Step 2) Identify differentially expressed genes between
the “normal” and “mutant” samples.

Black dots are genes that are the same.


4
2
logFC

0
−4 −2

0 5 10 15

logCPM

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Step 2) Identify differentially expressed genes between
the “normal” and “mutant” samples.

The X-axis tells you how much each


gene is transcribed.
4
2
logFC

0
−4 −2

0 5 10 15

logCPM

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Step 2) Identify differentially expressed genes between
the “normal” and “mutant” samples.

The X-axis tells you how much each


gene is transcribed.
CPM stands for “counts per million”
4
2
logFC

0
−4 −2

0 5 10 15

logCPM

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Step 2) Identify differentially expressed genes between
the “normal” and “mutant” samples.

The Y-axis tells you how big the relative


difference is between “normal” and
“mutant”.
4
2
logFC

0
−4 −2

0 5 10 15

logCPM

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Step 2) Identify differentially expressed genes between
the “normal” and “mutant” samples.

The Y-axis tells you how big the relative


logFC = log(fold change) difference is between “normal” and
“mutant”.
4
2
logFC

0
−4 −2

0 5 10 15

logCPM

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


We’ve identified interesting genes, now what?

4
2
logFC

0
−4 −2

0 5 10 15

logCPM

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


We’ve identified interesting genes, now what?

4
2
logFC

0
−4 −2

0 5 10 15

logCPM

1) If you know what you’re looking for, you can see if the
experiment validated your hypothesis.

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


We’ve identified interesting genes, now what?

4
2
logFC

0
−4 −2

0 5 10 15

logCPM

1) If you know what you’re looking for, you can see if the
experiment validated your hypothesis.

2) If you don’t know what you’re looking for, you can see if
certain pathways are enriched in either the normal or
mutant gene sets.
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
And then what?

© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY


Check out StatQuest!!!
Google: “StatQuest”

You will find complete tutorials on all kinds of stuff related to RNA-seq:

PCA,
Heatmaps,
Hierarchical Clustering
K-means Clustering
P-values
False Discovery Rates
Differential Gene Expression
Linear Models
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
Thanks to…
Terry Magnuson and his lab full of awesome people!!
Weipeng Mu
Jesse Rab
John Runge
Prabuddha Chakraborty
Dominic Ciavatta
Keri Smith
Cam Spear
Karl Shpargel
Della Yee
The Debu
Chuan-Wei Jang
Sarah Miller
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
Thanks to…
Terry Magnuson and his lab full of awesome people!!
Weipeng Mu
Jesse Rab
John Runge Scott Magness and his lab full of awesome people!!!
Prabuddha Chakraborty
Dominic Ciavatta
Keri Smith
Cam Spear
Karl Shpargel
Della Yee
The Debu
Chuan-Wei Jang
Sarah Miller
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
Thanks to…
Terry Magnuson and his lab full of awesome people!!
Weipeng Mu
Jesse Rab
John Runge Scott Magness and his lab full of awesome people!!!
Prabuddha Chakraborty
Dominic Ciavatta
Keri Smith
Cam Spear
Karl Shpargel
All my collaborators and friends here at UNC!!!!!!
Della Yee
The Debu
Chuan-Wei Jang
Sarah Miller
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY
The End!!!!
© 2017 Joshua Starmer, http://statquest.org, https://youtu.be/tlf6wYJrwKY

You might also like