Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Analyzing Sequencing Electropherograms

-by Scott Nicholson-


Your sequence data arrives in two separate file formats: a .txt file and an .ab1 file. The .txt
file is just the sequence, the .ab1 is the electropherogram. It’s tempting to just copy and
BLAST the sequence from the .txt file and go. But, you should always look at your sequencing
electropherograms. These will help you get a handle on the quality and reliability of your
sequence, and will reveal any problems in template purification or primer specificity. It will
also display any evidence of sequence heterozygosity or divergence. In cases of sequencing
failure, the electropherogram can guide you to success.

What are we looking at in the sequencing electropherogram?


The sequencing electropherogram is a chart of light emission over time. Your sample is processed
in a series of Sanger reactions that generate a pool of labelled fragments ranging in size from the
primer plus one base to however many bases are in your template (and occasionally beyond –
addressed below). The fragments are labelled with one of four dyes attached to a modified
nucleotide base. Look up Sanger Sequencing to find out more.
These fragments are run through a capillary. When they reach the end of the capillary, they’re
blasted with a laser. The dyes emit at different wavelengths, and the intensity of each of these
wavelengths is measured throughout the run at a frequency of about 3 Hz.

Now let’s do some exploration of how to interpret electropherograms and how they can be used
to solve your sequencing problems. Let’s start with an example of what you would really like to
see.
An Excellent Template:
Let’s start with an excellent template and primer run in an optimal reaction. The beginning,
middle, and end of the sequence are shown below:
Beginning:

In the beginning, things are kind of rough. We’re talking about very small fragments that tend to
produce messy peaks, so sequence before 30 is usually not reliable. From there on, it’s very
good. Look at the gray bars above the sequence identities – anything above the green dashed
line has a Q value over 20, which is considered very good (>99% probability of correct call). Note
the tall, narrow peaks – that’s what you want.
Middle:

When you get around 600 bases out, peaks start to decline in height a bit, but still produce very
reliable sequence data. Since we’re using dye terminator technology, the sequence stops where
the dye binds. The longer your sequence is, the less of it there’s going to be, so the peaks get
smaller as the fragment gets bigger.
End:

With an exceptional template, you can get up to 1,000 reliable bases and slightly beyond. Notice
here that peak quality begins to degrade after about 900 (the peaks become broad and jagged),
and the base Q value drops below acceptability after 1000. Anything beyond that should be
ignored, it’s low-confidence data.

Let’s also look at the raw data from this run:

The Y-axis is intensity, and the X-axis is time expressed in scan numbers (the instrument scans
about 3 times per second). The red horizontal curved line is just a hand-drawn general trend
line to demonstrate the relative shape of a good capillary trace. With most any protocol, it
takes about 2400 scans for the initial products to travel down the capillaries to the detector.
Then data is constantly recorded as each reaction product passes across the detector. You can
see that the peaks are still pretty high (about 2000-2500) at the end of the run, and there were
probably quite a few even longer fragments left when recording stopped. These fragments are
not worth reading, however, because their peaks are too broad to produce reliable basecalling
(the peak resolution with larger fragments is very low – you can see this by comparing peaks in
scans 2500-15500 with the peaks afterward – peak broadening is very noticeable after about
15500).
Sequencing Problems We Have Seen, and their solutions (if any).
DNA contamination

This is a mess, and none of the bases called are reliable. What you have is a combination of
sample contamination, mixed templates (there are multiple primer-binding sites in the sample),
and low DNA concentration. Note that Dye blobs are also present in this sample (the large broad
peaks around 80 and 110 - explained below). This sample somehow retained dye blobs through
multiple purifications, which is excellent evidence of some sort of severe contamination. The
reason for this was not determined, but switching sample prep methods eliminated this artefact
later on. Taking a look at the raw data for this run (in Finch TV go to view > raw data), you see
this:

The Y-axis is intensity, and the X-axis is time expressed in scan numbers (the instrument scans
about 3 times per second). The peak intensity averages about 500 – 1000, which is generally ok,
but far from ideal (see the first example of good raw data above for comparison). The relatively
low peak intensity is a result of low DNA concentration combined with sample contamination.
Lower peak intensities than this can still provide excellent sequence data, but not here. The
reason for this problem is that you have too many peaks that are all superimposed on one
another, the contamination likely interfered with the migration patterns, and the result is a mess.
No set of peaks is clearly separated from another, and the raw data does not present an orderly
progression of peaks for the software to process. The cure for this is to generate more of your
template, gel purify, and make sure you remove any trace of alcohol, EDTA, or other reaction
inhibitors from your final prep.
Very Low DNA Concentration

This template showed dye blobs and also had very low base quality, but did extend out past 500
bases. Notice the very noisy baseline and erratic peak height. Let’s look at the raw data:

The most apparent peaks here are the dye blobs; they range up to 2500 intensity units. The
sequence we’re seeing is so faint that it just barely rises above the baseline noise. That’s why
the sequence quality is so low, and also why it looks so noisy. The verdict here is low DNA
concentration. Take your sample to the speed-vac, or try the prep again (the better choice). Also
remember, quantification reading below about 20 ng/ul should be viewed with suspicion. As a
test, load some of your blank on your nanodrop and zap it a few times. What sort of reading do
you get, and how much do they fluctuate? If your readings jump from 0 to 11 to -7 ng/ul, your
error is + or – 9 ng/ul. Watch out for stuff like that. It’s always better to dilute from a high
concentration.
No Template:
Example A: Seeing Nothing at All

Example A: Raw Data

There’s an old song about the one you love saying nothing at all, and it’s very sad. Seeing nothing
at all on your electropherogram is also very sad. We see nothing but random noise here, and the
peaks are very short. A look at the raw data shows only random noise that varies plus or minus
about 50 intensity units from 0 (Note the y-axis scale – this baseline is magnified compared to
other raw data plots, as there were no peaks at all). There’s no labelled product at all, because
there’s no DNA. A quantitation of the sample showed nothing. This customer re-prepped the
sample and it turned out fine, once there was actually DNA in the sample.
Example B: Nothing at all combined with primer self-complementarity

Here’s a classic example of the absence of template. The template was generated using the same
primers used to sequence, but there was no sequence data at all. There were fairly nice initial
peaks and nothing else. A look at the raw data showed:

What we’re seeing here is a case of primer self-complementarity, the primer had a little self-
homology and ran off a short copy of its 3’ end. The primer didn’t have an extremely high Tm,
and the Tm of the primer’s secondary structure was well below the annealing temperature of the
reaction. Further, the customer was able to generate a band (although a faint and fuzzy one)
using PCR with those same primers and template. What’s the deal here? We quantified the DNA,
and found that the sample had huge absorbance at 230, and the only absorbance at 260 was part
of the shoulder of the 230 peak. What absorbs at 230? Guanidine HCL and phenol are a couple,
which are components of many DNA isolation procedures. Here, a gel extraction procedure,
which has poor yield anyway, was done on the weak band, and only one wash was done. So,
most likely, some Guanidine HCL but little to no DNA was eluted. Guanidine HCL denatures
protein, but there was relatively little in the sample, and the sample itself only made up 15% of
the final reaction mixture, so the enzyme was still active at some level. Since the primer didn’t
have any template to bind to, it bound to itself. The lesson here is to follow your kit instructions
precisely.
Too Much Template:

A condensed view – Note the quality bars at top. This was a very clean sample, and produced
very nice peaks to begin with. However, the sequence was only good out to about 600 in the
best cases, and only went out to about 350 most of the time. The DNA concentration was about
15X higher than we ask for. Reducing the DNA sample concentration lengthened the reads
considerably.
Too Little Template:
My template was 1 KB, but I’m only getting 300 bases! What gives?

Well, let’s look at the peak intensity in the raw data:

Notice how the peak intensities (neglecting initial peaks, they’re always stronger) start out at
about 200 and then decline to nothing? Compare that with the other examples in this guide.
Your sample concentration is too low. In this case, the DNA quality is excellent, so concentrating
the sample by speed-vas should work nicely. The concentration of this sample was less than 10%
of the required concentration.
Little to No Primer, or wrong primer:

This signal is 100% noise, with no amplification at all. There is a little unbound dye that got by
the column, but that’s all there is. Either the primer is absent or practically absent, or the primer
doesn’t bind at all to either the template or itself. In this case, the primer was over-diluted.
Remember, 5 pmol/uL = 5 nmol/mL = 5 umol/L = 5uM. Your 20 uM PCR primer stock is 20
pmol/uL. Diluting it by ¼ will bring your primers to the correct concentration.
Unknown combination of template secondary structure, primer secondary structure, and
contamination:
Here we have the beginning of a sequence, but it declines rapidly and nothing can be determined.
Probably due to sample contamination, secondary structure in primer and template that
prevents elongation, and sample contamination by inhibitors. Note the exponential decline in
peak height over a very short distance. Not good. Come see us and let’s see what else we can
do.
Very Short Templates:

Here we have a short sequence that amplified nicely, but was too short to call using the standard
base-calling settings – you’ll notice there is no sequence information above the peaks, but the
peaks are nicely formed. We will re-analyze this sort of sequence using an alternate base-calling
procedure.
High G percentage

Notice how the signal strength declines immediately once the enzyme hits the poly-G sequence.
This can be remedied by using our high-GC protocol.
High G percentage Combined with Heterozygosity:

Here we have a precipitous decline in peak height and quality (see gray bars above base identity)
due to the presence of a sequence consisting mostly of G. We have a high-GC protocol that
usually takes care of this, but other DNA quality and concentration issues can complicate the
issue even when using that protocol (like what happened here). Also notice the presence of
heterozygosity (red circles). This was a very difficult sequence, and this was the best it got.
Poly-T (or A) homopolymer:
Example 1
Example 2

Notice how the peaks are fine to begin with, but go crazy right after the poly-T region. This is due
to “polymerase slip”. The sequence afterward is pretty bad. That’s not that many T’s, and it
doesn’t seem fair, but we all know the answer to that complaint.
Sequence Data Continues After The End Of Your Template:
This happens occasionally. You see very sequence data that follows your primary sequence, but
peak height is greatly diminished.

See those very short peaks past the natural end of the sequence? Let’s zoom in:

There it is, a small amount of contaminating sequence is producing peaks just tall enough to
produce a string of poor and irresolute data. Notice the Q value bars are all lousy. Now you know
two things: Your prep contains an impurity. If it’s a plasmid prep, you picked up some
contamination somewhere (did you re-use a tip, perhaps?). If it’s a PCR prep, your primer most
likely binds poorly at another site. If you saw any additional bands on your gel, there you go. This
is not a problem as long as you delete the offending sequence. It can become a problem if the
contaminant sequence intensity is >10% of the primary template intensity (see Sample
Heterozygosity below). Keep it clean.
Bleed-Over from Very Strong Sample in Adjacent Capillary:

This happens rarely. You see very high quality sequence that follows your primary sequence, but
no peaks. What the What!!? But, if you increase the peak magnification, here’s what you’ll see:

Those teensy, tinsy little peaks are an actual sequence, but it’s not from your sample at all. It’s
the detector picking up an emission profile from a very intense sample running in an adjacent
capillary. This is only a problem if you include this sequence in your analysis. The practical
solution is identical to what you do in the case of a low-quantity contaminating sequence: Find
out where your sequence ends and delete the rest.
Heterozygosity in your Target Sequence:
Example 1 (below): Heterozygosity from one point that extends throughout the remaining
sequence.

Example 2 (below): Heterozygosity within a definite region that resolves after a number of bases
– Heterozygosity is confined to this region.

Example 3 (below): Single Nucleotide Polymorphism. Notice how circled region contains two
separate peaks in the same spot. The taller peak is called (T), but you can see there is a C that is
nearly the same height, and in the same place. This means that you have two alleles, one
containing a T, and one substituting the T with a C (or vice versa) at the same location. Also notice
that the Q value for that base is low, this is because neither peak really stands out.

Everything’s great until you reach a certain point, then you see double peaks with low q values.
This is a heterozygous region in your target. If your organism is diploid, tetraploid, etc., you’ll run
into this eventually. You’re copying each allele, and both copies are present in the sequencing
reaction in fairly equal numbers. The cure for this is to clone your PCR product into a PCR cloning
vector (TA cloning) and then sequence a few of those plasmids. Then you’ll see your one messed-
up sequence become two well-resolved, but different, sequences.

Dye Blob

See those two giant peaks at around 80 and 120? That’s unbound dye that managed to seep
through the purification column. It happens sometimes. The software is designed to read
through dye blobs without affecting basecalling, but we typically will take your purified reaction
and run through the purification column again the next day, and the blobs will go away.

Now, go and bear fruit.


While this guide doesn’t cover every possible problem, it does cover 99.9% of the difficulties
you’ll see. And don’t forget, we’re here to help, so come by and have a chat if you have questions.

You might also like