Lecture 4: Detecting Mutations and Mutation Consequences Learning Goals

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Lecture 4: Detecting Mutations and Mutation Consequences

Learning Goals:

Students should understand:

• That gel electrophoresis separates DNA fragments based on their size.


• How molecular techniques such as PCR and RFLP analysis can be
used to identify mutations in individuals.
• How next generation sequencing is leading to rapid advances in
the field of genomics.
• The basic steps of sequencing.
• The basic steps of sequencing data interpretation (align short
sequences with overlapping segments and then map these
sequences to a reference genome).

Students should be able to:

• Define allele, heterozygote, homozygote, recessive, dominant.


• Evaluate genotype and phenotype data to determine which
allele is associated with the dominant/recessive phenotype.
• Design and interpret genotyping assays using PCR, restriction
fragment length polymorphisms and gel electrophoresis to
determine the genotypes of individuals that carry genetic changes
associated with different traits (for example PCR analysis of
Huntington disease, RFLP analysis of SOD1 in ALS).
• Determine the appropriate next generation sequencing strategy
(whole genome, whole exome or targeted PCR amplification) for a
given scenario.
• Interpret next-generation sequencing data to identify mutations.
• Characterize a mutation as either gain of function or loss of function
and determine if it is haploinsufficient.
• Characterize a mutation as either coding or regulatory.
Lecture Notes:

Mutations can affect a DNA sequence in many different ways; the


type of molecular change that occurs will determine how a particular
mutation can be detected. We will look at different strategies for
detecting mutations, and what types of mutations can be detected by
each.

Polymerase Chain Reaction (PCR)


PCR can be used to isolate and amplify a specific DNA sequence
from a sample. For example, it can be used to selectively amplify the
sequence of a particular gene from an individual’s genome. The main
components of the reaction are:

• Template DNA: genomic DNA from the individual to be tested.


• Primers: short fragments of DNA that determine the “boundaries” of
the sequence to be copied; each primer serves as the starting point
for synthesis of a new DNA strand.
• Taq DNA polymerase: a DNA polymerase that can withstand high
temperatures without becoming denatured. It synthesizes the new
DNA strands.
• dNTPs: deoxynucleotides, the building blocks of DNA, which will be
used to synthesize the new strands.

In performing PCR, researchers combine these components in a


small tube, then subject the mixture to repeated cycles of heating and
cooling to allow exponential amplification of the DNA. A typical PCR
cycle consists of:

1. 30 seconds at 94°C. DNA is denatured into single strands.


2. 30 seconds at the annealing temperature for the particular primer
pair used in the reaction (often 50-65°C). At the annealing
temperature, strands of DNA anneal (hybridize). Because the primers
are present at much higher concentrations than the template DNA,
the template DNA will anneal to the primers instead of reforming its
original double helix.
3. 30 seconds at 72°C. Taq polymerase extends the primers, resulting in
a new strand of DNA.

Each cycle doubles the number of target DNA sequences so they


increase by geometric progression in subsequent cycles. Assuming 100%
efficiency, 30 cycles will result in an amplification factor of over one billion!
Even with realistic efficiencies, a single copy of DNA can be amplified to
the quantities necessary for further analysis in just a few hours.
Gel Electrophoresis
The results of a PCR (the PCR products) are typically analyzed using
gel electrophoresis. The PCR products are loaded into a well at the end
of an agarose gel in a buffered salt solution. An electric current is then
applied across the gel. DNA is negatively charged, so it is pulled through
the gel towards the positive electrode. As it travels, DNA fragments of
different sizes are separated, because they move through the gel at
different rates. Smaller fragments move more quickly through the gel, and
larger fragments more slowly. As a consequence, when you stop the
current, larger fragments will be closer to the wells, and smaller fragments
closer to the other end. The fragments can be visualized using a DNA-
binding dye. Fragments of a particular length will appear as a discrete
band in the gel.
PCR is a very powerful molecular biology technique, but it gives you
a limited amount of information. You can determine if an individual’s
genome contains a given sequence. If the primers bind and produce a
PCR product, the sequence is present; if they are unable to bind, no
product is produced, and the sequence is likely absent. You can also
determine the length of the sequence between the primers.
However, you cannot determine the specific sequence of bases in
the PCR product. Thus, gel electrophoresis can be used to detect only
those mutations that result in a detectable change in the length of a
gene or sequence, generally between tens and a few thousand bases.
Mutations that can therefore be detected by PCR include trinucleotide
repeat expansions, transposon insertions, and other insertions and
deletions of 10’s-1000’s of bases.

RFLP Analysis
To look for smaller-scale changes, on the order of one or a few
bases (such as single base substitutions, or small insertions or deletions), it is
necessary to use a technique that allows you to determine the specific
DNA sequence. Sometimes you need to actually sequence the entire
gene, but in some situations, you may be able to use a faster technique,
RFLP analysis.
Restriction Fragment Length Polymorphism (RFLP or riff-lip) analysis is
a molecular biology technique that you can use to detect a subset of
changes to DNA sequences that affect one or a few bases, such as a
single base substitution. In order for this type of analysis to be applicable
for the detection of a particular sequence variant, it needs to either
create or interrupt a restriction site in the DNA sequence. Restriction sites
are specific short DNA sequences that are recognized and cut by
restriction enzymes.
Restriction enzymes (also known as restriction endonucleases) are
found in bacteria, where they serve as a defense mechanism against
viruses. The bacteria use restriction enzymes to cut up viral DNA,
preventing an infection from progressing and the virus from replicating.
Today, biotechnology companies mass produce restriction enzymes for
use in molecular biology labs, where they are commonly used to
manipulate DNA.
Each restriction enzyme recognizes a specific DNA sequence, and
cuts that sequence between specific bases. These restriction sites are
generally between four and eight bases long, and many are palindromic
– the sequence on one strand, as read in the 5’ to 3’ direction, is the same
as the sequence on the other strand in the 5’ to 3’ direction. For example,
the restriction enzyme EcoRI, which comes from the bacterium E. coli, cuts
at the following sequence between the G and the A on each strand:

5’ – GAATTC – 3’
3’ – CTTAAG – 5’

If one variant of a gene includes the sequence 5’ – GAATTC – 3’,


the enzyme EcoRI will recognize and cut that sequence at the restriction
site. If the other variant changes that sequence to 5’ – GAGTTC – 3’,
EcoRI will not cut it. Note: a single base change is sufficient to stop EcoRI
from recognizing the site.
The general procedure for RFLP analysis involves three different
molecular biology techniques:

1. PCR to isolate and amplify the region of interest


2. Restriction digestion with an enzyme that would cut one version of the
gene, but not the other
3. Gel electrophoresis to analyze the results

In the example above, the 5’ – GAATTC – 3’ variant and the 5’ –


GAGTTC – 3’ can be distinguished from each other after being amplified
by PCR and digested with EcoRI by running them out on an agarose gel.
The PCR product from the “A” variant will be cut into two pieces, while the
PCR product from the “G” variant will remain uncut. The cut and uncut
PCR products can be distinguished from one another by gel
electrophoresis, as the uncut version will be the same length as the
original PCR product, while the cut product will produce two smaller
fragments.
Digested/cut vs. undigested/uncut are not themselves phenotypes.
Instead, whether or not DNA is digested tells us information about the DNA
sequence itself—which variant exists. Note: this technique can only be
used to examine sites that can be recognized by a restriction enzyme. If a
restriction enzyme that recognizes the site of interest (mutant or non-
mutant DNA sequence) does not exist, you cannot use RFLP.

Reading RFLP gels


As we have seen, restriction fragment length polymorphism analysis
makes use of restriction enzymes to help identify differences between
sequences and can be used to genotype individuals. The analysis
requires:

1. A position, or multiple positions, that can be recognized by a


restriction enzyme. Specifically, where the restriction enzyme can
recognize and cut one variant but not the other.
2. PCR amplification of the site recognized by the restriction enzyme.
3. Gel electrophoresis, including a ladder or weight markers used to
view the DNA and determine the size of the fragments.

For example, imagine you are trying to determine whether an


individual has a rare mutation that increases their likelihood of developing
certain cancers. Luckily for you, this mutation disrupts a restriction enzyme
recognition site, preventing the enzyme from cutting the DNA. You decide
to use PCR to amplify a 1000 bp length DNA that contains this site for RFLP
analysis. To test your method, you do this for 3 controls: an individual
known to not have the mutation; an individual who is heterozygous for the
mutation; an individual that is homozygous for the mutation. Below are the
results for the controls.
For the no mutation control, we
see two bands that add up to
1000 bp in length, showing that
the PCR product was recognized
by the enzyme and cut. Since the
mutation disrupts the restriction
enzyme recognition site (i.e.
prevents the enzyme from
cutting), the heterozygous mutant
has one version of the sequence
that the restriction enzyme
recognizes and cuts, and another
version that the enzyme does not
cut, resulting in 3 bands. The
bands on the gel from a heterozygote may appear fainter than those
from the homozygotes because the heterozygote’s bands have half the
amount of DNA as the homozygotes’ bands. Finally, the homozygous
mutant only has the version of the sequence that the enzyme cannot
recognize, and we only see a single band corresponding to the uncut
PCR product.
Now using this method, we can run a gel of the unknown individual
and determine whether they do or do not have the mutation, including
whether they are heterozygous or homozygous for it.

DNA Sequencing
Sanger Sequencing was invented in the 1970s by Fred Sanger, who
had already won a Nobel Prize for figuring out how to determine the
amino acid sequence of a protein (he got a second for DNA
sequencing), and was the way all DNA was sequenced until the
introduction in around 2007 of ‘next-generation’ methods. Sanger
sequencing is still used today when, for example, we need to confirm
results derived from next gen methods.
The Sanger method uses the same enzyme that copies DNA
naturally in cells, DNA polymerase. The trick involves making the copy out
of base pairs that have been slightly altered. Instead of using only the
normal “deoxy” bases (As, Ts, Gs, and Cs) found naturally in DNA, Sanger
also added some so-called “dideoxy bases.” Dideoxy bases have a
peculiar property: DNA polymerase will happily incorporate them into the
growing DNA chain (i.e., the copy being assembled as the complement
of the template strand), but it cannot then add any further bases to the
chain. In other words, the duplicate chain cannot be extended beyond a
dideoxy base.
Imagine a template strand whose sequence is 3’-GGCCTAGTA-5’.
There are many, many copies of that strand in the experiment. Now
imagine that the strand is being copied using DNA polymerase, in the
presence of a mixture of normal A, T, G, and C plus some dideoxy A. The
enzyme will copy along, adding first a C (to correspond to the initial G),
then another C, then a G, and another G. But when the enzyme reaches
the first T, there are two possibilities: either it can add a normal A to the
growing chain, or it can add a dideoxy A. If it picks up a dideoxy A, then
the strand can grow no further, and the result is a short chain that ends in
a dideoxy A (ddA): CCGGddA. If it happens to add a normal A, however,
then DNA polymerase can continue adding bases: T, C, etc. The next
chance for a dideoxy “stop” of this kind will not come until the enzyme
reaches the next T. Here again it may add either a normal A or a ddA. If it
adds a ddA, the result is another truncated chain, though a slightly longer
one: this chain has a sequence of CCGGATCddA. And so it goes every
time the enzyme encounters a T (i.e., has occasion to add an A to the
chain); if by chance it selects a normal A, the chain continues, but in the
case of a ddA the chain terminates there.
Where does this leave us? At the end of this experiment, we have a
whole slew of chains of varying lengths copied from the template DNA;
what do they all have in common? They all end with a ddA.
Now, imagine the same process carried out for each of the other three
bases: in the case of T, for instance, we use a mix of normal A, T, G, and C
plus ddT; the resultant molecules will be either CCGGAddT or
CCGGATCAddT.
Having staged the reaction all four ways—once with ddA, once
with ddT, once with ddG, and once with ddC—we have four sets of DNA
chains: one consists of chains ending in ddA, one with chains ending with
ddT, and so on. We can use gel electrophoresis to sort by size all these
mini-chains. The speed with which a particular mini-chain will travel is a
function of its size: short chains travel faster than long ones. Within a fixed
interval of time, the smallest fragment, in our case a simple ddC, will travel
furthest; the next smallest, CddC, will travel a slightly shorter distance; and
the next one, CCddG, a slightly shorter one still. Now Sanger’s trick should
be clear: by reading off the relative positions of all these mini-chains after
a timed race through our gel, we can infer the sequence of our piece of
DNA: first is a C, then another C, then a G, and so on.

Next-generation sequencing is an umbrella term used to describe a


number of different modern sequencing technologies. PCR-based
sequencing has been in use for almost 40 years. However, this older
method of sequencing was limited in the amount of data it could
generate. Next-generation methods have many advantages: they have
made sequencing hundreds of times faster, far less expensive, and are
able to generate unprecedented amounts of data. These advances have
revolutionized the field of genetics and genomics and are leading to
major breakthroughs in the understanding and treatment of disease.
The end result of a typical next gen sequencing run is many copies
of small overlapping fragments of DNA. These copies are referred to as
‘reads’. For example, a single sequencing reaction using the industry
standard Illumina technology, can produce between 50 million and 1
billion reads, depending on which machine performs the sequencing. In
order to make sense of such massive amounts of data, computational
tools have been developed to process all this information. The first step is
to match up reads that have overlapping areas, to line up all the reads
and generate a much longer sequence from many smaller overlapping
fragments.
Next, the sequences are compared to a reference sequence. A
reference sequence is a well-characterized genome sequence that has
been extensively studied and is of a high quality. The sequence data is
compared to the reference and any areas where the two differ are
flagged in the software. In lecture we saw an instance where the sample
was heterozygous for a single-base substitution in SOD1, which was
different from the reference sequence. This was indicated by a color
change in the reads (red and blue). Sequences that match the reference
genome are colored grey, making it easy to identify any changes by
simply looking for color differences.
Scientists are constantly inventing new and creative applications for
next-generation sequencing. Currently, three of the main uses of this
methodology are 1) whole genome sequencing 2) whole exome
sequencing and 3) sequencing of PCR products. Each of these methods
generates a specific type of data and has its advantages and limitations.

Whole genome sequencing generates information about the whole


genome and therefore permits detection of most mutation types. The
limitation is that each location in the genome will, in comparison with the
other uses, be sequenced fewer times. The number of times a base is
sequenced in a given sequencing run is referred to as the read depth. A
high read depth (e.g. 30x) means that, on average, each base has been
sequenced 30 times. Since a sequencing run produces a finite number of
reads, the more DNA you wish to sequence the lower the read depth.
Lower read depth means that you will get fewer ‘looks’ at each
sequence. If you are sequencing less DNA (e.g. a PCR product) then
more reads are devoted to this DNA and you will get a lot more ‘looks’ at
this fragment.
Why do we care about read depth so much? Consider a situation
where you are sequencing a patient’s genome in an attempt to identify a
disease-causing mutation. If you notice that there is a difference between
the patient’s DNA and the reference sequence the first question you
would ask yourself is ‘Is this is a genuine mutation or is it an experimental
error?’ No methodology is error free and mistakes are commonly made
when sequencing DNA. Therefore, scientists use read depth to help them
distinguish between genuine mutations and sequencing errors. The more
reads that contain the different base the more confidence there is in
declaring this to be a genuine mutation. However, if you have fewer
reads then you may be less confident in making this distinction. This level
of confidence is very important if the goal is to identify a disease-causing
mutation. Therefore, if you suspected the mutation was in a specific gene
or in the coding sequence, you may wish to perform PCR or exome
sequencing, respectively, as these applications will generate a higher
read depth than whole genome sequencing.

Genotype and Phenotype


In families in which Huntington disease (or any other disease) is
common, not all individuals will exhibit symptoms of the disease. These
differences are in the phenotypes of the individuals in these families. An
individual’s phenotype is their appearance or the way in which they
manifest a particular characteristic. An individual’s phenotype is often
determined by their genetic makeup, or genotype.
In humans, every individual possesses two alleles (copies) of almost
every gene (an exception is genes found on the X chromosomes – males
only have one copy of each of those genes). In analyzing an individual’s
genotype, you may observe that they have two different alleles (versions)
of a particular gene, or they may have two alleles that are the same.
Individuals that have two different alleles of a gene are said to be
heterozygous for that gene, while those that have two copies of the same
allele are homozygous for that gene.
For many genes, an individual’s genotype determines their
phenotype. If an individual is heterozygous for a particular gene, they will
express the phenotype associated with one of their two alleles. The
phenotype that they express is said to be dominant. Individuals who are
homozygous for the dominant allele or heterozygous will, for simple
Mendelian traits, display the same phenotype. In contrast, individuals
must be homozygous in order to display the phenotype for a recessive
trait.
Many recessive traits are due to what we call loss-of-function (also
known as null) mutations. These types of mutations cause a gene to lose
its ability to perform its normal function. These mutations are often
recessive because having a single copy of the normal allele is sufficient for
function. However, some loss-of-function mutations cause even
heterozygous individuals to exhibit the mutant phenotype. Genes for
which a loss-of-function mutation produces a phenotype are said to
display haploinsufficiency: losing the function of one of two copies causes
a phenotype. Mutations in these genes will show a dominant inheritance
pattern.
Many dominant traits are due to gain-of-function (also known as
neomorphic) mutations. These mutations cause genes to do or create
something new that it normally would not – the gene has a new function.
We can now expand the vocabulary that we use to describe
mutations using additional characteristics:

Impact on the function of the gene


• Null/knockout/loss of function: complete elimination of gene
function; an example would be if the gene is deleted (though other
types of mutations can also cause a complete loss of function)
• Gain of function (also called neomorphic): these cause the gene to
do something new, for example bind a new partner or catalyze a
new reaction
Location of the mutation
• Regulatory: mutation is in a DNA sequence that is involved in
regulating gene expression, such as an enhancer, promoter, splice
site, or terminator, etc.
• Coding: mutation is in the body of the gene that codes for the
protein

Molecular change in the DNA


• Base-pair substitution: one base pair in a DNA duplex is replaced with
another
• Insertion or deletion: one or more extra or missing nucleotides

Base-pair substitutions and small insertions or deletions in the coding


sequence can have different effects on the translation of a protein,
based on their effect they have been categorized to the following
types:
• Synonymous (also known as silent): no change in amino acid
sequence (often found at 3rd position of a codon)
• Non-synonymous (also known as missense): mutation changes the
original amino acid to a different amino acid
• Nonsense/termination: mutation changes an amino-acid codon to a
stop codon, leading to production of a truncated protein
• Frameshift: insertion or deletion that involves a number of bases that is
not divisible by three; shifts the triplet code for the gene “out of
phase,” leading to a completely different amino acid sequence after
the mutation
Concept Check:

Gel electrophoresis
Simple Sequence Repeats
Preimplantation Genetic Diagnosis (PGD)
Genotype
Phenotype
Synonymous mutation
Non-synonymous mutation
Frameshift mutation
Coding variant
Regulatory variant
Restriction enzyme
Restriction site
Allele
Heterozygous
Homozygous
Locus
Dominant
Recessive
Next generation sequencing
Exome
Loss of function mutation
Haploinsufficiency
Gain of function mutation

You might also like