ch1 A Gentle Introduction To Genomics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

C H A PT ER 1

A gentle introduction to genomics

 Key points
• The “central dogma” of biology states that genomic information flows from DNA to RNA to protein.
• Most human DNA does not encode genes (i.e. is non-coding) but includes various regulatory elements that facilitate the
regulation of gene activity.
• Mutations can have functional consequences for both coding regions (e.g. changes to protein sequence) and non-coding
processes (e.g. altering the regulation of the protein).
• The most common form of genetic variation in humans is single nucleotide polymorphisms (SNPs), and inheritance of
SNPs is not typically random and often correlated.

1.1 Introduction that defines your unique biology both as an indi-


vidual within the human species, as well as the
The motivation for writing this book stems from a member of a distinct species within the animal king-
strong belief that anyone who has an interest in dom. Popular media often refers to the genome as
exploring their personal genome has a right to gain the set of genes unique to an individual, but we will
access to the tools and knowledge required to do so. see that a genome is much more than genes. In fact,
Therefore, we expect (and hope!) that some readers only ~2% of the human genome is known to code
of this book may lack a formal education in genetics for actual genes; until recently, the remaining 98%
or genomics. Although we have made a strong was referred to as “junk” DNA, however this
attempt to provide appropriate explanations and moniker is quickly fading as we learn more and
background knowledge to enable a thorough under- more about the important functional roles of this
standing of the materials in each individual chapter, supposed “junk” DNA. This chapter provides an
an understanding of fundamental concepts in introductory minimum of knowledge about the
genomics will enable the reader to utilize the knowl- human genome and the associated molecular biol-
edge and tools represented in this book optimally. ogy required for informed analysis and interpreta-
In this chapter, we provide a brief and gentle intro- tion of a personal genome.
duction to fundamental concepts in genomics aimed
at providing a fundamental understanding of the
human genome, which will greatly enhance one’s 1.2.1 The architecture of a human genome
ability to analyze and interpret a personal genome. The human genome is fundamentally composed of
deoxyribonucleic acid (DNA), which is composed
of a set of four distinct nucleotides (“bases”) that
1.2 What is a genome?
combine according to specific pairing rules along a
A genome is quite simply the entire sum of the sugar-phosphate backbone, which gives the genome
genetic components of a living species. Plainly, you its well-known double-helix conformation. The
can think of your genome as all the “genetic stuff” four bases in the human genome are adenine (A),

Exploring Personal Genomics. First Edition. Joel T. Dudley and Konrad J. Karczewski.
© Joel T. Dudley and Konrad J. Karczewski 2013. Published 2013 by Oxford University Press.
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

4 EXPLORING PERSONAL GENOMICS

Phosphate Molecule gle (unpaired) strand of nucleotide bases (e.g., ATG-


TAACGT) since the complementary strand can
Deoxyribose
Sugar Molecule always be inferred from the purine-pyrimidine
pairing rule. Given that the entire genome can be
encoded using only four characters representing the
Nitrogenous
Bases
four nucleotides, you may be wondering how we
can identify the protein-coding DNA regions (i.e.
genes) from the non-coding DNA. As we will learn
in the following sections, additional genomic fea-
T
A tures, such as the higher-order patterning of nucle-
otide bases in the genome, and the organization of
DNA in the cell nucleus, encode and govern the
complex functionality of the human genome.
C
G

1.2.2 Structure of protein-coding regions


As previously mentioned, only a small fraction
C of our DNA resides in genes, which are broadly
G
defined as the functional units of the genome. These
regions typically “code” for proteins (with excep-
tions discussed below), wherein the DNA sequence
T is transcribed into messenger RNA (mRNA) and
A
translated into proteins (Figure 1.2). The intermedi-
ate step, RNA (ribonucleic acid), is a molecule that
is quite similar to DNA in structure, except with the
addition of a hydroxyl group (-OH), and thymine
Weak Bonds (T) is replaced by a similar nucleic acid, uracil (U).
Between Additionally, RNA is most often single-stranded.
Bases The final products are the proteins, which make up
the various constituent parts of cells in the human
Sugar-Phosphate body, including structural proteins (that hold the
Backbone cell together), enzymatic proteins (that catalyze cel-
lular reactions), and signaling proteins (that com-
Figure 1.1 Molecular structure of DNA: DNA is made up of four base
pairs (A, T, C, and G) attached to a sugar-phosphate backbone. The two
municate signals within or across cells). It is these
strands of DNA are formed by weak hydrogen bonds, where A pairs protein-coding genes that are the most studied
with T and C pairs with G. DNA is “complementary”, meaning one regions in the human genome (and have been most
strand can be reconstructed from the other by reversing it and switching prominently described in the popular media), and
A with T and C with G. Image courtesy of the U.S. Department of Energy thus, they form the basis of our classical under-
Genomic Science program (<http://genomicscience.energy.gov>). Image
reprinted from Molecular Evolution and Phylogenetics by Nei and
standing of human genetics and genomics.
Kumar, OUP: 2000. Unlike the protein-coding regions of many micro-
organisms, such as bacteria, the protein-coding
thymine (T), guanine (G), and cytosine (C). To form regions of human DNA are not contiguous, but
the double-helix structure, two DNA strands “pair” instead are interrupted with non-coding DNA. The
(through weak hydrogen bonding), and due to protein-coding regions are fragmented, where the
physiochemical constraints, A always pairs with information that encodes a gene is broken up into
T, and C always pairs with G (Figure 1.1). Because chunks called exons, which are interspersed with
of this strict pattern of complementarity, it is usually non-coding nucleotide regions called introns (see
sufficient to represent the human genome as a sin- Figure 1.2). When genes are expressed in human
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

A G E N T L E I N T R O D U C T I O N TO G E N O M I C S 5

DNA
GT AG GT AG nature and function of various aspects of this large,
3’-end non-coding expanse in the human genome using
5’-regulatory
region Transcription new tools and technologies. By comparing the
genomes of many species, we have hints from evo-
Exon 1 Exon 2 Exon 3
Pre-mRNA
lution that some of this DNA may be functionally
Intron 1 Intron 2
important, as 4–7% of the human genome is present
5’-untranslated Introns 3’-untranslated
region spliced out region
across mammals, a concept known as evolutionary
conservation. It is likely that advances in func-
Exon 1 Exon 2 Exon 3 tional genomics in the near future will offer insights
Processed mRNA
into the precise nature and function of elements in
Translation these large non-coding regions. A breakdown of
the DNA contained in the human genome is shown
Polypeptide in Figure 1.3, but some of the key genomic ele-
ments known to comprise these non-coding regions
Figure 1.2 The central dogma of biology: DNA is “transcribed” into
include:
RNA, which is then processed and “translated” into protein. Gene
regions are transcribed into a pre-mRNA (pre-messenger RNA), which
includes untranslated regions (at the 5’ and 3’ ends) as well as • Gene-related sequences: These include introns,
introns. The introns are “spliced out” in a processing step, before the untranslated regions (UTRs), and promoter
processed mRNA is translated into a protein. Image reprinted from regions that may have numerous roles in the regu-
Molecular Evolution and Phylogenetics by Nei and Kumar, OUP: 2000.
lation of gene expression. Introns can be “spliced
out” to generate new gene products from the same
cells, the cellular machinery removes (“splices out”) exons (only in different combinations; see Box 1.2
the intronic regions and rejoins the remainder to “Alternative Splicing”). UTRs and promoters are
form messenger RNA. We will learn later how the involved in regulating gene expression levels by
fragmented structure of protein-coding genes ena- affecting transcription and degradation of mRNA
bles additional functional capacity through a mech- (see Section 1.6).
anism known as alternative splicing, where exons • Non-coding RNAs (ncRNAs): ncRNAs are RNA
are occasionally skipped in different combinations molecules that are transcribed, but not translated,
to create alternative gene products called protein with the actual RNA molecule itself performing
isoforms. Finally, these exons and introns are flanked a functional role. These appear to have diverse
by “control regions” for the gene, known as pro- roles across the genome, such as regulating the
moters. These regions are involved in regulating expression of genes (e.g. miRNAs, see Section 1.4)
gene expression, as they encode information about and serving roles in the structural integrity of the
when and how much of the gene should be chromosome (e.g. telomerase RNA).
expressed or transcribed into mRNA. For exam- • Psuedogenes: Pseudogenes are remnants of
ple, these regions may control whether a gene ancient genes that were once functional in an
should be expressed in liver or brain or in response ancestral species, but have now lost their func-
to a stimulus, such as a disease. tion (i.e. they are switched off) and are accumu-
lating genetic “noise” in the form of random
mutations.
1.2.3 The myth of junk DNA
• DNA repeats: Repetitive patterns of DNA
Even before the human genome was initially nucleotides of various length sometimes referred
sequenced, the discovery that most (over 95%) of to as satellite DNA. These repeats are thought to
the human genome did not directly code for pro- have come about through “slippage” errors in
teins led to the common usage of the phrase “junk DNA replication, however some have speculated
DNA” to refer to this large proportion of human that these repeat elements have important roles in
DNA with unknown function. Recent advances in shaping species evolution.
genome science have proven this phrase to be a • Retrotransposons: Recent evidence has dem-
misnomer, with scientists regularly decoding the onstrated that up to 8% of the human genome
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

6 EXPLORING PERSONAL GENOMICS

Human genome
3200 Mb

Genes and gene-related Intergenic DNA


sequences 1200 Mb 2000 Mb

Other
Interspersed
Related intergenic
Genes repeats
sequences regions
48 Mb 1400 Mb
1152 Mb 600 Mb

Microsatellites
Gene Introns. 90 Mb
Pseudogenes
fragments UTRs
Various
LINEs LTR elements 510 Mb
640 Mb 250 Mb

SINEs DNA transposons


420 Mb 90 Mb

Figure 1.3 Breakdown of the human genome: While a small fraction of the human genome is directly translated into protein, the overwhelming
majority does not. Some of this “non-coding” DNA, especially DNA near genes, is involved in the regulation of gene expression. Other segments
are repetitive elements, including long- and short-interspersed DNA elements (LINEs and SINEs), long terminal repeats (LTRs), and transposons.
Image reprinted with permission from Genomes 2 by Brown.

has been formed by retrotransposon DNA from species and human cells have 23 chromosome pairs
Human Endogenous Retroviruses (HERVs), (46 total chromosomes), with one of these pairs
although the human genome contains a larger being a sex-determining chromosome pair (XX in
proportion (~25%) of DNA that is identifiable as females and XY in males). One member of each of
being formed by retrotransposons. These viral these chromosome pairs is inherited from the
retrotransposons are suspected to have had a mother, and the other is inherited from the father. In
significant effect on the evolution of the mamma- addition, the mitochondrial genome is a small circu-
lian genome, and particularly in the evolution of lar genome that is inherited only from the mother
humans and other hominids. (see Box 1.1).
As such, each cell in our bodies is a diploid cell,
with two copies of each chromosome. Gametes, or
1.2.4 Packaging Genomic DNA into
reproductive cells (i.e. sperm and egg), are the only
Chromosomes
exception to this rule. Human gametes are haploid
If you were to stretch a human genome out into a cells, which contain half of the typical amount, or 23
single contiguous strand, you would find that it chromosomes. This is necessary because gametes
would be nearly 1.83 meters (6 ft) from end to end. from two individuals will fuse in the process of sex-
Every cell in the human body with a nucleus con- ual reproduction to form a zygote, which will
tains a full copy of an individual’s genome, which is develop into an embryo with 46 chromosomes per
packaged inside the cell in the form of densely cell. Therefore, chromosomes are not only impor-
packed structures called chromosomes (see tant for the packaging of DNA into the cellular
Figure 1.4). The number of chromosomes used to nucleus, but also as structures and units of genetic
package the genome into a cell nucleus varies by inheritance in human sexual reproduction.
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

A G E N T L E I N T R O D U C T I O N TO G E N O M I C S 7

Chromosome
Nucleus Chromatid Chromatid
Telomere

Centromere

Telomere
Cell

Base Pairs Histones

A
T G
C A DNA
T C
T (double helix)
G
G G
G
A
C

Figure 1.4 DNA packaged in chromatin: DNA is wrapped around histones, which is organized into chromatin. This chromatin makes up
chromosomes, which are found in the nucleus of the cell. The cell efficiently packs the 1.8 meters of linear DNA into a nucleus that is 6 microm-
eters in diameter. Image courtesy of the National Human Genome Research Institute (artist: Darryl Leja).

Chromosomes are composed of a complex bun- of histone (known as linker histones) that organizes
dling of genomic DNA and proteins known as chro- the nucleosomes into a more tightly-packed struc-
matin. The major protein constituents of chromatin ture known as chromatin fiber. These chromatin fib-
are a special set of proteins called histone proteins, ers form the basis of chromatin, which can be further
around which genomic DNA is wrapped. It can be classified as heterochromatin, a condensed form, or
helpful to visualize the relationship between DNA euchromatin, an extended form. Thus, human
and histone proteins using a thread and spool anal- genomic DNA is packaged into cellular nuclei in a
ogy. In this way, you can think of DNA as a thread condensed and complex structure, as opposed to
of string that is wound tightly around a histone floating around as one long strand. The packaging
“spool” (146 base pairs at a time to be exact!). When of genomic DNA in the chromatin structures of the
the DNA is wrapped around this histone “spool”, nuclei can have a significant effect on the expression
the resulting particle is called a nucleosome. This of genes (See Section 1.4).
process is repeated along the length of genomic In human cells, during mitosis, a chromosome
DNA for a particular chromosome, creating a “beads duplicates and condenses such that it can be
on a string” structure with a contiguous strand of visualized under a microscope (Figure 1.5). The
nucleosome “beads” on a genomic “string.” These duplicated pair (also known as “sister chromatids”)
nucleosome “beads” are connected by another type is physically joined by a structure known as a
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

8 EXPLORING PERSONAL GENOMICS

Telomere
Short arm (p)
Box 1.1 “The Mitochondrial Genome”

Centromere Many people will be surprised to hear that the genome


found in the nuclei of our cells is not the only genome
found in the human body. Mitochondria, which are
energy-producing sub-cellular components found in most
Long arm (q)
cells in the human body, have their own genome. In addi-
tion to being much smaller than the human genome, con-
taining just 37 genes across ~16,569 nucleotide base
pairs, mitochondrial DNA looks quite a bit different than
Telomere
human genomic DNA. Instead of being packaged into
chromatin by histones, mitochondrial DNA takes a circular
Figure 1.5 Schematic of a chromosome: The two sister chromatids
form and it does not contain any introns. These character-
(perfectly identical copies) of a chromosome can be visualized after it
is duplicated during mitosis. These copies are connected off-center by
istics appear to make the mitochondrial genome more
a centromere, resulting in the long and short arms (p and q arms), similar to bacterial genomes than the human genome. In
which contain telomeres at their ends. fact, the leading theory on the origins of the mitochon-
drial genome is the endosymbiont theory, which asserts
that mitochondria developed from bacteria that were
centromere, giving chromosomes their characteris-
taken up by an ancient eukaryotic cell and subsequently
tic crisscross appearance (although in reality the
developed a symbiotic relationship with the host cell.
chromosomes are not crossed over each other when There is a substantial amount of evidence supporting this
joined by the centromere, but rather “pinched” theory. For example, the ribosomal RNA (rRNA) found in
together). Chromosome pairs are joined by the cen- mitochondria is more similar to the rRNA found in bacte-
tromere in an asymmetrical fashion, such that the ria rather than mammals.
unconnected region or arm on one sides of the cen- From the perspective of personal genomics, there are
tromere is typically shorter than the other. The short two key properties of mitochondrial DNA that useful to
arm of a joined chromosome pair is known as the understand. First, since the mitochondria are the “power
“p” arm and the longer arm is known as the “q” plants” of the cell, mutations in mitochondrial DNA can
arm. This convention is often used in annotating often lead to a host of genetic disease conditions in
humans. Second, mitochondrial DNA is only inherited
specific coordinates of a genomic region (e.g. 6q ori-
from the mother, and is therefore useful for investigation
ents to the long arm of chromosome 6). The regions
of inheritance and ancestry along an individual’s maternal
at the end of each chromosome are called telomeres. lineage. If you push this idea of maternal inheritance to its
Telomeres are composed of repetitive sequences of extreme conclusion, you can assert that the mitochondrial
non-coding DNA that protects the chromosome DNA found in the present human population could be
from damage during cell division. Each time a cell traced back to a single female individual, who is popularly
divides, the telomeres are shortened, and this short- referred to as “Mitochondrial Eve.” Molecular evolution-
ening has been implicated in some age-related ary studies of human mitochondrial DNA estimate that
diseases. Mitochondrial Eve most likely lived around 200,000 years
ago and most likely in East Africa. In Chapter 5, we will
learn how to map mitochondrial DNA haplogroups back
1.3 How does a genome work? to particular human lineages.
Having a particular gene encoded in your genome
is largely inconsequential unless it is “expressed” to
result in some biological effect inside or outside the process called translation. While the process of
cell. Specifically, while your genes are encoded by expressing a single gene to make a single protein
your genomic DNA, they generally do not exert any product is a tremendously complex process in its
influence until they are expressed through a process own right, the cell must be able to precisely express
called transcription, which eventually leads to the thousands of genes in concert to support that viabil-
formation of functional protein products through a ity of a just single cell, much less a complex, multi-
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

A G E N T L E I N T R O D U C T I O N TO G E N O M I C S 9

cellular species such as humans. To complicate to express a particular gene, it is aided by proteins
matters further, cells must also achieve a specific called transcription factors, which recognize short
timing of this regulation: for instance, fetal haemo- sequence motifs in gene promoter regions. These
globin is only expressed in early development and transcription factors then recruit RNAP to tran-
then switched off to make way for adult haemo- scribe an RNA copy of the gene. During this proc-
globin. This symphony of gene expression is ess, the RNAP will move along the double-stranded
achieved through a process known generally as DNA in the 5’ → 3’ direction of the coding strand
gene regulation. and it will use the complementary strand as a tem-
plate (read in the 3’ → 5’ orientation) to encode an
1.3.1 From genes to proteins: the central mRNA transcript of a gene (Figure 1.6). Recall that
dogma of molecular biology DNA base pair complementarily ensures that a
transcript encoded from the complementary
This sequential transfer of information from the strand will match the information found in the
genome to proteins is called the central dogma of “intended” coding strand. Note that the nucle-
molecular biology, a phrase that was coined by otide code used by mRNA is slightly different
co-discoverer of DNA Francis Crick in 1958. The than that of DNA: any positions encoded with a
central dogma of molecular biology states that thymine (T) in DNA will be encoded using a uracil
flow of genetic information in molecular biology is (U) in RNA.
DNA → RNA → Protein, or simply that DNA is However, the mRNA strand produced by the
transcribed to RNA and RNA is translated into RNAP does yet not represent a viable representa-
proteins (Figure 1.2). Like many rules of thumb, tion of a gene, as it will contain non-coding intronic
this is an oversimplification the relationships regions in addition to the desired exonic regions.
between DNA, RNA and proteins, but it is a useful This immature mRNA strand is known as precur-
tool for understanding the general relationship sor mRNA (pre-mRNA), which must have the non-
between these molecules and the flow of informa- coding intronic regions removed in a process called
tion that starts at the genome and produces protein splicing, before it can be used to create a viable
products with biological activity. Despite their fun- protein product. This arrangement of exons can
damental nature, aspects of gene transcription and
translation are still areas of active research; for
more information, we refer readers to “Recom-
binant DNA: Genes and Genomes” by Watson,
Caudy, Myers, and Witkowski and other reviews
Genomic DNA template
such as [Chen and Rajewsky, 2007].
Transcription Nucleus

1.3.2 Gene transcription RNA processing


RNA transcript
Cytoplasm

One of the first major steps in gene expression is tran- mRNA


scription, which creates an RNA copy of a gene, called
messenger RNA (mRNA), from its genomic DNA
template (Figure 1.6). The actual synthesis of mRNA Protein Translation
tRNA
is performed by an enzyme called RNA Polymerase polypeptide

(RNAP). In order to transcribe a gene into an mRNA Ribosome


transcript, RNAP must bind to genomic DNA at the
beginning of a gene sequence, known as a promoter
region, which lies in the “upstream” or 5’ (pronounced Figure 1.6 Transcription and translation: DNA is transcribed into
RNA transcripts (pre-mRNA) which is processed to mature mRNAs in
“five-prime”) position. the nucleus. These mRNAs move from the nucleus to the cytoplasm,
Since RNAP does not know the location of par- where the translational machinery (the ribosome along with tRNAs)
ticular genes in the genome or when the cell needs translates the message into a protein.
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

10 EXPLORING PERSONAL GENOMICS

data in the 5´ to 3´ direction. The information regard-


Box 1.2 “Alternative splicing” ing the precise composition of the protein encoded by
the mRNA is represented in a genetic language that
The alternative combination of exons during pre-mRNA uses a triplet of nucleotide sequences, or a codon, to
processing is known as alternative splicing (see encode one of the 20 amino acids used by mammalian
Figure 1.7). As you might imagine, skipping certain exonic cells to build proteins. Each codon is recognized by a
gene regions can have the consequence of the same ini- tRNA (transfer RNA) molecule that carries with it an
tial gene sequence producing very different protein amino acid, which is added to a peptide chain by the
sequences. Note that in alternative splicing, the order of
ribosome. The codon translation used in human cells
exons is still maintained, but which ones are used to pro-
is shown in Table 1.1.
duce the final mRNA molecule may vary. Alternative splic-
ing is facilitated by a number of complex genetic and Many of the amino acids are encoded by multiple
biochemical factors, driven by a splicing machinery that codons; when a set of codons encodes the same
can recognize and bind to particular regions in the gene. amino acid, these codons are said to be synony-
Additionally, there are a number of other factors affecting mous, and provide redundancy in the genetic code.
the resulting alternatively spliced transcript, such as the There are several theories as to why such redun-
retention of particular introns, or a mutually exclusive dancy exists in the genetic code, but leading theo-
relationship between certain exons. For the purposes of ries suggest that this redundancy protects from the
this book, the main takeaway point from alternative splic- effects of mutation. There are also a few special
ing is that a single gene sequence encoded in the genomic codons that serve as regulatory signals to the trans-
DNA can in fact produce a repertoire of differing protein
lational machinery. The start codon (AUG in RNA)
products.
encodes the amino acid methionine and serves as a
signal that “tells” the translational machinery to
begin protein synthesis. The stop codons (TGA,
produce a large amount of genetic diversity, if cer- TAG, and TAA) serve as a signal that then end of a
tain exons are skipped, forming various combina- protein has been reached and that protein synthesis
tions of exons in the mature mRNA (see Box 1.2). by the ribosome should stop.

1.3.3 Gene translation


Table 1.1 Codon table: With only a few exceptions, the genetic code
After the pre-mRNA is processed by the spliceosome, of life is shared among organisms. Each set of three nucleotides
and a mature mRNA is produced, the cell transports (which form a codon) code for a specific amino acid. Note that there
the mRNA out of the nucleus into the cell’s cytoplasm, is a large degree of redundancy in this code (e.g. CCU, CCC, CCA, and
CCG all code for proline), the result of which is that certain mutations
where the information encoded by the mRNA can occur without affecting protein structure
sequence to synthesize a protein (Figure 1.6). The syn-
Second letter
thesis of proteins from mRNA by the translational U C A G
machinery in the cell is perhaps one of the most amaz- U
UUU Phe UCU UAU Tyr UGU Cys
ing examples of molecular machinery yet revealed by U UUC UCC Ser UAC UGC C
UUA UCA UAA Stop UGA Stop A
science. The same molecular process that translates UUG Leu UCG UAG Stop UGG Trp G
information from mRNA into proteins is capable of
CUU CCU CAU His CGU U
producing the soft proteins that make up human hair, C
C CUC Leu CCC Pro CAC CGC Arg
A
as well as the tough collagen fibers and matrix pro- CUA CCA CAA CGA
Third letter
First letter

CUG CCG CAG Gin CGG G


teins that make up the hard human endoskeleton. The
AUU AUU AAU Asn AGU Ser U
translation of mRNA information into proteins is C
A AUC Ile AUC Thr AAC AGC
A
facilitated by a complex macromolecule called the AUA AUA AAA AGA
AUG Met AUG AAG Lys AGG Arg G
ribosome, a complex made up of various protein and
RNA subunits. The ribosome binds to the mRNA in GUU GCU GAU Asp GGU U
GUC GCC Ala GAC GGC Gly C
the cellular cytoplasm at the 5´ (upstream) region of G GUA Val GCA GAA GGA A
GUG GCG GAG Glu GGG G
the mRNA sequence and proceeds to read the mRNA
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

A G E N T L E I N T R O D U C T I O N TO G E N O M I C S 11

Exon 1 Exon 2 Exon 3 Exon 4 Exon 5


DNA
Exon 1 Exon 2 Exon 3 Exon 4 Exon 5
RNA

Alternative Splicing

1 2 3 4 5 1 2 4 5 1 2 3 5
mRNA
Translation Translation Translation

Protein A Protein B Protein C

Figure 1.7 Alternative splicing: After DNA is transcribed into an RNA transcript, the relevant exons are chosen to form mature mRNAs. The choice
of which exons to use can result in different mRNAs and different protein products. Note that the order of the exons is kept intact. Image courtesy
of the National Human Genome Research Institute.

1.3.4 Protein folding and 3-D structure such as G-protein coupled receptors (GPCR), which
serve as targets for the majority of drugs on the mar-
When a protein polymer is being synthesized by the
ket. Protein structures are usually solved by scientists
ribosome during translation, it does not persist as a
in the field of structural biology, where they employ a
long string of amino acids chained together. Instead,
technique called X-ray crystallography; however, this
the protein sequence immediately begins to fold into
approach is complex and time-consuming. In the effort
complex three-dimensional (3-D) structures (Figure
to accelerate the elucidation of human protein 3-D
1.8). The rules governing how a protein will fold are
structures, many hope to develop computational
profoundly complex and incorporate a large number
approaches to predicting 3-D protein structure from
of parameters. Certainly, the amino acid sequence itself
the 2-D amino acid sequences of human proteins,
will affect protein folding, where the polarity, hydro-
which are known. If you are interested in assisting the
phobicity, covalent bonding, and many other physio-
effort to identify the 3-D structures of human proteins,
chemical parameters of the amino acids will influence
you might consider joining the Folding@Home effort
the folding dynamics. Protein folding is also influenced
from Stanford University (<http://folding.stanford.
by many factors extrinsic to the amino acid sequence,
edu>), which allows you to download protein-folding
such as the polarity of the solvent in which it is folding,
software that runs on your computer and allows you
solvent ion concentration, temperature, and the pres-
to contribute your unused computing power to
ence of special protein folding “helper” molecules
advanced protein-folding algorithms.
called chaperones. Ultimately, the protein will fold into
a 3-D structure that minimizes the free energy across
the molecule (i.e. is chemically and energetically sta- 1.4 Gene regulation: when and where a
ble). As you might imagine, errors in protein folding
gene is expressed
can cause severe disruptions to the proteins function,
and indeed protein misfolding lies at the heart of many Gene regulation encompasses many of the underly-
well-known diseases, such as Alzheimer’s disease. ing processes discussed so far, such as transcription
The true 3-D structures of many human proteins are and translation (Figure 1.9). However, in this sec-
yet to be determined, even for very important proteins tion, we describe some of the higher-level processes
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

12 EXPLORING PERSONAL GENOMICS

Levels of protein organization Post-transcriptional control


MicroRNAs, alternative splicing,
alternative polyadenylation,
Primary protein structure RNA-binding proteins, etc.
is sequence of a chain of
Amino Acids amino acids miRNAs

TF TF mRNA

Pleated sheet Alpha helix


Secondary protein structure
occurs when the sequence of
amino acids is linked by Transcriptional control
hydrogen bonds
Transcription factors, chromatin state,
combinatorial control, co-factors,
alternative promoters, etc.
Pleated
sheet
Figure 1.9 Overview of gene regulation processes: While the coding
Tertiary protein structure
occurs when certain attractions segments of genes take care of the structure of the protein, other
are present between alpha processes regulate the “expression” of the gene, or the amount
Alpha helices and pleated sheets.
helix present in a cell at a given time. Regulatory mechanisms at the
transcriptional level include binding of transcription factors to
promoter regions, as well as chromatin remodeling and state (histone
modifications). Additionally, the gene can be regulated at the
post-transcriptional level through alternative splicing, microRNA-
Quaternary protein structure mediated regulation, and other chemical modifications of the mRNA.
is a protein consisting of more Reprinted by permission from Macmillan Publishers Ltd: Chen, K. &
than one amino acid chain.
Rajewsky, N. The evolution of gene regulation by transcription factors
and microRNAs. Nature Review Genetics 8, 93–103 (2007).

Figure 1.8 Protein folding: Once an mRNA is translated into an


amino acid peptide, the primary structure (sequence) must be folded • Transcription factors: A set of regulatory proteins
into a functional protein. Secondary structure, including alpha helices that contain special functional features in their
and beta sheets, are formed and folded across each other to form proteins structures called DNA binding domains,
tertiary protein structures. Additionally, multiple subunits of a protein such as zinc finger and leucine zipper domains.
may come together to form a quaternary structure. Image courtesy of
These DNA binding domains allow the proteins
the National Human Genome Research Institute.
to bind to specific regions of the genome, called
that govern the initiation and termination of these recognition sequences, which are typically adja-
processes as part of more complex “programs” gov- cent to the genes that they regulate. Once a tran-
erning the precise expression of a gene or genes. The scription factor binds to a recognition sequence
biology of gene regulation is still an area of intensive upstream of a gene, a myriad number of regula-
research, and a comprehensive look at the complex tory events can happen that might affect the gene’s
phenomena underlying gene regulation has filled expression. The transcription factor could initiate
the contents of entire books. Here, we briefly intro- transcription and therefore instigate expression
duce several key concepts in gene regulation that of the gene, or it may repress the gene by block-
provide the necessary biological background for ing the transcriptional machinery. DNA-bound
understanding concepts that are discussed later in transcription factors will typically interact with a
this book. Interested readers are strongly encour- vast array of additional modifier proteins, such
aged to read more on this topic, as knowledge of this as coactivators, corepressors, or kinases which
subject is critical to the study of the genetic basis of may alter the activity of the transcription factor,
human disease. One excellent resource for further or form larger regulatory complexes.
study is the freely-available edition of Genomes 2 • Promoters: Regions of regulatory DNA to which
which can be found online at the NCBI Bookshelf. transcription factors typically bind. Promot-
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

A G E N T L E I N T R O D U C T I O N TO G E N O M I C S 13

ers usually lie upstream of genes and are often and cell types. Many miRNAs are found in the
composed of specific patterns of DNA base pairs intergenic regions of the human genome, and
that are recognized by transcription factors. For hence offer a functional basis for at least a por-
example, the sequence TATAAA is known as a tion of the 98% of non-coding human DNA.
TATA box motif which is recognized by TATA
binding protein transcription factors. When tran-
1.5 The human epigenome
scription factors are bound to promoters, they
will typically form complexes that attract RNA Just when geneticists and genomicists were getting
Polymerase II to bind to the promoter region and comfortable with the recently sequenced human
initiate transcription of the adjacent gene. genome in the early 2000s, with high hopes that it
• Chromatin remodeling: Recall that the DNA in would help to unravel the myriad genetic mysteries
human cells is packaged into a highly condensed behind human nature and disease quickly, several
structure called chromatin. The chromatin state breakthrough studies demonstrated that genes could
of a particular genomic region can regulate a be modified in ways that could alter their gene func-
gene’s expression because it affects the accessibil- tion without changing the DNA sequence. Further-
ity of transcription factors and RNA Polymerase more, it was revealed that these modifications that
to the gene region. When the genomic DNA is did not alter the DNA sequence were heritable from
wound tightly in a nucleosome, transcriptional parent to offspring. This phenomenon, known as
elements may have a difficult time “finding” the epigenetics, clearly presented a challenge to the clas-
gene they intend to activate. In order to provide sical dogmas of molecular biology and genetic inher-
easier access to particular genes, the chromatin itance which held DNA is the source of genetic
structure will temporarily reorganize through a inheritance between generations, and thus compli-
process known as chromatin remodeling. The cated our view of genomics even further. The term
biochemistry behind chromatin remodeling is epigenetics simply refers to any heritable changes
quite complex; however, we can generalize by that happen above the level of DNA sequence. There
understanding that a group of special remodeling are several complex processes that can facilitate epi-
proteins will shift the nucleosome cores along the genetic modifications, several of which we have
genomic DNA strand to increase access to par- already described, such as methylation. The effects
ticular regions of DNA. of epigenetics are so widespread and important to
• Chromatin modification: Chromatin can also be normal cellular functioning as well as human health
modified by chemical modifications known as and disease that researchers now often refer to the
methylation and acetylation, where methyl or human epigenome as an entity that is completely
acetyl chemical groups are added to the nucleo- distinct from the human genome. Although this dis-
some respectively. These chemical modifications tinction is fuzzy, there is currently a Human Epige-
typically occur at lysine residues along the nucle- nome Project underway (<http://www.epigenome.
osome protein and they can affect how the tran- org>) that aims to identify and catalog all epigenetic
scriptional machinery is able to access and bind features across the human genome in similar spirit
to regions of genomic DNA. to how the human genome was initially decoded.
• microRNAs (miRNA): Micro RNAs (miRNAs) Since researchers still have much to learn about the
are an interesting class of small RNA molecules human epigenome, it will not be discussed heavily
(22 nucleotides long on average) that are now in the first edition of this book. However, you should
known to post-transcriptionally regulate a large understand at least a few important implications of
number of mammalian genes. It is thought that epigenomics to prevent you from thinking that DNA
they primarily act to silence genes by binding to is the only way to explain variability and inheritance
the 3’ untranslated region (UTR) or their mRNA of traits.
transcripts. The human genome is predicted to
express ~1000 miRNAs which are thought to be • Genomic imprinting: Under the classic model of
differentially expressed across different tissues genetic inheritance, you inherit one copy of a gene
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

14 EXPLORING PERSONAL GENOMICS

from your mother, and another copy of the same 1.6 Replication and reproduction
gene from your father, and both of these genes
will be “turned on” (i.e. expressed) in your cells The molecular and cellular biology behind cellular
with equal probability. However, in some cases replication and human reproduction are now regu-
you will inherit genes in this way and one of the lar parts of the curriculum of secondary school edu-
genes, either the paternal or maternal copy, will cation in many countries. Therefore, we will only
be specifically inactivated. Therefore despite the aim to refresh the reader’s memory a handful of key
fact that you have two copies of a gene, you will concepts pertaining to these phenomena that are
express the gene in a parent-of-origin-specific important for understanding other topics discussed
manner. This phenomenon is called genomic in this book.
imprinting, and although it affects a relatively Foremost, it is important to understand the dis-
small percentage of genes in the human genome, tinction between the two forms of cellular division
it can have substantial consequences for an indi- and replication that happen in the human body. In
vidual’s development and disease risk. A gene mitosis, a cell will divide and replicate the nuclear
will become imprinted, or “stamped”, in either DNA such that two diploid copies of a cell are pro-
the sperm or the egg through epigenetic mech- duced. This form of cell replication occurs in regen-
anisms (typically methylation). This imprint- erating body tissues such that tissues can grow or
ing often happens in the promoter region of a renew, such as in muscle repair after exercise. Any
gene and permanently prevents the gene from changes that occur to the DNA through errors in
being expressed in the offspring. Rarely, errone- mitosis are called somatic mutations, which are not
ously imprinted copies of the same gene can be passed on to offspring because genetic material
inherited from both the father and the mother, from somatic cells (i.e. cells that make up your body)
resulting in no expressed copies in the offspring. are not passed to offspring during sexual reproduc-
This phenomenon can result in disease, such as tion. In meiosis, diploid germ cells will recombine
the developmental disorders Prader-Willi syn- DNA and divide to produce four haploid gamete
drome and Angelman syndrome. cells (i.e. sperm or egg), which contain the genetic
• Environmental factors: Understanding that dis- material passed to offspring through sexual repro-
ease is the product of genes and environment, duction. Modifications to DNA during meiosis can
the discovery of epigenetics led researchers to have profound impact on the offspring, but do not
study possible interactions between the environ- affect the “host” body in any way. Although these
ment and epigenetic modification. It turns out terms sound similar, you can see that they are very
that the environment can indeed work through different, and therefore it is important to distinguish
epigenetic mechanisms to alter our biology in them. One mnemonic that is useful in this case is to
profound ways. A flurry of work has linked remember the phrase “mitosis happens in my toe” to
environmental pollutants to epigenetic modifi- recall that mitosis occurs in somatic cells and meiosis
cations in genomic DNA that can, for example, therefore happens in reproductive cells.
modify genes in a developing fetus resulting in An important event that occurs during meiosis is
the child developing asthma in childhood, or homologous recombination, which works to main-
cause epigenetic modifications to somatic cells tain a sufficient level of genetic variation within a
that lead to dramatically increased cancer risk. population. Homologous recombination, often
It has also been shown, most strongly in rats and referred to as simply “recombination”, is one of the
mice, that environmental pollution can cause reasons why a set of siblings from the same parents
epigenetic modifications to germ-line DNA will have different characteristics despite their
(i.e. in sperm and eggs) and be passed down shared genetic origins. The biology underlying
through several generations of offspring. There- meiosis and recombination is quite complex, but all
fore environmental context can be inherited you should understand is that at some point during
across generations through epigenetic imprint- meiosis, homologous chromosomes in the germ
ing of DNA! cell will pair up and exchange DNA segments
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

A G E N T L E I N T R O D U C T I O N TO G E N O M I C S 15

Paternal Maternal
Chromosome Chromosome

DNA Replication Crossing Over

Recombination berween two homologous Chromosomes

Paternal Maternal Crossing Over Recombined


Chromosomes

Figure 1.10 Recombination: During meiosis, genetic material is exchanged between homologous pairs of chromosomes in a process known as
recombination. While the resulting genetic material is approximately the same in the end, the specific combination of variants may be shuffled
around. Image adapted from the National Human Genome Research Institute.

which recombine to form new chromosomal vari- completely random: regions that undergo recombi-
ants. More plainly, let us assume we are talking nation more frequently are known as recombina-
about meiosis in the father’s germ cell. The father tion “hotspots”.
inherited two copies of a particular chromosome As you might imagine, this pseudo-random shuf-
(for instance, chromosome 6): one from his mother fling of DNA between ancestral paternal and mater-
and another from his father. At some point during nal lineages provides the genetic variability that is
meiosis, these maternal and paternal copies of necessary to allow future generations to adapt to a
chromosome 6 will pair up and exchange segments constantly changing environment. However, the
of DNA through a process called “crossover” (Fig- stochastic (i.e. random) nature of recombination
ure 1.10). One analogy is to think of each of the also provides the opportunity for deleterious genetic
parental chromosomes as a strip of colored paper, events to occur. For example, “errors” during recom-
one blue and the other pink. Then, imagine cutting bination can lead to chromosomal variants that have
these strips at random points, switching the blue extra copies of genes, or might even have entire
and pink segments, and gluing them together. genes completely deleted from the genomic DNA.
Although the two new strips contain the same total Such changes in gene number are called Copy
“information” as before, they have been recom- Number Variants (CNV) and while they might be
bined to create two novel variants of this informa- advantageous given the right environment, they are
tion. The points at which chromosomes break and most often have either a neutral or detrimental
interchange with their homologous partner are not affect to the offspring.
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

16 EXPLORING PERSONAL GENOMICS

1.7 Genetic variation ing region (exon) of the gene’s nucleotide sequence.
If we sample the DNA of the population we might
Genetic variation is simply a catch-phrase to find that 80% of the population has an adenine (A)
describe all the observed variability in genetic and at this position, while the remaining 20% has a
genomic characteristics between individuals and guanine (G) at this position. We would call this a
populations. In genetic variation lie the keys to single nucleotide polymorphism (SNP), pronounced
understanding differences in traits, such as height “snip”, where A is the major allele and G is the
or eye color, as well as differences in dispositions to minor allele, based on the population frequency.
genetic or complex diseases, and genetic markers of SNPs are the most abundant polymorphisms found
heredity and ancestry. Current estimates put the in the human genome (Figure 1.11).
genetic similarity between any two individual It is important to understand that all of the alleles
humans at approximately 99.9% on average. A 0.1% at a position can be “healthy”, in that the major alle-
difference might not seem like a big deal, but les are not necessarily biologically preferred or nec-
remember that the human genome comprises just essary for maintaining proper function. Individuals
more than 3 billion base pairs, so a 0.1% difference with the minor allele can be just as biologically
entails around 3,000,000 base-pair differences! robust as those carrying the major allele. The SNP
Many of these differences are likely to be benign, may simply be not functionally relevant, or for cod-
occurring in inconsequential regions of non- ing regions, recall the earlier discussion about the
functional DNA, or contributing to common differ- human codon table and how redundancy is built
ences in human traits, such as height or hair color. into the genetic code. Since SNPs are known to be
Yet some proportion of this variability will cause associated with disease, it is useful to characterize
individuals to harbor recessive genetic diseases, be SNPs across the human genome. The government
more or less susceptible to complex disease, or cause funded HapMap project (<http://hapmap.ncbi.
them to metabolize medications differently. There- nlm.nih.gov/>) set out to catalog all of the “normal”
fore, relatively small changes across the vast human variation found among human populations by
genome are the focal point of many modern genomic identifying a large number of SNPs across globally
studies. We learn more about genetic variation in and ethnically distributed human populations. As
the human population in later chapters on ancestry we will find out first-hand in later chapters, the
and trait associations. HapMap project is a highly valuable tool for explor-
ing personal genomics.

1.7.1 Polymorphism
Person 1 ATCGCATCGAT ACGCTACGCTACG
Polymorphism is a word you will hear frequently
when you read scientific articles pertaining to Person 2 ATCGCATCGAT GCGCTACGCTACG
genomics. If you consider its Greek roots, the term
polymorphism simply means “many forms”. In the Person 3 ATCGCATCGAT GCGCTACGCTACG
context of genetics and genomics, it is used to
describe the genetic variability that can be found Person 4 ATCGCATCGAT ACGCTACGCTACG
along the human genome (Figure 1.11). A specific
location on the genome is called a locus, and there- Person 5 ATCGCATCGAT ACGCTACGCTACG
fore, we often say that a gene is found at a particular Single Nucleotide Polymorphism (SNP)
locus. If we were to sequence a gene in a number if
unrelated individuals, we might find small differ- Figure 1.11 Single Nucleotide Polymorphisms: A single base pair
ences among individuals in the nucleotide sequence mutation (point mutation) that is observed in more than one percent
used to encode the gene. Therefore, we can say that of the population is known as a single nucleotide polymorphism
(SNP). These mutations, which most often only have two alleles, arose
several forms, or alleles, exist for this gene in the at some point in human history and have spread throughout human
population, and that the gene is polymorphic. Let populations. In this example, we observe a SNP with two alleles
us consider just a single position in one of the cod- (A and G).
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

A G E N T L E I N T R O D U C T I O N TO G E N O M I C S 17

Normal
Amino Acids Ala Ile Arg Leu Gly Tyr Ser Ala Cys Ile His Val Ala Ile Arg
tRNA
anticodon CGA UAUUCC GAUCCA AUG UCA CGU ACG UAU GUG CAU CGA UAU GCG Protein
mRNA GCU AUA AGG CUA GGU UAC AGU GCA UGC AUA CAC GUA GCU AUA CGC
5' codons 3'

Missense mutation
Amino Acids Ala Ile Arg Leu Ala Tyr Ser Ala Cys Ile His Val Ala Ile Arg
tRNA
anticodon CGA UAU UCC GAU CGA AUGUCA CGUACG UAU GUGCAU CGAUAU GCG
mRNA GCU AUA AGG CUA GCU UAC AGU GCA UGC AUA CAC GUA GCU AUA CGC Protein
5' codons 3`
Missense mutation

Nonsense mutation
Amino Acids Ala Ile Arg Leu Gly Tyr Ser Ala Cys stop
tRNA Protein
anticodon CGA UAU UCC GAU CCA AUG UCA CGU ACG AUU
mRNA GCU AUA AGG CUA GGU UAC AGU GCA UGC UAA CACGUAGCUAUACGC 3`
5' codons
Nonsense mutation

Figure 1.12 Types of coding SNPs: Some SNPs that occur in coding regions can affect the amino acid sequence of the protein. These SNPs are
known as non-synonymous SNPs and can result in a change in amino acids (missense) or the introduction or loss of a stop codon (nonsense and
nonstop, respectively). Additionally, synonymous SNPs (not pictured in the figure) are SNPs that change the DNA sequence, but do not affect the
amino acid sequence (as the same amino acid is used due to redundancy in the genetic code; e.g. a point mutation that changes a codon from
GCC to GCA). Image adapted from the National Human Genome Research Institute.

Polymorphisms come into existence by a muta- synonymous mutations are benign, but recent stud-
tion at a previous moment in evolutionary history ies have demonstrated that synonymous mutations
that persisted and spread throughout the population can impact the structure of the mRNA or affect the
(see Box 1.3). These mutations can be caused by any translational efficiency of the tRNA. A mutation that
number of factors, such as errors in DNA replication, causes a region to code for a different amino acid is
radiation, mutagenic toxins, structural methylation, called a non-synonymous or missense mutation.
or viral activity. Mutations can involve regions of The severity of these mutations depends on several
DNA spanning one to many base pairs of DNA. factors, such as the physiochemical differences
When a mutation involves a single base pair, it is between the original and mutant mutation. Another
typically referred to as a point mutation. Mutations type of non-synonymous mutation is a mutation
in regions of the genome coding for genes are of par- that causes an amino-acid coding codon to mutate
ticular interest because they have the potential to into a stop codon, which is typically referred to as a
change the composition, and therefore the function nonsense mutation; these are often more detrimen-
of the protein product encoded by the gene. If a tal as they can result in truncation of the protein
mutation occurs in one of the codons of a gene cod- encoded by the gene.
ing region, it can have one of several possible effects Although SNPs are one of the most prominent
(Figure 1.12). A mutation in this codon might not types of genetic polymorphism in the human
change the amino acid at all because it might simply genome, they are far from the only types found (Fig-
mutate one codon into another codon that codes for ure 1.13). SNPs have stolen the stage because the first
the same amino acid; this type of mutation is called phase of the personal and medical genomics revolu-
a synonymous mutation. It is generally thought that tion was enabled by a specific type of genotyping
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

18 EXPLORING PERSONAL GENOMICS

Single nucleotide Insertion and deletion Nucleotide repeat


polymorphism (SNP) polymorphism (indel) polymorphism

C A T G C C G C A C

G T A C G G C G T G

C C T G C A C G C A C A C A C

G G A C G T G C G T G T G T G

Copy number variation

Gene

Gene

Gene Gene

Gene Gene

Deletion Duplication

Figure 1.13 Other types of mutations and polymorphisms: In addition to single nucleotide polymorphisms, other types of variations can also
alter the DNA and cause a functional change. For instance, indels can form when bases can be inserted or deleted, while the tandem duplication of
short segments can produce nucleotide repeat polymorphisms. When these mutations occur in coding regions, they can cause a “frameshift” in
codons. Additionally, larger variants can delete or duplicate whole genes or large segments of DNA, which are known as copy number variants and
structural variants (see Chapter 10). Reproduced from Hingorani, A. D., Shah, T., Kumari, M., Sofat, R. & Smeeth, L. Translating genomics into
improved healthcare. BMJ 341, c5945 (2010) with permission from BMJ Publishing Group Ltd.

technology called SNP microarrays, or “SNP chips”. • Repeat polymorphisms: Repeat polymorphisms
With the advent of affordable full-genome sequenc- are repeated patterns of short DNA sequences (e.g.
ing technology, other forms of polymorphism will be GATAGATA) found in abundance in the human
easier to measure and evaluate. Important polymor- genome. Since the repeat sequences are typi-
phisms found in the human population include: cally short, ranging from two to five base pairs in
length, these repeats are often called short tandem
• Single nucleotide polymorphism (SNP): SNPs
repeats (STR). STR polymorphisms are impor-
entail a class of polymorphisms where a single
tant for forensics, because STR polymorphisms
nucleotide in the human genome is found to be
between individuals serve as the basis for the
different among members of a population.
DNA fingerprinting used by law enforcement
• Insertion/Deletion (indel): Indel polymor-
agencies. Another larger type of repeat polymor-
phisms represent a type of genetic variation in
phism, called Alu repeats, is also used frequently
which sections of nucleotide sequence are found
to study ancestry and population structure.
to be inserted into or deleted from a region in the
• Copy number variations (CNV) and struc-
human genome. Although indel polymorphisms
tural variants (SVs): CNVs describe segments
can be associated with disease, especially in
of DNA which have a different number of cop-
coding regions, there are many cases of benign
ies between individual genomes. Although this
indel polymorphisms in the population.
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

A G E N T L E I N T R O D U C T I O N TO G E N O M I C S 19

Box 1.3 “Molecular Evolution”

Evolution is critical to understanding the origins of species, An important thing to understand about genetic drift is
genome architecture, and even human disease mutations. that if we observe changes in allele frequencies within
Evolutionary studies comparing genomes across species a population, the observed changes are not necessarily
have contributed significantly to our understanding of the the result of natural selection, but rather could simply be
human genome. Perhaps one of the most critical things to explained by random genetic drift. In a sufficiently large
understand is that evolutionary forces are the primary drivers
population, allele frequencies will tend to drift around
of genetic variability both across species and within popula-
some stable population equilibrium (see Box 1.4). The
tions. The evolutionary forces acting on the human genome
have contributed to substantial changes in our species, even effects of genetic drift are more profound in smaller
within the past several thousand years. Although evolution is populations, such as populations that experience a pop-
typically discussed in terms of millions of years, it is an impor- ulation bottleneck (Figure 1.15). One important con-
tant note that evolution is ongoing, and that evolutionary sequence of this is called the founder effect. Imagine
forces come to bear on each human generation born into a fictional ancient human population of 100 individuals
existence today. Additionally, the environment is very impor- that decided to separate from their main population and
tant in shaping evolution and our interactions with other establish a new, isolated human settlement in some fara-
species, such as the bacteria in our digestive tracts, can have way land. The reduced genetic diversity and smaller pool
a profound affect on our evolution. For example, it appears of reproducing individuals in this population could cause
that up to 8% of the human genome is of viral origin, mean-
extreme, random shifts in allele frequencies for this popu-
ing that transposable DNA from ancient viruses was inte-
lation as it expands. We might observe that formerly rare
grated into our ancestral genome some time in the past and
has now become part of the blueprint for who we are as a alleles reach fixation at 100% frequency in the popula-
species! A number of evolutionary forces have shaped the tion in just a few generations. Such populations can be
human genome and are important for understanding how more prone to certain recessive genetic disorders due to
genomes operate in modern humans. the reduced genetic diversity in their populations.
• Gene Duplication: Gene duplication is one of the pri-
• Mutation: A mutation is simply a change to the DNA mary mechanisms for functional innovation in genome
sequence of a genome. With regards to evolution, we are evolution. Many of the important gene families in our
primarily concerned with mutations that occur in the germ genome, such as hormones and cell surface receptors
line of an organism as these mutations can be passed on (which facilitate the effects of medicinal drugs) came
to offspring. about through gene duplication events in evolution-
• Natural selection: Natural selection is a key principal ary history. Gene duplication occurs when any region
in molecular evolution, which pertains to the increase or of genomic DNA containing a gene is duplicated and
decrease in particular genetic traits as a function of fitness becomes fixed in the germ line of a species. Duplications
and reproductive success. Although natural selection is a can occur as the result of errors during recombination,
complex process, it can be thought of as a kind of filter the activity of retrotransposons, or in some cases entire
that removes suboptimal alleles from a population so that chromosomes can be duplicated. From an evolutionary
the population is better adapted to its environment. standpoint duplications are very interesting because a
• Genetic drift: In each generation of a population, there duplicated “copy” of a gene often has reduced selective
will be individuals who will leave behind more offspring pressures acting against it because it is usually function-
than others simply by chance. After several generations ally redundant to the original copy it was duplicated
the effects of this random sampling will cause the allele from. This means that it is more free to acquire mutations
frequencies to “drift” randomly towards higher or lower and potentially evolve its function towards the formation
frequencies, which is called genetic drift (Figure 1.14). of a completely novel gene.
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

20 EXPLORING PERSONAL GENOMICS

Box 1.4 Hardy-Weinberg equilibrium

Developed independently by British mathematician G.H. The HWE would assert that the frequencies and relative
Hardy and German physician Wilhelm Weinberg, the Hardy- proportions of these genotypes would remain stable (i.e. in
Weinberg Equilibrium (HWE) model is a theoretical mathe- equilibrium) over time if all the assumptions of the HWE are
matical model concerning the probability and distribution satisfied. The assumptions of the HWE, all of which must be
genotype allele frequencies in a population. Somewhat anal- satisfied to assert HWE, are:
ogous to rules of Mendelian inheritance that describe the
transmission of alleles at the family level, the HWE estab- • The population is (infinitely) large
lishes a framework that can be used to model and predict • The mating patterns between population members are
genotype frequencies in large, stable populations. To illus- random
trate, let’s consider a single SNP locus that has three possible • All members of the population have equal reproductive
states: homozygous for the minor allele (aa), heterozygous success
(Aa), or homozygous for the major allele (AA). The parame- • Males and females have similar allele frequencies (more
ters of the HWE equations are the frequency of the major likely on an autosomal locus)
allele, denoted p, and the frequency of the minor allele,
• Mutation is not occurring
denoted q, with the relationships between q and p expressed
• There is no significant migration in or out of the
in the equilibrium model expressed as
population
p2 + 2pq+ q2 = 1 • All genotypes have equal fitness (i.e. there is no selection)

Note that because we are dealing in allele frequencies, we At this point in the chapter, even a genomics neophyte
can easily infer the frequency of one allele given the fre- should carry enough understanding to suspect that the
quency of the other. For example, if we measure a SNP in a assumptions of HWE are likely to be violated in the majority
population and find that 90% of the individuals carry the of cases. In fact, this is why HWE is useful, because deviation
major allele (A), then we can subtract this proportion from from HWE is often suggestive that the locus has been
one to determine that 10% of individuals in the population affected by non-equilibrium forces such as mutation or evo-
carry the minor allele (a). We can then use the HWE equation lutionary selection. One of the most straightforward means
to estimate the frequency of heterozygotes (2pq) in the to statistically evaluate deviations from HWE is to perform a
population. Pearson’s chi-squared test to look for significant deviation
between the observed genotype frequencies, compared to
0.92 + 2pq + 0.12 =1 the genotype frequencies that would expected under HWE.
It is also reasonable to use HWE as a basis for inferring popu-
0.81+ 2pq + 0.01=1 lation genotype frequencies where only the frequency of a
single allele for a locus is known or reported by a genetic
2pq =1- 0.82 = 0.18 association study, for example.

can pertain to any segment of DNA ranging pen spontaneously, the latter of which is known
from a few kilobases to even megabases in size, as de novo CNVs, due to errors in the DNA rep-
it is easiest to conceptualize CNVs in terms of lication machinery. CNVs are implicated in a
gene regions. Since humans are diploid, they number of diseases, such as schizophrenia, and
typically inherit two copies of any one gene; are suspected to underlie many more genetic
one from your father and the other from your disorders. Structural variants are rearrange-
mother. However, an individual may only have ments of DNA that can affect large regions of
one copy of a gene because the other copy was DNA, similar to CNVs, including insertions,
deleted in one of the parental genomes, or have deletions, and inversions that can affect entire
three copies of a gene because the gene was genes. We will discuss CNVs and SVs further in
duplicated. CNVs can be inherited or can hap- Chapter 11.
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

A G E N T L E I N T R O D U C T I O N TO G E N O M I C S 21

100
Population size (N) = 50
90
80
Allele Frequency (%)

70
60
50
40
30
20
10
0
0 10 20 30 40 50
Generations

100
Population size (N) = 5,000
90
80
Allele Frequency (%)

70 Original Bottleneck Next generation


60 population event after bottleneck
50
40 Figure 1.15 Population bottleneck: This phrase describes a scenario
30 in which a population’s size is substantially reduced for at least one
20
generation, perhaps due to a pandemic disease, war, natural disaster,
10
or some other event. The reduced population will reduce the amount
of total genetic variation in the population, and, as illustrated in
0
0 10 20 30 40 50 Figure 1.14, the effects of genetic drift on allele frequencies can be
Generations much more drastic in smaller populations. In this illustration, insects of
the same species are contained within a bottle, and are distinguished
Figure 1.14 Effects of genetic drift on allele frequencies: These by their external color, which is controlled by a single allele. Even if a
graphs show the effect of genetic drift on allele frequencies in a small population quickly recovers back to a large population size in
(top; N = 50) and large (bottom; N = 5000) human population for subsequent generations, the effects of the population bottleneck, in
three distinct SNP variants exhibiting Hardy-Weinberg equilibrium. In terms of reduced genetic variation, can be apparent in the genomic
this example, each allele begins with a population frequency of 50%, makeup of a population. The genomic patterns of modern populations
and the frequency begins to vary over successive populations due to of European ancestry seem to bear hallmarks of a historical
the effects of genetic drift. The effects of random sampling error are population bottleneck, marked by reduced overall patterns of
much more drastic in small population, causing one of the alleles to heterozygosity compared to other worldwide populations.
drift towards fixation (solid line), and another towards elimination (i.e.
loss of the allele from the population) (dashed line). In the larger
population, the effects of genetic drift due to sampling error are less meiosis (see Section 1.6—Replication and reproduc-
drastic, and the allele frequencies do not vary far from their original
tion), these two neighboring SNPs will then be
population frequency over successive generations.
transmitted together, unless there is a recombina-
tion event in between. Furthermore, the probability
of recombination is relatively low for small sections
1.7.2 Linkage disequilibrium
(such as the two SNPs 100 base pairs apart): there
Thus far, we’ve discussed the different types of are, on average, 35 recombination events per meio-
mutations independently, where a single mutation sis, which corresponds to roughly one recombina-
occurs and may spread throughout the population tion every 100 Mb. Over time, this process continues
until it reaches high enough frequency to become a and leads to the correlated inheritance of SNPs,
polymorphism. However, the mechanics of genomic which are known as linked SNPs (Figure 1.16). More
inheritance implies that the inheritance of any two broadly, this phenomenon is also termed linkage
polymorphisms may not be independent. To illus- disequilibrium (LD); when two SNPs are inherited
trate how this may come about, consider two germ- randomly (unlinked), they are said to be in equilib-
line point mutations that occur on the same copy of rium. Thus, a high level of LD implies that two SNPs
a chromosome 100 bases apart. When a chromo- are “linked” and the measure of linkage is known as
some is transmitted to a sperm or an egg cell during R2, which is the level of correlation between the
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

22 EXPLORING PERSONAL GENOMICS

A C T
A C T
A C A
A T T
A C A
G T A
G T T
G T A
G T T
G T A

Linked Unlinked

Linked Unlinked
positions positions

Over
Figure 1.17 Haplotype blocks: The phenomenon of linkage
Time disequilibrium (LD) often results in the correlated inheritance of
nearby variants. In this schematic, a stretch of DNA with SNPs
(represented by stars) is shown. Below is a commonly used display of
linkage disequilibrium patterns: darker shaded regions correspond to
more tightly linked SNPs. In this example, the white star SNP is
Figure 1.16 Polymorphisms are often inherited together due to unlinked to the gray star SNPs, which are all linked with each other. To
linkage disequilibrium: As mutations arise, they are inherited together read the diagram, follow down the diagonal from a SNP until the
with nearby alleles, unless a recombination event occurs in between. opposite diagonal of the other SNP of interest. Here, because the
Over time, as these alleles spread throughout the population, the triangle is dark between the two marked SNPs, the two SNPs are in
presence of one of the alleles is correlated or “linked” with the other. high LD.
These SNPs are said to be in linkage disequilibrium (LD).

lotype blocks will be used in the design of genotype-


SNPs. A perfect correction (fully linked SNPs, where phenotype association experiments (Chapter 2) and
genotype A of SNP 1 is always observed with geno- the application of these associations to our personal
type B of SNP 2) occurs when R2 = 1.0, and two ran- genomes (Chapter 6), as well as ancestry analysis
dom SNPs will have R2 = 0. (Chapter 5).
As we progress through our discussion of per-
sonal genomics, the concept of LD will become cru- Further reading
cial for clinical and phenotype risk analysis, as well
Unfortunately, there is more fascinating breadth and depth
as ancestry analysis. One major reason for this is the
to genomics that could not be covered by this primer.
development of haplotype blocks in a population
We suggest the following materials and resources for
(Figure 1.17). These blocks are segments of SNPs those inclined to expand their study and understanding
that are often inherited as a group; thus, we would of molecular biology and genomics.
only need to measure one SNP (a tag SNP) to pre- Alberts, B. (2008) Molecular biology of the cell. New York,
dict the genotype of another (in a process known as NY: Garland Pub.
imputation). In reality, the situation is not always as Berger, S. L. (2000) Gene regulation. Local or global?
clear-cut: unless we can find a SNP in perfect LD Nature 408, 412–13, 415.
(R2 = 1), we may not be able to fully and accurately Brown, T. A. (2007) Genomes Three. London: Taylor
Francis.
impute our SNP of interest. Additionally, different
Chen, K. & Rajewsky, N. (2007) The evolution of gene reg-
populations will have different patterns of linkage ulation by transcription factors and microRNAs. Nature
disequilibrium: populations that have undergone Reviews Genetics 8, 93–103.
bottlenecks typically have longer haplotype blocks Collins, F. S. (2011) The Language of Life. New York, NY:
than those that have not. As we will see, these hap- Harper Perennial.
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

A G E N T L E I N T R O D U C T I O N TO G E N O M I C S 23

Hingorani, A. D., Shah, T., Kumari, M., Sofat, R. & Smeeth, Nei, M. & Kumar, S. (2000) Molecular evolution and phyloge-
L. (2010) Translating genomics into improved health- netics. New York: Oxford University Press.
care. BMJ 341, c5945. Ridley, M. (2006) Genome. New York, NY: Harper Perennial.
Lynch, M. (2007) The origins of genome architecture. Sunder- Watson, J. D., Caudy, A. A., Myers, R. & Witkowski, J. A.
land, MA: Sinauer Associates Inc). (2007) Recombinant DNA. W. H. Freeman.

You might also like