Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

4

Genome and Gene Structure*


Madhuri R. Hegde1,2, Michael R. Crowley1,2
1Department
of Human Genetics, Emory University, Atlanta, GA, United States
2The Department of Genetics, The University of Alabama at Birmingham, Birmingham, AL, United States

The past decade of biological research has focused heav- contains about 250 gaps. The 3.2 Mbp are packed into
ily on the human genome, and the Human Genome Proj- 22 pairs of chromosomes and two sex chromosomes, X
ect has had a significant impact on biomedical research. and Y. The human chromosomes are not equal sizes; the
Our genetic material is encoded in two genomes: smallest, chromosome 21, is 54 Mbp long, and the larg-
nuclear and mitochondrial. Both genomes reflect the est, chromosome 1, is about 249 Mbp long (Fig. 4.1 and
molecular evolution of humans, which started about 4.5 Table 4.1). From a functional point of view, the genomic
billion years ago. The function of the human genome is sequences are distinguished by genes, pseudogenes,
to transfer information reliably from parent to daughter and noncoding DNA, and only a minute fraction of the
cells and from one generation to the next. At particular sequences code for proteins (approximately 2%). There
developmental times and in specific tissues, the tran- are many pseudogenes (0.5%) but most of the genome
scriptional machinery initiates programmed patterns of consists of introns and intergenic DNA. Almost half
gene expression that are dictated by chromatin structure of intergenic sequences consist of transposons. Gene
and the activities of transcriptional regulatory factors. clusters have evolved from several duplication events
Gene expression is followed by the processes of splicing, in deep evolutionary time, and these include clusters
translation, and protein localization, ultimately lead- such as the HOX and globin clusters. The chemical
ing to synthesis of the protein and RNA molecules that structure of genetic material as well as the storage, pro-
mediate cellular function. Variability in genome struc- cessing, and transfer of genetic information from one
ture, including single nucleotide differences and larg- generation to the next are similar in all living organ-
er-scale variations at the genome level between humans, isms. Thus, it was expected that the complexity of the
dictates the traits we manifest as well as the diseases to human phenotype would be explained by a significantly
which individuals are predisposed. higher number of genes in humans compared with sim-
pler organisms. Surprisingly, instead of the predicted
4.1 INTRODUCTION: COMPOSITION OF 50,000–150,000 human genes, the sequence of the
human genome revealed about 20,000–25,000 genes,
THE NUCLEAR HUMAN GENOME similar to the number of genes in many other organ-
The present assembly of human DNA sequence con- isms. However, analysis of the genome and its products
tains approximately 3.1 billion bp, which covers most has revealed complexity in the form of exquisite tempo-
of the nonheterochromatic portions of the genome and ral and spatial regulation of gene alternative transcripts,

* This article is a revision of the previous edition article by


David H. Cohen, vol. 1, pp. 61–80, © 2007, Elsevier Ltd.

Emery and Rimoin’s Principles and Practice of Medical Genetics and Genomics: Foundations, https://doi.org/10.1016/B978-0-12-812537-3.00004-4
Copyright © 2019 Elsevier Inc. All rights reserved. 53
54 CHAPTER 4  Genome and Gene Structure

Megabases (Sequenced %) Megabases (Gaps unsequenced Euchromatic,%) Megabases (Unsequenced Heterochromatic,%)


100%

90%

80%

70%
Megabase (%)

60%

50%

40%

30%

20%

10%

0%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M

Chromosome

Figure 4.1  The percent proportion of the sequenced (blue), unsequenced gaps from euchromatic regions (red), and unsequenced
gaps from heterochromatic regions (green) of the human genome, listed by chromosome numbers. Statistics are from the NCBI
Build 36.1, UCSC assembly of March 2006, Assembly hg18. (Data from http://genome.cse.ucsc.edu/goldenPath/stats.html#hg18.)

and their expression derived from a single locus, which


TABLE 4.1  Physical Sizes (Megabasespairs,
ranges from 35% to 60%, but there remains an uncer-
Mbp) of Human Chromosomes
tainty in determining the extent to which these reflect
Chromosome Size (Mbp) functional splice variants or splice errors (including
1 249 tissue-specific variants) for most genes. The complex
2 237 posttranslational modifications of proteins also create
3 192 a yet-undiscovered endless diversity in gene products
4 183 and their functions.
5 174 Since the initial sequence of the human genome was
6 165 determined, we have gained tremendous new insights
7 152
into genome structure such as how the sequence and
8 135
structure define the complex functions of human cells
9 132
10 132 and how genome architecture can be altered to produce
11 132 disease. Additionally, how our genome compares with
12 123 the sequences of the genomes of closely and distantly
13 108 related organisms has provided insights into evolution-
14 105 ary conservation of gene and protein functions as well
15 99 as our origins as a species. Technological innovations
16 84 have facilitated defining the genomic sequences of
17 81 many individual humans, particularly the differences
18 75 that distinguish individuals, allowing a fuller descrip-
19 69
tion of the history of our species and the traits we man-
20 63
ifest. We are also at the point where we can conceive of
21 54
22 57 understanding at the molecular level the genetic contri-
X 141 butions to disease across the human population, facil-
Y 60 itating targeted medical intervention based on these
findings. 
CHAPTER 4  Genome and Gene Structure 55

Figure 4.2  Complementary structure of double-stranded DNA.

4.2 DOUBLE HELIX STRUCTURE, DNA


REPLICATION, TRANSCRIPTION, AND
MEIOTIC RECOMBINATION
4.2.1 Double Helix
The function of the human genome is to transfer infor-
mation reliably from parent cells to daughter cells and
from one generation to the next. This is carried out in
a semiconservative manner. One of the two parental Figure 4.3  Flow of genetic information.
DNA strands of a double helix remains intact in every
cell division, serving as a template for copying the with X-ray diffraction images produced by Rosalind
sequence. The two DNA strands form the double helix Frankland. 
by hydrogen bonding between the nitrogenous bases:
guanine (G) pairs with cytosine (C) and adenine (A) 4.2.2 Replication
pairs with thymine (T) (Fig. 4.2). The hydrogen bonds Genetic information is preserved and transmitted via
formed between these pyrimidine–purine pairs (gua- DNA replication, a process that produces two identical
nine and adenine are purines; cytosine and thymine are copies of the DNA. During this process, the two paren-
pyrimidines) stabilize the double helix and ensure that tal strands separate, and each serves as a template for
the two complementary strands remain together and synthesis of a new complementary strand by an enzyme
in register. The strands are oriented antiparallel to each called DNA polymerase (Fig. 4.3). As a consequence,
other, meaning that they run in opposite directions: One each daughter cell inherits one strand of the parental
strand is oriented in a 5′–3′ direction, whereas the other duplex. Every DNA molecule thus contains a “young”
is in a 3′–5′ direction. These opposite strands are often strand that was synthesized in the parental cell during
referred to as the Watson strand and the Crick strand DNA replication and an “old” strand that was inherited
after the two investigators who initially described the from the parental cell and synthesized in the grandpar-
DNA double helix, James Watson and Francis Crick, ental cell. This semiconservative manner of replication
56 CHAPTER 4  Genome and Gene Structure

guarantees transmission of intact information from one


generation to the next. Remarkably, the genome copies
itself through millions of cell divisions during an indi-
vidual’s life with amazing precision. The error rate of
about 1 × 10−8 per bp per generation means that repli-
cation of 3 × 109 bp comprising the human genome leads
to about 60 new single base mutations per individual. 

4.2.3 Transcription
Only about 1%–1.5% of the genome is reflected in
the population of mature protein-coding transcripts.
Protein-coding genes are transcribed from DNA into
messenger RNA (mRNA) (see Fig. 4.3). A single gene
can give rise to multiple transcripts through alterna-
tive splicing and alternative sites of transcription initi-
ation and termination, generating functional diversity.
During transcription, the DNA duplex unwinds, and
one of the strands serves as the template for the syn-
thesis of a complementary RNA strand. RNA is distin-
guished from DNA by the presence of uracil instead of
thymine, ribose instead of deoxyribose, and a different Figure 4.4  Packaging of DNA into chromatin.
three-dimensional folding pattern. The mRNA mole-
cules are single stranded and function as the vehicle for (UCEs), which are extraordinarily highly conserved
translating genomic information into a protein. between evolutionary distant species. The human
In eukaryotes, genes are transcribed by one of the genome contains 481 such regions, which are >200 bp in
three different RNA polymerases (I, II, and III, respec- length and are 100% invariable between human, rat, and
tively). RNA polymerase I transcribes ribosomal RNAs mouse sequences.
(rRNAs, a structural noncoding RNA) (except for The DNA strand that is similar to the transcribed
5S rRNA); RNA polymerase II transcribes mRNAs, mRNA sequence is referred to as the sense strand. The
microRNAs (miRNAs), and many long noncoding DNA sequence that serves as the transcriptional tem-
RNAs (lncRNAs); and RNA polymerase III transcribes plate is referred to as the antisense strand. However,
5S rRNA, transfer RNA (tRNA), and other small RNAs. in recent years it has become more apparent that tran-
In addition to RNA polymerase, initiation of gene tran- scription occurs from both strands of the DNA. There
scription requires other proteins, so that multiple factors is a substantial amount of transcription of the genome,
form the complex responsible for transcriptional initia- between 65% and 80% depending on the study. Yet, as
tion. This complex gets attached to the initiation site of stated here, only 1.0%–1.5% of the genome encodes
transcription at the 5′ end of the gene (the promoter) for proteins. The remainder of the transcription is of
and determines which genes are transcribed in differ- miRNA, small structural and regulator RNAs, and long
ent cell types or during different developmental stages. noncoding RNAs. 
The transcription factors (TFs), along with the activ-
ities of cis-acting enhancer and inhibitor sequences, 4.2.4 Meiotic Recombination
also determine the level of gene expression (Fig. 4.4). Meiotic recombination [1] refers to the reciprocal
The enhancers and inhibitors can be located near the physical exchange of chromosomal DNA between the
promoter of a gene, at the 5′ or 3′ side of the promoter, parental chromosomes and occurs at meiosis during
or at significant distances away from the transcription spermatogenesis and oogenesis, serving to ensure
start site. Such sequences are commonly found within proper chromosome segregation. During the four-
first introns of many mammalian genes. These regions of strand stage of meiosis, two duplex DNA molecules (one
the genome also contain the “ultraconserved elements” from each parent) form a hybrid, and a single strand of
CHAPTER 4  Genome and Gene Structure 57

one duplex is paired with its complement from the other groups. The phosphate groups in a single nucleoside
duplex. Single-stranded DNA is exchanged between the triphosphate are attached to the ribose or deoxyribose
homologous chromosomes, and the process involves moiety via the 5′ hydroxyl residue, and two of the three
DNA strand breakage and resealing, resulting in the phosphates are removed during the incorporation of
precise recombination and exchange of DNA sequences each nucleoside triphosphate into DNA or RNA. When
between the two homologous chromosomes. This pro- the new DNA strand is synthesized during replication
cess is highly efficient and does not usually result in or copied during transcription, the polymerase enzymes
mutations at the sites of recombination. Recombina- add new nucleotides to the 3′ hydroxyl group of a grow-
tion thus shuffles genetic material between homologous ing polynucleotide chain. The new strand of DNA or
chromosomes, generating much of the genetic diversity RNA is thus synthesized in the 5′–3′ direction, so the
that characterizes differences between individuals, even parental or template DNA strand is read in the 3′–5′
within the same family. direction. If two proteins are encoded by adjacent genes
The frequency of recombination between two loci that lie on different strands along the chromosome,
along a chromosome is proportional to the physical those genes are said to have opposite transcriptional ori-
distance between them, and historically, this provided entations. 
the basis for defining the genetic distance between loci,
allowing genetic maps to be constructed. The genetic
proximity of two loci is measured by the percentage of
4.3 ORGANIZATION OF GENOMIC DNA
recombination between them; a map distance of 1 cen- The 3 billion bp that constitute the human genome are
timorgan (cM) indicates 1% recombination frequency packaged into 22 pairs of autosomes and the X and Y sex
between the two loci. The human genome sequence has chromosomes. The chromosomal DNA can be divided
made it possible to compare genetic and physical dis- into regions of heterochromatin and euchromatin: het-
tances and to analyze variations in recombination fre- erochromatin represents the “tightly” packed regions
quency in different chromosomal regions. On average, of chromosomal DNA and euchromatin represents the
1 million bp (1 Mb) correspond to 1 cM (1% recombi- “loose” regions, which are generally the actively tran-
nation frequency). However, there is a tremendous local scribed DNA regions. Using cytogenetic staining meth-
variation between individual chromosomes and among ods, differently packed chromosomal regions can be
particular chromosomal regions. For example, the aver- viewed as G (Giemsa staining)-bands, with the banding
age recombination rate is higher in the short arms of pattern characteristic of each individual chromosome
chromosomes and at the distal segments of the arms providing the basis for the cytogenetic identification
but overall is suppressed near the centromeres. There of each human chromosome. From early on, the dark
is also a significant variation in the recombination rates G-bands were considered to reflect “gene-poor and
between the sexes, with 1.6-fold more recombination on GC-poor” regions and are comparatively more con-
average in females relative to males. On average, female densed and more AT-rich, and they replicate later than
recombination is higher at the centromeres and male the DNA within the lighter staining bands; the sequence
recombination is higher at the telomeres [2].  of the human genome has proved this concept to be
accurate. The light staining bands correspond to the
4.2.5 DNA and RNA Synthesis R-bands via an alternative staining technique. However,
Each chromosome in the human cell consists of a con- human genome sequence information has revealed that
tinuous double-helical DNA strand; an average chro- there can be a tremendous variation in the GC content
mosome contains about 4–5 cm of DNA. The polarity across chromosomal regions.
of a single-stranded nucleic acid is defined by the posi- Each of the 23 pairs of human chromosomes con-
tion of the phosphodiester bonds, which connect the tains a single DNA duplex extending between the two
3′ hydroxyl group of one nucleoside to the 5′ hydroxyl telomeres. When the DNA in the human genome is
group of the next (see Fig. 4.2). A nucleoside is com- stretched from one end to the other, its length would
posed of a purine or pyrimidine base and a deoxyri- be longer than 1 m (approximately 3 ft) long! Remark-
bose (in DNA) or ribose (in RNA), and a nucleotide is ably, compacting the DNA by greater than 100,000-
composed of a nucleoside and one or more phosphate fold, which is required to fit the chromosomes into the
58 CHAPTER 4  Genome and Gene Structure

Figure 4.5  Superhelical turns in DNA.

nucleus, is achieved by coiling and folding the double of linker DNA, about 60 bp in length in humans, usually
helix into a series of progressively shorter and thicker separates adjacent nucleosomes, so that nucleosomes
structures (Fig. 4.5). Proteins that bind to DNA help are for the most part regularly spaced.
direct and organize this folding, and the folded complex Nucleosomes represent the first level in the pack-
of DNA and protein is referred to as chromatin. aging of naked DNA into chromatin and appear in the
electron microscope as strings of 11-nm “beads.” Previ-
4.3.1 Nucleosomes and Higher Order ous models have the next level in packaging as the coil-
Chromatin Structure ing of the nucleosomes to form a 30-nm structure with
In addition to compacting the genetic material to fit into a solenoid conformation. However, recent data suggest
the nucleus, chromatin condensation can regulate acces- the 30-nm solenoid structure is an artifact of X-ray
sibility of the DNA for transcription and other processes. diffraction studies of chromatin in  vitro and does not
The simplest level of chromatin structure is the organi- reflect chromatin organization in the nucleus. Indeed,
zation of DNA and histones into nucleosomes [2]. Each studies by Maeshima and colleagues [3] failed to find the
nucleosome is a 147-bp-long segment of DNA tightly 30-nm structure using small-angle X-ray scattering on
wrapped almost two times around an octamer histone purified nuclei or chromosomes in the absence of con-
core. This octamer core contains two molecules each of taminating ribosomes. More recently, a novel method
the histones H2A, H2B, H3, and H4. Nucleosomes are of decorating the chromatin with photoactivatable long
the fundamental feature of all eukaryotic DNA, and the polymers of diaminobenzidine (DAB), staining with
sequences of the core histones are well conserved among OsO4, and visualization by multi-tilt scanning electron
even, distantly related species. A fifth histone, H1, binds microscopy tomography reveals in situ chromatin to be
to a particular modification on the core histone H3 at disordered polymers of 5 and 24 nm with no evidence of
lysine 9 and its sequence is less well conserved. A region the 30-nm fibers (Fig. 4.6) [4]. Epigenetic modifications,
CHAPTER 4  Genome and Gene Structure 59

Figure 4.6  Eukaryotic gene structure and the pathway of gene expression.
60 CHAPTER 4  Genome and Gene Structure

primarily DNA methylation, and histone modifications histone hypermethylation, especially histone 3-lysine 9
also regulate the structure and dynamics of chromatin monomethylation (H3K9me). Facultative heterochro-
folding. Additional levels of folding can compress DNA matin, in contrast, results from tissue- or cell-specific
into 300-nm fibers and a nearly 1000-nm metaphase downregulation of transcription of specific genes, such
chromatid. as during cell or tissue differentiation. Facultative het-
Following separation of sister chromatids at meiosis erochromatin is found in the inactive X chromosome
II, or sister chromosomes at mitosis, the DNA decom- in female cells, which contains genes but generally does
pacts as the cell returns to interphase or the G1 phase of not express them. This is partially due to a high degree
the cell cycle. Chromosomes do not, however, distribute of chromatin condensation and histone modification
randomly throughout the nucleus, nor does the DNA of that does not allow access to the DNA by the transcrip-
every chromosome simply fill the nucleus. Instead the tional machinery. Inactive X heterochromatin remains
chromosomes reside in discrete territories, which can highly condensed during the lifetime of somatic cells,
be observed by staining chromosomes with interphase but in germ cells, during oogenesis, it becomes active
spectral karyotyping (SKY karyotyping or “painting” and euchromatic by the time of entrance into meiosis.
each chromosome with a unique color) [5]. Formation of heterochromatin involves proteins
The human genome is also compartmentalized into that help direct condensation and packaging to achieve
large (>300-kb) segments of DNA that are homoge- assembly of these proteins and DNA into heterochroma-
neous in base composition and are referred to as iso- tin. The molecular mechanisms behind heterochromatin
chores, based on sequence analysis and compositional formation have been revealed by studies of inactivation
mapping. L1 and L2 are GC-poor (“light”) isochore of the second X chromosome in female cells [6]. Heter-
families, which represent 62% of the genome. The ochromatin spreading is also thought to be involved in
heavy H1, H2, and H3 isochores are GC-rich. This also the initiation of X-chromosome inactivation that occurs
corresponds to the G-bands, which are composed of in all female cells because condensation begins from a
GC-poor isochores, whereas the R-bands are composed specific site on the X chromosome, the X-inactivation
of the GC-rich isochores and some GC-poor isochores.  center, which in humans is located on Xq13. At this cen-
ter, the X-inactivation-specific transcript (XIST) gene
4.3.2 Euchromatin and Heterochromatin encodes a 17- to 19-kb-long noncoding RNA, which
During metaphase, the entire chromosome is highly is transcribed only from the inactive X chromosome
condensed, but at other times, most chromatin is orga- and acts in cis to initiate X-inactivation. During early
nized into one of two separate compartments within the embryonic development, X-inactivation is random, so
nucleus. Some portions of the genome remain highly females are mosaics with respect to whether the mater-
condensed throughout the entire cell cycle, associate nal or paternal X is active in each cell. Once established
with the nuclear periphery and the nucleolus, are gene in somatic cells, the inactive X is stable through repli-
poor, are enriched in long interspersed nuclear ele- cation and cell division (i.e., the same inactive X will be
ment (LINE)-1–type repetitive elements, and are late inactive in all daughter cells).
replicating during S phase, which are termed heteroch- XIST RNA inactivates genes at a significant distance
romatin. Many areas of heterochromatin are located from the gene that encodes it. Although XIST is essen-
close to chromosome centromeres and at the telomeres tial for X-inactivation, regulatory genes including TSIX
of acrocentric chromosomes (chromosomes with the and a variety of TFs control XIST. While XIST is essen-
centromere at one end), contain highly repetitive or tial for the initiation of X-chromosome inactivation, it is
simple-sequence DNA instead of genes, and may play not required for the maintenance of X-inactivation. The
a structural role in chromosome organization. In addi- detailed analyses of the process of X-inactivation have
tion, heterochromatin can be subdivided into facultative informed more general aspects of heterochromatin for-
and constitutive heterochromatin. Constitutive heter- mation and gene inactivation. A more general form of
ochromatin is found at highly repetitive features, such heterochromatin formation is directed by the removal
as α-satellite DNA and transposons. Constitutive het- of acetyl lysine histone marks through the increased
erochromatin is also identified by its high compaction, association of histone deacetylates and the recruitment
the absence of histone acetylation, and the presence of of histone methyltransferases [7]. Acetylation of H3
CHAPTER 4  Genome and Gene Structure 61

histones adds a positive charge to the histone tail that 50% of the sequence. The human genome contains long
results in repulsion from the negatively charged DNA stretches of DNA sequences of various lengths that also
strand and is associated with open chromatin and active exist in variable copy number [10,11]. The repeats fall
gene transcription. In contrast, removal of the ace- into five categories: (1) transposon-derived repeats,
tyl group from the H3 histone tails in association with often referred to as interspersed repeats; (2) inactive
methylation of different H3 lysine residues, specifically retrotransposed copies of cellular genes (referred to as
H3K9 and H3K27, is associated with histone H1 recruit- processed pseudogenes); (3) segmental duplications
ment, silencing of transcription, and heterochromatin (SDs) (also known as low copy repeats [LCRs]) consist-
formation. ing of blocks of around 10–300 kb that have been copied
Euchromatin, on the other hand, has an open chro- from one region of the genome into another; (4) blocks
matin configuration, is gene-rich, is early replicating, is of tandemly repeated sequences such as centromeres
enriched in short interspersed nuclear element (SINE) and ribosomal gene clusters; and (5) simple-sequence
transposable elements, and is found in the center of the repeats consisting of direct repeats of short sequences
nucleus. There are also several histone modifications that such as (CA)n or (CGG)n, which have been extremely
delineate euchromatin including H3K27 acetylation and important for human genetic studies as they have been
H3K36 methylation. Indeed, differential histone modifi- used as genetic markers (see Section 4.3.6).
cations define chromatin type and hence transcriptional Transposable elements in humans, as in all mam-
potential. Most gene transcription occurs within the mals, fall into four types: LINEs, SINEs (including Alu
euchromatic regions of the genome.  sequences), long terminal repeat (LTR) retrotrans-
posons, and DNA transposons. Both the number and
4.3.3 Centromeres and Telomeres age of transposable elements in the human genome are
Special features of the DNA molecule that compose each strikingly different from those in other species. The den-
human chromosome are required at the centromeres sity of transposable elements is much higher in humans
and the telomeres. Located close to most centromeres than in other species, and the human genome contains
are many copies of a 171-bp α-satellite repeat that forms more ancient transposons than do other species. It
the core of the centromere. These sequences bind struc- thus appears that these repeats have survived because
tural proteins that serve as a site for kinetochore forma- of a significant evolutionary advantage although their
tion and spindle attachment during metaphase. Certain selective advantage and precise function are not well
alphoid repeats are found close to the centromeres of understood. Some chromosomes are extremely crowded
all chromosomes, while others are specific for one or a with repeat elements (e.g., a 500-kb region on the short
small number of chromosomes. The proteins and DNA arm of the X chromosome has an overall transposable
sequences that make up the centromeres must also element density of 89%), whereas other chromosomal
ensure that the two daughter chromatids are partitioned regions are nearly devoid of repeats (e.g., the homeobox
to different cells during cytokinesis [8]. gene clusters). Several transposable elements, LINEs
As template-directed replication of DNA can be and SINEs, are still active within the human genome
only performed in a 5′–3′ direction, one strand of each and have been associated with gene disruption, thereby
duplex cannot be fully replicated at its 3′-terminus by causing disease. The human genome does not contain
the DNA polymerase. Therefore, the telomeres of each any active DNA transposons. An important distinction
human chromosome contain many copies of a short between LINEs and SINEs and the DNA transposons is
repeat 5′-TTAGGG-3′, which can be replicated using the their mechanism of movement within a genome. LINEs
enzyme telomerase [9]. This enzyme has an RNA com- and SINEs are types of retrotransposons and therefore
ponent, which itself serves as a template and can elon- use an RNA intermediate through which to move. This
gate the 5′-TTAGGG-3′ repeat in a manner that does not is often referred to as a copy-and-paste mechanism.
depend on the DNA strand.  Notably, SINEs do not encode the proteins required for
transposition but rather co-opt the proteins produced
4.3.4 Repeat Content of the Human Genome from LINEs. DNA transposons (of which there are
Less than 5% of the human genome sequence encodes many active transposons in species other than humans)
proteins, whereas repeat sequences account for at least are excised from their original location, usually during
62 CHAPTER 4  Genome and Gene Structure

DNA replication, and reinserted into the genome in a and/or mitosis. If the repeat becomes excessively long,
new location. These transposons use a cut-and-paste it can cause diseases such as Huntington disease or spi-
mechanism for moving through genomes. nocerebellar ataxia, both of which are caused by the
Pseudogenes (full and partial) are regions of DNA expanded repeat CAG in the coding region of a gene
with many sequence elements of a potential transcrip- and result in a long polyglutamine tract within the gene
tional unit (e.g., promoter, protein-coding region, splice product. Some expanded repeats occur in 5′- and 3′-
junctions, etc.), yet do not code for a functional prod- UTRs and result in a disease due to an inhibitory effect
uct. They can originate after gene duplication when the on gene expression (Fig. 4.7).
duplicated sequence acquires a mutation that prevents Effects of microsatellites can occur at different lev-
its expression. For example, a member of the α-globin els of RNA, which includes alternative splicing, struc-
gene family, ψζ, has all the sequence characteristics of tural changes, and working as microRNAs (Fig. 4.7).
a functional globin gene, but the protein-coding region Minisatellites are tandemly repeated sequences of DNA
contains a point mutation that prevents the expression lengths from 1 to 15 kbp. The telomeric DNA sequences
of a full-length globin [12]. A second way in which contain 10–15 kb of hexanucleotide repeats—TTAGGG.
pseudogenes originate is via the pathway of reverse tran- Macrosatellites are very long arrays of sequences up to
scription and integration. If the mRNA of a cellular gene hundreds of kilobases of tandemly repeated DNA. The
is converted into complementary DNA by reverse tran- α-satellite DNA constitutes the bulk of the centromeric
scriptase, a duplex DNA molecule can be formed that heterochromatin on all chromosomes. 
lacks introns and contains a poly(A) tract. Pseudogenes
with this pattern are commonly found in genomic DNA, 4.3.5 Gene Families
showing that cellular mRNAs are occasional substrates Genes belong to a family of closely related DNA
for reverse transcriptase and that the DNA products can sequences, which cluster together as families because
integrate back into the genome. of similarity in their nucleotide sequence or amino acid
Large segmental duplications [13] are especially sequences. Gene families consist of structurally (and
enriched in pericentromeric and subtelomeric regions usually functionally) related genes with a common
of chromosomes. These intrachromosomal or interchro- evolutionary origin. Multiple levels of hierarchical sub-
mosomal duplications range between 1 and 300 kb, are family structure are common. A well-studied and exten-
>90% identical at the sequence level, and are much more sively described example due to its clinical significance
common in humans than in yeast, flies, or worms, sug- is the gene family consisting of genes that code for α-
gesting a relatively recent origin for these genomic ele- and β-globin gene clusters, which are assumed to have
ments. Segmental duplications have been demonstrated arisen from a gene duplication event 500 million years
to serve as templates for the production of copy number ago. This gene family is an example of duplication events
variants (CNVs) within the genome through different due to retrotransposition. These two gene clusters also
mechanisms (see later). code for globin chains that are expressed during stages
Highly homologous sequences can reside within of human developmental stages. Another example of
the CDS of the same gene (intragenic homology; gene family with extensive diversity is the immunoglo-
same gene), have homology to functional genes (dif- bin superfamily.
ferent genes), and have homology to nonfunctional Some gene family members, such as collagens, are
pseudogenes or homology to sequence regions that are dispersed among different chromosomal locations, but
still poorly annotated. many others, such as the κ or λ variable region genes,
Satellite DNA consists of arrays of simple tandem are physically linked. Among gene families that exhibit
repeats; microsatellites are composed of repeats primar- linkage, members are usually oriented in the same
ily of 4 bp or less dispersed throughout the genome [14]. direction. In some cases, linkage is thought to be an evo-
These cover about 0.5% of the genome and exist in the lutionary footprint without functional significance, sug-
form of dinucleotide repeat CA/TG. Microsatellites are gesting that evolutionary divergence of the gene family
well known for their causative roles in as many as 40 occurred through successive rounds of duplication in
neurological diseases. Certain specific triplet repeats can tandem arrays. However, in other gene families, such
be unstable, expanding and contracting during meiosis as the immunoglobulins, linkage has been conserved
CHAPTER 4  Genome and Gene Structure 63

Figure 4.7 Effects of microsatellites at the level of RNA. Long noncoding RNAs (lncRNAs) predominantly consisting of micro-
satellites have been observed to function in the nuclear matrix and to aggregate into nuclear foci with indications of functional
significance. They also may associate with DNA microsatellites (short tandem repeat [STR]) in both UTRs. Microsatellite-dominated
microRNAs have been observed, but their function is not yet known. Intronic microsatellites can regulate splicing efficiency that
can lead to exon skipping, intron inclusion or new splice site selection. STRs located in the UTRs can influence the locations of the
start and end sites of transcription. Microsatellites transcribed can also affect the mRNA half-life, which may be due to formation
of secondary structures such as hairpins.

during evolution because it provides a mechanism for is due to frequent slippage by DNA polymerase during
coordinated or regulated control of gene expression. replication. These repeats comprise about 3% of the
Even when dispersed, coordinated expression of mem- human genome, and there is approximately one such
bers of gene families can be regulated by similar control repeat per 2 kb of genomic sequence. The large number
mechanisms by carrying similar response elements in and wide distribution of these repeats have facilitated
their 5′ regulatory regions.  mapping and identification of many genes associated
with inherited human disorders solely based on their
4.3.6 Interindividual Variations in the Human chromosomal position.
Genome SNPs are also nonrandomly distributed in the human
The Human Haplotype Map (HapMap) project is a key genome. Millions of SNPs have been identified in the
component in understanding the genetic potential of genome sequence but only a small fraction is predicted
the Human Genome Project. Sequencing of the human to affect the protein sequence. This limits the extent to
genome has exposed multiple interindividual variations. which such genetic variations contribute to the struc-
These variants include both simple-sequence repeat tural diversity of human polypeptides, but regulatory
polymorphisms and single nucleotide polymorphisms effects on gene expression may cause or result in suscep-
(SNPs). Repeat polymorphisms can represent di- tibility to a variety of human phenotypes.
(mostly CA), tri-, or tetranucleotide repeats, and they Copy number variation (CNV) describes the vari-
form the basis of the genetic map of the human genome. ation identified within the population or associated
These repeat markers are multiallelic; the alleles differ with human diseases in genomic segments larger than
in the number of repeat units and thus can be used to the SNP and simple-sequence repeat polymorphisms
identify the maternal and paternal alleles of individu- but smaller than cytogenetically visible chromosomal
als as well as to define recombinations between marker abnormalities [15]. An appreciation of the number and
loci. The high degree of length polymorphism among diversity of such variants has primarily been a prod-
simple-sequence repeats within the human population uct of comparative genomic hybridization studies in
64 CHAPTER 4  Genome and Gene Structure

normal individuals (copy number polymorphisms) Finally, the last major method involved in the pro-
and in patients with a wide variety of genetic disor- duction of CNVs is FoSTeS. The formation of CNVs
ders. Although the number of sites that vary is small through this mechanism starts during DNA replication
when compared with SNPs and simple-sequence repeat with a stalled replication fork whereby the 3′ end of the
polymorphisms, the number of base pairs involved may lagging strand can dissociate from its current strand and
be as much as two orders of magnitude greater [16]. anneal, through microhomology, in a nearby replication
Similar to the SNP variations, CNVs may be associ- fork. The nearby replication fork will be close in spatial
ated with susceptibility to particular disorders or may proximity, not necessarily close in the linear sequence
be causative, especially when they arise de novo in context. DNA synthesis can commence through the
an individual. There are three prevailing mechanisms newly “primed” site. The newly synthesized DNA can
for the creation of large CNVs—nonhomologous end dissociate from the replication fork and reintegrate back
joining (NHEJ), nonallelic homologous recombination into the original fork or could invade and anneal to
(NAHR), and fork stalling template switching (FoS- another replication fork. The process of disengagement
TeS)—and, more recently, aberrant firing of replication and invasion could occur several times, resulting in very
origins [17] (Maya-Mendoza, Nature. June 27, 2018). complex rearrangements [18]. 
NHEJ is the cellular process to repair double-strand
breaks (DBSs) and proceeds in four basic steps; the 4.3.7 DNA Looping and TADs
DBS is recognized, bridging of both broken DNA It has become increasing clear in the past several years
ends, preparation of the ends for ligation, and finally, that the DNA in the nucleus has a higher order struc-
ligation of the two DNA strands. NHEJ can result in ture in the form of loops. As stated earlier, enhancer ele-
small alterations within the genome such as microdele- ments that lie either upstream or downstream of a gene’s
tions at the repair site, which may or may not have promoter can influence developmental timing of gene
an effect on the organism. Moreover, NHEJ can lead expression or tissue or cell type–specific gene expres-
to larger rearrangements such as chromosomal trans- sion. Enhancer sequences work in a position- and orien-
locations, which have been associated with different tation-independent manner to affect gene transcription.
types of cancer. NHEJ events are nonrecurrent types Some early studies had suggested that the enhancer and
of CNVs and can, but not always, be found within low the promoter form a loop in the DNA, bringing the
CNRs (SDs) (LCRs/SDs). NAHR events, on the other proteins that bind both elements in close proximity to
hand, are associated with LCRs/SDs and can be recur- regulate transcription [19]. The locus control region
rent due to the presence of similar alterations in dif- (LCR) of the globin locus is a notable example where
ferent individuals between the same LCR/SD. NAHR the LCR resides 50 kb upstream of the globin genes (α,
can occur during meiosis or mitosis and will result in β, and γ) themselves and regulates the switch between
either constitutive deletions/duplications that are asso- fetal hemoglobin and adult hemoglobin. More recently
ciated with genetic abnormalities or sporadic deletion/ with the advent of next-generation sequencing (see
duplication events that are mosaic or observed in can- later), it has become possible to assess contacts between
cers, respectively. Normal homologous recombination regions of the genome in cis and in trans on a genome-
occurs between sister chromatids during meiosis to wide scale. Indeed, through the use of a proximity-based
exchange genetic information or between sister chro- ligation assay termed HiC (a process whereby DNA and
mosomes during mitosis. If the pairing of the differ- proteins are crosslinked), the DNA is digested with a
ent chromosomes occurs through allelic segments of restriction enzyme, and the DNA within the complex is
the genome the resulting recombination has no phe- ligated, bringing together normally distant sequences.
notypic effect. If, however, the pairing occurs between The ligation products are eventually sequenced with
two nonallelic segments of the genome, the result can high throughput DNA sequencing); it has been shown
be duplications or deletions if the LCRs/SDs are in a that regions of chromatin that are in close proximity
direct orientation or an inversion if the LCRs/SDs are to each other have a higher probability to interact with
in opposite orientations. Translocations can also occur each other, whereas more distant regions of chromatin
if the recombination occurs between LRCs/SDs on dif- have a lower probability of interaction. However, there
ferent chromosomes. are many regions of the genome that are separated by
CHAPTER 4  Genome and Gene Structure 65

great distances on the linear chromatin fiber that pref- mechanical means (sonication) or enzymatically (tag-
erentially interact with each other. These segments form mentation with a DNA transposon). Following frag-
large DNA loops. The prevailing hypothesis of how mentation, the ends of the DNA are repaired to blunt
the loops form is through the loop extrusion model ends and phosphorylated, and an adenosine residue
whereby the ring protein cohesin binds to and draws the attached to provide a more efficient substrate for liga-
DNA through itself until the complex encounters the tion of platform-specific adaptors. (These steps are not
protein CTCF, which is also bound to the DNA. These necessary during tagmentation as oligomers are inserted
proteins demarcate most, although not all DNA loops. into the DNA by the transposase.) Following adaptor
They also act as boundary elements restricting access to ligation, the library is amplified by PCR and is ready for
the loop. The loops have been found to persist between sequencing [23–25].
cell types and throughout vertebrate evolution and have Sequencing by synthesis on the Illumina platform
been termed topologically associated domains (TADs) begins by creating individual sequencing reactions,
[20,21]. One example of how TADs are important for known as clusters, on the surface of a flow cell. The pre-
regulating gene transcription was the discovery that pared DNA libraries are denatured and flowed across the
individuals with copy number alterations at 2q35-36, surface of the flow cell in order for the adaptor sequences
which caused malformation of the hands and feet, were to bind their complementary sequences that are physi-
associated with the deletion of boundary elements and cally attached to the flow cell surface. Once attached the
not with mutation of individual genes [22]. The loca- molecules are amplified in situ by bridge amplification,
tion of 2q35-36 contains the genes WNT6, IHH (Indian and the resulting clusters are ready for sequencing. In
hedgehog), EPHA4, and PAX3 separated into TADs with the Illumina system, nucleotides, with a specific fluores-
boundaries between the WNT6/IHH and EPHA4 and cence for each base and a 3′ block, are added to the flow
between the EPHA4 and PAX3 TADs. In other words, cell with the polymerase and one base is incorporated
a WNT6/IHH TAD, an EPHA4 TAD, and a PAX3 TAD, into the growing strand. The base is “read” by its unique
each TAD separated by binding sites for the boundary color tag and the nucleotide is recorded. The 3′ block
proteins CTCF and cohesin. Importantly, the chro- is removed from nucleotide and the cycle begins again.
mosome structure is conserved between mouse and This is repeated from 50 cycles up to 600 cycles in either
humans at this locus. Modeling the human CNVs in the one direction (single end sequencing) or in both direc-
mouse resulted in similar phenotypes in the paws of the tions (paired end sequencing) depending on the instru-
affected mice. Deletion of the boundary element, and ment and length of chemistry. Once the sequencing is
hence the binding sites for the CTCF/cohesin complex, finished, the raw data file is converted to a usable format,
was responsible for the observed phenotypes in both the FASTQ file. The FASTQ file format is similar to a
humans and mice [22]. These results provide evidence FASTA except it has the addition of run quality metrics
that alterations in chromatin architecture can lead to for each base position within the read. The FASTQ files
phenotypic changes in the organism without mutations are used in downstream bioinformatics analysis of the
to the genes that reside within the loops.  sequencing run. 

4.3.8 Analysis of Genomes 4.4 STRUCTURE OF GENES


During the past 10 years there has been an explosion of (TRANSCRIPTIONAL UNITS): EXONS AND
genomic data due primarily to the use of next-genera-
tion sequencing (NextGen or NGS). NGS, at its most
mRNA
basic level, is the process of sequence analysis of DNA Sequences coding for a single eukaryotic mRNA mol-
(or cDNA) in a massively parallel configuration [23–25]. ecule are typically separated by noncoding sequences
There are several competing concepts for NGS, but the into noncontiguous segments along the chromosomal
most prevalent technology in use today is the Illumina DNA strand (Fig. 4.6). The segments that are retained in
Sequencing by Synthesis method. All NGS assays begin the mature mRNA are referred to as exons. During tran-
with the creation of a sequencing library. The process is scription, the exons are spliced together from a larger
basically the same for genomic sequencing and begins precursor RNA that contains, in addition to the exons,
with shearing the DNA in some fashion, either by interspersed noncoding segments referred to as introns.
66 CHAPTER 4  Genome and Gene Structure

Figure 4.8  Organization of a human gene. At the 5′ end (upstream) of each gene lies a promoter region that includes sequences
(red lines) responsible for the proper initiation of transcription (TATA, CCAAT), including regulatory elements. At the 5′ and 3′ end of
the gene is the untranslated region (5′-UTR and 3′-UTR; white boxes). The 3′UTR contains a signal for the addition of polyA tail (yel-
low) to the end of the mature mRNA. Gene expression can be regulated by binding of specific microRNAs (purple) in microRNA
recognition sequences mostly present in the 3′-UTR. Genes (open reading frames) expressing untranslated microRNAs can also
be embedded within exons (exonic microRNA; purple) or introns (intronic microRNA) such as that in immunoglobulin lambda
variable region gene family (Das S. Mol Biol Evol 2009;26(5):1179–89.). The nucleotide sequences adjacent after the 5′-UTR or
before the 3′-UTR provide the molecular “start” (Initiation codon) and “stop” (Termination codon, black arrow) signals respectively
for the mRNA synthesis from the gene. Exons (light brown) and intervening introns (white) are the coding and noncoding parts
of the gene.

The number of exons coding for a single mRNA mole- TABLE 4.2  Physical Sizes of Human
cule depends on the gene and the organism, but ranges Chromosomes
from one to more than 100 (Fig. 4.8). The noncoding
mRNA sequences are spliced out during mRNA matu- Size
ration. Human genes tend to have small exons, with a Median exon 167 bp
median value of only 167 bp and mean equal to 216 bp. Longest exon 6609 bp
The shortest exon is only 12 bp while the longest is Smallest exon 12 bp
6609 bp. The exons are separated by introns, which can Exon number 9
Introns 3300 bp
be less than 100 bp, but can also exceed 10 kb. The size
3′-UTR 770 bp
distribution of exons and introns of human genes based 5′-UTR 300 bp
on the analyzed sequence information and comparison Average coding sequence 1340 bp
to worm and fly sequences are provided in Table 4.2. It Polypeptide 447 aa
is important to note that some introns carry significant Overall size 27 kb
information and even code for other complete genes
aa, amino acids; bp, base pairs; kb, kilobase pairs; UTR,
(nested genes). untranslated region.
Individual exons may correspond to structural and/
or functional domains of the proteins for which they
code, such as the signal peptide of secreted polypep- been facilitated by the ability to bring together differ-
tides or the heme-binding domain of globin. For some ent protein subdomains by exon shuffling. The origin of
complex proteins, domains encoded by single exons intron/exon structure is thought to be extremely ancient
often appear in apparently unrelated proteins, sug- and to predate the divergence of eukaryotes and pro-
gesting that the evolution of these proteins may have karyotes. However, prokaryotes and small eukaryotes
CHAPTER 4  Genome and Gene Structure 67

(A)
Promoter

5' 3'
3' 5'
1
TF TF TF

2
RNA
pol
Sense
5'
3'
3' 5'

5' PPP OH 3' Template strand


RNA (= antisense strand)
(B)
–75 –25 +30
CCAAT BRE TATA INR DPF

CFT CBF TAF TFIIB


TFIID
CCAAT Transcription factors
binding factors and RNA polymerase

Figure 4.9  (A) Transcription of eukaryotic genes. (B) Basic promoter elements in eukaryotes.

(e.g., yeast) have lost their introns during evolution, per- patterns, and this spatial and/or temporal restriction of
haps because of the strong selective pressure on these gene expression can also be regulated at multiple levels. 
organisms to retain a small genome size. Therefore,
exons can be classified as follows, 5′-UTR exons, coding 4.4.2 Transcription
exons, 3′-UTR exons, and all possible combinations of Initiation of transcription happens when the compact
those three main components, including single exons DNA structure is loosened and short sequence elements
that cover the whole mRNA. in the 5′ end of the gene guide and activate RNA poly-
merase (Fig. 4.9). A group of such sequences is often
4.4.1 Gene Expression clustered upstream of the transcription initiation site to
The expression of individual genes can be regulated at form the promoter. The promoter is a region of DNA at
multiple levels. Before a gene sequence gets translated the 5′ end of the genes that bind RNA polymerase.
into a polypeptide sequence, multiple events take place: There are different types of promoters for RNA poly-
activation of the local DNA structure, initiation and com- merases I, II, and III. RNA polymerases I and III are ded-
pletion of transcription, processing of the primary tran- icated to transcribing genes encoding RNA molecules
script, transport of the mature transcript to the cytoplasm, (rRNA and tRNA), which assist in the translation of the
and translation of the mRNA. All these steps can be the polypeptide-coding genes. All RNA polymerases are large
target of regulation and thus are potential control points proteins and appear as aggregates consisting of 8–14 sub-
for altering gene expression. Some genes are needed in all units. Significant amounts of information exist on pro-
cell and tissue types as they encode a crucial gene product. moter sequences specific for these polymerases. The basal
Such genes are often referred to as “housekeeping genes.” apparatus, the generic minimal promoter sequence that is
However, numerous human and mammalian genes sufficient to initiate transcription of any protein-coding
show highly restricted cell- or tissue-specific expression gene, contains an RNA polymerase II recognition signal
68 CHAPTER 4  Genome and Gene Structure

as well as signals for general TFs needed for the binding


of the polymerase by most genes. This minimal promoter
contains a consensus sequence (5′-TATA-3′, referred to as
the TATA box) about 25 bp upstream of the site at which
transcription begins, surrounded by GC-rich sequences, as
well as the B recognition elements (BRE) sequence (TF rec-
ognition element), the Inr (initiator) sequence at the start
site of transcription, and the DPE (downstream promoter
element) at about 30 bp 3′ from the transcription initiation
site. Furthermore, about 50–200 bp upstream is the CAAT
box, to which several TFs bind. The usual nomenclature
of the numerous transcription factors is TF followed by a
roman numeral to indicate the associated RNA polymerase.
The general TFs, such as TFIIB, TFIID, TFIIE, TFIIF, and
TFIIH, facilitate the binding and activation of RNA poly-
merase II into an activated transcriptional complex. \
Genes are constitutively expressed at some basal min-
Figure 4.10  Three-dimensional structure of DNA helix bound
imum rate determined by the core promoter. However, to TFs.
transcription can be increased or totally switched off by
additional positive or negative elements (enhancers or
silencers), which regulate the efficiency and specificity Comparisons among many TFs have exposed some
with which a promoter is recognized by the transcrip- structural domains characteristic of the DNA-binding
tional apparatus. These cis-acting regulatory elements character of these proteins (Fig. 4.10). These include
are typically short sequences located at about 200 bp zinc finger motifs, helix-turn-helix motifs, helix-loop-
upstream from the promoter sequence but may also be helix motifs, and leucine zipper motifs. The zinc fin-
placed at more distant locations. Finally, gene expres- ger motif binds a zinc ion with four highly conserved
sion can be regulated by elements that respond to exter- amino acids, two cysteines and two histidines (C2H2),
nal stimuli. These response elements are often within to form a finger-like loop [27]. The typical loop is 25
1000  bp upstream from the transcription start site amino acids long, and the finger structure is often tan-
(Fig. 4.8). Genes under common control share similar demly repeated. The helix-turn-helix motif is a common
response elements, recognized by regulatory TFs. Some element of homeobox proteins. It consists of two short
response elements are extremely well characterized, α-helices separated by a short linker region and confers
such as heat shock response elements or glucocorticoid sequence specificity to DNA binding [28]. The helix-
response elements. loop-helix motif consists of two α-helices separated by
Promoters do not necessarily have to lie upstream of a loop that is flexible enough to allow two helices to
the transcription initiation site. For example, most pro- pack against each other. The contact of the helix-loop-
moters for RNA polymerase III, including the 5S RNA helix motif with DNA is considered to be looser than
promoter, lie downstream of the transcription start site other TFs. The leucine zipper motif is a helical stretch of
within the coding sequence [26]. This promoter binds amino acids, with leucine at every seventh amino acid
the general transcription factor TFIIIA, a large protein position and occurring once in every two turns of the
with several zinc fingers, which then, along with the helix. Characteristic of most TFs is that they recognize
factors TFIIIB and TFIIIC, binds RNA polymerase III and bind a short nucleotide sequence and their binding
in a manner such that the polymerase is positioned at surfaces have extensive complementarity to the surface
the exact spot where transcription begins. Although the of the DNA double helix. Typically, eukaryotic TFs have
mechanism of TFIIIA binding appears to be a common two functional domains: a DNA-binding domain that
one, promoters that lie in exon sequences may be limited binds to the DNA of the target gene and an activation
to the special situations in which multicopy genes such domain that interacts with other proteins, which regu-
as 5S RNA or tRNA are subject to coordinate regulation. late transcription. 
CHAPTER 4  Genome and Gene Structure 69

4.4.3 Enhancers and cis-Acting Regulatory nearly every mRNA molecule [32]. Many functions have
Elements been ascribed to the cap, the most notable of which is
Enhancers are defined as cis-acting sequences that increase protection of the mRNA from degradation by exonu-
transcriptional initiation but, unlike promoters, are not cleases. The cap may also promote splicing and nuclear
dependent on their orientation or their distance from the export of the RNA and is recognized by the translational
transcriptional start site [29]. They may be found within machinery. The 5′-UTR extends from the capping site
the introns of the genes they regulate, within adjacent to the beginning of the protein-coding sequence and
genes or, in extreme cases, 1 Mb or more away. Enhancer can be several hundred base pairs in length. The 5′-UTR
sequences are generally short, on the order of 20–30 bp, and regions of most mRNAs contain a consensus sequence,
bind specific TFs. When there is a mechanistic diversity in 5′-CCA/GCCAUGG-3′, known as a Kozak consensus
enhancer function, many enhancers facilitate the assembly sequence, involved in the initiation of protein synthe-
of an activated transcriptional complex at the promoter sis. In addition, about 5′-UTRs contain upstream AUG
via a chromatin looping mechanism. Other mechanisms codons that can affect the initiation of protein synthesis
involve recruitment of RNA polymerase II by enhancers or and thus could serve to control expression of selected
transcription of enhancer sequences to generate long non- genes at the translational level. 
coding RNAs that can facilitate transcription. Enhancers
have roles in differentiation, tissue specification, and tis- 4.4.5 Introns and Splice Junctions
sue-selective gene expression, playing important roles The number of introns in a simple transcriptional unit
during development and in specific cell types. will be one less than the number of exons. More com-
Silencers are another class of cis-acting regulatory ele- plicated arrangements exist in which an upstream exon
ments that reduce transcription levels [30]. They are less can be spliced to any of several different downstream
well characterized than enhancers, and some of them exons, or in which a complete transcriptional unit is
are position dependent while others seem to be position nested inside an intron of a second transcriptional unit.
independent. They can bind TFs that act in transcrip- In these situations, the same DNA sequence can be used
tional initiation, and many genes contain a combina- as both exon and intron, depending on the transcrip-
tion of both positive and negative upstream regulatory tional unit. Regardless of how the transcriptional unit is
elements that act in concert on a single promoter. This organized, the boundaries between potential exons and
diversity of regulatory elements has the potential to pre- introns share common features that are important in
cisely modulate gene expression with regard to cell type, the splicing process. Beginning from the upstream or 5′
developmental stage, and environmental conditions. exon, these splice junctions have the sequence where the
Boundary elements are insulators, most of which block brackets define the exon–intron junctions, the under-
or isolate the effects of enhancers or silencers, limiting lined nucleotides represent the splice donor, branch
their action to the target genes. point and splice acceptor sequences, respectively, the
Variation of gene promoters or enhancers can alter uppercase sequences are virtually invariant characteris-
the pattern of gene expression but not the structure of tics of every splice junction, and the lowercase sequences
a particular gene product. While such variation is much are other conserved bases within the consensus splice
less frequent than structural variation in genes, they sites. These conserved intron sequences serve a critical
provide insight into the elements of transcriptional reg- role in the splicing process, and many inherited diseases
ulation. For example, SNVs and partial deletions of the are caused by mutations of the consensus splice junc-
β-globin gene cluster that affect upstream regulatory tions. Most splicing mutations alter one of the invari-
sequences lead to reduced expression of adult β chains ant GT or AG nucleotides of the splice donor or splice
in β-thalassemia and/or increased expression of fetal Ɣ acceptor [33] and result in abnormal splicing and either
chains in hereditary persistence of fetal hemoglobin [31].  loss of the gene product or synthesis of an abnormal
polypeptide chain.
4.4.4 5′-Untranslated Sequences A transcribed precursor RNA molecule must have its
Shortly after initiation of mRNA transcription, a 7-meth- introns spliced out and its ends modified before export
ylguanosine residue is added to the 5′ end of the primary to the cytoplasm as mature mRNA. The spliceosome,
transcript (see Fig. 4.6). This 5′ cap is a characteristic of which is composed of small nuclear ribonucleoproteins
70 CHAPTER 4  Genome and Gene Structure

Figure 4.11  Splicing of mRNA.

(snRNPs), mediates the splicing of the large number of U5, and U6) tightly associated with a large number of
pre-mRNA transcripts, collectively referred to as het- proteins [34]. RNA molecules in the snRNPs are among
erogeneous nuclear RNA (hnRNA). Spliceosomes are the most highly evolutionarily conserved sequences
multienzyme complexes that both catalyze the splicing among eukaryotes. An initial intermediate of the splicing
reaction and stabilize the intermediates in the splicing reaction is formed when the 5′ guanylate end of an intron
process. The snRNPs composing the spliceosome consist (the splice donor) is joined to an adenylate residue near
of a set of five integral snRNA molecules (U1, U2, U4, the 3′ end of the intron (the branch point) through a 2′–5′
CHAPTER 4  Genome and Gene Structure 71

phosphodiester linkage (Fig. 4.11). After the completion similar effects. Removal or alteration of these sequences
of exon–exon fusion, the excised intron is released as a can prolong the half-life of mRNA, indicating that
“lariat structure” by cleavage at the splice acceptor. such elements represent a general regulatory feature of
The genes encoding rRNA and tRNA also contain mRNAs whose level of expression can be rapidly altered. 
exons and introns but are spliced by different mecha-
nisms than those required for mRNA splicing. Self-splic- 4.5 TRANSLATION OF RNA INTO PROTEIN
ing of RNA without any protein factors is known to
happen in prokaryotes, which suggests that introns have 4.5.1 Genetic Code
an extremely ancient evolutionary origin, predating not After intron sequences are spliced out of the primary
only the eukaryote/prokaryote divergence but also per- RNA transcript and the 3′-terminus is generated [in
haps the origin of proteins as well.  most cases, by the addition of a poly(A) tail], the mature
mRNA is transported from the nucleus to the cytoplasm,
4.4.6 3′-Untranslated Sequences and where it is translated into a polypeptide chain. In the
Transcriptional Termination cytoplasm, tRNA molecules provide a bridge between
The 3′ ends of primary transcripts are determined by mRNA and free amino acids (see Fig. 4.3). Adjacent
transcriptional termination signals located downstream groups of three nucleotide sequences in the mRNA
of the ends of each coding region. However, the 3′ ends (codons) each bind to complementary three nucle-
of mature mRNA molecules are created by cleavage of otide sequences in tRNA (anticodons). Unlike most
each primary precursor RNA and the addition of a sev- other nucleic acids, tRNA molecules have rigid tertiary
eral hundred nucleotide polyadenylate [poly(A)] tails structures. All tRNAs are L-shaped, with the anticodon
(see Fig. 4.6). The cleavage site is marked by the sequence located at one end and the amino acid binding site at the
5′-AAUAAA-3′ located 15–20 nucleotides upstream of the other end. Modified nucleotides, such as methylguano-
poly(A) site and by additional GU-rich sequences 10–30 sine (mG) and pseudouridine (ψ), are common in tRNA
nucleotides downstream. Histone mRNAs, which do not and help determine the specific three-dimensional char-
have poly(A) tails, have stem-loop structures instead with acteristics of tRNA molecules. Aminoacyl tRNA synthe-
cleavage of the primary transcript mediated by a distinct tases specifically recognize different tRNAs and attach
protein complex that includes the U7 snRNP [35]. each tRNA to the correct amino acid. The last base in
Some complex transcriptional units contain several each codon is followed by the first base in the next, and
potential polyadenylation and/or transcription termi- thus the first codon in an mRNA molecule determines
nation sites. It is often difficult to distinguish the latter the reading frame for all subsequent codons.
from the former as the product available for analysis The relationship between codon and amino acid
(mRNA) has lost the portion of the 3′-terminus orig- sequence is referred to as the genetic code (Fig. 4.12).
inally transcribed by RNA polymerase. Alternative Different tertiary structures of each tRNA are specifi-
polyadenylation (or termination) sites can determine cally recognized by the proper tRNA synthetase, ensur-
final protein structure if the longer precursor RNA con- ing the accuracy of the code. As the anticodon sequence
tains an exon not found in the shorter precursor RNA. itself does not determine tRNA tertiary structure, each
In a simple case, two proteins with different carboxyl amino acid may have several possible codons recognized
termini are formed. But if alternative exon splice sites by tRNAs with different anticodons but similar tertiary
are made available in the longer precursor RNA, pro- structures; that is, they are recognized by the same tRNA
teins with entirely different sequences can be produced. synthetase. For example, 5′-AAA-3′ tRNAPhe (the tRNA
The region from the translation termination codon coding for phenylalanine with the anticodon 5′-AAA-
to the poly(A) addition site may contain up to several 3′) has the same tertiary structure and is charged by the
hundred nucleotides of a 3′-UTR, which includes sig- same tRNA synthetase as 5′-GAA-3′ tRNAPhe. Thus, both
nals that affect mRNA processing and stability. Many codons 5′-UUU-3′ and 5′-UUC-3′ code for phenylala-
mRNAs that are known to have a very short half-life con- nine using different tRNAs but the same tRNA synthe-
tain AU-rich elements, 50- to 150-nucleotide sequences tase. Additional redundancy in the genetic code arises
containing AUUUA motifs that regulate mRNA stability because the third base in each codon–anticodon duplex
[36]. Other, less well-characterized sequences can have (which is the first base from the 5′ end of the anticodon)
72 CHAPTER 4  Genome and Gene Structure

codons in the middle of a normal reading frame cause


UUU UCU UAU UGU
UUC Phe UCC UAC
Tyr UGC Cys truncation of the newly synthesized protein during pro-
Ser Stop UGA Stop
UUA
UUG
Leu UCA
UCG
UAA
UAG Stop UGG Trp
tein synthesis and are referred to as nonsense mutations.
Frequently transcripts with a premature termination
CUU CCU CAU His CGU codon are degraded by a process called nonsense-me-
CUC CCC CAC CGC
CUA
Leu CCA
Pro CAA CGA
Arg diated decay, so that no protein product is synthesized
CUG CCG CAG
Gln CGG
from the mutant allele, resulting in haploinsufficiency
AUU ACU AAU AGU for the gene product. 
AUC Ile ACC AAC Asn AGC Ser
AUA ACA
Thr AAA AGA Arg
AUG Met ACG AAG
Lys
AGG 4.5.2 Protein Synthesis
The biochemistry of protein synthesis (Fig. 4.13) can
GUU GCU GAU GGU
GUC GCC GAC Asp GGC be divided into the stages of initiation, elongation, and
GUA
Val GCA
Ala GAA GGA
Gly
GUG GCG GAG
Glu
GGG termination. All three processes occur on ribosomes,
cytoplasmic particles of protein and rRNA that align
Figure 4.12  The genetic code. the different substrates of each reaction. When inac-
tive, ribosomes exist as separate pools of the two ribo-
can be flexible according to the rules of Watson–Crick some subunits, described by their size or sedimentation
base pairing. In particular, G:U or U:G base pairs are coefficient (S value). The small 40S ribosomal subunit
often found in the third position of a codon–anticodon contains 18S rRNA and approximately 33 different pro-
duplex, and the guanine analog inosine, found only in teins, and the large 60S subunit contains 28S rRNA,
tRNA, can pair or wobble with A, C, or U in the codon. 5.8S rRNA, 5S rRNA, and approximately 50 different
Despite the redundancy of the genetic code, synony- proteins. Beyond these structural components, there is
mous codons are not used with equal frequency, and the a wealth of additional factors, including both proteins
pattern of codon usage (codon bias) may vary tremen- and functional RNA molecules, that are required for
dously among different species and between nuclear and ribosome biogenesis [38]. Translation begins with the
mitochondrial mRNAs. formation of a preinitiation complex that contains the
The AUG codon, which codes for methionine, nearly 40S ribosomal subunit, initiator tRNAMet, GTP, and sev-
always begins the protein-coding portion of each mRNA eral protein initiation factors. An mRNA molecule ini-
molecule. Therefore, the vast majority of newly synthe- tially binds to the preinitiation complex in conjunction
sized peptides begin with methionine. The tRNAMet for with several initiation factors that interact with the 5′
the initiator AUG codon has a different tertiary struc- cap structure. The canonical model for identification of
ture from all other tRNAs, including the tRNAMet that the AUG start codon involves scanning the mRNA in
functions in elongation. Translation of most mRNAs a 5′–3′ direction until the consensus sequence 5′-CCA/
generally begins with the first AUG from the 5′ end, GCCAUGG-3′ is reached. However, internal ribosome
which is typically embedded within a Kozak consensus entry sites (IRES) mediate ribosome recruitment and
sequence (5′-CCA/GCCAUGG-3′) and establishes the translational initiation for uncapped mRNA molecules
reading frame [37]. The UAA, UAG, and UGA codons and for translation when cap-dependent processes are
are stop codons and have no cognate tRNAs. Thus, rec- inhibited [39]. Binding of the 60S ribosomal subunit
ognition of any one of these codons by the protein syn- and dissociation of several initiation factors generate a
thesis machinery terminates the protein-coding portion complex of proteins and subcellular particles poised to
of every mRNA molecule. begin synthesis of the first peptide bond.
Mutations that change a codon into a different codon A ribosome contains room for two tRNAs and their
and would therefore encode a different amino acid respective amino acids (see Fig. 4.13). One tRNA at the
result in a protein with an amino acid substitution, and peptidyl or P site is attached to the amino acid that has
these are described as missense mutations. However, just been incorporated into a nascent peptide chain,
the UAA, UAG, and UGA codons do not code for an and another tRNA at the aminoacyl or A site is attached
amino acid but instead serve as a signal to terminate to its cognate amino acid and is ready to participate in
protein synthesis. Mutations that produce one of these protein synthesis. During elongation, a peptide bond
CHAPTER 4  Genome and Gene Structure 73

Figure 4.13  Translation of mRNA into protein.

is formed between the two adjacent amino acids, the The principle of suppression of nonsense codons, which
ribosome moves to the next codon in the mRNA, the represent about 30% of disease-causing mutations in
tRNA at the P site dissociates from the nascent peptide humans, can be mimicked by treatment with aminogly-
chain, and the tRNA at the A site is translocated to the P coside antibiotics and other pharmaceutical compounds
site. This series of reactions is dependent on elongation [40]. Used as a therapeutic approach, such treatment has
factor 1 (EF1), which binds to free charged tRNAs, and the potential to affect therapy in a wide variety of genetic
EF2, which facilitates translocation from the A site to disorders. 
the P site. A single mRNA can be simultaneously trans-
lated by several active ribosomes, forming a polysome 4.5.3 Protein Localization
that can contain as many as 50 ribosomes. Gene products function in particular cellular compart-
When a codon signifying termination of protein syn- ments. For example, histones, tubulin, glycosyltrans-
thesis (UAA, UAG, or UGA) is reached, the completed ferases, peptide hormone receptors, and collagen are
polypeptide separates from the tRNA at the P site, and specifically localized to the nucleus, cytosol, Golgi appa-
the ribosome dissociates. In bacteria and yeast, a unique ratus, cell membrane, and extracellular space, respec-
group of suppressor tRNA mutations is caused by tively. Although many membranes contain pores large
changes in an anticodon that then permit binding of a enough to accommodate a linear polypeptide chain,
charged tRNA to a termination codon. Point mutations completely folded proteins are generally too large to
in an mRNA that would normally lead to premature ter- fit through these pores. In addition, to the problem
mination codon (e.g., by changing a UUA codon into a of translocating soluble proteins across membranes,
UAA codon) are then partially suppressed by the mutant proteins that remain attached to the membrane must
tRNA, allowing synthesis of a full-length protein with a be placed and oriented in specific ways. These prob-
missense change at the position of the mutant codon. lems—protein sorting, translocation, and membrane
74 CHAPTER 4  Genome and Gene Structure

AUG UAA Poly A


Signal
peptide
Cytoplasmic surface

Lumenal surface

Signal S
recognition Signal S
particle peptidase
cleavage
S
S CHO
Docking Rough
protein Disulfide bond endoplasmic
formation Glycosylation reticulum

Figure 4.14  Translocation of newly synthesized proteins across the endoplasmic reticulum.

orientation—have been solved by complex biochem- of proteins destined for the cell membrane, however,
ical mechanisms that depend in part on short peptide require additional sequences that function either to stop
sequences in each protein. One of the most well-under- transfer across the membrane (stop transfer sequences)
stood pathways is the initial sorting of gene products or to initiate transfer of an internal loop of the nascent
into those that will remain inside the cytosol or nucleus polypeptide chain (start transfer sequences). Start
and those that pass across the endoplasmic reticulum transfer sequences are recognized by SRP-like N-ter-
(ER) membrane, and which are then available for secre- minal signal sequences but are not cleaved from the
tion into the extracellular space. This initial sorting is protein after translocation. The number, order, and ori-
determined early in the translation of proteins destined entation of start transfer and stop transfer sequences
to cross the ER by the presence of a specialized hydro- determines the conformation of complex integral
phobic signal sequence of 20–30 amino acids [41] usu- membrane proteins that span the membrane multiple
ally located at the N-terminus (Fig. 4.14). times. Some soluble proteins that contain the short
The signal sequence is first recognized when about 25 peptide sequence Lys–Asp–Glu–Leu (KDEL) remain
amino acids of the growing polypeptide have emerged in the lumen of the RER, such as binding protein (BiP)
from the ribosome and bind to a protein–RNA com- and protein disulfide isomerase (PDI). Both BiP and
plex called the signal recognition particle (SRP). The PDI facilitate the folding of newly synthesized proteins
SRP stops further translation until bound to a docking in the RER. PDI catalyzes the rearrangement of Cys–
protein complex, the translocon, which is located on the Cys disulfide bonds; BiP is a so-called chaperone that
surface of the ER and forms a hydrophilic membrane binds temporarily to portions of other proteins nor-
pore [42]. The signal peptide passes through the pore, mally not exposed to the surface and, in doing so, pre-
translation recommences, and the growing polypeptide vents partially folded proteins from misfolding and/or
crosses the membrane cotranslationally. After the pro- aggregating. Soluble proteins destined for specialized
tein has passed into the lumen of the ER, signal pepti- compartments inside the cell, such as lysosomes or per-
dase, a protein on the luminal surface of the ER, cleaves oxisomes, use a signal sequence to gain access to the ER
the signal peptide to complete the initial phase of pro- lumen but require additional mechanisms for proper
tein sorting. subcellular localization. Lysosomal sorting depends
Proteins extruded from the RER pass through the on amino acid sequences that specify posttranslational
Golgi apparatus into secretory vesicles for transport to addition of a mannose 6-phosphate residue. Proteins
the cell surface. Proteins destined for the extracellular containing this modification are selectively transferred
matrix are secreted from the cell when the vesicles fuse from the Golgi apparatus to the lysosomal interior.
with the plasma membrane. Insertion and orientation Failure to modify proteins destined for the lysosome in
CHAPTER 4  Genome and Gene Structure 75

this way is responsible for the inherited disease mucol- via serine, threonine, or hydroxylysine residues, or
ipidosis II (I-cell disease). N-linked via asparagine residues. O-linked glycosyla-
There are more thn 1000 mitochondrial proteins, tion is catalyzed by glycosyltransferases located on the
the great majority of which are encoded in the nucleus luminal surface of the Golgi apparatus. N-linked glyco-
and synthesized in the cytosol. Similar to the pro- sylation begins with transfer of a 14-residue oligosac-
teins destined for the ER, transport of proteins across charide from a lipid molecule (dolichol) embedded in
the mitochondrial membrane also depends on a sig- the RER membrane to the asparagine residue of a grow-
nal sequence and uses several translocator complexes ing polypeptide chain. At some sites, the oligosaccharide
[43,44]. The targeting sequence is usually located at the is highly modified by removal of some carbohydrates
N-terminus and is cleaved on import, but sometimes and addition of other carbohydrates to form a complex
is internal to the protein and is therefore not cleaved. glycoprotein modification. Other sites are less modified
Unlike RER proteins, for which translation and trans- and contain the original high mannose composition of
location are codependent, mitochondrial proteins are the dolichol intermediate. Many glycoprotein modifi-
first translated completely, released into the cytosol, cations help determine the specificity of extracellular
and then translocated into the mitochondrial mem- protein–protein interactions, such as antigen–antibody
branes, intermembrane space or matrix. The potential binding or attachment of cells to the extracellular matrix.
problem of translocating a completely folded protein Proteoglycans are a specialized class of extensively
across a membrane pore is solved for mitochondria glycosylated proteins that contain a protein core with
by complexes that include chaperone proteins of the long disaccharide chains branching off at regular inter-
Hsp70 family, which bind to proteins destined for the vals and can contain as much as 95% carbohydrate by
mitochondria and stabilize them in the unfolded state weight. Proteoglycans are extremely hydrophilic and
until after they have passed across the mitochondrial form hydrated gels that provide structural integrity to
membrane, after which they assume their folded con- the extracellular space. During growth and development,
formation.  extracellular remodeling is accompanied by endocytosis
and degradation of proteoglycans by lysosomal enzymes
4.5.4 Posttranslational Modification specific for different disaccharide chains; absence of
Alterations to protein structure that occur after these lysosomal enzymes produces mucopolysaccha-
translation include the formation of disulfide bonds, ridoses, such as Hunter syndrome or Hurler syndrome. 
hydroxylation, glycosylation, proteolytic cleavage, and
phosphorylation. Phosphorylation of serine, tyro- 4.5.5 Expression of Housekeeping and Tissue-
sine, and threonine residues is a common reversible Specific Genes
modification that alters protein–protein interactions Many proteins that operate in basic metabolic func-
or controls enzymatic activity, mostly of intracellular tions such as energy generation or nutrient transport
proteins. The formation of disulfide bonds, hydroxyl- are found in all cells, and the genes that encode these
ation, glycosylation, and proteolytic cleavage are gen- proteins are described as housekeeping genes. They are
erally not reversible and mostly involve extracellular characteristically expressed at a relatively constant level
proteins. in all cells. More specialized genes that are not house-
Intramolecular disulfide bond formation can begin keeping are used only at specific times and places during
cotranslationally as the growing polypeptide chain development or in one or a limited set of tissues. The
enters the lumen of the ER. Some proteins, such as sequence of the human genome has revealed that the
immunoglobulin light chains, have a sequential pattern most common genes in our genome are those encoding
of intrachain disulfide bonds (e.g., between the first and TFs and nucleic acid binding proteins, which together
second cysteines or third and fourth cysteines). Other encode 26.8% of the proteins with known or putative
proteins, such as proinsulin, have a more complicated function. Other highly represented genes encode recep-
pattern. Protein folding and establishment of the correct tors, transferases, signaling molecules, and transport-
arrangement of disulfide bonds are critical steps in syn- ers [45]. These and other housekeeping genes usually
thesizing a three-dimensional protein structure. Glyco- account for 90% or more of the transcripts expressed in
sylation of newly synthesized proteins may be O-linked any particular cell type.
76 CHAPTER 4  Genome and Gene Structure

Analysis of genome sequence data has revealed [9] Grandin N, Charbonneau M. Protection against
that segments of DNA with a relatively large propor- chromosome degradation at the telomeres. Biochimie
tion of 5′-CpG-3′ dinucleotide pairs and a very high 2008;90:41–59.
GC content are frequently located near the 5′ ends of [10] Little PF. Structure and function of the human genome.
Genome Res 2005;15:1759–66.
all housekeeping and a proportion of tissue-selective
[11] Makalowski W. The human genome structure and orga-
genes [46]. These CpG islands thus mark promoters and
nization. Acta Biochim Pol 2001;48:587–98.
are present adjacent to about 70% of annotated genes, [12] Efstratiadis A, Posakony JW, Maniatis T, Lawn RM,
accounting for about half of the known CpG islands. O’Connell C, Spritz RA, DeRiel JK, Forget BG, Weissman
Some of the remaining CpG islands have been associ- SM, Slightom JL, Blechl AE, Smithies O, Baralle FE, Shoul-
ated with transcriptional activity, suggesting that these ders CC, Proudfoot NJ. The structure and evolution of the
sequences mark the promoters of unknown transcripts human beta-globin gene family. Cell 1980;21:653–68.
and genes. On average, CpG islands are about 1000 bp [13] Cooper GM, Nickerson DA, Eichler EE. Mutational and
long and most are less than 2000 bp in length. The CpG selective effects on copy-number variants in the human
sequences within each island are generally nonmethyl- genome. Nat Genet 2007;39:S22–9.
ated and are thought to facilitate transcription either [14] Bagshaw ATM. Functional mechanisms of microsat-
ellite DNA in eukaryotic genomes. Genome Biol Evol
by preferential binding of TFs or by altering chroma-
2017;9:2428–43.
tin structure to a nucleosome-deficient and transcrip-
[15] Girirajan S, Campbell CD, Eichler EE. Human copy
tionally permissive state. Methylation of CpG islands is number variation and complex genetic disease. Annu
associated with reduced transcriptional activity, such as Rev Genet 2011;45:203–26.
on the inactive X chromosome, although the methyla- [16] Lupski JR. Genomic rearrangements and sporadic dis-
tion event appears to follow rather than initiate tran- ease. Nat Genet 2007;39:S43–7.
scriptional silencing. [17] Hastings PJ, Lupski JR, Rosenberg SM, Ira G. Mecha-
nisms of change in gene copy number. Nat Rev Genet
2009;10:551–64.
REFERENCES [18] Gu W, Zhang F, Lupski JR. Mechanisms for human
[1] Kauppi L, Jeffreys AJ, Keeney S. Where the crossovers genomic rearrangements. Pathogenetics 2008;1:4.
are: recombination distributions in mammals. Nat Rev [19] Kim A, Dean A. Chromatin loop formation in the be-
Genet 2004;5:413–24. ta-globin locus and its role in globin gene transcription.
[2] Broman KW, Murray JC, Sheffield VC, White RL, We- Mol Cell 2012;34:1–5.
ber JL. Comprehensive human genetic maps: individual [20] Gonzalez-Sandoval A, Gasser SM. On TADs and LADs:
and sex-specific variation in recombination. Am J Hum spatial control over gene expression. Trends Genet
Genet 1998;63:861–9. 2016;32:485–95.
[3] Maeshima K, Imai R, Hikima T, Joti Y. Chromatin [21] Lupianez DG, Spielmann M, Mundlos S. Breaking
structure revealed by X-ray scattering analysis and TADs: how alterations of chromatin domains result in
computational modeling. Methods 2014;70:154–61. disease. Trends Genet 2016;32:225–37.
[4] Ou HD, Phan S, Deerinck TJ, Thor A, Ellisman MH, [22] Lupianez DG, Kraft K, Heinrich V, Krawitz P, Brancati
O’Shea CC. ChromEMT: visualizing 3D chromatin F, Klopocki E, Horn D, Kayserili H, Opitz JM, Laxova
structure and compaction in interphase and mitotic R, Santos-Simarro F, Gilbert-Dussardier B, Wittler L,
cells. Science 2017;357. Borschiwer M, Haas SA, Osterwalder M, Franke M,
[5] Cremer T, Cremer M. Chromosome territories. Cold Timmermann B, Hecht J, Spielmann M, Visel A, Mund-
Spring Harb Perspect Biol 2010;2:a003889. los S. Disruptions of topological chromatin domains
[6] Arthold S, Kurowski A, Wutz A. Mechanistic insights cause pathogenic rewiring of gene-enhancer interac-
into chromosome-wide silencing in X inactivation. tions. Cell 2015;161:1012–25.
Hum Genet 2011;130:295–305. [23] Heather JM, Chain B. The sequence of sequencers: the
[7] Wang J, Jia ST, Jia S. New insights into the regulation of history of sequencing DNA. Genomics 2016;107:1–8.
heterochromatin. Trends Genet 2016;32:284–94. [24] Shendure JA, Porreca GJ, Church GM, Gardner AF,
[8] Mehta GD, Agarwal MP, Ghosh SK. Centromere Hendrickson CL, Kieleczawa J, Slatko BE. Overview of
identity: a challenge to be faced. Mol Genet Genom DNA sequencing strategies. Curr Protoc Mol Biol 2011
2010;284:75–94. (Chapter 7, Unit 7.1).
CHAPTER 4  Genome and Gene Structure 77

[25] van Dijk EL, Auger H, Jaszczyszyn Y, Thermes C. Ten [35] Dominski Z, Marzluff WF. Formation of the 3’ end
years of next-generation sequencing technology. Trends of histone mRNA: getting closer to the end. Gene
Genet 2014;30:418–26. 2007;396:373–90.
[26] Paule MR, White RJ. Survey and summary: transcrip- [36] Barreau C, Paillard L, Osborne HB. AU-rich elements
tion by RNA polymerases I and III. Nucleic Acids Res and associated factors: are there unifying principles?
2000;28:1283–98. Nucleic Acids Res 2005;33:7138–50.
[27] Klug A. The discovery of zinc fingers and their develop- [37] Kozak M. Regulation of translation via mRNA structure
ment for practical applications in gene regulation and in prokaryotes and eukaryotes. Gene 2005;361:13–37.
genome manipulation. Q Rev Biophys 2010;43:1–21. [38] Strunk BS, Karbstein K. Powering through ribosome
[28] Scott MP, Tamkun JW, Hartzell 3rd GW. The structure assembly. RNA 2009;15:2083–104.
and function of the homeodomain. Biochim Biophys [39] Gilbert WV. Alternative ways to think about cellular
Acta 1989;989:25–48. internal ribosome entry. J Biol Chem 2010;285:29033–8.
[29] Bulger M, Groudine M. Functional and mechanis- [40] Linde L, Kerem B. Introducing sense into nonsense in
tic diversity of distal transcription enhancers. Cell treatments of human genetic diseases. Trends Genet
2011;144:327–39. 2008;24:552–63.
[30] Riethoven JJ. Regulatory regions in DNA: promoters, [41] High S. Protein translocation at the membrane of
enhancers, silencers, and insulators. Methods Mol Biol the endoplasmic reticulum. Prog Biophys Mol Biol
2010;674:33–42. 1995;63:233–50.
[31] Thein SL, Menzel S, Lathrop M, Garner C. Control [42] Swanton E, Bulleid NJ. Protein folding and transloca-
of fetal hemoglobin: new insights emerging from tion across the endoplasmic reticulum membrane. Mol
genomics and clinical implications. Hum Mol Genet Membr Biol 2003;20:99–104.
2009;18:R216–23. [43] Endo T, Yamano K. Multiple pathways for mitochondri-
[32] Cowling VH. Regulation of mRNA cap methylation. al protein traffic. Biol Chem 2009;390:723–30.
Biochem J 2009;425:295–302. [44] Mokranjac D, Neupert W. Thirty years of protein trans-
[33] Krawczak M, Thomas NS, Hundrieser B, Mort M, Wit- location into mitochondria: unexpectedly complex and
tig M, Hampe J, Cooper DN. Single base-pair substitu- still puzzling. Biochim Biophys Acta 2009;1793:33–41.
tions in exon-intron junctions of human genes: nature, [45] Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak
distribution, and consequences for mRNA splicing. B, Daverman R, Diemer K, Muruganujan A, Narechania
Hum Mutat 2007;28:150–8. A. PANTHER: a library of protein families and subfami-
[34] Valadkhan S, Jaladat Y. The spliceosomal proteome: lies indexed by function. Genome Res 2003;13:2129–41.
at the heart of the largest cellular ribonucleoprotein [46] Deaton AM, Bird A. CpG islands and the regulation of
machine. Proteomics 2010;10:4128–41. transcription. Genes Dev 2011;25:1010–22.

You might also like