Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Chromosomes and Chromatin

Not only are the genomes of most eukaryotes much more complex than those of prokaryotes, but
the DNA of eukaryotic cells is also organized differently from that of prokaryotic cells. The
genomes of prokaryotes are contained in single chromosomes, which are usually circular DNA
molecules. In contrast, the genomes of eukaryotes are composed of multiple chromosomes, each
containing a linear molecule of DNA. Although the numbers and sizes of chromosomes vary
considerably between different species (Table 4.2), their basic structure is the same in all
eukaryotes. The DNA of eukaryotic cells is tightly bound to small basic proteins (histones) that
package the DNA in an orderly way in the cell nucleus. This task is substantial, given the DNA
content of most eukaryotes. For example, the total extended length of DNA in a human cell is
nearly 2 m, but this DNA must fit into a nucleus with a diameter of only 5 to 10 μm. Although
DNA packaging is also a problem in bacteria, the mechanism by which prokaryotic DNAs are
packaged in the cell appears distinct from that of eukaryotes and is not well understood.

Table 4.2 Chromosome Numbers of Eukaryotic Cells

Organism Genome size (Mb)a Chromosome numbera


Yeast (Saccharomyces cerevisiae) 12 16
Slime mold (Dictyostelium) 70 7
Arabidopsis thaliana 130 5
Corn 5,000 10
Onion 15,000 8
Lily 50,000 12
Nematode (Caenorhabditis elegans) 97 6
Fruit fly (Drosophila) 180 4
Toad (Xenopus laevis) 3,000 18
Lungfish 50,000 17
Chicken 1,200 39
Mouse 3,000 20
Cow 3,000 30
Dog 3,000 39
Human 3,000 23
a
Both genome size and chromosome number are for haploid cells. Mb = millions of base pairs.

Chromatin
The complexes between eukaryotic DNA and proteins are called chromatin, which typically
contains about twice as much protein as DNA. The major proteins of chromatin are
the histones—small proteins containing a high proportion of basic amino acids (arginine and

1
lysine) that facilitate binding to the negatively charged DNA molecule. There are five major
types of histones—called H1, H2A, H2B, H3, and H4—which are very similar among different
species of eukaryotes (Table 4.3). The histones are extremely abundant proteins in eukaryotic
cells; together, their mass is approximately equal to that of the cell's DNA. In
addition, chromatin contains an approximately equal mass of a wide variety of nonhistone
chromosomal proteins. There are more than a thousand different types of these proteins, which
are involved in a range of activities, including DNA replication and gene expression. Histones
are not found in eubacteria (e.g., E. coli), although the DNA of these bacteria is associated with
other proteins that presumably function like histones to package the DNA within the bacterial
cell. Archaebacteria, however, do contain histones that package their DNAs in structures similar
to eukaryotic chromatin.

Table 4.3 The Major Histone Proteins

Histone Molecular weight Number of amino acids Percentage Lysine + Arginine


H1 22,500 244 30.8
H2A 13,960 129 20.2
H2B 13,774 125 22.4
H3 15,273 135 22.9
H4 11,236 102 24.5

The basic structural unit of chromatin, the nucleosome, was described by Roger Kornberg in
1974 (Figure 4.8). Two types of experiments led to Kornberg's proposal of
the nucleosome model. First, partial digestion of chromatin with micrococcal nuclease (an
enzyme that degrades DNA) was found to yield DNA fragments approximately 200 base pairs
long. In contrast, a similar digestion of naked DNA (not associated with proteins) yielded a
continuous smear of randomly sized fragments. These results suggested that the binding of
proteins to DNA in chromatin protects regions of the DNA from nuclease digestion, so that the
enzyme can attack DNA only at sites separated by approximately 200 base pairs. Consistent with
this notion, electron microscopy revealed that chromatin fibers have a beaded appearance, with
the beads spaced at intervals of approximately 200 base pairs. Thus, both the nuclease digestion
and the electron microscopic studies suggested that chromatin is composed of repeating 200-
base-pair units, which were called nucleosomes.

2
Figure 4.8 (A) The DNA is wrapped around histones in nucleosome core particles and sealed by
histone H1. Nonhistone proteins bind to the linker DNA between nucleosome core particles.
More extensive digestion of chromatin with micrococcal nuclease was found to yield
particles (called nucleosome core particles) that correspond to the beads visible by electron
microscopy. Detailed analysis of these particles has shown that they contain 146 base pairs
of DNA wrapped 1.65 times around a histone core consisting of two molecules each of H2A,
H2B, H3, and H4 (the core histones) (Figure 4.9). One molecule of the fifth histone, H1, is
bound to the DNA as it enters each nucleosome core particle. This forms a chromatin subunit
known as a chromatosome, which consists of 166 base pairs of DNA wrapped around the histone
core and held in place by H1 (a linker histone).

Figure 4.9
(A) The nucleosome core particle consists of 146 base pairs of DNA wrapped 1.65 turns around
a histone octamer consisting of two molecules each of H2A, H2B, H3, and H4.
A chromatosome contains two full turns of DNA (166 base pairs) locked in place by one
molecule of H1.
The packaging of DNA with histones yields a chromatin fiber approximately 10 nm in
diameter that is composed of chromatosomes separated by linker DNA segments averaging about
80 base pairs in length (Figure 4.10). In the electron microscope, this 10-nm fiber has the beaded
appearance that suggested the nucleosome model. Packaging of DNA into such a 10-
nm chromatin fiber shortens its length approximately sixfold. The chromatin can then be
further condensed by coiling into 30-nm fibers, the structure of which still remains to be

3
determined. Interactions between histone H1 molecules appear to play an important role in this
stage of chromatin condensation.

Figure 4.10
Chromatin fibers. The packaging of DNA into nucleosomes yields a chromatin fiber
approximately 10 nm in diameter. The chromatin is further condensed by coiling into a 30-nm
fiber, containing about six nucleosomes per turn.
The extent of chromatin condensation varies during the life cycle of the cell.
In interphase (nondividing) cells, most of the chromatin (called euchromatin) is relatively
decondensed and distributed throughout the nucleus (Figure 4.11). During this period of the cell
cycle, genes are transcribed and the DNA is replicated in preparation for cell division. Most of
the euchromatin in interphase nuclei appears to be in the form of 30-nm fibers, organized into
large loops containing approximately 50 to 100 kb of DNA. About 10% of the euchromatin,
containing the genes that are actively transcribed, is in a more decondensed state (the 10-nm
conformation) that allows transcription. Chromatin structure is thus intimately linked to the
control of gene expression in eukaryotes.
In contrast to euchromatin, about 10% of interphase chromatin (called heterochromatin)
is in a very highly condensed state that resembles the chromatin of cells undergoing mitosis.
Heterochromatin is transcriptionally inactive and contains highly repeated DNA sequences, such
as those present at centromeres and telomeres.
As cells enter mitosis, their chromosomes become highly condensed so that they can be
distributed to daughter cells. The loops of 30-nm chromatin fibers are thought to fold upon
themselves further to form the compact metaphase chromosomes of mitotic cells, in which
the DNA has been condensed nearly 10,000-fold. Such condensed chromatin can no longer be
used as a template for RNA synthesis, so transcription ceases during mitosis. Electron
micrographs indicate that the DNA in metaphase chromosomes is organized into large loops
attached to a protein scaffold, but we currently understand neither the detailed structure of this
highly condensed chromatin nor the mechanism of chromatin condensation.
Metaphase chromosomes are so highly condensed that their morphology can be studied
using the light microscope. Several staining techniques yield characteristic patterns of alternating
light and dark chromosome bands, which result from the preferential binding of stains or
fluorescent dyes to AT-rich versus GC-rich DNA sequences. These bands are specific for each
chromosome and appear to represent distinct chromosome regions. Genes can be localized to
specific chromosome bands by in situ hybridization, indicating that the packaging of DNA
into metaphase chromosomes is a highly ordered and reproducible process.

4
Figure 4.11
Interphase chromatin. Electron micrograph of an interphase nucleus. The euchromatin is
distributed throughout the nucleus. The heterochromatin is indicated by arrowheads, and the
nucleolus by an arrow.

Centromeres
The centromere is a specialized region of the chromosome that plays a critical role in ensuring
the correct distribution of duplicated chromosomes to daughter cells during mitosis (Figure
4.15). The cellular DNA is replicated during interphase, resulting in the formation of two copies
of each chromosome prior to the beginning of mitosis. As the cell enters
mitosis, chromatin condensation leads to the formation of metaphase chromosomes consisting
of two identical sister chromatids. These sister chromatids are held together at the centromere,
which is seen as a constricted chromosomal region. As mitosis proceeds, microtubules of
the mitotic spindle attach to the centromere, and the two sister chromatids separate and move to
opposite poles of the spindle. At the end of mitosis, nuclear membranes re-form and the
chromosomes decondense, resulting in the formation of daughter nuclei containing one copy of
each parental chromosome.
The centromeres thus serve both as the sites of association of sister chromatids and as the
attachment sites for microtubules of the mitotic spindle. They consist of specific DNA sequences
to which a number of centromere-associated proteins bind, forming a specialized structure called
the kinetochore (Figure 4.16). The binding of microtubules to kinetochore proteins mediates the
attachment of chromosomes to the mitotic spindle. Proteins associated with the kinetochore then
act as ―molecular motors‖ that drive the movement of chromosomes along the spindle fibers,
segregating the chromosomes to daughter nuclei.

5
Figure 4.15
Chromosomes during mitosis. Since DNA replicates during interphase, the cell contains two
identical duplicated copies of each chromosome prior to entering mitosis.
Centromeres of humans and other mammals have not yet been defined by functional studies, but
they have been identified by the binding of centromere-associated proteins. Mammalian
centromeres are characterized by extensive regions of heterochromatin consisting of highly
repetitive satellite DNA sequences. In humans and other primates the primary centromeric
sequence is α satellite DNA, which is a 171-base-pair sequence arranged in tandem repeats
spanning up to millions of base pairs. The α satellite DNA appears to play a role in centromere
structure and function, since it has been found to bind centromere-associated proteins. However,
the precise function of α satellite DNA, as well as the potential roles of other sequences in
mammalian centromeres, remains to be established. Consistent with their large size, mammalian
centromeres form large kinetochores that bind 30 to 40 microtubules, whereas only single
microtubules bind to the centromeres of S. cerevisiae.

6
Figure 4.16
The centromere of a metaphase chromosome. The centromere is the region at which the two
sister chromatids remain attached at metaphase. Specific proteins bind to centromeric DNA,
forming the kinetochore, which is the site of spindle fiber attachment.

Telomeres
The sequences at the ends of eukaryotic chromosomes, called telomeres, play critical roles in
chromosome replication and maintenance. Telomeres were initially recognized as distinct
structures because broken chromosomes were highly unstable in eukaryotic cells, implying that
specific sequences are required at normal chromosomal termini. This was subsequently
demonstrated by experiments in which telomeres from the protozoan Tetrahymena were added to
the ends of linear molecules of yeast plasmid DNA. The addition of these telomeric DNA
sequences allowed these plasmids to replicate as linear chromosome-like molecules in yeasts,
demonstrating directly that telomeres are required for the replication of linear DNA molecules.
The telomere DNA sequences of a variety of eukaryotes are similar, consisting of repeats
of a simple-sequence DNA containing clusters of G residues on one strand (Table 4.4). For
example, the sequence of telomere repeats in humans and other mammals is AGGGTT, and the
telomere repeat in Tetrahymena is GGGGTT. These sequences are repeated hundreds or
thousands of times, thus spanning up to several kilobases, and terminate with an overhang of
single-stranded DNA. Recent results suggest that the repeated sequences of telomere DNA form
loops at the ends of chromosomes, thereby protecting the chromosome termini from degradation
(Figure 4.19).
Telomeres play a critical role in replication of the ends of linear DNA molecules. DNA
polymerase is able to extend a growing DNA chain but cannot initiate synthesis of a new chain at
the terminus of a linear DNA molecule. Consequently, the ends of linear chromosomes cannot be
replicated by the normal action of DNA polymerase. This problem has been solved by the
evolution of a special mechanism, involving reverse transcriptase activity, to replicate telomeric
DNA sequences. Maintenance of telomeres appears to be an important factor in determining the
lifespan and reproductive capacity of cells, so studies of telomeres and telomerase have the
promise of providing new insights into conditions such as aging and cancer.

7
Table 4.4 Telomeric DNAs

Organism Telomeric repeat sequence


Yeasts
Saccharomyces cerevisiae G1–3T
Schizosaccharomyces pombe G2–5TTAC
Protozoans
Tetrahymena GGGGTT
Dictyostelium G1–8A
Plant
Arabidopsis AGGGTTT
Mammal
Human AGGGTT

Figure 4.19
Structure of a telomere. Telomere DNA loops back on itself to form a circular structure that
protects the ends of chromosomes.

CHROMOSOMES NOMENCLATURE AND CLASSIFICATION


In the first decade of the twentieth century, cytologists— scientists who use the microscope to
study cell structure— showed that the chromosomes in a fertilized egg actually consist of two
matching sets, one contributed by the maternal gamete, the other by the paternal gamete. The
corresponding maternal and paternal chromosomes appear alike in size and shape, forming pairs
(with one exception—the sex chromosomes). Gametes and other cells that carry only a single set
of chromosomes are called haploid (from the Greek word for ―single‖). Zygotes and other cells
carrying two matching sets are diploid (from the Greek word for ―double‖).
The number of chromosomes in a normal haploid cell is designated by the shorthand
symbol n; the number of chromosomes in a normal diploid cell is then 2n. In humans, 2n = 46;
n=23. To study the chromosomes of a single organism, geneticists arrange micrographs of the
stained chromosomes in homologous pairs of decreasing size to produce a karyotype. Karyotype
assembly can now be speeded and automated by computerized image analysis. 46 chromosomes
are arranged in 22 matching pairs of chromosomes and one nonmatching pair. The 44
chromosomes in matching pairs are known as autosomes. The two unmatched chromosomes in

8
this male karyotype are called sex chromosomes, because they determine the sex of the
individual.
You can see how the halving of chromosome number during meiosis and gamete
formation, followed by the union of two gametes’ chromosomes at fertilization, normally allows
a constant 2n number of chromosomes to be maintained from generation to generation in all
individuals of a species. The chromosomes of every pair must segregate from each other during
meiosis so that the haploid gametes will each have one complete set of chromosomes. After
fertilization forms the zygote, the process of mitosis then ensures that all the cells of the
developing individual have identical diploid chromosome sets.

Figure: The complete set of human chromosomes. These chromosomes, from a female, were
isolated from a cell undergoing nuclear division (mitosis) and are therefore highly compacted.
Each chromosome has been ―painted‖ a different color to permit its unambiguous identification
under the fluorescence microscope, using a technique called ―spectral karyotyping.‖ (A) The
chromosomes visualized as they originally spilled from the lysed cell. (B) The same
chromosomes artificially lined up in their numerical order. This arrangement of the full
chromosome set is called a karyotype.

Number and shape of chromosomes

Scientists analyze the chromosomal makeup of a cell when the chromosomes are most visible—
at a specific moment in the cell cycle of growth and division, just before the nucleus divides. At
this point, known as metaphase, individual chromosomes have duplicated and condensed from
thin threads into compact rod like structures.
Each chromosome now consists of two identical halves known as sister chromatids
attached to each other at a specific location called the centromere (Fig. 4.3). Based on the
position of centromere and length of chromosomal arms, the chromosomes are classified into 4
groups. A chromosome with the centromere at or near the middle is called metacentric. A
submetacentric chromosome has a centromere little away from the middle point. In
acrocentric chromosomes, the centromere is very close to one end. Telocentric – centromere
found at end of chromosome, meaning no p arm exists (chromosome not found in humans).

9
As a result, the sister chromatids of all chromosomes actually have two ―arms‖ separated by a
centromere, even if one of the arms is very short. P arm can be equal or shorter than another q
arm.
Cells in metaphase can be fixed and stained with one of several dyes that highlight the
chromosomes and accentuate the centromeres. The dyes also produce characteristic
banding patterns made up of lighter and darker regions.
Chromosomes that match in size, shape, and banding are called homologous
chromosomes, or homologs. The two homologs of each pair contain the same set of genes,
although for some of those genes, they may carry different alleles. Homologous chromosomes,
which carry the same genes but may vary in the identity of particular alleles, while
nonhomologous chromosomes, carry completely unrelated sets of genetic information (Fig. 4.3)

Figure 4.3 Metaphase chromosomes can be classified by centromere position. Before cell
division, each chromosome replicates into two sister chromatids connected at a centromere. In
highly condensed metaphase chromosomes, the centromere can appear near the middle (a
metacentric chromosome), very near an end (an acrocentric chromosome), or anywhere in
between. In a diploid cell, one homologous chromosome in each pair is from the mother and the
other from the father.

Modern methods of DNA analysis can reveal differences between the maternally and
paternally derived chromosomes of a homologous pair, and can thus track the origin of the extra
chromosome 21 that causes Down syndrome in individual patients. In 80% of cases, the third
chromosome 21 comes from the egg; in 20%, from the sperm. Prenatally, roughly three months
after a fetus is conceived.
The main difference between karyotype and karyogram is that the karyotype is the
number, size, and shape of chromosomes of a particular organism whereas the karyogram is a
visual profile of stained chromosomes in a standard format.
Idiogram is the diagrammatic representation of karyotype showing all the morphological
feature of the chromosomes grouped on the basis of position of centromere and ordered in a
series of decreasing size.
Chromosomes are classified into seven groups, A to G, by the length and centromere position.

10
 Group A (1-3) – large metacentric chromosomes (but chr. 2 – submetacentric)
 Group B (4-5) – large submetacentric
 Group C (6-12, X) – medium-sized submetacentric
 Group D (13-15) – medium-sized acrocentric
 Group E (16-18) – short submetacentric
 Group F (19-20) – short metacentric
 Group G (21-22, Y) – short acrocentric (but chr. Y – submetacentric).

THE HUMAN GENOME AND ITS CHROMOSOMES

With the exception of cells that develop into gametes (the germline), all cells that contribute to
one’s body are called somatic cells (soma, body). The genome contained in the nucleus of
human somatic cells consists of 46 chromosomes, arranged in 23 pairs. Of those 23 pairs, 22 are
alike in males and females and are called autosomes, numbered from the largest to the smallest.
The remaining pair comprises the sex chromosomes: two X chromosomes in females and an X
and a Y chromosome in males. Each chromosome carries a different subset of genes that are
arranged linearly along its DNA. Members of a pair of chromosomes (referred to as homologous
chromosomes or homologues) carry matching genetic information; that is, they have the same
genes in the same sequence. At any specific locus, however, they may have either identical or
slightly different forms of the same gene, called alleles. One member of each pair of
chromosomes is inherited from the father, the other from the mother.
Normally, the members of a pair of autosomes are microscopically indistinguishable from each
other. In females, the sex chromosomes, the two X chromosomes, are likewise largely
indistinguishable. In males, however, the sex chromosomes differ. One is an X, identical to the
X’s of the female, inherited by a male from his mother and transmitted to his daughters; the
other, the Y chromosome, is inherited from his father and transmitted to his sons.
In addition to the nuclear genome, a small but important part of the human genome resides in
mitochondria in the cytoplasm. The mitochondrial chromosome, has a number of unusual
features that distinguish it from the rest of the human genome.

Organization of Human Chromosomes

The composition of genes in the human genome, as well as the determinants of their expression,
is specified in the DNA of the 46 human chromosomes in the nucleus plus the mitochondrial
chromosome. Each human chromosome consists of a single, continuous DNA double helix; that
is, each chromosome in the nucleus is a long, linear double-stranded DNA molecule, and the
nuclear genome consists, therefore, of 46 DNA molecules, totaling more than 6 billion
nucleotides.
Chromosomes are not naked DNA double helices, however. Within each cell, the genome
is packaged as chromatin, in which genomic DNA is complexed with several classes of
chromosomal proteins. Except during cell division, chromatin is distributed throughout the
nucleus and is relatively homogeneous in appearance under the microscope. When a cell divides,
however, its genome condenses to appear as microscopically visible chromosomes.
Chromosomes are thus visible as discrete structures only in dividing cells, although they retain
their integrity between cell divisions.

11
The DNA molecule of a chromosome exists in chromatin as a complex with a family of
basic chromosomal proteins called histones and with a heterogeneous group of nonhistone
proteins that are much less well characterized but that appear to be critical for establishing a
proper environment to ensure normal chromosome behavior and appropriate gene expression.

The Mitochondrial Genome

As mentioned earlier, a small but important subset of genes encoded in the human genome
resides in the cytoplasm in the mitochondria. Mitochondrial genes exhibit exclusively maternal
inheritance. Human cells can have hundreds to thousands of mitochondria, each containing a
number of copies of a small circular molecule, the mitochondrial chromosome. The
mitochondrial DNA molecule is only 16 kb in length (less than 0.03% of the length of the
smallest nuclear chromosome) and encodes only 37 genes. The mitochondrial genome contains
37 genes that encode 13 proteins, 22 tRNAs, and 2 rRNAs. The 13 mitochondrial gene-encoded
proteins all instruct cells to produce protein subunits of the enzyme complexes of the oxidative
phosphorylation system, which enables mitochondria to act as the powerhouses of our cells.
The products of these genes function in mitochondria, although the majority of proteins within
the mitochondria are, in fact, the products of nuclear genes. Mutations in mitochondrial genes
have been demonstrated in several maternally inherited as well as sporadic disorders.

Organization of the Human Genome

Regions of the genome with similar characteristics or organization, replication, and expression
are not arranged randomly but rather tend to be clustered together. This functional organization
of the genome correlates remarkably well with its structural organization as revealed by
laboratory methods of chromosome analysis. The overall significance of this functional
organization is that chromosomes are not just a random collection of different types of genes and
other DNA sequences. Some chromosome regions, or even whole chromosomes, are high in
gene content (―gene rich‖), whereas others are low (―gene poor‖) (Fig. 2-8). Certain types of
sequence are characteristic of the different structural features of human chromosomes.
The clinical consequences of abnormalities of genome structure reflect the specific nature
of the genes and sequences involved. Thus, abnormalities of gene rich chromosomes or
chromosomal regions tend to be much more severe clinically than similar-sized defects involving
gene-poor parts of the genome.
As a result of knowledge gained from the Human Genome Project, it is apparent that the
organization of DNA in the human genome is far more varied than was once widely appreciated.
Of the 3 billion base pairs of DNA in the genome, less than 1.5% actually encodes proteins and
only about 5% is thought to contain regulatory elements that influence or determine patterns of
gene expression during development or in different tissues. Only about half of the total linear
length of the genome consists of so-called single copy or unique DNA, that is, DNA whose
nucleotide sequence is represented only once (or at most a few times). The rest of the genome
consists of several classes of repetitive DNA and includes DNA whose nucleotide sequence is
repeated, either perfectly or with some variation, hundreds to millions of times in the genome.
Whereas most (but not all) of the estimated 25,000 genes in the genome are represented in single
copy DNA, sequences in the repetitive DNA fraction contribute to maintaining chromosome
structure and are an important source of variation between different individuals; some of this
variation can predispose to pathological events in the genome.

12
Single-Copy DNA Sequences

Although single-copy DNA makes up at least half of the DNA in the genome, much of its
function remains a mystery because, as mentioned, sequences actually encoding proteins (i.e.,
the coding portion of genes) constitute only a small proportion of all the single-copy DNA. Most
single-copy DNA is found in short stretches (several kilobase pairs or less), interspersed with
members of various repetitive DNA families.

Repetitive DNA Sequences

Several different categories of repetitive DNA are recognized. A useful distinguishing feature is
whether the repeated sequences (―repeats‖) are clustered in one or a few locations or whether
they are interspersed, throughout the genome, with single-copy sequences along the
chromosome. Clustered repeated sequences constitute an estimated 10% to 15% of the genome
and consist of arrays of various short repeats organized tandemly in a head-to-tail fashion. The
different types of such tandem repeats are collectively called satellite DNAs, so named because
many of the original tandem repeat families could be separated by biochemical methods from the
bulk of the genome as distinct (―satellite‖) fractions of DNA.
Tandem repeat families vary with regard to their location in the genome, the total length
of the tandem array, and the length of the constituent repeat units that make up the array. In
general, such arrays can stretch several million base pairs or more in length and constitute up to
several percent of the DNA content of an individual human chromosome. Many tandem repeat
sequences are important as molecular tools that have revolutionized clinical cytogenetic analysis
because of their relative ease of detection. Some human tandem repeats are based on repetitions
(with some variation) of a short sequence such as a pentanucleotide.
Long arrays of such repeats are found in large genetically inert regions on chromosomes
1, 9, and 16 and make up more than half of the Y chromosome. Other tandem repeat families are
based on somewhat longer basic repeats. For example, the α-satellite family of DNA is
composed of tandem arrays of different copies of an approximately 171–base pair unit, found at
the centromere of each human chromosome, which is critical for attachment of chromosomes to
microtubules of the spindle apparatus during cell division. This repeat family is believed to play
a role in centromere function by ensuring proper chromosome segregation in mitosis and
meiosis.
In addition to tandem repeat DNAs, another major class of repetitive DNA in the genome
consists of related sequences that are dispersed throughout the genome rather than localized.
Although many small DNA families meet this general description, two in particular warrant
discussion because together they make up a significant proportion of the genome and because
they have been implicated in genetic diseases. Among the best-studied dispersed repetitive
elements are those belonging to the so-called Alu family. The members of this family are about
300 base pairs in length and are recognizably related to each other although not identical in DNA
sequence. In total, there are more than a million Alu family members in the genome, making up
at least 10% of human DNA. In some regions of the genome, however, they make up a much
higher percentage of the DNA.
A second major dispersed repetitive DNA family is called the long interspersed nuclear
element (LINE, sometimes called L1) family. LINEs are up to 6 kb in length and are found in

13
about 850,000 copies per genome, accounting for about 20% of the genome. They also are
plentiful in some regions of the genome but relatively sparse in others.

Figure 2-8. Size and gene content of the 24 human chromosomes. A, Size of each human
chromosome, in millions of base pairs (1 million base pairs = 1 Mb). Chromosomes are ordered
left to right by size. B, Number of genes identified on each human chromosome. Chromosomes
are ordered left to right by gene content.

14
Repetitive DNA and Disease

Families of repeats dispersed throughout the genome are clearly of medical importance. Both Alu
and LINE sequences have been implicated as the cause of mutations in hereditary disease. At
least a few copies of the LINE and Alu families generate copies of themselves that can integrate
elsewhere in the genome, occasionally causing insertional inactivation of a medically important
gene. The frequency of such events causing genetic disease in humans is unknown currently, but
they may account for as many as 1 in 500 mutations. In addition, aberrant recombination events
between different LINE or Alu repeats can also be a cause of mutation in some genetic diseases.
An important additional class of repetitive DNA includes sequences that are duplicated, often
with extraordinarily high sequence conservation, in many different locations around the genome.
Duplications involving substantial segments of a chromosome, called segmental duplications,
can span hundreds of kilobase pairs and account for at least 5% of the genome.
When the duplicated regions contain genes, genomic rearrangements involving the
duplicated sequences can result in the deletion of the region (and the genes) between the copies
and thus give rise to disease. In addition, rearrangements between segments of the genome are a
source of significant variation between individuals in the number of copies of these DNA
sequences.

Coding vs. noncoding DNA


The content of the human genome is commonly divided into coding and noncoding DNA
sequences. Coding DNA is defined as those sequences that can be transcribed
into mRNA and translated into proteins during the human life cycle; these sequences occupy
only a small fraction of the genome (<2%). Noncoding DNA is made up of all of those
sequences (ca. 98% of the genome) that are not used to encode proteins.
Some noncoding DNA contains genes for RNA molecules with important biological
functions (noncoding RNA, for example ribosomal RNA and transfer RNA). The exploration of
the function and evolutionary origin of noncoding DNA is an important goal of contemporary
genome research, including the ENCODE (Encyclopedia of DNA Elements) project, which aims
to survey the entire human genome, using a variety of experimental tools whose results are
indicative of molecular activity.
Because non-coding DNA greatly outnumbers coding DNA, the concept of the sequenced
genome has become a more focused analytical concept than the classical concept of the DNA-
coding gene.

Coding sequences (protein-coding genes)


Protein-coding sequences represent the most widely studied and best understood component of
the human genome. These sequences ultimately lead to the production of all human proteins,
although several biological processes (e.g. DNA rearrangements and alternative pre-mRNA
splicing) can lead to the production of many more unique proteins than the number of protein-
coding genes. The complete modular protein-coding capacity of the genome is contained within
the exome, and consists of DNA sequences encoded by exons that can be translated into proteins.
Because of its biological importance, and the fact that it constitutes less than 2% of the genome,
sequencing of the exome was the first major milepost of the Human Genome Project.

15
Number of protein-coding genes. About 20,000 human proteins have been annotated in
databases such as Uniprot. Historically, estimates for the number of protein genes have varied
widely, ranging up to 2,000,000 in the late 1960s, but several researchers pointed out in the early
1970s that the estimated mutational load from deleterious mutations placed an upper limit of
approximately 40,000 for the total number of functional loci (this includes protein-coding and
functional non-coding genes). The number of human protein-coding genes is not significantly
larger than that of many less complex organisms, such as the roundworm and the fruit fly. This
difference may result from the extensive use of alternative pre-mRNA splicing in humans, which
provides the ability to build a very large number of modular proteins through the selective
incorporation of exons.
Protein-coding capacity per chromosome. Protein-coding genes are distributed
unevenly across the chromosomes, ranging from a few dozen to more than 2000, with an
especially high gene density within chromosomes 1, 11, and 19. Each chromosome contains
various gene-rich and gene-poor regions, which may be correlated with chromosome
bands and GC-content. The significance of these nonrandom patterns of gene density is not well
understood.
Size of protein-coding genes. The size of protein-coding genes within the human
genome shows enormous variability. For example, the gene for histone H1a (HIST1HIA) is
relatively small and simple, lacking introns and encoding an 781 nucleotide-long mRNA that
produces a 215 amino acid protein from its 648 nucleotide open reading
frame. Dystrophin (DMD) was the largest protein-coding gene in the 2001 human reference
genome, spanning a total of 2.2 million nucleotides, while more recent systematic meta-analysis
of updated human genome data identified an even larger protein-coding gene, RBFOX1 (RNA
binding protein, fox-1 homolog 1), spanning a total of 2.47 million nucleotides. Titin (TTN) has
the longest coding sequence (114,414 nucleotides), the largest number of exons (363), and the
longest single exon (17,106 nucleotides). As estimated based on a curated set of protein-coding
genes over the whole genome, the median size is 26,288 nucleotides (mean = 66,577), the
median exon size, 133 nucleotides (mean = 309), the median number of exons, 8 (mean = 11),
and the median encoded protein is 425 amino acids (mean = 553) in length.

Noncoding DNA (ncDNA)


Noncoding DNA is defined as all of the DNA sequences within a genome that are not found
within protein-coding exons, and so are never represented within the amino acid sequence of
expressed proteins. By this definition, more than 98% of the human genomes is composed of
ncDNA.
Numerous classes of noncoding DNA have been identified, including genes for
noncoding RNA (e.g. tRNA and rRNA), pseudogenes, introns, untranslated regions of mRNA,
regulatory DNA sequences, repetitive DNA sequences, and sequences related to mobile genetic
elements.
Numerous sequences that are included within genes are also defined as noncoding DNA.
These include genes for noncoding RNA (e.g. tRNA, rRNA), and untranslated components of
protein-coding genes (e.g. introns, and 5' and 3' untranslated regions of mRNA).
Protein-coding sequences (specifically, coding exons) constitute less than 1.5% of the
human genome. In addition, about 26% of the human genome is introns.[44] Aside from genes

16
(exons and introns) and known regulatory sequences (8–20%), the human genome contains
regions of noncoding DNA. The exact amount of noncoding DNA that plays a role in cell
physiology has been hotly debated. Recent analysis by the ENCODE project indicates that 80%
of the entire human genome is either transcribed, binds to regulatory proteins, or is associated
with some other biochemical activity.
It however remains controversial whether all of this biochemical activity contributes to
cell physiology, or whether a substantial portion of this is the result transcriptional and
biochemical noise, which must be actively filtered out by the organism.[45] Excluding protein-
coding sequences, introns, and regulatory regions, much of the non-coding DNA is composed of:
Many DNA sequences that do not play a role in gene expression have important biological
functions. Comparative genomics studies indicate that about 5% of the genome contains
sequences of noncoding DNA that are highly conserved, sometimes on time-scales representing
hundreds of millions of years, implying that these noncoding regions are under
strong evolutionary pressure and positive selection.[46]
Many of these sequences regulate the structure of chromosomes by limiting the regions
of heterochromatin formation and regulating structural features of the chromosomes, such as
the telomeres and centromeres. Other noncoding regions serve as origins of DNA replication.
Finally several regions are transcribed into functional noncoding RNA that regulate the
expression of protein-coding genes (for example), mRNA translation and stability (see miRNA),
chromatin structure (including histone modifications, for example), DNA methylation (for
example), DNA recombination (for example), and cross-regulate other noncoding RNAs (for
example). It is also likely that many transcribed noncoding regions do not serve any role and that
this transcription is the product of non-specific RNA Polymerase activity.
Pseudogenes
Pseudogenes are inactive copies of protein-coding genes, often generated by gene duplication,
that have become nonfunctional through the accumulation of inactivating mutations. The number
of pseudogenes in the human genome is on the order of 13,000, and in some chromosomes is
nearly the same as the number of functional protein-coding genes. Gene duplication is a major
mechanism through which new genetic material is generated during molecular evolution.
For example, the olfactory receptor gene family is one of the best-documented examples
of pseudogenes in the human genome. More than 60 percent of the genes in this family are non-
functional pseudogenes in humans. By comparison, only 20 percent of genes in the mouse
olfactory receptor gene family are pseudogenes. Research suggests that this is a species-specific
characteristic, as the most closely related primates all have proportionally fewer pseudogenes.
This genetic discovery helps to explain the less acute sense of smell in humans relative to other
mammals.
Genes for noncoding RNA (ncRNA)
Noncoding RNA molecules play many essential roles in cells, especially in the many reactions
of protein synthesis and RNA processing. Noncoding RNA
include tRNA, ribosomal RNA, microRNA, snRNA and other non-coding RNA genes including
about 60,000 long non-coding RNAs (lncRNAs). Although the number of reported lncRNA
genes continues to rise and the exact number in the human genome is yet to be defined, many of
them are argued to be non-functional.

17
Many ncRNAs are critical elements in gene regulation and expression. Noncoding RNA
also contributes to epigenetics, transcription, RNA splicing, and the translational machinery. The
role of RNA in genetic regulation and disease offers a new potential level of unexplored genomic
complexity.
Introns and untranslated regions of mRNA
In addition to the ncRNA molecules that are encoded by discrete genes, the initial transcripts of
protein coding genes usually contain extensive noncoding sequences, in the form of introns, 5'-
untranslated regions (5'-UTR), and 3'-untranslated regions (3'-UTR). Within most protein-coding
genes of the human genome, the length of intron sequences is 10- to 100-times the length of exon
sequences.
Regulatory DNA sequences
The human genome has many different regulatory sequences which are crucial to
controlling gene expression. Conservative estimates indicate that these sequences make up 8% of
the genome, however extrapolations from the ENCODE project give that 20-40% of the genome
is gene regulatory sequence. Some types of non-coding DNA are genetic "switches" that do not
encode proteins, but do regulate when and where genes are expressed (called enhancers).
Regulatory sequences have been known since the late 1960s. The first identification of
regulatory sequences in the human genome relied on recombinant DNA technology. Later with
the advent of genomic sequencing, the identification of these sequences could be inferred by
evolutionary conservation. The evolutionary branch between the primates and mouse, for
example, occurred 70–90 million years ago. So computer comparisons of gene sequences that
identify conserved non-coding sequences will be an indication of their importance in duties such
as gene regulation.
Other genomes have been sequenced with the same intention of aiding conservation-
guided methods, for exampled the pufferfish genome. However, regulatory sequences disappear
and re-evolve during evolution at a high rate.
As of 2012, the efforts have shifted toward finding interactions between DNA and
regulatory proteins by the technique ChIP-Seq, or gaps where the DNA is not packaged
by histones (DNase hypersensitive sites), both of which tell where there are active regulatory
sequences in the investigated cell type.
Repetitive DNA sequences
Repetitive DNA sequences comprise approximately 50% of the human genome. About 8% of the
human genome consists of tandem DNA arrays or tandem repeats, low complexity repeat
sequences that have multiple adjacent copies (e.g. "CAGCAGCAG..."). The tandem sequences
may be of variable lengths, from two nucleotides to tens of nucleotides. These sequences are
highly variable, even among closely related individuals, and so are used for genealogical DNA
testing and forensic DNA analysis.
Repeated sequences of fewer than ten nucleotides (e.g. the dinucleotide repeat (AC) n) are
termed microsatellite sequences. Among the microsatellite sequences, trinucleotide repeats are of
particular importance, as sometimes occur within coding regions of genes for proteins and may
lead to genetic disorders. For example, Huntington's disease results from an expansion of the
trinucleotide repeat (CAG)n within the Huntingtin gene on human chromosome

18
4. Telomeres (the ends of linear chromosomes) end with a microsatellite hexanucleotide repeat
of the sequence (TTAGGG)n.
Tandem repeats of longer sequences (arrays of repeated sequences 10–60 nucleotides
long) are termed minisatellites.
Mobile genetic elements (transposons) and their relics
Transposable genetic elements, DNA sequences that can replicate and insert copies of
themselves at other locations within a host genome, are an abundant component in the human
genome. The most abundant transposon lineage, Alu, has about 50,000 active copies, and can be
inserted into intragenic and intergenic regions. One other lineage, LINE-1, has about 100 active
copies per genome (the number varies between people). Together with non-functional relics of
old transposons, they account for over half of total human DNA. Sometimes called "jumping
genes", transposons have played a major role in sculpting the human genome. Some of these
sequences represent endogenous retroviruses, DNA copies of viral sequences that have become
permanently integrated into the genome and are now passed on to succeeding generations.
Mobile elements within the human genome can be classified into LTR
retrotransposons (8.3% of total genome), SINEs (13.1% of total genome) including Alu
elements, LINEs (20.4% of total genome), SVAs and Class II DNA transposons (2.9% of total
genome).

19

You might also like