Of all the claims on our curiosity, we want most to non-human relatives, the chimpanzees? What do we
understand ourselves. What are we? How have we have in common and how do we diverge from other
come to be what we are? What lies in our future? species of primates? Of mammals? Of vertebrates?
Many features of our lives depend on accidents of Of eukaryotes? Of all other living things?
history. The time and place of our birth largely deter- The complete sequences of human and other
mine what language we first learn to speak, whether genomes reveal the underlying text of this story. We
we are likely to be well fed and well educated, and are beginning to understand how our lives shape
receive adequate medical care. Many aspects of themselves under the influence of our genes, epigen-
our future depend on events outside ourselves and etic modifications, and our surroundings and life
beyond our control. histories.
Yet, within us, there are constraints on our We are also beginning to be able to intervene.
lives that brook relatively little argument. In some Genetic engineering of microorganisms is an estab-
respects, we are at the mercy of our genomes. Under lished technique. Genetically-modified plants and
normal circumstances, all of our basic anatomy and animals exist, and are the subjects of lively debate. To
physiology, and eye colour, height, intelligence, and override the genes for hair colour is trivial. Changes
basic personality traits, are ingrained in our DNA in lifestyle or behaviour can—to some extent—avoid
sequences. This is not to say that our genomes dictate or postpone development of diseases to which we are
our lives. Some constraints are tight—for instance, genetically at risk. Gene therapy offers the promise
eye colour—but our genetic endowment also confers of rectifying some inborn defects. Novel gene editing
on us a remarkable robustness. techniques based on the CRISPR/Cas system offer
This robustness also is a product of evolution. revolutionary power in reshaping individuals, whole
Within the last century, lifestyles have changed with species, and ecosystems.
a rapidity hitherto unknown (except for the instants Fast, inexpensive sequencing has transformed
of asteroid impacts). We can meet and survive brutal genomics. The landmark goal, the $US1000 human
stresses. Our talents have many opportunities to genome, has been achieved. It is very likely that within
nurture themselves and to develop in novel ways. the lifetime of many readers, human genome sequenc-
These are gifts of our genomic endowment: What ing will be nearly universal. Already, hundreds of
genes control is the response of an organism to its thousands of human individuals have had their full
environment. genomes sequenced, and many more are on the way.
The human genome is only one of the many com- Many people have had sequences determined for
plete genome sequences known. Taken together, individual genes. For example, mutations in BRCA1,
genome sequences from organisms distributed widely BRCA2, and PALB2 suggest an increased likelihood
among the branches of the tree of life give us a sense, of developing breast or ovarian cancer. Many people
only hinted at before, of the very great unity in detail have had these regions sequenced to provide informa-
of all life on Earth. This recognition has changed our tion about their personal risk. This can be more than
perceptions, much as the first pictures of the Earth informative: some individuals—most famously the
from space engendered a unified view of our planet. actress Angelina Jolie—have opted for prophylactic
Superimposed on this underlying unity is great surgery.
variety. We ask: What is special about us? What do The spectacular progress in high-throughput
we share with our parents and siblings and how do sequencing, and the resulting explosive growth of the
we differ from them? What do we share with all data produced—in quality, quantity, and type—have
other human beings and what makes us different altered the entire landscape of genomics itself, and its
from the other members of our species? What are influence has invaded surrounding fields. No area of
the sources of our differences from our closest extant biology has been left unscathed.
viii Preface

Study of human disease through sequencing con- will fundamentally alter our perception of ourselves:
tinues to be a major effort. The possibility of appli- what it means to be human.
cations to improve human health were the major In this book, I have tried to present a balanced
motivation for support of the Human Genome view of the background of the subject, the technical
Project, and continue to be the major elicitor of lar- developments that have so greatly increased the data
gesse for pursuing its consequences. Identification flow, the current state of our knowledge and under-
of genes responsible for particular diseases permits standing of the data, and applications to medicine
testing, genetic counselling, and risk assessment and and other fields. With power derived from knowledge
avoidance. Cancer genomics—the comparison of goes commitment to act wisely. We have responsibili-
sequences from normal and tumour cells from single ties, to ourselves, to other people, to other species,
patients—has become a heavy industry. and to ecosystems ranging up to the entire biosphere.
Understanding the relationships between genes Ethical, legal, and social issues have been a promi-
and disease will allow more precise diagnosis and nent component of the human genome project. Most
warnings of increased risk of disease in patients technical questions in genomics, as in other scientific
and their offspring. For many important diseases, subjects, have objectively correct answers. We do not
a patient can expect more precise diagnosis and know all the answers, but they are out there for us to
prognosis, and more precise recommendations for discover. Ethical, legal, and social issues are different.
treatment. Design of treatment based on a patient’s Many choices are possible. Their selection is not the
genome, called pharmacogenomics, is already part privilege of scientists in individual laboratories, but
of clinical practice. Genomes of other organisms of society as a whole. Scientists do have a responsi-
also have implications for human health, especially bility to contribute to the informed public discussion
those of pathogenic organisms that have developed, that is essential for wise decisions.
or are threatening to develop, antibiotic resistance. One aspect of the first edition that I liked was its
We study the biology of viruses and bacteria to take concision. Unfortunately, this has had to be sacri-
advantage of their vulnerabilities, and the biology of ficed to the stampeding progress of the field. A major
humans to ward off the consequences of our own. problem encountered in writing about genomics is
This century will see a revolution in healthcare the need to pick and choose from the many riches
development and delivery. Walls between ‘blue sky’ of the subject. The list of subjects that cannot be left
research and clinical practice are tumbling down. It is out is too long and threatens to reduce the treat-
possible that a reader of this book will discover a cure ment of each to superficiality. There is also a serious
for a disease that would otherwise kill him or her. It organizational challenge: many phenomena must be
is extremely likely that Szent-Györgyi’s quip, ‘Cancer approached from several different points of view. A
supports more people than it kills’, will come true. reader may be relieved to conclude that a topic has
One hopes that this will happen because the research been beaten thoroughly into submission in one chap-
establishment succeeds in developing therapeutic or ter, only to encounter it again, alive and kicking, in a
preventative measures against tumours, rather than later section, in a different context.
by merely imitating their uncontrolled growth. The speed at which the field is moving causes other
Other applications of genomics include improve- problems. An author is often pleased with a draft of
ment of crops and domesticated animals, enhanc- a section only to find the carefully described conclu-
ing food production, and support of conservation sions overturned in next week’s journals. Yet, there
efforts dedicated to preserving endangered species. is a great pleasure in seeing nature’s secrets emerging
Development of alternative energy sources is a chal- before one’s eyes.
lenge to both physics and biology. Another casualty of rapid progress is the frequent
Beyond these applications, genomics offers us a dismissal of attention to history and biography. We
profound understanding of fundamental principles of are fantastically interested in the development of the
biology. The history of life is a pageant for which we sea urchin and the fruit fly, but not in the development
are beginning to delineate the choreography. On the of molecular genomics. This is too bad: those who
personal level, within this history, genome exegesis do not learn from the successes of history will find
Preface ix

it harder to emulate them. Intellectual struggles that (If bacteria or fruit flies could read, genomics text-
occupied entire careers leave behind only the terse books would look very different.)
conclusions, often without any appreciation of the In the new edition, extended coverage is given both
experiments that established the facts, much less of to clinical applications to humans and to the applica-
the alternative hypotheses tested and rejected. We tions of genome sequences to working out of evolu-
remember the breakthroughs, but not the frustra- tionary relationships in microorganisms, plants, and
tions that their achievement required. The force of animals. Non-clinical applications to human biology
the scientists’ personalities, and their foibles, are for- and history are sufficient to justify a chapter of their
gotten. Making use of other people’s results isn’t the own. The genomics of plant and animal domestica-
same thing as creating new ones: ‘To imitate the Iliad tions not only shed light on human—as well as plant
is not to imitate Homer’. and animal—history, but emphasize the reciprocal
Genomics is an interdisciplinary subject. The phe- interactions between humans and the rest of the
nomena we want to explain are biological. But many biosphere.
fields contribute to the methods and the intellec- This book assumes that the reader already has
tual approaches that we bring to bear on the data. some acquaintance with modern molecular biology,
Physicists, mathematicians, computer scientists, engi- and builds on and develops this background, as a self-
neers, chemists, clinical practitioners and research- contained presentation. It is suitable as a textbook for
ers have all joined in the enterprise. This book will undergraduates or starting postgraduate students.
appeal, at least in part, to all of them. However, the Progress is happening so fast as to make unavoid-
central point of view remains focused on the biology. able a feeling of frustration in aiming at a moving
More specifically, the focus is on human biology. target. The hope is that the third edition has erected
In fact, on the biology of humans who are curious for the reader a sound framework, both intellectual
about other species, albeit primarily for what the and factual, that will make it possible, when encoun-
other species tell us about ourselves. This choice nat- tering subsequent developments, to see where and
urally reflects the potential readership of this book. how they fit in.

Exercises and problems at the end of each chapter, and ‘weblems’ on the Online
Resource Centre, test and consolidate understanding, and provide opportunities to
practise skills and explore additional subjects. Exercises are short and straightforward
applications of material in the text. Answers to exercises appear on the website associ-
ated with the book. Problems, also, make use of no information not contained in the
text, but require lengthier answers or, in some cases, calculations. The third category,
‘weblems’, requires access to the Internet. Weblems are designed to give readers prac-
tice with the tools required for further study and research in the field. Some of these are
suitable for use as practical or laboratory assignments, or even as class projects.

Key terms are highlighted in the text, and are defined in the Glossary at the end of
the book.

Chapter 1, Introduction and Background, sets the personal genealogy companies. There are even ‘dating
scene, and introduces all of the major players: DNA sites’ that use the correlation between major histocom-
and protein sequences and structures, genomes and patibility complex haplotype and mate selection to offer
proteomes, databases and information retrieval, and to identify mutually attractive individuals.
bioinformatics and the Internet. Subsequent chapters Many of these applications involve ethical, legal,
develop these topics in detail. Chapter 1 sketches the and social issues. Different jurisdictions have estab-
framework in which the pieces fit together and sets lished different guidelines or regulations.
genomics in its context among the biomedical, physi- Chapter 3, Mapping, Sequencing, Annotation, and
cal, and computational sciences. Databases, describes how genomics has emerged from
The message is clear: genomics is the hub of biol- classical genetics and molecular biology. The first
ogy. Whereas genome sequences are determined from nucleic acid sequencing, by groups led by F. Sanger
individuals, to appreciate life as a whole requires and W. Gilbert, in the 1970s, was a breakthrough
extending our point of view spatially, to populations comparable to the discovery of the double helix of
and interacting populations; and temporally, to con- DNA. The challenges of sequencing stimulated spec-
sider life as a phenomenon with a history. We can tacular improvements in technology. First came auto-
study the characteristics of life in the present, we can mation of the Sanger method. The original sequences
determine what came before, and we can—at least of the human genome were accomplished by batter-
to some extent—extrapolate to the future. The ‘cen- ies of automated Sanger sequencers. Subsequently, a
tral dogma’ and the genetic code underlie the imple- series of ‘new generations’ of novel approaches have
mentation of the genome, in terms of the synthesis achieved the landmark goal, the $US1000 human
of RNAs and proteins. Absent from Crick’s original genome. Sequencing power is very widely distrib-
statement of the central dogma is the crucial role of uted. There are major specialized institutions, such
regulation in making cells stable and robust, two as the Beijing Genomics Institute (now in Shenzhen).
characteristics essential for survival. BGI sequencers generate 10 terabytes of raw data per
Chapter 2, The Human Genome Project: day. (You do the math: that’s over 2 human genomes
Achievements and Applications, focuses on the human per minute!) Smaller installations are common in
genome and the applications of human genome universities, hospitals, and companies.
sequences. It reports the current state of the data, Where do all the publicly available data go?
although that gives an inadequate sense of their explo- Chapter 3 also introduces the databanks that archive,
sive growth. Clinical applications are maturing from curate, and distribute the data, and some of the infor-
hype to serious promise to actual clinical practice. mation-retrieval tools that make them accessible to
Putting our species in context involves, most nar- scientific enquiry.
rowly, comparing our genomes with those of our Chapter 4, Evolution and Genomic Change,
closest relatives, including Neanderthal man and the treats relationships. Life has been shaped by evo-
chimpanzee, our nearest extant relative. More general lution, primarily acting through natural selection.
applications to anthropology appear in Chapter 9. T. Dobzhansky famously said, ‘Nothing in biology
Applications of genome sequences in personal makes sense except in the light of evolution’.1
identification are well known. These include determi- Genomics allows us to trace many aspects of this
nations of paternity, and, less frequently, maternity. process, if not always to make sense of them.
Crime-scene investigation has proved the guilt or Of course, most of evolution took place in the
innocence of many suspects. It is the stuff of popular past. We cannot observe it directly. However, we can
entertainment. observe and draw inferences from its products. These
Although much genome sequence investigation is car- 1
I would add thermodynamics to the list of things
ried out under clinical or forensic organization, sequenc-
except in the light of which nothing in biology makes
ing has gone public with ‘pop’ applications provided by sense.
Plan of the third edition xi

products are, for the most part, the genomes—and organization with which living things have experi-
phenotypes—of extant organisms, plus sporadic data mented. If the great variety of living things has
from recently extinct species. In addition, many evo- arisen through evolution by natural selection—the
lutionary events have left their traces in contempo- change in allele frequency in a population through
rary genomes. differential reproductive success among different
Chapter 4 explores some of the tools that scientists variants—how does the variation arise? We must
have developed to analyse sequence data for what it consider the nature and extent of the variability in
can reveal about evolution. Prominent among these the genomes within individual populations, and the
are methods for sequence alignment and calibration mechanisms that generate this variability. It will
of the results, and the computation of phylogenetic emerge that, especially in prokaryotes, horizontal
trees. gene transfer is a very important component of the
We are entering an era when evolution, even in mechanism.
natural populations, may be under the direct con- Chapter 8, The Impact of Genome Sequences on
trol of molecular biologists wielding tools based on Health and Disease, describes clinical applications
CRISPR/Cas and gene drive. Because most of the of genome sequencing. Sometimes, when particular
material in Chapter 4 is decades or even centuries genes are known to correlate with disease or risk
old, one might think that these sections might be the of disease, specific regions in a patient’s genome are
least-likely parts of this book to need significant revi- sequenced. More and more, we will see complete
sion during the coming decade or so. This may be a genome sequencing in clinical contexts.
dangerous assumption. We shall see. Applications include improved diagnosis and prog-
Chapter 5, Genomes of Prokaryotes and Viruses, nosis of the causes of presenting syndromes, genetic
surveys the genomes of viruses, bacteria, and archaea counselling of parents with family histories of a dan-
in more detail. Taxonomy and phylogeny pres- gerous genetic condition, and ‘pharmacogenomics’: the
ent problems because of extensive horizontal gene tailoring of treatment to the individual patient, based
transfer. Indeed, horizontal gene transfer challenges on DNA sequence information. ‘Gene therapy’, the
the whole idea of a hierarchy of biological classifi- replacement of defective genes with correct ones, has
cation. In the past, many bacteria have been cloned already had some successes and, with the development
and studied in isolation, especially those responsible of CRISPR/Cas technology of genome editing, will
for disease. However, a new field, metagenomics, undoubtedly grow in applicability. (Guidelines emerg-
deals with the entire complement of living things ing from a 2015 Washington conference urged a mora-
in an environmental sample, allowing us to address torium on application of CRISPR to humans. In 2017,
questions about interspecies interaction in the ‘real the US National Academies of Sciences and Medicine
world’. Sources include ocean water, soil, and the recommended allowing human germline editing under
human body. certain stringent conditions. But the pressure is too
Chapter 6 surveys Genomes of Eukaryotes. It starts great for deterrence to survive, even in the US.)
with yeast, which is about as simple as a eukaryote Chapter 9, Genomics and Anthropology, devel-
can get. Selected plant, invertebrate, and chordate ops additional applications to the study of our own
genomes illuminate the many profound common fea- species. Although clinical applications are undoubt-
tures of eukaryotic genomes; and the very great vari- edly the most important, genomics has important
ety of structures, biochemistry, and lifestyles that are contributions to make to human palaeontology and
compatible with the underlying similarities. There anthropology. The ability to extract DNA from
are examples of recovery and sequencing of DNA extinct species, including Neanderthals, sheds light
from extinct organisms. on our early evolution. Events in our history, includ-
The goal of Chapter 7, Comparative Genomics, ing migrations and domestication of crops and ani-
is to harvest some conclusions from the surveys mals, have left their traces in DNA sequences.
of viral, prokaryotic, and eukaryotic genomes With Chapter 10, Transcriptomics, we move on
presented in the preceding chapters. We begin from the (relatively) static genome to the selectivity
by comparing the different modes of genome and dynamics of expression patterns. Following the
xii Plan of the third edition

central dogma, there are at least two stages: transcrip- of the major messages of Chapters 10 and 11 is how
tion, in which DNA makes RNA (Chapter 10); and much the genome doesn’t tell us about cellular pro-
translation, in which RNA makes protein (Chapter teins. Nevertheless, the interactions and relationships
11). between the genome and proteome are intimate,
Measurements of cellular RNAs describe the tran- both during cellular activity and in the longer term
scription patterns of regions of the genome. These in evolution.
patterns vary in response to changes in the environ- Chapters 12 and 13 introduce systems biology,
ment: the lac operon of Escherichia coli is a classic a relatively new field that presents a description of
example. Expression patterns vary among different biological organization based on networks. Systems
tissues, different physiological states, and different biology deals with the connectivity of biological net-
developmental stages. Thus, the human embryo and works, and analyses and models the mechanisms that
neonate synthesizes embryonic and then foetal hae- control traffic patterns through them.
moglobin, switching to expression of adult haemo- A reasonable definition of life would include
globin at about 6 months post-birth. the criterion that a living thing executes controlled
The inventory of mRNAs in the cell—a component manipulations of matter, energy, and information.
of the transcriptome—is naturally of interest for what Chapters 12 and 13 treat the cellular networks that
it can tell us about the distribution of cellular proteins. carry out and regulate these tasks.
Certainly protein synthesis does involve many RNAs, Chapter 12, Metabolomics, deals with the meta-
including messenger RNAs, the RNA of the ribosome, bolic networks that manipulate matter and energy.
transfer RNAs, and the RNAs of the spliceosome, that Cells require a source of energy, from nutrients or
removes introns from pre-messenger RNAs. However, from light, and use it to drive metabolites through
other RNAs have a variety of roles, including catalytic series of metabolic pathways. The flows through the
activity. Catalytically active RNAs are called ribo- paths in this network must be kept under control. To
zymes. (In both the ribosome and the spliceosome, achieve this, another network, a logical one, keeps
the catalytic activity is in the RNA, not the protein cellular activities organized. Cells contain parallel
component.) Other RNAs, such as microRNAs, short sets of networks based on physical and logical inter-
interfering RNAs, and silencing RNAs, control gene actions among molecules. Each network also has
expression by interacting with messenger RNAs. The static and dynamic aspects.
RNA transcripts of CRISPR sequences, bound to Components present in a cell at any instant are
CRISPR-associated nucleases, identify viral invaders. subject to direct controls, for instance feedback inhi-
There is good reason to believe that many other RNA bition by an end product to shut down a metabolic
functions remain to be discovered. pathway. But transcription is subject to very exten-
Chapter 11, Proteomics, describes briefly the prin- sive oversight: turning genes on and off. Chapter
ciples of protein structure and the high-throughput 13, Systems Biology, treats the logical component
data streams that provide information about sets of cellular networks that controls transcription and
of proteins in cells, including methods for predict- metabolism.
ing protein structure from amino-acid sequence.2 The genius of classical biochemistry was to take
Proteomics is an essential complement to genomics. cells apart and show that the components could func-
(A colleague once entitled a keynote lecture: ‘Genes tion in isolation. Now our job now is to put things
are from Venus, proteins are from Mars.’) In fact, one back together.

For a more thorough treatment of proteins see Liljas, A., Liljas, L., Piskur, J., Lindblom, G., Nissen, P., & Kjeldgaard,
M. (2009). Textbook of Structural Biology. World Scientific, Singapore; or Lesk, A. (2016). Introduction to Protein
Architecture: Structure, Function, and Genomics, 3rd ed. Oxford University Press, Oxford.

Where else might the interested reader turn? This paper there is the problem of extracting a take-home
book is designed as a companion volume to three message from a mass of detail, whereas the blog
others: Introduction to Protein Architecture: entries often contain one or two concise and ‘to-the-
The Structural Biology of Proteins; Introduction point’ paragraphs.
to Protein Science: Architecture, Function, and Nevertheless, I am reluctant to include these in
Genomics; and Introduction to Bioinformatics (all the recommendations. This is partly out of a com-
published by Oxford University Press). Of course, mitment to the refereed scientific literature. But
there are many fine books and articles by many also: (1) there is no control over whether a recom-
authors, some of which are listed as recommended mended website will remain available. The scien-
reading at the ends of the chapters. The goal is that tific literature at least has a permanent presence. (2)
each reader will come to recognize his or her own Blogs can contain a mixture of some contributions
interests, and be equipped to follow them up. that are very useful and others that are not; and
For the recommendations for additional reading at it may be difficult to distinguish. But it is unde-
the ends of chapters, I have limited myself to books, niable that the Internet contains very many useful
and to articles in the scientific literature. However, if sources of information outside of the printed scien-
one wants an introduction to a topic, there are many tific literature.
quite good lectures on the Internet, which might well Many applications of genomics to healthcare are
be useful. Or, if one wants up-to-date details about a discussed in the book. However, nothing here should
topic, there are blogs. Indeed, blog entries are often be taken as offering medical advice to anyone about
more useful than a full scientific paper. With a full any condition.


Results and research in genomics make use of the the place to learn about the Internet is on the Internet
Internet, both for storage and distribution of data, itself.
and for methods of analysis. Readers will need to To this end, an Online Resource Centre at www.
become familiar with websites in genomics, and to accom-
develop skills in using them. Many useful sites are panies this book. This contains web-based problems
mentioned in the book. The author’s Introduction (‘weblems’) and material from the book—figures and
to Bioinformatics offers a pedagogical approach to ‘movies’ of the pictures of structures, answers to exercises,
computational aspects of genomics. However, clearly, hints for solving problems, and guides to useful websites.

I thank S. Ades, G.F. Anderson, E. Axelsson, M.M. book throughout the writing process. Your help was
Babu, S.L. Baldauf, P. Berman, B. de Bono, D.A. invaluable:
Bryant, C. Cirelli, A. Cornish-Bowden, R. Diskin,
• Richard Bingham, Department of Biological Sciences,
R.B. Eckhardt, N.V. Fedoroff, J.G. Ferry, R. Flegg,
University of Huddersfield, UK
J.R. Fresco, S. Girirajan, N. Goldman, J. Gough,
D. Grove, D.J. Halitsky, R. Hardison, E. Holmes, • Joanna Kelley, School of Biological Sciences,
H. Klein, E. Koc, A.S. Konagurthu, T. Kouzarides, Washington State University, USA
H.A. Lawson, E.L. Lesk, M.E. Lesk, V.E. Lesk, V.I. • Hans Michael Kohn, Institute of Biosciences and
Lesk, D.A. Lomas, B. Luisi, J.A. Lumadue, P. Maas, Bioengineering, Rice University, USA
K. Makova, W.B. Miller, C. Mitchell, J. Moult, E. • Emma Laing, Department of Microbial and Cellular
Nacheva, A. Nekrutenko, G. Otto, A. Pastore, D. Sciences, University of Surrey, UK
Perry, C. Praul, K. Reed, G.D. Rose, J. Rossjohn, S.
• Frances Pearl, School of Life Sciences, University of
Schuster, R.L. Serrano, B. Shapiro, J. Shi, J. Tamames,
Sussex, UK
A. Tramontano, A.A. Travers, A. Valencia, G. Vriend,
L. Waits, R. Wayne, J.C. Whisstock, A.S. Wilkins and • Michael Shiaris, College of Science and Mathematics,
V.E. Womble, and E.B. Ziff for helpful advice. University of Massachusetts Boston, USA
I thank the staff of Oxford University Press for • Adrian Slater, Faculty of Health and Life Sciences,
their essential and superb contributions to this book. De Montfort University, UK
The author and publisher would also like to thank • David Studholme, College of Life and Environmental
those who gave their time and expertise to review the Sciences, University of Exeter, UK

Plan of the third edition x
Recommended reading xiii
Introduction to genomics on the web xiii
List of abbreviations xxv

1 Introduction and Background 1

Learning goals 1
Genomics: the hub of biology 2
Phenotype = genotype + environment + life history + epigenetics 2
Varieties of genome organization 4
Chromosomes, organelles, and plasmids 4
Genes 6
The scope and applications of genome sequencing projects 9
Variations in genome sequences within species 10
Mutations and disease 10
Single-nucleotide polymorphisms 10
Haplotypes 13
A clinically important haplotype: the major histocompatibility complex 14
Populations 15
Species 17
The biosphere 18
Extinctions 19
The future? 22
Genome projects and our current library of genome information 25
High-throughput sequencing 25
De novo sequencing 26
Resequencing 26
Exome sequencing 26
What’s in a genome? 27
Some regions of the genome encode non-protein-coding RNA molecules 28
Some regions of the genome contain pseudogenes 28
Other regions contain binding sites for ligands responsible for regulation
of transcription 29
Repetitive elements of unknown function account for surprisingly large
fractions of our genomes 29
Dynamic components of genomes 30
Genomics and developmental biology 32
Genes and minds: neurogenomics 35
Genetics of behaviour 36
xvi Contents

Proteomics 37
Protein evolution: divergence of sequences and structures
within and among species 39
Mechanisms of protein evolution 39
Organization and regulation 42
Some mechanisms of regulation act at the level of transcription 45
Some mechanisms of regulation act at the level of translation 45
Some regulatory mechanisms affect protein activity 47
On the web: genome browsers 47
Genomics and computing 52
Archiving and analysis of genome sequences and related data 52
Databanks in molecular biology 53
Programming 54
Looking forward 55
Recommended reading 55
Exercises and problems 57

2 The Human Genome Project: Achievements and Applications 62

Learning goals 62
‘… the end of the beginning’ 63
Human genome sequencing 66
What makes us human? 67
Comparative genomics 68
Genomics and language 68
The human genome and medicine 71
Prevention of disease 71
Detection and precise diagnosis 72
Genetic counselling—carrier status 72
Discovery and implementation of effective treatment 72
Tunable healthcare delivery: pharmacogenomics 74
‘Pop’ applications of genome sequencing 76
Genomics in personal identification 76
DNA ‘fingerprinting’ 77
Personal identification by amplification of specific regions has
superseded the RFLP approach 79
Mitochondrial DNA 79
Analysis of non-human DNA sequences 82
Parentage testing 82
Inference of physical features, and even family name 83
Ethical, legal, and social issues 85
Databases containing human DNA sequence information 85
Use of DNA sequencing in research on human subjects 88
Looking forward 88
Recommended reading 88
Exercises and problems 90
Contents xvii

3 Mapping, Sequencing, Annotation, and Databases 97

Learning goals 97
Classical genetics as background 98
What is a gene? 99
Maps and tour guides 99
Genetic maps 100
Linkage 101
Linkage disequilibrium 102
Chromosome banding pattern maps 103
Restriction maps 106
Discovery of the structure of DNA 107
DNA sequencing 110
Frederick Sanger and the development of DNA sequencing 111
DNA sequencing by termination of chain replication 111
The Maxam–Gilbert chemical cleavage method 114
Automation of DNA sequencing 114
Organizing a large-scale sequencing project 116
Bring on the clones: hierarchical—or ‘BAC-to-BAC’—genome
sequencing 116
Whole-genome shotgun sequencing 116
Next-generation sequencing 118
Roche 454 Life Sciences 120
Illumina 121
Ion Torrent/Personal Genome Machine (PGM) 124
PacBio 124
Oxford Nanopore 124
10X Genomics 125
The Bionano Irys system 125
Life in the fast lane 126
How much sequencing power is there in the world? 126
Databanks in molecular biology 128
Nucleic acid sequence databases 130
Protein sequence databases 130
Databases of genetic diseases—OMIM and OMIA 130
Databases of structures 131
Specialized or ‘boutique’ databases 132
Expression and proteomics databases 132
Databases of metabolic pathways 133
Bibliographic databases 133
Surveys of molecular biology databases and servers 134
Computer programming in genomics 134
Programming languages 135
How to compute effectively 136
Looking forward 137
Recommended reading 137
Exercises and problems 138
xviii Contents

4. Evolution and Genomic Change 143

Learning goals 143

Evolution is exploration 144
Biological systematics 146
Biological nomenclature 146
Measurement of biological similarities and differences 148
Homologues and families 150
Pattern matching—the basic tool of bioinformatics 151
Sequence alignment 151
Defining the optimum alignment 153
Scoring schemes 155
Varieties and extensions 157
Approximate methods for quick screening of databases 158
Pattern matching in three-dimensional structures 160
Evolution of protein sequences, structures, and functions 161
Evolution of protein structure and function 164
Phylogeny 165
Calculation of phylogenetic trees 169
Short-circuiting evolution: genetic engineering 173
Looking forward 175
Recommended reading 176
Exercises and problems 177

5 Genomes of Prokaryotes and Viruses 179

Learning goals 179

Evolution and phylogenetic relationships in prokaryotes 180
Major types of prokaryotes 180
Do we know the root of the tree of life? 182
Genome organization in prokaryotes 183
Replication and transcription 184
Gene transfer 184
Archaea 184
The genome of Methanococcus jannaschii 186
Life at extreme temperatures 188
Comparative genomics of hyperthermophilic archaea:
Thermococcus kodakarensis and pyrococci 189
Bacteria 194
Genomes of pathogenic bacteria 195
Genomics and the development of vaccines 197
Viruses 198
Nucleocytoplasmic large DNA viruses (or giant viruses) 199
Viral genomes 199
Recombinant viruses 199
Viruses and evolution 201
Influenza: a past and current threat 201
Contents xix

’Ome, ’ome, on the range: metagenomics, the genomes in a

coherent environmental sample 204
Marine cyanobacteria—an in-depth study 205
Looking forward 208
Recommended reading 208
Exercises and problems 210

6 Genomes of Eukaryotes 211

Learning goals 211

The origin and evolution of eukaryotes 212
Evolution and phylogenetic relationships in eukaryotes 212
The yeast genome 212
The evolution of plants 214
The Arabidopsis thaliana genome 214
Genomes of animals 216
The genome of the sea squirt (Ciona intestinalis) 216
The genome of the pufferfish (Tetraodon nigroviridis) 217
The genome of the chicken (Gallus gallus domesticus) 220
The genome of the platypus (Ornithorhynchus anatinus) 221
The genome of the dog 223
Palaeosequencing—ancient DNA 225
Recovery of DNA from ancient samples 225
DNA from extinct birds 226
High-throughput sequencing of mammoth DNA 227
The phylogeny of elephants 230
Looking forward 230
Recommended reading 231
Exercises and problems 231

7 Comparative Genomics 233

Learning goals 233

Introduction 234
Unity and diversity of life 234
Taxonomy based on sequences 235
Sizes and organization of genomes 238
Genome sizes 238
Genome organization in eukaryotes 241
Photosynthetic sea slugs: endosymbiosis of chloroplasts 242
How genomes differ 243
Variation at the level of individual nucleotides 243
Duplications 243
Duplication of genes 244
xx Contents

Family expansion: G protein-coupled receptors 247

Comparisons at the chromosome level: synteny 252
What makes us human? 252
Comparative genomics 252
Genomes of chimpanzees and humans 252
Genomes of mice and rats 254
Model organisms for study of human diseases 255
The genome of Caenorhabditis elegans 256
The genome of Drosophila melanogaster 256
Homologous genes in humans, worms, and flies 257
Looking forward 258
Recommended reading 260
Exercises and problems 261

8 The Impact of Genome Sequences on Human

Health and Disease 264

Learning goals 264

Introduction 265
Some diseases are associated with mutations in specific genes 265
Haemoglobinopathies—molecular diseases caused by abnormal
haemoglobins 265
Phenylketonuria 266
Alzheimer’s disease 268
Identification of genes associated with inherited diseases 268
Genome-wide association studies 271
GWAS of sickle-cell disease 274
GWAS of type 2 diabetes 275
GWAS of schizophrenia 276
The human microbiome 278
Treatment of abnormal microbiome composition 281
Cancer genomics 281
SNPs and cancer 284
Whole-genome sequencing association studies of breast cancer 285
Copy-number alterations in cancer 286
Chromosomal aberrations 288
Epigenetics and cancer 289
microRNAs and cancer 290
Immunotherapy for cancer 291
Looking forward 292
Recommended reading 292
Exercises and problems 294
Contents xxi

9 Genomics and Anthropology: Human Evolution,

Migration, and Domestication of Plants and Animals 295

Learning goals 295

Ancestry of Homo sapiens 296
The Neanderthal genome 297
The Denisovan genome 298
What do these data tell us? 299
What have Neanderthals and Denisovans done for us lately? 300
Ancient populations and migrations 300
Western civilization? ‘I think it would be a good idea’ 308
Domestication of the dog 310
Domestication of the horse 311
Domestication of crops 313
Maize (Zea mays) 315
Rice (Oryza sativa) 317
Control of flowering time 318
History of rice domestication 320
Chocolate (Theobroma cacao) 321
The Theobroma cacao genome 322
Looking forward 324
Recommended reading 326
Exercises and problems 327

10 Transcriptomics 329

Learning goals 329

Introduction 330
Microarrays 331
Microarray data are semiquantitative 332
Applications of DNA microarrays 332
Analysis of microarray data 333
RNAseq 336
RNAseq versus microarrays 337
Expression patterns in different physiological states 339
Sleep in rats and fruit flies 340
Expression pattern changes in development 342
Variation of expression patterns during the life cycle of
Drosophila melanogaster 342
Flower formation in roses 344
Expression patterns in learning and memory: long-term potentiation 347
Conserved clusters of co-expressing genes 350
xxii Contents

Evolutionary changes in expression patterns 351

Applications of transcriptomics in medicine 353
Development of antibiotic resistance in bacteria 353
Childhood leukaemias 357
The Encyclopedia of DNA Elements (ENCODE) 359
Looking forward 359
Recommended reading 360
Exercises and problems 361

11 Proteomics 363

Learning goals 363

Introduction 364
Protein nature and types of proteins 364
Protein structure 365
The chemical structure of proteins 365
Conformation of the polypeptide chain 367
Protein folding patterns 367
Domains 370
Disorder in proteins 370
Post-translational modifications 372
Why is there a common genetic code with 20 canonical amino acids? 374
Separation and analysis of proteins 375
Polyacrylamide gel electrophoresis (PAGE) 375
Two-dimensional PAGE 375
Mass spectrometry 376
Identification of components of a complex mixture 377
Protein sequencing by mass spectrometry 378
Quantitative analysis of relative abundance 378
Measuring deuterium exchange in proteins 379
Experimental methods of protein structure determination 381
X-ray crystallography of proteins 381
Interpretation of the electron density: model building and improvement 383
How accurate are the structures? 384
NMR spectroscopy in structural biology 386
Protein structure determination by NMR 386
Low-temperature electron microscopy (cryoEM) 387
Classifications of protein structures 388
SCOP 389
SCOP2 391
Protein complexes and aggregates 391
Protein aggregation diseases 392
Properties of protein–protein complexes 393
Stoichiometry—what is the composition of the complex? 393
Affinity—how stable is the complex? 394
Contents xxiii

How are complexes organized in three dimensions? 395

Multisubunit proteins 396
Many proteins change conformation as part of the mechanism
of their function 396
Conformational change during enzymatic catalysis 397
Motor proteins 399
Allosteric regulation of protein function 401
Allosteric changes in haemoglobin 402
Conformational states of serine protease inhibitors (serpins) 404
Protein structure prediction and modelling 405
Homology modelling 406
Secondary structure prediction 408
Prediction of novel folds: ROSETTA 408
Available protocols for protein structure prediction 409
Structural genomics 411
Directed evolution and protein design 412
Directed evolution of subtilisin E 412
Looking forward 414
Recommended reading 414
Exercises and problems 415

12 Metabolomics 419

Learning goals 419

Introduction 419
Classification and assignment of protein function 420
The Enzyme Commission 420
The Gene Ontology™ Consortium protein function classification 421
Comparison of EC and GO classifications 423
Metabolic networks 424
Databases of metabolic pathways 425
EcoCyc 425
The Kyoto Encyclopedia of Genes and Genomes 427
The Human Metabolome Database 428
Evolution and phylogeny of metabolic pathways 429
Alignment and comparison of metabolic pathways 431
Comparing linear metabolic pathways 432
Reconstruction of metabolic networks 432
Comparing non-linear metabolic pathways: the pentose phosphate
pathway and the Calvin–Benson cycle 434
Metabolomics in ecology 435
Dynamic modelling of metabolic pathways 437
Looking forward 439
Recommended reading 439
Exercises and problems 441
xxiv Contents

13 Systems Biology 443

Learning goals 443

Introduction 444
Regulatory mechanisms 444
Two parallel networks: physical and logical 445
Networks and graphs 446
Robustness and redundancy 447
Connectivity in networks 448
Dynamics, stability, and robustness 449
Protein complexes and aggregates 451
Protein interaction networks 451
Protein–DNA interactions 456
DNA–protein complexes 456
Structural themes in protein–DNA binding and sequence recognition 457
Bacteriophage T7 DNA polymerase 458
Some protein–DNA complexes that regulate gene transcription 459
Regulatory networks 463
Structures of regulatory networks 464
Structural biology of regulatory networks 465
Gene regulation 466
The transcriptional regulatory network of Escherichia coli 466
The genetic switch of bacteriophage λ 469
Regulation of the lactose operon in Escherichia coli 472
The genetic regulatory network of Saccharomyces cerevisiae 474
Adaptability of the yeast regulatory network 476
Looking forward 479
Recommended reading 479
Exercises and problems 480

Epilogue 483
Glossary 484
Index 493
