Professional Documents
Culture Documents
10.1007@978 981 13 1942 6
10.1007@978 981 13 1942 6
10.1007@978 981 13 1942 6
Ju Han Kim
Genome
Data
Analysis
Translated by Younghee Lee
Learning Materials in Biosciences
Learning Materials in Biosciences textbooks compactly and concisely discuss a specific biological, bio-
medical, biochemical, bioengineering or cell biologic topic. The textbooks in this series are based on lec-
tures for upper-level undergraduates, master’s and graduate students, presented and written by
authoritative figures in the field at leading universities around the globe.
The titles are organized to guide the reader to a deeper understanding of the concepts covered.
Each textbook provides readers with fundamental insights into the subject and prepares them to inde-
pendently pursue further thinking and research on the topic. Colored figures, step-by-step protocols and
take-home messages offer an accessible approach to learning and understanding.
In addition to being designed to benefit students, Learning Materials textbooks represent a valuable
tool for lecturers and teachers, helping them to prepare their own respective coursework.
Genome Data
Analysis
Ju Han Kim
Division of Biomedical Informatics
Seoul National University College of Medicine
Seoul, South Korea
Previous publisher returns the publishing rights for the English language edition of the
work Genome Data Analysis for publication in all forms and media of expression to the
author, Dr. Ju Han Kim.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
v
Preface
»» The future is already here – it’s just not very evenly distributed.
– William Gibson
We are now in the era of “constant data creation.” Cameras are used more than human
eyes and microphones more than human ears; these biosensors obtain data constantly at
no cost. No longer is the obtaining of data a bottleneck for medical and life science
research. Instead, the bottleneck has rapidly moved to data analysis and the bioinformat-
ics field. The person who controls this bottleneck has power over the research. In the
time when obtaining data was a bottleneck, the techniques applied to the cells and tissues
of an experimental animal were the most important methodology. In the data era, it is
bioinformatics approaches, which control the whole “life cycle” of bio data, that will
become the most important methodology.
To the eye of a researcher in life science or clinical medicine, this data-central paradigm
is a new and unfamiliar thing. However, data science has already been firmly established
in many fields. The data-centric paradigm is a century-old phenomenon in the physical
sciences, including particle physics, astronomy, and earth science, and also in most engi-
neering fields and various industry fields. With the emergence of Facebook and Twitter,
even fields in the humanities have joined the era of “constant data creation.” With the
introduction of the mobile smartphone environment to assist data collection, the evolu-
tionary direction of biological science has become clearer, driven by this continuous
creation of individual bio data.
This Genome Data Analysis textbook will cover not just using bio data, but also viewing
it through the perspective of “data science,” which treats the data itself as an object with
a life cycle. Bioinformatics is a field that supervises the four phases of data life, “Birth,
Aging, Sickness, and Death,” which are the whole process of creation, interaction, devel-
opment, and extinction of bio data. This data is a projection into virtual space of phe-
nomena that occur in the natural world. From this point of view, it is wrong to focus on
“analysis” alone. However, as this book is limited to the practices for the beginner, it
could not cover in detail the major parts of “flow” and “control” of bio data.
This Genome Data Analysis practice book is not meant for experts from the bioinformatics
field. This book is for a life scientist, medical scientist, statistician, data processing researcher,
engineer, or other beginner in bioinformatics who finds it necessary to study bioinformatics
vi
Preface
but also difficult to approach the field. However, this book can be a simple guideline for experts
unfamiliar with the new, developing subfield of genomic analysis within bioinformatics.
Genome Data Analysis was begun in 2011 and is based on the practice data bundle from
the 8-week ‘Genome Data Analysis Workshop’ conducted twice per year at the Division
of Biomedical Informatics and Biomedical Informatics Training and Education Center
of Seoul National University. Over time, I have noticed that the contents of the first ver-
sion of this book, published in 2012, became old and outdated. The bioinformatics field
has really developed fast. As I was organizing the table of contents for this second ver-
sion, I noticed that even after only 2 years, it has become impossible to publish every-
thing in just one book. This second version is being published in two books, with the
drastic development of next-generation sequencing techniques to be dealt with in the
second part, which is planned to be published in later.
Without the Seoul National University Biomedical Informatics researchers, this book
could not be published. Furthermore, this book was completed with the support from
2000 participants of the GDA Workshop who gave unstinting encouragement and
review. I would like to give special thanks to Sunhee Shin, who was the person in charge,
and the publisher epublic (bummoon-sa) even though the manuscript was not prepared
well from the experimental composition. Also, I want to give special thanks to Jihoon
Kim, Je-Gun Joung, and Ji Yeon Park, the assistant professors who directly led practices;
and to Younjung Bae, Hyerim Jung, Eunmi Byun, Arirang Jang, and Hyerim Yoo from
the executive office who helped everything to run smoothly. The true authors of this
practice book Genome Data Analysis are Young-Ji Na, Su-Yeon Lee, Hee-Joon Chung, Yu
Rang Park, Dokyoon Kim, Heewon Seo, Sunmin Yun, Chan Hee Park, Jun Hee Youn,
Soo Youn Lee, Yonglae Cho, Hyun Wook Han, Kye Hwa Lee, Ki Tae Kim, and Jae Hyun
Lim, who organized the practice data; and Su Youn Baik, Hyehyeon Kim, Young Jo Yoon,
Sue Hyun Lee, Rocky, Younggyun Lim, Woo Seung Lee, Yoomi Park, and Brian Y Ryu,
who helped with the proceedings of the actual practices. It is embarrassing for me to
become the representative for all these authors. Finally, I really appreciate the efforts of
Woo Seung Lee, who was in charge of proofreading and final organization of this book.
This practice book Genome Data Analysis will continuously evolve and be updated with
new knowledge in accordance with the ongoing GDA Workshop, held every summer
and winter. I would like to disclose that the authors are responsible for any errors or
imperfections in this book.
Ju Han Kim
Seoul, South Korea
vii
Translated by
Younghee Lee Soo Jin Kim
Department of Biomedical Informatics Division of Hematology, School of Medicine
University of Utah School of Medicine University of Utah
Salt Lake City, UT, USA Salt Lake City, UT, USA
younghee.lee@utah.edu soo.kim@hsc.utah.edu
(All Chapters and Preface) (Chapters 3, 6, 7 and 12)
Kanghoon Choi Jane Ryu
Department of Biomedical Informatics Johns Hopkins University,
University of Utah Nelson Laboratories LLC
Salt Lake City, UT, USA Baltimore, MD, USA
kyle.choi89@gmail.com jryu13@gmail.com
(Chapters 5, 10, 11, 17, 18, 20 and Preface) (Chapters 4, 8, 9, 14 and 15)
Seonggyun Han Michael S. Sinclair
Department of Biomedical Informatics Department of Biomedical Informatics
University of Utah School of Medicine University of Utah School of Medicine
Salt Lake City, UT, USA Salt Lake City, UT, USA
seonggyun.han@utah.edu sincla@gmail.com
(Chapter 16) (Chapter 1)
Dongwook Kim
Department of Biomedical Informatics
University of Utah School of Medicine
Salt Lake City, UT, USA
dwkim106@gmail.com
(Chapters 13 and 21)
ix
Contents
Bioinformatics for Life
1.1 Introduction – 4
Bibliography – 15
1.1 Introduction
Bioinformatics has become the core research field in the post-genome era since the
completion of the Human Genome Project.1 The success of the Project not only revealed
the sequence of the 3 billion DNA base pairs that make up the human genome, but also
accelerated the adoption of new paradigms, such as personalized medicine based on
human genome variation and predictive medicine based on gene expression analysis. It
has enabled new paradigms for clinical applications, such as in the emerging fields of
biomedical informatics and genomic medicine. The Human Genome Project also precipi-
tated the massive parallelization and miniaturization of the technology used in molecular
genomics analysis, the innovative development of microarrays, “lab-on-a-chip” devices,
and next-generation sequencing (NGS), and has pioneered high-throughput biology, sys-
tems medicine, and a new era of personal genome informatics.
In its formative stage, bioinformatics was mainly ancillary methods to assist life-science
research in areas that required computing, for example the construction of databases to
store and organize various types of biological data, data structure modeling, analysis of
the three-dimensional structure of proteins, and was limited to a particular area. However,
bioinformatics has become important in life science, has gone beyond the more simplistic
categories of statistical hypothesis testing techniques used in experimental biology, and
has taken a prominent place among the modern biomedical disciplines as the one which
pursues the most progressive approach and understanding. This is attributable to the basic
fact that biological phenomena are truly informatics phenomena, expressed in terms of
interactions in nature between material objects, and between matter and energy. The rapid
growth of bioinformatics shares the same context and current state of progress as the rest
of the life sciences today, which can be understood by looking, briefly, at the historical
development of biology itself.
1 7 https://www.genome.gov/10001772/all-about-the%2D%2Dhuman-genome-project-hgp/
1.2 · Life and Information
5 1
Originally, biology concerned itself with empirical observations and taxonomic
c lassifications, but began to reexamine the processes of life as chemical interactions tak-
ing place in the material world and subject to the same laws as inorganic reactions. With
this understanding, biology developed and incorporated the findings of organic chemistry
and biochemistry, and truly came into its own as an experimental science. The next step
was the growth of molecular and cellular biology, which arose from the special attention
given to the properties and roles of the large molecules characteristic of living organisms.
Thanks to continuous progress in these fields, life sciences have experienced repeated
advancements to this day.
This book, as its title suggests, focuses its perspective especially on the various types
of genomic data. It is therefore the fruit of an effort to reflect the evolution of research
paradigms in life science, from the study of nature and classification of living things based
on observations recorded over long periods of time, to experimental biology and the use
of simulations, and finally to the use of data science today.
Before genes, proteins were for a long time singled out as the essential component of
life. In 1883, Curtius proposed the first hypothesis that proteins have a one-dimensional
sequence structure resembling ordinary text. A slightly more advanced hypothesis of
protein structure based on peptide bonds was put forward by Hofmeister (1902) and
Fischer (1902). Miescher, who first discovered DNA in 1869, predicted that the genetic
material would consist of sequence information in the form of chemical symbols,
like simple text. Such sequence data started to appear with the publication of the first
nucleotide sequence of a segment of tRNA in 1961, after which a variety of DNA base
sequences began to be presented. At least theoretically, the flow of all genetic informa-
tion from an organism’s DNA, culminating in the manifestation of an individual phe-
notype, is encoded in the pure sequence information of the nucleotide bases that make
up its DNA. This sequence information includes, among other things, the secondary
and tertiary structures of proteins, promoters, enhancers, and other gene regulatory
elements, restriction enzyme cleavage sites, splicing elements, non-coding RNA, accu-
mulated mutations, and even the history of the entire process of that genome’s evolu-
tion. The extraordinary development of modern molecular biology, in the process of
uncovering the mysteries of life, has revealed that biological phenomena are, to a sur-
prising extent, “informatic” in nature. The field that investigates the flow of information
and interacting processes of expression that govern such biological phenomena can be
defined as bioinformatics.
The usefulness of informatics research in biology has been confirmed by its successful
application to the processing of output from DNA sequencers, the latter being the great-
est contributor to the early completion of the Human Genome Project. But beyond its
benefits for the more mechanical aspects of data processing, a classic example of scientific
insight gained through an informatics approach to biology can be seen in the study of
sequence homology. Zuckerkandl and Pauling [12] elucidated the relationship between
the degree of divergence in the peptide sequences of similar proteins, and their stage of
evolution, and so established molecular evolution as an entirely new area of research.
Because sequence variation arises according to fixed rules, it can be quantified, and the
history of evolution can be traced through this molecular timepiece that leaves its record
6 Chapter 1 · Bioinformatics for Life
inside genes. More recently, thanks to the availability of faster analytical tools, sequence
1 homology analysis is being widely employed in metagenomics.
A common problem in sequence homology studies is the comparison of two character
strings, which is a classical problem in discrete mathematics. Since DNA sequences are
essentially character strings, the problem is directly applicable to the comparison of two
DNA sequences. As a simple example, let’s say we have two strings, “BLUE” and “BILE.”
Let’s further suppose that we can modify one of the strings one step at a time by inserting
a character, deleting a character, or substituting one character for another, which happen
to be the types of single base mutations that affect DNA. The problem is to determine
how many insertions, deletions, and/or substitutions it would take to change “BLUE” into
“BILE”, or vice versa. This is the “edit distance” between strings, and if the strings are DNA
sequences, it represents the molecular evolutionary distance between them. So, we have:
“BLUE” → {series of insertions, deletions, and/or substitutions} → “BILE”
However, we could, for example, delete the final “E” from “BLUE”, insert it back, delete
it again, insert it again, and so on, in endless repetition, with the string “BLUE” remaining
in its original state. Thus there are an infinite number of possible routes of transformation
from “BLUE” to “BILE” using the types of edits defined above. The problem is then to find
the shortest route of transformation from among all the infinitely-many possible ones. The
reason for this inference, at least in the case of DNA, is that the shortest route is probably
closest to what actually happened in nature.2 In simplest terms, the problem is to find
the minimum edit distance (also known as the Levenshtein distance) for transforming
sequence S1 into sequence S2 by means of single-character insertions, deletions, and/or
substitutions, and this distance is an approximation of the evolutionary distance between
the sequences (. Fig. 1.1).
In our example problem, there are two equivalent solutions that each have the min-
imum edit distance of 2 edits. The evolutionary distance between “BLUE” and “BILE”
would, therefore, be estimated as 2, as follows:
One insertion and one deletion (edit distance = 2)
B I L – E
B – L U E
Two substitutions: I for L, L for U (edit distance = 2)
BILE
BLUE
Of course, in reality, there aren’t only single base mutations, but longer sequences
can also be inserted or deleted at the same time, and the relative probabilities of occur-
rence of different types of insertions, deletions, and substitutions can also vary, so we
would have to make a somewhat more elaborate model to take them into account. This
problem, for which it is very difficult to obtain a general mathematical solution, is easily
solved with dynamic programming for pathfinding (. Fig. 1.1), a classic area of research
in artificial intelligence. . Figure 1.2 is a matrix of the edit distance for all possible edit
2 The basis for inference from simplicity can be found in the so-called “Occam’s razor” principle, also
known as lex parisomniae in Latin (the law of parsimony, economy or succinctness). This principle
is founded on the insight that out of many equally compelling solutions to a given problem, the
simplest one has a high probability of being the correct one.
1.2 · Life and Information
7 1
.. Fig. 1.1 Dynamic programming for sequence homology analysis (A-star search)
D(i, j) B I L E
0 1 2 3 4
B 1 1 2 3 B I L – E
B – L U E
L 2 1 1 2
U 3 2 2 2 2
B I L E
E 4 3 3 2 B L U E
.. Fig. 1.2 Calculation of minimum edit distance between two comparison sequences using dynamic
programming techniques in order to compare sequence homology. In each cell, an upward (deletion),
leftward (insertion), or up-and-left diagonal (concurrence or substitution) edit can be applied. Cells in
which the row and column coincide to indicate the same character for both sequences are marked in
blue, and the two optimal paths are marked by two sets of bolded arrows of different shape. The small
arrows indicate the possible paths to the coincident cells (blue) from the current cell. For example, in
cell B-B, because the two sequences agree at that position, the edit distance is 0, and the arrow pointing
up and to the left is the optimal path. The edit distance in the case of going up and left, or left and up, is
bigger by 2 than the optimal path distance of 0. By simply repeating this process, the minimum edit dis-
tances represented by numbers to the right can precisely find two different paths, each of edit distance 2
As the scope of the minimum edit distance problem gets larger, the computational
complexity required increases geometrically such that it eventually becomes impos-
sible to find an exact solution; it belongs to the class of NP-hard (Non-deterministically
Polynomially hard) problems. One must then apply a number of analytical methods to
efficiently find an approximate value. There are some slightly more advanced algorithms
one could apply, such as a scoring matrix that assigns differential scoring to substitutions
between similar amino acids as opposed to substitutions between non-similar amino
8 Chapter 1 · Bioinformatics for Life
acids, performance improvements that make use of hash tables (e.g. FASTA), or elaborate
1 transition probability Hidden Markov models.
Sequence similarity analysis is the basic principle behind a many algorithms in bioin-
formatics for elucidating molecular evolutionary hierarchies, finding motifs, reconstitut-
ing gene regulatory networks, and finding other sequence elements such as transcription
factor binding sites, splice sites, and regulatory elements located inside introns. One can
try searching for a large number of genes or proteins with highly-similar sequences to a
given sequence using the Smith-Waterman algorithm provided by the Pairwise Sequence
Alignment,3 or with FASTA4 or BLAST tool,5 among others. A detailed explanation of
these search algorithms and tools will be given in 7 Chaps. 10 and 11.
In the early 1970s, when RNA sequences were first being revealed, Pipas and McMahon
suggested that it should be possible to predict the secondary structure of RNA from its
primary sequence information alone. Each RNA molecule has a specific secondary struc-
ture, and 16S rRNA in particular is very useful for identifying microorganisms. One can
download the Vienna RNA Package6 and enter a sequence to obtain a prediction of the
optimal secondary structure. 16S rRNA analysis is now being used as the most certain and
direct methodology in microbial analysis, including species that have so far been impos-
sible to cultivate, and in metagenomics research. For proteins, there are databases such as
the Protein Data Bank (PDB) and SWISS-PROT. The construction of predictive models
taking advantage of artificial intelligence and machine learning techniques that use graph
theory, hidden Markov models, artificial neural networks, and other concepts based on
probability and information theory, has made it possible to solve various computational
problems in biology. These problems include analysis of the tertiary structure of proteins
and RNA, prediction of the functions of large molecules that are very difficult to ascertain
experimentally, determination of open reading frames (ORFs) from expressed sequence
tags (ESTs), interpretation of gene regulatory networks, inference of restriction enzyme
cleavage sites, and investigation of splice sites.
3 7 http://www.ebi.ac.uk/Tools/psa/
4 7 http://www.ebi.ac.uk/Tools/sss/fasta/
5 7 https://www.ncbi.nlm.nih.gov/blast/
6 7 http://www.tbi.univie.ac.at/RNA/
1.4 · Development of Microarray Technology and its Medical Applications
9 1
The generation of large-scale database networks as a consequence of the Human
Genome Project inevitably gave rise to complicated problems related to the storage, classi-
fication, annotation, searching, analysis, and processing of the data, as well as establishing
links between related databases. To give the simplest example, there are many critical limi-
tations in the existing haphazard nomenclature for large biological molecules. Researchers
who discover a new gene are supposed to register it with an appropriate database, such as
GenBank. However, seemingly simple tasks such as checking if that gene has already been
listed under another name become extremely complicated due to the lack of consistency
in gene nomenclature. Attempts are being made to solve this by assigning each gene in a
gene database an “accession number” as a unique identifier, but there is still a great deal
of entries with duplicated or overlapping information. Worse, sequences that are listed
starting from the 3’end are sometimes mixed together with sequences listed starting from
the 5’end without explanation. There are many cases of recording rat sequences as coming
from mouse, and vice versa. In recent times, various bodies have been formed such as the
HUGO Gene Nomenclature Committee7 and the Gene Ontology Consortium8 (2000) to
deal with the issue of systematizing the nomenclature in a semantically meaningful way,
and to secure both syntactic and semantic consistency for data processing.
Microarray technologies (e.g. DNA chip, SNP chip, aCGH) and high-throughput data
acquisition (e.g. NGS), recently in the spotlight, are the main tools that are making it pos-
sible to investigate a biological organism as a single, whole system. When the entire yeast
genome sequence was analyzed in April, 1994, out of all 6200 identified gene sequences,
they could only even estimate, let alone determine, the function of less than a quarter of
them (Brown et al. 1999). The first tool that enabled molecular biological analysis at the
genomic level was the DNA microarray. Detailed reviews on gene chips have been put
together in relevant references,9 and we will discuss them thoroughly in 7 Chaps. 5, 6, 7,
and 8. The use of microarrays has rapidly spread, and we now have several tens of differ-
ent types (e.g. SNP chip, aCGH, tissue microarray, tiling array, exon microarray, miRNA
microarray, cell chip). These have become established experimental methods used in most
of molecular and cellular biology fields.
Gene chips are just a large-scale integration of existing gene detection methods, such
as the reverse dot-blot procedure, parallelized for high-throughput analysis, paradigm
change in research they have occasioned. The introduction of biochips has expanded
genetic research (i.e. research on individual genes in isolation) into genomic research (i.e.
research on all genes in a genome simultaneously), and changed gene detection methods
(despite a few persisting technical limitations) from qualitative to quantitative analysis.
In other words, mathematical and quantitative methodologies have been introduced
into molecular genetics, which had classically relied on chemical analysis and qualitative
methodologies.
7 7 http://www.genenames.org
8 7 http://www.geneontology.org
9 Shena et al. 1995; Shalon et al., 1996; Peaseet et al., 1994; Lockhart et al., 1996; DeRisi et al., 1997
10 Chapter 1 · Bioinformatics for Life
Genome-wide gene expression analysis has proved to be a very powerful tool than can
1 be easily and directly applied to clinical practice. Clinical medicine can be divided into diag-
nostics, risk stratification, and therapeutics, and in each of these there are reports of gene
expression analysis being successfully applied. Golub et al. [6] reported that it is possible to
distinguish between two subtypes of leukemia, acute myeloid leukemia (AML) and acute
lymphoid leukemia (ALL), by analyzing the expression pattern of 6817 different genes.
Alizadeh et al. [1] revealed the existence of two new subtypes of diffuse large B-cell lymphoma
(DLBCL) through genome expression analysis. He showed that of the two subtypes, the sur-
vival rate for the subtype whose expression pattern resembled germinal center B cells was
higher than that for the subtype whose expression pattern resembled activated B cells. They
predict that the entire paradigm of cancer diagnosis classifications and treatment policies
for future medicine will be permanently transformed by the use of gene expression analysis.
Subsequent large-scale research results have proved that prognostic subgroup predic-
tion in cancer is possible by means of genome expression analysis. It is expected accord-
ing to results from the prospective randomized controlled clinical trial MINDACT,10
which involves 6000 lymph node-negative breast cancer patients and determines based
on microarray assays whether or not to treat with chemotherapy, that genome expression
analysis will help avoid unnecessary chemotherapy in 25% of the patients without any
increase in the risk of recurrence. With current treatment methods 85–95% of patients
undergo unnecessary chemotherapy with a high risk of side effects, so reducing unneces-
sary chemotherapy is a major step forward. Microarrays have been rapidly commercial-
ized. Based on results published by van de Vijver MJ et al. [5], a molecular diagnostic
test called “MammaPrint” that estimates the likelihood of metastasis in breast cancer
using an expression pattern analysis of 70 genes has been developed and marketed by the
Agendia company. In 2007, the US Food and Drug Administration approved the use of
MammaPrint for lymph-node negative breast cancer patients who are under 60 years of
age and with tumors under 5 cm in size. MammaPrint currently costs $4200 per assay. A
similar test, Oncotype Dx, is being offered by Genomic Health at $3650 per assay.
Some have raised doubts about the scientific character of diagnostic methods employ-
ing genome expression data. However, while existing pathology tests are based on
pathophysiological and morphological differences between clinical specimens, those dif-
ferences can also be seen as the result of accumulated differences in genome expression
patterns inside cells. If we compare genome expression analysis to other methods that
have come to be used in cytopathology when morphological classification is difficult, such
as immune histopathology or detection of cell markers, then the basic concept behind
genome expression analysis as applied to making diagnoses in the pathology laboratory
is not a new one at all, since it can simply be seen as the simultaneous detection of tens of
thousands of important cellular markers.
SNP microarrays, triggered by research into single nucleotide polymorphisms (SNPs),
have greatly contributed to the investigation of correlations between individual variation
in genotype and disease phenotypes. Out of 3 billion total bases in the genome, approxi-
mately 0.1–0.2% of them (3–6 million bases) show genotypic variation between individu-
als at specific base locations; these variations are called single nucleotide polymorphisms,
or SNPs.11 From the point of view of simple base variation, a SNP is outwardly the same
The traditional Sanger sequencing method of DNA base sequencing is now referred to as
first-generation sequencing, while the more recent sequencing methods claim to stand for
the “next generation” in sequencing, and so are called next-generation sequencing (NGS).
There are new sequencing platforms on the market today claiming to be of the third and
15 Commonly, third generation sequencing refers to methods for sequencing short reads that are
too long for single molecule sequencing (e.g. Pacific Bioscience), while fourth generation goes
beyond established biochemistry-based methods to exploit electrical properties to determine base
sequence.
16 As of early 2017, genome sequence data of more than 3000 individuals were publicly available.
17 The view that the first biological molecule was not DNA but RNA. This is different from the so-called
“central dogma” of biology, where DNA is responsible for storing information, proteins are the mole-
cules that carry out biological functions, and RNA acts as a message or modulator such that the flow
of information is DNA → RNA → protein. In the hypothesized “RNA world”, RNA was the primordial
biological molecule which both stored information and had functional activity (as in the ribosome),
and DNA and proteins later evolved to take over the roles of information storage and functional
activity, respectively.
14 Chapter 1 · Bioinformatics for Life
Personalized, predictive medicine is defined as using all data related to a patient’s genotype
and genome expression to get the best possible perspective for classifying that patient’s
disease, coming up with a treatment plan, and taking preventive measures. Pathological
examinations, image data, and the application of clinical data are also equally important.
That entails overcoming the limitations of current medical decision-making by physicians
relying mainly on clinical signs and categories, by obtaining data directly at the molecular
level for a more detailed approach based on a patient’s intrinsic characteristics. Going
forward, it means taking each patient as an individual, providing the right medication at
the right time with the right dosage for the right patient, thus optimizing treatment and
prevention.
In truth, there are not that many personalized genomic treatment options available
to give to patients. President Barack Obama (then a senator) submitted the Genomics
and Personalized Medicine Act in order to remove scientific and regulatory obstacles
and market pressures, and Secretary Michael Leavitt organized the Secretary’s Advisory
Committee on Genetics, Health, and Society. As has always been the case, the road to data-
based, genomic medicine does not look good. There are especially widespread and influ-
ential concerns about discrimination for employment or insurance eligibility based on the
results of genetic testing. In the United States, the Genetic Information Nondiscrimination
Act has been introduced. Educating the public on the difference between personalized
Bibliography
15 1
medicine based on genomic information, and its diagnosis, treatment, and strategies for
prevention will also emerge as an important task.
Systems biology and biomedical informatics is the combination of biomedical infor-
matics and genomic medicine, and the knowledge of the three fields of molecular biology,
informatics, and clinical medicine form its trinity. It is an essential field of study that will
play a leading role in the life sciences of the future. It’s obvious that if the relevant medical
institutions had not had such excellent clinical information systems, then the research of
Alizadeh et al. [1] and Golub et al. (2000), which opened new prospects in the study of
tumor pathophysiology, would not have been possible. The discussion over how to com-
bine systematic, genome-level biological data with massive health information systems is
an animated one.
Large high-throughput genomic data acquisition and integrative analysis will enable
a systems-level integration of biological phenomena permanently change paradigms in
medicine and biotechnology. If the so-called “omics revolution” is leading us to a hori-
zontal integration of the constituent units of every living thing, then biomedical informat-
ics will lead to a vertical integration, from the microscopic level of molecular biological
phenomena to the macroscopic level of human beings and societies. Through the delicate
interweaving of the various dimensions of systems biomedical informatics, which is a syn-
thesis of informatics with a molecular understanding of biological phenomena, we can see
an image of the future.
Bibliography
1. Alizadeh AA et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression
profiling. Nature 403(6769):503–511
2. Altman RB (2000) The interactions between clinical informatics and bioinformatics: a case study. J Am
Med Inform Assoc 7(5):439–443
3. Brown PO, Botstein D (1999) Exploring the new world of the genome with DNA microarrays. Nat
Genet 21(1 Suppl):33–37
4. Bullinger L et al (2004) Use of gene-expression profiling to identify prognostic subclasses in adult
acute myeloid leukemia. N Engl J Med 350(16):1605–1616
5. van de Vijver MJ et al (2002) A gene-expression signature as a predictor of survival in breast cancer. N
Engl J Med 347(25):1999–2009
6. Golub TR et al (1999) Molecular classification of cancer: class discovery and class prediction by gene
expression monitoring. Science 286(5439):531–537
7. Lorenz R et al (2011) ViennaRNA Package 2.0. Algorithms Mol Biol 24(6):26
8. Pipas JM, McMahon JE (1975) Method for predicting RNA secondary structure. Proc Natl Acad Sci U S
A 72(6):2017–2021
9. Povey S et al (2001) The HUGO gene nomenclature committee (HGNC). Hum Genet 109(6):678–680
Epub 2001 Oct 24
10. Redon R et al (2006) Global variation in copy number in the human genome. Nature 444(7118):
444–454
11. Valk PJ et al (2004) Prognostically useful gene-expression profiles in acute myeloid leukemia. N Engl J
Med 350(16):1617–1628
12. Zuckerkandl E, Pauling L (1962) Molecular disease, evolution, and genic heterogeneity. In: Horizons in
biochemistry. Academic Press, New York
17 2
Next-Generation
Sequencing Technology
and Personal Genome
Data Analysis
2.1 Introduction – 18
Bibliography – 30
2.1 Introduction
DNA sequence analyses, which unveils the genetic information of life, is the most classic
methodology of the field. As described in 7 Sects. 1.2, 1.3, and 1.5 of 7 Chap. 1, analy-
120
Submissions
105.1 million
Ref. Clusters
100 Validated
Number of SNPs (millions)
80
60
40
23.7 million
20
14.5 million
0
Nov02
Nov03
Nov04
Aug02
Jan03
Apr03
May03
Jun03
Aug03
Jan04
Jun04
Aug04
Jan05
May06
Apr08
May09
Feb10
Dec02
Mar03
Mar03
Mar04
Sep05
Mar07
Oct02
Oct03
Oct07
dbSNP Release
.. Fig. 2.1 The trend of the registered number of SNPs in dbSNP database. Since Watson’s genome
acquired by NGS technology was released in 2007, the number of sequence variations registered in
dbSNP has dramatically increased. (Courtesy from Koboldt et al. [12])
sequencing data has limitations; and that anyone will be able to obtain a high-level of
personal genome data in the near future.
It is a new challenge. The emergence of the personal genome sequencing technology
has changed the paradigm of existing studies that compare sequences between species
or certain groups. As we are now able to utilize the data of The 1000 Genomes Project
and published medical and biomedical knowledge including references in PubMed, in
the near future, a physician will be able to provide patients with medical advice based
on patients’ respective sequencing data. It is time for physicians to learn the meaning of
personal DNA sequence, as a bioinformatician should, and to reflect this new perspective,
medical schools also need to include such programs in their curriculum.
Ironically, while personal genome sequencing costs have decreased considerably,
analysis and interpretation of personal genome sequences still remain to be the most dif-
ficult part of the challenge. . Figure 2.2 is a workflow of personal genome analysis and
interpretation that are operated in the Center for Biomedical Informatics, Seoul National
University School of Medicine (SNUBI). The first step with interpreting the personal DNA
sequence is to search for genomic variations by comparing the personal DNA sequence
with the reference genome sequence. Identified variations are assigned with any of
interpretation and annotations. Variations are not only individually interpreted but also
processed for variant-sets assessment, clinical risk analysis, pharmacogenomics, gene-
environment interpretation, and other bioinformatics analyses. These analyses require an
incorporation of the latest technologies such as highly processed distributed file systems
and cloud computing and an integration of diverse biological databases. Through integra-
tion of various bioinformatics tools and knowledgebase analysis methodologies compiled
for the last decade, personal genome interpretation could be achieved (. Fig. 2.2).
20 Chapter 2 · Next-Generation Sequencing Technology and Personal Genome Data Analysis
2 Variants detection
Phenotype-
Clinical risk analysis associated
variants DB
Variants analysis Pre-test probability analysis (SNPedia
etc..)
Variants annotation
Post-test probability analysis
GET-evidence
ANNOVAR Promethease Pharmacogenomics
Computing resource
Lustre clustered LIMS Galaxy workflow
file system
.. Fig. 2.2 A workflow of personal genome interpretation developed by the Center of Biomedical Infor-
matics, Seoul National University School of Medicine (SNUBI)
This chapter will give an understanding of fundamental tools used for analyzing and
interpreting next-generation sequencing data such as sequence data format, algorithms
of sequence alignment/assembly/visualization tools, and summarizes the process of
sequence variation analysis. More elaborate processes and interpretation of NGS data will
be provided in “Genome data analysis II for NGS, Cancer and disease genome”, which will
be published as a separate volume. DNA sequence motifs and phylogenetic tree analyses
will be described in 7 Chap. 11.
Understanding of data format is the first requirement for sequence data analysis. NGS
sequencers generate data in FASTQ and CSFASTA formats, and a data file including QV
(Quality Value), a series of ASCII symbols which represent quality of each base call.
FASTQ format was originally developed by Wellcome Trust Sanger Institute in order to
save sequence data and is now a widely used format, adopted by Illumina NGS sequencer
as well as the Ion TorrentTM sequencer from Thermo Fisher. FASTQ files are essentially
a combination of QV score for each base and FASTA file. To convert all scores (i.e., num-
bers) to one letter, QV includes various ASCII codes.
A sequence in FASTA format begins with a single-line description, followed by lines of
sequence data. The description line is distinguished from the sequence data by a bracket
2.2 · Data Format
21 2
.. Fig. 2.3 An example of a FASTA file format used in sequence data processing
(“>”) in the first column. The word following “>” is the identifier of the sequence, and the
rest of the line is the description (both are optional). There should be no space between
“>” and the first letter of the identifier. It is recommended that all lines of text be shorter
than 90 characters. The sequence ends when another line starts with “>”; this indicates
the start of another sequence (. Fig. 2.3). The further details of FASTA format will be
The concept of QV originates from the Phred quality score, which has been used in
sequence analysis for a long time. It takes a log value of an error rate and a score of 10 indi-
cates 10% possibility of error rate, 20 for 1%, and 30 for 0.1% (. Table 2.1). For example,
if a sequencer has 99.99% of accuracy, most generated reads would have QV greater than
40. However, QV is not an absolute measurement for quality assessment because every
22 Chapter 2 · Next-Generation Sequencing Technology and Personal Genome Data Analysis
C 1 0 3 2
1st base
G 2 3 0 1
T 3 2 1 0
.. Table 2.1 Phred quality scores are logarithmically linked to error probabilities
Phred Quality Score Probability of incorrect base call Base call accuracy
10 1 in 10 90%
sequencer has its own data generation mechanism to detect a signal, which is represented
as a base call or color call at a certain level of accuracy. Therefore, the concept of QV is
applicable to every sequence but QV scores are not comparable between sequencers. Note
that SOLiD sequencers output raw data without any QV-based filtering process as other
sequencers do and it is meaningful to take further analysis of the unfiltered raw data as
specific genomic regions with very low QV scores can exist.
Personal DNA genome is re-constructed by the process of aligning and mapping, which align
and map short sequences (i.e., reads) generated from NGS sequencer, to the human refer-
ence genome. As existing sequence alignment tools such as BLAST (Basic Local Alignment
Search Tool) are slow for mapping these short reads to the genome, new tools for alignment
and mapping have been developed. . Table 2.2 shows new tools and running time.
Most fast sequence alignment algorithms generate an index as an auxiliary data struc-
ture in order to increase mapping speed. Indexing of huge DNA sequence enables users to
2.3 · Alignment of NGS Sequencing Reads
23 2
find where short sequences are located in the DNA sequence at a much faster rate. There
are two sequence alignment algorithms based on indexing method: suffix trees and hash
table. Hash table is a data structure that stores keys and corresponding values, and hash
function searches the location (index) where value is stored, whose is corresponded to the
key. Suffix tree is a data structure that stores suffixes of a string.
Indexing method used in the hash table (hash map) stores the position of each sequence,
which are processed by breaking them into consecutive tuples. K-tuple is defined as a
string that is composed of k number of characters. For example, if string s is “banana”
and k is four, a set of all 4-tuples is {“bana”, “anan”, “nana”}. As DNA is composed of four
bases, A, C, G, and T, the maximum number of k-tuple is four**k and the number of its
position information is similar to the DNA sequence length. Once the indexing table of
the reference genome or query sequence are constructed by this method, alignment based
on the hash table search the seed from the hash table, which is a perfect match between
query sequence and reference genome, and sequence alignment start to extend from the
matched seed. This method is called “seed-and-extend.”
Bin Ma suggests to use k things of non-consecutive string as a seed sequence and
define relative position of k things of string as a model and k as a weight. For example,
for 1110111, a sequence model with weight of 6, “actgact”, “actgact” as well as “actgact”
and “acttact” can all match with the seed. “Spaced seed,” as the method is called, has been
reported to show better sensitivity than traditional seed alignment method.
Suffix is a word ending, which is a substring of the given string. For example, a suffix of
the string of “banana” is the string “nana”. A suffix tree is a tree structure including any
24 Chapter 2 · Next-Generation Sequencing Technology and Personal Genome Data Analysis
N A N $
A N $ A
N A $
A $
substrings of given strings and an efficient data structure to operate a string. There are six
suffixes in “banana”: “banana”, “anana”, “nana”, “ana”, “na”, and “a.” As “ana” is a suffix of
“anana”, “na” is a suffix of “nana”, “a” is also a suffix of “anana”, the efficient way to construct
a suffix tree is with only itself and “anana” and “nana.” This tree enables to search a pattern
of small and repeated sequences (. Fig. 2.5). “$” marks the end position of the suffix.
However, when the suffix tree is constructed for original strings, file size of the com-
pleted structure is often too big to store. To overcome this problem, suffix array is sug-
gested, which is a stored array of all suffixes of a given text, as it is highly space-efficient
than suffix tree. Burrow-Wheeler Transform (BWT) is a text transformation algorithm,
developed by Michael Burrow and Davie Wheeler in 1994.
A text transformed by BWT is more efficiently compressed than the original text in
general. BWT has been widely used in text compression as well as in search field. BWT
is originally developed for data compression but now it is also used to sort NGS DNA
sequences due to storage space problems. An advantage of suffix tree, suffix array, and
WT index is that they can search all positions of a substring in the original string at once.
After sequence alignment, SNP and InDel detection is one of key processes in personal
genome sequence analysis. In early NGS sequencing studies, filtering step was employed
to select cases that have Phred-type Quality Score of 20 or greater. Heterozygous
genotype was generally accepted to have non-reference allelic rate of 20–0%, and for
homozygous genotype, 100–21%. This method was an empirical method to determine
genotype. However, disadvantages of this method was that (1) there can be a loss of
many information due to low coverage of sequence alignment or moderate experiment
and (2) the uncertainty of inferred genotype is not measurable, and therefore, various
probabilistic methods are used to alleviate these problems. For example, error rate of
used platform, bad mapping probability, coverage is analyzed to calculate a likelihood
to determine whether the given position is heterozygous or homozygous. For this rea-
son, recent SNP and InDel detection programs apply data preprocessing and Bayes’
framework.
2.5 · Annotation of Sequence Variation and Function Prediction
25 2
Convert Apply
Recalibration
SAM to BAM recalibration
Merge INDEL
Our.snps.vcf
BAM files realignment
TOOLS
Sort Remove
the BAM file PCR duplicates BWA PICARD
SAMTOOLS GATK
Data preprocessing involves quality estimation of sequence fragments and applies filters
accordingly. According to the results from different sequence fragment analyses, sequence frag-
ments are evaluated as a paralogue or repeated sequences or filtered out. QV is re-constructed
by various statistical methods. For example, small InDel are re-arranged by a process of
sequence rearrangement. After data preprocessing, Bayes’ method is used for data filtering.
Detection of genome variation involves a series of complicated processes. It is neces-
sary to specialize the process according to the nature of NGS sequencing data and the pur-
pose of analysis. . Figure 2.6 shows the workflow of the detection of genome variations
developed at SNUBI. Each step selectively uses tools such as BWA, PICARD, SAMTOOLS,
and GATK for any analysis purpose (. Fig. 2.6).
After sequencing and numbering, each person has a genomic data of about 3GB. Because
this is too large an amount to possess in personal storage, downloading, saving, and
copying were very difficult in the work environment. However, if you only save the dif-
ferent modifications of space, (SNP, InDel, SV) it is possible to compress the data into
26 Chapter 2 · Next-Generation Sequencing Technology and Personal Genome Data Analysis
about 4 MB and email as an attached file. If we can decrease the size of data, the speed of
additional analysis of data can also be increased.
The next step after sequencing is identifying sequence variation and attempting to
2 analyze the data. In order to interpret the individual variation, we use an informatics tool
Promethease that can analyze the variation and the effects of these variations using 100,000
large data sets concurrently, which is summarized into a HTML report. Promethease
Analysis bases its analysis on variations stored in SNPedia database. There is a free and
paid version of Promethease, where free reports take about 13–15 min to analyze and
summarize. Promethease categorizes its summarized results into several categories: Good
News, Bad news, These seem interesting, Medicines, Medical Conditions, and Topics. Paid
reports cost 2 US dollars, and provides a substantial increase in report generation speed
and also provides Tab delimited, Tag Cloud, comparison of personal genome to others in
an F1 report, and a family report in which an individual’s genome is compared to his or
her family. Magnitude, repute, population frequency, and text color are used to visual-
ize results and help interpret one rsID per one variation. The report file is supported in
23andMe, deCODeme, Navigenics format.
Genome interpretations cannot be made solely on functional inferences of individual
genome variations. About 300,000 to 400,000 variations are identified per sample, varied
by factors such as racial background of an individual. The discussion of these interactions
isnot addressed in this book, however those interactions must be taken into consideration.
Not just next generation sequencing data but also the individual’s age, gender and health
status, clinical information must be taken into consideration as it may increase the risk
of disease. Though incomplete, . Fig. 2.7 depicts an example of a clinical risk prediction
P...
TT
P...
Stom...
Contingency table
Genotype Case Control LR Posttest Prob
TT 400 256 1.1484 0.0427
TC, CC 49 74 0.4867 0.0181
Reference
21070779 Saeki Norihisa et al. A functional single nucleotide polymorphism in mucin 1, at chromosome
1q22, determines susceptibility to diffuse-type gastric cancer. Gastroenterology. Mar, 2011.
KEGG views diseases as molecular biology system in its perturbed state and classifies them
in 3 categories: monogenic disease, multi-factor diseases (cancer, autoimmune disease,
28 Chapter 2 · Next-Generation Sequencing Technology and Personal Genome Data Analysis
Diagnostic markers
2 Genetic perturbations
(germline/somatic mutations, etc.)
Molecular
network system Disease
Environmental perturbations
(incl. pathogens, microbiome)
Therapeutic Genomic
drugs biomarkers
metabolic disorders, etc.), and infectious diseases. Disease classification criteria follows
ICD-10, ICD-O-3, and MedDRA (. Fig. 2.8). KEGG disease database organizes diseases
2.5.4 Pharmacogenomics
tool from Stanford University, to analyze personal drug response and interpret the results.
GENOtation supports 23andMe data format.
1000 Genomes Project started in January of 2008 with the intent to study the correlative
relationship between genotype and phenotype as well as genomic variability between pop-
ulation groups, variant frequency distribution, haploid information, and linkage disequi-
librium of alternative allele. In addition, identifying 95% of the variations (SNPs, CNVs,
2.5 · Annotation of Sequence Variation and Function Prediction
29 2
and InDels) distributed in 0.1% to 0.5% of the coding region and 1% of the non-coding
region was also their primary objective.
Comprised of three phases, the project completed its first phase in 2010 with the release
of their project data files. As of now, the Phase 3 sequencing has been completed and of the
2577 specimens, 2504 specimen data was released (. Table 2.3) after excluding those that
had contamination issues, alignment issues, et cetera. The released data can be downloaded
from NCBI and EBI (7 ftp://ftp-trace.ncbi.nih.gov/1000genomics/ftp/release/20130502,
from May of 2013, and has since been updated in February of 2015. Chromosomes 1–22,
X, and Y data is compressed in gunzip format data. In 7 Chap. 3, we will practice calculat-
ing the population distribution of variants using 1000 Genomes Project data.
(continued)
30 Chapter 2 · Next-Generation Sequencing Technology and Personal Genome Data Analysis
.. Table 2.3 (continued)
Bibliography
1. Alberto M et al (2010) Bioinformatics for next generation sequencing data. Genes 1(2):294–307
2. Antonarakis SE et al (2010) Mendelian disorders and multifactorial traits: the big divide or one for all?
Nat Rev Genet 11:380–384
3. Capriotti E et al (2006) R. Predicting the insurgence of human genetic diseases associated to single
point protein mutations with support vector machines and evolutionary information. Bioinformatics
22(22):2729–2734
4. Daly AK (2010) Genome-wide association studies in pharmacogenomics. Nat Rev Genet 11:241–246
5. Daly AK (2010) Pharmacogenetics and human genetic polymorphisms. Biochem J 429(3):435–449
6. Frazer KA et al (2009) Human genetic variation and its contribution to complex traits. Nat Rev Genet
10:241–251
7. Gonzalez-Angulo AM et al (2010) Future of personalized medicine in oncology: a systems biology
approach. J Clin Oncol 28(16):2777–2783
8. Haas BJ, Zody MC (2010) Advancing RNA-Seq analysis. Nat Biotechnol 28(5):421–423
Bibliography
31 2
9. Homer N et al (2009) BFAST: an alignment tool for large scale genome resequencing. PLoS One
4(11):e7767
10. http://en.wikipedia.org/wiki/FASTA_format
11. http://en.wikipedia.org/wiki/Phred_quality_score
12. Koboldt DC et al (2010) Challenges of sequencing human genomes. Brief Bioinform 11(5):484–498
13. Langmead B et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the
human genome. Genome Biol 10(3):R25
14. Li H, Durbin R (2009) Fast and accurate short read alignment with burrows-wheeler transform. Bioin-
formatics 25(14):1754–1760
15. Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing.
Brief Bioinform 11(5):473–483
16. Li H et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25(16):2078–2079
17. Li R et al (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics
25(15):1966–1967
18. Ma B et al (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics 18(3):
440–445
19. MacLean D et al (2009) Application of ‘next-generation’ sequencing technologies to microbial genet-
ics. Nat Tev Microbiol 7(4):287–296
20. Manolio TA et al (2009) Finding the missing heritability of complex diseases. Nature 461:747–753
21. Morin R et al (2008) Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively
parallel short-read sequencing. BioTechniques 45(1):81–94
22. Mortazavi A et al (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Meth-
ods 5(7):621–628
23. Ng PC, Henikoff S (2006) Predicting the effects of amino acid substitutions on protein function. Annu
Rev Genomics Hum Genet 7:61–80
24. Nielsen R et al (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev
Genet 12(6):443–451
25. Ning Z et al (2001) SSAHA: a fast search method for large DNA databases. Genome Res 11(10):
1725–1729
26. Novelli G et al (2010) Role of genomics in cardiovascular medicine. World J Cardiol 2:428–436
27. Ramensky V et al (2002) Human non-synonymous SNPs: server and survey. Nucleic Acids Res 30:3894–
3900
28. Schork NJ et al (2009) Common vs. rare allele hypotheses for complex diseases. Curr Opin Genet Dev
19:212–219
29. Snyder M et al (2010) Personal genome sequencing: current approaches and challenges. Genes Dev
24:423–431
30. Sultan M et al (2008) A global view of gene activity and alternative splicing by deep sequencing of the
human transcriptome. Science 321(5891):956–960
31. The 1000 Genomes Project Consortium (2010) A map of human genome variation from population-
scale sequencing. Nature 467:1061–1073
32. Trapnell C et al (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated tran-
scripts and isoform switching during cell differentiation. Nat Biotechnol 28(5):511–515
33. Tsuji S (2010) Genetics of neurodegenerative diseases: insights from high-throughput resequencing.
Hum Mol Genet 19:R65–R70
34. Wang Z et al (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63
35. Wheeler DA et al (2008) The complete genome of an individual by massively parallel DNA sequencing.
Nature 452(7198):872–876
36. Yandell M et al (2011) A probabilistic disease-gene finder for personal genomes. Genome Res
21(9):1529–1542
37. Yi X et al (2010) Sequencing of 50 human exomes reveals adaptation to high altitude. Science
329(5987):75–78
38. Zhang J et al (2011) The impact of next-generation sequencing on genomics. J Genet Genomics
38(3):95–109
33 3
Personal Genome
Data Analysis
3.1 Prerequisites – 34
Bibliography – 45
3.1 Prerequisites
This chapter consists of four sections. The first three sections use the Windows operating
system (OS) and the last section runs on a Linux OS. The requirement for the Windows
OS is to install Windows R program (7 http://www.r-project.org/) and an Internet browser,
either Mozilla Firefox or Google Chrome. The requirement for the Linux OS is an instal-
lation of VCFtools. The updated version for the Linux OS is available at the website,
7 http://vcftools.sourceforge.net.
In this section, you will familiarize yourself with 23andMe file and personal genome data
analysis and interpretation using an HTML analysis report created by the Windows ver-
sion of Promethease.
# 23andMe
# rsid chromosome position genotype
rs12255372 10 114808902 GT
rs12255372 10 114808902 TT
rs6152 X 66765627 GG
rs9939609 16 53820527 AA
It takes about 13 to 15 min to obtain known SNP annotations in an HTML report file.
55 Step 1: Download and run Promethease for the desktop. (7 https://www.snpedia.
com/index.php/Promethease/Desktop).
55 Step 2: Click ‘Load’ option in ‘Genotype Files’ window. Then, select promeSample.txt
file in the directory.
55 Step 3: Click ‘Next’ and select Ethnicity (population).
55 Step 4: Click ‘Next’ and type the file name and a output folder to save the file.
55 Step 5: Click ‘Next’ in ‘Promethease Wizard’ window.
55 Step 6: When the analysis is completed, the resulting HTML file will be saved in the
output directory you defined in step 4 (. Fig. 3.2).
55 Good news: Repute is good (i.e. resistant to HIV, won’t go bald, resistant to several
diseases, etc.)
55 Bad news: Repute is bad. (i.e. increased risk for type-2 diabetes, 1.2x prostate cancer
risk, more likely to go bald before age 40)
55 Interesting: Sometimes repute cannot be distinctly ‛Good’ or ‛Bad.’ (i.e. You have a
variant linked to blue eye color, warfarin sensitivity (~2.5 mg/day), 1.38x increased
risk for prostate cancer)
55 Top 10% somewhat rare in your chose population: you can change your chosen
population from whatever default you picked when making the report to others, this
will update the population frequency scores, and therefore the sorting order.
55 Medicines: Information on drug response for 160 drugs is provided. (i.e. Gefitinib:
rs2231142, associated with diarrhea as a side effect of Gefitinib treatment, Gleevec,
Warfarin)
55 Medical Conditions: Interpretations are provided for 270 disease-associated SNPs.
(i.e. AIDS, asthma, liver cancer)
55 Topics: A summary of annotated SNPs is presented.
0 You have the common genotype, for which nothing interesting is known.
0.1 You have the common genotype, but its interesting that this varies for others.
The SIFT algorithm is an analysis tool that enables predicting the severity of protein dam-
age by SNPs in protein coding sequences. In this section, we will practice the following:
(1) Identifying SNPs that are most likely to cause protein damages by using SIFT Web
tool, (2) Mapping the identified SNPs on the disease pathway of KEGG DISEASE, (3)
Predicting the relationships between the identified SNPs and the diseases, (4) Adding
annotations to identified SNPs and calculating SIFT scores by using the simple SIFT Web
tool (7 http://sift.jcvi.org/), (5) Investigating the relevance to diseases of the identified
SNPs, which are selected using the Windows R program, by mapping on the pathway of
KEGG DISEASE.
38 Chapter 3 · Personal Genome Data Analysis
CEU European – 180 samples of Utah residents with Northern and Western European ancestry
from the CEPH collection
When the number of SNPs is less than one million, we can analyze the data using the SIFT
Web tool. However, analyzing SNPs of the whole personal genome data requires a pro-
gram, such as ANNOVAR, that can process a larger amount of data.
Using the SNPs data, go to the SIFT web site (7 http://sift.jcvi.org/) and annotate the
SNPs. Also, find the degree of protein damage from the SNPs (. Fig. 3.4).
2. Upload sift_input.txt or click “upload example,” then paste the list of human genomic
variants. Select ‘Ensembl Gene ID’ and ‘Gene Name’ and click ‘submit.’
3. Click “view result table” on the result page. You can either download the files or
review only the brief report.
4. Review VARIATION, PROTEIN SEQUENCE CHANGE, PROVEAN PREDICTION,
SIFT PREDICTION, and ANNOTATION (dbSNP_ID, GENE_ID, GENE_NAME).
If the SIFT score of a SNP is smaller than 0.05, we denote the SNP as ‘DAMAGING’
interpreted that the SNP may affect protein function.
Use the above DAMAGING gene list by SIFT scores or the selected list by other tools to
correlate the SNPs and the diseases. Run hypergeometric test using phyper (x,m,n,k,lower.
tail = FALSE) of Windows R and evaluate the statistical significance.
3.3 · Interpretation of the Correlation of Rare SNPs with Diseases Using SIFT and KEGG
39 3
.. Fig. 3.4 Running the SIFT web tool using sample data
3 > head(pathway_list)
V1 V2 V3 V4 V5
1 Colorectal cancer 207 AKT1 0 NA
2 Colorectal cancer 208 AKT2 0 NA
3 Colorectal cancer 10000 AKT3 0 NA
4 Colorectal cancer 324 APC 1 NA
5 Colorectal cancer 10297 APC2 0 NA
6 Colorectal cancer 26060 APPL11 0 NA
> head(important_genes)
V1 V2 V3 V4 V5
4 Colorectal cancer 324 APC 1 NA
11 Colorectal cancer 581 BAX 1 NA
20 Colorectal cancer 1630 DCC 1 NA
24 Colorectal cancer 3845 KRAS 1 NA
32 Colorectal cancer 4292 MLH1 1 NA
33 Colorectal cancer 4436 MSH2 1 NA
5. Find common genes between damaged genes and colorectal cancer genes (impor-
tant_genes).
> mapping_result
[1] "MLH1" "TGFBR2" "APC" "TP53" "MSH2" "MSH6" "DCC"
3.4 · Pharmacogenomic Analysis
41 3
6. Run Hypergeometric distribution test. (damaged_gene: 2900, Human gene: 30000)
Resulting P-value of 0.01199652 indicates that our test results are statistically significant.
※ If there are any technical difficulties, plaease refer to ‘command.txt’ provided.
The 1000 Genomes Project data include SNP, InDel, SV (structural variants) of 2054 indi-
viduals. This section uses VCFtools (7 http://vcftools.sourceforge.net/) to define target
regions on the chromosomes. Also, we will practice finding allele frequencies in a popula-
tion by SNP rsIDs using grep and awk commands.
Basic Options
•--vcf <filename>
This option is to state that a VCF file is to be used for the process.
•--gzvcf <filename>
This option is to process a compressed, gzipped, VCF file without extracting it.
•--out <prefix>
This option is for the output filename prefix for all files generated by vcftools. If the
option is omitted, all output files will have the prefix.out
※ Additional Options
> Example of extracting a compressed.gz file.
tar-xvfz
ALL.chr22.phase1_integrated_calls.20101123.snps_indels_svs.genotypes.vcf.tar.gz
44 Chapter 3 · Personal Genome Data Analysis
gov/1000genomes/ftp/release//20130502/ALL.chr22.phase3_shapeit2_mvncall_inte-
grated_v5a.20130502.genotypes.vcf.gz).
2. Using an editor or vi commands, enter chromosome number and position in a line
and save it as a text tile. Please refer to position.txt as an example.
3. Type the following VCFtools command.
Exercises
[Exercise 1] - Go to SNPedia (7 http://www.snpedia.com) and learn which disease Herceptin is used for
Personal Genome
Interpretation and Disease
Risk Prediction
4.1 Introduction – 48
Bibliography – 75
4 4.1 Introduction
The Era of the Personal Genome opened with the development of next generation
sequencing and accompanying decreases in genome analysis costs. There has been a vari-
ety of genome testing services in the past. However, Direct-to-consumer (DTC) com-
panies that can analyze consumer’s individual genome are explosively increasing. Such
companies include 23andMe, deCODE genetics, and Navigenics; they provide services
such as individual genome analysis for disease prediction, drug reactivity prediction, and
finding ancestry. In particular, disease prediction has become one of the most important
consumer genome services. The company Pathway Genomics has actively started provid-
ing services through pharmacies for consumer-based DTC genetic test kits.
One famous example of risk prediction, reported on by the New York Times, involved
Hollywood star Angelina Jolie. She tested positive for BRCA1 gene mutation inherited
from her mother, who had been lost to ovarian cancer at the age of 56. This mutation is
found in ~5% of breast cancer patients, and in ~10% of ovarian cancer patients. During
the time of testing, there were no signs of breast cancer from Angelina Jolie. Due to the
increased risk of developing breast cancer, Jolie chose preventative double mastectomy.
Angelina Jolie’s decision was one of ‘clinical rationalities.’ Various disease-risk rate pre-
diction services provided by DTC companies have been criticized for having low reliabil-
ity; this cannot be used in clinical decisions. Furthermore, risk prediction increases social
anxiety and confusion. The FDA has announced that many commercial genetic testing
services are clinically unreliable, confuses the consumers, and provide clinically insignifi-
cant results. Such service should be FDA approved. The FDA also demanded that services
provided by 23andMe, marketed successfully as an individual genome analysis, be FDA
approved as a ‘medical device.’ 23andMe marketing was banned after the controversy, and
this event changed marketing techniques for many DTC companies.
Today, personal genome interpretation remains challenging. There are too few inter-
pretable individual genome variants, and the analysis algorithms need to be further devel-
oped by bioinformaticians. In this chapter, we practice the basics of individual genome
variation data analysis and clinical interpretation. One must know SNP prioritization
methods and genome-related databases in order to understand genetic disease risk pre-
diction algorithms. We introduce software and databases used in predicting disease risk to
provide accurate information regarding individual genome interpretation.
In the previous chapter, GWAS genome correlation analysis was introduced as a general
method to correlate genotype to phenotype; used to predict disease risk or drug reactivity in
relation to individual differences in gene sequence. Recently, correlation analyses based on
4.3 · Prediction of Genetic Disease Risk
49 4
microarrays and next generation sequencing have been developed. However, GWAS gives an
excessive amount of ‘statistically significant’ SNPs, the effects of each SNP on molecular biol-
ogy are largely unknown, and it is difficult to use GWAS analysis results in clinical applications.
SNP Prioritization is used to sort out more relevant SNPs from the excessive number
of GWAS results. In this technique, SNPs that are presumed to have correlation with the
phenotype are ranked. This technique does not only utilize p-values for ranking, but also
knowledge of molecular biology, SNP also weights its significance using algorithms such
as SPOT or SNPranker.
The previous chapter introduced several algorithms using whole genome or exome
sequence data to predict hereditary disease probability, based on Dr. Ashley’s publica-
tion “Clinical assessment incorporating a personal genome.” Consumer-based genome
analysis companies use a variety of algorithms, while most GWAS research results are
manually reviewed to extract a correlation between disease and sequence variation, allow-
ing prediction of disease risk. GWAS literature uses a meta-analysis of case study/control
group patient numbers to calculate odds ratios relating sequence variation to disease. We
further calculate race-related disease risk using known prevalence factors. Fundamentally,
there is not much literature reporting specific numbers for case/control groups. Therefore,
it is often difficult to calculate disease risk rates.
In this exercise, we use Dr. Ashley’s example data and Stanford University’s GENOtation
tool, which uses his algorithm to provide disease risk results.
Algorithms, such as Dr. Ashley’s one, take as input data (1) sequence variants and relevant
phenotypes extracted from GWAS papers and (2) case/control group number charts. The
information on the right chart of . Fig. 4.1 must be extracted from literature reviews and
meta-analyses. Unfortunately, most literature does not provide these details, and a lot of
work is needed to extract the information. After the chart is completed, use the calculation
method shown in . Fig. 4.2.
First, assign the prevalence in the race of analyst with the specific disease to the pre-
test probability and calculate the pretest odds using the formula “pretest odds = pretest
probability/(1 – pretest probability).” Then calculate the likelihood ratio (LR) using the
equation at the bottom of . Fig. 4.2, comparing with the allele. This is used to determine
posttest odds, the product of pretest odds and LR. When each variant is independent,
the posttest odds are renewed by sequentially multiplying the LR values of each variant.
Finally, the disease risk can be calculated from posttest odds using the formula “posttest
probability = posttest odds/(1 + posttest odds).”
In this exercise, we use individual genome data in 23andMe file format. The content of
practice file Personal_Genome.txt is as shown in . Fig. 4.3.
50 Chapter 4 · Personal Genome Interpretation and Disease Risk P
rediction
Determine disease
PubMed search
Table 2 Genotypic frequencies of ESR1 SNPs
in the Study of Ostoporotic Fractures
Paper filter (contain tables) SNP Genotype Case n (%) Control n (%)
AA Aa aa a
Case a b c LR(AA) = a + b +c
d
Control d e f d+e+f
LR
In order to predict the risk rate of Type 2 Diabetes, select the ‘Clinical’ tab, then
‘Diabetes,’ and finally select ‘Compute my Type 2 Diabetes risk.’ A file selection screen will
appear. Browse and select the practice data file, after which a race selection screen will
appear. Select ‘European’ and select ‘Compute my Type 2 Diabetes risk’ again. . Figure 4.13
52 Chapter 4 · Personal Genome Interpretation and Disease Risk P
rediction
shows screenshots of this process. The sample size of GWAS experiments and the SNV
list related to Type 2 diabetes are seen. The calculated LR, running LR (posttest odds),
and probability (posttest probability) results from the user genotype comparison can be
viewed in . Fig. 4.5.
Next, select the ‘Clinical’ tab, then ‘All GWAS studies’. This uses the NHGRI-EBI GWAS
Catalog to determine disease relevance of individual SNVs. As in the previous exercise,
choose the race (‘European’), then select ‘show my GWAS SNPs’ to view individual SNV
genotype information (. Fig. 4.6).
To run Prometheus on the practice data, load the ‘Personal_Genome.txt’ file, set race
as JPT, Japanese in Tokyo, Japan, and set the result file to save to your ‘Desktop.’ After
this is all done, select the ‘next’ button. The analysis will take about 20 min to complete
(. Fig. 4.8).
4.3 · Prediction of Genetic Disease Risk
53 4
. Figure 4.9 shows the analysis results screen, given in HTML format. At the top, you
can verify run information such as the Promethease version, date created, and input file.
At the bottom, the analysis results are classified into eight articles. The first three articles
group variants into ‘Good news,’ and ‘Bad news’. ‘Good news’ variant has low disease risk
or causes positive effects whereas ‘Bad news,’ variant has high disease risk or causes nega-
tive effects. There is a third category – ‘They seem interesting, but have not been flagged as
clearly Good or Bad’ – for other ambiguities. You can click the ‘…more…’ button located
below each article for annotated results.
The Promethease annotation format displays one SNV and one phenotype correlation,
as shown in . Fig. 4.10. Annotations contain the magnitude and information on the SNV,
including occurrence frequency. If you select the ‘…more…’ button next to the pheno-
type information, you can view more details. Selecting the SNV rsID link takes you to its
SNPedia page to view detailed information on the variant.
‘Medicines’ and ‘Medical Conditions’ further classify the three articles according to
the phenotype. As shown in . Fig. 4.11, these tabs report the number of related variants,
annotated phenotype, and annotated information. The bar graph on the left shows the
number of SNVs in each category: red as “Bad News”, green as “Good News”, and grey
for variants with no standings. You can view annotated information by clicking the ‘…
more…’ button located next to each phenotype (. Figs. 4.10 and 4.11).
4.4 · Resources for Analyzing an Individual Genome
55 4
4.4.1 dbGAP
correlation studies. These include GWAS, genotyping, clinical sequencing, and molecular
56 Chapter 4 · Personal Genome Interpretation and Disease Risk P
rediction
diagnostic assays. The information provided by each of the mentioned studies can be
summarized into four categories: documents related to experiments, accessible authority
document, information on genomic data and individual genomic pedigrees and lastly,
linkage analysis and related statistical genetic information. Documents relevant to experi-
ments include experimental instructions, protocol instructions, and the data-collecting
agency. dbGaP contains both ‘open’ data that is easily accessible and ‘controlled’ data that
can only be accessed with permission. Most of the data is controlled, for which access
requests must be submitted. Currently, dbGAP contains 4792 datasets from 780 experi-
4 ments (. Table 4.1).
.. Table 4.1 (continued)
25
4.4.2 SNPedia
SNPedia (7 http://www.snpedia.com/index.php/SNPedia) is a wiki-based database started
in 2006 that provides information on sequence variation and related diseases. SNPedia is
composed of information from research literature, data from the cooperation of several
individuals, and the NCBI RSS feed (really Simple Syndication, 7 http://ww.ncbi.nlm.nih.
phenotype, and other terms can be used to search in SNPedia (. Fig. 4.12). As shown in
the figure, information and SNPs related to Type 2 Diabetes can be verified from several
references, and each SNP page provides detailed information on its variations.
4.4.3 PheGenI
PheGenI (Phenotype-Genotype Integrator, 7 http://ncbi.nlm.nih.gov/gap/PheGenI) is pro-
vided by NCBI as a genome diversity database that incorporates the GWAS Catalog with
other NCBI tools. The database incorporates dbSNP, NCBI Gene, and Genotype-Tissue
Expression (GTEx) eQTL, and dbGAP genotype-phenotype correlation analysis data.
Currently, PheGenI contains 66,063 Phenotype-Genotype correlations, about 54,000,000
dbSNP entries, about 40,000 NCBI genes, and about 61,000 eQTLs.
We can search PheGenI on the basis of either genotype or phenotype. For this practice,
let’s use individual genome data to search. Access the PheGenI homepage (. Fig. 4.13).
In the Traits field under ‘Phenotype Selection,’ input the keyword ‘Type 2 Diabetes’
and select the ‘Search’ button. It should give you 256 Phenotype Genotype results, 29
genes, and 36 SNPs relating to Type 2 Diabetes (. Fig. 4.14).
In the lower section of the results page, under ‘Association Results,’ we can see related
SNP rsIDs, gene, location, and p-value indicating its linkage to Type 2 Diabetes. Detailed
information from dbSNP or NCBI can be viewed by selecting the source articles as shown
in . Fig. 4.15.
4.4 · Resources for Analyzing an Individual Genome
59 4
.. Fig. 4.15 “Type 2 Diabetes” PheGenI search results (1) – Association Results
Under ‘Genome View’, we can visualize a SNP’s location on the genome map. Under
the ‘Genes’ and ‘SNPs’ sections, we can view gene and SNP data relating to Type 2
Diabetes. Published studies related to Type 2 Diabetes are listed under ‘dbGaP Studies’
(. Fig. 4.16).
4.4 · Resources for Analyzing an Individual Genome
61 4
Next, we will search on an individual SNP from Personal_Genome.txt. This time, click
‘Genotype Selection’, select the ‘SNP’ tab, input ‘rs7903146,’ and select search (. Fig. 4.17).
In 7 Sects. 4.1, 4.2, 4.3, and 4.4, we used common variants (SNPs). Common SNPs are
excellent genome research markers, but them being poor explanatory markers are their
downsides. In 7 Sect. 4.5, we will identify a sequence variant with better explanation, using
4
familial relationship data and extracting variants that are judged to have individual effects.
This exercise will partially repeat the process and result from “Exome sequencing and
disease-network analysis of a single family implicate a mutation in KIF1A in hereditary
spastic paraparesis” by Yaniv Erlich and his colleagues. In order to obtain results identi-
cal to those reported in literature, we will simulate 1000 Genomes Project with 20,000
chromosome trio sequence data (. Table 4.2). The data can be downloaded from the fol-
lowing FTP site, and any processed files are provided separately (7 ftp://ftp.1000genomes.
ebi.ac.uk/vol1/ftp/pilot_data/release/2010_07/trio/snps/CEU.trio.2010_03.genotypes.vcf.gz)
(. Fig. 4.19).
For this exercise, use a Linux system with ANNOVAR installed (see 7 Appendix B).
gda@GDA:~$ cd chapter04
gda@GDA:~/chapter04 $ ls
: Male
: Female
: HSP
SA12878
4.5 · Disease-Related Sequence Variation Analysis
63 4
4.5.2 xtracting Main Disease-Causing Variants
E
from Sequence Data
##fileformat=VCFv4.0
..
##fileformat=VCFv4.0
..
SA12891
4 NOTICE: Read 305798 lines and wrote 232950 different variants at 232950 genomic
NOTICE: Among 232950 SNPs, 156825 are transitions, 76125 are transversions
2. In the same way, convert input files for mother and child.
3. Thus far, we have created three files, one each for the father, the mother, and the
child.
3. Next, calculate PolyPhen2 Scores using ANNOVAR, just like SIFT. With PolyPhen2,
scores below 0.15 are classified as benign, scores between 0.15 and 0.85 are possibly
damaging, and scores above 0.85 are probably damaging. In contrast to SIFT, higher
scored variants are considered more damaging.
4. As with the SIFT analysis, dropped and filtered files are created. Open the file labeled
father.hg19_ljbpp2_dropped. The values in the second column are the PolyPhen2
scores.
misc/accessory/
Step 1. Find variants that occur in splice sites or exonic regions. This limits analysis to
variants that more likely have effects on protein function.
Step 2. Find variants in conserved regions. Variants in conserved regions are more
likely to have large effects on function.
Step 3. Find variants not in segmental duplication regions. Most variants in segmental
duplication regions have smaller effects.
Steps 4,5,6. Remove variants observed in the 1000 Genomes Project (CEU, YRI, JPT,
CHB) (Phase 1 data). These variants are common and cannot be considered as factors
causing Mendelian disease.
Step 7. Remove variants registered in dbSNP.
Step 8. Map the remaining variants to genes.
Step 9. Find genes that include many variants.
In this exercise, we will perform each step in the order stated above.
4
NOTICE: Reading FASTA sequences from
child.step1.exonic_variant_function
2. Variants at splicing sites and in exonic regions are saved to the file child.step1.
variant_function, which is also the input file for the next step. Execute the following
command to create the file child.step2.varlist, and use this file for step 2.
gda@GDA:~/chapter04 $ next_step.pl -outfile child -step 1
child.step2.hg18_phastConsElements44way
4.5 · Disease-Related Sequence Variation Analysis
67 4
2. Variants in conserved regions are saved to the file child.step2.hg19_phastCons
Elements46way. For the next step, execute the command below, and verify that the
file child.step3.varlist is created.
gda@GDA:~/chapter04 $ next_step.pl -outfile child -step 2
10209 regions
4 child.step4.hg18_CEU.sites.2009_04_filtered
/home/gda/program/annovar/humandb/hg18_CEU.sites.2009_04.txt...Done
child.step5.hg18_YRI.sites.2009_04_filtered
/home/gda/program/annovar/humandb/hg18_YRI.sites.2009_04.txt...Done
2. As before, variants remaining after removing known African population variants are
saved to the child.step5.hg19_1000g2012apr_afr_filtered file. Execute the following
command for the next step’s input file.
child.step6.hg18_JPTCHB.sites.2009_04_filtered
/home/gda/program/annovar/humandb/hg18_JPTCHB.sites.2009_04.txt...Done
2. Variants remaining after the process above are saved to the child.step6.
hg19_1000g2012apr_asn_filtered file. Execute the following command for the next
step’s input file.
child.step7.hg18_snp130_filtered
/home/gda/program/annovar/humandb/hg18_snp130.txt...Done
70 Chapter 4 · Personal Genome Interpretation and Disease Risk P
rediction
2. Variants remaining after removing registered dbSNP variants are saved to the file
child.step6.hg19_1000g2012apr_asn_filtered. Execute the following command for the
next step’s input file.
$ANNOVAR_HUMANDB
2. The default command automatically executes steps 1 through 9. However, users will
sometimes choose to leave out steps. In this exercise, we will execute the ANNOVAR
analysis without step 2. The command is as follows.
nonsynonymous SNV
CGREF1:NM_001166239:exon6:c.G430A:p.G144R,CGREF1:
NM_006569:exon6:c.G430A:p.G144R,CGREF1:NM_001166241:exon5:c.G142A:p.
nonsynonymous SNV
KIF1A:NM_001244008:exon47:c.C5114T:p.A1705V,KIF1A:
A het . 138
mother.step9.varlist child.step9.varlist
BMPR2
KIF1A
3. Double click the ‘training_set.genes’ file to open it, select all and copy, then paste them
into the left field titled ‘Training Gene Set.’ In the right field titled ‘Test gene set,’ input
the genes BMPR2 and KIF1A, identified in the previous exercise, and also input
VWA3B and CGREF1, which were only found in the parent. Then select ‘Submit Query.’
(. Fig. 4.21).
4.5 · Disease-Related Sequence Variation Analysis
73 4
4. Under “Training parameters,” select features to be analyzed, correction types, and a
threshold value; then click ‘Start’ to begin the prioritization analysis (. Fig. 4.22).
5. The four test genes entered are ranked using information from the training set. With
this data, the gene ‘KIF1A’ was identified as the most significant.
Exercises
[Exercise 1] - Using the VCF you own as the input, run GENOtation and Promethease and check the
medical interpretation results.
[Exercise 2] - If there is a variant annotated as “Bad News” from performing [Exercise 1], use ANNOVAR
to look into the variant more closely. We can obtain more information on the damaging risk the variant
has according to its SIFT and Polyphen scores.
Bibliography
75 4
Take Home Message
55 What is the algorithm for prediction of genetic disease risk and its meaning.
55 How to analyze the disease-related sequence variation.
Bibliography
1. 1000 Genomes Project Consortium et al (2010) A map of human genome variation from population-
scale sequencing. Nature 467:1061–1073
2. Ashley EA et al (2010) Clinical assessment incorporating a personal genome. Lancet 375(9725):1525–
1535
3. Cariaso M, Lennon G (2012) SNPedia: a wiki supporting personal genome annotation, interpretation
and analysis. Nucleic Acids Res 40(D1):D1308–D1D12
4. Chen J et al (2007) Improved human disease candidate gene prioritization using mouse phenotype.
BMC Bioinformatics 8:392
5. dbGap – http://www.ncbi.nlm.nih.gov/gap
6. Erlich Y et al (2011) Exome sequencing and disease-network analysis of a single family implicate a
mutation in KIF1A in hereditary spastic paraparesis. Genome Res 21:658–664
7. GENOtation – http://genotation.standford.edu/
8. Karczewski KJ et al (2012) Interpretome: a freely available, modular, and secure personal genome
interpretation engine. Pac Symp Biocomput:339–350
9. Mailman MD et al (2007) The NCBI dbGaP database of genotypes and phenotypes. Nat Genet
39(10):1181–1186
10. Ng SB et al (2010) Exome sequencing identifies the cause of a mendelian disorder. Nat Genet 42:
30–35
11. PheGenI – http://www.ncbi.nlm.nih.gov/gap/PheGenI
12. Promethease – http://snpedia.com/index.php/Promethease
13. Ramos EM et al (2013) Phenotype–genotype integrator (PheGenI): synthesizing genome-wide asso-
ciation study (GWAS) data with existing genomic resources. Eur J Hum Genet
14. Saccone SF et al (2010) SPOT: a web-based tool for using biological databases to prioritize SNPs after
a genome-wide association study. Nucleic Acids Res 38(suppl 2):W201–W2W9
15. SNPedia – http://www.snpedia.com/index.php/SNPedia
16. SPOT – https://spot.cgsmd.isi.edu/submit.php
17. ToppGene site – https://toppgene.cchmc.org
18. Wang J et al (2007) Estrogen receptor alpha haplotypes and breast cancer risk in older Caucasian
women. Breast Cancer Res Treat 106(2):273–280
19. Wang K et al (2010) ANNOVAR: functional annotation of genetic variants from high-throughput
sequencing data. Nucleic Acids Res 38:e164
77 II
Advanced Microarray
Data Analysis
Contents
Advanced Microarray
Data Analysis
5.1 Introduction – 80
Bibliography – 92
5.1 Introduction
5
Since the traditional molecular research method was developed to target a single gene, it
has been difficult to go beyond understanding the result of translating a single gene and
studying the interaction of the network between numerous genes. However, the molecu-
lar biological analysis system was reborn via advanced concept analysis technology. Such
improvement with high-speed mass experimental technique is used to measure tens of
thousands of genes at the same time in just one experiment like that of DNA microarray.
Microarray research is gaining its significance through the integrated analysis of cease-
lessly developing next-generation sequencing (NGS).
Since the microarray technique uses a single chip to measure tens of thousands of
genes, it creates copious data in a short period. At this time, gene expression is quantita-
tively analyzed.
One of the difficulties of quantifying the actual gene expression and standardizing
the condition to whole genome analysis is cacophony and low reproducibility. Since the
experimental condition has to be subjected to the total genome, the best experimental
condition for individual gene expression cannot be determined. For example, it is possible
to adjust the experimental conditions to detect lower gene expression in the Northern
blot technique, but it is not easy to separate the accuracy of lower gene expression in the
microarray technique experiment that examines the total gene. The intrinsic property of
the microarray is its quantitative analyze an entire subject, and new analysis methods for
microarray data must be developed.
A quantification experiment that has standard experimental conditions has an
advantage. The meta-analysis of various experimental data is gaining more attention,
since the measured value is based on standardization and quantification. This property
makes it possible to save and share globally generated microarray data, and this property
becomes the backbone for forging a public database, such as GEO,1 ArrayExpress,2 and
Oncomine.3 By opening microarray data in these databases to the public, researchers
can use public data without performing experiments. Understanding microarray data
and the importance of acquiring the analysis technique are becoming more important.
1 7 https://www.ncbi.nlm.nih.gov/geo/
2 7 https://www.ebi.ac.uk/arrayexpress/
3 7 https://www.oncomine.org/
5.2 · Microarray Experiment
81 5
5.2 Microarray Experiment
The DNA microarray is constructed in 5 steps, shown in . Fig. 5.1. Step 1 shows how to
make a DNA chip. In the past, each lab directly made a robot that spots a DNA clone onto
the chip. Then, it is amplified and refined by PCR. Lastly, a ‘robotically spotted cDNA
microarray’ is produced. However, it is more cost-effective to buy and use a mass-produced
chip. The mass-produced chip is manufactured by creating an oligomer nucleotide on a
chip directly or with a robotic printer.
DNA chips can be separated into a cDNA (200–500 base pairs) chip and oligobase
sequence (15–100 base pairs) chip, depending on base sequence. Moreover, a DNA chip
can be separated, depending on the manufacturing method, such as pin microarray and
inkjet for robotically spotted chips. Also, a DNA chip like photolithography chip from
Affymetrix can be separated by using semi-conductive manufacturing process. A cDNA
chip is literally attaching ORFs (open reading frames, i.e., a gene) or ESTs (expression
sequence tags) to all base sequences; it distinguishes a gene that has a complementary
sequence properly. (1) Affymetrix’s short-base sequence chip is synthesized layer-by-
layer using photolithography technology to manufacture a semi-conductive chip on a
small glass plate. It can distinguish 25-mers, which are composed of 25-base sequences,
for 425 different genes both theoretically and properly. A robotically spotted chip can
normally synthesize 10,000–30,000-base sequences, whereas a photolithography chip can
2 1
Make
Interesting Interesting Interesting
biochip
patients animals cell lines
3
Appropriate Appropriate
Extract RNA
tissue conditions
Hybridize
biochip
5
4
Access Functional Data pre-
significance clustering processing Scan
biochip
Post-cluster
analysis &
Biological integration Informatical
validation validation?
synthesize a sequence more than 1,000,000-bases in a single slide. (2) Instead of using
cDNA, technology that prints 60–100-mer synthetic base sequences has received atten-
tion. A 50–70-mer synthetic base sequence library is cheaper, overcoming the disadvan-
tage and expensive price of Affymetrix’s short 25-mer and the disadvantage of cDNA, a
difficult aspect of clone library maintenance.
Step 2 is preparation of the specimen, which separates mRNA from the experimental
and control specimens individually. This creates cDNA that is indicated by red Cy5 and
green Cy3 fluorescent dyes during reverse transcription. Step 3 is a hybridization step,
which stains the experimental specimen red and the control specimen blue by a hybridiza-
tion reaction between complementary sequence probes on the chip and samples with the
5 same amount of fluoresce.4 In this process, two different fluorescent samples competitively
undergo a synthesis reaction with the fixed probe on the chip under the same condition.
A robotically spotted chip often dyes two different samples with two colors to create more
competitive ‘two-dye technique’ binding. The unreactive specimen does not hybridize
after the reaction and it needs to be removed properly by washing.
Step 4 is quantification of the brightness of the fluorescence of each probe that hybrid-
ized with fluorescent dyes with the laser fluorescent scanner. It quantifies the mRNA
expression, depending on the ratio of red:green intensity of the two fluorescent materi-
als. The color intensity is converted into a numerical value after the image is analyzed.
The amount of red light obtained from Cy3 and the amount of green light obtained from
Cy6 are mixed computationally to create a red~yellow~green spectrum, which creates
data and visualizes the quantification of the numerical values of the expression along with
ratios of the experimental and control specimens. The Affymetrix chip arranges 25-mer
base sequences that match perfectly with the target gene and those that have a mismatch at
base 13, the midpoint of the 25-mer, to compare binding ratios between the two sequences
as a control for any noise due to cross-linking.
Step 5 is the analysis of the numerical value for the expression of each gene. The
detailed analysis steps will be introduced in 7 Sects. 5.3, 5.4, and 5.5. The previous experi-
mental step is based on a two-colored experimental technique that uses red and green.
However, due to the development of precise technology, the use of a monotone technique
is growing, and the measured values have become more credible. The basic data analysis
principles of the two techniques do not differ considerably.
The data of microarray experiments are indicated in gene expression matrix and are sepa-
rated into three parts. (1) A list and annotation of genes provide the link to the relevant
database. (2) A list and annotation of specimens provide the link between the classification
of species and a taxonomy database. (3) Each element of a gene expression matrix shows
the numerical value of gene expression. Moreover, this gene expression matrix is gener-
ated in three steps: (1) raw data that are created by the results of laser fluorescent scanning,
4 In the dyeing method, reverse transcription (RT) transforms mRNA into complementary single-
strand cDNA using oligo-dTprimer and reverse transcriptase for technical convenience. In this case,
Cy3-dUTP, which emits green light, and Cy5-dUTP, which emits red light, are added into each reac-
tion separately, which converts all mRNA in the reference and test cell to target cDNA by mixing Cy3
and Cy5.
5.3 · Structure and Normalization of Microarray Data
83 5
(2) various measured values from each experiment (i.e., each chip or each hybridization)
and the quantification matrix that is made of reference counts, and (3) a gene expression
matrix, which is the bundle of numerical gene expression values that are calculated from
the quantification matrices from many experiments (. Fig. 5.2).
Sample annotation
Samples
Gene expression
matrix
Genes
Gene annotation
Raw data Quantification Gene expression
matrices data matrix
Array scans Quantifications Samples
Genes
Spots
Gene
Quantification datum expression
level
simple and used for qualitative or semi-quantitative analysis, which skips data normaliza-
tion and preprocessing. Microarray experiments use large-scale data normalization and
relative values5 of the intensity of light, and each chip varies during the manufacturing
process. Moreover, the red and green dyes have different DNA binding and signal detect-
ing efficiencies, requiring data normalization and preprocessing.
Data normalization normally uses three different technologies. (1) Spiked control of
index gene: a housekeeping gene6 that has a comparatively constant expression ratio, is
used as an index to compute the expression of other genes. A 2-colored technique cali-
brates housekeeping genes of the experimental and control specimens, dyed with red and
green at the ratio of 1:1. In this case, it is assumed that the housekeeping gene expression is
5 constant with all specimens and conditions. (2) Global normalization of the total amount
of fluorescence: Applying the assumption that the total amount of fluorescence emission
from one chip (i.e., the total amount of mRNA) is constant, and the total expression rate
of an object of a comparative analysis is made to be constant. (3) Fluorescence inten-
sity-dependent normalization: According to research that the detectable fluorescence is
calculated per the fluorescence intensity function.7 A curvilinear regression technique is
used to speculate fluorescent intensity under the assumption that the fluorescent inten-
sity increases (total fluorescence amount) and decreases gene dosage (total fluorescence
amount) are the same. Besides, various normalization methods are recommended, con-
sidering the data properties. The three aforementioned methods are applied to both the
monotone and two-colored experimental techniques.
For the data of the two-colored experiment, calculated by the expression ratio of both
the experimental and control specimens, a gene that has a low expression amount shows
a large change in difference in expression.
The objective to performing the most typical microarray experiment is to obtain a list
of upregulated or downregulated genes in a specimen compared to a control. One can
get a good result by analyzing the fold-difference due to the intervention (DeRisi et al.
1997; Heller et al. [13]). However, average values are used for statistical analysis. T-test,
which uses T-distribution, is used more often to compare two independent or conjugated
samples, and ANOVA, uses F-distribution, to analyze the average values of more than 3
groups. These principles are applied to find genes in microarray data that are differentially
expressed through slight modifications.
However, a microarray experiment cannot fulfill the assumption of the traditional
T- or F-test due to the low number of repeated experiments. Especially, the distribution
function, which is written as a percentage, is difficult to decipher when the two-colored
technique is used. Various limitations exist, such as many missing values. Analytic meth-
ods for microarray data have been developed to overcome the limitations of traditional
statistical analysis techniques with average values. Pierre Baldi’s Cyber T8 or Tibshirani’s
5 Instead of using absolute light intensity, the relative value of the opening of the aperture or the
sensitivity setting of the light sensor is used.
6 Often, rRNAand GAPDH (glyceraldehyde 3-phosphate dehydrogenase) are used.
7 Dye bias is dependent on spot intensity.
8 7 http://cybert.ics.uci.edu
5.5 · Cluster Analysis and Interpretation
85 5
SAM (significance analysis of microarrays) are representative methodologies that can
generate comparatively stabilized results with a low number of repeat experiments.
Even if discriminative expression analysis of a single gene is performed perfectly,
numerous significance tests can not be conducted at the same time. The multiple hypoth-
esis testing problem theory states that as the number of hypotheses increases at the same
time, the false positive rate also increases as with the number of tests. There are various
ways to solve the multiple hypothesis testing problem, but the representative methods are:
(1) the FWER (family-wise error rate) calibrating method (2) the FDR (false discovery
rate) calibrating method. FWER is limited to at least the one false positivity after perform-
ing the nth test.9 However, FWER tends to reach an excessively conservative conclusion
in microarray with the large number of testing, and limiting “the possibility of the false
positivity of the presented result to be at least more than one” as this is not clear in a
practical sense; thus, FDR is applied more extensively in microarray data analysis. FDR
is a concept that estimates the false positive ratio of identified genes —for instance, 100
identified genes are not specified, but 5 genes are decided to be false positives.10
In 7 Sect. 5.3, cluster analysis of differentially expressed genes is performed. Cluster anal-
ysis is a similarity structure analysis. The 800 genes that are involved in the yeast cell cycle
were analyzed in regards to its periodicity and its correlation coefficient. Cluster analysis
finds a correlated gene, but its function is not revealed by this procedure. Cluster analysis
generates a new hypothesis of how an unknown gene correlates to another gene to reduce
the time and effort of planning the next experiment. After finding a gene cluster in various
experimental conditions, it is a good strategy to find tightly co-regulated genes, regardless
of changing conditions.
Cluster analysis can be separated into (1) hierarchical cluster analysis and (2) classified
cluster analysis, depending on the cluster structure. Hierarchical cluster analysis is a struc-
turalizing method with a hierarchical tree, whereas classified cluster analysis separates
each classified partition. Another method is to separate the data into two types: parti-
tion type and merge type by cluster format. Partition type is initially whole data that are
composed of one cluster and separates it into smaller irrelavant clusters, whereas merge
type starts with N number of clusters such that the initial size of the entire data is 1, the
accumulates similar clusters into on big cluster.
Similarity distance could be applied to various distances, such as Euclidean spatial
distance, vector angle, and correlation coefficient. Depending on the analysis objective
and applied distance, the result differs. In order to show the similarity of a gene profile,
vector angle or correlation coefficient is more reasonable compared to Euclidean distance,
which reflects profile shape. Making a profile that correlates inversely by square or abso-
lute value of the correlation coefficient into one cluster could be considered a reasonable
case. Choosing a distance is an issue that could be raised in order to find the optimally
designed distance, depending on the purpose of analysis. Likewise, forming clusters could
be a problem that needs further explanation. In order to overcome the issues, there are
9 FWER = Pr(α > 0).
10 pFDR = E(α / rejected | rejected >0).
86 Chapter 5 · Advanced Microarray Data Analysis
Expression level
Expression level
Expression level gl S0
S0
gl
gl
C1 C1 C1
Significantly
over-represented Random
G0:000123 G0:000126 Selected
Not selected
5
Biological pathways
Gene ontology
.. Fig. 5.3 Over-representation analysis of an overexpressed gene list resulting from differential expres-
sion analysis and cluster analysis. (a) Differential gene expression. (b) Set-wise differential expression (GSEA)
various analysis strategies such as the method that correlates after dividing strategically
into many clusters and the SOM (self-organizing map) method that reconstitutes the
interrelationship of the obtained cluster.
The biological interpretation of the cluster that is extracted by analysis is normally
accepted as tightly co-regulated genes. The obtained cluster will be helpful for GO (Gene
Ontology) or biological pathway-based annotation and performing biological interpreta-
tion through the similarity test. The significance test of annotation for over-representation
analysis (ORA) is (1) the gene list that is given by input, (2) a significant comparison to a
random, specific GO annotation or pathway-affiliated gene, and (3) a technique that uses
hypergeometric distribution to test (. Fig. 5.3). Today, there is interest in the input gene
list, the list of differentially expressed genes. For instance, only upregulated genes can be
used as input, or the gene whose expression has risen twofold. Over-representation analy-
sis will be explained in 7 Chap. 6.
Classification analysis systemically classifies samples that are already prepared for clas-
sification. Cluster analysis, which is an exploratory analysis technique, creates new clus-
ters by similarity analysis without any information about the sample classification system,
whereas classification analysis is a machine learning technique that could be applied to a
situation where the classification label information on the sample classification system is
5.6 · Classification Analysis
87 5
given. Classification analysis is called ‘supervised learning,’ because it has the classifica-
tion label information, whereas cluster analysis is called ‘unsupervised learning,’ because
it does not.
For example, the gene expression profile that is obtained from n number of liver tissues
and m number of brain cells would each get a profile, called ‘liver or brain class label.’ This
label is considered as the standard. It creates a classifier that is a type of discriminant func-
tion from two different data matrices. A classifier with unknown data is used to predict and
deduce the most proper label. Medically, diagnosis and classification are similar in that the
diagnostic problem is solved through the classification analysis of gene expression data.
The created classifier can be evaluated in a cross validation experiment. The original
data can be divided into a training set and a test set; a classifier could be applied to a test set
to evaluate the performance after a classifier can be drawn from a training set. Cross valida-
tion is performed numerous times to optimize a classifier with various designs. In 7 Chap.
Linear combination, obtained from a statistical method, finds the linear combination of
a variable that classifies two or more samples. It is used as a linear classifier and used for
variable level reduction for continued classification analysis. The advantages of microar-
ray data analysis are evident and it provides an outstanding explanation of a model. The
downside is that the measure of a corresponding gene is sensitive. Moreover, if the disper-
sion of each category is different, accuracy decreases. Dudoit et al. reported that DLDA,
which uses only p number of dispersion information for each gene, showed better results
on re-substitution error rates than the covariance structure in the LDA method.
SVM or ‘Statistical Learning Theory’ was established by Vapnik (1995) and was applied to
the microarray initially by Terrence et al. (2000). SVM is a technique that finds a hyper-
plane that distinguishes two groups in hyperspace simply by applying a support vector.
88 Chapter 5 · Advanced Microarray Data Analysis
K-NN is the most intuitive method conceptually. Based on the distance, K-NN detects
matters as its neighbor if they are close and assumes that it shares similar information
5 with the neighbor. Based on this property, K-NN method is widely known for revising
a missing value or estimating probability density. It is an older, elementary, and intuitive
algorithm but remains useful.
biologically predefined gene set with statistically significant difference in its gene expres-
sion. Thus, the results of GSEA allow the possibility of biological interpretation; different
from differential expression analysis.
The interpretation of GSEA differs, depending on optimization analysis. If we use
optimization analysis, for instance, it is possible to find all possible combinations of a
gene group that shows the largest differential expression in two experimental conditions.
Therefore, optimization analysis can find the larger gene group compared to GSEA in
terms of significance. However, the significance of the biological interpretation of an opti-
mized gene group remains unknown.
In this perspective, GSEA is the result of bioinformatics research that combines infor-
matics algorithms and biological intuition of knowledge. GSEA is a very important algo-
rithm that has been developed and applied to various modifications and has opened a new
paradigm of microarray analysis. This research showed that the biological meaning indi-
cates the importance of a large gene and establishes the relevant database. When GSEA
was introduced, it was applied to interpret microarray data on muscle tissue of diabetic
5.7 · Gene Set Enrichment Analysis
89 5
a
Differential expression
exp exp
gj
gi
c1 c2
b
Set-wise differential expression (GSEA) C2
C2= C1 + b0
exp exp
S0 S0
c1 c2 C1
c
Coexpression (or clustering) gj s0
C0
exp
s0
c0
gi s0
.. Fig. 5.4 Comparison of differential expression, GSEA, and clustering in microarray data analysis.
(a) Differential expression. (b) Set-wise differential expression (GSEA). (c) Coexpression (or clustering)
patients. Originally, these data were created by another research group and differential
genes were not detected between diabetic patients and normal people in the differen-
tial expression analysis. Mesirov’s group developed GSEA and defined various gene sets
beforehand. The biological pathway was remarkably relevant to diabetes, which shows the
significant difference of the two groups after testing the significance of each differential
expression (. Fig. 5.4b). Genes from geneset S0 in . Fig. 5.4b showed slight differential
expression that did not reach significance. However, collectively in the right panel, it
showed a one-sided distribution (i.e., C1 = C2 below the line). Overall, it could be con-
cluded without much objection to have statistical significance. This method is helpful to
overcome the noise of individual gene analysis methods, the lack of significantly different
individual genes from two groups, and the inconsistent results between two microarray
platforms. In 7 Chap. 8, a detailed exercise for GSEA will be discussed.
55 According to the selected distance function (e.g., expression rate) based on the
expression pattern, range, and ranks of all genes, the possibility of a designated gene
set concentrated in a specific portion (e.g., where the expression rate is high) can be
calculated to some degree with Kolmogorov-Smirnov statistics.
90 Chapter 5 · Advanced Microarray Data Analysis
55 At this point, the differentially expressed genes and all of the genes in the relevant
gene set are used for calculation.
55 After confirming the maximum total score, genes with greater significance in the set
can be extracted.
55 After random permutation of the label of phenotype samples, the p-value and FDR
can be determined by calculating each enrichment score.
55 Adjust the significance level toward statistical multiple hypothesis testing. For each
gene set, enrichment score that is obtained from the previous step is normalized to
the size of the gene set to create an NES (normalized enrichment score). The statisti-
cally significant gene set is decided by modifying the false positive rate by calculating
5 the FDR toward each NES.
Gene annotation databases, which contain biological information about genes, include
Cytogenetic Band, KEGG pathway, and Gene Ontology. MIT created MSigDB,11 which
combines many gene set databases for GSEA analysis. MsigDB contains 17,779 gene sets,
composed of 8 main groups.
55 Hallmark gene sets: Summarize and represent specific well-defined biological states
or processes and display coherent expression.
55 Positional gene set: Set of genes in chromosomes and cytogenetic bands, including
the location information of the chromosome, such as deletion, amplification, and
epigenetic silencing.
55 Curated gene set: Collected and organized gene set that is obtained from Biological
pathway, PubMed papers, or domain experts; it is composed of chemical and genetic
perturbations and canonical pathways (BioCarta, KEGG, and REACTOME).
55 Motif gene set: Preserved cis-regulatory motif shared in gene set across human,
mouse, rat, and dog genomes, including miRNA and transcription factors of targeted
gene.
55 Computational gene set: Gene set that shows similar expression patterns to about 380
cancer-related genes (oncogene, tumor suppressor gene).
55 GO gene set: One gene set that comprises many genes, mapped by Gene Ontology
terms.
55 Oncogenic signatures: Gene sets that represent signatures of cellular pathways which
are often dis-regulated in cancer.
55 Immunologic signatures: Collected cell states and perturbations within the immune
system from human and mouse immunology.
the same time. Alizadeh et al. suggest that it is possible to find out the survival rate dif-
ference of one subgroup of clinically similar lymphoma with the gene expression ratio. A
new subgroup can also be found by adding to the existing disease classification system.
Moreover, an international prognostic index value is used in the prognosis to confirm that
there is a difference in survival function, depending on the gene expression information
in the classified subgroup. Gene expression information implies that it is possible to use
to establish a specific therapy for each patient with the International Prognostic Indicator
(IPI). This standard classifies a patient’s condition in more detail by adding genetic infor-
mation of the patient’s symptom. Moreover, this standard is the foundation of customized
medicine, facilitating individual prescriptions by predicting the prognosis13 with genetic
5 information.
Bibliography
1. Alizadeh AA et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression
profiling. Nature 403(6769):503–511
2. Baldi P, Long AD (2001) A Bayesian framework for the analysis of microarray expression data: regular-
ized t-test and statistical inferences of gene changes. Bioinformatics 17(6):509–519
3. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach
to multiple testing. JR Statist Soc B 57:289–300
4. Derisi JL et al (1997) Exploring the metabolic and genetic control of gene expression on a genomic
scale. Science 278(5338):680–686
5. Dudoit S et al (2002) Comparison of discrimination methods for the classification of tumors using
gene expression data. J Am Stat Assoc 97(457):77–87
6. Dudoit S et al (2002) Statistical methods for identifying differentially expressed genes in replicated
cDNA microarray experiments. Stat Sin 12:111–139
7. Dudoit S et al (2003) Multiple hypothesis testing in microarray experiments. Stat Sci 18:71–103
8. Durbin BB et al (2002) A variance-stabilizing transformation for gene-expression microarray data. Bio-
informatics 18(suppl 1):S105–S110
9. Eisen MB, Brown PO (1999) DNA arrays for analysis of gene expression. Methods Enzymol 303:3–18
10. Eisen MB et al (1998) Cluster analysis and display of genome-wide expression patterns. PNAS
95(25):14863–14868
11. Golub TR et al (1999) Molecular classification of cancer: class discovery and class prediction by gene
expression monitoring. Science 286(5439):531–537
12. Guyon I et al (2002) Gene selection for cancer classification using support vector machines. Mach
Learn 46(1/3):389–422
13. Heller RA et al (1997) Discovery and analysis of inflammatory disease-related genes using cDNA
microarrays. Proc Natl Acad Sci U S A 94(6):2150–2155
14. Huber W et al (2002) Variance stabilization applied to microarray data calibration and to the quantifi-
cation of differential expression. Bioinformatics 18(suppl 1):S96–S104
15. Kohonen T (1995) Self-organization maps. Springer
16. Newton MA et al (2001) On differential variability of expression ratios: improving statistical inference
about gene expression changes from microarray data. J Comput Biol 8(1):37–52
Gene Expression
Data Analysis
6.1 Introduction – 96
6.2 Prerequisites – 96
Bibliography – 120
6.1 Introduction
Functional cluster analysis is a systemic analyzing method for microarray data. Cluster
analysis is useful for identifying gene clusters that are similarly expressed or tightly co-
regulated according to changes in experimental conditions. It is an exploratory data analy-
sis method that enables the processing of large amount of data because cluster analysis
does not require a hypothesis for data analysis. We can assume that genes with similar
expression patterns share similar functions. However, there are many unknown functions
of genes within a co-regulated gene family. Cluster analysis helps researchers generate
data-driven hypotheses through correlation analysis of unknown genes in complicated
data, saving their time and effort in designing research plans.
Cluster analysis can be a good tool when there is no prior knowledge of the com-
plicated observed data. This method is categorized as unsupervised machine learning in
the field of artificial intelligence. On the other hand, supervised machine learning is use-
ful when there is prior knowledge of the observed data. For example, prior knowledge
has index labels, such as “control group and disease group” or “malignant neoplasm and
benign neoplasm.” It is used in classification analysis, which is utilized in many areas in
medicine, such as diagnosis and prognosis of diseases. . Figure 6.1 is an introduction to
6.2 Prerequisites
55 Download five txt files in the ch06 directory of the Appendix files.
6.2 · Prerequisites
97 6
· Hierarchical clustering
Clustering analysis · K-means clustering
· Self-Organizing Map (SOM) clustering
> biocLite(“genefilter”)
> library(hgu95av2)
> library(annotate)
> library(samr)
6
> library(som)
> library(golubEsets)
> library(class)
> library(e1071)
> library(MASS)
> library(genefilter)
※ In Windows version R, there might be an error saying that your library function is not
installed. This kind of error can be resolved by installing the package as described in Sect.
B.2 of Appendix B.
library (sam)
zz Practice R code
The file (input_microarray.txt) used for this exercise is a tab-delimited ASCII file. The
format of this file will be described in more detail in 7 Sect. 5.3 of 7 Chap. 5. Once you
open the input_microarray.txt file, you will find a matrix of 10 columns and 22777 rows.
The execution codes are in process.txt file (. Fig. 6.2). If you need additional practice, you
can refer to Appendix A for basic analysis using the R program. You can find the R codes
for the additional exercises in the R_basics.txt file.
6.2 · Prerequisites
99 6
# Find the file using menu, save the handle in fn, and read it using read.table
> head(input)
G1.CEL G2.CEL G3.CEL L1.CEL L2.CEL L3.CEL N1.CEL N2.CEL N3.CEL
1007_s_at 10.461447 11.341586 11.023860 9.320934 11.440068 11.309118 10.086800
10.129414 9.966061
1053_at 8.675580 8.267232 8.486979 9.243647 8.732876 9.230934 7.757453
7.886887 8.014026
117_at 7.847506 7.571837 8.113888 7.619820 7.897814 7.285058 7.402672
7.559963 7.583754
121_at 8.300985 7.837589 7.915578 7.747076 7.772876 8.452841 8.407166
8.504255 8.102939
1255_g_at 7.002520 6.970377 7.000953 7.030557 7.061738 7.048979 8.059134
7.563080 7.658484
6 1294_at 8.304798 8.297717 8.010718 7.589431 7.520612 7.834773 7.698972
7.778027 7.783477
The following is the code in microarray.txt. Please refer to this to perform the exercise in
7 Sect. 6.4.
#########################
# Differential Expression Analysis #
#########################
# T-Test
> input_gn = input[ , c(1, 2, 3, 7, 8, 9)] #The group numbers for the t-test are 1
for columns 1 to 3 and 2 for columns 7 to 9
> list_gn <- c(1, 1, 1, 2, 2, 2)
> group_gn <- factor(list_gn)
> t.test(input_gn[1, ] ~ group_gn) #Perform t-test for the gene in the first row of
the data
> fc <- vector() #Produce a vector to save the fold change value
> pval <- vector() #Produce a vector to sabe the p-value
> for (i in 1:nrow(input_gn)) #Perform t-test for all genes
{
rst <- t.test(input_gn[i, ] ~ group_gn)
fc[i] <- diff(rst$estimate) #Save the fold change values in the vector fc.
pval[i] <- rst$p.value #Save the p-values in the vector pval
}
# Volcano Plot
> id <- which(pval < 0.05) #Extract the genes with p-values less than 0.05.
> plot(fc, -log10(pval)) #Plot the fold change and p-values
> points(fc[id], -log10(pval)[id], col = "red") #Distinguish significant
differential genes with red dots.
6.2 · Prerequisites
101 6
# FDR
> fdr <- p.adjust(pval, method = "fdr") #Perform FDR.
> id <- which(fdr < 0.05) #Extract the genes with FDR adjusted p-values less
than 0.05.
> fc[id]
# SAM
> library(samr) #Load the samr R package.
> sam_list <- list(x = input_gn, y = group_gn, logged2 = TRUE) #Load the data
used to perform SAM.
> sam_input <- samr(sam_list, resp.type = "Two class unpaired") #Perform
SAM.
> delta.table <- samr.compute.delta.table(sam_input) #Caculate the delta
values.
> samr.plot(sam_input, 0.4) #Draw the SAM plot.
# ANOVA
> anova(lm(input[1, ] ~ group)) #Perform ANOVA for the gene in the first row of the data.
> for (i in 1:nrow(input)) #Perform ANOVA for all genes in the data.
> fwer_p <- p.adjust(an_pval, method = "bonferroni") #adjust p-values using FWER
> id <- which(fwer_p < 0.05) #Extract genes with Bonferroni-adjusted p-value less than 0.05.
> an_pval[id]
################
# Clustering Analysis #
################
102 Chapter 6 · Gene Expression Data Analysis
# Hierarchical clustering
> d <-as.data.frame(input)
# Heatmap
6 # K-means clustering
> k <- kmeans(input, center = 5) #Produce 5 clusters using the k-means algorithms
# SOM clustering
#################
# Classifiation Analysis #
#################
# LDA
> library(MASS)
# KNN
# SVM
In this section, we will learn to find the list of genes that are differentially expressed in
microarray data by using T-Test, SAM, and ANOVA. T-Test analyzes the mean and distri-
bution of experimental and control groups under the null hypothesis. ANOVA uses F-test
and calculates the likelihood of finding differences in the means between more than two
groups, assuming the truth of null hypothesis. Both T-Test and ANOVA assume a normal
distribution of the samples.
Distribution analysis is used for the comparison of more than three groups. When the
null hypothesis is rejected, it is used to find out the groups that are significantly differ-
ent from each other by sample means (post hoc analysis). Multi-hypothesis test or mul-
tiple comparisons is a statistical analysis method that encompasses simultaneous testing
of multiple hypotheses. In order to correct the false positive error rate associated with
performing multiple statistical tests, a statistical test method must be chosen after deter-
mining the type of error which needs to be corrected. The family-wise error rate (FWER)
has less chance of having type 1 error, because it is a compensation method that tests
significant levels of multiple hypotheses simultaneously. However, it has lower statistical
power, because the type 1 error is minimized. FDR is an alternative method that provides
less stringent control of type I errors and provides a greater statistical power compared to
FWER. FDR controls the rate of false positives among the results that are determined to
be significant. (Please refer to 7 Sect. 5.4 of 7 Chap. 5).
104 Chapter 6 · Gene Expression Data Analysis
> pdat <- pData(ALL) #Import phenotype and metadata from ALL and assign them to pdat
#Extract the indexes of which $mol.biol values are “BCR/ABL” and “NEG” in ALL.
> eset <- ALL[ , subset] #Extract the subset of ALL, the extracted indexes.
6
> emat <- data.matrix(exprs(eset)) #Format the data set into the gene
expression matrix
The analysis will start with the generated gene expression matrix.
6.3.1.1 T-Test
T-Test compares two sets of data. We can identify the genes that have significant differ-
ential expression by running T-Test. In previously generated data, we can define BCR/
ABL (BCR/ABL fusion gene) as an experimental group and NEG (cytogenetically nor-
mal) as a control group. We will practice generating a volcano plot for the genes selected
as significantly differentially expressed by T-Test and create a distribution graph using
fold-changes and p-values.
In order to run a t-test, we need to look up the index labels saved in eset. These index
labels are designated mol.biol, which is the subset of ALL. We should define “BCR/ABL”
and “NEG” as two comparison groups.
> ttest.p <- vector() #Generate a vector variable that can take p-values.
> ttest.est <- vector() #Generate a vector variable that can take fold-changes.
{
calc <- t.test(emat[i, ] ~ group) #Perform t-test against gene i and save
ttest.p[i] <- calc$p.value #Calculate p-value and save the result in the
ttest.p vector.
We now have generated T-Test results using for-loop and saved p-values and fold-
changes in the ttest.p and ttest.est. vectors, respectively. In order to generate a distribution
graph of fold-changes and p-values, we need to draw a volcano plot using all listed genes
first and then mark the indexes of significant differential genes in red (. Fig. 6.3).
106 Chapter 6 · Gene Expression Data Analysis
7
6
5
–log10(ttest.p)
4
6
3
2
1
0
−2 −1 0 1 2 3
ttest.est
> index <-which(ttest.p < 0.05) #Extract the genes with p-values less than 0.05.
> ttest.est[index] #Extract the t-test results on the extracted gene from the previous step.
> plot(ttest.est, -log10(ttest.p), pch = 20, cex = 0.3, ylim = c(0, 7), main =
"Volcano plot for differential expression of BCR/ABL vs. NEG") #Draw a volcano plot
> index <- which(pfdr < 0.05) #Extract gene indexes with pfdr < 0.05.
Bonferroni method.
> index <- which(pfwer < 0.05) #Extract gene indexes with pfwer < 0.05.
So far, we have completed a basic differential analysis. The next step is to annotate the
selected genes. The following example helps you practice to print the extracted genes with
gene symbols and expression profiles.
> deg <- cbind(emat, pfwer, pfdr)[index, ] #Extract the expression profile data,pfwer,
and pfdr.
> head(deg) #Return the gene symbols and gene expression of selected rows.
108 Chapter 6 · Gene Expression Data Analysis
> head(deg)
01005 01010 03002 04007 04008 04010
DDR1 7.623562 7.543604 7.916954 7.516981 7.726716 7.288960
CD19 9.812015 10.003141 9.164655 9.876270 9.032565 10.114688
TRD@ 5.162924 5.966752 5.987986 5.201913 7.268455 5.310706
TNK2 8.654172 8.443706 8.521821 8.466544 7.571240 7.638699
ITGAE 6.117544 5.494834 6.588648 6.523782 6.008995 6.650002
GAB1 4.441590 5.095521 3.851317 4.814972 3.799737 4.844462
04016 06002 08001 08011 08012 08024
DDR1 7.724153 6.763490 7.543907 7.784205 7.041890 6.769948
CD19 8.917097 7.823291 10.391308 9.442780 10.449013 9.621440
TRD@ 5.571119 5.656040 6.109881 5.453425 5.302523 6.412112
6 TNK2 8.168028 8.011405 8.795978 8.932922 8.564040 7.312690
ITGAE 8.213070 6.427095 5.995374 7.109062 5.895099 6.734095
GAB1 3.007319 3.014779 4.945371 5.922863 3.718701 3.678384
> label <- sub("0", "2", BAlabels) #Assign numbers to the two groups.
> a.sam <- list(x = exprs(eset), y = label, logged2 = TRUE) #Input matrix of SAM
> samr.obj <- samr(a.sam, resp.type = "Two class unpaired", nperm = 100)
#Implement SAM
5
observed score
−5 0
−3 −2 −1 0 1 2 3
expected score
Variance analysis tests a hypothesis using F-distribution for the analysis of more than three
groups. F-distribution tests the variance within each group, group means, and distribution
comparisons based on the variance in group means between groups. This section practices
how to extract statistically significant genes between groups: BCR/ABL, NEG, and ALL/AF4.1
First, extract three groups for analysis.
> eset <- ALL[, subset] #Extract gene expression profile data
> emat <- data.matrix(exprs(eset)) #Save the gene expression profile in matrix form to m3.
Label the extracted data and divide the data into three groups.
“ALL1/AF4” to numbers.
Perform ANOVA analysis on three groups for the first gene of the matrix.
Response: emat[1, ]
Df Sum Sq Mean Sq F value Pr(>F)
BAgroup 2 0.3504 0.175220 2.6018 0.07839 .
Residuals 118 7.9468 0.067346
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> index <- which(pfdr < 0.05) #Extract gene indexes with FDR < 0.05
Extract and print the gene symbols of the differentially expressed genes.
> gene = featureNames(ALL)[index] #Extract gene symbols with corrected p-value < 0.05
> genesymbols <- mget(gene[!is.na(gene)], hgu95av2SYMBOL) #Import gene symbols.
> deg <- emat[index, ] #Save the differentially expressed genes to variable.
> rownames(deg) <- genesymbols #Set row names with gene symbol
> head(deg)
> head(deg)
01005 01010 03002 04006 04007 04008 04010
DDR1 7.623562 7.543604 7.916954 6.816397 7.516981 7.726716 7.288960
HIF1A 8.063428 6.740539 8.126621 5.739973 8.533088 8.232524 6.922755
CD19 9.812015 10.003141 9.164655 7.474483 9.876270 9.032565 10.114688
CD44 6.875240 6.315132 6.730446 9.338491 7.282684 8.907688 6.781251
TNK2 8.654172 8.443706 8.521821 7.854585 8.466544 7.571240 7.638699
ITGAE 6.117544 5.494834 6.588648 8.039748 6.523782 6.008995 6.650002
04016 06002 08001 08011 08012 08024 09008
DDR1 7.724153 6.763490 7.543907 7.784205 7.041890 6.769948 8.709475
HIF1A 7.815199 8.510725 8.126982 7.717734 8.513018 6.930541 7.440068
CD19 8.917097 7.823291 10.391308 9.442780 10.449013 9.621440 10.618220
CD44 7.953812 8.697630 7.826722 7.420834 5.942588 7.951445 5.802265
TNK2 8.168028 8.011405 8.795978 8.932922 8.564040 7.312690 8.848862
ITGAE 8.213070 6.427095 5.995374 7.109062 5.895099 6.734095 5.878245
Obtain a similar gene expression profile using clustering analysis. This section practices
hierarchical clustering, K-means clustering, and SOM.
Cluster dendrogram
35
30
25
Height
6
20
57001 04016
08024
15
65005
49006
28024
11005
08011 27004
24001 64002
36002
31007 63001
48001
28005
64001
19017
28001
28006
24017
08012
4300728036
10005
22013 20002
25003
22010
84004
03002
10
28035
28044
09017
24008 16009
09002
28019
68001
11002 28008
22011
09008
62003
12026
37001 12008
2402262001
04010
28023
26001
28047
L AL19014
20005
04007
43001
04018
43006
43015
28007
22009
31011
15005
28028
28032
01010
28037
24005
24011
68003
08001
28009
12019
27003
6400519005
5600783001
04006
26008
04008
06002
28042
43012
33005
28031
01003
65003
28043
43004
15004
16004
01005
12007
15006
26009
24018
25006
18001 4
28021
30001
16002
44001
37013
16007
49004
24010
12006
31015
15001
26003
62002
02020
17003
5
24006
12012
14016
19002
19008 h
hclust (*, "complete")
.. Fig. 6.5 The result of hierarchical clustering analysis was visualized in a dendrogram
this hierarchical clustering analysis. . Figure 6.5 shows a dendrogram, the result of the
> d <- as.data.frame(deg) #Save the differentially expressed genes in data frame
The result of hierarchical clustering analysis can be displayed in a heat map. (. Fig. 6.6).
Heatmap
TNFRSF14
TMC6
FCGRT
PTPN18
TNK2
EDEM1
NA
PTP4A3
DPEP1
TNK2
ITGA5
ABL1
FYN
FYN
LSS
NUP93
KIAA0195
SPN
POU4F1
GPR56
SV2A
GYPC
CSTB
MAGED1
TMEM165
IGFBP7
PLXNB2
PPM1F
DENND3
TUBA4A
CDKN1A
HLA-DMA
HLA-DMB
CD79B
CD19
HLA-DQB1
HLA-DQB1
HLA-DRB5
CD52
SOCS2
FHL1
LILRA2
KLF9
C1orf38
NA
ENG
DDR1
ZNF467
DDR1
ITPR1
CDS2
HIF1A
EVI2B
ACTN1
SLC2A5
ALOX5
DUSP6
CCND2
FOXO1
NRIP1
CD24
ITGA6
SMAD1
QPRT
TBXA2R
C7orf44
PPP2R5C
PPP2R5C
WDR45L
STX1A
C15orf39
SLC5A6
PTGES3
DAD1
RGS16
DBN1
FADS1
MAST4
ICAM3
LSP1
DSTN
NUCB2
SPINK2
ATP9A
PRKCQ
GATA3
FUT4
DIAPH2
NA
UCK2
CD72
MAST3
DTX4
PMS2L3
PMS2L8
PMS2L1
LARGE
FARP1
KLRK1
PMS2L1
TTC28
MX2
ITGAE
PKIG
PPP2R5C
PPP2R5C
CSRP2
CD44
CD44
CD44
LGALS1
SNX2
PPP3CC
CYFIP1
SGSH
STX3
ECM1
S100A13
IGFBP4
RGL1
Genes
COLQ
CASP10
OPTN
ACTN1
MUC4
NPR1
PDXK
UBE2E3
tcag7.1314
CD27
F13A1
GSN
SCHIP1
IGJ
CTGF
DIP2C
RAPGEF3
MYLK
NCF2
SAMD4A
MMRN1
NA
CA6
GAB1
KHDRBS3
PON2
EPS8
ALDH1A1
EMP1
GAB1
ACVR2A
CAST
CAST
MTSS1
ADCY9
MAP3K5
NA
GALC
PFTK1
P2RY14
F3
MARCKS
SEMA6A
CNN3
CTDSPL
SPARC
NPC1
TLE4
ZNF43
WSB2
CENTD1
HBS1L
CDC42EP3
IL6ST
PSD3
NT5E
GNAI1
NRXN3
XPA
EMP2
XPA
OPTN
OPTN
YES1
MYO1B
ENPP2
EP400
PIK3C3
HOXA9
MINA
ANGPT1
AMH
PGM1
PPM1H
PALLD
GNA12
CD44
CSPG4
PLD1
RECK
EBI2
GPM6B
IGF2BP3
HOXA4
ADRB2
LOC171220
WT1
THSD7A
RNASE3
TBC1D8
ITGB1BP1
VLDLR
CCNA1
HOXA9
HOXA5
PROM1
MRPL33
WIPI1
CAP2
DDEF2
ANGPT1
PDE4DIP
MDM2
PTHR1
DPYSL3
HOXA10
MEIS1
RAB40B
PRSS12
LILRA1
HOXA4
COBLL1
DLC1
FADS1
KANK1
RHOBTB3
LCK
LCK
CD3D
CD2AP
ZEB1
ITM2A
SH2D1A
ABL1
ABL1
LST1
C6orf32
STMN1
CBX3
CYFIP2
MRPL33
ID3
PSAP
PKM2
GPX1
CD97
MME
IFITM1
HLA-DPB1
HLA-DPA1
HLA-DRA
IGHM
HLA-A
IFITM1
XBP1
08024
04006
26008
63001
28028
28032
31007
24005
19005
16004
15004
19017
28008
31015
11002
10005
43015
02020
17003
18001
19014
20005
09002
01003
44001
16002
16007
49004
04018
43006
15006
26009
28009
37001
19002
12008
56007
19008
83001
65003
24006
64005
65005
20002
22013
14016
12012
03002
84004
49006
11005
27004
36002
68003
08001
43001
12006
24010
04016
57001
64002
22010
12026
62003
26001
28047
08012
22011
24001
28024
28023
04010
15001
24008
16009
06002
04008
28042
43012
28031
33005
22009
28007
68001
25006
24018
24017
28035
28044
48001
25003
12019
43007
28001
28006
28036
09017
27003
26003
62002
28005
64001
08011
24011
12007
01005
30001
28021
28019
01010
28037
04007
43004
28043
09008
31011
15005
62001
37013
24022
LAL4
Samples
.. Fig. 6.6 Heatmap
3
0
−3
0 1 2 3 4
> kmeans.row$cluster[1:10]
DDR1 HIF1A CD19 CD44 TNK2 ITGAE GAB1 XPA XPA CASP10
3 3 5 3 5 2 4 2 2 2
SOM clustering analysis can also be used for the significantly differentially expressed
genes that have p-values less than 0.05 with Bonferroni correction after ANOVA analysis.
SOM displays similar clusters closer to each other (. Fig. 6.7).
Pre-analysis of the
Generation of Prediction of
Data matrix and data and analysis of
classifier classification
response variables differential
(Training set) (Test set)
expression
Classification analysis deals with unknown samples and assigns predefined index labels to
the samples. Clustering analysis, a type of unsupervised learning, groups the genes based
on similar gene expression without additional information. However, classification analy-
sis, a type of supervised learning, is used when we have a certain amount of labeled training
data. Based on the training data, classification analysis generates classifiers and infers new
data classifications. The diagram of steps of classification analysis is described in . Fig. 6.8.
For classification analysis, the data have to be divided into a training set and test set.
This section will use golubEsets data, which is in an R package. The data is consisted of
the following items.
- Training set (Golub_Train): Total 38 individuals (ALL 27; AML 11), 7129 genes
- Test set (Golub_Test): Total 34 individuals (ALL 20; AML 14), 7129 genes
- Merge set (Golub_Merge): Total 72 individuals (ALL 47; AML 25), 7129 genes
> library(MASS)
#training set
#test set
6
6.5.2 KNN (K-Nearest Neighbor)
> library(class)
> library(e1071)
In this section, we will practice additional simple data analysis using R program for begin-
ners. Easier examples using R program are provided in the basic data analysis practices of
Appendix A. R program is an essential part of genome data analysis, and it is very impor-
tant to have a basic understanding of the program.
In order to obtain accurate differential gene expression data, it is crucial to obtain a good
understanding of the data frame that we will use for our practice.
> head(exprs(ALL))
01005 01010 03002 04006 04007 04008 04010
1000_at 7.597323 7.479445 7.567593 7.384684 7.905312 7.065914 7.474537
1001_at 5.046194 4.932537 4.799294 4.922627 4.844565 5.147762 5.122518
1002_f_at 3.900466 4.208155 3.886169 4.206798 3.416923 3.945869 4.150506
1003_s_at 5.903856 6.169024 5.860459 6.116890 5.687997 6.208061 6.292713
1004_at 5.925260 5.912780 5.893209 6.1702 45 5.615210 5.923487 6.046607
1005_at 8.570990 10.428299 9.616713 9.937155 9.983809 10.063484 10.662059
04016 06002 08001 08011 08012 08018 08024
1000_at 7.536119 7.183331 7.735545 7.591498 7.824284 7.231814 7.879988
1001_at 5.016132 5.288943 4.633217 4.583148 4.685951 5.059300 4.830464
1002_f_at 3.576360 3.900935 3.630190 3.609112 3.902139 3.804705 3.862914
1003_s_at 5.665991 5.842326 5.875375 5.733157 5.762857 5.770914 6.079410
1004_at 5.738218 5.994515 5.748350 5.922568 5.679899 6.044520 6.057632
1005_at 11.269115 8.812869 10.165159 9.381072 8.227970 7.627248 7.667445
> setwd("C:/gda") #Set a route to save the data that will be generated.
After reviewing the phenotype data above, we can design the experimental conditions
for analysis.
> eset <- ALL[ , subset] #Extract gene expression data that have “BCR/ABL,” “NEG” labels.
> table(eset$mol)
> library(genefilter)
> head(selected)
> sum(selected)
> head(esetSub)
Exercises
[Exercise 1] - Perform the following procedures using Affymetrix GeneChip.
1.1. Install affy and affydata packages for analysis
1.2. Load the data.
1.3. Load the CEL image.
1.4. Draw a histogram.
1.5. Perform normalization.
1.6. Draw a box plot using the obtained data from the previous steps.
1.7. Draw a scatter plot.
1.8. Run T-test.
[Exercise 2] - This exercise is to practice clustering analysis. (Run the following codes first to solve this
problem.)
> source(“http://bioconductor.org/biocLite.R”)
> biocLite(“golubEsets”)
> library(Biobase)
> library(golubEsets)
> data(Golub_Train)
Bibliography
1. Chiaretti S et al (2004) Gene expression profile of adult T-cell acute lymphocytic leukemia identifies
6 distinct subsets of patients with different response to therapy and survival. Blood 103(7):2771–2778
2. Eisen MB et al (1998) Cluster analysis and display of genome-wide expression patterns. PNAS
95(25):14863–14868
3. Gentleman R (2016) Annotate: annotation for microarrays. R package version 1.52.1
4. Gentleman R, Carey V, Huber W, Hahne F (2016) genefilter: genefilter: methods for filtering genes from
high-throughput experiments. R package version 1.56.0
5. Golub T (2016) golubEsets: exprSets for golub leukemia data. R package version 1.16.0
6. Huber W et al (2002) Variance stabilization applied to microarray data calibration and to the quantifi-
cation of differential expression. Bioinformatics 18(suppl 1):S96–S104
7. Li X (2009) ALL: a data package. R package version 1.16.0
8. Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2017) e1071: misc functions of the depart-
ment of statistics, probability theory group (formerly: E1071), TU Wien. R package version 1.6-8.
https://CRAN.R-project.org/package=e1071
9. Tamayo P et al (1999) Interpreting patterns of gene expression with self-organizing maps: methods
and application to hematopoietic differentiation. PNAS 96(6):2907–2912
10. Tibshirani R, Chu G, Narasimhan B, Li J (2011) samr: SAM: Significance analysis of microarrays. R pack-
age version 2.0. https://CRAN.R-project.org/package=samr
11. Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York. ISBN
0-387-95457-0
12. Yan J (2016) som: Self-Organizing Map. R package version 0.3-5.1. https://CRAN.R-project.org/
package=som
121 7
Gene Ontology
and Biological
Pathway-Based Analysis
7.1 Introduction – 122
7.2 Prerequisites – 124
7.4 DAVID – 129
7.5 ArrayXPath – 131
7.6 BioLattice – 131
Bibliography – 134
7.1 Introduction
Gene set
(i.e. differential expression group and cluster)
Biological interpretation
7.1 · Introduction
123 7
GO (−) GO (+)
Significant 70 10 80
[1] 5.935335e-07
The hypergeometric test used in ORA is the same as a one-tailed Fisher’s exact test. In
other words, the result of ORA is the probability of obtaining 10 or more than 10 genes
with the GO annotation. The formula is as follows:
æ M öæ N - M ö
k ç ÷ç
k ÷
Y n- y ø
P ( X £ k ) = åh ( y|N ;M ;n ) = å è ø è
y =0 y =0 æNö
ç ÷
ènø
> a <- 0
>a
[1] 6.57381e-07
7.2 Prerequisites
The details of R program installation are described in Appendix B, Sect. B.1; R package
installation examples are given in Appendix B, Sect. B.2. Appendix A provides the tuto-
rial for basic usage of R. The examples in this section require R packages which can be
installed as follows:
>source("http://bioconoductor.org/biocLite.R")
We will use the public ALL data set in examples for this section, and easily accessed
web-based biological interpretation tools. This section will also review each knowledge
resource.
7.3 · Dataset and Biological Interpretation Tools
125 7
Samples 128
The example data is ALL, which is provided by Bioconductor. . Table 7.2 describes the
structure of the ALL dataset. We will extract significantly differentially expressed genes
and run K-means clustering.
First, load analysis modules and the ALL dataset into R. Load the installed R Library.
> library("Biobase")
The show (ALL) command provides an overview of the structure of the dataset
(. Fig. 7.2).
The pData function returns phenotypic info from the dataset. If the info is very long,
use summary (pData (ALL)) to get a summary overview. With these commands, we can
see that ALL consists of 128 samples and 12,625 features related to gene expression. The
names of 128 samples are 01005, 01010, …, LAL4. The following link provides more
detailed information of the ALL data set.
7 http://www.bioconductor.org/packages/release/data/experiment/html/ALL.html
126 Chapter 7 · Gene Ontology and Biological Pathway-Based Analysis
7
.. Fig. 7.2 Usage of show function to view data structure
Using summary(pData (ALL)) also shows the phenotypic distribution of the molecu-
lar biological index, which is titled mol.biol, for 128 samples.
ALL1/AF4: 10
BCR/ABL : 37
E2A/PBX1: 5
NEG : 74
NUP-98 : 1
p15/p16 : 1
In order to perform two group comparisons, we will extract 37 cases of BCR/ABL and
10 cases of ALL1/AF4.
> ALL$mol.biol #Print and review mol.biol label of ALL data set.
> eset <- ALL[, ALL$mol.biol %in% c("BCR/ABL", "ALL1/AF4")] #Save only 47 samples,
Next, we will extract a list of genes which are statistically significantly different between
the two groups (adjusted p-value <0.05). There are 165 significantly different probes.
K-means clustering can be run twice for the 165 probes, and generate 10 clusters. We can
also run this analysis on the differential gene expression files generated in 7 Chap. 6.
7.3 · Dataset and Biological Interpretation Tools
127 7
> selected <- p.adjust(fit$p.value[, 2]) < 0.05 #Extract the indexes of significant analysis
results.
> esetSel <- eset[selected, ] #Save the significant subset in esetSel.
> res <- exprs(esetSel) #Save the expression profile of the significant subsets.
The following example shows how to convert cluster analysis results into an input for-
mat accepted by an interpretation tool. The first line of this input format describes the
experimental condition for each sample. For DAVID, we need to generate a file listing the
165 probes, because DAVID does not accept cluster analysis results as input. However,
for ArrayXPath, we need to add a cluster ID to each probe because it will incorporate the
cluster analysis results.
Print a list of significant probes and clustering results in a system input file format for
interpretation.
differential gene ID of res in the first row of res.input and cluster IDs created by K-means
> res.input <- cbind(res.input, res) #From the third line, generate a matrix combining gene
expression profiles.
of a data-frame
row.names = FALSE, col.names = FALSE) #Generate an input file for DAVID analysis.
row.names = FALSE, col.names = TRUE) #Generate an input file for ArrayXPath and BioLattice.
The input file for DAVID is a simple list of genes arranged in a single column. How-
ever, as shown in . Fig. 7.3, input files for ArrayXPath and BioLattice contain probe IDs,
cluster IDs, and gene expression profiles for each sample. The first row contains column
labels.
128 Chapter 7 · Gene Ontology and Biological Pathway-Based Analysis
7.3.2.1 DAVID
The Database for Annotation, Visualization and Integrated Discovery (DAVID) is a
web-based tool designed for understanding the biological meaning of a given gene list.
It includes functions for functional annotation, gene functional classification, gene ID
conversion, and pathogen annotation.
The website is: 7 https://david.ncifcrf.gov
7.3.2.2 ArrayXPath
ArrayXPath is a microarray data interpretation system based on biological pathways. It
also provides interpretations using GO and visualizes the genes in terms of clusters, dis-
eases, gene ontology, and biological pathways.
7 http://www.snubi.org/software/ArrayXPath/
7.3.2.3 BioLattice
Biolattice is a web-based clustering analysis tool. DAVID and ArrayXPath provide inter-
pretations for an individual cluster. However, BioLattice analyzes correlations of multiple
clusters using Formal Concept Analysis, and thus provides understanding of the overall
context of an experiment. It helps identify clusters which have important meaning, visu-
alizes the results by running core-periphery analysis, and prints the results in a table
format.
7 http://www.snubi.org/software/biolattice/
7.4 · DAVID
129 7
7.4 DAVID
Go to the DAVID website. Select “Start Analysis” from the top menu, then click “Upload”
in the left input window. We will use the “test_input_DAVID.txt” file which we created in
the previous example. There are two ways of uploading the data.
First, you can upload a file directly by choosing “Browse” under the “B: Choose From a
File” option of “Step 1.” A pop-up window will appear for you to select the file for uploading.
You can also copy the gene list and paste it directly into the field under the “A: Paste a list”
option of “Step 1.” Since example is Affy probe data, choose “AFFYMATRIX_3PRIME_
IVT_ID” in “Step 2: Select Identifier.” There is an option to convert probe IDs to gene IDs
when you enter the data. Next, select “Gene List” in “Step 3”, then click “Submit List” under
“Step 4” to submit the data. The submission results will appear on the screen (. Fig. 7.4).
Based on various database and knowledge resources, DAVID offers various analysis
tools for our 165 probes. You can see detailed results for any analysis by clicking each
resource link (. Fig. 7.4). There are two different presentations of results, the “Functional
In the “Functional Annotation Chart,” the ‘category’ column describes the original
database and specific term name. The table also provides a p-value and adjusted p-value
for each term. The “Functional Annotation Table” matches database terms with individual
submitted probes (. Fig. 7.6). You can confirm the probes at gene level and download the
.. Fig. 7.7 ArrayXPath
7.5 ArrayXPath
The ArrayXPath webtool provides interpretations of bulk genome data based on biologi-
cal pathways. It is available for biological pathways of Homo sapiens, Mus musculus, and
Rattus norvegicus. ArrayXPath automatically converts probe IDs to gene IDs and provides
options for p-value and adjusted p-value calculations (. Fig. 7.7). To perform a test run,
upload the file test_input_ArrayXPath.txt created in the previous section. . Figure 7.8
shows the pathway analysis results. We can confirm the results of statistics and hypergeo-
metric tests of the listed genes. The results can be evaluated by cluster or with a biological
pathway database. Using Scalable Vector Graphics (SVG), ArrayXPath provides a web-
based service for visualizing gene-expression profiles and biological pathway graphs. In
addition, it provides a web-enabled interactive visualization of pathways integrated with
gene-expression profiles so a user can visualize changes in gene expression according to
experimental conditions (. Fig. 7.8).
7.6 BioLattice
DAVID suggests GO terms that best describe the biological meanings of a gene list.
ArrayXPath combines GO analysis and biological pathway analysis. In most cases, though
not always, pathway analysis is more likely to give meaningful results than GO term analysis.
Recently, DAVID and ArrayXPath have been improved and made more compatible with
each other. However, both GO and pathway analyses have the limitation of only interpreting
the genes of a single cluster. BioLattice overcomes this limitation by mapping the correlations
132 Chapter 7 · Gene Ontology and Biological Pathway-Based Analysis
7
.. Fig. 7.8 Results of ArrayXPath analysis. When you click each pathway, it will produce an interactive graph
of each cluster through graphical representations, creating a lattice of concepts. Also, it pro-
vides a core-periphery analysis to present different degrees of importance at a semantic level.
BioLattice provides interpretations using GO or biological pathways. It asks users to
select one of three GO categories, “Biological Process,” “Molecular Function,” or “Cellular
Component,” and provides an option to filter based on degrees of statistical significance
or the number of probes (. Fig. 7.9). In this section, we will perform a test run using the
test_input_BioLattice.txt file.
7.6 · BioLattice
133 7
.. Fig. 7.10 BioLattice Graph. It divides the whole experiment into multiple clusters and rearranges
them according to their correlations
After you run BioLattice, you will find the Lattice graph shown in . Fig. 7.10. In the
graph, red represents “core” clusters which are the most important experimental context.
The green clusters are “communicating” clusters which have high correlations with “core”
clusters. Yellow clusters are independent and do not have any correlation with other clus-
ters. The rest are in grey, and classified as “peripheral.” As shown in . Fig. 7.9, the same
Exercises
[Exercises 1] - The structure of the Golub data provided by Bioconductor is as . Table 7.3 below.
Download the data (data name: golubEsets), find the differentially expressed genes (DEGs) between ALL
and AML patients, and convert the DEG results to a DAVID input file format.
1.1. Upload the data generated from exercise 1 in DAVID and check which GO terms and OMIM (Online
Mendelian Inheritance in Man) diseases from the “Functional Annotation Chart” and “Functional
Annotation Table” significantly enriched the uploaded genes.
1.2. Using the DEGs having p-values less than 0.05 extracted from the Golub data, perform K-means
using K = 20. Using the results of the K-means, run ArrayXPath.
134 Chapter 7 · Gene Ontology and Biological Pathway-Based Analysis
Samples 72
62 10
7
Take Home Message
55 Understanding of the Gene Ontology and Pathway analysis as a biological inter-
pretation of gene expression data using R.
Bibliography
1. ArrayXPath – http://www.snubi.org/software/ArrayXPath/
2. BioLattice – http://www.snubi.org/software/biolattice/
3. Chiaretti S et al (2004) T-cell acute lymphocytic leukemia identifies distinct subsets of patients with
different response to therapy and survival. Blood 103(7):2771–2778. Epub 2003 Dec 18.Gene expres-
sion profile of adult
4. Chung HJ et al (2004) ArrayXPath: mapping and visualizing microarray gene expression data with
integrated pathway resources using Scalable Vector Graphics. Nucleic Acids Res 32:W460–W464
5. Chung HJ et al (2005) ArrayXPath II: mapping and visualizing microarray gene expression data with
biomedical ontologies and integrated pathway resources using Scalable Vector Graphics. Nucleic
Acids Res 33:W621–W626
6. DAVID – https://david.ncifcrf.gov
7. Huang DW et al (2009) Systematic and integrative analysis of large gene lists using DAVID bioinfor-
matics resources. Nat Protoc 4(1):44–57
8. Huber W, Carey VJ, Gentleman R et al (2015) Orchestrating high-throughput genomic analysis with
Bioconductor. Nat Methods 12:115–121
9. Kim J et al (2008) BioLattice: a framework for the biological interpretation of microarray gene expres-
sion data using concept lattice analysis. J Biomed Inform 41(2):232–241
10. Li X (2009) ALL: a data package. R package version 1.16.0.
11. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015) Limma powers differential expres-
sion analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43(7):e47
135 8
8.2 Prerequisites – 139
Bibliography – 157
8.1 · Introduction
137 8
What You Will Learn in This Chapter
In this chapter, we implement Gene Set Enrichment Analysis (GSEA) to analyze microarray
data. We perform Kaplan-Meier survival analysis for the clustered genes obtained by micro-
array data clustering analysis and test the statistical significance of different prognoses
between clusters. It provides an understanding of the correlation between biological inter-
pretation and GO and pathway analysis of the clustered genes and an interpretation with
GSEA of the clustered genes.
8.1 Introduction
Gene Set Enrichment Analysis (GSEA) is a valuable tool in analyzing and interpreting
microarray data. GSEA is a method to detect if there is a gene set that has significant
expression or pathway-specific biological meaning. While differential expression analysis,
as explained in 7 Chap. 5, is a method of identifying genes important for this difference,
GSEA is a method of identifying the gene set from the comparison. The former verifies
the number of genes as an analysis unit (. Fig. 8.1), and the latter verifies the level of the
gene set as an analysis unit (. Fig. 8.2). Although the number of verified genes has already
been decided in the experimental step, the verified number of gene sets varies by the
Class A Class B
t-test cut-off
.. Fig. 8.1 After performing differential expression analysis on the gene level, the results of the analysis
are used to find biological meaning
138 Chapter 8 · Gene Set Approaches and Prognostic Subgroup Prediction
Gene set 2
enriched in
Class B
ES/NES statistic
t-test cut-off
Gene set 3
enriched in
Class A
8
.. Fig. 8.2 GSEA analysis of the gene set unit performed for differential expression significance
connotes biological meaning
analyst’s purpose.1 After verification, both methods can correct for the effect of multiple
hypotheses (Reference 7 Chap. 5, 7 Sect. 5.5 and 7 Chap. 6, 7 Sect. 6.3).
There are two advantages to this type of unit set analysis. First, small signals that are
not detected by individual genome analysis can be amplified by set unit analysis. Second,
because the analysis is run with gene sets that have been proven to have biological mean-
ing, if the significance of differential expression is acknowledged, then the biological inter-
pretation is meaningful. However, in order to interpret the gene list from the differential
analysis of gene units, ontology or pathway analysis must be run. GSEA can use several
gene annotation databases (KEGG pathway, Gene Ontology, Cytogenic Band) to mine
biological information. Of course, gene unit differential expression analysis can also be
used.
Survival analysis is a way to statistically infer the time for a specific event to happen
through observation of animal testing or human clinical trials. Survival analysis of micro-
array gene expression data that are not differentiable clinically can be sorted into catego-
ries. The survival analysis practice in 7 Chap. 7 uses microarray gene expression data and
clinical data containing survival data. Verify the statistical significance of the difference in
survival curves of the groups clustered by gene expression data.
If there is a significant difference between the clusters, classify the prognosis into two
different categories that allow the first steps to suggest different treatment strategies.
1 Theoretically, once the number of genes is decided, the possible number of gene sets is defined by
combinatorics, i.e., “Stirling numbers.” However, this number is a very big number, resulting in a
“computationally unfeasible” state. In addition, not all combinations are reasonable enough to have
a biological meaning. Therefore, practically, a small portion of all Stirling numbers that are
meaningful are testable.
8.3 · Input Files
139 8
8.2 Prerequisites
The main methods of GSEA include Java GSEA Desktop Application and R Package. In
practice, we will use methods (1) and (2). For detailed information on installation, refer
to Appendix B in Sect. B.8.
55 If the latest version of Java has not been installed, download and run “jxpiinstall.exe”
from the website: 7 https://www.java.com/en/download/.
55 Download the dataset necessary for the practice and create the directory “C:\gda\
ch08\” and unzip the dataset in this directory.
These datasets can be downloaded from Broad Institute, and these file are modified and
provided in this textbook.
Since the GSEA input file follows the ASCII2 text file format, it can be prepared by text
edit or MS Excel software.3 The input file format follows a table or matrix format, and the
column is separated by line and rows4 by tab; this format is called a tab-delimited file. A
selection box opens up when you select “File” -> “Save As..”in the Excel menu. You can
save by entering “File name (N) (e.g., “p53.gct”5) and selecting the “Text (Separated by
Tab) (*.txt)” option in the “File Format (T).”6 When using text edit, you can use “Tab” on
the keyboard to separate rows and “Enter” to change lines. The types of formats used in
GSEA input files are below. All files are in ASCII tab separation format.
* Gene Expression Data Format
GCT: Gene Cluster Text file format (*.gct)
RES: ExpRESsion (with P and A calls) file format (*.res)
PCL: Stanford cDNA file format (*.pcI)
TXT: Text file format for expression dataset (*.txt)
* Phenotype Label Data Format
CLS: Categorical (ex, tumor vs normal) class file format (*.cls)
CLS: Continuous (ex, time series or gene profile) file format (*.cls)
2 ASCII (American Standard Code for Information Interchange) is a 7-bit character code to present an
English character in computer. It has 128 codes. Codes 0 to 31 are used to control peripherals, such
as printers, codes 32–47 are for all the characters, 48–57 are for numbers, and 65–90 are for
alphabets.
3 Auto-formatting function can lead to “auto-error” when a gene name is entered in the Excel file
(Zeeberg et al. 2004).
4 CR/LF, line feed format is OS-dependent.
5 Do not use hyphen “-” in the file name. It cannot be recognized in the GSEA input window due to
some JAVA libraries.
6 Excel sends a warning that it has features unable to support tab-delimited files. Nevertheless, please
select “Yes” to save.
140 Chapter 8 · Gene Set Approaches and Prognostic Subgroup Prediction
contains 48 samples.
55 Third row: Columns 1 and 2 contain “NAME” and “DESCRIPTION”, and each row
after row 3 contains sample names of all 48 in order.
55 The fourth row and below: Gene name, description, and expression value are listed in
their respective order. In other words, actual gene expression data are placed from
column 4 and rows 3 of the input.
.. Fig. 8.4 Example of GSEA input file format tab separation *.txt file
.. Fig. 8.5 Example of GSEA phenotype label input file format *.cls
Categorical phenotype: the categorical phenotype expression input file format used in
practice is as follows. The first line contains the number of samples, number of phenotype
labels, and the number 1; a space separates the numbers. In the second line after “#,” the
phenotype label type is annotated. In the third line, the phenotype label is listed, separated
by a space, in the same order as the samples are in the gene expression data file (. Fig. 8.5).
Each row in the gene set database represents one gene set. Column 1 contains the gene
set name, column 2 has the target gene set explanation, and the set element gene name is
142 Chapter 8 · Gene Set Approaches and Prognostic Subgroup Prediction
.. Fig. 8.6 Example of gene set database *.gmt file format used in GSEA
listed from column 3. Since each set may have a different number of gene elements, the
number of columns in each row in the *.gmt file can be different. The gene set name in
column 1 must be unique and cannot be duplicated (. Fig. 8.6).
Microarray chip companies produce annotation files in their own language and put the
information into a *.chip file. Although this file is not used directly in GSEA, it is used in
the interpretation.
The file is arranged in three rows of tab separation ASCII format; in the first row, the
first three cell columns include the label name (“Probe Set ID,” “Gene Symbol”, and “Gene
Title”). The microarray chip and the corresponding Probe Set ID are first followed by the
explanation of the gene symbol, starting from the second row.
The gene set database, gene expression data, and phenotype data necessary in the analy-
sis are input into the installed GSEA program. In order to open the file and open the
“Load Data” tab, select “Load data” in the left menu “Steps in GSEA Analysis.” Select either
“Method 1: Browse for files…” or “Method 2: Load last dataset used.” Or, you can use
“Method 3: drag and drop files here” to move input files by mouse. You can confirm the
input file list in the “Object cache” on the bottom right. A detailed menu can be viewed
8.4 · GSEA Execution
143 8
.. Fig. 8.7 The GSEA program window after uploading three input files
by double-clicking the entry in the list or right-clicking. File content can be confirmed by
“Phenotype Viewer” or “Dataset Viewer” (. Fig. 8.7).
The following practice example uses Leukemia_example.gct, for the expression data,
Leukemia_example.cls for phenotype data, and C1.gmt for the gene set database.
Execute GSEA by selecting “Run GSEA” in the left menu after uploading all three input
files.
An additional information input window to execute GSEA can be viewed by selecting “Run
GSEA” in the left menu. There are three categories: “Required,” “Basic,” and “Advanced”
(. Fig. 8.8).
.. Fig. 8.8 An additional information input window necessary for GSEA execution is visible by opening
the “Run GSEA” tab
.. Fig. 8.9 MsigDB gene set listed in “Gene matrix (from website)” tab
8.4 · GSEA Execution
145 8
55 Number of permutations: number of phenotype labels or gene set permutations
performed. The larger the number of results, the higher the confidence but also the
longer the calculation time.
55 phenotype labels: Select phenotype labels
55 Collapse dataset to gene symbol
-> True: If *.chip file is present, use to map gene symbol
-> False: If gene symbol is already mapped or there is no *.chip file
55 Permutation type: Select target permutation from phenotype labels or gene sets.
55 Chip platform (s): Designate *.chip file. Leukemia.gct used in the practice has already
been mapped for gene symbol. No designation for Chip needed. Therefore, “Collapse
dataset to gene symbol” is designated as “False.”
The analysis can be started by selecting “Run” on the bottom right. If the permutation test
is run 1000 times using practice data, it will take 1–2 min. The left-bottom “GSEA reports”
will have “Success” listed with the complete analyses. The results report can be viewed as
an HTML file by selecting entries in the list.
GSEA result files are by default saved in the directory “gsea_home” (ex, C:\
DocumentsandSettings\username\gsea_home). The report file can be prepared as an
HTML file and linked to relevant files through a hyperlink in a browser, as shown below
(. Fig. 8.10).
By default, two phenotype label (e.g., ALL vs. AML) analysis results are provided in
the report. Results with FDR <25% or p-value <5% can be interpreted as a significant
difference between gene sets. Verification of a nominal p-value is only conducted within
the gene set and is not corrected for multiple hypothesis verification; therefore, standard
results with FDR <25% are used.
The individual analysis results in . Fig. 8.11 can be viewed by selecting “Detailed
enrichment results” in the analysis results report and selecting the analysis results of each
146 Chapter 8 · Gene Set Approaches and Prognostic Subgroup Prediction
gene set in the third column, “GS Details” (. Fig. 8.10). The summary page, enrichment
is an important subset in the relevant gene set that contains the top-ranking gene list
of enrichment scores (ESs), from 0 to the top score. An additional analysis to find the
biological meaning is a program called “Leading edge subset analysis,” available in the left
menu window. The results page for the GSEA can be loaded by selecting the “Load GSEA
Results” in the top right menu. In the results page, select the target gene set and select
“Run leading edge analysis” to run the analysis (. Fig. 8.12).
(Upper left) Expression profile, Heat map of selected gene sets of leading edge subset
gene
(Upper right) Selected gene sets overlap with leading edge subset. Higher, green.
Lower, white
(Lower left) Bar graphs of gene sets of leading edge subset gene
(Lower right) Standard histogram of Jacquard distance of leading edge subset gene
8.6 · GSEA Analysis Via R
147 8
As introduced in 7 Sects. 8.3, 8.4, and 8.5, Java is more user-friendly. However, to handle the
analysis results more freely, R is highly recommended. In this practice, acute lymphoblastic
leukemia (ALL) data reported by Chiaretti et al. [2] will be analyzed by GSEA using R.
148 Chapter 8 · Gene Set Approaches and Prognostic Subgroup Prediction
8.6.1 Installation
55 Select the tab for “Downloads” from the MIT Broad Institute website (7 http://
55 For Windows users, download “C:\gda\ch08\” and unzip the file (Reference to
7 Sect. 8.2. Prerequisites).
8.6 · GSEA Analysis Via R
149 8
> setwd(“C:\gda\ch08\GeneSet\”)
> source(“GSEA.1.0.R”)
Before GSEA in R, the genes must be mapped as gene symbols. The practice file has
already been mapped.
A directory to store the compressed results. In this practice, the directories that include
the input files, datasets/, and gene set data, GeneSetDatabases/, are used. Resultsreports
in text file and image file are saved in the Reports/directory (. Figs. 8.13 and 8.14).
.. Fig. 8.13 Comprehensive report of GSEA results for each gene set
150 Chapter 8 · Gene Set Approaches and Prognostic Subgroup Prediction
1.0
BRCA1 mutation
BRCA2 mythylation
BRCA2 mutation
BRCA wild-type
0.8
0.6
Proportion
0.4
8
0.2
0.0
0 50 100 150
Time, days
.. Fig. 8.14 Survival plot for association of BRCA 1/2 mutations with survival
#######################################################
#######################################################
> GSEA(
#######################################################
# Check working directory with getwd() and set working directory with setwd(“C:\gda\ch08\”)
8.6 · GSEA Analysis Via R
151 8
#######################################################
#######################################################
#######################################################
#nperm = 1000,
> weighted.score.type = 1,
> preproc.type = 0,
#######################################################
> perm.type = 0,
>)
8
8.7 Survival Analysis
With CGDS-R package provided by cBioPortal, we could gain access to TCGA data rela-
tively easily.
We could call for a list of cancer studies within cBioPortal servers. Among these studies,
select ovarian cancer and save the data into “mycancerstudy” variable.
> head(cancerstudy)
> mycancerstudy
[1] "ov_tcga_pub“
#"ov_tcga_pub_mutations"
#"ov_tcga_pub_methylation_hm27"
154 Chapter 8 · Gene Set Approaches and Prognostic Subgroup Prediction
In order to verify the prognostic effects of mutation and methylation within BRCA1/2
genes in ovarian cancer, comparing these factors to overall survival is crucial. Therefore,
we will first split the patients into BRCA1 mutation group, BRCA2 mutation group,
BRCA1 methylation group, and wild type preceding the analysis.
c('BRCA1', 'BRCA2'))
2 mutation
== 'BRCA2'), 3]
> table(brca_mutation$gene_symbol)
BRCA1 BRCA2
37 35
Methylated BRCA1/2 genes leads to lowered gene expression, which may have a nega-
tive impact in survival rate. Therefore, we will extract patients with BRCA1 or BRCA2
methylation.
mycaselist)
> brca1_methylation_cases =
> brca2_methylation_cases =
So far, we have extracted a list of patients that either have mutations or methylation in
BRCA1/2 genes. We next have to extract survival time variable from clinical variables in
order to proceed with our survival analysis.
> myclinicaldata = getClinicalData(mycgds, mycaselist) # Get clinical data for the case list
> myclinicaldata$OS_MONTHS
> myclinicaldata$OS_MONTHS
[1] 12.29 39.82 8.41 36.99 17.77 57.53 31.14 35.78 29.37 24.67 23.98$
Differentiate patients with mutations or methylation in either BRCA genes from those
that do not.
> total_sample <- rownames(myclinicaldata) # Extract all patients with clinical data
> type <- rep('Wild', length(total_sample)) # Assume all patients are wild type and make a
> names(type) <- total_sample # Matching patients IDs: Matching patient IDs is a crucial
step because patient IDs used in mutation profile, and patient IDs used in clinical, methylation
Perform survival analysis on four patient groups categorized by BRCA mutation and
methylation profile.
> library(survival)
myclinicaldata)
> plot(out, col=color, main = "Association of BRCA1/2 Mutations with Survival", xlab=
Exercise
[Exercise 1] - Using gene set C2.gmt from the ‘Leukemia’ data, find the gene set with FDR <25%.
[Exercise 2] - Conduct a leading edge subset analysis on the analysis results above using the gene sets
with FDR <25%.
[Exercise 3] - Using gender data gender.gct, gender.cls, and motif gene set C3.gmt provided by the
Broad Institute, find the gene set with FDR <25%.
9.2 Prerequisites – 160
Bibliography – 171
9.1 Introduction
Micro-RNA has a comparatively short length of about 21–22 bases and is single-stranded
RNA that does not code for protein, i.e., non-coding RNA. miRNA is known as an impor-
tant transcriptional control factor. Therefore, we can infer the gene functions that are con-
trolled by miRNA (or are related to the target).
Investigating miRNA target relationships is important in miRNA research. However,
since the number of verified miRNAs and their target mRNAs is small, many target pre-
diction algorithms have been developed and used. Furthermore, in contrast to plants with
much accuracy, organisms that are in higher hierarchy, the accuracy of the miRNA target
prediction algorithm is lower due to imperfect base pairing in the miRNA seed region that
cause interaction between miRNA and its target gene. There is a high chance of false pos-
itives. Therefore, there is a limit in using base pair pattern analysis to predict miRNA-gene
9 interaction mechanisms.
Expression pattern analysis of gene expression data gained from microarray experi-
ments can be used to infer the biological functions of miRNAs. Therefore, in the exercise,
miRNA-mRNA pair dual expression profile data from samples are used with target pre-
diction algorithm and known biological knowledge to carry out a function analysis as well
as to increase our understanding of miRNA-gene interactions. In this chapter, we do not
use RNA-seq gained from next-generation sequencing technology to analyze miRNA
expression. RNA-seq data will be used in Genome Data Analysis Version II for NGS.
9.2 Prerequisites
> mRNA_nor <- read.table("mRNA.txt", header = TRUE, sep = "\t") #Load mRNA
expression data
> miRNA_nor <- read.table("miRNA.txt", header = TRUE, sep = "\t") #Load miRNA
expression data
> head(miRNA_nor)
MetaCol MetaRow Column Row Reporter_ID Reporter_Name I04hr_1 I04hr_2
1 1 1 2 BM11097 hsa_miR_518a_2_AS 2.605266 4.638006
1 1 2 2 BM11207 ambi_miR_7920 3.463365 3.822202
1 1 3 2 BM11352 rno_miR_499 3.360941 4.243580
1 1 4 2 BM11262 ambi_miR_10394 2.843248 4.334861
1 1 5 2 BM10438 hsa_let_7e 10.387571 9.903674
1 1 6 2 BM10719 hsa_miR_99a 9.271923 8.628829
1 1 7 2 BM10137 hsa_miR_133a 4.048892 4.497045
1 1 8 2 BM11246 ambi_miR_11576 4.186156 4.428559
1 1 9 2 BM11236 ambi_miR_279 3.638117 4.700757
1 1 10 2 BM10165 hsa_miR_372 2.775328 4.059538
* column 1~4: MetaCol, MetaRow, Column, Row (The location of the probe on the
microarray)
* column 5: reporter_id
* column 6: miRNA_id
* column 7~30: normalized expression values (04hr_1, 04hr_2, 08hr_1, 08hr_2, 12hr_1,
12hr_2, 16hr_1, 16hr_2, 20hr_1, 20hr_2, 24hr_1, 24hr_2, 28hr_1, 28hr_2, 32hr_1,
32hr_2, 36hr_1, 36hr_2, 40hr_1, 40hr_2, 44hr_1, 44hr_2, 48hr_1, 48hr_2 sample)
If the value is greater than 0, we remove the missing value and get the expression file. In
this case, we get a gene expression matrix without missing values by removing the row
with the missing value when we know that an miRNA expression value is missing.
> dim(miRNA_nor.omit_na)
The final resulting matrix is saved in the mRNA and miRNA variables. We use this input
for further operations.
Let us calculate the correlation value of the mRNA matrix’s 2006th gene, the miRNA.
In effect, verify that the 2006th gene is GI_3807572-S (Mus musculus similar to ribosomal
protein S24(Loc380888)) and the 625th miRNA is “has_miR_489.”
> res
9.4 · miRNA-mRNA Expression Correlation Coefficient
163 9
Use the cor.test function to verify the applicable miRNA-mRNA correlation results as
shown below.
> this_mat <- cbind(x, t_miRNA) #ADD the expression in the input matrix
t(transpose) is a function to convert the row to the column and vice versa. After convert-
ing sample information to column using sample information, arbitrarily input mRNA
expression into the matrix column-wise (. Fig. 9.1). With this input matrix, we can calcu-
late correlation using the cor function. We can repeat this process for 10 mRNAs.
Samples
mRNA
miRNAs miRNAs
Samples
Samples
miRNAs
Repeat this process for 10 mRNAs. The input for the repeat process is as shown below.
> start_i <- 2001 #Select the 2001st miRNA to the 2010th miRNA
> for (i in start_i:end_i) #Repeat as many times as there are number of mRNAs
res <- rbind(res, round(this_cor[-1, 1], 5)) #Extract correlation values of selected
> dim(res)
We can gain the applicable correlation value of res with a size of 10(row)*2662(column),
in which the row is mRNA and the column is miRNA. Find the maximum and minimum
correlation values.
9.4 · miRNA-mRNA Expression Correlation Coefficient
165 9
> max(res)
> min(res)
Generally, miRNAs show a negative correlation with mRNA gene expression; search for
miRNA-mRNAs showing a negative correlation. In this practice, we find correlation coef-
ficients less than −0.7.
#Create a function covering a location into a two-dimensional array of rows and columns
position
In this case, the correlation coefficient is less than −0.7, the sequence index is 6246, and
when converted to rows and columns, it is the sixth row and the 625th column. At this
time, we can verify the gene name as below.
166 Chapter 9 · MicroRNA Data Analysis
> res[this_position[1],this_position[2]]
rownames(res)[this_position[1]], colnames(res)[this_position[2]],
res[this_position[1],this_position[2]])
> position_res
9
> plot(x_at, y_m, type = "l", pch = 1, xlim = c(1, length(x)), ylim=c(0, max(y_m,
> points(x_at, y_mi, type = "l", col = "blue", xlim = c(1, length(x_at)), ylim = c(0,
8
6
exprs
4
2
0
4 4 8 8 12 16 20 24 28 32 36 40 44 48
x_at
A graph similar to the one below will be printed. The red line represents mRNA and the
blue line miRNA (. Fig. 9.2).
Recently, (2014) TarBase v7.0 was released, and includes about 20 different species and
65,000 miRNA-mRNA target information. We can confirm if a significant miRNA-mRNA
pair has been verified experimentally. First, we use the site below to log on (. Fig. 9.3).
7 http://www.microrna.gr/tarbase
Gene or miRNA names are searchable. In the practice, we search for the target gene
“hsa-let-7a-5p”. The search results are similar to . Fig. 9.4.
If the target gene “hsa-let-7a-5p” is predicted via validation method (reporter gene
assay, Northern blot, Western blot) and the prediction algorithm microT provided by
Tarbase, then the scores are also supplied.
168 Chapter 9 · MicroRNA Data Analysis
In order to run the analysis, we select the species of interest and enter the gene name
or miRNA and click “Submit.” In this exercise, we attempt to predict the target gene of
human “miR-155-5p”.
Detailed Search Results provide information on predicted target gene name, represen-
tative transcript, type of miRNA target site, total context, and score. In the “Links to sites
in UTRs” column, you can view several miRNAs and their binding sites that target the
3’UTR region of the applicable gene (. Figs. 9.6, 9.7 and 9.8).
.. Fig. 9.8 TargetScan detailed search: In the “Target gene” column, click ID of gene symbol or
“Representative transcript” column; you can view detailed information linked to Ensembl
Exercises
[Exercise 1] - Calculate the correlation coefficient of the applicable gene and the entire miRNA, and
graph the correlation distribution. Also, find miRNA-mRNA correlation coefficients that are lower than
−0.7 or higher than 0.7.
[Exercise 2] - Verify the significantly correlated miRNA-mRNA pairs fouds in [Exercise 1] using TarBase
and TargetScan.
Bibliography
1. Agarwal V et al (2015) Predicting effective microRNA target sites in mammalian mRNAs. Elife 12:4.
https://doi.org/10.7554/eLife.05005
2. Barbato C et al (2009) Computational challenges in miRNA target predictions: to be or not to be a true
target? J Biomed Biotechnol 2009:803069
3. Bartel DP (2009) MicroRNAs: target recognition and regulatory functions. Cell 136(2):215–233
4. Garofalo M, Croce CM (2011) microRNAs: master regulators as potential therapeutics in cancer. Annu
Rev Pharmacol Toxicol 51:25–43
5. Na YJ et al (2009) Comprehensive analysis of microRNA-mRNA co-expression in circadian rhythm. Exp
Mol Med 41(9):638–647
172 Chapter 9 · MicroRNA Data Analysis
6. R Core Team (2016) R: a language and environment for statistical computing. R foundation for statisti-
cal computing. Vienna, Austria. URL https://www.R-project.org/
7. TarBase – http://www.microrna.gr/tarbase
8. TargetScan – http://www.targetscan.org
9. Vlachos IS et al (2015) DIANA-TarBase v7.0: indexing more than half a million experimentally sup-
ported miRNA:mRNA interactions. Nucleic Acids Res 43(Database issue):D153–D159
9
173 III
Network Biology,
Sequence, Pathway
and Ontology
Informatics
Contents
Bibliography – 186
10.1 Introduction
1 Homology means similar DNA or protein sequences of an individual of the same or different
species. It is used to infer sequence function by searching for a highly homologous sequence
with a sequence of interest or to predict evolutionary correlation between sequences.
10.2 · Sequence Data Analysis
177 10
.. Fig. 10.1 Sequence Logo visualize frequency of a base or amino acid at each position of sequence in
a log value
Before analyzing the sequence, it is essential to obtain basic sequence information. The
National Center for Biotechnology information (NCBI); PubMed, a massive archive of
biotechnology and medical paper indices; and GenBank, a genome sequence database,
contain an extensive amount of biotechnology information. Entrez is an integrated
retrieval system, providing not only DNA and protein sequence data, but also related
MEDLINE2 literature, the genomic data of GenBank, taxonomy, and tertiary protein
structure databases.
Sequence alignment can be classified into two different methods depending on the
input sequence number:
55 Pairwise alignment
55 Multiple alignment
be searched by using the PubMed system. Also, it is possible to retrieve the full text from 25
academic journals from the website in the PubMed system.
178 Chapter 10 · Network Biology, Sequence, Pathway and Ontology I nformatics
Pair-wise alignment is a method for indicating homology when there are two alignment
objects. The object of alignment can become a pair of amino acid sequence to amino acid
sequence or base sequence to base sequence. Pair-wise alignment is the most fundamen-
tal method for comparing homology. BLAST, which is the one of pair-wise alignment
databases, is mostly used for searching sequence homology in gene sequence database,
and it is used for comparing the similarity of two gene sequences. BLAST utilizes a heu-
ristic algorithm similar to FASTA to search for homology. BLAST provides online data
in European Bioinformatics Institute (EBI) and NCBI, but it can also run as stand-alone
program.
55 EBI BLAST: 7 http://www.ebi.ac.uk/Tools/psa/
There are different types of BLAST, and NCBI provides the three methods:
55 BLASTN (Nucleotide BLAST): Comparison between the base sequences
55 BLASTP (Protein BLAST): Comparison between the protein sequences
55 BLASTX: Comparison between the protein sequence database after converting the
input base sequence into six frames.
Global alignment is a method used to search for the optimal alignment for two long
sequences. When distances of the comparative sequences are similar, it is suitable when
the overall similarity is extensive. A global alignment using the Needleman-Wunsch
algorithm as a representative method consists of three steps: (1) initialization; (2)
score calculation; and (3) alignment. Calculation scheme is as follows: match, add 1;
mismatch, subtract 1; indel, subtract 1; align sequences by which the highest score is
obtained.
10.3 · Phylogenetic Tree Analysis
179 10
.. Fig. 10.2 Performing global
alignment for two sequences a a a g c g g a a g t c a c a g
• • •
a a g g c t g a a g t - a t a g
A local alignment aligns only the strongest homology region instead of aligning all the
sequences within the two sequences. It is suitable for aligning the overall sequences that
have lower similarities, especially when the length of comparison sequences is quite dif-
ferent. The Smith-Waterman algorithm is most commonly used for the local alignment.
The Smith-Waterman algorithm is similar to the Needleman-Wunch algorithm, which is
used in global alignment, but the score calculation method is different. We can see dif-
ferent results for the same input in . Fig. 10.2 (global alignment by Needleman-Wunch
Bacteria
Eukaryota
Archaea
The widely used visualization tools for sequence analysis are UCSC Genome Browser,
10 NCBI Map Viewer, and Ensembl Genome Browser. Annotation information for each
special method and a visualization tool is provided with the genome coordinates. One
drawback for the visualization tool is that the basic genome structure of each different
genome database has the same base, but the annotation information describes many
other cases. A specific location for the genome coordinate system of the number of
genes, exons-introns, and structure show subtle differences between each other. Thus, it
is very important to indicate the information source when you use the sequence infor-
mation.
The most useful feature of the UCSC Genome Browser3 is that it provides annota-
tion information with a multiple layer track by arranging all of the data with the genome
coordinate system of the reference genome sequence. NCBI’s Map Viewer4 also provides
genome-centered data, but the UCSC Genome Browser is better with regard to ease and
utilization. The UCSC Genome Browser provides track with bundling of various related
annotation information. The UCSC Genome Browser provides not only the graphic data
but also the download of text data and the result file in HTML format. The UCSC genome
browser allows not only reception of individual gene information but automatically pro-
cesses extensive genomic data, which has required previous and considerable bioinfor-
matics knowledge.
3 7 https://genome.ucsc.edu/
4 7 http://www.ncbi.nlm.nih.gov/projects/mapview
10.5 · Biological Pathway Analysis
181 10
10.5 Biological Pathway Analysis
The construction of pathway databases had laid the foundation that placed biological
pathway data to be the core component of bioinformatics research for genomics data anal-
ysis. . Table 10.1 shows the list of known pathway databases. With recent increase in the
sis are core methodologies of genome data analysis. Text mining techniques have also
been developed based on pathway information. PharmGKB provides pathways based on
pharmacogenetics.
Pathways, the representative aggregate of biological knowledge, would only prove to
be more valuable in the field of bioinformatics.
Chemokines,
Hormones,
Survival Factors Transmitters Growth Factors
Extracellular
(e.g., IGF1) (e.g., interleukins, (e.g., TGFα, EGF)
Matrix
serotonin, etc.)
GPCR
Integrins
RTK RTK cdc42 Wnt
Fyn/Shc
PLC Grb2/SOS
Frizzled
NF-κB
PKA MEKK MAPK MKK
Cytokines IκB β-catenin
Patched
JAKs
(e.g., EPC)
STAT3,5 TCF
Myc: Mad: ERK JNKs
Bcl-xL Max Max β-catenin:TCF
Fos Jun
SMO
Death factors
(e.g. FasL, Tnf )
and Genomes
BioCyc 7 https://biocyc.org/
PANTHER 7 http://www.panther.org/pathway/
TRANSPATH 7 http://genexplain.com/transpath/
BioCarta 7 https://cgap.nci.nih.gov/Pathways/BioCarta_Pathways
Reactome 7 http://www.reactome.org
PharmGKB 7 https://www.pharmgkb.org/view/pathways.do
is under BP.
As you can see in the figure, each standardized terminology node is connected with
‘is_a’ and ‘part_of ’ relationship. Recently, some new relationships have been added, includ-
ing ‘regulates’. For instance, “DNA replication” is under “DNA metabolism” (. Fig. 10.6).
In the figure, it can be observed that the terminology called “DNA replication” is annotated
in the RNH3 and RNR1 enzymes, drosophila Rntl and Rnrs, and mouse Rccc1, Rrm1, and
Rrm2. GO gene annotation is independently included in each genomic database, but the
fact that these three species genes are connected to “DNA replication” function can be
logically concluded from the figure. More precisely since the gene is connected to GO
terminology, GO terminology is then connected to other gene again; thus, all genomic
databases are connected to each other, which creates a massive bi-partite graph structure.
GO annotation is the core source for using genome data analysis in a manner similar to a
biological pathway as shown in 7 Chap. 5.
10.7 · Biomedical Text Mining
183 10
GO terminology includes additional information: (1) references; (2) the data; (3) the
source of the annotation; and (4) the evidence code.5 GO has been led by Gene Ontology
Consortium (7 http://www.geneontology.org/). GO is used widely in the genome data
Biological text mining is used to apply text-mining technology to biomedical science data
or the extensive amounts of molecular biological documents or papers. It has a central role
in the biomedical science field due to the huge increase in the amount of data. Biological
5 7 http://www.geneontology.org/Go.evidence.shtml
184 Chapter 10 · Network Biology, Sequence, Pathway and Ontology I nformatics
.. Table 10.2 Biological text mining tools categorized according to the purpose of the user
Source of biomedical data Functions used by the text Text mining tools
mining tool
Entity pair & lists Gene group & lists analysis Chilibot, SimConcept,
GNormPlus
10
text mining techniques extract and manages life science knowledge from documented
data that is written in an easily conceivable language. PubMed in particular, is the big-
gest literature database in the biomedical science field; therefore, this PubMed data is the
most mined information. While abstracts are the most utilized, text mining has developed
to mine the full texts of documents ever since increased use of open access policy and
online publication has occurred. Krallinger and others have classified text-mining tools
that are developed in the biomedical science field into eight categories according to the
user’s question and purpose (. Table 10.2).
Network science, which is a part of complex system research, studies the intrinsic prop-
erty of networks represented with nodes and edges (link). Studying the complicated
system from the perspective of the correlation between the nodes significantly helps to
understand the network properties are generally embedded in the natural and social sci-
ences. Acclaimed scientists, including Albert Barabasi et al., analyzed and identified that
the Internet connection structure has a “scale-free” network structure, which caught many
eyes of academics in the field.
In order to understand the scale-free network, an understanding of degree distribu-
tion is required. An edge number indicates the number of links that one node has. If a link
number of one node is N-1 in a network that has n number of nodes, that particular node
is connected to all other nodes within the network. If the link number of all the nodes is
(N-1), then the relevant network receives a completely connected graph structure. Link
10.8 · Biomedical Network Analysis
185 10
P (k)
A few hubs with 10−2.5
large number of links
10−3
10−3.5
number distribution of a random network generally has median value and a symmetrical
structure that has both extreme nodes of either many or few link numbers.
Then, what would the structure of the Internet be like? On one hand, the Internet
seems to have been randomly created, but after precisely analyzing the link structure,
it turns out that the Internet structure is not a random network but rather a scale-free
structure as described previously. The link number distribution of a scale-free network has
a power law distribution. Thus, the hub node, which has a high link number, has a lower
link number distribution. It has a plenty of outside nodes that have less link numbers,
which yields a straight line distribution if it uses the log of the link number (. Fig. 10.7).
Thus, the Internet consists of a few hubs with many outside nodes. The formation func-
tion of the scale-free network can be explained with growth and preferential attachments.
When there is a new node in the network, it tends to connect to a surrounding relative hub
node, and the distributional scale free network is created by that result. Therefore, the new
Internet member tends to connect to the closed area hub to create this structure.
The scale-free network has high efficiency for the information exchange. In other
words, the average path distance for connecting from one node to other node is notice-
ably shorter compared to a random network, and this kind of phenomenon is called “small
world phenomenon”. The phenomenon, in which “there are many people living in the
world, but it is possible to connect to a specific person in an average of 6~7 connections”,
can be explained with the link number analysis of the scale-free network. An additional
property is network robustness toward random attack. A victim of random attack could
be the outside nodes that have many numbers, as opposed to the hub which are scarce.
Inversely, a network is vulnerable to selective hub attacks. For instance, an airline network
that consist of worldwide airports and air routes is a typical scale-free network, but the
entire airline network dimension is not affected by a random attack such as a natural
disaster. In contrast to this, it explains the phenomenon that the airline network is very
vulnerable to a selective hub airport attack from terrorists.
These phenomena are also observed in not only with the Internet but also with vari-
ous networks. Of special interest is the protein interaction network, which is a biological
network, and like the Internet, has a high information exchange efficiency as the structure
of it is a scale-free network, and robustness from random attack. One of other properties
186 Chapter 10 · Network Biology, Sequence, Pathway and Ontology I nformatics
is that the hub node is evolutionarily old. Since the network formation process, which
has been mentioned earlier, consists of developmental and preferential connections, it
corresponds with the direct interpretation that the older node (protein) will last longer.
Surprisingly, results of the protein interaction network analysis have shown that the hub
protein is evolutionarily old, and network research has proven that it has many essential
proteins that will be lethal when the hub protein is knocked out. Many studies have shown
that the scale-free network has a self-organizing property. A computer network such as
the World Wide Web, in addition to the life networks that represent genomic, metabolic,
and rapidly growing social connection networks, has a free-scale network property. This
property is part of a complex system network resulting from the development of social
network services.
Life network research has become possible due to identification of extensive pro-
tein/gene interactions after analyzing publication abstracts in PubMed over the past few
decades, a microarray that has made the massive transcriptome analysis available, and the
development of a massive genomic technology that is next generation sequencing tech-
nology (NGS). These days, network medicine, which originated from network biology,
has become a new, emerging field.
The reason why life network research is involved in the current life science system is
because of the possibility of overcoming the limits of existing reductive research, which
is a study that focuses on a specific life process. For example, genetic diseases that have
specific phenotypes are affected by not only the specific gene variation but also interac-
tions between gene variations. One interesting study in the medical network is “Network
10 Medicine: From Obesity to the Diseasome”, which was published in NEJM in 2007. This
paper proposed that the pathophysiology of obesity is the result of life network inter-
actions at various levels with propagation of obesity by social networks. “The Spread
of Obesity in a Large Social Network over 32 Years” (Christakis et al. 2007), which was
published together with the paper that is mentioned above, claimed that obesity can be
propagated socially, after analyzing the social network data of 12,067 people from 1971
to 2003. That is, the increment of a person’s weight is related to the weight increments of
their friends, brothers and sisters, spouse, and neighbors. From now on, network analysis
can be expected to be applied in various fields such as life’s structural and evolutionary
mechanisms, new medical development, and disease gene discoveries.
Bibliography
1. Ashburner M et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Con-
sortium. Nat Genet 25(1):25–29
2. Bader GD et al (2006) Pathguide: a pathway resource list. Nucleic Acids Res 34(Database issue):D504–
D506
3. Barabási AL (2007) Network medicine – from obesity to the “diseasome”. N Engl J Med 357(4):404–407.
Epub 2007 Jul 25
4. Barabási AL et al (2000) Scale-free characteristics of random networks: the topology of the world-
wide web. Physica A Stat Mech Appl 281(1–4):69–77
Bibliography
187 10
5. Ciccarelli FD et al (2006) Toward Automatic Reconstruction of a Highly Resolved Tree of Life. Science
311(5765):1283–1287
6. Crooks GE et al (2004) WebLogo: a sequence logo generator. Genome Res 14(6):1188–1190
7. Johnson M et al (2008) NCBI BLAST: a better web interface. Nucleic Acids Res 36(Web Server issue):
W5–W9
8. Kanehisa M, Goto S (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res
28(1):27–30
9. Karolchik D et al (2012) The UCSC Genome Browser. Curr Protoc Bioinformatics Chapter 1:Unit1.4
10. Karp PD et al (2005) Expansion of the BioCyc collection of pathway/genome databases to 160
genomes. Nucleic Acids Res 33(19):6083–6089. Print 2005
11. Krull M et al (2006) TRANSPATH: an information resource for storing and visualizing signaling path-
ways and their pathological aberrations. Nucleic Acids Res 34(Database issue):D546–D551
12. Larkin MA et al (2007) Clustal W and Clustal X version 2.0. Bioinformatics 23(21):2947–2948. Epub
2007 Sep 10
13. Letunic I, Bork P (2007) Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and
annotation. Bioinformatics 23(1):127–128. Epub 2006 Oct 18
14. Matthews L et al (2009) Reactome knowledgebase of human biological pathways and processes.
Nucleic Acids Res 37(Database issue):D619–D622
15. McDonagh EM et al (2011) From pharmacogenomic knowledge acquisition to clinical applications:
the PharmGKB as a clinical pharmacogenomic biomarker resource. Biomark Med 5(6):795–806
16. McWilliam H et al (2013) Analysis Tool Web Services from the EMBL-EBI. Nucleic Acids Res 41(Web
Server issue):W597–W600
17. Schaefer CF et al (2009) PID: the Pathway Interaction Database. Nucleic Acids Res 37(Database
issue):D674–D679
18. Tatusova TA, Madden TL (1999) BLAST 2 Sequences, a new tool for comparing protein and nucleotide
sequences. FEMS Microbiol Lett 174(2):247–250
19. Wolfsberg TG (2011) Using the NCBI Map Viewer to browse genomic sequence data. Curr Protoc Hum
Genet Chapter 18:Unit18.5
189 11
Motif and Regulatory
Sequence Analysis
11.1 Introduction – 190
Bibliography – 211
11.1 Introduction
‘Gene finding’ and ‘transcription regulatory site finding’ through analysis of a given
genomic base sequence are the most fundamental informatics practices. There are still
many unsolved problems in gene finding, but transcription regulatory site finding is an
even harder challenge due to the complex patterns and short sequence lengths. Sequence
analysis algorithms are also used in evolutionary phylogenetic correlation analysis, and
such correlation leads to building phylogenetic trees through the comparison of gene,
protein, and RNA sequences. A peptide sequence comparison is used to find evolution-
arily conserved sequences of related proteins. These conserved sequences often provide an
essential function of the protein. Therefore, one strategy for identifying functional non-
coding regions like transcription regulatory sites is to find consensus sequences of bases
conserved across multiple species. Further analysis of the consensus sequence can infer
the function of that sequence region. ‘Phylogenetic analysis’ is a term used when building
and analyzing relationships of sequence similarities.
In this exercise, we will repeat the process of producing a major result1 reported by
Cliften et al. We use the Cluster Omega software, which applies a multiple sequence
alignment technique, to search for transcription factor binding sites (TFBS) in the yeast
11 genome. Also, we will learn to use various tracks in the UCSC Genome Browser to visual-
ize related information that you find. Finally, we will practice making miRNA target gene
predictions.
This practice uses sequence data from the yeast genome. This data is composed of four
DNA bases, adenine (A), guanine (G), cytosine (C), and thymine (T).
In order to do sequence analysis, sequence data acquired from the experiment needs
to be converted into a FASTA format. Sequence alignment algorithms such as BLAST
and Clustal Omega, used for practice, widely support input and output of FASTA format.
FASTA consists of two parts; the sequence and the annotation about the sequence. An
annotation line must start with the symbol “>”, and the actual sequence data must start
right below the annotation line. The annotation line can include Entry Name, Molecule
Type, Gene Name, and Sequence Length. Entry Name is required, while the others are
optional. Molecule Type, Gene Name, and Sequence Length are also recommended to be
included. Sequence data may be DNA bases or amino acids. . Figure 11.1 shows an exam-
ple FASTA sequence with the gene name of ‘YDR374C’ and a sequence length of ‘362’.
In the practice below, we will convert sequence data into FASTA format. The sequence
data that will be used in this practice is from Cliften et al., and DNA base sequence of the
gene YDR374C from four types of yeast.
* Create FASTA files.
1. Assume that the following sequence data is obtained from an experiment.
GGAAGAATGTTAGGAACTGTTGCTATTGTTGTACTTTGGTTATACGACAGTAAGTAAC
GTTGACTTGGTGACCGAAAATAGACACGAAATCGCTACCCGTTTCCCCAGAATATCACT
CCTCACGATGTACCTCGGCGGCTAATCTTTTTGGTAGCCTTTTGTGATATATATATAAA
TAAATAAGTATACATACATATATATATATATATATTTATACAGCTACATTGTTTTCCTC
CAAAATTTTCTGTTGGTTATGAATCGCAAAAGAAGTTTTCAGATTGTGTCCTCTGTTAC
TATTTCGTTAAGAAAGGAAGATATCGTCTACGGCTGGTGTGACGTAAGTATTGCGTTGT
GCTCTAAAA
TAAAACCCTCAAGAACTCTTGACACTACTGTGCTCTGTCTTCTTATTAAATGTAGAAGC
ATTTGCCTAAAGTAAACAAGAATAAATATACTGCATGGGGCTACCCGTTCCATATGATA
TCATCGGTCACGAAGTGTCGGCGGCTAATTTAGAGTACGCCTTTTGTGATATATATATA
TATATATATATACATAGAATGAACTACCGCTATTTTAAAACTCTTTTTGGTGGCTATGA
TTGCAGAAAAAGTGTCTAATAATAAGTGTGTTCTGTCACTTTGAGAAAGAATATTGCA
TATACGGTAAACAGTGGTGTGAGCTTTCTATTTTTTATTTTAAGAAAT
192 Chapter 11 · Motif and Regulatory Sequence Analysis
8. You can also put many different sequences in one FASTA file as below. Using
sequence data from all four species of yeast, create a new FASTA file called
“saccaharomyces.fasta”.
>S.cerevisiae
GGAAGAATGTTAGGAACTGTTGCTATTGTTGTACTTTGGTTATACGACAGTAAGTAAC
GTTGACTTGGTGACCGAAAATAGACACGAAATCGCTACCCGTTTCCCCAGAATATCACT
CCTCACGATGTACCTCGGCGGCTAATCTTTTTGGTAGCCTTTTGTGATATATATATAAA
TAAATAAGTATACATACATATATATATATATATATTTATACAGCTACATTGTTTTCCTC
CAAAATTTTCTGTTGGTTATGAATCGCAAAAGAAGTTTTCAGATTGTGTCCTCTGTTAC
TATTTCGTTAAGAAAGGAAGATATCGTCTACGGCTGGTGTGACGTAAGTATTGCGTTGT
GCTCTAAAA
>S.bayanus
TAAAACCCTCAAGAACTCTTGACACTACTGTGCTCTGTCTTCTTATTAAATGTAGAAGC
ATTTGCCTAAAGTAAACAAGAATAAATATACTGCATGGGGCTACCCGTTCCATATGATA
11 TCATCGGTCACGAAGTGTCGGCGGCTAATTTAGAGTACGCCTTTTGTGATATATATATA
TATATATATATACATAGAATGAACTACCGCTATTTTAAAACTCTTTTTGGTGGCTATGA
TTGCAGAAAAAGTGTCTAATAATAAGTGTGTTCTGTCACTTTGAGAAAGAATATTGCA
TATACGGTAAACAGTGGTGTGAGCTTTCTATTTTTTATTTTAAGAAAT
>S.mikate
GGACGACTCTAAAAAATGTTGTCACTGCAGCATTTTGGTTTAAGCGAGAGTTAATTATG
TTGGTCTGAGCAACCAAAAATAAACAGTTCAAGTGTTGCTACCCGTTTTTGCAGTTAAG
ATCACTTACCACGGATAAGTATCGGCGGCTAATCCTCATGGGACGCCTTTTGTGATATA
TAAATACATGCATCTAGTGAAACCTTTTCTTCAAAATTCACTCGCTGACTATAAGCCCC
AAACAGAAGCTTTAAAACTACGTATTCTACTACTAATTGATTAGAAAATATCACTTCAT
ACACGGTTGAAGTGGCTTAAGCATTGTTTGTGCTTGAAAAAT
11.3 · Sequence Alignment and Phylogenetic Tree Analysis
193 11
>S.kudriazevii
GAGATTATTTAGTAACTTTGTTGCTACACTACCTCTTTATACGAGAATTGATAGGATTG
ACCAAAGCATCTAGGATAAATAAGATGTGAATGTATTACCCGTTTTGTATTCAAGATCA
CCTCTCACGGAGGGGTTTCGGCGGCTAATCGTTATTAGCGCCTTTTGTGATATGCGTAT
AAATAAAGTGACTACTTCTAGCTTCAAAAAATTGCTTACTGCTATACCCCTCGCTCTAA
GCGCGAAGTTTCAAAATTGTCTGTTCTACCATTCCTTGGTTAAGAAAATACTGCTAGGG
TGGTGTGAACATTGTCTTGTGCTTGAGAAAT
In this chapter, we will explain how to perform homology analysis using sequence align-
ments. Sequence homology refers to the similarity of DNA base or amino acid sequences
between individuals or between species. A sequence alignment quantitatively and visually
represents correlation of bases, which can be used to infer a degree of functional, evolu-
tionary, and structural relationships among sequences. Sequence alignment is similar to
aligning heads, legs and tails to compare organs. Finding sequences that are high homolo-
gous with your sequence allows prediction of evolutionary correlations or the inference of
function for your sequence.
First, we introduce sequence alignment techniques and practice a homology analysis
using the sequence alignment algorithms BLAST and Clustal Omega. BLAST and Clustal
Omega are available on NCBI and EBI websites. Refer to 7 Chap. 1 7 Sect. 1.2 for funda-
pair of sequences. It has the disadvantage of not considering sequence gaps, which has
overcome with BLAST2.
Yes
No
. Fig. 11.3.
When you run Clustal Omega, the following five result files are generated:
the “*.output” result file. The “*.ph” file is a phylogenetic tree file, and a detailed explanation
will be provided in 7 Sect. 11.3.3. Multiple Alignment practice also will be performed.
. Tables 11.1 and 11.2 shows color coordination for aligned sequences and symbols denot-
ing conserved sequence.
11.3 · Sequence Alignment and Phylogenetic Tree Analysis
195 11
.. Fig. 11.5 Explanation of multiple sequence alignment results obtained from Clustal Omega
196 Chapter 11 · Motif and Regulatory Sequence Analysis
.. Table 11.1 Key for color coding of clustal omega sequence alignment
DE Blue Acidic
RK Magenta Basic-H
. Semi-conserved substitution
There are many different algorithms for each type, such as the Needle-Wunsch (global)
and Smith-Waterman (local) algorithms. Refer to their Wikipedia pages2 for detailed
explanations. Next, you will perform both global and local alignments of sequence pairs
and learn the practical differences between the two methods. This will be done with
EMBOSS_Needle and EMBOSS_Water web services provided by the EMBOSS-ALIGN
alignment tool3 from EBI.
55 EMBOSS_Needle: Performs global alignments using the Needle-Wunsch algorithm
55 EMBOSS_Water: Performs local alignments using the Smith-Waterman algorithm
11.3.2.1 BLAST
* Understand the use of global and local pairwise sequence alignment
1. Go to “Tools” > “Pairwise Sequence Alignment” Pairwise Sequence Alignment of EBI
webpage in order to do pairwise sequence alignments (. Fig. 11.6). (http://www.ebi.
ac.uk/Tools/psa)
2 7 http://en.wikipedia.org/wiki/Sequence-alignment
3 EMBOSS (European Molecular Biology Open Software Suite).
11.3 · Sequence Alignment and Phylogenetic Tree Analysis
197 11
6. After opening a new window, run the method from 1) to 5), but instead select “Local
Alignment” > “Water Tools” > “Nucleotide” in part 2) (. Fig. 11.8).
7. Check differences between the results of global alignment and local alignment
(. Figs. 11.7 and 11.8).
2. Select a set of type and upload the saccharomyces.fasta file in STEP1 (Upload the
sequences to be compared as a set).
3. Click “Submit”. If you want the results to be sent to your email, then check the
appropriate box in STEP3 and input your email address.
4. If you select “Show Colors” in the result tab “Alignments”, then you will see color-
coded values. Refer to . Table 11.1 for the color key (. Fig. 11.9).
198 Chapter 11 · Motif and Regulatory Sequence Analysis
11
5. Check the four result files given in “Result Summary” (. Fig. 11.10). The Percent
Identity Matrix gives a comparison score between each sequence pair. The homology
of two sequences is greater for higher scores.
6. Click “Start Jalview” under “Jalview” in the “Result Summary” tab. Jalview4 visualizes
the degree of consensus across sequences (. Fig. 11.11). It is a multiple sequence
alignment editor provided through a JAVA applet, and it requires that the Java
Virtual Machine (JVM)5 is installed on the user’s computer.
A cladoram does not limit the lengths of branches in the tree. Cladorams only return
approximate classifications. In phylograms, the length of a branch is based upon the
genetic relationship between taxa (e.g., among various biological species or entities), where
taxa on shorter branches are genetically closer to one another. Finally, an ultrametric tree
5 7 http://www.java.com/en/download/index.jsp
200 Chapter 11 · Motif and Regulatory Sequence Analysis
11
focuses more on the time-based relation of taxa than their evolutionary relationships, with
the length of each branch indicating when the two taxa diverged.
* Creating a Phylogenetic Tree
1. Open the Clustal Omega webpage given above.
2. Select “DNA” under Step 1 and upload or paste the results file from [Example 11.3]
into the input field.
3. Click “Submit”. If you want the results to be sent to your email, then check the
appropriate box in STEP 3 and input your email address.
4. Results will be shown as both graphic cladogram and as text (. Fig. 11.12). In the text
notation, parentheses group taxa together by their relavance. For instance, S. mikate
and S. kudriazevii are in same parentheses set, which means they are most closely
related.
11.3 · Sequence Alignment and Phylogenetic Tree Analysis
201 11
5. Clicking on the “Cladogram” and “Real” option buttons will change the graphical tree
from cladogram to phylogram. As shown in . Fig. 11.12, S. mikate and S. kudriazevii
.. Fig. 11.13 Multiple sequence alignment results and conserved sequence regions
3. Copy and paste a selected conserved region into the Motif Database of SCPD.6 For
example, enter “TACCCG” in the motif field. Click “Submit” after entering 0 for
“Allowed mismatches” (. Fig. 11.14).
11
“GCCTTTGTGATAT” (. Fig. 11.16).
6. Finally, perform the motif search again for “TACCCG”. Now, let’s analyze the search
results.
“Factor” means the transcription factor, and “gene” is the gene controlled by that tran-
scription factor. Text in black, such as “TTATTACCCG,” shows the full transcription fac-
tor binding site, while below it in red is the aligned input motif, “TACCCG”.
7. Among 15 search results, there are three having only one match (MCM1, PHO4, and
GRF2). Among these three, GRF2 has one TFBS (TTATTACCG) also listed under
REB1. Since REB1 and GRF2 have the same TFBS, they are assumed synonyms.
8. Go to the SGD website (7 http://www.yestgenome.org) and search for GRF2
.. Fig. 11.17 GRF2 search results from the Saccharomyces genome database
9. Back in the motif search results, click REB1 to bring up its page, then “Get consensus” to
confirm that the “TACCCG” motif is the transcription factor’s binding site (. Fig. 11.18)
10. The consensus is “YYACCG”, where “Y” is the IUPAC7 notation for any pyrimidine
(C or T). If you look back to the multiple alignment corresponding to the first
conserved region, you see that S. cerevisiae, S. bayanus, and S. mikate species have
“CT” while and S. kudriazevii has “TT”. We can conclude that this conserved
sequence is related to REB1 binding.
11
The greatest advantage of the UCSC Genome Browser8 is providing detailed annotation
information and other relevant data, which are centered around the reference genome
sequence. The NCBI Map Viewer9 also provides genome sequence-based information,
but the UCSC Genome Browser is simpler and generally more useful. The UCSC Genome
browser provides each various annotations as tracks. In addition to the graphic output
screen, ASCII text data and HTML format reports are freely downloadable. The UCSC
Genome Browser can provide a large amount of information required for genome data
analysis.
1. Go to the UCSC Genome Browser website given above.
2. Click on “Genome Browser”, then select Yeast under “Popular Species” at the top
3. Select “June 2008 (SGD/sacCer2)” as the assembly and enter “REB1” as the search
term, then click “Submit” (. Fig. 11.19).
8 7 http://genome.ucsc.edu/
9 7 http://www.ncbi.nlm.nih.gov/projects/mapview
11.5 · UCSC Genome Browser
207 11
A list of search results is returned (. Fig. 11.20). In some cases, a lot of candidates are
shown. Check the first gene’s name and explanation, then select it.
4. Look at the layout of the genome browser page (. Fig. 11.21).
The genome base position numbers are at the top of the view. There are several tracks
below, each providing relevant information. Each track type has a control panel (dark
blue horizontal bands in . Fig. 11.21), and when you click the “refresh” button, changed
.. Fig. 11.23 Open ‘SGD Genes’ track to visualize REB1 gene in genome browser
11
.. Fig. 11.24 Description page for REB1 in UCSC genome browser
6. Under “Genes and Gene Predictions”, select “pack” in the menu for “SGD Genes”.
Click any “refresh” button.
7. REB1 information is added back into the genome view (. Fig. 11.23). Track control
panels have up to five options (hide, dense, squish, pack, full), and more detailed
information is provided in the viewer as you go from dense to full. It is really conve-
nient for controlling display of EST or SNP data. The “pack” option is generally used.
11.6 · Prediction of Targeting microRNAs
209 11
8. UCSC Genome Browser also contains information about a transcription factor’s bind-
ing sites. First, open the REB1 information page by clicking REB1 on the side of the
genome view (. Fig. 11.24).
9. In the section titled “DNA Binding Motif from CHIP/CHIP Experiments, the motif
“CGGGTAA” is shown. This is complementary to the sequence motif “TACCCG” that
was used in [Example 11.5] (. Fig. 11.25).
DianaTools/index.php?r=MicroT_CDS/index)
2. Search for ‘ENSG00000110092”. Select one of miRNAs, which has a green box in the
“Also Predicted” column, indicating it has been verified experimentally (. Fig. 11.26).
3. The details page provides gene name, miRNA details, and methods information.
Click the circled “i” colored in grey, next to the gene name for more details.
4. Review the detailed information (. Fig. 11.27).
5. Similarly, details for the targeting miRNA can be obtained from the grey circle next
to an miRNA name (. Fig. 11.28).
.. Fig. 11.27 Gene details for ‘ENSG00000110092’ from the DIANA-microT Webserver
11
.. Fig. 11.28 Detailed miRNA info for has-let-7b in the DIANA-microT Webserver
Bibliography
211 11
Exercises
[Exercise 1] - Using the myosin protein sequence of human, chicken, and other provided species, work
through the following problems: (sequences are in file /gda/ch11/myosin.fasta).
.1 Perform multiple sequence alignment using Clustal Omega
1
1.2 Find the species whose MYH9 myosin protein is closest to human using the hierarchical tree dia-
gram
1.3 Search for MYH9 in the UCSC Genome Browser
1.4 Hide all tracks, then check if there are myosin ESTs from brain tissue by using the “Human ESTs”
track in the “mRNA and EST” group
55 Select pack for “Display mode,” red for “Filter color”, and brain for “tissue”
1.5 Hide the Human ESTs track again, and show SNPs in MYH9
55 Use the SNPs track under “Variation and Repeats”
Bibliography
1. Cliften P et al (2003) Finding functional features in Saccharomyces genome by phylogenetic footprint-
ing. Science 301:71–76
2. Clustal Omega – http://www.ebi.ac.uk/Tools/msa/clustalo
3. DIANA-microT – http://diana.imis.athena-innovation.gr/DianaTools/index.php?r=MicroT_CDS/inde
4. http://en.wikipedia.org/wiki/Sequence-alignment
5. Leung W (2008) Identifying regulatory regions using multiple sequence alignments. http://www.nslc.
wustl.edu/courses/archives/Bio5924/elgin/MSA_Intro_rv1.pdf
6. Maragkakis M et al (2009) DIANA-microT web server: elucidating microRNA functions through target
prediction. Nucleic Acids Res 37:W273–W276
7. Saccharomyces Cervisiae PRomoter Database (SCPRD) – http://rulai.cshl.edu/SCPD/searchmotif.html
8. Saccharomyces Genome Database (SGD) – http://www.yeastgenome.org
9. UCSC Genome Browser – http://genome.ucsc.edu/
213 12
Molecular Pathways
and Gene Ontology
12.1 Introduction – 214
Bibliography – 231
12.1 Introduction
With the development of genomic technology, large-scale genomic data accumulated and
various bioinformatic analysis methodologies were developed. The analytical approach
patterns or classifies genomic data, and semantically interprets of such data. A variety of
additional data such as life science data are needed. Gene ontology (GO), biological path-
ways, and literature information are used for analysis of large-scale genomic data. Various
algorithms and biological text mining techniques are developed for the analysis of the
large-scale genomic data.
After conducting biological text data mining, we will try to understand the semantic
interpretation process of genomic data based on the biomedical knowledge in 7 Chap. 12.
12.2 Prerequisites
This section will use acute lymphoblastic leukemia (ALL) data by Chiaretti S et al. (Blood
2004). We will use the 20 selected genes in the paper instead of using microarray data of
ALL. Twenty genes were selected: (1) IL-8; (2) CHC1L; (3) AHNAK; (4) SEC31B-1; (5)
MEF2A; (6) CAT; (7) MYC; (8) HNRPH1; (9) DEK; (10) CDC7L1; (11) BUB1B; (12)
H2AFX; (13) CENPF; (14) KIAA0175; (15) HEC; (16) CD2; (17) USP1; (18) TTK; (19)
CD8; and (20) LRMP.
12 * Gene Ontology Tools
AmiGo (7 http://amigo.geneontology.org/amigo)
G-SESAME (7 http://bioinformatics.clemson.edu/G-SESAME/)
* Pathway Tools
Reactome (7 http://www.reactome.org/)
Currently, GO provides annotation information for 319 species of genes, yielding search
results for IL-related genes or proteins across a variety of species. When an icon such as “Protein
1 7 http://amigo.geneontology.org/amigo
12.3 · Gene Ontology
215 12
from Homo sapiens” on the right side of each row or “Interleukin-8” is selected, we can obtain
information on human Interleukin-8, including name, type, source database, and sequences.
To search GO annotation for a given gene, select “# associations” on the left of “Protein from
Homo sapiens.” You will find annotated GO terms for IL-8 as shown in . Fig. 12.2. You can
confirm IL-8 is annotated with various GO terms such as, angiogenesis, calcium-medicated
signaling, cell cycle arrest, cellular component movements, cellular response to lipopolysac-
charides, chemotaxis, and embryonic digestive track development. When a GO term is anno-
tated to a gene, the evidence code is marked differently according to the methodology.
216 Chapter 12 · Molecular Pathways and Gene Ontology
.. Fig. 12.3 Tree view and graph view of search results of IL-8 in AmiGO
Therefore, the quality of the GO annotation is determined by the evidence code. The
top left filter allows you to filter annotations based on the basis code and GO type.
If you select “GO: 000001524 angiogenesis” in the annotation, a detailed description of
the corresponding GO term is displayed. If you select the “Inferred Tree View” tab in the
upper tab menu, the relationship between the other terms is shown in a tree diagram
(. Fig. 12.3). “Graph View” shows the relationship and hierarchical structure between
12
12.3.2 Search GO Annotations for a Gene List
The previous exercise examined GO annotations in one gene and the characteristics of
individual GO terms. The strength of GO is that it is useful to find semantic similarities in
gene lists obtained from differential expression or cluster analysis. In this section, we will
practice GO annotation search for a gene list (or set) composed of multiple genes. Select
“Search” from the first screen menu of AmiGO and the “Advanced Search” screen appears.
You can search GO terms of a gene list here. In this exercise, we will perform a GO anno-
tation search for IL-8, AHNAK, CD2, and TTK (. Fig. 12.4). Go to the setting at the
bottom of the page and select genes of proteins for “Search Type.” Select “symbol” since
input data is composed of gene symbols. In “Filter results”, select Homo sapiens for
“Species” and then select “send query.” Non redundant GO annotation will be searched in
order to find the input genes. As explained in 7 Chap. 7, GO annotation results for gene
clusters are important information that can be used in further analysis to infer the bio-
logical significance of the clusters.
12.3 · Gene Ontology
217 12
.. Fig. 12.4 GO annotation search for gene list (IL-8, AHNAK, CD2, TTK)
It can be assumed that genes with similar GO annotations are genes that code for similar
functions. Therefore, the annotated GO term in the gene can be used to assess the seman-
tic similarities between genes or between gene clusters. Such approach is very important
to infer biological meanings in genome data. This section uses tools to support semantic
similarities or distance measurements between genes. G-SESAME2 is a tool to calculate
the similarities between genes based on GO terms. Enter the two gene names and set the
ontology type, species, database, and evidence code, and then click “Submit” to calculate
the semantic distance between the two genes. In this exercise, we will calculate the
2 7 http://bioinformatics.clemson.edu/G-SESAME/
218 Chapter 12 · Molecular Pathways and Gene Ontology
s emantic similarity between two different human genes, CD2 and TTK, using the GO
12 Molecular Function category (. Fig. 12.5).
The distance between the two genes is 0.746. The closer the distance is to 1, the closer
the semantic distance of the annotated GO terms of the two genes (. Fig. 12.6). We then
should review the relationship between the annotated GO terms of both genes.
. Figure 12.7 shows the relationship between the two genes and the annotated GO terms
as a DAG. The cyan nodes denote CD2 terms, orange nodes for TTK, and grey nodes for
both. In addition, the correlation of each term on the GO DAG can also be analyzed.
Web-based tools have limits for handling large volumes of data or performing GO annota-
tion searches in conjunction with other R-based statistical analysis results. This section
provides an R script for advanced users. If you run the R scripts provided in this section,
you can do the same exercises as you performed on the web.
* Package installation
12.3 · Gene Ontology
219 12
0.746
.. Fig. 12.6 G-SESAME results screen 1 for CD2 and TTK genes
> source("http://www.bioconductor.org/biocLite.R")
> biocLite("GOSim")
> biocLite("topGO")
> biocLite("sigPathway")
> library(GOSim)
#If the working directory is set, you can open it directly as shown below.
> head(InputGenes)
220 Chapter 12 · Molecular Pathways and Gene Ontology
molecular_function
GO:0003674
> head(InputGenes)
ProbeID EntrezGene GeneSymbole
1 31491_s_at 841 CASP8
2 31506_s_at 1668 DEFA3
3 31696_at 5912 RAP2B
4 31737_at 58 ACTA1
5 32393_s_at 9987 HNRPDL
6 34098_f_at 9270 ICAP-1A
Some of the genes stored in InputGenes have an “NA” value in the second column, which
is the EntrezGene column. The next step is to remove the gene where the second column
value is ‘NA’ and extract only 283 of 314 possible analyzes. 238 genes with EntrezGene
present in the second column are included in the next process.
12.3 · Gene Ontology
221 12
3.0
2.5
2.0 Cluster Dendrogram
Height
1.5
1.0
9270
0.5
3205
0.0
5912
3045
5723
2921
10399
9262
8329
6504
3820
3592
5791
3087
8970
6421
10859
841
909
910
58
1668
1791
6095
9987
7253
6320
as.dist(1 - sim)
hclust (*, "ward.D")
> length(GenesOfInterest)
We will calculate the distance between the first 30 genes because it takes a longer time to
calculate the total number of gene distances.
n = 27 3 clusters Cj
j : nj | avei∈Cj si
2
18
19
15
16 1 : 10 | 0.48
9
20
21
13
1
6
14
24 2 : 5 | −0.15
3
17
5
7
22
12
10
11 3 : 12 | 0.46
4
26
25
8
27
23
> plot(hc)
> cl = cutree(hc, k = 3)
> if(require(cluster))
}
12.3 · Gene Ontology
223 12
Next, using the GO enrichment function, we should perform GO enrichment of the genes
belong to the cluster. Since GO enrichment only accepts character type variables, we
should convert the data types of variables to character types using as.character () function.
> typeof(GenesOfInterest)
> typeof(mGOI)
Using the GO enrichment function, we can perform a GO enrichment analysis for cluster
1 of the clusters in . Fig. 12.8 compared by the Fisher’s exact test. The cutoff value is a
parameter in the GO enrichment function. In this exercise, only a value <0.05 based on
the p-value is extracted, and the resulting value is assigned to the variable of GOEResult.
> if(require(topGO))
> names(GOEResult)
> names(GOEResult)
[1] "GOTerms" "p.values" "genes"
Print the items we want to see. For example, the following example returns the genes
enriched in GO terms.
> GOEResult$genes
The biological pathway refers to a well-organized summary of the key life science research
findings. A pathway database is a necessary resource for biological data analysis using
biological pathways. However, there are very few available pathway databases. Exercise in
this section briefly introduce a simple web-based tool, Reactome. Reactome provides a
variety of well-organized information from basic pathways to complex biological path-
ways such as GPCRs and APOPTOSIS. A user guide can be found on the website. (7 http://
wiki.reactome.org/index.php/Usersguide).
Selecting the “Overrepresentation analysis” option at the bottom of the previous exercise
gives a list of pathways in which input genes are significantly enriched. (Refer to 7 Chap. 4,
7 Sect. 4.2 to use the overrepresentation analysis). The statistical significance of the GO
By selecting “Analyze Expression Data” on the left side of the Reactome homepage a win-
dow of “Upload expression data” opens (. Fig. 12.10). When the genome expression pro-
file data from microarray or RNA-Seq is entered, the input gene is mapped to the Reactome
Pathway as described above, and as a result, the number of mapped genes and mapping
ratio (%) are shown (. Fig. 12.11). The gene expression level is shown in a color-coded
diagram of the pathway. In this pathway analysis, the “Extrinsic Pathway for Apoptosis”
was mapped to all genes (100% mapping). When “Extrinsic Pathway for Apoptosis” is
clicked, gene expression profiles of this pathway are displayed color coded. Rich and dull
gene expressions are represented as red to blue gradient (. Fig. 12.12).
The section provides R scripts for advanced R users. We use the sigPathway package to
perform biological pathway overrepresentation analysis with the MuscleExample data.
MuscleExample data is a microarray data for patients with z muscular system disease
called inclusion body myositis, which is composed of 5000 genes in columns, and 15
experiments (number of samples) in rows.
12
> library(sigPathway)
> data(MuscleExample)
> dim(tab)
[1] 5000 15
Fifteen samples are composed of eight patients and seven healthy subjects as a control
group phenotype.
12.4 · Biological Pathway Analysis
227 12
> table(phenotype)
0_NORM 1_IBM
7 8
Using the T-test, the differentially expressed genes between cases and controls were identi-
fied. The results are plotted in the histograms (. Fig. 12.13).
> hist(statList$pval, breaks = seq(0, 1, 0.025), xlab = "p -value", ylab = "Frequency", main = " ")
As shown in . Fig. 12.13, we found that a large number of genes were differentially
expressed between case and control groups. Next, using the runSigPathway function, we
searched for pathways differing between the two groups.
> set.seed(1234)
> res.muscle <- runSigPathway(G, 20, 500, tab, phenotype, nsim = 1000,
This search returns a list of pathways that show statistically significant differences between
case and control groups.
> print(res.muscle$df.pathways[1:10,])
> names(res.muscle$df.pathways[1:10,])
12 [1] "IndexG"
[5] "Percent Up"
"Gene Set Category" "Pathway"
"NTk Stat" "NTk q-value"
"Set Size"
"NTk Rank"
[9] "NEk Stat" "NEk q-value" "NEk Rank"
.. Fig. 12.14 Screenshot of the leukemia keyword entered into the COREMINE main screen
12.5 · Biological Text Mining – COREMINE
229 12
"NTk Rank")])
When you enter a search term on the COREMINE main screen, similar terms with
different concepts are displayed. The appropriate category can then be selected.
(. Fig. 12.14).
. Figure 12.16 shows the results of a two keyword search, leukemia and IL-8.
Wikipedia, ClinicalTrials, and PubMed articles were used for text mining. As a result,
230 Chapter 12 · Molecular Pathways and Gene Ontology
.. Fig. 12.15 Screenshot
showing concurrent input of
leukemia (disease) and IL-8
(gene/protein)
12
.. Fig. 12.16 Screenshot of the results of a two keyword search, leukemia (disease) and IL-8 (gene/
protein)
there were 13 concepts and keywords related to the input. On the first screen, highly
related keywords are nodes and the user can see more to less relevant keywords by clicking
the ‘+’ or ‘−’ buttons on the left. Clicking each node presents basic information and dis-
plays extracted associations on the right side. Connection degree between nodes is shown
in a bar graph. Briefly, leukemia and IL-8 are closely related to CSF3, LIF, CSF2, IL6, and
recombinant Interleukin-6 (. Fig. 12.16).
Clicking on the link between two nodes shows the number of PubMed Articles that
link their relationship. In the results page, a list of 243 papers that mention the relationship
of two keywords is displayed on the right side. The title of the paper serves as the link to
the PubMed site (. Fig. 12.17).
Bibliography
231 12
.. Fig. 12.17 References for the relationship between leukemia (disease) and IL-8 (gene/protein)
Exercises
[Exercise 1] - Download the MSigDB gene set and extract GO annotations from AmiGO for oncogene
gene sets.
[Exercise 2] - Calculate the similarity of CD74 and MYC genes among the genes entered in [Exercise 1].
Select Homo sapiens and biological process for GO.
[Exercise 3] - Perform pathway assignment and over representation analysis for the genes entered in
[Exercise 1].
[Exercise 4] - Perform a literature network with the CD74 and MYC genes used in [Exercise 1] and use
carcinoma as a keyword.
[Exercise 5] - Configure the literature network with the genes entered in [Exercise 1] and carcinoma as a
keyword.
Bibliography
1. Alexa A, Rahnenfuhrer J (2016) topGO: enrichment analysis for Gene Ontology. R package version
2.26.0
2. AmiGo – http://amigo.geneontology.org/amigo
232 Chapter 12 · Molecular Pathways and Gene Ontology
3. Carbon S et al (2009) AmiGO: online access to ontology and annotation data. Bioinformatics
25(2):288–289
4. Chen H, Sharp BM (2004) Content-rich biological network constructed by mining PubMed abstracts.
BMC Bioinformatics 5:147
5. Chiaretti S et al (2004) Gene expression profile of adult T-cell acute lymphocytic leukemia identifies
distinct subsets of patients with different response to therapy and survival. Blood 103:2771–2778
6. COREMINE – https://www.coremine.com
7. Croft D et al (2011) Reactome: a database of reactions, pathways and biological processes. Nucleic
Acids Res 39(Database):D691–D697
8. G-SESAME – http://bioinformatics.clemson.edu/G-SESAME
9. Harris MA et al (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res
32(Database issue):D258–D261
10. Holger F, Speer N, Poustka A, Beissbarth T (2007) GOSim – An R-package for computation of informa-
tion theoretic GO similarities between terms and gene products. BMC Bioinformatics 8:166
11. Lai W, Tian L, Park P (2008) sigPathway: pathway analysis. http://www.pnas.org/cgi/doi/10.1073/pnas.
0506577102, http://www.chip.org/~ppark/Supplements/PNAS05.html
12. Reactome – http://www.reactome.org
12
233 13
Bibliography – 246
13.1 Introduction
The human body is composed of more than 60 trillion cells; each cell consists of many
genes and proteins. Also, there are a number of environmental factors and various organ-
isms surrounding human beings. Each organism is not only independent but also consis-
tently interacts to create a huge and complex network. Therefore, to understand life
phenomena, it is important to understand function and structure of interactions between
various elements composing the life system.
Protein-protein interaction network (PPI), transcriptional regulatory, metabolic, and
perturbation networks constitute biological network studies. Gene-disease, disease-dis-
ease network, and drug interaction networks have been actively studied.
It is expected that the bio network will change the current research paradigm in life
science research and provide clues for understanding genome structure and evolutionary
mechanisms. The development of extensively parallel technology, including NGS and bio-
informatics interpretation techniques, have produced advances in new fields such as net-
work biology and network medicine. The study about the biological network is expected
to provide a clue to the structural and evolutionary mechanisms of the genome.
In this chapter, we learn how to use the “igraph” package, which is one of R packages
for biological network analysis.
13.2 Preparations
13 First, install R program and “igraph” package, and install “rgl” package for visualization of
network in a three-dimensional space. We can practice the analysis on a Windows system.
After running R, enter the following command.
> source("http://www.bioconductor.org/biocLite.R")
> biocLite("rgl")
> biocLite("igraph")
A variety of software has been developed to analyze and visualize biological networks.
Such software may be distributed for free by developers, and it is also sold for commercial
purposes by companies related to bioinformatics. . Figure 13.1 shows the tools for analyz-
Software, including Cytoscape and Pajek, are stand-alone modes after they are installed
on your computer, while NetworkX or igraph are available in R or Python packages.
Stand-alone software provides an easy user interface, but it has limited use in various and
sophisticated functions. On the other hand, Python or R allows flexible and sophisticated
analysis, but it requires programming skills. In this chapter, we learn how to use igraph
packages in R, which is mostly used in biological analysis like microarray and with this
package, we practice network analysis and visualization of protein interaction.
“igraph” is a network analysis software package based on graph theory. The core software
library has been written in C/C++ language. igraph provides an interface and program-
ming language such as GNU R, Python, and Ruby (. Fig. 13.2). igraph allows for direc-
> library(igraph)
236 Chapter 13 · Biological Network Analysis
.. Fig. 13.2 igraph
2. Create a directional object with seven nodes and check the connectivity between the
nodes.
>g
> are.connected(g, 1, 2)
> are.connected(g, 1, 3)
> V(g)$name
> E(g)$weight
4. Network visualization exercise: Build a star-like network, and set the color of each
connection line according to the weight. Practice functions: (1) plot, (2) tkplot, and
(3) rglplot (. Fig. 13.4).
edge.color = E(g)$color)
edge.color = E(g)$color)
> library(rgl)
edge.color = E(g)$color)
238 Chapter 13 · Biological Network Analysis
0.500
degree.distribution(g)
0.005 0.020 0.0010.100
1 2 5 10 20 50
Index
13
.. Fig. 13.5 Scale-free network
Note: igraph provides a variety of layout options for visualization. Let’s draw a network by
changing the actual layout. We can see various types of network graphs (. Fig. 13.5 and
. Table 13.1).
13.3 · The Network Analysis Tools
239 13
layout.random layout.circle
layout.sphere layout.fruchterman.reingold
layout.kamada.kawai layout.spring
layout.reingold. layout.fruchterman.reingold.grid
tilford
layout.lgl layout.graphopt
layout.mds layout.svd
0.200
degree.distribution(g)
0.020 0.050
0.005
0.001
1 2 5
Index
2. Create and visualize an Erdos Renyi network model, and draw a graph of the
distribution of the number of nodes (. Fig. 13.6).
In the chapter, we analyze a protein interaction network in yeast, which is the mostly
studied protein interaction network. In practice, we reproduce the network of “Evolutionary
rate in the Protein Interaction Network” (Fraser et al. [4]) and “Emergency of Scaling in
Random Networks” (Barabási and Albert [2]) published in Science.
interactions should be removed for the analysis in this chapter. The lower two panels in
. Fig. 13.7 shows a re-formatted input file that is readable in igraph. In total, 5543 genes
(nodes) and 54,484 connection (links) exist in this data. Download the evolutionary rate
(non-synonymous substitution rate, dN) data, published in PNAS in 2005 “Functional
genomic analysis of the rates of protein evolution” (upper and right panel in . Fig. 13.7).
Among the 5543 genes, if gene has dN in the file, assign the dN value to the gene, other-
wise, assign −1 since the gene evolution rate has a value >0.
Database Feature
13
HPRD (Human Reference Construction of human PPI database through literature
Database) collection and verification.
41,327 interactions, 30,047 proteins, 470 domains.
13.5.1 Visualization
In R, read the protein interaction network files provided in the CD. (“vertexes.txt” and
“relations.txt”), create graph object, and visualize with Fruchterman.reingold (. Fig. 13.8).
# Confirm the current working directory using “getwd()” and move the working directory
> library(igraph)
1e-01
2e-02
degree.distribution(g)
5e-03
1e-03
13
2e-04
1 5 10 50 500
Index
A
10−2
10−3
P(k)
10−4
10−5
10−6
100 101 102 103
In this chapter, we confirm that hub proteins in scale-free network are evolutionarily old.
Draw a small dot for the evolution rate by degree and add the regression line. Please
confirm that hub genes are evolutionarily older (. Fig. 13.11).
> ev_table <- matrix(0, 2569, 1) # Make matrix to calculate evolutionary rate
> plot(k_degree, y_value, type = “n”, ylim = c(0, 1.2), xlab = “Degree”, ylab = “dN”,
temp <- traits[ , 3] == i & traits[ , 4] >= 0 # Search data matching the ppi value
each_degree_data = traits[temp, ]
data is
Using data obtained from the above practice, calculate the Pearson correlation coefficient
with p-value or Spearman correlation coefficient with p-value.
Exercises
[Exercise 1] - Extract proteins which interact with p53 using protein interaction databases in
. Table 13.2.
[Exercise 2] - Visualize the interaction network using data from [Exercise 1].
[Exercise 3] - Find proteins in the OMIM database. Draw proteins associated with disease in red and
proteins not related to disease in blue.
[Exercise 4] - Draw proteins associated with metabolism in red and proteins not associated with
metabolism in blue using physical interaction data from the yeast genome database.
[Exercise 5] - From [Exercise 4], extract the proteins related to metabolism and plot the distribution of
connections between them.
Bibliography
1. Adler D, Murdoch D, et al (2017) rgl: 3D visualization using OpenGL. R package version 098.1. https://
CRAN.R-project.org/package=rgl
2. Barabasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512
3. Csardi G, Nepusz T (2006) The igraph software package for complex network research, InterJournal,
Complex Systems 1695. http://igraph.org
4. Fraser HB et al (2002) Evolutionary rate in the protein interaction network. Science 296:750–752
5. Jeong H, Mason SP, Barabasi AL et al (2001) Lethality and centrality in protein networks. Nature
411:41–42
6. Saccharomyces Genome Database – http://www.yeastgenome.org
7. Shannon P et al (2003 Nov) Cytoscape: a software environment for integrated models of biomolecular
interaction networks. Genome Res 13(11):2498–2504
8. Wall DP, Hirsh AE, Frase HB et al (2005) Functional genomic analysis of the rates of protein evolution.
PNAS 102(15):5483–5488
9. Wuchty S, Almaas E (2005) Peeling the yeast protein network. Proteomics 5(2):444–449
13
247 IV
Bibliography – 259
14.1 Introduction
When the first blueprint of The Human Genome Project was finished at the end of the
twentieth century, we expected that the mystery of human genetics would be solved. News
that claimed that the human genome blueprint was finished went viral. It also stated that
we acquired the entire chimpanzee genome and that humans and chimpanzees share
99% of the same genome. It was said that we are so close to conquering all the diseases.
However, in reality, there were not enough results that could explain the molecular mech-
anism of life.
Investigating the relationship between genotype and phenotype is one of the oldest
research topics. The results of the human genome project verified that there are more
variations among individuals than initial speculation. Gene variation can be largely
separated as those that have <1000 base pair base variations and those that have >1000
base pair variations; these differences are called structural variations. Examples of base
variations include single nucleotide polymorphism (SNP), insertion and deletion (InDel),
and structure variations including copy number variation (CNV), segmental duplica-
tion, translocation, and inversion (. Figs. 14.1 and 14.2). Copy number variation can be
branched into amplification and deletion. Although we have arbitrarily chosen structure
variation as 1000 base pairs, there have been copy number variations of 800~1000 base
pairs. Therefore, it is important for us to view 1000 base pairs as a convenient standard.
. Figure 14.2 outlines the various genome variation distributions and their criteria.
SNPs among many genome variations drew the attention of early scientists. DNA is
14 made up of four base pairs A, G, T, and C and encodes crucial information of life. It is well
known that each base can have variations such as mutations, insertions, and deletions.
SNP and mutation are characteristically the same in that it is a single base variation. SNP
and mutation differ in that mutations pinpoint to an exceptional (or diseased) condition,
whereas SNP is more of a diverse universal condition. The difference between SNP and
mutation is solely for our convenience; if the allele at a single base is rare and the fre-
quency is >1%, it is defined as SNP, and if it is <1%, it is defined as mutation. Therefore,
mutations are understood as exceptional or what we commonly call a ‘disease’, and SNP
is understood as the factor that can explain human diversity. This distinction is bound to
cause controversy.
In data analysis, SNPs have two main characteristics: (1) a dense marker; (2) disease-
related. Although SNPs only exist in pairs of alleles and therefore it lacks the ability to
segregate alleles, SNP has a high emergence frequency. If several SNPs are used together,
it is possible to distinct the alleles in multiples of two. They can be used as dense markers
in dynamic and genealogy research. The second characteristic –disease related- could be
intended as a dense marker as well. However, the hypothesis states that a specific SNP or
14.1 · Introduction
251 14
Single Nucleotide
Polymorphism (SNP)
Base Variation
Copy Number
Variation (CNV) Copy Number Gain
Genome Variation
Translocation
Inversion
• Base change – substitu • Microsatellites, → Copy number variants → Segmental aneusomy • Interchromosomal
tion – point mutation minisatellites (CNVs) • Chromosomal translocations
→ Indels → Segmental duplications deletions – losses • Ring chromosomes, is
→ Insertion-deletions • Inversions • Chromosomal ochromosomes
(“indels”) • Di-, tri-, tetranucleotide • Inversions, insertions – gains • Marker chromosomes
• SNPs-tagSNPs repeats translocations • Chromosomal → Aneuploidy
• VNTRs → CNV regions(CNVRs) inversions → Aneusomy
• Microdeletions, micro • Intrachromosomal
duplications translocations
→ Heteromorphisms
• Fragile sites
Contrary to the attention and the spotlight it has been having, there has not been
much progress with SNP research regarding diabetes or complex diseases stemming
from complex causes. It may be necessary to switch gears and to readjust the research
direction. Regardless, genomic variation research continues, and we continue to supple-
ment the limits of GWAS by NGS technology and The 1000 Genome Project. This is an
effort to make a list of all genomic variations and defects in order to understand human
diseases. In this exercise, we analyze SNP data by using GWAS and copy-number varia-
tion data. Other types of diversity include InDels, microsatellites, and cancer genomic
markers, which are important targets for research; they will, however, be explained
elsewhere.
14.2 dbSNP
dbSNP provided by NCBI is a comparatively significant resource that is high reliable and
contains inclusively large genomic diversity data. Database that can be trusted is crucial
to genome diversity research. There is the need to consistently upgrade informatics for
highly reliability data analysis. Users of SNP resources need to know when the resource
was created, if it has been renewed to the most recent data, and if the data version and the
analysis software and the database version are being used correctly.
Each distinct SNP needs to be given a distinct identifier. Reference SNP identifiers or
rsIDs are known identifiers provided by dbSNP and InDel, and repeated identifiers are
also included in dbSNPs. The constant increase in SNPs makes it difficult to manage, and
many rsID identify the same SNP. In this case, the alias rsID and combined overlap refer-
ence are given. Therefore, if the rsID does not consider the alias rsID when searching for
dbSNP in articles and software, the search may yield to incorrect results. dbSNP search
results show the SNP history. For example, the web-based analysis tool SNAP (7 http://
The goal of the 1000 genome project was to analyze the genome of 1000 people including
697 samples used in the HapMap. This project was first announced in 2009 and continues
to provide data on human genomic diversity of common and rare variations. Some SNPs
in ThaiSNP2, Taiwan Biobank, HGDP-CEPH Database, and ALFRED can be obtained
from dbSNP or HapMap. ALFRED is known to provide information on 664,292 variations
across 724 populations. Diverse specimens can be used to calculate population control
SNP frequencies or race origin tracking assessments of a SNP of interest.
Biomart, SPSmart, and others can be used to quickly provide SNP results correspond-
ing to user preference and can also arrange searches. Databases that provide useful infor-
mation regarding a specific variation combines literature and several different databases.
OMIM is a database that was manually put together by experts to search allele variation
and SNP ID. Similarly, the National Institutes of Health Genetic Association Database
(NIH GAD) provides race research, specific variation statistics, and research conclusions
from 40,000 search resources gathered from several different paths.
The GWAS catalog contains a large analyzed data set. Portions of GWAS results are
incomplete. There are efforts to increase GWAS data usage, and the NCBI Database of
Genotype and Phenotype (dbGAP) is playing a huge role. The results of GWAS are main-
tained by the National Human Genome Research Institute (NHGRI).
14.4 PharmGKB
PharmGKB was created at Stanford University in 2000. At that time, there was no file
format technology to save phenotypic and genotypic data from pharmacogenetics
research. Appropriate file format and search format were developed for the increase of
data and the results of the research. Not only gene-drug relationships but also genomic
254 Chapter 14 · SNPs, GWAS, CNVs: Informatics for Human Genome V
ariations
The main idea of an association study is to investigate the differential allele frequency of
SNPs between case and control groups. When there is statistically, a significant difference in
the allele frequency, the given allele then has a significant relationship with the p
henotype.
1 In this chapter, an association study is considered separate from a linkage study. A linkage study is
used in order to explore trait-associated loci within families, and an association study is used in
order to explore an association between specific genotype and phenotype at the population level.
Therefore, they are co-operative. In other words, a linkage analysis can be used to identify related
gene or loci and an association study can be used for estimating a causal correlation level of specific
variants with a trait.
14.5 · Genome-Wide Association Studies (GWAS)
255 14
Using a chi square test, we can summarize the difference between expected allele and
observed allele frequency; this type of frequency analysis is called an allelic test, and the
test statistic is the allele. The genotypes aa, aA, and AA are paired into a-a, a-A, and A-A
alleles, and the relationship of a single allele and its phenotype will ultimately be tested.
The interaction or relationship between the alleles are ignored. For example, if there are
100 people the expected contingency table is as below.
allele a allele A
allele a allele A
We use an independent chi square test to investigate the statistical difference between
the two distributions above.
n
( Oi - Ei )2
x2 = å
i =1 Ei
The chi square value is 50 and the associated p-value is <0.0001.2 Therefore, the marker
can be seen as statistically significant.
For a genotype test, a 2×3 contingency table is used instead of a 2×2 genetic contingency
table (isomorphic aa, heteromorphic Aa, isomorphic AA) used for an allelic test. Broadly,
there are three types of models as describe subsequently:
Field Description
55 Additive Model: Assume Two copy minor allele (ex, AA genotype) are twice as
effective as the single copy minor allele (such as Aa genotype). It is also called a
Cochran-Armitage tendency test.
55 Dominant Model: Assume there is a small phenotypic effect with at least one copy of
the minor allele (such as Aa or AA).
55 Recessive Model: Assume a phenotypic effect only happens with two copies of the
minor allele.
In 7 Chap. 16, the example works with 83,534 SNPs. We test allele and four genotypes
for a total of 417,650 times (=5 × 83,534). Setting the significance level at p < 0.05, each
hypothesis test has a 5% false positive rate. As the number of comparisons increase, the
false positive rate also increases; this is known as the ‘multiple comparisons problem.3 To
decrease these false positive relationships, we can either adjust the p-value threshold or
14 adjust the p-value entirely, and this can be done using the PLINK program. PLINK uses
seven methods. . Table 14.1 explains each method and corresponding symbols.
The human genome has 23 pairs of chromosomes that are composed of paternal and
maternal genomes. This type of pair format is called diploid (2n) genetic information.
However, there are haploid (1n) or triploid (3n) states. Polyploidy or aneuploidy are also
quite common in cancer cell change and is called copy number alteration (CNA). However,
during the development of genomics, simple partial region copy-number variation has
been found to be commonly detected. This is called copy number variation (CNV).
3 Refer to 7 Chap. 5 7 Sect. 5.4 and 7 Chap. 6 7 Sect. 6.3 for the multiple comparisons problem in
microarray analysis.
14.7 · Analysis Methods of CNV
257 14
More recently, in addition to variation, copy-number variation is perceived to be poly-
morphic similar to that of SNP. This is called copy number polymorphism (CNP). It is
interesting that the initial theory has shifted from normal (diploid) to modified (abnor-
mality, CNA), to variation (exceptional diversity, CNV), and finally polymorphic (com-
mon polymorphism, CNP).
In early research concerning copy-number variation, about 100 variations per an indi-
vidual were detected. However, recent technological advancements have detected an aver-
age of about 1000 variations per individual, and that number is expected to increase. Since
it is known that copy-number variation can be inherited from parental genes, there have
been studies on how CNV can explain genetic diseases and the relationship to specific
phenotypes. Active genes could be involved in copy-number variation regions. In this
case, gene dose changes can be induced, and the chances of inevitable molecular biological
or pathological changes are high. In that aspect, there are high expectations that CNV can
supplement the biggest weakness of SNP. In a real example, research on childhood autism
related to copy-number variation is being done, and to date, there have been many causal
relationships in CNV etiology.
Early interest in SNP can be seen as a technological limit in that only a base or short
base sequence could be detected. Long structure variation detection was difficult and
meaningful interpretations were also difficult in the beginning. However, recent technol-
ogy has allowed us to detect variations of all lengths and detection accuracy continues
to improve. Compared to the 1% that SNP affects, CNV is known to affect over 10% of
the entire human genome and is vastly distributed. We expect that CNV may be more
important than SNP.
To date, there are not many disease cases that we can explain through CNV. Mutations
arising from reduction division have been known to cause disease. During mutation forma-
tion, genes with regions containing CNV may be severely damaged or fused to cause gene
function defects. Furthermore, in cases when the variation is large, many genes may be
affected and different symptoms may be observed. This is known as continuous gene syn-
drome. Known diseases in which CNV is known to affect a single gene in several diseases
include Duchenne/Becker muscular dystrophy, Type 1 fibrocystic cancer, tuberous sclerosis,
sotos syndrome, charge syndrome, Pelizaeus-Merzbacher disease, early onset Alzheimer’s
disease, autism, psoriasis, Crohn’s disease, and Parkinson’s disease. Diseases known to be
affected by continuous gene syndromes include DiGeorge and Williams syndromes.
In the beginning, using single-base polymorphisms to investigate cause of disease was
not fruitful, but we continued to find explanations to missing heritability. Within missing
heritability, we focused on CNVs due to the fact that it causes more gene sequence changes
than that of single base polymorphisms, which in turn explains diseases or specific phe-
notypes. We expect that about 10% of the genome contains variations and the other 1%
difference can be explained solely by the diversity of humans. Copy-number variation data
registered in the Database of Genome Variants has been significantly increasing since 2005.
There are two main methods in detecting CNVs: (1) using microarray chips (2) NGS
technology. Chip technology is more cost efficient than sequencing. It is also suitable for
detecting known variations. However the accuracy of chip technology is lower and detec-
tion of new variations is impossible. In repeating sequence regions, it is very difficult to
258 Chapter 14 · SNPs, GWAS, CNVs: Informatics for Human Genome V
ariations
detect CNV using sequencing technology. If detecting CNV is successful, the resolution
is very low. Recently, many researchers have been using chip analysis first and sequencing
technology afterwards for result verification.
Initial structure variation was detected via microarray chip aCGH (Array CGH) using
comparative genome hybridization technology: (1) the sample and the control genomes
are first dyed in a fluorescent substance (2) the two base sequences of samples and controls
bind competitively to the complementary sequence of a probe planted on the aCGH chip;
(3) the combination amount is quantified by the fluorescence intensity of fluorescence.
The weakness of this method, however, is its inability to accurately detect the start and end
point of the structural variation’s breakpoint. Recently, short sequence probe chips with
high density and high resolution have been used.
There are two methods used in CNV detection: (1) paired-end mapping (PEM) (2) depth
of coverage (DOC) analysis.
A paired-end fragment sequence contains the distance information of a fragment
sequence pair from both ends of the genome of interest. We can use this genome informa-
tion in both mapping-linked and base sequence distance differences to calculate structural
variation. Additionally, we can accurately identify the beginning and end of the structural
variation since information on the linked region is already defined. When the fragment
sequence is mapped with the reference genome considering forward and reverse directions
similar to when a specific region is inverted or copied on the wrong side, we can find struc-
tural variation not available in chip based analysis. DOC base analysis uses mapping of
ratios onto the reference genome to detect CNVs. Assuming that mapping multiples are
equal in regions where copy-number is high, there will be a specific ratio in which the map-
ping multiple will increase and deletion sections will decrease; this difference can be quanti-
14 fied. The downfall to this type of analysis is that it is difficult to detect structure variation
in short sequences and to identify start and endpoints. Therefore, we use both methods
mentioned above to supplement the weakness of each method to detect structural variation.
14.8 Conclusion
In the last 10 years, genome expression analysis using large genome data, GWAS, and
copy-number variation analysis has allowed rapid development of our understanding of
gene to gene domains involved in human diseases. It has been a major starting point for
the start of genomic medicine and human health enhancement. GWAS yielded disap-
pointing experimental results of missing heritability, low effect size, and disagreeing phe-
notype. Furthermore, GWAS analysis was based on was the CDCV hypothesis but has
progressed to the CDRV hypothesis in which a few rare alleles affect diseases (. Fig. 14.3).
We hope that the NGS technology will allow us to examine and find variation causes,
including rare mutations, and ultimately redefine genome research.
Bibliography
259 14
High
Penetrance
Highly unusual
Rare variants common variants
causing Mendelian Influencing common
disease disease
Intermediate
Less common
variants
with intermediate
penetrance
Modest
Common variants
Private variants
Influencing common
hard to identify by genetic disease identified by GWA
means studies
Low
Bibliography
1. Hirschhorn JN et al (2002) A comprehensive review of genetic asso-ciation studies. Genet Med 4:
45–61
2. International HapMap Consortium (2003) The international HapMap project. Nature 426(6968):
789–796
3. Lander ES (1996) The new genomics: global views of biology. Science 274:536–539
4. Lander ES et al (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921
5. Lohmueller KE et al (2003) Meta-analysis of genetic associa-tion studies supports a contribution of
common variants to susceptibility to common disease. Nat Genet 33:177–182
6. Manolio TA et al (2009) Finding the missing heritability of complex diseases. Nature 461:747–753
7. McCarthy MI et al (2008) Genome-wide association studies for complex traits: consensus, uncertainty
and challenges. Nat Rev Genet 9:356–369
8. Osier MV et al (2001) ALFRED: an allele frequency database for diverse populations and DNA polymor-
phisms—an update. Nucleic Acids Res 29(1):317–319
9. Purcell S et al (2007) PLINK: a tool set for whole-genome association and population-based linkage
analyses. Am J Hum Genet 81(3):559–575. Epub 2007 Jul 25
10. Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases. Science
273:1516–1517
260 Chapter 14 · SNPs, GWAS, CNVs: Informatics for Human Genome V
ariations
1. Sherry ST et al (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29(1):308–311
1
12. Thorn CF et al (2013) PharmGKB: the pharmacogenomics knowledge base. Methods Mol Biol
1015:311–320. https://doi.org/10.1007/978-1-62703-435-7_20
13. Witte JS (2009) Prostate cancer genomics: towards a new understanding. Nat Rev Genet 10:77–82
14. Witte JS (2010) Genome-wide association studies and beyond. Annu Rev Public Health 31:9–20. 4 p
following 20
14
261 15
Bibliography – 279
15.1 Introduction
15.1.1 dbSNP
dbSNP is a basic SNP (single nucleotide polymorphism) search database. dbSNP was con-
jointly created by NHGRI and NCBI since December 2001. The 101 build version was the
first database. Currently, it is being run by the NCBI Entrez system (. Table 15.1). General
SNP analysis is used to search for existing SNPs registered in the dbSNP database. The
current version of dbSNP includes over ten million SNPs, each with a given rs number.
The International HapMap Project officially started in 2002 in Canada, China, Japan,
Nigeria, England, and the United States to establish the haploid human genome map.
Stage 1 was announced in 2005, followed by Stage 2 in 2007, and Stage 3 in 2009 (. Tables
15.1.3 PharmGKB
April 2000 online, it has become one of the most-used drug-related databases, along with
DrugBank. PharmGKB provides curated data for the relationships between genes, drugs,
and diseases; various information about the drug itself; and information on important
genes in drug-related pathways.
15
.. Table 15.1 Summary of dbSNP database
Sample & POP panels 269 samples 270 samples 1115 samples (11 panels)
(4 panels) (4 panels)
Unique QC + SNPs 1.1 M 3.8 M (phase I + II) 1.6 M (Affy 6.0 & Illumina 1 M)
71 ASW 1,543,731
82 CHB 1,342,348
70 CHD 1,312,343
83 GIH 1,409,510
82 JPT 1,294,974
83 LWK 1,527,403
71 MEX 1,453,659
77 TSI 1,420,526
Attribute #(number)
Gene 27,007
Chemicals 3634
Diseases 3518
Pathways 114
15.2 dbSNP
dbSNP is composed of two types of data: (1) data that users submitted (ssID) (2) data that
is maintained or adjusted by the system (rsID). These IDs are the basic ID values and the
format used when referring to SNP (. Fig. 15.1).
264 Chapter 15 · SNP Data Analysis
browser.
2. In the “Reference cluster ID (rs#)” search box type rs380390 and click “Search”.
3. You can verify CFH gene related information on the results page.
Map weight [WEIGHT], [MPWT], [HIT] Integer SNP map weight info – the
number of times a SNP map to
the genome contig (range 1–10)
The 1000 Genomes Project was planned during a meeting at The Wellcome Genome
Campus in September 2007. Taking advantage of developments in sequencing technology
and the reduced cost of sequencing, this project produced the largest public catalogue of
human variation and genotype data for various populations. The 1000 Genome Project
includes the initial HapMap data and the HapMap site has disappeared. In 2013, the 1000
Genomes Project released phase 3 and the recently version GRCh38 data was released.
15.3.1 Data
The 1000 Genomes Project data can be downloaded using the ftp of EBI and NCBI. Aspera
and Globus software can be used for faster and more reliable downloads. Additional portals
are available to categorize data according to sample, race, and release version (. Fig. 15.2)
and (. Table 15.6).
the data site. Move the directory to look at data provided by 1000 Genomes Project. The
release directory is organized by version. You can access the latest phase 3 data through
20,130,502 directory version hg19. The data is organized by chromosome.
15
.. Table 15.6 (continued)
15.3.2 Browser
In the browser page, data can be searched using Ensembl or 1000 Genomes Browser.
Ensembl is linked to the site and data can be retrieved by Human genome build version
15 GRCh37 and GRCh38. Project release version pilot and phases 1 and 3 can be seen in the
1000 Genome browser. The current main version is phase 3 (7 http://phase3browser.
1000genomes.org) (. Fig. 15.3).
Enter previous search ‘rs380390’ in the search bar. A simple summary regarding the
variation in addition to genomic context, genes and regulation, population genetics, indi-
vidual genotypes, linkage disequilibrium, phenotype data, citations, and etc. can be viewed
in the results page (. Fig. 15.4).
For example, for population genetics, information on allele frequencies of 2504 indi-
viduals from the 1000 Genome Project Phase 3 is available. We can verify that the average
MAF is 0.25 (G), EAS is 0.05(G), EUR is 0.4(G), and SAS is 0.29(G).
15.3 · 1000 Genomes Project
269 15
.. Fig. 15.4 The result page of rs380390 search and detailed allele frequencies
270 Chapter 15 · SNP Data Analysis
15.4 PharmGKB
The main page provides PharmGKB knowledge pyramid and news. In the upper search
PharmGKB search bar, we can search for gene, rsID, drug name, and disease name. In this
exercise, let’s search for the drug warfarin (. Figs. 15.5 and 15.6).
Clinical information related to warfarin is categorized by tab. In the search screen, click
dosing guidelines. As shown in the figure below, the important points regarding warfarin
dosing are organized using published literature. We can also identify how genotype affects
dosing.
In the drug labels tab, information regarding drug labels from the Food and Drug
Administration (FDA), and Health Care Service Corporation (HCSC) are shared. In the
clinical annotations tab, information on drug-related genes and variations are listed. Level
15
These are the search results of gene CYP2C9 variation rs105710. After logging in, we
can see additional detailed information of all clinical annotation including race, genotype,
and other factors (. Fig. 15.8).
In this database, VIP stands for very important pharmacogene and shows information
about the gene. The pathway tab shows diagrams of pharmacodynamic and pharmacoki-
netic effects of the drugs of interest (. Fig. 15.9).
15.5.1 Preparation
In this exercise, you will practice SNP analysis in Linux with one single sample of the IBS
population from the 1000 Genomes Project. You will need to install ANNOVAR and
VCFtools. The download pathway is shown below; refer to Appendix B to see detailed
installation instructions.
55 ANNOVAR (7 http://annovar.openbioinformatics.org/en/latest/)
55 VCFtools (7 http://vcftools.sourceforge.net/index.html)
55 Data (. Table 15.7)
272 Chapter 15 · SNP Data Analysis
##fileformat=VCFv4.1
..
#CHROM POS ID REF ALT QUAL FILTERINFO FORMAT
EUR_IBS_HG01500_M EUR_IBS_HG01501_F
19 60842 . A G 100 PASS AC=2;AF=0.000399361;AN=5
008;NS=2504;DP=19533;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0;EUR_AF=0;SAS_AF=0.0
01;AA=.||| GT 0|0 0|0
15.5 · Variant Analysis Using NGS
273 15
.. Fig. 15.8 The significant related variant with high evidence level
Extract information for the sample “EUR_IBS_HG01612_F” using the command vcf-
subset from VCFtools, and verify the file.
EUR_IBS_HG01612_F.vcf &
##fileformat=VCFv4.1
..
#CHROM POS ID REF ALT QUAL FILTERINFO FORMAT
EUR_IBS_HG01612_F
19 60842 . A . 100 PASS AA=.|||;AC=0;AF=0.000399
361;AFR_AF=0;AMR_AF=0.0014;AN=2;DP=19533;EAS_AF=0;EUR_AF=0;NS=2504;SAS_A
F=0.001 GT 0|0
274 Chapter 15 · SNP Data Analysis
Cell membrane
R-Warfarin S-Warfarin
NADH NAD+
EPHX1 VKORC1 Hydroxy-
Vitamin K1
GGCX
Functional Hypofunctional
F2 F2
F10 F10
F9 F9
GAS6 GAS6
F7 CALU F7
PROZ PROZ
BGLAP BGLAP
PROC PROC
PROS1 PROS1
MGP MGP
15
.. Table 15.7 List of data for practice
IBS.chr19.vcf.gz A VCF format file that contains gene information from 107 IBS samples
IBS_chr19.exonic.txt An annotation file including gene and mutation type for all variants
located on chromosome 19.
EUR_IBS_HG01612_F. A file for one single IBS samples “EUR_IBS_HG01612_F” out of 107, containing
chr19.vcf gene and mutation type for all variants located on chromosome 19.
RVIS_v3_12MAR16.txt A file containing RVIS scores, which was provided by the publication.
code A directory containing code files that will be used in the practice.
EUR_IBS_HG01612_F.step1
–maf 0.0 01
Run the following command to generate the input file “EUR_IBS_HG01612_F.step2” for
the next step.
Functional variants in splice site and exonic regions are saved in the file “EUR_IBS_
HG01612_F.step2.exonic_variant_function”.
In this practice, you will learn to calculate the gene core using whole-genome sequencing
data, since many research institutes (including the 1000 Genomes Project) provide the
this data. Run the following command to generate the file “EUR_IBS_HG01612_F.step3”,
which will be the input for the next step.
–buildver hg19
–operationa g,f,f
RVIS generates a regression line using the total number of variants that are identified in a
given gene relative to the number of common variants identified in the given gene.
Therefore, first you need to count total and common functional variants in each gene’s
coding region. In this practice, common variants are defined as having an MAF >0.1%,
and functional variants are defined as those classified with functional annotations of ‘non-
synonymous’, ‘frameshift insertion’, ‘frameshift deletion’, or ‘stopgain’.Count the number of
variants pergene using the following R code (. Fig. 15.10):
30
Sum of all common functional variants in a gene (Y)
25
KIAA 1683
20
15
10
RYR1
5
0
0 10 20 30 40 50 60
Sum of all variant sites in a gene (X)
Gene RVIS_score
GRIN3B 1.97729539195145
COL5A3 -1.58064893054413
15 C19orf6 -2.51055613470339
A negative (−) RVIS score means the gene is intolerant of variation, having fewer observed
variants than expected. Apositive (+) RVIS score means the gene is tolerant of variation,
having a higher number of observed variants than expected. An intolerant gene is a poten-
tially pathogenic locus that may cause Mendelian disease, as the gene is highly conserved
and has a relatively lower chance of mutation.
Exercises
[Exercise 1] - Confirm SNP number in Human Gene CYP2D6 and search information regarding SNP
rs1081003 at the dbSNP website.
[Exercise 2] - Set SNP_Class of Human chromosome 5 and 6 as SNP and filter Ventor SNP data on the
dbSNP website. Print the result.
[Exercise 3] - Prepare Problem 2 using Entrez Gene query on the dbSNP website.
[Exercise 4] - Search all drugs related to the drug Warfarin, if one patient’s rs2292566 variation allele is
A allele, search what phenotype will be present on the PharmGKB website.
[Exercise 5] - Investigate distribution of allele frequency of rs1799853 among different races, and
investigate what related phenotypes exist on the all websites.
Bibliography
1. 1000 Genome Project – http://www.internationalgenome.org
2. 1000 Genomes Project Consortium et al (2010) A map of human genome variation from population-
scale sequencing. Nature 467(7319):1061–1073
3. Adzhubei I et al (2013) Predicting functional effect of human missense mutations using PolyPhen-2.
Curr Protoc Hum Genet Chapter 7:Unit7.20. https://doi.org/10.1002/0471142905.hg0720s76
4. Danecek P et al (2011) The variant call format and VCFtools. Bioinformatics 27(15):2156–2158. https://
doi.org/10.1093/bioinformatics/btr330. Epub 2011 Jun 7
5. dbSNP site – http://www.ncbi.nlm.nih.gov/projects/SNP
6. Hewett M et al (2002) PharmGKB: the pharmacogenetics knowledge base. Nucleic Acids Res
30(1):163–165
7. Petrovski S et al (2013) Genic intolerance to functional variation and the interpretation of personal
genomes. PLoS Genet 9(8):e1003709
280 Chapter 15 · SNP Data Analysis
8. Petrovski S et al (2015) The intolerance of regulatory sequence to genetic variation predicts gene
dosage sensitivity. PLoS Genet 11(9):e1005492
9. PharmGKB – www.pharmgkb.org
10. Sherry ST et al (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29(1):308–311
11. The International HapMap Consortium (2005) A haplotype map of the human genome. Nature
437:1299–1320
12. The International HapMap Consortium (2010) The international HapMap consortium. Integrating
common and rare genetic variation in diverse human populations. Nature 467:52–58
13. Wang K et al (2010) ANNOVAR: functional annotation of genetic variants from high-throughput
sequencing data. Nucleic Acids Res 38(16):e164
15
281 16
Bibliography – 297
16.1 Introduction
We use PLINK1 software and HaploView and gPLINK in this practice. PLINK is an open
source tool for GWAS developed by the Broad Institute. HaploView is a tool to analyze
and visualize genetic information, especially haplotypes. gPLINK2 is a GUI (Graphic User
Interface) tool for visualization using PLINK and HaploView. gPLINK helps to easily
carry out command lines in the graphic interface.
16.2 Prerequisites
55 Install PLINK program. See the installation details in Appendix B. This chapter
explains simple installation based on Windows OS. JAVA is a prerequisite. The Java
download page is 7 https://java.com/en/download/
The following procedure uses the menu option to set data directory (. Fig. 16.1).
1 Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW,
Daly MJ & Sham PC(2007) PLINK: a toolset for whole-genome association and population-based
linkage analysis. American Journal of Human Genetics, 81.
Package: PLINK (including version number)
Author: Shaun Purcell
URL: 7 http://pngu.mgh.harvard.edu/purcell/plink/
2 7 http://pngu.mgh.harvard.edu/~purcell/PLINK/gPLINK.shtml
16.3 · *.ped and *.map Files of the PLINK
283 16
genotypes.ped and genotypes.map are data consisted of 83,534 SNPs from 89 (44 case and
45 control) individuals with no familial relations, created for the purpose of this exercise.
The purpose of this exercise is to test the association of phenotype with genotype. The
GWAS data is saved in two text files.
Column 1 = Family ID
Column 2 = Individual ID
Column 3 = Paternal ID (zero for missing)
Column 4 = Maternal ID (zero for missing)
Column 5 = Sex
Column 6 = Phenotype (1=unaffected, 2=affected, and 0=missing)
Column 7, 8 = genotype pair of the first SNP1 (zero for missing)
Column 9, 10 = genotype pair of the second SNP2 (zero for means missing)
…
Column 457393, 457394 = genotype pair of the last SNP228694
In GWAS, there is a case in which paternal and maternal IDs are zero, but the individual
ID is 1 (. Fig. 16.2).
The locus information of each SNP genotype is tabulated in the genotypes.map file
(. Table 16.1).
Maternal ID
Paternal ID Genotype Pair at SNP1 Genotype Pair at SNP6
JA19012 NA19012 0 0 1 2 G A G G G A T T C C T T G G 0 0 G G 0 0
16 genotypes.ped
......
genotype.ped
JA19012 NA19012 0 0 1 2 G A G G G A T T C C T T G G 0 0 G G 0 0
1 rs6681049 0 789870
1 rs4074137 0 1016570
1 rs7540009 0 1050098
1 rs1891905 0 1090080 genotype.map
1 rs9729550 0 1125105
1 rs3813196 0 1159244
1 rs6704013 0 1187454
As described above, gPLINK is the graphic interface that enables users to use PLINK and
HaploView software. Thus, gPLINK options need to be configured in order to use PLINK
and HaploView.
1. “Project” -> “Configure”: it shows the set up options for PLINK and HaploView.
2. Select “Browse” of “Haploview path” and select*.jar file in C:\GWAS\PLINK (for
Linux OS).
3. In Windows, select “Browse” of “PLINK path” and, select PLINK1.07_windows.exe
under c:\GWAS\PLINK\. In Mac OSX, select PLINK1.07_mac_intel (. Fig. 16.4).
Before analysis, check if the input data is formatted for the PLINK program
1. Select “PLINK” – > “Summary Statistics” – > “Validate Fileset” (. Fig. 16.5)
16
.. Fig. 16.6 The command to execute gPLINK options in the PLINK software
6. When the window shows the PLINK command line as shown in . Fig. 16.6, click
GENO >1 means Genotype data Missingness test. There is no result because the current
option is there is no genotype data yet.
MAF <0: there is no SNPs filtered out as no SNP has MAF lower than zero. As men-
tioned above, check whether the data is correct.
288 Chapter 16 · GWAS Data Analysis
Filtering options is required according to noise or missing values that can exist in the real
16 data (. Fig. 16.8).
.... (skip)
.... (skip)
We will perform basic association to identify alleles associated with phenotypes. This pro-
cess filters out SNP and samples that have low MAFs and high missing values.
1. Select “PLINK” - > “Association” -> “Allelic Association Tests”
2. Select “Standard Input” tab.
3. Select “Threshold” and enter the same filtering option of the previous one as shown
in . Fig. 16.8.
5. Right click at “Operations Viewer” -> “genotypes_allelictest” -> “Output files” ->
genotypes_allelictest.assoc and select “Open in Haploview” (. Fig. 16.11).
We perform the additive genotypic test using the Cochran-Amitage trend test (. Fig. 16.12).
6. After completing the test, click the right button of the mouse at “Operation Viewer”
-> “genotypes_trendtest” -> “Output files” -> “genotypes_trendtest.model.trend.
adjusted” and click the “Open in.”
7. Right click on “Operation Viewer” - > “genotypes_trendtest” - > “Output files” - >
“genotypes_trendtest.model.trend.adjusted”, and then, click the “Open in Haploview.”
8. Read the result table of the genotypes_trendtest.model.trend.adjusted file as shown in
. Fig. 16.14.
9. Open the genotypes_trendtest.model to check the result of each genetic model for
SNP rs11830226, which is the SNP in the first row. Enter “rs11830226” in the “Specify
Marker” text box and click the “Prune Table” button (. Fig. 16.15).
ALLELIC, TREND (=addictive test) are shown to be significantly associated, while GENO
(=basic genotypic), DOM (=dominant model), REC (=recessive model) are not.
294 Chapter 16 · GWAS Data Analysis
The Manhattan plot is a scatter plot used to visualize data with a lot of variables, none of
which are zero. This plot is useful to visualize a few SNPs with low P-value from GWAS
concurrently. The chromosomes are plotted on the X-axis, and the –log10 (P-values) are
plotted on the Y-axis.
1. Move to the genotypes_trendtest.model.trend.adjusted window.
2. Click the “plot” at bottom.
16 3. Enter “AssocTest” for Title, select “Chromosomes” for X-Axis, “”FDR_BH “for
Y-Axis, and “-log10 “for Scale (. Fig. 16.16).
4. Click the “OK” button and visualize the results as a Manhattan plot (. Fig. 16.17).
16.9 · Manhattan Plot
295 16
16
Exercises
[Exercise 1] - How many markers are filtered out when “Hardy-Weinberg” option is selected in “PLINK” -
> “Summary Statistics” menu?
[Exercise 2] - Perform association tests using diverse genetic models such as –model-gen, −model-
dom, and -model-rec from the “PLINK” - > “Association” -> “Genotypic C/C association tests” menu.
Bibliography
297 16
Take Home Message
55 How to analyze and use GWAS data with gPLNK and visualize with HaploView.
Bibliography
1. Barrett JC et al (2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics
21(2):263–265. Epub 2004 Aug 5
2. Purcell S et al (2007) PLINK: a toolset for whole-genome association and population-based linkage
analysis. Am J Hum Genet 81(3):559–575. Epub 2007 Jul 25
299 17
CNV Analysis
17.1 Introduction – 300
Bibliography – 312
17.1 Introduction
Humans are generically different, and they have various genome mutations. The charac-
teristic genome mutation of each person affects not only various traits such as blood type,
height, and weight, but also the pathogenesis of complex diseases such as malignant
tumors, hypertension, and diabetes. SNPs, CNVs, and occurrence in a wider genomic
region contribute to a genetic diversity. CNV is expected to demonstrate a high possibility
of being related to various disease sensibility. CNV is a modification that has hundreds ~
millions base pairs, which are deleted or amplified and accordingly results in changing
gene copy number existing within the CNV region.
17.2 Prerequisites
In this practice, you will conduct a basic CNV analysis using the R package: Copynumber, which
was published in BMC genomics, 2012. . Figure 17.1 briefly describes the analysis pipeline.
Diagnosis plot
plotGamma(...)
Analysis pipeline
In this chapter, we use a arrayCGH data for eight lymphoma patients and 21 biopsy samples.
> source(“www.bioconductor.org/biocLite.R”)
> biocLite(“copynumber”)
> library(copynumber)
You can confirm several points: (1) the data consists of chromosome, median base pair,
and the value of each patient and (2) the data has 3091 rows and 23 columns. The first two
columns of the data file indicates chromosome and median base pair, respectively, and the
subsequent columns indicate the copy number measurements for the 21 samples.
17.3.1 Normalization
> lymph.wins <- winsorize(data = lymphoma, verbose = FALSE) # Check the outlier of
each value and assign the data to the new variable in order to modify a data
Winsorize function provide a return outlier as a parameter. Using this parameter, you can
check position and status of patients in the outliers. After assigning return, the outlier with
“TRUE” parameter to wins.res variable, wins.res variable has two data frame, wins.data
and wins.outlier, wins.outlier returns samples and probes in the outliers.
These commands return positions in outliers by each sample. 1 value is higher outlier and
−1 value is lower outlier. 0 value is another value, not outlier.
Now, let’s practice data segmentation of 21 samples. The ‘pcf ’ function is used for segmen-
tation of single sample data and ‘multipcf ’ function is used for segmentation of multi
patient data.
Here is a segmentation of single sample data.
> head(single.lymphoma)
‘pcf ’ function is performed after setting the gamma value, which is a penalty parameter
that will be applied to CNV as 12 to use in the single segmentation with ‘lymph.wins’ data
that is produced through the above described normalization.
CNV single segmentation plot can be confirmed with ‘plogGenome’ function in that
order (. Fig. 17.2).
> multi.seg <- multipcf(data = lymph.wins, verbose = FALSE) # Use ‘multipcf’ function
> head(multi.seg)
17
Using ‘multipcf ’ function, you can confirm the p- and q-arms and new dataset generated
including probe information. You can see the distribution of 21 patients by chromosome
position. Assign value of gamma variable in multipcf function. Gamma is a penalty vari-
able by assigning a number to segmentation. If the gamma value is high, the segmentation
value becomes low. If there is no value for gamma, you can see distribution of segmenta-
tion at an interval of 10, which is the division of 100 by gamma value = 10. In addition, as
17.4 · Genomic Alteration Detection
303 17
X01.B2
2 4 6 8 10 12 14 16 18 20 22
0.5
Log R
0.0
− 0.5
1 3 5 7 9 11 13 15 17 19 21 23
assigned variable, chromosomes 1–23, you can draw a plot showing a distribution of
patients by chromosome (. Fig. 17.3).
Now, let’s draw a plot of allele-specific segmentation using logR and BAF data. Gener-
ally, allele-specific segmentation cannot be confirmed through array CGH. Therefore,
SNP array data can be used in the analysis.
Load refreshed data using the data command: (1) logR is data with log-scaled value of
the total copy-number (In general, copy number analysis supersedes log-scaled value) and
(2) B-allele frequency (BAF) provides information for confirmation of an allelic imbal-
ance for each individual patient (B/(A + B)).
When two data sets have the same structure, the first and second columns are chromo-
some and probe position and S1 or S2 indicates the designated samples.
> data(BAF)
> head(BAF)
You perform the data normalization process and use the aspcf function for allele specific
segmentation.
304 Chapter 17 · CNV Analysis
Chromosome 1
X01.B1
0.5
Log R
0.0
− 0.5
X01.B2
0.5
Log R
0.0
− 0.5
X01.B3
0.5
Log R
0.0
− 0.5
> head(allele.seg)
Likewise, use ‘plotAllele’ to draw an allele specific plot in order to present logR of chromo-
somes in individual patients and segmentation including BAF information (. Fig. 17.4).
> plotAllele(logR.wins, BAF, allele.seg, sample = 1, chrom = c(1:4), layout = c(2, 2))
# Set the option to draw multiple plots for four chromosomes (1, 2, 3, and 4) in a page.
17
17.4.2 Identification of Gain and Loss at Genomic Position
Using the same data used above, practice plotting a frequency graph by designating gain
and loss conditions. At this point, it is necessary to designate the threshold of gain and
loss. In this practice, set the threshold of gain as 0.2 and loss as 0.1 (. Fig. 17.5).
17.4 · Genomic Alteration Detection
305 17
S1
Chromosome 1 Chromosome 2
0.5 *** ***** 0.5
logR
logR
0.0 0.0
**** * * * ** **
0.8 0.8
0.6 0.6
BAF
BAF
0.4 0.4
0.2 0.2
Chromosome 3 Chromosome 4
0.5 0.5
** *
*
logR
logR
0.0 * 0.0
*
***
0.8 0.8
* *
0.6 0.6
BAF
BAF
*
0.4 * 0.4
0.2 0.2
.. Fig. 17.4 Four allele specific segmentation plots in four chromosomes, for one individual
Thresholds = [–0.1,0.2]
2 4 6 8 10 12 14 16 18 20 22
% with gain or loss
75
50
25
0
25
50
75
1 3 5 7 9 11 13 15 17 19 21 23
.. Fig. 17.5 Frequency plots with gain value of 0.2 and loss value of −0.1
The above described variables described above are used to set the default value of the line that
shows correlation in the cirrus plot, including the start and end positions of chromosome
and color, and use plotCircle to visualize the data with the default parameter (. Fig. 17.6).
Apply this to all chromosomes. Set a threshold of correlation and draw correlation
graphs showing any correlations among chromosomes.
Y 1
X
22
21 2
20
19
18 3
17
16
4
15
14 5
13
6
12
7
11
10 8
9
.. Fig. 17.6 Cirrus graph showing correlation of copy number aberration within chromosomes
for (j in (i + 1):nseg) {
multiseg$chrom[j]) {
}else{
After running the code, we can see the orange line for the positive correlation and the blue
line for the negative correlation as shown below (. Fig. 17.7).
17.5 Visualization
This heatmap shows the tendency of copy number alteration of 21 patients. Draw an aber-
17 ration plot using gain and loss values set above (. Fig. 17.9).
Y 1
X
22
21 2
20
19
18 3
17
16
4
15
14 5
13
6
12
7
11
10 8
9
In the plot shown above, define an interesting region and extract genes belonging in this
region (. Fig. 17.10).
14, ]$mean)
(lymphoma.res$mean == minimum.mean), ]
-0.1, chrom=c(14))
310 Chapter 17 · CNV Analysis
Limits = [–0.3,0.3]
2 4 6 8 10 12 14 16 18 20 22
X09.B3
X09.B2
X09.B1
X08.B3
X08.B2
X08.B1
X07.B3
X07.B2
X07.B1
X06.B2
X06.B1
X05.B3
X05.B2
X05.B1
X04.B2
X04.B1
X03.B2
X03.B1
X01.B3
X01.B2
X01.B1
1 3 5 7 9 11 13 15 17 19 21 23
Limits = [–0.2,0.2]
2 4 6 8 10 12 14 16 18 20 22
X09.B3
X09.B2
X09.B1
X08.B3
X08.B2
X08.B1
X07.B3
X07.B2
X07.B1
X06.B2
X06.B1
X05.B3
X05.B2
X05.B1
X04.B2
17 X04.B1
X03.B2
X03.B1
X01.B3
X01.B2
X01.B1
1 3 5 7 9 11 13 15 17 19 21 23
.. Fig. 17.9 Abberation plot using predefined loss and gain threshold
17.6 · Obtaining Genomic Regions
311 17
Thresholds = [–0.1,0.2]
Chromosome 14
100
% with gain or loss
75
50
25
0
25
50
75
100
.. Fig. 17.10 A plot showing the region that had the lowest copy number value in chr14
For example, the above code draws a plot of the region that has lowest copy number value
in chromosome 14, which was plotted in 7 Chap. 5. The result is as follows.
> min.14.lympoma.res
sampleID chrom arm start.pos end.pos n.probes mean
879 X06.B1 14 q 102399698 106227119 5 -0.406
For example, the TRAF3 gene in this region, is translated into a key protein in the
hymphtoxin-bera receptor signaling complex.
Bibliography
1. Komura D, Shen F, Ishikawa S et al (2006) Genome-wide detection of human copy number variations
using high density DNA oligonucleotide arrays. Genome Res 16(12):1575–1584
2. Nilsen G, Liestol K, Lingjaerde OC (2013) Copynumber: segmentation of single- and multi-track copy
number data by penalized least squares regression. R package version 1.14.0
3. UCSC Genome Browser – https://genome.ucsc.edu
17
313 V
Metagenome and
Epigenome, Basic
Data Analysis
Contents
Metagenome
and Epigenome
Data Analysis
18.1 Metagenome – 316
Bibliography – 322
18.1 Metagenome
A microorganism lives and adapts to various environments such as 1000 m deep sea water
or hypoxic high-level altitude. Microorganisms are considered essential components for
normal functioning of Earth’s ecosystems. Microorganisms were first reported by the
Dutch scientist Leeuwenhoek in 1683. He observed the shape of bacteria with a micro-
scope for the first time. Since then, research with pure cultures was done by the German
scientist, Schrader.
Previously, microbiologists were only able to do research with the cultivable micro-
organisms in the lab since it is impossible to do research without cultures. Therefore,
a known microorganism species is estimated as only the tip of an iceberg compared to
the total species of unknown microorganisms. An actual microorganism lives through
the interaction between the same or different kinds of species by colonizing in various
inhabitable environments, but the pure culture environment does not mimic this actual
environment. Thus, with the variety of microorganisms existing in the natural world and
proper understanding of the ecological aspect, a different macroscopic approach was
required in order to effectively understand these parameters.
Metagenome research has emerged as an innovative method to overcome the funda-
mental limitation of microorganism research. These limitations include extracting the total
DNA, which is the genetic material of microorganism existing in the natural world, con-
structing the library, and investigating the new genetic material using the NGS technique.
In this case, the origin of genetic material, which is bacterial, cannot be identified, but all
genetic sources can be inversely obtained and the genome sequence of origin bacteria can
also be reconstructed with the latest bioinformatics technique. Metagenome is defined
as all microorganisms’ genome sets existing in the specific environment, and it is a field
that researches the collective genome unit of functional and structural diverse of exist-
ing microorganisms in the natural world. Metagenomic research subjects can be obtained
from every environment such as the soil, ocean, river, and swamp that exists in the natural
world and it also can be obtained from the various organs of the animal and humans.
The first metagenomic study is represented by the creation of a 10 Kb DNA from the
genome of all the microorganisms that exist in the Pacific Ocean, cloned into E. coli, and
constructed into a library by Schmidt in 1991. NGS technology used in the metagenomic
analysis constructs a library by reading the longer sequences rather than mapping to the
18 reference genome that is done by reading the short sequence. That is a reason for fully
preserving the gene sequence of various microorganisms without the information about
each species’ genome.
In . Fig. 18.1, the generalized process of metagenome research is schematized. Sample
collection, as the first step, should collect enough of the sample from the proper place
for the proper purpose. At present, metadata for metagenomics is required, and it has to
describe the detailed information that is related to the collecting environment that could
18.1 · Metagenome
317 18
.. Fig. 18.1 Metagenome
research process
Sample collecting
DNA/RNA Extraction
Library Construction
Screening
be required when the useful candidate matter is discovered through the metagenomic
analysis. Extracting DNA/RNA, which is the second step, should eliminate the impuri-
ties from the obtained sample. This step uses either the chemical method, using various
enzymes to eliminate the impurities, or a physical method. The physical method is more
useful for the actual meta genomic analysis, but it has its disadvantage of causing genome
segmentation.
RNA extraction is harder than DNA extraction due to the structural instability of
RNA. The following step is a library construction step, which constructs a library by using
the amplification of various cloning vectors such as Bacteriophage, cosmid, forsmid, and
BAC. Metagenomics creates a library that has a big insert, and it is advantageous for pro-
ducing novel products through a phylogenetic analysis or an antibiotic selection. The final
screening step varies depending on the functional metagenomic or base sequence analyses.
Metagenomics is a field in which technical advances have appeared earlier than the
complete scientific understanding of the field. It has a shorter history compared to classi-
cal molecular biology. Craig Venter, who is well-known for The Human Genome Project,
has initiated an advanced research concept called ‘Ecosystem Sequencing.’ It applies whole
genome shotgun sequencing into a metagenomic library of samples collected from the
waters around Bermuda in order to determine the hundreds of millions base sequence.
The report has been entitled “Environmental Genome Shotgun Sequencing of the Sargasso
Sea” and was published in Science in 2004. This publication also reported that 1800 spe-
cies of microorganisms have been identified, more than 150 new species and more than
1.2 million novel genes have been discovered.
Brady, Gillespie, Diaz-Torres, Beja et al., have extracted various new antibiotics through
the metagenome research of the ocean, soil, oral cavity, and inner intestine. They have
discovered the new gene showing the resistance to the specific antibiotic, and we are now
even in the step to apply several useful enzyme resources to medical industry. Turnbaugh
et al. have claimed that the cause of obesity is the result from various microorganism
group that coexists in the intestine, indicating that obesity not only originates as a result
of indigenous genetic predisposition but also by the various microorganism infections
that exist around us. Such obesity-related microorganisms may increase the infectious
probability between people who often come into contact with these microorganisms. This
will eventually increase the probability that the obesity-related microorganisms would
live in the intestines of that individual his/her family members. Through this process,
318 Chapter 18 · Metagenome and Epigenome Data Analysis
the evidence indicating that specific microorganisms cause obesity, can be discovered.
Furthermore, Barabasi et al. claimed that obesity is propagated by the network structure
of various level in the study entitled “Network Medicine–From Obesity to the Diseasome”
reported in NEJM in 2007.
18.2 Epigenome
Epigenome is a combination word between ‘epi-‘, which means ‘above’ in Greek, and
‘genome’. In other words, it means ‘something out of the genome’, and without changing
DNA base sequence itself, it can cause alterations in genomic expression patterns through
DNA or chromatin reconstruction. This property can be delivered to the next generation.
Epigenome means that DNA and genes obtained from the parents when they are born
can be expressed differently depending on the environment and life patterns. The same
DNA can be differently expressed depending on the environment in identical twins. The
function of the epigenome is to regulate gene expression. For example, there are three
types of epigenetic modifications affecting gene expression: (1) DNA methylation; (2) his-
tone modification; and (3) non-coding RNAs. The first two are known as molecular-based
mechanisms, and the last one is known as a transcriptional regulatory mechanism. These
mechanisms affect the whole part of one individual, but it can also affect a specific group
of cells. Therefore, it one of major interests in various diseases, including cancer.
Epigenetic alteration is an unlikely genetic mutation caused by DNA sequence altera-
tion. It is affected more by the environment. The epigenetic profile of twins who were
raised in the same environment showed more similarity than that of twins who were
raised in different environments. In another example, DNA methylation status between
identical twins who were around 3 years old showed very similar patterns, but the status
between the 50 year old twins was very different. Such epigenetic changes can be factors
that lead to differential gene expression during one’s life span as one gets older with the
same genetic information. It might explain how many environmental factors affect genes.
DNA methylation is the attachment of a methyl group to the DNA itself at the C5 position
in the pyrimidine ring of cytosine. This reaction occurs when guanine follows cytosine
(CpG) via a methyl group of S-adenosyl-methionine mediated by DNA methyllationsfer-
ase (DNMT). Mammals have high cytosine methylation tendencies showing that 3–5%
of cytosine exists as 5-methylcytosine. Among that, around 70% exists in CpG regions.
DNA methylation is known to specifically inhibit gene expressions that are involved in
formation of heterochromatin that does not transcribe due to the condensation of base
strands in X chromosome deactivation and in genome imprinting. The methylation rate
18 and pattern of CpG varies across the mammal species and tissues. Most CpGs are known
to be methylated (60–90%). However, CpG island in mammal genome exist where CpG
is concentrated. CpG islands have been shown to be less methylated (typically about
300–3000 base pairs in length, located at promoter region). If CpG islands in the promo-
tor regions of a specific gene are methylated, transcription factors binding to the pro-
moter region is inhibited, resulting in repressed expression of a given gene. Methylation
in a promoter region is a well-known epigenetic regulation of gene expression. Three
18.2 · Epigenome
319 18
DNMT DNA methyltransferase, MeCP methyl-CpG binding protein, MBD methyl-CpG binding
domain, HAT histone acetyltransferase, HDAC histone deacetyltransferase, HMT histone methyl-
transferase, HDM histone demethylase
known mammalian methyl transferases exist. DNMT1 is involved in retaining the methyl
group in DNA where DNA is synthesized during cell division. DNMT3a and DNMT3b
might catalyze new methylation in DNA. In addition, methylated DNA allows the CpG-
bindingdomain (MBD) to bind to HDAC, (a chromatin modification protein leading to
the chromatin condensation), which leads to transcription repression by blocking binding
of transcription factors to DNA sequences.
DNA methylation is involved in the genomic imprinting phenomenon which is
involved in gene expression from just one chromosome among the two chromosomes
that come from each parent. In the early stages of egg fertilization, the CpG island of
the corresponding gene are selectively methylated to repress the given gene expression.
Methylation of the CpG island occurs in a tissue-specific manner and are also regulated
in a tissue-specific manner. In cancer cells, genome-wide DNA methylation is generally
decreased, but some specific promoter regions are highly methylated. For example, some
CpG islands are highly methylated in various cancers including colorectal, lung cancer,
and breast cancers. Such abnormal methylation can occur at the beginning of cancer
development; therefore, many studies have focused on marker discovery for early diag-
nosis (. Table 18.1).
Eukaryotic chromatin is the structural unit of a chromosome. Total DNA, which consists
of 46 chromosomes existing in the nucleus of human somatic cell, is very long (2 m).
DNA exists as a highly condensed nuclear chromatin structure with a size of 10–100 in
diameter. DNA is condensed with a nucleosome, which is the base unit of chromatin. A
nucleosome forms the core particle wrapped with DNA strand of 146 base pairs, and the
core particle is composed of two copies of H2A, H2B, H3, and H4. In a nucleosome, the
N-terminal histone proteins have been shown to be modified. It is called histone modifica-
tion and six types of modification have been discovered to date:
55 Acetylation
55 Methylation of lysine and arginine
320 Chapter 18 · Metagenome and Epigenome Data Analysis
55 Ubiquitination
55 Phosphorylation
55 Sumoylation
55 ADP rybosylation
55 Deamination, proline isomerization
Among transcribed RNAs, several RNAs are not translated into proteins (1) tRNA
(transfer RNA); (2) rRNA (ribosomal RNA); (3) miRNA (microRNA); and (4) siRNA
(short-interfering RNA). Listed RNAs are called non-coding RNA (ncRNA), and some
ncRNA has been shown to control gene expression. miRNA that consists of around 22
bases, represses target mRNA. In animals, miRNA binds to mRNA 3’UTR and inhibits
gene expression by degrading mRNA or inhibiting translation into protein. miRNA has
been shown to be involved in ontogenesis, apoptosis, proliferation, hematopoiesis, insulin
secretion, and immune responses.
55 MethylomeDB (7 http://www.neuroepigenomics.org/methylomedb)
A collection of NGS data used for DNA bisulfide conversion. It includes high quality
of chromosome methylation information from various tissues, pathological condi-
18 tions, and species.
55 MethBase (7 http://smithlabresearch.org/software/methbase)
55 4DGenome (7 http://4dgenome.research.chop.edu)
hhmd)
A collection of human histone modification data obtained by experimental methods.
55 PeakSeq (7 http://info.gersteinlab.org/PeakSeq)
A program to search for the peak regions in ChIP-Seq data and rank them. The input
is the sequence fragment mapped by Chip-Seq and the output is a result ranked by
Q-value.
55 ChIP-Seq Analysis Server (7 http://ccg.vital-it.ch/chipseq)
55 miRBase (7 http://www.mirbase.org)
One of widely used tools. It provides currently known miRNA sequences and annotation.
55 TarBase (7 http://www.microrna.gr/tarbase)
Database of eukaryotic long non-coding RNAs. It provides a text search tool and
BLAST search tool
Currently, more than 20 different techniques exist for DNA methylation detection. DNA
methylation profiling techniques are categorized into three methylation status detection
methods:
55 Bisulfite conversion method
55 DNA Lysis with restriction enzyme sensitive to methylation status
55 Obtaining a DNA fragment methylated with Recombinant MBD (methyl-DNA bind-
ing protein domain) or monoclonal anti-5-methyl-cytosine antibody
Recently, Bisulphite-seq and MeDIP-seq using NGS technique have been widely used to
analyze genome-wide DNA methylation.
The ChIP-Seq technique is used to analyze histone modifications. Chip-Seq is a com-
bined technique of chromatin immuno-precipitation (CHIP) and sequencing techniques.
A ChIP-Seq is generally used to search for transcription factor-binding sites (TFBS) and
has recently been used for epignenomic profile analysis. A Chip-Seq can be used to search
for a short sequence fragment, which is mapped to a specific region of the genome, and
evaluate how much these fragments are enriched compared to other regions.
Bibliography
1. Amaral PP et al (2011) lncRNAdb: a reference database for long noncoding RNAs. Nucleic Acids Res
39(Database issue):D146–D151
2. Berger SL et al (2009) An operational definition of epigenetics. Genes Dev 23(7):781–783
3. Goldberg AD et al (2007) Epigenetics: a landscape takes shape. Cell 128(4):635–638
4. Hackenberg M et al (2011) NGSmethDB: a database for next-generation sequencing single-cytosine-
resolution DNA methylation data. Nucleic Acids Res 39(Database issue):D75–D79
5. Huang WY et al (2015) MethHC: a database of DNA methylation and gene expression in human can-
cer. Nucleic Acids Res 43(Database issue):D856–D861
6. Kharchenko PV et al (2008) Design and analysis of chIP experiments for DNA binding proteins. Nat
18 Biotechnol 26:1351–1359. Park PJ (2009). ChIP-seq: advantages and challenges of a maturing technol-
ogy. Nat Rev Genet 10(10):669–680
7. Schones DE, Zhao K (2008) Genome-wide approaches to studying chromatin modifications. Nat Rev
Genet 9(3):179–191
8. Ongenaert M et al (2008) PubMeth: a cancer methylation database combining text-mining and expert
annotation. Nucleic Acids Res 36(Database issue):D842–D846 Epub 2007 Oct 11
9. Rozowsky J et al (2009) PeakSeq enables systematic scoring of ChIP-seq experiments relative to con-
trols. Nat Biotechnol 27(1):66–75
Bibliography
323 18
10. Sethupathy P et al (2006) TarBase: a comprehensive database of experimentally supported animal
microRNA targets. RNA 12(2):192–197 Epub 2005 Dec 22
11. Song Q et al (2013) A reference methylome database and analysis pipeline to facilitate integrative and
comparative epigenomics. PLoS One 8(12):e81148
12. Teng L et al (2015) 4DGenome: a comprehensive database of chromatin interactions. Bioinformatics
31(15):2560–2564
13. Ward LD, Kellis M (2012) HaploReg: a resource for exploring chromatin states, conservation, and regu-
latory motif alterations within sets of genetically linked variants. Nucleic Acids Res 40(Database
issue):D930–D934
14. Xin Y et al (2012) MethylomeDB: a database of DNA methylation profiles of the brain. Nucleic Acids
Res 40(Database issue):D1245–D1249
15. Yun X et al (2016) 3CDB: a manually curated database of chromosome conformation capture data.
Database (Oxford). 2016. pii:baw044. https://doi.org/10.1093/database/baw044. Print 2016
16. Zhang Y et al (2010) HHMD: the human histone modification database. Nucleic Acids Res 38(Database
issue):D149–D154
325 19
Bibliography – 337
19.1 Introduction
A metagenome is a set of the genomes of all microorganisms that exist in certain environ-
ments. NGS techniques have recently advanced the metagenome field. Therefore, in this
chapter, we will use metagenomeSeq, a functional analysis technique of a microorganism’s
genome that is one of major metagenome analysis tools. It includes the basic use of
metagenomeSeq (R-package) and gene-based metagenome analysis using the mouseDB
dataset provided in the R-packages.
19.2 Prerequisites
> source(“http://www.bioconductor.org/biocLite.R”)
> biocLite(“interactiveDisplay”)
> biocLite(“vegan”)
> biocLite(“metagenomeSeq”)
. Figure 19.1 shows the typical software developed for metagenome analysis. Among
these software, MEGAN,1 as a standalone program, maps the sequence fragments resulted
from the metagenome analysis to the NCBI taxonomy database, and perform phyloge-
netic analysis, diversity assessment, and functional analysis. MG-RAST,2 a web-based
analysis tool, analyzes and visualizes raw data in FASTQ format. R-packages for metage-
19 nomic analysis exist. Web-based or standalone software provides various functions and
1 7 http://ab.inf.uni-tuebingen.de/software/megan6/
2 7 http://metagenomics.anl.gov
19.4 · Metagenome Analysis
327 19
easier user interface, but they have limitations for some types of analyses. In contrast, the
R-package allows various analytical approaches, but it requires a certain level of program-
ming skill.
19.4.1 Dataset
We will use the dataset published in Science, 2009, entitled with “The effect of diet on the
human gut microbiome: a metagenomic analysis in humanized gnotobiotic mice.” This
dataset includes two metagenomes from an intestinal microorganism obtained from
12 germ-free adult male mouse treated under two conditions: (1) six mice under low-fat
and plant-polysaccharide-rich dietary conditions and (2) six mice under high-fat and
high-sugar dietary conditions.
In this exercise, we will learn the basic usage of metagenomeSeq as a preprocessing tech-
nique for metagenome data
328 Chapter 19 · Metagenome Data Analysis
1. Import a library
> library(metagnomeSeq)
> library(vegan)
> library(interactiveDisplay)
> ls(“package:metagenomeSeq”)
Using the command lines below, pring the phenotypic data and feature data of the mouse-
Data and examine their properties. featureData(phenoData) returns an object containing
information on both variable values and variable meta-data, whereas fData(pData) shows
a data frame with features as rows, and variables as columns. You can use the
fvarLabels(varLabels) function to return a character vector of measured variable names.
variable meta-data
> head(pData(mouseData), 3) # Return the top three rows in the data having phenotypinc
> head(fData(mouseData)[ , -c(1, 7)], 3) # Return the top three rows excluding the data in
the first and 7th columns having feature data,variable value of feature meta-data,
3 a modified eSet object for the data from high-throughput sequencing experiments
19.4 · Metagenome Analysis
329 19
19.4.3 Statistical Testing
> head(MRcounts(mouseData[, 1:2])) # Return top six rows in the calculation of the
combination numbers between the first and second samples and features.
> filterData(mouseData, present = 10, depth = 1000) # Filter samples by number of depth
For group comparisons, perform presence-absence test using 2 × 2 contingency tables.
Calculate p-values, odd’s ratios, and confidence intervals. The fitPA function provided in
metagenomeSeq calculates p-values, odds ratios, and lower and upper confidence limits
for all rows.
> res = fitPA(mouseData[1:5, ], cl = classes) # Divide data (1st column to 5th columns)
To calculate the correlation between the samples or the features, use the correlationTest
function in metagenomeSeq to calculate basic Pearson, Spearman, and Kendall correla-
tion statistics, and corresponding p-values.
# Calculates the (pairwise) correlation statistics and associated P-values of a matrix or the
aggTax, which stands for aggregateByTaxonomy, counts data of each taxonomy using fea-
ture information.
aggSamp, which stands for aggregateBySample, counts each samples using phenoData
information.
> obj = aggSamp(mouseData, fct = ‘mouseID’, out = ‘matrix’) # Using the phenoData
a particular phenoData column (i.e., ‘diet’) will aggregate counts using the aggfun
function (default row Means). Possible aggfun alternative include rowMeans and rowMedians.
19.5 Visualization
metagenomeSeq provides functions for drawing various plots. Draw several plots using
mouseData. classIndex variable is assigned 54 Western values and 85 BK values.
>
> classIndex
$Western
[1] 14 15 17 19 20 21 22 23 24 38 39 40 42 43 44 45 46 47 84 85 87 89 90
91 92 93 94 96 97 98 100 101 102 103 104 105 117 118 120 122 123 124 125 126
127 129 130 132 134 135 136 137 138 139
$BK
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 16 18 25 26 27 28 29 30 31 32 33 34
35 36 37 41 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 86 88 95 99 106 107 108
109 110 111 112 113 114 115 116 119 121 128 131 133
19
19.5 · Visualization
331 19
> par(mfrow = c(1, 2)) # Show multiple plots
col = dates, sortby = dates, ylab = “Raw reads”) # Divide data into two groups
This function plots the abundance of a particular OTU by class. The function is the typical
manhattan plot of the abundances (. Fig. 19.2).
700
600
600
500
500
400
400
Raw reads
Raw reads
300
300
200
200
100
100
0
0 10 20 30 40 50 0 20 40 60 80
Western BK
Color Key
and Histogram
Count
6000
0
0 2 4 6 8 12
Value
Lachnospiraceae:3796
Ruminococcaceae:357
Lachnospiraceae:4047
Ruminococcaceae:642
Clostridia:36
Holdemania:23
Lachnospiraceae:4084
Lachnospiraceae:4411
LachnospiraceaeIncertaeS
Bacteroides:1084
Bacteroides:1048
Betaproteobacteria:10
Proteobacteria:26
Lachnospiraceae:3189
Parabacteroides:745
Erysipelotrichaceae:26
Lachnospiraceae:4440
Lachnospiraceae:2700
Anaerostipes:38
Lachnospiraceae:3992
Ruminococcaceae:387
ErysipelotrichaceaeIncerta
IncertaeSedisXIII:13
Lachnospiraceae:4403
Lachnospiraceae:313
Coprobacillus:85
Lachnospiraceae:3636
LachnospiraceaeIncertaeS
Lachnospiraceae:3854
Lachnospiraceae:4244
RuminococcaceaeIncertae
Clostridiales:1075
Lachnospiraceae:4098
Ruminococcaceae:639
Ruminococcaceae:547
LachnospiraceaeIncertaeS
RuminococcaceaeIncertae
Akkermansia:40
Erysipelotrichaceae:43
Coprobacillus:67
Erysipelotrichaceae:49
Lachnospiraceae:3410
Ruminococcaceae:526
LachnospiraceaeIncertaeS
LachnospiraceaeIncertaeS
Ruminococcaceae:469
Enterococcaceae:28
Lactococcus:42
Lactococcus:48
LachnospiraceaeIncertaeS
Lachnospiraceae:4148
Enterococcus:162
Enterococcus:182
Lachnospiraceae:2423
Clostridiales:313
Ruminococcaceae:634
Lachnospiraceae:4070
Bacteroides:1058
Lachnospiraceae:2908
IncertaeSedisXIII:17
Akkermansia:54
Bilophila:38
Bilophila:37
Lachnospiraceae:4074
Lachnospiraceae:1726
Lachnospiraceae:3839
Ruminococcaceae:374
ErysipelotrichaceaeIncerta
ErysipelotrichaceaeIncerta
Coriobacteriaceae:42
LachnospiraceaeIncertaeS
Lachnospiraceae:209
Lachnospiraceae:4248
Lachnospiraceae:796
LachnospiraceaeIncertaeS
Lachnospiraceae:3800
Ruminococcaceae:575
Enterococcus:43
Ruminococcaceae:541
Ruminococcaceae:274
Firmicutes:278
Lachnospiraceae:2958
Lachnospiraceae:4056
Lachnospiraceae:4196
Coprobacillus:126
Lachnospiraceae:4122
Lachnospiraceae:1453
Lachnospiraceae:3907
Enterobacter:1
Enterobacteriaceae:18
Lachnospiraceae:3067
Collinsella:28
Lachnospiraceae:4295
Lachnospiraceae:4308
LachnospiraceaeIncertaeS
LachnospiraceaeIncertaeS
Ruminococcaceae:635
Eubacterium:44
Ruminococcaceae:80
ErysipelotrichaceaeIncerta
LachnospiraceaeIncertaeS
LachnospiraceaeIncertaeS
LachnospiraceaeIncertaeS
ErysipelotrichaceaeIncerta
ErysipelotrichaceaeIncerta
Lachnospiraceae:3906
Enterococcus:39
Enterococcus:153
Prevotella:83
Prevotella:81
Prevotella:76
Prevotella:74
Bacteria:5
Bacteria:4
Prevotellaceae:334
Prevotellaceae:433
Bacteria:2
Prevotellaceae:143
Bacteroides:901
Lachnospiraceae:4347
Sutterella:13
Faecalibacterium:275
Faecalibacterium:282
Dorea:58
Bacteroides:265
Ruminococcaceae:453
Bacteroides:854
Bacteroides:667
Bacteroides:820
Firmicutes:340
Lachnospiraceae:3509
Lachnospiraceae:4374
Bacteroidales:72
Bacteroides:965
Bacteroidales:105
LachnospiraceaeIncertaeS
Lachnospiraceae:3407
Lachnospiraceae:3867
Bacteroides:978
Bacteroides:1146
Bacteroides:768
Lachnospiraceae:3743
Lachnospiraceae:685
LachnospiraceaeIncertaeS
Lachnospiraceae:4233
Lachnospiraceae:4362
Lachnospiraceae:4261
Lachnospiraceae:4394
Lachnospiraceae:3963
Lachnospiraceae:4003
Bacteroides:1050
Clostridiales:1117
Lachnospiraceae:3861
Ruminococcaceae:593
Ruminococcaceae:612
Bacteroides:868
Clostridiales:1114
Betaproteobacteria:14
Betaproteobacteria:13
Bacteroides:1060
Bacteroides:1166
Bacteroides:1194
Bacteroides:1151
Bacteroides:1150
Lachnospiraceae:3899
Alistipes:156
Alistipes:161
Bacteroides:1153
Bacteroides:956
Bacteroides:1081
Bacteroides:1111
Alistipes:151
Bacteroides:891
Bacteroides:1055
LachnospiraceaeIncertaeS
ErysipelotrichaceaeIncerta
Coprobacillus:49
Lachnospiraceae:3493
LachnospiraceaeIncertaeS
Ruminococcus:15
Clostridiales:1116
Lachnospiraceae:3663
Lachnospiraceae:4319
Bryantella:94
Faecalibacterium:276
Faecalibacterium:288
Veillonellaceae:54
Bacteroides:921
Bacteroides:811
Coprobacillus:38
Prevotella:86
Prevotella:85
Bacteroides:1132
Parabacteroides:743
Veillonellaceae:60
Bacteroides:369
Bacteroides:925
Bacteroides:890
Lachnospiraceae:2961
Prevotella:84
PM12:20080128
PM12:20080204
PM10:20080108
PM12:20080108
PM10:20080114
PM12:20080218
PM12:20080225
PM12:20080303
PM12:20080114
PM12:20080121
PM12:20080211
PM10:20080211
PM10:20080225
PM10:20080218
PM10:20080303
PM10:20080128
PM10:20080204
PM10:20080121
PM11:20071211
PM10:20071211
PM12:20071217
PM10:20071217
PM11:20071217
PM11:20080303
PM11:20080225
PM11:20080211
PM11:20080204
PM11:20080218
PM11:20080128
PM10:20080107
PM12:20080107
PM11:20080121
PM11:20080107
PM11:20080114
PM11:20080108
PM5:20080303
PM9:20080108
PM6:20080108
PM5:20080108
PM8:20080108
PM5:20080114
PM6:20080114
PM8:20080114
PM9:20080211
PM6:20080121
PM9:20080114
PM6:20080211
PM5:20080121
PM5:20080225
PM5:20080218
PM5:20080211
PM8:20080121
PM6:20080128
PM6:20080204
PM9:20080121
PM9:20080128
PM9:20080204
PM9:20080218
PM9:20080225
PM6:20080303
PM6:20080225
PM6:20080218
PM8:20080211
PM8:20080204
PM8:20080128
PM9:20080303
PM8:20080303
PM8:20080225
PM8:20080218
PM5:20080204
PM5:20080128
PM7:20071211
PM8:20071211
PM2:20071211
PM3:20071211
PM5:20071211
PM9:20071211
PM1:20071211
PM7:20071217
PM7:20080303
PM7:20080121
PM7:20080114
PM7:20080204
PM7:20080128
PM7:20080211
PM7:20080218
PM7:20080225
PM9:20071217
PM2:20071217
PM5:20071217
PM8:20071217
PM4:20071217
PM6:20071217
PM1:20071217
PM3:20071217
PM3:20080218
PM3:20080303
PM1:20080303
PM4:20080303
PM2:20080303
PM2:20080225
PM2:20080218
PM4:20080114
PM4:20080204
PM4:20080211
PM4:20080225
PM4:20080218
PM1:20080218
PM1:20080225
PM1:20080114
PM1:20080128
PM4:20080128
PM1:20080211
PM1:20080121
PM3:20080121
PM9:20080107
PM4:20080121
PM3:20080211
PM3:20080128
PM3:20080225
PM2:20080204
PM2:20080211
PM3:20080204
PM1:20080204
PM1:20080107
PM3:20080107
PM4:20080107
PM4:20080108
PM2:20080108
PM6:20080107
PM8:20080107
PM2:20080107
PM5:20080107
PM3:20080114
PM3:20080108
PM1:20080108
PM2:20080121
PM2:20080128
PM2:20080114
Color Key
and Histogram
1000
Count
0
–1 –0.5 0 0.5 1
Value
Lachnospiraceae:3663
Lachnospiraceae:3509
LachnospiraceaeIncertaeSedis:977
Lachnospiraceae:4319
Bryantella:94
Clostridiales:1116
Lachnospiraceae:3867
Lachnospiraceae:3963
Lachnospiraceae:4261
Lachnospiraceae:3861
Clostridiales:1117
Ruminococcaceae:453
Clostridiales:1114
Bacteroides:868
Bacteroides:978
Ruminococcaceae:593
Ruminococcaceae:612
Bacteroides:1146
Lachnospiraceae:4233
Lachnospiraceae:4362
Lachnospiraceae:4394
Lachnospiraceae:4003
Lachnospiraceae:3743
Lachnospiraceae:685
LachnospiraceaeIncertaeSedis:962
Lachnospiraceae:4440
Lachnospiraceae:4347
Bacteroides:854
Bacteroides:667
Bacteroides:820
Bacteroides:925
Firmicutes:340
Lachnospiraceae:4374
Ruminococcus:15
Bacteroides:890
Lachnospiraceae:3407
Bacteroides:768
Bacteroides:265
Bacteroides:901
Lachnospiraceae:2961
Bacteroides:1132
Bacteroides:1194
Bacteroides:1151
Bacteroides:1150
Bacteroides:1084
Bacteroides:1050
Bacteroides:1048
Lachnospiraceae:3899
Bacteroides:1055
Veillonellaceae:60
LachnospiraceaeIncertaeSedis:1016
LachnospiraceaeIncertaeSedis:981
Faecalibacterium:282
Faecalibacterium:275
Dorea:58
Faecalibacterium:276
Faecalibacterium:288
Bacteroides:811
Prevotella:74
Prevotella:84
Prevotella:86
Bacteria:5
Prevotella:76
Bacteroidales:105
Lachnospiraceae:2700
Bacteroidales:72
Bacteroides:369
Bacteroides:965
Sutterella:13
Prevotellaceae:433
Prevotellaceae:143
Bacteria:2
Bacteria:4
Prevotellaceae:334
Prevotella:85
Prevotella:81
Prevotella:83
Ruminococcaceae:80
Eubacterium:44
LachnospiraceaeIncertaeSedis:506
Lachnospiraceae:4047
RuminococcaceaeIncertaeSedis:55
Ruminococcaceae:635
ErysipelotrichaceaeIncertaeSedis:315
Ruminococcaceae:547
Clostridiales:1075
Ruminococcaceae:639
IncertaeSedisXIII:17
Bacteroides:1153
Bacteroides:956
Ruminococcaceae:357
Lachnospiraceae:3796
Bilophila:37
Bilophila:38
Bacteroides:1081
Bacteroides:1058
Bacteroides:1111
Alistipes:151
Lachnospiraceae:4098
Lachnospiraceae:3839
Coriobacteriaceae:42
ErysipelotrichaceaeIncertaeSedis:255
Clostridiales:313
Lachnospiraceae:4074
Lachnospiraceae:4056
Coprobacillus:126
Lachnospiraceae:4070
Ruminococcaceae:374
Lachnospiraceae:1726
LachnospiraceaeIncertaeSedis:290
Firmicutes:278
Lachnospiraceae:2958
LachnospiraceaeIncertaeSedis:1018
Ruminococcaceae:469
LachnospiraceaeIncertaeSedis:967
Enterococcus:153
Lachnospiraceae:4148
LachnospiraceaeIncertaeSedis:1009
Ruminococcaceae:274
Enterococcaceae:28
Ruminococcaceae:634
LachnospiraceaeIncertaeSedis:1021
Lachnospiraceae:2423
Enterococcus:162
Enterococcus:182
Lachnospiraceae:4196
LachnospiraceaeIncertaeSedis:854
LachnospiraceaeIncertaeSedis:955
ErysipelotrichaceaeIncertaeSedis:248
ErysipelotrichaceaeIncertaeSedis:232
Lachnospiraceae:3906
Ruminococcaceae:575
Lachnospiraceae:3800
Ruminococcaceae:526
Lachnospiraceae:3410
Lactococcus:48
Lactococcus:42
Enterococcus:39
Enterococcus:43
ErysipelotrichaceaeIncertaeSedis:240
Ruminococcaceae:541
Alistipes:156
Alistipes:161
Lachnospiraceae:4084
Lachnospiraceae:4411
Bacteroides:891
Lachnospiraceae:3493
Parabacteroides:743
Betaproteobacteria:14
Bacteroides:1060
Betaproteobacteria:13
Bacteroides:1166
Lachnospiraceae:3992
Lachnospiraceae:313
Anaerostipes:38
Ruminococcaceae:387
Erysipelotrichaceae:26
Coprobacillus:38
Coprobacillus:67
Erysipelotrichaceae:43
Proteobacteria:26
Betaproteobacteria:10
Parabacteroides:745
Lachnospiraceae:3189
Bacteroides:921
Veillonellaceae:54
Ruminococcaceae:642
ErysipelotrichaceaeIncertaeSedis:313
IncertaeSedisXIII:13
Holdemania:23
LachnospiraceaeIncertaeSedis:936
Lachnospiraceae:3636
Clostridia:36
Coprobacillus:49
Erysipelotrichaceae:49
RuminococcaceaeIncertaeSedis:49
ErysipelotrichaceaeIncertaeSedis:254
Lachnospiraceae:2908
LachnospiraceaeIncertaeSedis:1014
Akkermansia:40
Akkermansia:54
Lachnospiraceae:4244
Lachnospiraceae:4403
Coprobacillus:85
LachnospiraceaeIncertaeSedis:471
Lachnospiraceae:796
Lachnospiraceae:3854
Lachnospiraceae:4248
LachnospiraceaeIncertaeSedis:929
LachnospiraceaeIncertaeSedis:277
Lachnospiraceae:209
Collinsella:28
Lachnospiraceae:3067
Lachnospiraceae:4122
Lachnospiraceae:1453
Lachnospiraceae:3907
Lachnospiraceae:4295
Lachnospiraceae:4308
Enterobacteriaceae:18
Enterobacter:1
LachnospiraceaeIncertaeSedis:1014
LachnospiraceaeIncertaeSedis:471
LachnospiraceaeIncertaeSedis:955
LachnospiraceaeIncertaeSedis:854
LachnospiraceaeIncertaeSedis:1021
LachnospiraceaeIncertaeSedis:1009
LachnospiraceaeIncertaeSedis:1018
LachnospiraceaeIncertaeSedis:1016
Lachnospiraceae:3907
Lachnospiraceae:1453
Lachnospiraceae:4122
Lachnospiraceae:3067
LachnospiraceaeIncertaeSedis:277
LachnospiraceaeIncertaeSedis:929
RuminococcaceaeIncertaeSedis:49
Lachnospiraceae:796
Lachnospiraceae:4403
Lachnospiraceae:4244
Lachnospiraceae:3636
LachnospiraceaeIncertaeSedis:936
Veillonellaceae:54
Erysipelotrichaceae:43
Lachnospiraceae:3493
Lachnospiraceae:4411
Lachnospiraceae:4084
Ruminococcaceae:526
Lachnospiraceae:3800
Ruminococcaceae:575
Lachnospiraceae:3906
ErysipelotrichaceaeIncertaeSedis:232
ErysipelotrichaceaeIncertaeSedis:248
Lachnospiraceae:4196
Lachnospiraceae:2423
Lachnospiraceae:4070
Coprobacillus:126
Lachnospiraceae:4056
Lachnospiraceae:4074
Coriobacteriaceae:42
Lachnospiraceae:3839
Lachnospiraceae:4098
RuminococcaceaeIncertaeSedis:55
Lachnospiraceae:4047
LachnospiraceaeIncertaeSedis:506
Ruminococcaceae:80
Lachnospiraceae:2700
Faecalibacterium:288
Faecalibacterium:276
Faecalibacterium:275
Faecalibacterium:282
LachnospiraceaeIncertaeSedis:981
Veillonellaceae:60
Lachnospiraceae:3899
Lachnospiraceae:2961
Lachnospiraceae:3509
Lachnospiraceae:3663
Enterobacter:1
Enterobacteriaceae:18
Lachnospiraceae:4308
Lachnospiraceae:4295
Collinsella:28
Lachnospiraceae:209
Lachnospiraceae:4248
Lachnospiraceae:3854
Coprobacillus:85
Akkermansia:54
Akkermansia:40
Lachnospiraceae:2908
ErysipelotrichaceaeIncertaeSedis:254
Erysipelotrichaceae:49
Ruminococcaceae:642
Lachnospiraceae:3189
Coprobacillus:49
Clostridia:36
Holdemania:23
IncertaeSedisXIII:13
ErysipelotrichaceaeIncertaeSedis:313
Parabacteroides:745
Betaproteobacteria:10
Bacteroides:921
Proteobacteria:26
Coprobacillus:67
Coprobacillus:38
Erysipelotrichaceae:26
Ruminococcaceae:387
Ruminococcaceae:541
ErysipelotrichaceaeIncertaeSedis:240
Lachnospiraceae:3410
Anaerostipes:38
Lachnospiraceae:313
Lachnospiraceae:3992
Bacteroides:1166
Betaproteobacteria:13
Bacteroides:1060
Betaproteobacteria:14
Parabacteroides:743
Bacteroides:891
Alistipes:161
Alistipes:156
Enterococcus:43
Enterococcus:39
Lactococcus:42
Lactococcus:48
Enterococcus:182
Enterococcus:162
Ruminococcaceae:634
Enterococcaceae:28
Ruminococcaceae:274
Lachnospiraceae:4148
LachnospiraceaeIncertaeSedis:967
LachnospiraceaeIncertaeSedis:290
Enterococcus:153
Ruminococcaceae:469
Lachnospiraceae:2958
Lachnospiraceae:1726
Firmicutes:278
Ruminococcaceae:374
Clostridiales:313
ErysipelotrichaceaeIncertaeSedis:255
Alistipes:151
Bacteroides:1111
Bacteroides:1058
Bacteroides:1081
Bilophila:38
Lachnospiraceae:3796
Bilophila:37
Ruminococcaceae:357
IncertaeSedisXIII:17
Ruminococcaceae:639
Ruminococcaceae:547
ErysipelotrichaceaeIncertaeSedis:315
Ruminococcaceae:635
Lachnospiraceae:4347
Lachnospiraceae:4440
LachnospiraceaeIncertaeSedis:962
Ruminococcaceae:593
Bacteroides:956
Bacteroides:1153
Clostridiales:1075
Eubacterium:44
Prevotella:83
Prevotella:81
Prevotella:85
Prevotellaceae:334
Bacteria:4
Bacteria:2
Prevotellaceae:143
Prevotellaceae:433
Sutterella:13
Bacteroides:965
Bacteroides:369
Bacteroidales:72
Bacteroidales:105
Prevotella:76
Bacteria:5
Prevotella:86
Prevotella:84
Prevotella:74
Bacteroides:811
Dorea:58
Bacteroides:1055
Bacteroides:1048
Bacteroides:1050
Bacteroides:1084
Bacteroides:1150
Bacteroides:1151
Bacteroides:1194
Bacteroides:1132
Lachnospiraceae:3407
Lachnospiraceae:4374
Bacteroides:901
Bacteroides:265
Lachnospiraceae:685
Lachnospiraceae:3743
Lachnospiraceae:4003
Lachnospiraceae:4394
Lachnospiraceae:4362
Lachnospiraceae:4233
Bacteroides:1146
Ruminococcaceae:612
Ruminococcaceae:453
Lachnospiraceae:3861
Lachnospiraceae:4261
Lachnospiraceae:3963
Lachnospiraceae:3867
Lachnospiraceae:4319
LachnospiraceaeIncertaeSedis:977
Bacteroides:768
Bacteroides:890
Ruminococcus:15
Firmicutes:340
Bacteroides:925
Bacteroides:820
Bacteroides:667
Bacteroides:854
Bacteroides:978
Bacteroides:868
Clostridiales:1114
Clostridiales:1117
Clostridiales:1116
Bryantella:94
Calculate the correlation using the heatmapCols variable set in the above color and express
it as a heatmap. The higher the correlation value, the darker in blue color, and the lower
the correlation value, the darker in red color (. Fig. 19.4).
> plotCorr(obj = mouseData, n = 200, cexRow = 0.25, cexCol = 0.25, trace = “none”,
20
10
0
MDS component: 2
–10
–20
–30
–40
–20 –10 0 10 20 30
MDS component: 1
.. Fig. 19.5 Plot of either PCA or MDS coordinates for the distances
This function plots the PCA / MDS coordinates for the diet values. Potentially relation-
ships of two diet groups uncovers their distances (. Fig. 19.5).
> plotOrd(mouseData, tran = TRUE, usePCA = FALSE, useDist = TRUE, bg = cl, pch = 21)
19
19.5 · Visualization
335 19
Diet1-BK
Diet2-Western
600
Number of detected features
500
400
300
.. Fig. 19.6 The number of observed features vs. the depth of coverage following diet values
plotRare visualizes the number of obeserved features vs. the depth of coverage (. Fig. 19.6).
> res = plotRare(mouseData, cl = cl, pch = 21, bg = cl) # This function plots the number
abline(tmp[[i]], col = i)
> legend(“topright”, c(“Diet1 – BK”, “Diet2 – Western”), text.col = c(1, 2), b ox.col = NA)
.. Fig. 19.7 A web browser opens where the graph can be examined interactively by adjusting the
available options
Shiny is an interactive web application generated by R. All plots generated in above can be
actively visualized. Shiny is an extended application with reactive binding between input
and output. metagenomeSeq and interactiveDisplay R-pacage provide a visualization tool
through shiny (. Fig. 19.7)
Exercises
[Exercise 1] - Perform the same analysis with the lungData provided in metagenomeSeq.
[Exercise 2] - Understand characteristic of data used in the analysis.
[Exercise 3] - Apply fitDO function for discovery odds ratio test.
[Exercise 4] - Find a combination of samples with high correlation value.
[Exercise 5] - Draw plot with several options of display (mouseData).
19
Bibliography
337 19
Bibliography
1. DeLong EF et al (2006) Community genomics among stratified microbial assemblages in the ocean’s
interior. Science 311(5760):496–503
2. Huson DH et al (2007) MEGAN analysis of metagenomic data. Genome Res 17(3):377–386. Epub 2007
Jan 25
3. Meyer F et al (2008) The metagenomics RAST server – a public resource for the automatic phyloge-
netic and functional analysis of metagenomes. BMC Bioinforma 9:386. https://doi.org/10.1186/1471-
2105-9-386
4. Paulson JN et al (2013) Differential abundance analysis for microbial marker-gene surveys. Nat
Methods 10(12):1200–1202. https://doi.org/10.1038/nmeth.2658 Epub 2013 Sep 29
5. Paulson JN, Talukder H, Pop M, Bravo HC metagenomeSeq: statistical analysis for sparse high-
throughput sequencing. Bioconductor package: 1.16.0. http://cbcb.umd.edu/software/metageno-
meSeq
6. Turnbaugh PJ et al (2006) An obesity-associated gut microbiome with increased capacity for energy
harvest. Nature 444(7122):1027–1031
7. Turnbaugh PJ et al (2009) The effect of diet on the human gut microbiome: a metagenomic analysis
in humanized gnotobiotic mice. Sci Transl Med 1(6):6ra14. https://doi.org/10.1126/scitrans-
lmed.3000322
339 20
Epigenome Database
and Analysis Tools
20.1 Introduction – 340
Bibliography – 352
20.1 Introduction
20.2 Prerequisites
> source(“www.bioconductor.org/biocLite.R”)
> biocLite(“bsseq”)
> biocLite(“bsseqData”)
> library(bsseq)
> library(bsseqData)
dbEM is a database that has information relating to epigenetic modifiers. It includes data
such as mutations, CNVs, and gene expression in thousands of tumor samples, cancer cell
lines, and healthy individuals obtained from COSMIC, CCLE, and the 1000 Genomes
20 Project. It includes both cancerous and normal genomes, and it aims to study the roles of
epigenetic proteins in oncogenesis and cancer drug resistance.
20.3 · Epigenome Database
341 20
Both of the sequences that we input are predicted to be cancer sensitive. Prediction is
based on the HMM score from EMBL-EBI HAMMER. If the original HMM score is
higher, then it is considered a “Normal Variant”, while if the mutant HMM score is higher,
then it is predicted to be “Cancer Sensitive.”
20.3.1.2 Browse
From the Browse menu, you can view all data in a given database. (1) The Epigenetic
Modifiers database is sorted by each protein, and shows annotated IDs from Uniprot,
Homologues, PDB, and PubChem. (2) Selecting Chromosomes lists epigenetic proteins
by their position on each of the 24 human chromosomes. (3) The Frequency of Mutation
page shows the number of cancer mutant and normal variants per gene, referenced from
CCLE, COSMIC, and the 1000 Genomes Project. (4) Browsing by Genomic Features
allows searching for epigenetic proteins based on Mutation Frequency, Expression level,
and CNV copy number change. (5) Finally, under Drugs/Inhibitors, it shows a table of
drugs affecting epigenetic modifiers, listed by DrugBank or PubChem ID (. Fig. 20.2).
EpiFactor is a database containing epigenetic factors, their genes, and their products. In
this database, “epigenetic factors” means a protein that causes chromatin remodeling.
These can include: (1) proteins that act upon post-translational modifications of histones
(histone modification read, write, and erase); (2) proteins that move, eject, or restructure
nucleosomes (ATP-dependent chromatin remodelers); and (3) proteins that incorporate
histone variants into the nucleosomes.
342 Chapter 20 · Epigenome Database and Analysis Tools
The Epifactors menu has options for viewing Genes, Complexes, Histones and
Protamines, Expression, and Docs and Downloads. Let’s check Genes first. This page
shows information for 815 genes, including annotations for the HGNC symbol, HGNC
name, UniProt ID, Pfam domains ID, UniProt mouse ID, gene function, modification,
associated protein complexes, target entity, and Product (. Fig. 20.3).
Let’s search for a gene of interest. For the demonstration, let’s put “DNMT3A” in the
search field under “HGNC approved Symbol.” When the search results come up, click
‘details’ to go to the detail screen. Here, when you click on any database ID, you will go to
the information page associated with that database (. Fig. 20.4).
If you click on the Complexes, Histones and Protamines, or Expression menu, you can
perform a similar simple query. EpiFactor includes information for 69 Protein Complexes
and 95 Histones and Protamines (. Fig. 20.5).
20 The Expression menu includes 255 cell lines, 12 fractionations, 458 primary cells, 29
time courses, and 135 tissue types. After performing a query for ‘brain’ samples, the result
20.3 · Epigenome Database
343 20
.. Fig. 20.4 Search results for the HGNC symbol ‘DNMT3A’ in EpiFactor
shows 3 primary cell and 3 tissue samples. The links in each row lead to pages with expres-
sion results for the relevant genes (. Fig. 20.6).
MethylomeDB provides human and mouse DNA methylation profiles. Methylation profiles
include over 80% of CpG dinucleotides in the brains of humans and mice at single CpG reso-
344 Chapter 20 · Epigenome Database and Analysis Tools
lution. Profiles are generated through Methylation Mapping Analysis by Paired- end
Sequencing (Methyl-MAPS), and analyzed with the Methyl-Analyzer software package. In
this database, the integrated genome browser, an edited version of the UCSC genome browser,
is used to search for DNA methylation profiles at specific genomic loci, to search for specific
methylation patterns, and to compare methylation patterns between samples.
MethylomeDB provides Browse/Search and Download functions. Clicking on
20 “Browse” takes you to the MethylomeDB Browser. It uses the UCSC Genome Browser to
access the Brain Methylation Database, and you can query through either the Genome
Browser or the Table Browser interfaces. From the Download page, you can download
20.3 · Epigenome Database
345 20
methylation data for each sample. Data are arranged by sample, tissue, organism, gender,
age, and PMI (hour), with links on the right for downloading (. Fig. 20.7).
Short-read MethFlow
datasets from pipeline
WGBS projects
NGSmethDB
(MongoDB)
Standalone NGSmethDB
client API client VM
Bsseq is an R package that uses the BSmooth algorithm to analyze and visualize whole-
genome shotgun bisulfite sequencing (WGBS) datasets. The BSmooth algorithm uses
smoothing to obtain reliable semi-local methylation estimates in low-coverage regions.
After smoothing, BSmooth estimates biological variation and confirms differentially
methylated regions (DMRs) using biological replicates. We will also need to consider the
variation between individuals, which may relate to disease.
For practice, we will use the data from Hansen KD et al., (2011) Nature Genetics. This
is a primary dataset focused on chromosomes 21 and 22, which was obtained from colon
cancer patients, and it contains both normal and cancer tissue. The data consists of 50 bp
single-end reads and was generated using ABI SOLiD sequencing.
20.4.1.1 Preparation
Download, install, and load the packages bsseq and bsseqData from Bioconductor using
biocLite.R
> source(“www.bioconductor.org/biocLite.R”)
> biocLite(“bsseq”)
> biocLite(“bsseqData”)
> library(bsseq)
> library(bsseqData)
> data(BS.cancer.ex)
> BS.cancer.ex
> BS.cancer.ex
An object of type 'BSseq' with
958541 methylation loci
6 samples
has not been smoothed
20
20.4 · Epigenome Analysis Tool
347 20
Also check the phenotypic information in the dataset; it should be composed of three
cancer tissues and three normal tissues.
> pData(BS.cancer.ex)
> pData(BS.cancer.ex)
DataFrame with 6 rows and 2 columns
Type Pair
<character> <character>
C1 cancer pair1
C2 cancer pair2
C3 cancer pair3
N1 normal pair1
N2 normal pair2
N3 normal pair3
Execute BSmooth on the BS.cancer.ex data. The run will take around 2 min for each sam-
ple. The command option mc.cores defines the number of CPU cores to use, and can be
optimized to speed the analysis. (Please note that on Windows, the number of usable cores
is limited to one.)
Also check that the dataset BS.cancer.ex.fit is loaded. This data, composed of 958,541
methylation loci and 6 samples, has already been processed with “smoothing.”
> data(BS.cancer.ex.fit)
> BS.cancer.ex.fit
20.4.1.3 Poisson
Calculate the (log) distribution function of the Poisson distribution using the lambda
parameter.
Since CpG coverage is approximately 4x per sample, many zero coverage CpGs would
be expected by chance, but the same CpG must not be a zero coverage in all six samples.
> round(colMeans(getCoverage(BS.cancer.ex)), 1)
> round(colMeans(getCoverage(BS.cancer.ex)), 1)
[1] 3.5 4.2 3.7 4.0 4.3 3.9
348 Chapter 20 · Epigenome Database and Analysis Tools
With the assumption that coverage genome-wide follows a Poisson distribution with a
parameter (lambda) of 4, it is expected that 0.105 of CpGs have a zero coverage in at least
one sample.
20.4.1.4 BSmooth.tstat
Before analyzing differences between DMRs in the two tissue groups, eliminate low cover-
age CpGs to reduce false positive results. To do this, calculate coverages of the BS.cancer.
ex.fit dataset, which is composed of 597,371 methylation loci and 6 samples. Save those
coverages in the variable BS.cov.
In the BS.cancer.ex data frame, columns 1–3 are cancer tissues and columns 4–6 are nor-
mal tissues. To perform t-statistics using BSmooth.tstat, first select loci from BS.cov that
have coverages equal to or greater than two in each group, then apply rowSum to the
selected data and save the resulting value in the variable keepLoci.ex.
Next, save the rows from BS.cancer.ex.fit that are listed in keepLoci.ex to the variable BS.
cancer.ex.fit. The number of methylation loci in the dataset will decrease from 958,541 to
597,371.
Assign the cancer values as Group1 and the normal values as Group 2, then and perform
t-statistics for the reduced BS.cancer.ex.fit dataset. Save the calculated value to the variable
BS.cancer.ex.tstat.
verbose = TRUE)
Now plot the values of BS.cancer.ex.tstat. The blue line shows uncorrelated values while
the black line shows correlated values. The option estimate.var. can be used to define the
sample for measuring variability. In a cancer dataset, the normal sample is used since
cancer tissue has higher variability than normal tissue. The option local.correct provides
20 large-scale mean correlation, which we use for this dataset to discover large-scale
methylation differences between cancer and normal tissues.
20.4 · Epigenome Analysis Tool
349 20
0.30 uncorrected
corrected
0.25
0.20
Density
0.15
0.10
0.05
0.00
–10 –5 0 5 10
N = 538851 Bandwidth = 0.1625
20.4.1.5 plotTstat
This function displays the distribution of the t-statistics.
> plot(BS.cancer.ex.tstat)
0.8
Methylation
0.5
0.2
.. Fig. 20.10 Plots for the top 200 individual DMRs with plotRegions
Select DMRs that matches the following criteria: (1) the number of affected CpGs is equal
to or greater than three, and (2) the difference in methylation average between cancer and
normal tissue is at least 0.1. Save this subset to the variable BS.cancer.ex.dmr.
20.4.1.7 plotSetup
For clear visualization, we will color data according to the associated phenotypes in
BS.cancer.ex.fit. Create a column named “col” in pData, assigning “red” to cancer samples
and “blue” to normal samples.
Now plot the BS.cancer.ex.fit data. First, draw a plot for just the first row of BS.cancer.ex.
dmr (. Fig. 20.10).
addRegions = BS.cancer.ex.dmr)
Next, draw a plot for the first 200 rows of BS.cancer.ex.fit. Save the plots as a pdf file named
“dmrs_200_plots.pdf,” then close out the plot with dev.off().
20
20.4 · Epigenome Analysis Tool
351 20
> pdf(file = “dmrs_200_plots.pdf”, width = 10, height = 5)
addRegions = dmr)
> dev.off()
software/methpipe/)
MethBase is a central methylome reference database generated from public BS-seq datas-
ets. MethBase includes hundreds of methylomes from various organisms including
Arabidopsis, human, mouse, zebrafish, chimp, and dog. For each methylome, MethBase
provides methylation levels at individual sites, regions of allele-specific methylation,
hypo- and hyper-methylated regions, and partially methylated regions. MethBase also
provides and detailed metadata and summary statistics for each methylome. These results
are created using the MethPipe software package, which is an independent and compre-
hensive standalone pipeline for analyzing WGBS and RRBS data.
PAVIS (Peak Annotation and Visualization) is a tool for annotating and visualizing ChIP-
seq and BS-seq data, and is useful for hypothesis generation and data analysis. The annota-
tion function provides the relative locations between query peaks and genes, and with
other comparison peaks in a genome, and reports the relative enrichment levels of peaks
in different genomic regions. This visualization tool offers simultaneous viewing of mul-
tiple peaks in the context of genomic features and nearby comparison peaks. PAVIS
accepts peak location data, which is created by a peak-calling tool, as input data. Accepted
inputs include files in the UCSC BED format, GFF3 format, and peak data files output by
most ChIP-seq data analysis tools.
Exercises
[Exercise 1] - Analyze Rhesus macaque data (Trung J et al., (2012) PNAS) using the bsseq R package.
[Exercise 2] - Check your data of interest against the MethBase database. Analyze it with MethPipe and
visualize the data through the Genome Browser.
Bibliography
1. dbEM – http://crdd.osdd.net/raghava/dbem/
2. EpiFactors – http://epifactors.autosome.ru
3. Hansen KD et al (2012) BSmooth: from whole genome bisulfite sequencing reads to differentially
methylated regions. Genome Biol 13(10):R83. https://doi.org/10.1186/gb-2012-13-10-r83
4. Hansen KD, Langmead B, Irizarry RA (2012) BSmooth: from whole genome bisulfite sequencing reads
to differentially methylated regions. Genome Biol 13(10):R83. https://doi.org/10.1186/gb-2012-
13-10-r83
5. Hansen KD et al (2016) bsseqData: example whole genome bisulfite data for the bsseq package. R
package version 0.12.0
6. Huang W et al (2013) PAVIS: a tool for peak annotation and visualization. Bioinformatics 29(23):3097–
3099. https://doi.org/10.1093/bioinformatics/btt520. Epub 2013 Sep 4
7. Lebrón R et al (2017) NGSmethDB 2017: enhanced methylomes and differential methylation. Nucleic
Acids Res 45(D1):D97–D103. https://doi.org/10.1093/nar/gkw996. Epub 2016 Oct 27
8. Medvedeva YA et al (2015) EpiFactors: a comprehensive database of human epigenetic factors and
complexes. Database (Oxford) 2015:bav067. https://doi.org/10.1093/database/bav067. Print 2015
9. Plongthongkum N et al (2014) Advances in the profiling of DNA modifications: cytosine methylation
and beyond. Nat Rev Genet 15(10):647–661. https://doi.org/10.1038/nrg3772. Epub 2014 Aug 27
10. Singh Nanda J et al (2016) dbEM: A database of epigenetic modifiers curated from cancerous and
normal genomes. Sci Rep 6:19340. https://doi.org/10.1038/srep19340
11. Song Q et al (2013) A reference methylome database and analysis pipeline to facilitate integrative and
comparative epigenomics. PLoS One 8(12):e81148. https://doi.org/10.1371/journal.pone.0081148.
eCollection 2013
12. Xin Y et al (2012) MethylomeDB: a database of DNA methylation profiles of the brain. Nucleic Acids
Res 40(Database issue):D1245–D1249. https://doi.org/10.1093/nar/gkr1193. Epub 2011 Dec 2
20
353 21
Bibliography – 367
Electronic supplementary material The online version of this chapter
(https://doi.org/10.1007/978-981-13-1942-6_21) contains supplementary material,
which is available to authorized users.
21.1 Introduction
21.2 Preparations
* Environment
55 Software: methVisual Bioconductor package (R version 2.11.0 or higher required)
55 Data: Sample data provided with the BiQ Analyzer program (7 http://biq-analyzer.
bioinf.mpi-inf.mpg.de/)
55 Practice: Bisulfite sequencing samples from the mouse Gm9 region (Oda et al. [1])
> source(“http://bioconductor.org/biocLite.R”)
> biocLite(“methVisual”)
> library(methVisual)
The input file for this analysis must be in FASTA format. FASTA format sequence data is
composed of (1) a line starting with “>” which contains summary information and (2)
sequences following. For more details, refer to 7 Chap. 11 Sect. 7 11.3.2. The directory
We will use the MethDataInput function to load the sample sequences into the workspace.
The MethDataInput function takes as input a text file that contains a list of file names and
paths, which are tab-delimited. Assign the loaded sequences to the methData variable.
> methData
> methData
FILE PATH
1 seq_A.fasta C:/gda/ch21/
2 seq_B.fasta C:/gda/ch21/
3 seq_C.fasta C:/gda/ch21/
4 seq_D.fasta C:/gda/ch21/
5 seq_E.fasta C:/gda/ch21/
6 seq_F.fasta C:/gda/ch21/
7 seq_G.fasta C:/gda/ch21/
8 seq_H.fasta C:/gda/ch21/
9 seq_I.fasta C:/gda/ch21/
10 seq_J.fasta C:/gda/ch21/
The reference sequence must be loaded as well. Using the selectRefSeq function, assign
“Master_Sequence.txt” file, which has the reference sequence, to the variable refseq.
> refseq
> refseq
[1]
CCCGGGATCGCTCTCCCAGCAGGTGAAGCCTCGCCATGGACCCTCCCCGTCGGGGCCCCGCGCT
G$
For methylation analysis, alignment control (AC) is the process of reducing misalign-
ments that can occur while comparing sample sequences with reference sequences. The
cases that can lead to misalignment are as follows: First, misalignment can occur with a
reversed sequence. Second, it can also misalign to a complement or reverse-complemented
sequence. In the AC process, the Needleman-Wunsch algorithm is used to score align-
ments and evaluate their reliability.
356 Chapter 21 · Epigenome Data Analysis
Note: If you do as follows, the minimum agreement rate will automatically be set to 80%
and the minimum conversion will be set to 85%.
Using sample sequences passing alignment control (AC) and quality control (QC), we
extract their methylation status. We will use the MethyAlignNW function to extract this
information and interpret the results returned by this function.
As above, save information from 8 sequences into the object methQCData. Use the names
function to find the variable names stored in the object. There are six pre-named variables,
as shown below.
> names(methQCData)
21.5 · Analysis of Methylation Status
357 21
> names(methQCData)
[1] "seqName" "alignment" "methPos" "positionCGIRef" "startEnd" "lengthRef"
> methQCData$seqName
> methQCData$seqName
[1] "QC_seq_B.fasta" "QC_seq_C.fasta" "QC_seq_D.fasta" "QC_seq_E.fasta" "QC_seq_
F.fasta"
[6] "QC_seq_G.fasta" "QC_seq_H.fasta" "QC_seq_I.fasta"
21.5.2 Alignment
> methQCData$alignment
> methQCData$alignment
[1]"TTTGGGATTGTTTTTTTAGTAGGTGAAGTTTTGTTATGGATTTTTTTTGTTGGGGTTTT
GTGTTGTTTT$
[2]"TTCGGGATCGTTTTTTTAGTAGGTGAAGTTTTGTTATGGATTTTTTTCGTTGGGGTTTCG
TGTTGTTTT$
[3]"TCCGGGATCGTTTTTTTAGTAGGTGAAGTTTTGTTATGGATTTTCTTTGTTGGGGTTTTG
TGTTGTTTT$
[4]"TTTGGGATTGTTTTTTTTAGCAGGTGAAGTTTTGTTATGGATTTTTTTTGTTGGGGTTTT
GATGTTGTT$
[5]"TTCGGGATCGTTTTTTTAGTAGGTGAAGTTTTGTTATGGATTTTTTTCGTTGGGGTTTCG
TGTTGCCCT$
[6]"TTTGGGATTGTTTTTTTAGTAGGTGAAGTTTTGTTATGGATTTTTTTTGTTGGGGTTTT
GTGTTGTTTT$
[7]"TTCGGGATTGTTTTTTTAGTAGGTGAAGTTTTGTTATGGATTTTTTTTGTTGGGGTTTCG
TGTTGTTTT$
[8]"TTTGGGATTGTTTTTTTAGTAGGTGAAGTTTTGTTATGGATTTTTTTTGTTGGGGTTTT
GTGTTGTTTT$
358 Chapter 21 · Epigenome Data Analysis
> methQCData$startEnd
> methQCData$startEnd
[,1] [,2]
[1,] 1 233
[2,] 1 233
[3,] 1 233
[4,] 1 233
[5,] 1 233
[6,] 1 233
[7,] 1 233
[8,] 1 233
> methQCData$methPos
> methQCData$methPos
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 1 1 0 1 0 1 0 0 1 1 0 1 1 1
[3,] 1 1 0 0 0 0 0 0 0 1 0 0 0 0
[4,] 0 0 0 0 0 0 0 0 0 0 1 0 0 0
[5,] 1 1 0 1 0 1 0 0 1 1 0 1 1 1
[6,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[7,] 1 0 0 0 0 1 0 0 1 0 1 1 1 1
[8,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
[1,] 0 0 0 0 1 0 0 0 0 0 0 0
[2,] 1 1 1 1 1 1 1 1 0 0 0 0
[3,] 0 1 0 0 1 1 0 0 0 1 0 0
[4,] 1 1 1 1 1 0 0 1 1 1 0 0
[5,] 1 1 1 1 1 1 1 1 0 0 0 0
[6,] 0 0 0 0 1 0 0 0 0 0 0 0
[7,] 1 1 1 1 1 1 0 1 1 1 1 0
[8,] 0 0 0 0 0 0 0 0 0 0 0 0
[,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35]
21.5 · Analysis of Methylation Status
359 21
[1,] 0 0 0 0 0 0 0 0 0
[2,] 1 1 0 1 1 0 0 1 0
[3,] 1 0 1 0 1 1 0 1 1
[4,] 0 1 0 1 0 0 0 0 0
[5,] 1 1 0 1 1 0 0 1 0
[6,] 0 0 0 0 0 0 0 0 0
[7,] 1 1 1 0 0 0 1 1 1
[8,] 0 0 0 0 0 0 0 0 0
In the methAlignNW function, variable methPos returns the methylated position. The
methylated positions are indicated with 0 and 1 according to its agreement with the refer-
ence sequence.
> methQCData$methPos
> methQCData$methPos
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 1 1 0 1 0 1 0 0 1 1 0 1 1 1
[3,] 1 1 0 0 0 0 0 0 0 1 0 0 0 0
[4,] 0 0 0 0 0 0 0 0 0 0 1 0 0 0
[5,] 1 1 0 1 0 1 0 0 1 1 0 1 1 1
[6,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[7,] 1 0 0 0 0 1 0 0 1 0 1 1 1 1
[8,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
[1,] 0 0 0 0 1 0 0 0 0 0 0 0
[2,] 1 1 1 1 1 1 1 1 0 0 0 0
[3,] 0 1 0 0 1 1 0 0 0 1 0 0
[4,] 1 1 1 1 1 0 0 1 1 1 0 0
[5,] 1 1 1 1 1 1 1 1 0 0 0 0
[6,] 0 0 0 0 1 0 0 0 0 0 0 0
[7,] 1 1 1 1 1 1 0 1 1 1 1 0
[8,] 0 0 0 0 0 0 0 0 0 0 0 0
[,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35]
[1,] 0 0 0 0 0 0 0 0 0
[2,] 1 1 0 1 1 0 0 1 0
[3,] 1 0 1 0 1 1 0 1 1
[4,] 0 1 0 1 0 0 0 0 0
[5,] 1 1 0 1 1 0 0 1 0
[6,] 0 0 0 0 0 0 0 0 0
[7,] 1 1 1 0 0 0 1 1 1
[8,] 0 0 0 0 0 0 0 0 0
360 Chapter 21 · Epigenome Data Analysis
> methQCData$positionCGIRef
> methQCData$positionCGIRef
[1] 3 9 32 48 51 59 61 69 73 83 96 98 102 104 110 118 120 125 128
[20] 131 136 145 148 153 160 163 165 167 175 177 198 200 209 217 232
So far, we have simply determined methylated positions. It is not easy to understand what
these results mean. Therefore, we will visualize the results to perform more meaningful
interpretation. Below are several examples of commonly used visualization methods.
The plotAbsMethyl function returns the total number of CpG sites for each sample
sequence by sequence position. You can then plot methylation level by sequence position
(. Fig. 21.1).
The MethLollipops function returns the methylation status of the sample sequence and
the reference sequence at the indexed CpG sites. The graph visually compares methylation
status per position between the sample and the reference (. Fig. 21.2).
> MethLollipops(methQCData)
21.6 · Exploratory Analysis and Visualization
361 21
5
4 Absolute number of methylated CpGs
absolute number of methylated CpG
3
2
1
0
> MethLollipops(methQCData)
LABEL_Y_AXIS Experiment
1 1 QC_seq_B.fasta
2 2 QC_seq_C.fasta
3 3 QC_seq_D.fasta
4 4 QC_seq_E.fasta
5 5 QC_seq_F.fasta
6 6 QC_seq_G.fasta
7 7 QC_seq_H.fasta
8 8 QC_seq_I.fasta
9 refSeq refernceSequence
362 Chapter 21 · Epigenome Data Analysis
21 refSeq
5
index of clone sequences
0 5 10 15 20 25 30 35
index of CpG methylation
This plot shows the correlation of methylation status between two neighboring CpG sites
(. Fig. 21.3).
6
21.6 · Exploratory Analysis and Visualization
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
genomic position of CpG site
0.63
209
1 0.63 1
200
NA NA NA NA
198
1 NA -0.2 0.63 -0.2
177
1 0.63 NA -0.32 0.25 -0.32
175
1 -0.32 -0.2 NA 1 0.63 1
167
1 0.45 0.71 0.45 NA 0.45 0.71 0.45
165
1 0.71 0.63 0.25 0.63 NA 0.63 1 0.63
cor 1 163
NA NA NA NA NA NA NA NA NA NA
160
1 NA 0.63 0.45 1 -0.32 -0.2 NA 1 0.63 1
153
1 0.63 NA 0.25 0.71 0.63 0.25 -0.32 NA 0.63 0.25 0.63
148
1 1 0.63 NA 0.25 0.71 0.63 0.25 -0.32 NA 0.63 0.25 0.63
145
1 0.71 0.71 0.45 NA 0.71 1 0.45 0.71 0.45 NA 0.45 0.71 0.45
136
1 0.45 -0.32 -0.32 -0.2 NA 0.63 0.45 -0.2 0.63 1 NA -0.2 0.63 -0.2
131
1 0.63 0.71 0.25 0.25 0.63 NA 1 0.71 0.63 0.25 0.63 NA 0.63 1 0.63
128
1 0.32 0.2 0.45 0.32 0.32 0.2 NA 0.32 0.45 0.2 0.32 0.2 NA 0.2 0.32 0.2
–0.32 125
1 0.45 0.71 0.45 1 0.71 0.71 0.45 NA 0.71 1 0.45 0.71 0.45 NA 0.45 0.71 0.45
120
1 1 0.45 0.71 0.45 1 0.71 0.71 0.45 NA 0.71 1 0.45 0.71 0.45 NA 0.45 0.71 0.45
118
1 1 1 0.45 0.71 0.45 1 0.71 0.71 0.45 NA 0.71 1 0.45 0.71 0.45 NA 0.45 0.71 0.45
110
1 1 1 1 0.45 0.71 0.45 1 0.71 0.71 0.45 NA 0.71 1 0.45 0.71 0.45 NA 0.45 0.71 0.45
104
1 0.71 0.71 0.71 0.71 0.32 1 0.63 0.71 0.25 0.25 0.63 NA 1 0.71 0.63 0.25 0.63 NA 0.63 1 0.63
102
1 1 0.71 0.71 0.71 0.71 0.32 1 0.63 0.71 0.25 0.25 0.63 NA 1 0.71 0.63 0.25 0.63 NA 0.63 1 0.63
98
1 1 1 0.71 0.71 0.71 0.71 0.32 1 0.63 0.71 0.25 0.25 0.63 NA 1 0.71 0.63 0.25 0.63 NA 0.63 1 0.63
96
1 0.25 0.25 0.25 0.71 0.71 0.71 0.71 0.32 0.25 -0.32 0.71 1 1 0.63 NA 0.25 0.71 0.63 0.25 -0.32 NA 0.63 0.25 0.63
83
1 -0.32 0.63 0.63 0.63 0.45 0.45 0.45 0.45 0.2 0.63 1 0.45 -0.32 -0.32 -0.2 NA 0.63 0.45 -0.2 0.63 1 NA -0.2 0.63 -0.2
73
1 0.63 0.25 1 1 1 0.71 0.71 0.71 0.71 0.32 1 0.63 0.71 0.25 0.25 0.63 NA 1 0.71 0.63 0.25 0.63 NA 0.63 1 0.63
69
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
61
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
59
1 NA NA 1 0.63 0.25 1 1 1 0.71 0.71 0.71 0.71 0.32 1 0.63 0.71 0.25 0.25 0.63 NA 1 0.71 0.63 0.25 0.63 NA 0.63 1 0.63
51
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
48
1 NA 0.63 NA NA 0.63 1 -0.32 0.63 0.63 0.63 0.45 0.45 0.45 0.45 0.2 0.63 1 0.45 -0.32 -0.32 -0.2 NA 0.63 0.45 -0.2 0.63 1 NA -0.2 0.63 -0.2
32
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
91 NA 1 NA 0.63 NA NA 0.63 1 -0.32 0.63 0.63 0.63 0.45 0.45 0.45 0.45 0.2 0.63 1 0.45 -0.32 -0.32 -0.2 NA 0.63 0.45 -0.2 0.63 1 NA -0.2 0.63 -0.2
31 0.63 NA 0.63 NA 1 NA NA 1 0.63 0.25 1 1 1 0.71 0.71 0.71 0.71 0.32 1 0.63 0.71 0.25 0.25 0.63 NA 1 0.71 0.63 0.25 0.63 NA 0.63 1 0.63
This visualization estimates the co-occurrence between A and neighboring base positions
as well as between A and distant base positions (. Fig. 21.4).
We can perform a Fisher’s exact test to determine the significance of differences between
two groups for each CpG site. Run a Fisher’s exact test for 35 CpG sites (. Fig. 21.5).
1.0
0.8
0.6
p-value
0.4
0.2
0.0
0
10
15
20
25
30
CpG index 35
Clustering analysis is used to search for similar patterns of methylation state across sam-
ples. We can draw a heatmap using hierarchical clustering and infer a group that shows
similar patterns of methylation (. Fig. 21.6).
> heatMapMeth(methQCData)
366 Chapter 21 · Epigenome Data Analysis
21
6
35
33
25
29
24
11
23
32
26
8
7
3
5
30
31
21
10
2
4
19
34
27
20
14
13
12
9
1
6
28
22
18
17
15
16
> methCA(methQCData)
Exercise
[Exercise 1] - Load sample and reference sequences from the bisulfite sequenced samples of the Mouse
Gm9 region (Oda et al. [1]). Return the results of the following questions.
1.1. Select a sequence with an agreement rate of 70% and a conversion rate of 80% or less.
1.2. Perform alignment and extract CpG sites.
1.3. Visualize the methylation pattern using a Lollipop plot.
1.4. Perform statistical analysis to determine the difference between the following two groups: Group
1 comprising sequences 1,2,5,7, and 8 vs. Group 2 comprising sequences 3,4,6,9, and 10.
1.5. Draw a heatmap of the sequence and methylation sites using cluster analysis.
Bibliography
367 21
0.6 CpG30
clone3
CpG19
0.4
CpG31
CpG21
CpG10
CpG2
CpG4
CpG32
CpG26
CpG3
CpG5
CpG8
CpG7
clone4
clone1
0.2
Dimension 2 (29.3%)
clone2 CpG22
CpG15
CpG18
CpG17
CpG16
CpG28
CpG24
CpG23
CpG11
clone6
0.0
–0.2
CpG34
CpG20
CpG12
CpG14
CpG13
CpG27
CpG6
CpG9
CpG1
–0.4
clone5
–0.6
CpG33
CpG29
CpG25
Bibliography
1. Oda M et al (2006) DNA methylation regulates long-range gene silencing of an X-linked homeobox
gene cluster in a lineage-specific manner. Genes Dev 20:3382–3394
2. Zackay A, Steinhoff C (2010) MethVisual - visualization and exploratory statistical analysis of DNA
methylation profiles from bisulfite sequencing. BMC Res Notes 3:337
3. Zackay A, Steinhoff C (2016) methVisual: Methods for visualization and statistics on DNA methylation
data. R package version 1.26.0