Professional Documents
Culture Documents
HoytEtAl 2018 BioInformatics
HoytEtAl 2018 BioInformatics
HoytEtAl 2018 BioInformatics
Bioinformatics
LEARNING OBJECTIVES
After reading this chapter the reader should be able to:
• Define bioinformatics, translational bioinformatics, • List major private and governmental bioinformatics
and other bioinformatics-related terms databases and projects
• State the importance of bioinformatics in future • Enumerate several bioinformatics projects that involve
medical treatments and prevention electronic health records
• Describe genomics and its important implications for • Describe the application of bioinformatics in genetic
health care profiling of individuals and large populations
INTRODUCTION
relationships.3-4 A related term is computational biology,
This chapter is focused on “bioinformatics,” the study which refers to the computational aspects of molec-
of data and information as it relates to knowledge within ular biology. Translational bioinformatics focuses on
the context of the life sciences. Bioinformatics traces the “development of storage, analytic and interpretive
its formal beginning to 1970, when the term was first methods to optimize the transformation of increasingly
introduced in scientific literature.1 In many ways, bioin- voluminous biomedical data into proactive, predic-
formatics has evolved in parallel with health informatics. tive, preventive and participatory health.”5 Simply put,
Significant advances in bioinformatics have given rise translational bioinformatics is the specialization of bioin-
to contemplation of its applications within the context of formatics for human health.
biomedicine and health (the sub-discipline of biomedical Bioinformatics is sometimes said to work with the
informatics referred to as “translational bioinformatics.”) various “omes” and “omics.” They include:
• Genomics - the study of genetic material in an
Definitions organism (e.g., the genes that may be associated
with a disease).
The chapter begins with common definitions and the • Proteomics - the study at the level of proteins (e.g.,
next section provides a short primer on genomics, which through the components, structure, and functions).
underpins many of the concepts used for bioinformatics • Pharmacogenomics - the study of genetic material
within the context of health. in relationship to drug targets.
Bioinformatics has been defined as, “the field of • Metabolomics - the study of genes, proteins or metab-
science in which biology, computer science and infor- olites.
mation technology merge to form a single discipline.”2 • Interactome - biomolecular pathways and interac-
Bioinformatics makes use of fundamental aspects tions of proteins
of computer science (such as databases and artificial • Microbiome - microorganisms inhabiting an indi-
intelligence) to develop algorithms for facilitating the vidual
development and testing of biological hypotheses, such • Exposome – environmental factors to which an
as: finding the genes of various organisms, predicting organism is exposed
the structure or function of newly developed proteins, • Bibliome – the literature of science
developing protein models and examining evolutionary
357
358 Chapter 18: Bioinformatics
• Metagenomics - the analysis of genetic material polymorphisms (SNPs) (pronounced “snips”). There
derived from complete microbial communities har- are three general types of alterations: single base-pair
vested from natural environments.6 changes, insertions or deletions of nucleotides, and
In addition, bioinformatics studies the relationship reshuffled DNA sequences. As an example, one indi-
between the genotype, which is the genetic information vidual might have a chromosome with the sequence
that is associated with biological function7 and the pheno- TGGC, while another might have the sequence TAGC.
type, which is the observable characteristic, structure, Each of these is referred to as an allele. Although SNPs
function and behavior of a living organism. Examples of are common, their significance is complex and difficult
phenotypes include hair color, height, and development to decipher.8-10
of diseases. The phenome refers to the total phenotypic Another type of genetic variation is copy number
traits. variations (CNVs). These are repeats of DNA sequences
of 50 nucleotides or longer. There may be many CNVs in
an individual’s genome. Anywhere from 4.8% to 9.5%
GENOMIC PRIMER of the human genome is CNVs. Some of these copies of
genomes are deleterious, others are not. When they are
The human body has about 100 trillion cells and each
deleterious, they are so-called unbalanced rearrange-
one contains a complete set of genetic information (chro-
ments, involving either loss or gain of segments of the
mosomes) in the nucleus; exceptions are eggs, sperm, and
genome.11
red blood cells. Humans have a pair of 23 chromosomes
An additional cause of genetic variation is epigenetics,
in each cell that includes an X and Y chromosome for
which is the variation in the phenotype or gene expression
males and two Xs for females. Offspring inherit one
that is caused by mechanisms other than DNA sequence
pair from each parent. Chromosomes are listed approxi-
differences.12 The molecular mechanisms for epigenetics
mately by size with chromosome 1 being the largest and
are beginning to be unraveled, such as DNA methyla-
chromosome 22 the smallest. Organisms have differing
tion.13 Epigenetics shows that there is influence of the
numbers of chromosomes (e.g., our closest extant primate
environment on the expression of genes, and therefore
relatives, chimpanzees, have 24 pairs). Chromosomes
leading to genetic variation.
consist of double twisted helices of deoxyribonucleic
A great deal of progress has been made with genetic
acid (DNA). DNA is composed of four sugar-based
testing and our understanding of the human genome and
building blocks (“nucleotides”: adenine [A], thymine [T],
genetic variations. Genome-wide associations studies
cytosine [C], and guanine [G]) that are generally found
(GWAS) look at associations between genomic variants
in pairs (following “Watson-Crick” pairing templates:
and traits of the phenotype.14 The variations or SNPs
A-T, C-G). DNA is often referred to as the “blueprint for
discovered are said to be associated with the disease, but
life.” As such, a given organism’s DNA encodes its full
true cause and effect cannot be ascertained.15 Similarly,
complement of proteins essential for cellular function.
phenome-wide association studies (PheWAS) are being
Some of the encoding of DNA also enables it to control
carried out comparing genes to disease associations, most
the expression of proteins or affect how other portions
recently using the electronic health record for phenotyp-
of DNA may be decoded based on a biological context
ical information.16
(e.g., to accommodate for faulty DNA decoding or DNA
Genetic material can be obtained from blood, saliva,
damage that may be encountered due to environmental
skin and hair samples. Full genome sequencing has
phenomena). Genes are regions on chromosomes that
historically been an expensive, time-consuming, and
encode instructions, which may result in proteins that
complicated process, although the cost of full genome
then in turn enable biological functions. The process
sequencing has dropped to approximately $1000 (US).
of decoding genes involves transcribing the DNA into
SNP-based genomic profiling is available now for less
ribonucleic acid (RNA) and then translation into amino
than $200 US (e.g., 23andMe). This cost differential is
acids that form the building blocks for proteins (Figure
largely because SNP genotyping analyzes about 0.1 –
18.1). Collectively, the complete set of genes is referred
0.2% of the genome in contrast to every single nucleotide.
to as the “genome” (based on the combination of the
Even more cost effective is ultra-low-coverage (ULC)
terms “gene” and “chromosome”).
sequencing techniques that analyze the same 10-20% of
It is estimated that humans have between 20,000
the genome and cost $60 US. SNP-based genotyping has
and 30,000 genes and that genomes are about 99.9%
more specificity since it uses a technique the seeks to
similar between individuals. Variations in genomes
identify SNPs from a library of a priori selected SNPs
between individuals are known as single nucleotide
of interest (using a technology called “microarrays”);
Chapter 18: Bioinformatics 359
of translational bioinformatics is primarily due to the so cancer cells evolve and attain this ability to proliferate
rapid advances in both sub-disciplines, and the realiza- outside the normal defense mechanisms of organisms.24
tion of the potential to leverage biological data within Another important discovery is that some genomic
the context of clinical care. In other words, a variety of changes representing common tumor types occur in
advances in bioinformatics, such as faster and cheaper different cancer locations, so that the same mechanism
DNA sequencing, and more widespread adoption of may occur in more than one type of cancer.25-26 There
electronic health records have made this possible. has also been improved understanding of gene function,
Pharmacogenomics is an illustrative example of how and of course this may aid in the development of better
translational bioinformatics can be used within the treatments.27
context of pharmaceutical development to make use of
genomic information for better drug discovery and utili-
zation. Drug companies are faced with the huge expense
BIOINFORMATICS PROJECTS AND CENTERS
of drug development, the long road to producing a new
drug and expiring patents. Drug failures are common
The Human Genome Project
and can be due to complex combination of a lack of One of the greatest accomplishments in biomedicine
clinical efficacy, side effects and commercial issues. was the completion of the Human Genome Project (HGP).
Unfortunately, animal models are often inadequate for This international collaborative project, sponsored by the
the development and evaluation of drugs for treating US Department of Energy and the National Institutes
human conditions. It is thus the goal to use genetic infor- of Health, was started in 1990 and finished in 2003.
mation for: In the process of acquiring the human genome (as a
• New indications for an old drug (repurposing) complete set of DNA sequences, encompassing all 23
• New targets for existing drugs (e.g., treatment of chromosomes), genome sequences for several other key
tongue cancer using RET inhibitors) organisms (“model” organisms) were also acquired.
• Drugs to work better in certain patient groups These included the Escherichia coli bacterium, fruit
(gender, age, race, ethnicity, etc.) with possible fly (Drosophila melanogaster), and house mouse (Mus
genetic variants musculus). By mid-2007 about three million differ-
• Knowing ahead of time what drugs to avoid due to ences (SNPs) had been identified in human genomes.
higher incidence of side effects that are genetically Appreciating the potential significant societal impact, the
modulated HGP also addressed the ethical, legal and social issues
• Develop clinical decision support in electronic health associated with the project. Since the completion of the
records based on pharmacogenomics 20-21 HGP, attention is now more focused on the development
Multiple projects are underway to integrate genetic of approaches to analyze and learn from volumes of data
and clinical data that will be discussed later in the representing increasing numbers of individuals.11,28-29
chapter. Electronic health records (EHRs) and health These analyses include the annotation of information
information exchange (HIE) efforts, which are rapidly associated with disease onto chromosomes. Figure
becoming ubiquitous, thanks in large part to federal 18.3 displays the DNA sequencing of just chromosome
incentives, are poised to contribute massive amounts of number 12. Huge relational databases are necessary to
patient information (including demographic, laboratory, store and retrieve this information. New technologies
and clinical data). It is important to also note that in continue to emerge that reduce the necessity to sequence
addition to genomic and clinical data, environmental an entire human genome, such as DNA arrays (gene
data may offer valuable insights into the understanding chips) that help speed the analysis and comparison of
and eventual treatment of disease. DNA fragments.30 The cost of the HGP was close to $3
Another important application of translational bioin- billion; but over time, costs have dramatically dropped
formatics is in cancer genomics. There have been for genetic analysis.7
so-called hallmarks papers that describe the state of the
knowledge on how cancer cells proliferate and evade National Human Genome Research Institute (NHGRI)
death by a number of mechanisms within living organ-
NHGRI is an NIH institute that provides many
isms.22 When whole genome sequencing is applied to
educational resources on their web site. Like other
tumor cells, it shows that they undergo genomic changes
NIH institutes, they conduct and fund research within
that give them the ability to proliferate and metastasize
their intramural division, as well as support extramural
within living organisms.23 It has also been discovered that
research with external partners. Their health section has
cancer gene mutations, within tumors, are heterogeneous,
Chapter 18: Bioinformatics 361
multiple resources for patients and healthcare profes- initiative that catalogued the myriad of organisms
sionals with emphasis on the Human Genome Project. that co-exist with humans and heretofore have been
The “Issues in Genetics” section covers important rarely studied (e.g., flora from oral, nasal, skin, and
controversies in policy, legal and ethical issues in genetic the gastrointestinal tract). It is important to note that
research. They include a large glossary (200+) of genet- microbial cells on the human body outnumber human
ics-related definitions, also available as a software app cells by a factor of 10 to 1. Initial efforts were aimed
for the iPhone and iPad. at identifying the microbiome in health patients. More
In 2003, NHGRI launched the Encyclopedia of DNA recently, extensive work has been done to identify the
Elements (ENCODE) Project. ENCODE is comprised microbiome in multiple disease states with results too
of a consortium of laboratories with the goal to study comprehensive to cite in this chapter. Three areas the
and characterize the functional elements of the human HMP is currently focusing on include pregnancy and
genome. All ENCODE data are free for research pre-term birth, onset of inflammatory bowel disease and
purposes. In 2012, 1640 data sets were published, which onset of type 2 diabetes.
continue to produce controversy. For example, ENCODE The HMP used metagenomics, as explained in the
researchers posited that 80% of the human genome is definitions section. As detailed on the HMP web site
active and performing a role (and thus not “junk” DNA their goals were as follows:
as has been previously thought).31 • Determine whether individuals share a core human
microbiome
Human Microbiome Project (HMP) • Understand whether changes in the human micro-
biome can be correlated with changes in human
It is estimated that less than 0.01% of microbes on Earth
health
have been cultured, characterized, and sequenced. As an
• Develop new technological and bioinformatics tools
exception, the complete genome for the common human
needed to support these goals
parasite Trichomonas vaginalis was reported in 2007 in
• Address the ethical, legal and social implications
the journal Science.32 The HMP is an NIH-sponsored
raised by human microbiome research6
362 Chapter 18: Bioinformatics
health data sets. Health data exist in a variety of file annual database issue that covers the myriad of data-
formats and often require standardization for both use bases available. And there’s also a web server issue that
and supporting data integration tasks. From the perspec- covers accessible resources that can be accessed over the
tive of translational bioinformatics, OHDSI represents a Internet. NCBI was created in 1988 and hosts dozens of
major initiative that provides datasets that can be inte- databases associated with biomedicine, including the
grated with biological data. For example, within the popular MEDLINE and GenBank databases. NCBI
context of studying drugs, it is useful to know reported provides access to sequences from over 500,000 organ-
adverse events. While the US FDA does make available isms (via GenBank), including the complete genomes of
data collected from its adverse event reporting system thousands of organisms (via NCBI Genome). Genomes
(FAERS), these data are challenging to use for large scale represent both completely sequenced organisms and
analyses. AEOLUS is an artifact of an OHDSI initiative those for which sequencing is still in progress. Popular
(LAERTES) that standardizes FAERS data such that it NCBI databases, which are linked by a common interface
can be used within a range of studies, including those (Entrez), are listed in Figure 18.4. On the Genome project
involving translational bioinformatics.44 web site one can search for specific genes or proteins
from different species. Figure 18.5 shows the result of a
All of Us search for the tumor protein TP53.
The NCBI site also provides access to BLAST+
Conceptualized at the end of the Obama administra-
(new Basic Local Alignment Search Tool) that enables
tion, the All of Us program (which is a key element of the
the identification of significantly related (based on a
Precision Medicine Initiative) is a White House initiative
“expectation” value or “e-value”) nucleotide or protein
to develop a US cohort of individuals that will support the
sequences from within the protein and nucleotide data-
development and evaluation of next-generation, person-
bases.47 Magic BLAST is a recently developed tool that
alized treatments. The All of Us program is built around
enables rapid searching for related sequences (e.g., as
the fundamental tenets of precision medicine, i.e., the
might arise from a full sequence of a human being) to a
right treatment for the right patient at the right time.
reference genome.48 Also available on the NCBI site are
In this way, the success of this endeavor, which is just
databases where one can find data about a gene (Gene),
formally launching at the writing of this text, will result
its location on the chromosome (MapViewer), its variants
in a paradigm shift in how biological data can be used to
(dbVar), and data about patients who have the gene or its
inform health care in a meaningful and efficient manner.45
variants (dbGaP).
The Cancer Genome Atlas (TCGA) GenBank
The suite of conditions that comprise cancer have a
This database was established in 1982 and is the NIH
strong genomic component. The TCGA project is a joint
sequence database that is a collection of all publicly avail-
initiative between the National Human Genome Research
able DNA sequences. Along with EMBL (Europe) and
Institute and the National Cancer Institute, both of the
DDBJ (Asia), GenBank is a member of the International
National Institutes of Health, that aims to provide one
Nucleotide Sequence Database Consortium (INSDC),
of the largest collections of cancer-related genomic data.
which provides free access to sequence data from nearly
Currently, the initiative focuses on 33 cancers and the
anywhere with an Internet connection. Interestingly,
TCGA provides access to genomic data associated with
many biological and medical journals now require
tumor samples gathered from more than 11,000 patients.46
submission of sequences to a database prior to publica-
tion, which can be done with NCBI tools such as BankIt.49
Bioinformatics Data and Information Resources
The bioinformatics community has produced a wide The Online Mendelian Inheritance in Man (OMIM)
variety of knowledge-based information resources that
Originally an NCBI database but now a standalone
not only provide access to research results but also facil-
resource, Online Mendelian Inheritance in Man (OMIM)
itate scientific discovery information from genomics.
is a database of genetic data and human genetic disorders.
Much of this activity is led by the National Center for
It was originally developed by Johns Hopkins University
Biotechnology Information (NCBI), which is part of the
and Dr. Victor McKusick, a pioneer in genetic metabolic
National Library of Medicine. A catalog of NCBI and
abnormalities. It includes an extensive reference section
other resources is provided annually in the open-ac-
linked to PubMed that is continuously updated.50
cess journal Nucleic Acids Research. In fact, there is an
364 Chapter 18: Bioinformatics
Figure 18.5: Entrez search for tumor protein (Courtesy National Library of Medicine)
Chapter 18: Bioinformatics 365
Figure 18.6: Cost per Genome over time (Courtesy National Human Genome Research Institute, National
Institutes of Health)11
366 Chapter 18: Bioinformatics
• AncestryDNA is a separate service offered by specificity of genetic tests within clinical contexts will
Ancestry.com. Their analysis will determine eth- be essential for them to be accepted. In general, patients
nicity estimates and will identify remote relatives. may not be willing to undergo major procedures (e.g.,
Saliva samples are needed, the cost is $79 and the a prophylactic mastectomy or prostatectomy to prevent
turnaround time is six to eight weeks.54 cancer) unless the genetic testing is nearly perfect. It is
• 23andMe is a direct to consumer online genetic also important that genetic counseling be available to help
testing company. For $199 they send a testing kit patients understand the implication of genetic suscep-
to homes based on analyzing saliva with a turn- tibility tests (versus genetic guarantee of disease, such
around time of four to six weeks. Currently, they as the mutations associated with Huntington’s disease).
look for 240 diseases, multiple carrier states and drug Additionally, the Genetic Information Nondiscrim-
response conditions (a substantial increase in the last ination Act of 2008 was passed to protect patients
two years). They also offer an analysis of ancestry against discrimination by employers and healthcare
based on the genetic profile.55 In 2010 a genome wide insurers based on genetic information. Specifically, the
association study (GWAS) was published that used Act prohibits health insurers from denying coverage
this technology and showed that patient question- to a healthy individual or charging that person higher
naire results correlated well with genetic results. premiums based solely on genetic information and bars
Additionally, they were able to describe five new employers from using individuals’ genetic information
genotype-phenotype associations: freckling, photic when making decisions related to hiring, firing, job
sneeze reflex, hair curl and failure to smell aspar- placement, or promotion.61
agus.56 Google’s co-founder Sergey Brin was one Many obstacles face the routine ordering of genetic
of the early funders of 23andMe, focusing on a tests by the average patient. Ioannidis et al. pointed out
project through this company to study the genetic that for genetic testing to be reasonable several facts must
inheritance of Parkinson’s disease. They hope to be true. The disease of interest must be common. Even
recruit 10,000 subjects from various organizations with breast cancer, when seven established genetic vari-
and offer a discount price for complete analysis. In ants are evaluated, they only explain about 5% of the risk
late 2013 the FDA instructed the company to stop for the cancer. If the disease (e.g., Crohn’s disease) is rare,
performing genetic analyses for medical conditions then the test must be highly predictive. For genetic testing
until they receive 510(k) (pre-market) clearance, to be relevant one should have an effective treatment to
which they subsequently received.57 In 2017, they offer, otherwise there is little benefit. The test must be
offered a Health and Ancestry analysis for $199 cost effective, as many currently are too expensive. As
and an Ancestry analysis for $99. The Health an example, screening for sensitivity to the blood thinner
analysis includes genetic risk (e.g. celiac disease, warfarin (Coumadin) makes little sense now due to cost.62
macular degeneration, etc.), wellness reports, trait A 2010 Lancet journal commentary warned of addi-
reports (e.g. eye color, skin pigment) and carrier tional concerns. Whole-genome sequencing will generate
status reports (e.g. polycystic kidney disease, cystic a tremendous amount of information that the average
fibrosis, etc.). physician and patient will not understand without exten-
• Myriad™ specializes in genetic testing for cancers sive training. At this point, health care lacks adequate
with a hereditary component, such as breast, endo- numbers of geneticists and genetic counselors that under-
metrial, melanoma, ovarian, colon, prostate, gastric stand the implications of data being made available
and pancreatic cancer.58 A sentinel Supreme Court thanks to continued advances in biotechnology. Patients
decision took place in 2013 that determined that will need to sign an informed consent to confirm that
Myriad could not patent BRCA gene testing.59 many of the findings will have unclear meaning. They
As pointed out by Harold Varmus (American Nobel- will have to deal with the fact that they may be found to
prize winner, who was a former director of the NIH, and be carriers of certain diseases that may have impact on
the current director of the NCI), personal genetics “is not childbearing, etc. Genetic testing may cause many further
regulated, lacks external standards for accuracy, has not tests to be ordered, thus leading to increased healthcare
demonstrated economic viability or clinical benefit and expenditures. As more information about whole-genome
has the potential to mislead customers.”60 For genetics to sequencing is gained, more patients will desire it but who
enter the mainstream, new technologies and specialties will pay for it? And, can the costs be justified?63
will need to be developed and numerous ethical ques- Two other articles drive home additional practical
tions will arise. Just finding the abnormal gene or SNP points. When the risk of cardiovascular disease based
is the starting point. Understanding the sensitivity and on the chromosome 9p21.3 abnormality was evaluated