Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

C H A PT ER 4

Visualizing personal genomics

 Key points
• Visualization is a critical aspect of personal genomics due to the quantity and richness of the data.
• Visualization modalities can both enhance and limit the interpretation of personal genomic information.
• Different visualization techniques are used to facilitate distinct aspects of exploring personal genomic information; e.g.
visualizations for clinical interpretation will be distinct from those facilitating biological discovery and interpretation.

not also accompanied by advances in the means to


4.1 Introduction deliver and communicate the information and its sig-
A personal genome is a vast and rich source of nificance through effective visualization. For example,
information. Specialized visualization tools and genetic studies are continually identifying genetic vari-
techniques are needed to explore and interpret ants associated with a multitude of disease traits, to the
personal genomic information. To illustrate this point that some diseases are now known to have hun-
point, recall that a complete human genome con- dreds of variants associated with their susceptibility or
tains around 3 billion nucleotide base pairs. Using etiology. If a physician is unable to quickly assess this
a common 12-point font, roughly 3000 characters information within the time frame of a typical office
can be fit on to a single A4 page. At this rate, a text visit, which is less than 20 minutes in the United States,
representation of a single personal genome would then the clinical utility of this information is potentially
span 1 million pages! Obviously, such a represen- limited or lost.
tation would be extremely cumbersome and dif- Ultimately, there will be many types of visualiza-
ficult to interpret by standard human perception. tions required for exploring and interpreting personal
Although it may be apparent that visualization is genomic information. Genomic researchers may desire
critical for exploring and interpreting personal genomic visualizations that allow them to easily navigate a per-
information, present availability of tools and tech- sonal genome from a fine-grained molecular perspec-
niques for visualizing such information is quite sparse. tive, a genomic genealogist may desire a visualization
Therefore, visualization represents a great opportunity that maps genomic information on to geographic pro-
for professionals and hobbyists alike to contribute jections, or a physician may seek visualizations that
towards the broader utility of personal genomics by concisely summarize a patient’s genetic health risks
developing novel and useful visualization tools and over hundreds of genetic disease risk variants. In this
techniques for viewing, exploring, integrating, and chapter, we will discuss several fundamental types of
summarizing personal genomic information. Certainly, visualization that are relevant to personal genomics,
continued efforts to improve the biological and medi- and we will review some of the existing visualization
cal understanding of personal genomic information tools and techniques that represent initial efforts in
are also quite important. However, the relevance and this area. While we expect that those already engaged
impact of any such efforts could be limited if they are in the development of such visualization tools and

Exploring Personal Genomics. First Edition. Joel T. Dudley and Konrad J. Karczewski.
© Joel T. Dudley and Konrad J. Karczewski 2013. Published 2013 by Oxford University Press.
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

V I S UA L I Z I N G P E R S O N A L G E N O M I C S 53

Elevated Risk ?

Name Confidence Your Risk Avg. Risk Compared to Average

Atrial Fibrillation 33.9% 27.2% 1.25x

Psoriasis 16.8% 11.4% 1.48x

Colorectal Cancer 6.8% 5.6% 1.23x

Exfoliation Glaucoma 2.2% 0.7% 2.90x

Celiac Disease 0.6% 0.1% 4.98x

Esophageal Squamous Cell Carcinoma


0.4% 0.4% 1.21x
(ESCC)

Stomach Cancer (Gastric Cardia


0.3% 0.2% 1.22x
Adenocarcinoma)

Bipolar Disorder 0.2% 0.1% 1.44x

Scleroderma (Limited Cutaneous Type) 0.08% 0.07% 1.24x

Ankylosing Spondylitis

Chronic Lymphocytic Leukemia

Figure 4.1 Tabular view of genetic disease risk: This tabular view of genetic disease risk is generated as part of the analysis provided by a DTC
personal genomics service. This view offers fairly concise summarization of increased disease risk assessed from personal genomic information,
which are clearly ranked according to overall risk estimates. However, this view offers limited interactivity, and does not reveal relationships
between individual diseases (e.g. comorbid conditions or shared risk alleles). Image from 23andMe genetic testing service.

techniques will make continued progress, we hope vide links to drill-down inon a particular line item
that readers of this book, from all backgrounds and (e.g. disease) to view the underlying data, and per-
capabilities, might become inspired by the informa- haps additional visualizations.
tion in this chapter to add their own unique contribu- Tabular views are also the predominant visuali-
tions to personal genome visualization. zation technique employed by many third-party
genome interpretation tools. Both the GET-Evidence
tool from the Personal Genome Project, and the
4.2 Tabular views shareware Promethease tool from the makers of
Although tabular views, where information is SNPedia use tabular views as their primary visuali-
organized into a table-like format, may seem a bit zation technique. Like many DTC web interfaces,
pedestrian from the standpoint of data visualiza- Promethease uses tabular views to provide a high-
tion, they are used frequently to present personal level summary of annotated personal genomic
genomic information. Customers of direct-to-con- information; however, nested tables are used to
sumer (DTC) genomics services are typically greeted reveal underlying information. GET-Evidence uses
with tabular views of their data when they view tabular views to present a more fine-grained view
their results using the web interface provided by the of personal genomic data, presenting annotations
DTC company (Figure 4.1). These interfaces often relating to the gene in which a particular locus is
use tabular views to provide a high-level summary found in or around, and information related to the
of annotated personal genomic information, such as population frequency and clinical importance of
a list of diseases with calculated genetic risk based the genotype along with free-text commentary
on an individual’s personal genome, and they pro- (Figure 4.2).
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

54 EXPLORING PERSONAL GENOMICS

Variant report for huE80E3D (PGP4: Misha Angrist) CGI var file, build 36
· Name: huE80E3D (PGP4: Misha Angrist) CGI var file, build 36
· This report: evidence.personalgenomes.org/genomes?fe9f72be9699820adc9af9e001500e02189adc84
· public profile: my.personalgenomes.org/profile/huE80E3D
· Download: source data (373 MB), dbSNP and nsSNP report (126 MB)
· Show debugging info

Genome report Insufficiently evaluated variants Coverage Metadata

Show rare (f<10%) pathogenic variants Show all

Show All entries Search

Variant Clinical Impact Allele Summary


Importance freq

CPT2- High Well- 0.78% This is the most common variant associated with late-onset
S113L established carnitine palmitoyltransferase deficiency, which is
pathogenic classically viewed as recessive. Many patients are
heterozygous for this, but are presumably compound
Recessive, heterozygous.
Carrier
(Heterozygous)
TREM2- High Uncertain 0.78% Unreported, predicted to be damaging. Other recessive
R47H pathogenic mutations in this gene cause polycystic lipomembranous
osteodysplasia with sclerosing leukoencephalopathy (a
Recessive, severe genetic disorder, usually lethal by age 50).
Carrier
(Heterozygous)
CC2D2A- High Uncertain 0.78% Unreported, predicted to be damaging. Other recessive
G776R pathogenic mutations in this gene cause Joubert Syndrome and Meckel
Syndrome.
Recessive,
Carrier
(Heterozygous)
SPG11- High Uncertain 0.78% One unpublished report links this to causing spastic
K1013E pathogenic paraplegia in a recessive manner, but insufficient data
exists to evaluate significance. Most mutations in this
Recessive, reported to cause the disease in this gene are more severe
Carrier null mutations (frameshift or nonsense).
(Heterozygous)

Figure 4.2 Tabular view of rare pathogenic variants: The tabular view of personal genomic information provided by the GET-Evidence report
offers the ability to interactively sort personal genomic information according to various properties or annotations. This sorting functionality goes
beyond the functionality typically provided by DTC genomics services. However, the heavy reliance on text limits the effectiveness of this visual
representation. Image from GET-Evidence report.

Despite their lack of stimulating aesthetics, there train schedules to sports scores, our perception
are a number of pragmatic reasons why these com- has been well conditioned to quickly scan, inter-
panies and third-party tools have chosen to use pret, and assess information represented in this
tabular views: way. Given any tabular representation of per-
sonal genomic information, most individuals
• Familiarity: A quick scan of any popular maga- would automatically understand how to scan
zine or daily newspaper will reinforce the notion the header column of the table to read the types
that modern humans are generally familiar with of data in each column, and would innately
information presented in a tabular format. From understand that the data on each row are related
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

V I S UA L I Z I N G P E R S O N A L G E N O M I C S 55

to a single item (e.g. information related to a sin- limited for visualizing personal genomic
gle SNP). information:
• Summarization: Although tables can be used
• Limited ability to show relationships: Perhaps
in numerous ways, they are commonly used
one of the most striking limitations of tabular
to summarize information, and therefore most
views for visualizing personal genomic infor-
individuals are conditioned to perceive sum-
mation is the limited ability to show relation-
mary data using a tabular representation. One
ships between row entities. Nested tables are
of the most common uses of tabular summari-
sometimes used to imbue hierarchy into tabu-
zation is the representation of frequency tables,
lar views, but this approach can quickly clut-
where each row typically represents a count or
ter a table, and cannot be applied when there
proportion of the occurrence of some value in a
are many-to-many relationships between enti-
particular group or interval (e.g. the percentage
ties. To illustrate this limitation in the context
of individuals who responded “yes” or “no” to
of personal genomics, take a look at the tabu-
particular questions in a survey). Tabular views
lar representation of disease risk presented by
are frequently used by DTC companies in their
the 23andMe interface (Figure 4.1). In this case,
user interfaces to summarize quantitative (e.g.
each disease is represented as a discrete row;
combined genetic disease risk) or categorical (e.g.
however, what are not shown are the known
presence or absence of a trait) properties of a per-
relationships between the diseases which can
sonal genome.
be important for evaluating disease risk. For
• Sorting/ranking: Another familiar or intuitive
example, Type 2 diabetes is known to be a
property of tabular views is the notion of sorting
risk factor for myocardial infarction (i.e. heart
or ranking the rows according to values in one
attack), yet this important disease interaction
or several of the columns. Most readers will
is lost in the tabular representation. In fact, the
be familiar with the tabular views utilized by
majority of interfaces provided by DTC compa-
online merchants that typically allow items to be
nies and third-party tools succumb to this same
sorted according to price. We can just as easily
limitation due to their use of tabular views for
imagine sorting or ranking tabular summariza-
representing disease risk.
tions of annotated personal genomic information
• Limited perception of trends: While tabular
according to the odds ratio or allele frequency of
views might provide the most accurate repre-
a particular SNP. The ability to sort or rank data
sentation of the underlying data, in that they
in a tabular view adds another dimension along
are often used to display “raw” or minimally
which the data can be compared.
processed data, the limitations of human per-
• Transparency: Of course, due to pervasive use
ception can constrain what can be effectively
of spreadsheet paradigms and row-based data-
communicated using tabular views. In particu-
bases, data is often stored using tabular repre-
lar, human perception is decidedly poor in its
sentations. Therefore tabular representations are
ability to distinguish trends from tabular data
often the most straightforward way to display
representations. Although human perception
“raw” data underlying a personal genome, or
is attuned to perceiving rank order in sorted
cross-reference this information with other rel-
tables, it is quite limited in its ability to assess
evant data tables (e.g. a look-up table containing
the possible magnitude or slope of a trend, or
disease-associated SNPs). One example of this
the magnitude of the difference between trends
is the GET-Evidence Variant Report tool which
in a multidimensional table. For example, as
displays rows of metadata for annotated variants
shown in Figures 4.3 and 4.10, the dramatic
that are found in a personal genome by cross-
effect of SNP selection on running genetic risk
reference.
is much more apparent in the graphical repre-
Despite these positive attributes, there are a number sentation of the data in comparison to the tabu-
of reasons why tabular views may be insufficient or lar representation.
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

56 EXPLORING PERSONAL GENOMICS

Running Total (By Likelihood Ratios)


40

Adjusted Probability (%) 31

22

13

4
Prior 9295475 9460546 9465871 2792248 7903146 985694 1884051
2283228 6769511 4376068 9939609 7901695 3020317 726281 3753242
SNP Index (Ordered By Study Size)

Running LR Prior

dbSNP Genotype Study Imputed R squared LR Running Probability


size from LR
Prior 0.311 0.311 23.700%
2283228 AC 9387 2237892 (CT) 1 0.833 0.259 20.563%
9295475 AA 9294 0.922 0.239 19.266%
6769511 TT 9294 0.870 0.208 17.185%
9460546 TT 9294 7754840 (GG) 1 0.872 0.181 15.330%
4376068 AA 9294 1470579 (AA) 1 0.887 0.161 13.844%
9465871 TT 5000 0.905 0.145 12.696%
9939609 AA 5000 1.295 0.188 15.853%
2792248 AG 3093 0.920 0.173 14.779%
7901695 CT 2877 1.518 0.263 20.844%
7903146 CT 2877 1.542 0.406 28.874%
3020317 TT 1929 0.928 0.377 27.354%
985694 CT 1929 1.200 0.452 31.123%
726281 AG 1929 1.026 0.463 31.668%
1884051 AA 1929 0.917 0.425 29.816%
3753242 CC 772 0.970 0.412 29.186%

Figure 4.3 Likelihood ratios plot juxtaposed with underlying tabular data: The 2-D graph displays the Running Likelihood Ratio (LR) quantity
from the table below the graph. Although both the graph and the table represent the same basic Running LR data, the trend of this data is much
more apparent in the graph representation. In this case, the graph readily reveals that the included SNPs in the risk model first cause the overall
risk estimate to fall below the population average (red line), but then to trend above the average as additional SNPs are added to the model. This
trend is perceptible, although less apparent in the tabular representation of the data.

• Inefficiency: Other relevant limitations of critically important to survival of the human


human perception are the lack of any special- species, the ability to summarize and interpret
ized ability to perceive and assess large amounts large tables of quantitative information quickly
of quantitative information rapidly, and finite was not. As such, the human brain does not
attentions spans. While the ability to recognize have any specialized functionality for process-
shapes and colors quickly and accurately was ing this type of information, and is compara-
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

V I S UA L I Z I N G P E R S O N A L G E N O M I C S 57

tively inefficient in doing so compared to many how this relates to the structure and function of
other tasks of perception. One needs to simply cells (Figure 4.4). Early efforts in karyotyping
open a newspaper to a full page of stock quotes, used a technique called G-banding, in which
or stare at a large spreadsheet of numbers to feel chromosomes are stained with a chemical solu-
the weight of the cognitive load that such repre- tion that will cause dark bands to appear at cer-
sentations impart. Therefore, tabular views may tain locations in the chromosomes, which are
cause the viewer to expend comparatively large related to A-T rich regions and contain fewer
amounts of time and energy to interpret if they functional gene regions. Many digitally gener-
are sufficiently large or complex. In addition, ated karyotypes (i.e. ideograms) will shade chro-
humans have particularly low attention thresh- mosomes according to locations of known genes
olds in perceiving such data, and often prefer to simulate this G-banding. However, more mod-
to simply focus on the top-most portion of the ern cytogenetic techniques, such as fluorescent in
data. This tendency is captured in the behavior situ hybridization (FISH), can offer more detailed,
of people who use web search engines, where and colorful, representations of chromosomal
studies find that more than 90% of web search- structure and organization.
ers tend to click links found in the first set of In addition to their practical use for studying
search results presented, despite the fact that chromosomes for genetic research, karyotypes have
relevant links might be found in the next several also been used as diagnostic aids in medicine to
pages of search results. These variant reports detect and diagnose chromosomal abnormalities in
demonstrate these limitations using personal individual patients. In this regard, a clinician will
genomic information (Figure 4.3). Most view- typically scan an individual’s karyotype to look for
ers will have a tendency to focus on the first few signs of aneuploidy, or an abnormal number of
rows displayed in the table and use the sorting chromosomes. One of the most widely known
feature to adjust this view, rather than scroll anueploidies in humans, especially for those who
down and view all the rows presented by the have experienced prenatal medical care, is trisomy
viewer. 21 (i.e. having three copies of chromosome 21),
which is the primary etiological factor underlying
Although tabular views are commonly used for
Down’s syndrome. Trisomy 21 is a congenital aneu-
visualizing personal genomic information, we hope
ploidy because it is present from conception. Other
that readers will understand the strengths and
common congenital aneuploidys in humans include
weaknesses of these views as they interpret their
Edward’s syndrome (trisomy 18), Patau syndrome
own personal genomic information using estab-
(trisomy 13), and Turner syndrome (monosomy of
lished tools, or embark on the development of their
the X chromosome in females). Somatic aneuploidy,
own visualizations.
i.e. chromosomal abnormalities post-birth to adult-
hood, are also observed in many cancer cells—par-
4.3 Ideograms
ticularly in the hematological cancers known as
Ideogram representations are used to display leukemias. Because full genome sequencing is not
iconographic maps of genomic information using yet commonly performed on infants or to profile
the visual metaphor of chromosomal organiza- individual patient tumors, idiographic representa-
tion. Because the term karyotype is also used to tions of individual karyotypes from personal
refer to a visual representation of chromosomes genomic information typically conform to the
and their organization, ideograms are sometimes “normal” arrangement of human chromosomes.
called karyograms or digital karyotypes (if com- However, there are methods for detecting large
puter generated), or virtual karyotypes. Karyo- structural variations in personal genomic informa-
type representations are born out of a subfield of tion (discussed in Chapter 11) and methods to visu-
genetics called cytogenetics, which is largely con- alize these variations are likely to become an
cerned with the study of relationships between important aspect of visualizing spatially the per-
chromosomal structure and organization, and sonal genomic information.
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

58 EXPLORING PERSONAL GENOMICS

(a) (b) 24
13.3 22 23
13.2 13.1
12.3 12.2 21
12.1 13
11.2
11.1 11 12
12 11
13.1 12
13.2 13.3 13
14 21.1
15 21.2
21.1 21.2 21.3
21.3 22.1
22.2 22.3
22
23 31
24.1 33 32
24.2 24.3 34.1
34.2 34.3
12 9

(c)
15.3 15.2
15.1
14
13
12 11
11.2 11.1
12 13
13 12
14 11.2
15 11.1
21 11.2
12 13.1
22 13.2
23 13.3
31
32
33
34
35
5 20

(d) 15
14 13.3
13 13.2 13.1
12.3 12.3 12.2
12.1 12.2 12.1
11.2 11.2 11.1
11.1 11
11.2 12
21.2 13.1
21.2 13.2 13.3
21.3 14
22.2 22.1 15
22.3 21.1 21.2
23.1 23.2 21.3
23.3 22
24.1 24.2 24.3 23
25.2 25.1
25.3 24.1
26.2 26.1 24.2 24.3
26.3

10 12

Figure 4.4 The biomimetic basis of ideograms: Ideograms are digital graphic representations of chromosomes that are inspired by the classical
cytogenetic techniques for karyotyping chromosomes using chemical stains. The dark bands used to distinguish regions of an ideogram are analogs
of the G-banding pattern that is observed as the result of karyotype staining of human chromosomes. This figure shows micropictographs of
several G-banded human chromosomes next to their ideogram analogues. Reproduced from Harada, N. et al. Subtelomere specific microarray
based comparative genomic hybridization: a rapid detection system for cryptic rearrangements in idiopathic mental retardation. Journal of Medical
Genetics 41, 130–6 (2004) with permission from BMJ Publishing Group Ltd.

From the standpoint of personal genomics, the • Biological organizing principle: Ideograms
primary utility of ideogram representations is to organize personal genomic information that is
visualize patterns of personal genomic information analogous to the manner in which an individu-
across the genome, and to orient exploration of per- al’s DNA is organized in their cells. This allows
sonal genomic information (Figure 4.5). The primary for visual orientation around important bio-
benefits of using ideograms to visualize personal logical features of chromosomal DNA, such as
genomic information include the following. centromeres and telomeres. Additionally, recall
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

V I S UA L I Z I N G P E R S O N A L G E N O M I C S 59

Jump to a gene: Go a SNP: Go

1 2 3 4 5 6
247M Bases 242M Bases 199M Bases 191M Bases 180M Bases 170M Bases
2815 Genes 1860 Genes 1497 Genes 1136 Genes 1276 Genes 1575 Genes
81k SNPs 81k SNPs 66k SNPs 57k SNPs 58k SNPs 72k SNPs

7 8 9 10 11 12
158M Bases 146M Bases 140M Bases 135M Bases 134M Bases 132M Bases
1512 Genes 1013 Genes 1245 Genes 1105 Genes 1850 Genes 1370 Genes
56k SNPs 51k SNPs 44k SNPs 52k SNPs 51k SNPs 49k SNPs

13 14 15 16 17 18
114M Bases 106M Bases 100M Bases 88M Bases 78M Bases 76M Bases
545 Genes 1315 Genes 998 Genes 1098 Genes 1494 Genes 429 Genes
38k SNPs 32k SNPs 31k SNPs 31k SNPs 30k SNPs 29k SNPs

19 20 21 22 X Y
63M Bases 62M Bases 46M Bases 49M Bases 154M Bases 57M Bases
1743 Genes 768 Genes 360 Genes 749 Genes 1376 Genes 284 Genes
19k SNPs 24k SNPs 14k SNPs 15k SNPs 30k SNPs 5k SNPs

MT
16k Bases
37 Genes
4k SNPs

Figure 4.5 Using ideograms for visual orientation: Commercial DTC genomics companies will often employ ideogram representations in their
products to enable the visualization or browsing of personal genomic information guided by the biologically organizing principal of chromosomal
arrangement. The top graphic demonstrates the use of ideograms to facilitate the navigation of the raw genotype data for an individual. This view
enables an individual to click on a chromosome to begin exploring the personal genotyping data available for the selected chromosome. The
bottom graphic demonstrates the use of ideograms for organizing the results of analysis based on personal genomic information. In this case, the
ideogram is color coded to represent the predicted ancestral origins of each chromosomal region for a personal genome. Image from Lumigenix
genetic testing service.
Karyotype
1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

21 22 X Y

Chromosome
11 , Start: 1, End: 135006516, Length: 135006516
p15.2 p14.1 p11.12 q11 q13.1 q13.5 q21 q23.1 q24.2
p15.4 p14.3 p12 q12.2 q13.3 q14.2 q22.2 q23.3 q25

p15.5 p15.1 p13 p11.11 q12.3 q13.4 q14.3 q22.3 q24.1


p15.3 p14.2 p11.2 q12.1 q13.2 q14.1 q22.1 q23.2 q24.3

Chromosome Start: Chromosome End:


112655437 115655582
Zoom
11 , Start: 112785528, End: 112851103, Length: 65576
Gene Names (1) Variable Regions (0) Variable Regions (0) Cancer Mutations (7) Son (56) Mom (67) Dad (67) Genes Involved in Disease (1)
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

Gene Names
DRD2
Cancer Mutations
Son

Mom

Dad

Genes Involved in Disease


DRD2
q23.2

Figure 4.6 Linked ideogram visualization: The browsing interface provided by the myKaryoView tool uses ideogram representations at multiple levels of detail to enable interactive browsing
and navigation from the broader genome level to individual chromosomes, and the individual personal genome sequence level data of individuals. All three levels are linked such that changes
in one level of detail, such as a shifting of focus to a different chromosome at the top level, are reflected with appropriate changes at all three levels of detail. Image from myKaryoView tool.
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

V I S UA L I Z I N G P E R S O N A L G E N O M I C S 61

that variants that are nearby are often inherited obscure or limit the primary purpose of per-
together in a process known as linkage disequi- sonal genome visualization, which is to convey
librium (LD; see Chapter 1). To this end, variants important information about a personal genome.
in neighboring genes can be quickly visualized Therefore, factors such as the intended audience
using ideograms. and the nature of the information being commu-
• Spatial orientation: Many genetic studies report nicated should take precedence over biological
findings using a standardized nomenclature to accuracy.
refer to a particular genomic region, such as • Limited resolution: One of the main benefits
a region associated with a particular disease of the ideogram representation mentioned pre-
mechanism. For example, eye color and skin pig- viously, is that it provides a relatively compact
mentation traits are associated with the region and succinct means to visualize the tremendous
5p13.2 (See Chapter 1). An ideogram makes it quantity of information in a personal genome.
easy to use this genomic coordinate nomencla- However, as with most visualizations, this ben-
ture to easily navigate to relevant regions on a efit comes as a trade-off that can limit other
personal genome. aspects of the visualization. In this case, the
• High information density: One major benefit compact representation of the ideogram limits
of an ideogram representation is that it offers a the resolution at which information can be dis-
compact means to visualize the entire breadth of played. Therefore, ideograms may be useful for
information contained in the over 3 billion base showing gross patterns along a chromosome,
pairs represented in a personal genome. such as recombination patterns, but insufficient
• Standardization: Linear representations of for showing details that require base-pair level
chromosomes are often used in genomics lit- resolution, such as the absence or presence of a
erature, such as the Manhattan plots often trait-associated allele.
found in genome-wide association studies
The pervasiveness of ideograms in genome visuali-
(GWAS). Ideograms are effective tools for visu-
zation is largely driven by the advantages that this
alizing these data as they map directly to linear
representation provides. However, this visualiza-
representations.
tion technique should only be implemented after
Despite these positive aspects of ideograms, their careful consideration for the purpose of the visuali-
frequent use for representing genomic information zation, the intended audience, the nature of the
should not serve as a sign that they should be con- information being communicated, and the known
sidered as necessary for genomic visualization. limitations. Some of the limitations of ideograms
There are several reasons why ideograms may not can be compensated for by using it in combination
be appropriate for visualizing personal genomic with other visualization modalities, such as the
information. brushing and linking technique used by myKaryo-
View to connect visualizations representing differ-
• Lack of general familiarity: While the chromo-
ent levels of detail (Figure 4.6).
somal metaphor embodied by ideograms may be
intuitive for those who have formal training in
genetics, it is unlikely for amateurs, hobbyists, or
4.4 Genome browsers
lay individuals to find them familiar or intuitive.
Even many medical professionals are unlikely to It could be easily argued that the visualization
find familiarity or be comfortable with ideogram modality known as the genome browser repre-
representations. sents the ancestor of most modern approaches to
• Biological analogy not always beneficial: visualize genomic information. When researchers
Although ideograms organize personal genomic needed a means to provide access to and dissemi-
information in a manner that is analogous to nate the initial draft data for the Human Genome
the biological organization of DNA in the cell, Project, a team at the University of California
such efforts to maintain biological accuracy may Santa Cruz (UCSC) developed what has become
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

Figure 4.7 Linear genome browser with multiple tracks of genomic information. Linear genome browsers, such as the venerable UCSC Genome Browser shown in this graphic, typically enable
rich, multidimensional views of genomic information by providing the ability to add synchronized “tracks” of information along with a genome sequence. These tracks can contain additional
genomes for visual comparison, or other sources of relevant genomic information, such as the locations of disease variants, measures of evolutionary conservation, or functional genomic
information such as transcription factor binding regions. Image from UCSC Genome Browser.
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

V I S UA L I Z I N G P E R S O N A L G E N O M I C S 63

one of the most widely used and recognized tools zations can engage the powerful pattern matching
in genomic research, the UCSC Genome Browser. capability of human perception to enable the dis-
The UCSC Genome Browser provides a visualiza- covery and visualization of informative patterns
tion interface that renders the genome sequence in genomic information.
data as a linear sequence at varying levels of gran- Linear genome browsers provide an effective visual
ularity and allows the viewer to navigate along framework for several important aspects of exploring
the length of the sequence, jump to specific loca- and interpreting personal genomic information:
tions in the genome using genomic coordinates, or
zoom in and out do varying levels of detail (Figure • Browsable access to raw data: Because linear
4.7). Since its initial release, the UCSC Genome genome browsers typically incorporate stand-
Browser has evolved into a generalized tool for ard genomic coordinate systems (e.g. based on
visualizing nearly any type of genomic informa- a reference genome), they can offer a conven-
tion, including the genomes of other species, and ient means for browsing or searching along
annotative data such as the locations of polymor- personal genome sequence data. Further-
phisms or transcription factor binding regions. more, most linear genome browsers employ
Since the inception of the UCSC Genome Browser, a straightforward biological organizing prin-
many additional genome browsers have been cre- ciple, in which the data is typically organized
ated using the same visualization paradigm as the by their linear orientation along human chro-
UCSC Genome Browser, giving way to an entire mosomes. This makes it straightforward to
class of linear genome browsers. navigate to regions containing specific genes,
or other functional elements, whose chromo-
somal locations and exact reference coordi-
4.4.1 Linear genome browsers
nates are typically known a priori.
Linear genome browsers present genome sequence • Integrative assessment: One of the main ben-
data along a linear coordinate system that allows efits of linear genome browsers is the ability to
for viewing and navigation of the sequence data render integrated views of genomic information
long the entire length of the sequence at varying in which different data types are rendered in par-
levels of detail. One of the core features of most allel tracks along a common coordinate system.
genome browsers, and one of their primary This paradigm can be useful for mostly qualitative
strengths, is to visualize and navigate multiple functional assessment of personal genomic infor-
tracks of data in parallel with the genome sequence mation, where functional genomic information
data. These tracks can incorporate any type of can be rendered in parallel to a personal genome
information that can be mapped to a genomic sequence to visually assess possible functional
coordinate system, such as the location of disease- consequences of personal genome variation. For
associated SNPs, or the relative evolutionary con- example, a personal genome might be visualized
servation of positions across mammalian genomes. in a linear genome browser along with one par-
The benefit of this approach is that enables the allel track showing population allele frequency
visualization of multiple forms and dimensions information, another track showing the strength
of genomic information across parallel tracks in of transcription factor binding activity measured
the same field of view. Each individual track can from some population reference, and another
employ any type of visualization of glyph that fits showing a measure of the conservation of each
on to the genomic coordinate system, and can position across mammals (Figure 4.7). Visual
scale appropriately as the user zooms in and out inspection of these tracks might quickly iden-
of varying levels of details. For example, the rela- tify positions where an individual harbors a rare
tive evolutionary conservation of positions or (< 1% MAF) or personal variant allele at a position
regions might be represented by a histogram to that is both highly conserved across mammals—
convey the relative degree of conservation (Fig- and therefore likely to be functional important—
ure 4.7). If designed appropriately, these visuali- and located within a peak region of transcription
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

64 EXPLORING PERSONAL GENOMICS

factor binding activity upstream of a gene. Such • Inability to visualize non-linear interactions:
positions or regions might be rapidly identified While a linear representation of a genome serves
by visual inspection in a genome browser, and as a straightforward and intuitive paradigm for
further investigated using quantitative or experi- exploring and visualizing a personal genome, it is
mental methods for functional assessment. far removed from the actual biological context of a
• Comparative assessment: Because linear genome genome, which is coiled around histone proteins and
browsers are able to lay out multiple personal packaged into chromatin in the cell nucleus. Within
genome sequence along parallel tracks that are the nuclear environment, genomic DNA participates
spatially synchronized along a common coordinate in a large number of non-linear interactions, such as
system, they are useful for facilitating visual com- long-range chromatin interactions between non-
parative analysis of personal genome sequences. coding regulatory regions and gene-coding regions
There are a number of scenarios in which com- facilitated by transcription factors. Genomic regions
parative analysis of personal genomes may be that are distant in linear terms (i.e. many megabases
useful and informative. For example, a scenario in apart) may actually be relatively proximal within
which there is a need to assess differences between the chromatin structure of the cell. Such interactions
the somatic DNA sequence of an individual can- can be difficult to visualize or interpret using linear
cer patient, and the complete DNA sequence of representations of a personal genome.
the patient’s tumor genome. In this scenario, a • Cognitive overload: Linear genome browsers
pathologist might use a linear genome browser to excel at facilitating comparative analysis of local
compare the sequences at a low level of detail to genomic regions (i.e. coding region containing
scan for large structural differences (e.g. segmental upstream promoter and exons for a single gene)
insertions or deletions) between the two genomes, across parallel tracks of information, however this
or zoom in to a finer level of detail to scan for modality requires the user to specify a specific
single-nucleotide variants in known oncogenes. genomic region of interest a priori. Linear genome
Another example is the comparative assessment of browsers are less useful for ad hoc browsing of
the personal genomes of family members, where a personal genome to identify gross patterns of
the family is suspected of harboring rare disease interest, because the density of information across
variants. Although pedigree analysis is typically even a single chromosome can quickly overload
used for such scenarios, a linear genome browser the cognitive machinery of human perception,
might be used to visually identify genomic regions limiting its ability to identify meaningful patterns
with obvious differences between affected and in the visual data.
unaffected individuals in the family.

Despite their benefits and popular use, there are a


4.4.2 Nonlinear genome browsers
number of potential limitations of linear genome
browsers that might limit their usefulness in explor- Due to the limitations inherent to linear genome
ing personal genomic information: browsers, several efforts to develop non-linear
genome browsers that aim to overcome some of these
• Use of reference coordinate system: While the ref-
limitations have emerged. While these efforts are
erence coordinate system used by linear genome
relatively nascent compared to the long tradition of
browsers is necessary for spatially synchronizing
the linear genome browsers, they have already pro-
the data across multiple tracks, it can place some
duced a number of notable tools with clear advan-
restrictions on the type of data that can be rep-
tages over the linear genome browser paradigm.
resented. For example, it can be difficult to rep-
resent variation in tandem repeat regions, which • Annular genome browsers: Annular, or “ring-
can have functional consequences. Also, it may like” genome browsers arrange personal genomic
not be possible to represent significant structural information using a circular composition,
variations, such as inversions or other complex where the genomic information is represented
rearrangements, using linear coordinate systems. as a circular band, or track. The most notable
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

V I S UA L I Z I N G P E R S O N A L G E N O M I C S 65

Interchromosomal
rearrangement Point mutation

1
X
22
21
2

20
19
18

17 3

16

4
15

14
5
13

12 6
11

7
10

8
9

Intrachromosomal
rearrangement Copy-number change

Figure 4.8 Circular genome graphic displaying multiple layers of information for a cancer genome. Circular representations of genomic
information, such as the Circos plot of cancer genome information shown in this figure, allow for dense representations of genomic information
that are still easily interpreted by human perception. Like linear genome browsers, multiple tracks of information are used to add various data
dimensions to the visualization. However, circular representations allow for the representation of long-range relationships, such as the interchro-
mosomal rearrangements represented by the connecting segments in the center of the graphic here (see also Plate 2). Reprinted by permission
from Macmillan Publishers Ltd: Ledford, H. Big science: The cancer genome challenge. Nature 464, 972–4 (2010).

implementation of this paradigm is the Circos ability to visualize interactions or other types of
software package (<http://circos.ca>), which relationships between spatially distant regions,
has been used to render genomic information such as regulatory or physical (e.g. protein-
in numerous high-profile scientific and popular protein) interactions, succinctly. Additionally,
media publications (Figure 4.8). Other notable circular compositions are capable of rendering
tools employing the annular paradigm include greater information densities, and are capable of
Mitowheel (<http://mitowheel.org>), DNA- simultaneous display of variable resolution data,
Plotter <(http://www.sanger.ac.uk/resources/ where low-resolution representations (e.g. chro-
software/dnaplotter/>), and CGView (<http:// mosomal organization) can be displayed near the
wishart.biology.ualberta.ca/cgview/>). Similar circular center, and higher-resolution information
to the linear genome browsers, these browsers (e.g. exons or SNP locations) can be displayed
can render parallel tracks along the same coor- coordinately along circular tracks at outer radii.
dinates as the primary genome track, which are • Cartographic genome browsers: Building on the
rendered as concentric circular tracks. Circular paradigm of traditional geospatial maps, such as
compositions offer a number of advantages over the popular Google Maps web tool, cartographic
linear representations. Chief among these is the genome browsers allow for navigation and visu-
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

66 EXPLORING PERSONAL GENOMICS

Figure 4.9 Custom genome browser based on the JBrowse software client. There are several projects aimed at providing genome browser
frameworks that can be used to serve as the basis for customized genome browsers. The genome browser shown in this figure was adapted from
the open-source JBrowse genome browser to provide a customized browser for the GenomesUnzipped personal genomics community. Image from
Genomes Unzipped personal genome browser.

alization of personal genomic information along sent the entire genomic sequence across a plane,
a planar coordinate system. Whereas linear and allow for 360° navigational freedom at vary-
genome browsers typically allow freedom of nav- ing levels of detail (i.e. zoom). To use the tradi-
igation bilaterally along the length of a genome tional roadmap analogy, linear genome browsers
sequence, cartographic genome browsers repre- are like a constrained map interface that would
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

V I S UA L I Z I N G P E R S O N A L G E N O M I C S 67

only allow you to visualize and navigate along 4.5 Visual quantitative assessment
a single road or highway at a time, whereas a
cartographic genome browser is like a fully func- Although genomic data visualizations are often
tional map browser that allows you to navigate in designed to facilitate the display of summary infor-
any direction across a city or country. Examples mation (e.g. a list or location of variants in a given
of cartographic genome browsers include DNA gene), or to identify general patterns (e.g. locations
Guide (<http://dnaguide.com>) and Genome of genomic rearrangements), visualization tech-
Projector (<http://www.g-language.org/g3/>). niques can also be used to facilitate quantitative
assessment and interpretation of personal genomic
information. Visualizations designed for quantita-
tive assessment differ from other visualizations in
4.4.3 Building custom genome browsers that their purpose is to guide or otherwise enable
Although there are numerous genome browsers the viewer to determine some accurate quan-
available, and many as free downloads or public tity through aid of graphical devices. Careful con-
web tools, there are likely to be many present and sideration must be taken in the design of such
future scenarios in which the existing genome visualizations to ensure that data are represented
browsers lack some necessary functionality. This accurately, and that each graphical element serves a
is likely to be especially true for scenarios involv- clear purpose, so that ambiguity cannot confuse or
ing personal genomes, because the established distort interpretation. Although graphical represen-
genome browsers were largely developed to serve tations of quantitative data are commonplace in
the needs of the various genome science and daily life, such as stock market or sports-score
research communities. For instance, a personal graphs in newspapers, it is non-trivial to design
genome community, Genomes Unzipped, has graphical representations that facilitate accurate
developed its own genome browser to facilitate and effective assessment of quantitative informa-
personal genome exploration (Figure 4.9). Because tion. One could argue that this was especially true
personal genomics has applications and an audi- for personal genomic information, which is not only
ence beyond scientific research, it is likely that vast and complex, but also so personal with poten-
new types of genome browsers will need to be tially hazardous consequences from misinterpreta-
developed to serve the various aspects of per- tion. For example, an inaccurate quantitative
sonal genome exploration, ranging from clinical representation of risk for a particular disease could
interpretation to genomic genealogy. Fortunately, cause an individual to worry needlessly about their
there are already software tools available that personal health. Although it does not address
help to facilitate the development of customized genomics or biology specifically, we suggest readers
genome browsers. These tools provide the neces- interested in developing these types of visualiza-
sary toolkit and software “scaffolding” by which tions become familiar with the text, “The Visual Dis-
a custom genome browser can be created. How- play of Quantitative Information” by Edward Tufte.
ever, it should be noted that the currently availa- This book serves as a fundamental treatise for
ble genome browser toolkits require substantial understanding the theory and practice of designing
prerequisite technical knowledge for their use. accurate and effective visual representations of
There does not yet exist a “point-and-click” quantitative information.
genome browser toolkit that would allow non-
technical individuals to create custom genome
4.5.1 Nomograms for disease risk assessment
browsers with the same ease that one interacts
with modern word-processing software. How- One frequent use of personal genomic informa-
ever, it is likely that such software will emerge in tion is to assess an individual genetic risk for vari-
the future—at least in some form—as personal ous disease conditions. However, in order to
genomics continues to gain interest among assess genetic risk for even a single disease, a sub-
broader demographics. stantial amount of information must be taken into
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

68 EXPLORING PERSONAL GENOMICS

rs726281 ing, but rather to simply display the constituent


Posterior Probability of developing Type 2 Diabetes (%)

32 rs985694
31 rs1884051
30 rs7903146 elements of the disease model. Therefore, they do
29
28 rs3753242 not enable the individual to assess the specific quan-
27 rs3020317
26 titative effects of inclusion or exclusion of a particu-
25
24 lar variant. Of course, this is reasonable, because
23
22 one aspect of the perceived value of DTC genomics
21 rs2283228 rs7901695
20 rs9295475 services is that they obviate the need for the cus-
19
18 rs6769511 tomer to perform any complex decision making
17 rs9939609
16 rs9460546 regarding the quantitative assessment of their
15
14 rs2792248 genomic information. However, these visualiza-
13 rs4376068
12 rs9465871 tions essentially preclude individuals from easily
11
10 drawing quantitative assessments that would devi-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
SNP Index
ate from the model imposed by the company (e.g.
Figure 4.10 Graphical representation of multilocus genetic risk for
removing a variant discovered in a population with
Type 2 diabetes estimated from a personal genome. Each point a different ethnic background than the individual).
represents a SNP locus in the genetic risk model, which are plotted to One novel approach towards a graphical repre-
show the running posterior probability of developing disease based sentation that enables precise visual quantitative
on the inclusion of individual risk variants. assessment of personal genetic disease risk is based
on a classical graphical computing paradigm called
consideration. This includes the incidence of the nomograms. Nomograms were invented by the
disease and the set of variants known to be associ- French mathematician Philbert Maurice d’Ocagne
ated with risk of developing the disease, along in the late 1800s as a means to facilitate visual com-
with their relevant statistical properties (e.g. the putation of complex mathematical equations.
odds ratio and statistical significance of the effect). Nomograms became popular in medicine as a
Information from multiple variants must be com- graphical tool for visually computing probability
bined with the disease incidence data to quantify equations pertaining to diagnosis. For example, a
an individual’s overall genetic risk for develop- nomogram might be designed such that it would
ing a disease (Figure 4.10). The process and rele- allow a physician to easily compute the post-test
vant methods for combining this information will probability of a particular disease diagnosis given
be discussed in more detail in Chapter 6, but here the results of a particular blood test, which has
we must consider that each variant contributes some known probability characteristics (e.g. likeli-
unique quantitative information to the final dis- hood ratio) associated with it (Figure 4.11). The
ease risk score, and variants rarely contribute visual computation would be performed by mark-
equally to the final composite risk score. Because ing a straight line from the pre-test probability
there are yet no commonly established guidelines value through the quantitative result of the test to
regarding the inclusion of variants in risk assess- arrive at an estimated post-test probability on the
ment models, it is good practice to reveal the spe- right-most vertical axis.
cific variants that comprise a particular genetic To facilitate visual quantitative assessment of indi-
disease risk model, and also to represent their vidual genetic risk based on a personal genome in a
individual contribution to the composite risk clinical setting, Ashley et al. described a novel repre-
score. sentation genetic risk nomogram that enables the
Most DTC genomics companies have internal visual assessment of multi-allelic genetic risk for a
policies governing the inclusion of variants into dis- disease using variable inclusion criteria for risk vari-
ease risk profiles, and do provide visual representa- ants (Figure 4.12). In this representation, the disease
tions showing both the overall risk for a disease and prevalence, which represents the pre-test probability
the individual contributions of individual variants of disease risk, is represented by a blue dot at the top
to the disease risk model. However, these represen- of the graph. Then, the entire set of SNPs reported to
tations are not designed to facilitate decision mak- be associated with risk of the disease are represented
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

V I S UA L I Z I N G P E R S O N A L G E N O M I C S 69

0.1 99 for each SNP: the likelihood ratio of disease risk con-
ferred by the patient’s genotype at the locus, the
0.2 number of published studies reporting the SNP’s
association with the disease trait, the number of
0.5 95 samples (i.e. individuals) measured in those studies.
The graphical marks in the center of the nomogram
1 1,000 90
500 relate to these quantities, where the SNP is repre-
2 200 sented by a box, the size of the box is proportional to
80
100 the number samples used to determine the SNP
50 70 association, and the shade of the box represents the
5
20 60 number of distinct studies reporting the association
10 10 50 (more studies results in darker shades). As is appar-
5 40 ent in Figure 4.12, the set of associated SNPs are
20 2 30 ordered in descending fashion according to the sam-
1
30 0.5 20 ple size and number of studies, putting SNPs with
40 0.2 the greatest statistical confidence near the top of the
50 0.1 10 order. The boxes are arranged on a horizontal axis
60 0.05 according to the running post-test probability of dis-
0.02 5 ease estimated from the chain of alleles leading up to
70
0.01 that SNP in descending order. This allows the
80 0.005 2 observer to visualize the effect of each individual
0.002 SNP on the overall risk model. To determine the
90 0.001 1 individual’s multi-allelic genetic risk for the disease,
95 0.5 an observer would start from the pre-test probability
at the top and scan down the list of variants and stop
at the last SNP that meets some determination of
0.2
inclusion criteria, such as a minimum number of
99 0.1 samples used to determine the association. Then, the
Pre-test Likelihood Post-test observer would look at the value of the post-test
probability ratio probability probability in the right-most column to determine
the individual’s genetic risk of disease based on the
Figure 4.11 Schematic representation of a nomogram. Nomograms set of SNPs included up until that point.
are visual computing devices that have traditionally been used in medical
The genetic disease risk nomogram introduced
practice to visually estimate the posterior-probability of a diagnosis given
the results of a medical test. The left-most axis represents the pre-test by Ashley et al. has several advantages over the rep-
odds of a diagnosis, the middle axis represents the likelihood ratios of resentations employed before it. First, it generally
test outcomes, and the right-most axis represents the post-test follows good quantitative graphical design princi-
probability estimated as a product of the pre-test probability and the ples, such as showing the underlying data in a
given likelihood ratio of the test outcome. The idea behind a nomogram
complete and coherent manner, having a good
like the one shown is to begin at the pre-test probability axis, extend
a straight line from the appropriate pre-test probability value to intercept data-to-ink ratio, and having a clear purpose for the
the middle axis at the value of a test result, and terminate the straight design and orientation of graphical marks. How-
line on the post-test probability axis to determine the value. Image ever, another notable aspect of this visualization is
reprinted with permission from Morgan et al. 2010. that integrated a novel form of clinical data (genomic
risk markers) with a graphical paradigm that was
sequentially below the dot. For each SNP, relevant already established and familiar with those in the
annotative data is shown in the left columns: any clinical domain (nomograms). By leveraging a
genes associated with the SNP, the SNP identifier graphical paradigm that is already established in
(rsID), and the patient’s genotype at the SNP locus. the domain, the cognitive burden or hindrance to
The right columns contain quantitative information understanding is substantially reduced.
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

D Alzheimer’s disease
Gene* SNP location Patient LR Studies† Samples‡ Post-test
genotype probability
(%)
9·0%
TOMM40 rs157581 CT 1·6 6 7740 13·90%
DAPK1 rs4878104 TT 0·7 5 10 397 10·19%
TRAK2 rs13022344 CT 1·0 4 6512 10·12%
DAPK1 rs4877365 AA 0·6 4 4841 5·89%
E8F3 rs11016976 TT 1·0 3 5736 5·87%
TNK1 rs1554948 AA 0·9 3 5736 5·32%
MYH13 rs2074877 CT 1·0 3 5366 5·55%
GALP rs3745833 CC 0·9 3 5366 4·82%
PCK1 rs8192708 AA 0·9 3 5366 4·47%
rs1859849 TT 0·9 3 5304 4·02%
rs11622883 AT 1·0 3 5248 3·97%
WWC1 rs17070145 CC 0·9 3 2545 3·65%
LMNA rs505058 TT 1·0 2 4646 3·49%
ACAN rs2882676 CC 0·9 2 4590 3·22%
PGBD1 rs3800324 GG 0·6 2 4590 2·11%
GOLM1 rs10868366 GG 1·1 2 2156 2·30%
GOLM1 rs7019241 CC 1·1 2 2156 2·49%
rs9886784 CC 0·9 2 2156 2·36%
rs10519262 GG 0·9 2 2156 2·22%
rs463946 CG 0·5 2 1922 1·04%
PLAU rs2227564 CT 0·9 2 956 0·98%
ADAM12 rs1278279 GG 1·2 1 2320 1·23%
SORL1 rs2070045 GT 1·1 1 2031 1·36%
ABCA1 rs2230806 CT 1·1 1 1691 1·50%
PSEN1 rs165932 GT 0·9 1 170 1·37%

0·1 1 10 100
Risk (%)

C Prostate cancer
Gene* SNP location Patient LR Studies† Samples‡ Post-test
genotype probability
(%)

16%
rs1447295 CC 0·9 19 56 485 15%
TNRC6B rs9623117 TT 0·9 8 35 869 14%
DAB2IP rs1571801 GT 1·2 6 13997 16%
rs6983267 GT 1·0 3 3985 16%
CDH1 rs16260 CC 0·8 3 2238 13%
rs6983561 AA 1·0 2 1846 12%
rs1551512 TT 0·9 2 1846 12%
MMP2 rs1477017 AG 1·2 1 2878 13%
HIF1A rs11549465 CC 1·0 1 2878 14%
MMP2 rs11639960 AG 1·2 1 2878 16%
RSR2 rs2987983 AG 1·1 1 2216 18%
TLR10 rs4129009 TT 0·9 1 2163 17%
TLR10 rs4274855 CC 0·9 1 2163 16%
TLR1 rs5743604 AA 0·9 1 2163 15%
rs7837688 GT 1·7 1 2139 23%
rs4242382 GG 0·9 1 2139 21%
rs10086908 TT 1·0 1 2139 22%
rs7000448 TT 1·1 1 1012 23%

10 100
Risk (%)

Figure 4.12 Genomic nomograms for genetic disease risk assessment. The genomic nomogram was introduced by Ashley et al. as a visual
representation to facilitate clinical assessment and interpretation of genetic disease risk in a personal genome. Disease associated SNPs are shown
in decreasing order of sample size and number of studies showing association. Darker shaded boxes indicate SNPs having the most studies
reporting association with disease. The size of boxes are scaled proportional to the logarithm of the number of samples used to calculate the
likelihood ratio (LR). The ordering places SNP reported in the most and largest studies at the top of the graph, which have the most confidence for
their association with disease. Using this visualization, a clinician could scan down the list from the most confident associations towards the least
to choose a threshold based on personal criteria for SNP inclusion. For example, the clinician may only have confidence in SNPs that have been
reported to be associated with the disease in two or more studies. In this case, they would stop scanning the list of SNPs at the last SNP meeting
the minimum study criteria and look at the post-test probability to the right to determine the individual’s post-test probability of disease given the
pre-test probability and the likelihood ratios of the SNPs included into the model up to the chosen point. Image reprinted from Ashley, E. A. et al.
Clinical assessment incorporating a personal genome. Lancet 375, 1525–35 (2010) with permission from Elsevier.
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

V I S UA L I Z I N G P E R S O N A L G E N O M I C S 71

rs7592 634

rs7081 035
rs1901 125

rs5756 661
rs1 643 92

79
rs3 326 46

rs8617 95
rs1 2 40
rs1 86 8

47 98
67

05
6

8
1

6
6
7

1
rs1903

9
1
1
rs7
90 90 90
80 80 80
70 70 70
60 60 60
50 50 50
40 > 30 40 40
30 25–30 30 30
18–25
20 20 20
< 18
10 10 10
0 0 0
Pre-test Body mass Genetics Post-test
probability index (BMI) probability

Figure 4.13 Integrating clinical and genetic information for disease risk assessment. Likelihood ratios can be chained together to form a
composite measure of disease risk. Traditional nomograms (left, as in Figure 4.11) could be integrated with genomic nomograms (right, as in
Figure 4.12) to illustrate the effect of both clinical factors (such as height, weight, etc.) and genetic factors on a personal disease risk assessment.

4.6 Integrative visualizations environmental factors and genetic factors into a


visual risk model (Figure 4.13). This could enable
While the genetic information comprising a personal personal genomic information to be integrated
can alone serve as the basis of useful and informative with other environmental or even clinical (e.g.
visualizations, we can gain powerful new perspec- blood test results) information through a visual
tives on personal genomic information by visually paradigm. If such a visualization were made inter-
combining or comparing it with other forms of infor- active, it might be used to estimate the risk or
mation. We have already discussed a basic example of benefits of various behavioral, environmental, or
this approach demonstrated by the parallel tracks of other interventions visually with regards to amel-
information offered by linear genome browsers. These iorating overall disease risk in light of genetic
types of integrative visualizations can be used to give predisposition.
personal genomic information context, and aid in
interpretation or assessment of the raw genomic infor-
4.6.1 Relational maps
mation. As DNA sequencing or other molecular profil-
ing of personal traits, such as personal microbiomes or Relational maps are a visualization technique
personal metabolomes, accompany personal genome designed primarily to reveal the relationships
sequence profiling in the future, integrative visualiza- between entities. Common implementations
tions are likely to serve as critical tools in exploring include networks, directed acyclic graphs, and
and interpreting the connections and dynamics cladograms. There are potentially many aspects
between these rich and complex measurements. of personal genomics in which relational map
As a basic example, we could imagine a visuali- techniques can have applications. For example, if
zation that uses parallel coordinates to integrate we have personal genomic information for multi-
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

72 EXPLORING PERSONAL GENOMICS

Smoking
Stress Pesticides
Alcohol

NSAID Parkinson's MAO inhibitors


Disease
Cocaine Sodium

Anticoagulants Interferon-α
Abdominal Antihypertensives
Statins aortic aneurysm
Myocardial
infarction

Hypertension Depression

Coronary
artery disease
Type 2
Diabetes

Exercise

Asthma Air pollution

Diet
Obesity Allergens
Antipsychotics
Prostate
cancer Injury

Osteoarthritis

Figure 4.14 Integrative visualization for gene-environment interactions combing personal genomic disease risk estimates with modifiable
environmental factors. The integrative gene-environment visualization shown in this figure was introduced by Ashley et al. as a prototype for a
gene-environment “report card” which could succinctly and intuitively summarize the complete disease risk profile for an individual patient based
on their personal genomic information. The disease labels are scaled proportional to the individual’s estimated overall risk for each disease, and
the directed edges connecting the diseases indicate that one disease predisposes to another in the direction of the arrow. The edge of the circle is
annotated with circles that represent modifiable environmental factors that are relevant to the diseases. These environmental factors are connected
to the diseases whose risk they are known to modulate by dashed arrows. The size of the circle marking the environmental factor is scaled
according to the number of diseases it affects, and the color of the circle represents the maximum overall risk for diseases which it is connected to,
where darker colors represent high risk. The innovation behind this visualization is that disease–disease interactions are represented, and that a
visual scan of the periphery of the visualization facilitates easy identification of environmental risk factors that are likely to have the largest affect
on modifying health risks. Image reprinted from Ashley, E. A. et al. Clinical assessment incorporating a personal genome. Lancet 375, 1525–35
(2010) with permission from Elsevier.

ple individuals, we may want to explore how to play a role in the etiology of the disease, which
similar they might be on a genomic level. The were connected by lines to indicate the relation-
exact methodology for performing this compari- ship (Figure 4.14). In this representation, diseases
son is discussed in Chapter 5; however, from a that are known to predispose to another were
visualization perspective we could use a rela- connected by lines to indicate their relationship.
tional technique known as a cladogram, or more These inter-disease relationships are as important
simply a “tree”. Ashley et al. described a novel as the individual disease risk, because even
relational representation that summarized and though an individual might have low risk for one
represented an individual’s personal disease risk disease, they may have increased risk for diseases
along with environmental factors that are known that predispose to it.
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

V I S UA L I Z I N G P E R S O N A L G E N O M I C S 73

Adipocyte, hepatocyte, skeletal muscle cell

Obesity ADIPO GLUT4


Adipocytokine
signalling pathway

Insulin signalling Type II diabetes mellitus


pathway
INS INSR IRS P13K
+py
(Hyperinsulinism) IRS
+ps
SOCS

Adipocytokine ERK
signalling pathway
IKK
FFA JNK
Obesity
TNF a mTOR
PKC z

PKC d
Insulin resistance
Transient hyperglycemia
Hyperinsulinism
Pancreatic b-cell Impaired insulin secretion

Glucose Ca2+-dependent
Apoptosis
(Hyperglycemia) Maturity onset diabetes
JNK of the young
PDX-1
INS
ROS MafA DNA

VDCC
Prevention of Ca2+
membrane depolarization
SUR1 ATP
Kir6.2
Mitochondrion
K+ Mitochondrial dysfunction
GLUT2 GK PYK
Glucose Pyruvate

Figure 4.15 Functional map of metabolic signaling pathways annotated with personal genomic information. In the case of rare variants, we do
not typically have population-level statistical data to draw from to draw inferences about the potential consequences of the genetic variants. One
approach for a biological investigation of such rare variants is to integrate personal genomic information with functional biological maps such as
known biological pathways. This figure shows a schematic representation of a metabolic signaling pathway associated with insulin signaling and
Type II diabetes mellitus as it is represented in the Kyoto Encyclopedia of Genes and Genomes (KEGG) biological pathway database. The stars
indicate genes for which an individual was found to have potentially damaging non-synonymous mutations (i.e. potentially deleterious change to
the protein coding sequence) based on an analysis of their personal genomic information. Such information could be used as the basis to
formulate or investigate various biological hypotheses concerning individual physiology.

4.6.2 Functional maps


synonymous SNPs on protein function, or the
Functional maps are used to integrate personal aggregate effects of rare variants in a key biologi-
genomic information with other functional cal pathway. An example of the latter is shown in
molecular data to give it functional context. Func- Figure 4.15. Here, we see the insulin signaling
tional maps can be used to explore potential func- pathway defined in the Kyoto Encyclopedia of
tional consequences of various personal genomic Genes and Genomes (KEGG) pathway resource.
attributes, such as the potential effects of non- The red stars indicate genes harboring potentially
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

74 EXPLORING PERSONAL GENOMICS

deleterious non-synonymous mutations deter- exploring personal genomic information. We expect


mined by an in silico mutation assessment tool. If that there will be a need to create many novel types
this genome was obtained from an individual and implementations of visualizations for personal
exhibiting signs of idiopathic diabetes, this func- genomic information, as the awareness and availa-
tional map might offer a clue as the potential bility of personal genomics continues to expand
genetic basis of the dysfunction in the insulin sig- beyond its current domains. Also, it will be necessary
naling pathways. There are many types of func- to connect several types of personal genome visuali-
tional information that could be used to create zations to facilitate more advanced applications in
functional maps for personal genomic informa- personal genome exploration. We hope that the
tion, such as gene expression, ChIP-Seq, RNA- descriptions and critiques of the various visualiza-
Seq, and RNAi data. There are also many tions relevant to personal genomics in this chapter
established tools that can be used to facilitate the will both serve as a basis for those who may partici-
creation of such maps, such as Cytoscape or the pate directly in the design or development of novel
Integrative Genomics Viewer (IGV). personal genomics visualizations, but also enable
observers of these visualizations to assess the ben-
efits and potential limitations of a visualization tool
4.7 Conclusion
or technique for the purpose of exploring personal
The primary purpose of exploring personal genom- genomics.
ics is to gain knowledge and understanding from the
wealth of information encoded in an individual Further reading
human genome. Visualization plays a key role in this
process by engaging the adept capabilities of human Genome browsers and browser toolkits
visual perception to summarize and navigate, reveal- The Generic Genome Browser (GBrowse) <http://gmod.
ing patterns in, and performing quantitative assess- org/wiki/GBrowse> accessed 5 August 2012.
ment of the vast and rich source of information latent The UCSC Genome Browser <http://genome.ucsc.edu/>
in a personal genome. However, just as human per- accessed 5 August 2012.
ception is highly tuned for complex visual tasks, Genomes Unzipped public personal genomics custom
such as pattern recognition, it can just as easily per- browser <http://www.genomesunzipped.org/jbrowse/>
ceive patterns or differences that stem from visual accessed 5 August 2012.
Dalliance interactive genome viewer <http://www.
artifacts as it can perceive those based in the actual
biodalliance.org/> accessed 5 August 2012.
data. Although accuracy and faithful representation
University of Tokyo Genome Browser Toolkit <http://
are paramount for any type of visualization designed utgenome.org/> accessed 5 August 2012.
for the display of real data, this is especially true for JBrowse genome browser <http://jbrowse.org/> accessed
personal genomics, where the consequences of 5 August 2012.
misinterpretation can have severe physical, emo- Rover Genome Browser framework <http://chmille4.
tional, or ethical consequences. github.com/Rover/site/home.html> accessed 5 August
As with any effort that incorporates visualization 2012.
techniques, it is important to choose visualization
techniques that are most appropriate for the task
Other software and internet resources
and the intended audience. For example, genome
browsers offer a powerful means for visually explor- Interpretome <http://interpretome.com>
ing a personal genome at molecular resolution; GET-Evidence <http://evidence.personalgenomes.org>
Promethease <http://www.snpedia.com/index.php/
however, a genome browser visualization may be
Promethease>
completely unnecessary, overwhelming, or confus-
Circos <http://circos.ca>
ing if the goal is to provide summary information myKaryoView <http://mykaryoview.com/>
regarding personal genetic disease risk to a patient Scribl <http://chmille4.github.com/Scribl/>
or physician. In this chapter, we have covered sev- Cytoscape <http://www.cytoscape.org/>
eral types of visualizations currently used for MitoWheel <http://mitowheel.org/mitowheel.html>
OUP CORRECTED PROOF – FINAL, 11/29/2012, SPi

V I S UA L I Z I N G P E R S O N A L G E N O M I C S 75

SNPduo <http://pevsnerlab.kennedykrieger.org/ detection system for cryptic rearrangements in idio-


SNPduo/> pathic mental retardation. Journal of Medical Genetics 41,
Integrative Genomics Viewer (IGV) <http://www.broad- 130–6 Ledford, H. Big science: (2010) The cancer genome
institute.org/igv/> challenge. Nature 464, 972–4.
Morgan, A. A., Chen, R. & Butte, A. J. (2010) Likelihood
ratios for genome medicine. Genome Medicine 2, 30.
Further reading on data visualization
Tufte, E. (1986) The Visual Display of Quantitative Informa-
Ashley, E. A. et al. (2010) Clinical assessment incorporat- tion Graphics Press: Cheshire, CT.
ing a personal genome. Lancet 375, 1525–35. Stanford Visualization Group <http://vis.stanford.
Fry, B. (2007) Visualizing Data. Sebastopol, CA: O’Reilly edu/>
Media. Seagran, T. and Hammerbacher, J. (2009) Beautiful Data:
Harada, N. et al. (2004) Subtelomere specific microarray The Stories Behind Elegant Data Solutions. Sebastopol, CA:
based comparative genomic hybridisation: a rapid O’Reilly Media.

You might also like