10.1007@978 981 13 1942 6

Learning Materials in Biosciences
Ju Han Kim
Genome
Data
Analysis
Translated by Younghee Lee
Learning Materials in Biosciences textbooks compactly and concisely discuss a specific biological, bio-
medical, biochemical, bioengineering or cell biologic topic. The textbooks in this series are based on lec-
tures for upper-level undergraduates, master’s and graduate students, presented and written by
authoritative figures in the field at leading universities around the globe.
The titles are organized to guide the reader to a deeper understanding of the concepts covered.
Each textbook provides readers with fundamental insights into the subject and prepares them to inde-
pendently pursue further thinking and research on the topic. Colored figures, step-by-step protocols and
take-home messages offer an accessible approach to learning and understanding.
In addition to being designed to benefit students, Learning Materials textbooks represent a valuable
tool for lecturers and teachers, helping them to prepare their own respective coursework.
More information about this series at http://www.springer.com/series/15430

Ju Han Kim
Genome Data
Analysis
Ju Han Kim
Division of Biomedical Informatics
Seoul National University College of Medicine
Seoul, South Korea
Additional material to this book can be downloaded from http://extras.springer.com.
Previous publisher returns the publishing rights for the English language edition of the
work Genome Data Analysis for publication in all forms and media of expression to the
author, Dr. Ju Han Kim.
ISSN 2509-6125 ISSN 2509-6133 (electronic)

ISBN 978-981-13-1941-9 ISBN 978-981-13-1942-6 (eBook)
https://doi.org/10.1007/978-981-13-1942-6
Library of Congress Control Number: 2018966388
© Springer Nature Singapore Pte Ltd. 2019

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recita-
tion, broadcasting, reproduction on microfilms or in any other physical way, and transmission or infor-
mation storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publica-
tion does not imply, even in the absence of a specific statement, that such names are exempt from the
relevant protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein
or for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
v
Preface
»» The future is already here – it’s just not very evenly distributed.
– William Gibson
In the era of experimental biology, which is represented by traditional experiments in

biochemistry and molecular cell biology, obtaining data caused the most critical bottle-
neck. The researcher first established a hypothesis, designed the necessary experiment to
verify that hypothesis, and finally obtained experimental data with difficulty. It has been
20 years since modern bioinformatics was introduced, and now the Age of Data Tech-
nology and Artificial Intelligence has come. In this new era, researchers are awash with
big data. Frequently the data comes first, followed by creating a hypothesis and planning
an experiment for verification. This age demands a paradigmatic change from hypothesis-
driven experimental biology to making data-driven bioinformatics experiments.
We are now in the era of “constant data creation.” Cameras are used more than human
eyes and microphones more than human ears; these biosensors obtain data constantly at
no cost. No longer is the obtaining of data a bottleneck for medical and life science
research. Instead, the bottleneck has rapidly moved to data analysis and the bioinformat-
ics field. The person who controls this bottleneck has power over the research. In the
time when obtaining data was a bottleneck, the techniques applied to the cells and tissues
of an experimental animal were the most important methodology. In the data era, it is
bioinformatics approaches, which control the whole “life cycle” of bio data, that will
become the most important methodology.
To the eye of a researcher in life science or clinical medicine, this data-central paradigm
is a new and unfamiliar thing. However, data science has already been firmly established
in many fields. The data-centric paradigm is a century-old phenomenon in the physical
sciences, including particle physics, astronomy, and earth science, and also in most engi-
neering fields and various industry fields. With the emergence of Facebook and Twitter,
even fields in the humanities have joined the era of “constant data creation.” With the
introduction of the mobile smartphone environment to assist data collection, the evolu-
tionary direction of biological science has become clearer, driven by this continuous
creation of individual bio data.
This Genome Data Analysis textbook will cover not just using bio data, but also viewing
it through the perspective of “data science,” which treats the data itself as an object with
a life cycle. Bioinformatics is a field that supervises the four phases of data life, “Birth,
Aging, Sickness, and Death,” which are the whole process of creation, interaction, devel-
opment, and extinction of bio data. This data is a projection into virtual space of phe-
nomena that occur in the natural world. From this point of view, it is wrong to focus on
“analysis” alone. However, as this book is limited to the practices for the beginner, it
could not cover in detail the major parts of “flow” and “control” of bio data.
This Genome Data Analysis practice book is not meant for experts from the bioinformatics
field. This book is for a life scientist, medical scientist, statistician, data processing researcher,
engineer, or other beginner in bioinformatics who finds it necessary to study bioinformatics
vi
Preface
but also difficult to approach the field. However, this book can be a simple guideline for experts
unfamiliar with the new, developing subfield of genomic analysis within bioinformatics.
Genome Data Analysis was begun in 2011 and is based on the practice data bundle from
the 8-week ‘Genome Data Analysis Workshop’ conducted twice per year at the Division
of Biomedical Informatics and Biomedical Informatics Training and Education Center
of Seoul National University. Over time, I have noticed that the contents of the first ver-
sion of this book, published in 2012, became old and outdated. The bioinformatics field
has really developed fast. As I was organizing the table of contents for this second ver-
sion, I noticed that even after only 2 years, it has become impossible to publish every-
thing in just one book. This second version is being published in two books, with the
drastic development of next-generation sequencing techniques to be dealt with in the
second part, which is planned to be published in later.
Each exercise is comprised of one chapter of theoretical background, which is relevant to

the outline, followed by three chapters of practice. Each practice provides a prepared sce-
nario with an introduction to the data, analysis strategy, and learning goal as well as prac-
tice problems. Due to the scope of the field, not only the Windows environment but also R
programming and simple script practices are included, which are a challenge for the begin-
ner. All practices follow the “blind imitation” method, which anyone can try at least once,
and most of the necessary programs and data for exercises will be provided with the DVD.
Without the Seoul National University Biomedical Informatics researchers, this book
could not be published. Furthermore, this book was completed with the support from
2000 participants of the GDA Workshop who gave unstinting encouragement and
review. I would like to give special thanks to Sunhee Shin, who was the person in charge,
and the publisher epublic (bummoon-sa) even though the manuscript was not prepared
well from the experimental composition. Also, I want to give special thanks to Jihoon
Kim, Je-Gun Joung, and Ji Yeon Park, the assistant professors who directly led practices;
and to Younjung Bae, Hyerim Jung, Eunmi Byun, Arirang Jang, and Hyerim Yoo from
the executive office who helped everything to run smoothly. The true authors of this
practice book Genome Data Analysis are Young-Ji Na, Su-Yeon Lee, Hee-Joon Chung, Yu
Rang Park, Dokyoon Kim, Heewon Seo, Sunmin Yun, Chan Hee Park, Jun Hee Youn,
Soo Youn Lee, Yonglae Cho, Hyun Wook Han, Kye Hwa Lee, Ki Tae Kim, and Jae Hyun
Lim, who organized the practice data; and Su Youn Baik, Hyehyeon Kim, Young Jo Yoon,
Sue Hyun Lee, Rocky, Younggyun Lim, Woo Seung Lee, Yoomi Park, and Brian Y Ryu,
who helped with the proceedings of the actual practices. It is embarrassing for me to
become the representative for all these authors. Finally, I really appreciate the efforts of
Woo Seung Lee, who was in charge of proofreading and final organization of this book.
This practice book Genome Data Analysis will continuously evolve and be updated with
new knowledge in accordance with the ongoing GDA Workshop, held every summer
and winter. I would like to disclose that the authors are responsible for any errors or
imperfections in this book.
At the Hamchunmoon Gate, in May 2015
Ju Han Kim
Seoul, South Korea
vii
Translated by
Younghee Lee Soo Jin Kim
Department of Biomedical Informatics Division of Hematology, School of Medicine
University of Utah School of Medicine University of Utah
Salt Lake City, UT, USA Salt Lake City, UT, USA
younghee.lee@utah.edu soo.kim@hsc.utah.edu
(All Chapters and Preface) (Chapters 3, 6, 7 and 12)
Kanghoon Choi Jane Ryu
Department of Biomedical Informatics Johns Hopkins University,
University of Utah Nelson Laboratories LLC
Salt Lake City, UT, USA Baltimore, MD, USA
kyle.choi89@gmail.com jryu13@gmail.com
(Chapters 5, 10, 11, 17, 18, 20 and Preface) (Chapters 4, 8, 9, 14 and 15)
Seonggyun Han Michael S. Sinclair
Department of Biomedical Informatics Department of Biomedical Informatics
University of Utah School of Medicine University of Utah School of Medicine
Salt Lake City, UT, USA Salt Lake City, UT, USA
seonggyun.han@utah.edu sincla@gmail.com
(Chapter 16) (Chapter 1)
Dongwook Kim
Department of Biomedical Informatics
University of Utah School of Medicine
Salt Lake City, UT, USA
dwkim106@gmail.com
(Chapters 13 and 21)
ix
Contents
I Bioinformatics for Life and Personal Genome

Interpretation
1 Bioinformatics for Life...................................................................................................................... 3

1.1 Introduction........................................................................................................................................... 4
1.2 Life and Information............................................................................................................................ 5
1.3 The Human Genome Project............................................................................................................ 8
1.4 Development of Microarray Technology and its Medical Applications........................... 9
1.5 Explosion of Data from Next-Generation Sequencing........................................................... 12
1.6 The Era of Systems Biology and Biomedical Informatics...................................................... 14
Bibliography........................................................................................................................................... 15
2 Next-Generation Sequencing Technology and Personal

Genome Data Analysis..................................................................................................................... 17
2.1 Introduction........................................................................................................................................... 18
2.2 Data Format............................................................................................................................................ 20
2.2.1 FASTQ Format.......................................................................................................................................... 20
2.2.2 CSFASTA Format..................................................................................................................................... 21
2.2.3 Concept of QV......................................................................................................................................... 21
2.3 Alignment of NGS Sequencing Reads........................................................................................... 22
2.3.1 Sequence Alignment with a Hash Table......................................................................................... 23
2.3.2 Sequence Alignment with a Suffix Tree.......................................................................................... 23
2.4 SNP and InDel Detection................................................................................................................... 24
2.5 Annotation of Sequence Variation and Function Prediction............................................... 25
2.5.1 Annotation and Clinical Interpretation of Common Variations.............................................. 25
2.5.2 Annotation and Clinical Interpretation of Rare Variants........................................................... 26
2.5.3 KEGG Disease Pathways Mapping.................................................................................................... 27
2.5.4 Pharmacogenomics.............................................................................................................................. 28
2.5.5 Distribution of Genetic Variants in the Population..................................................................... 28
Bibliography........................................................................................................................................... 30
3 Personal Genome Data Analysis................................................................................................. 33

3.1 Prerequisites........................................................................................................................................... 34
3.2 SNP Annotations and Interpretations Using SNPedia and Promethease....................... 34
3.2.1 Create Input File..................................................................................................................................... 34
3.2.2 Running Promethease.......................................................................................................................... 35
3.2.3 Results Format and Interpretations of Promethease................................................................. 35
3.3 Interpretation of the Correlation of Rare SNPs with Diseases
Using SIFT and KEGG�� 37
3.3.1 Run the SIFT Web Tool.......................................................................................................................... 38
3.3.2 KEGG DISEASE Pathway Mapping.................................................................................................... 38
3.4 Pharmacogenomic Analysis............................................................................................................. 41
3.4.1 Pharmacogenomic Analysis Using GENOtation.......................................................................... 41
3.5 Calculations of Allele Frequencies in a Population................................................................. 43
Bibliography........................................................................................................................................... 45
x
Contents
4 Personal Genome Interpretation and Disease Risk Prediction.................................... 47

4.1 Introduction........................................................................................................................................... 48
4.2 SNP Prioritization................................................................................................................................. 48
4.3 Prediction of Genetic Disease Risk................................................................................................ 49
4.3.1 Algorithm for Prediction of Genetic Disease Risk........................................................................ 49
4.3.2 Practice Data........................................................................................................................................... 49
4.3.3 Risk Prediction Using GENOtation.................................................................................................... 50
4.3.4 Data Analysis Using Promethease.................................................................................................... 52
4.4 Resources for Analyzing an Individual Genome....................................................................... 55
4.4.1 dbGAP....................................................................................................................................................... 55
4.4.2 SNPedia..................................................................................................................................................... 57
4.4.3 PheGenI..................................................................................................................................................... 58
4.5 Disease-Related Sequence Variation Analysis.......................................................................... 62
4.5.1 Verification of File List and Moving to Practice Directory Using Terminal........................... 62
4.5.2 Extracting Main Disease-Causing Variants from Sequence Data........................................... 63
4.5.3 Finding the Disease Gene of a Child Using Kinship Data.......................................................... 72
4.5.4 Analysis of Gene Lists Known to Affect Phenotype.................................................................... 72
Bibliography........................................................................................................................................... 75
II Advanced Microarray Data Analysis

5 Advanced Microarray Data Analysis......................................................................................... 79
5.1 Introduction........................................................................................................................................... 80
5.2 Microarray Experiment....................................................................................................................... 81
5.3 Structure and Normalization of Microarray Data..................................................................... 82
5.4 Differentially Expressed Genes (DEGs)......................................................................................... 84
5.5 Cluster Analysis and Interpretation............................................................................................... 85
5.6 Classification Analysis......................................................................................................................... 86
5.6.1 Linear Discriminant Analysis (LDA).................................................................................................. 87
5.6.2 Support Vector Machine (SVM)......................................................................................................... 87
5.6.3 K-Nearest Neighborhood (KNN)........................................................................................................ 88
5.7 Gene Set Enrichment Analysis......................................................................................................... 88
5.7.1 Analysis Method and Feature............................................................................................................ 89
5.7.2 Gene Set Database................................................................................................................................ 90
5.8 Survival Analysis and Prognostic Subgroup Prediction........................................................ 90
Bibliography........................................................................................................................................... 92
6 Gene Expression Data A

nalysis................................................................................................... 95
6.1 Introduction........................................................................................................................................... 96
6.2 Prerequisites........................................................................................................................................... 96
6.3 Differential Gene Expression Analysis.......................................................................................... 103
6.3.1 Analysis of Significant Differential Expression in Two-Group Comparisons:
T-Test and SAM........................................................................................................................................ 104
6.3.2 Analysis of Significant Differential Expression in Comparisons of
More than Three Groups: ANOVA..................................................................................................... 109
xi
Contents
6.4 Clustering Analysis.............................................................................................................................. 111

6.4.1 Hierarchical Clustering......................................................................................................................... 111
6.4.2 K-Means Clustering............................................................................................................................... 113
6.4.3 SOM Clustering....................................................................................................................................... 114
6.5 Classification Analysis......................................................................................................................... 115
6.5.1 LDA (Linear Discriminant Analysis).................................................................................................. 116
6.5.2 KNN (K-Nearest Neighbor).................................................................................................................. 116
6.5.3 SVM (Support Vector Machine)......................................................................................................... 116
6.6 Basic Data Processing in R Program.............................................................................................. 117
6.6.1 Understanding the Data Structure................................................................................................... 117
Bibliography........................................................................................................................................... 120
7 Gene Ontology and Biological Pathway-Based Analysis................................................ 121

7.1 Introduction........................................................................................................................................... 122
7.2 Prerequisites........................................................................................................................................... 124
7.3 Dataset and Biological Interpretation Tools.............................................................................. 124
7.3.1 Generate Example Data Set................................................................................................................ 125
7.3.2 Biological Interpretation Tools.......................................................................................................... 128
7.4 DAVID........................................................................................................................................................ 129
7.5 ArrayXPath.............................................................................................................................................. 131
7.6 BioLattice................................................................................................................................................. 131
Bibliography........................................................................................................................................... 134
8 Gene Set Approaches and Prognostic Subgroup Prediction........................................ 135

8.1 Introduction........................................................................................................................................... 137
8.2 Prerequisites........................................................................................................................................... 139
8.3 Input Files................................................................................................................................................ 139
8.3.1 Gene Expression Data File Format.................................................................................................... 140
8.3.2 Phenotype Label File Format............................................................................................................. 141
8.3.3 Gene Set Database File........................................................................................................................ 141
8.3.4 Microarray Chip Annotation File....................................................................................................... 142
8.3.5 Data Input................................................................................................................................................ 142
8.4 GSEA Execution..................................................................................................................................... 143
8.4.1 Required Fields....................................................................................................................................... 143
8.4.2 Basic Fields............................................................................................................................................... 145
8.4.3 GSEA Execution...................................................................................................................................... 145
8.4.4 GSEA Analysis Results........................................................................................................................... 145
8.5 Leading Edge Subset Analysis......................................................................................................... 146
8.6 GSEA Analysis Via R.............................................................................................................................. 147
8.6.1 Installation............................................................................................................................................... 148
8.6.2 Input File................................................................................................................................................... 149
8.6.3 GSEA Execution...................................................................................................................................... 149
8.7 Survival Analysis................................................................................................................................... 152
8.7.1 Install CGDS-R Package........................................................................................................................ 152
8.7.2 Get List of Cancer Studies from Server........................................................................................... 153
8.7.3 Extract Samples and Features............................................................................................................ 153
8.7.4 Get Mutation Profiles for BRCA1 and BRCA2................................................................................ 154
xii
Contents
8.7.5 Extraction Samples with BRCA1 and BRCA2 Methylation........................................................ 154

8.7.6 Clinical Data Integration...................................................................................................................... 155
8.7.7 Survival Analysis..................................................................................................................................... 156
Bibliography........................................................................................................................................... 157
9 MicroRNA Data Analysis.................................................................................................................. 159

9.1 Introduction........................................................................................................................................... 160
9.2 Prerequisites........................................................................................................................................... 160
9.3 Retrieving miRNA-mRNA Pair Expression Data........................................................................ 160
9.3.1 Input File................................................................................................................................................... 160
9.3.2 Missing Value Removal........................................................................................................................ 161
9.4 miRNA-mRNA Expression Correlation Coefficient................................................................... 162
9.4.1 Finding One miRNA and mRNA Correlation.................................................................................. 162
9.4.2 Finding the Correlation Coefficient of All Pair-Wise mRNAs of Several Genes................... 163
9.4.3 Find Significant miRNA-mRNA Pairs................................................................................................ 165
9.4.4 Draw Significant miRNA-mRNA Plot................................................................................................ 166
9.5 Verifying a Significant miRNA-mRNA Correlations in Information Databases............. 167
9.5.1 Verify mRNA Target Against Verified Experiments...................................................................... 167
9.5.2 Confirm Predicted Target Gene......................................................................................................... 169
Bibliography........................................................................................................................................... 171
III Network Biology, Sequence, Pathway

and Ontology Informatics
10 Network Biology, Sequence, Pathway and Ontology Informatics............................. 175

10.1 Introduction........................................................................................................................................... 176
10.2 Sequence Data Analysis..................................................................................................................... 177
10.2.1 Sequence Information......................................................................................................................... 177
10.2.2 Pair-wise Alignment.............................................................................................................................. 178
10.2.3 Multiple Alignment............................................................................................................................... 178
10.2.4 Global Alignment................................................................................................................................... 178
10.2.5 Local Alignment..................................................................................................................................... 179
10.3 Phylogenetic Tree Analysis............................................................................................................... 179
10.4 Visualization for Sequence Analysis.............................................................................................. 180
10.5 Biological Pathway Analysis............................................................................................................. 181
10.6 Gene Ontology...................................................................................................................................... 182
10.7 Biomedical Text Mining...................................................................................................................... 183
10.8 Biomedical Network Analysis.......................................................................................................... 184
Bibliography........................................................................................................................................... 186
11 Motif and Regulatory Sequence Analysis............................................................................... 189

11.1 Introduction........................................................................................................................................... 190
11.2 Sequence Data Structure................................................................................................................... 190
11.3 Sequence Alignment and Phylogenetic Tree Analysis........................................................... 193
11.3.1 Sequence Alignment............................................................................................................................ 193
11.3.2 Sequence Alignment Practice............................................................................................................ 196
11.3.3 Phylogenetic Tree Analysis................................................................................................................. 199
xiii
Contents
11.4 Transcription Regulatory Site Prediction Using Sequence Alignments.......................... 202

11.4.1 Transcription Factors and Sequence Motifs.................................................................................. 202
11.4.2 Conserved Sequence Region Detection......................................................................................... 202
11.5 UCSC Genome Browser...................................................................................................................... 206
11.6 Prediction of Targeting microRNAs............................................................................................... 209
Bibliography........................................................................................................................................... 211
12 Molecular Pathways and Gene Ontology............................................................................... 213

12.1 Introduction........................................................................................................................................... 214
12.2 Prerequisites........................................................................................................................................... 214
12.3 Gene Ontology...................................................................................................................................... 214
12.3.1 Search GO Annotation for a Single Gene....................................................................................... 214
12.3.2 Search GO Annotations for a Gene List.......................................................................................... 216
12.3.3 Calculation of the Semantic Distance Between Genes.............................................................. 217
12.3.4 GO Annotation Analysis Using R....................................................................................................... 218
12.4 Biological Pathway Analysis............................................................................................................. 224
12.4.1 Search a Biological Pathway of a Specific Gene........................................................................... 224
12.4.2 Search Significant Biological Pathways of a Gene List............................................................... 224
12.4.3 Data Analysis Using Gene Expression Data................................................................................... 225
12.4.4 Biological Pathway Analysis Using R................................................................................................ 226
12.5 Biological Text Mining – COREMINE.............................................................................................. 229
Bibliography........................................................................................................................................... 231
13 Biological Network Analysis.......................................................................................................... 233

13.1 Introduction........................................................................................................................................... 234
13.2 Preparations........................................................................................................................................... 234
13.3 The Network Analysis Tools.............................................................................................................. 234
13.3.1 Major Network Analysis Tools............................................................................................................ 234
13.3.2 Introduction to igraph and an Example of Its Use...................................................................... 235
13.4 Introduction to Data and Publication Used for Analysis....................................................... 240
13.4.1 Introduction to Publication and Dataset........................................................................................ 240
13.4.2 Introduction to Data and Preprocessing........................................................................................ 240
13.5 Analysis of Protein Interaction Networks.................................................................................... 241
13.5.1 Visualization............................................................................................................................................ 241
13.5.2 Distribution of Connections............................................................................................................... 242
13.5.3 Evolutionary Analysis........................................................................................................................... 244
Bibliography........................................................................................................................................... 246
IV SNPS, GWAS and CNVS, Informatics for Genome Variants

14 SNPs, GWAS, CNVs: Informatics for Human Genome Variations................................. 249
14.1 Introduction........................................................................................................................................... 250
14.2 dbSNP....................................................................................................................................................... 252
14.3 International HapMap Project......................................................................................................... 253
14.4 PharmGKB............................................................................................................................................... 253
xiv
Contents
14.5 Genome-Wide Association Studies (GWAS)............................................................................... 254

14.5.1 Allelic Test................................................................................................................................................. 254
14.5.2 Genotypic Test........................................................................................................................................ 255
14.5.3 Multiple Testing Correction................................................................................................................ 256
14.6 Definition and Importance of Copy-Number Variation......................................................... 256
14.7 Analysis Methods of CNV................................................................................................................... 257
14.7.1 Chip Method to Detect CNV............................................................................................................... 258
14.7.2 Sequencing Method to Detect CNV................................................................................................ 258
14.8 Conclusion............................................................................................................................................... 258
Bibliography........................................................................................................................................... 259
15 SNP Data Analysis.............................................................................................................................. 261

15.1 Introduction........................................................................................................................................... 262
15.1.1 dbSNP........................................................................................................................................................ 262
15.1.2 International HapMap Project........................................................................................................... 262
15.1.3 PharmGKB................................................................................................................................................ 262
15.2 dbSNP....................................................................................................................................................... 263
15.2.1 rsID, ssID Search..................................................................................................................................... 263
15.2.2 Entrez SNP Search.................................................................................................................................. 264
15.3 1000 Genomes Project....................................................................................................................... 266
15.3.1 Data............................................................................................................................................................ 266
15.3.2 Browser..................................................................................................................................................... 268
15.4 PharmGKB............................................................................................................................................... 270
15.4.1 Search PharmGKB.................................................................................................................................. 270
15.4.2 Clinical Annotations.............................................................................................................................. 270
15.5 Variant Analysis Using NGS.............................................................................................................. 271
15.5.1 Preparation.............................................................................................................................................. 271
15.5.2 Extraction of Disease-Associated Variants from Variants Identified Through NGS........... 272
Bibliography........................................................................................................................................... 279
16 GWAS Data Analysis.......................................................................................................................... 281

16.1 Introduction........................................................................................................................................... 282
16.2 Prerequisites........................................................................................................................................... 282
16.3 *.ped and *.map Files of the PLINK................................................................................................ 283
16.3.1 The genotypes.ped File........................................................................................................................ 283
16.3.2 The genotypes.map File...................................................................................................................... 284
16.4 Configuration of gPLINK and Other Programs that Work with gPLINK........................... 285
16.5 Validation of the GWAS Data Set and Summary of Statistics.............................................. 285
16.6 Filtering Out Data with a Threshold.............................................................................................. 288
16.7 Basic Association Test......................................................................................................................... 291
16.8 Additive Genotypic Test..................................................................................................................... 292
16.9 Manhattan Plot...................................................................................................................................... 294
Bibliography........................................................................................................................................... 297
17 CNV Analysis......................................................................................................................................... 299

17.1 Introduction........................................................................................................................................... 300
17.2 Prerequisites........................................................................................................................................... 300
xv
Contents
17.3 Data............................................................................................................................................................ 301

17.3.1 Normalization......................................................................................................................................... 301
17.4 Genomic Alteration Detection........................................................................................................ 302
17.4.1 Single & Multi-sample Segmentation and Allele-specific Segmentation............................ 302
17.4.2 Identification of Gain and Loss at Genomic Position................................................................. 304
17.5 Visualization........................................................................................................................................... 308
17.5.1 Analysis of the Differences Between Samples Using the Heatmap....................................... 308
17.6 Obtaining Genomic Regions............................................................................................................ 309
Bibliography........................................................................................................................................... 312
V Metagenome and Epigenome, Basic Data Analysis

18 Metagenome and Epigenome Data Analysis....................................................................... 315
18.1 Metagenome.......................................................................................................................................... 316
18.2 Epigenome.............................................................................................................................................. 318
18.2.1 DNA Methylation................................................................................................................................... 318
18.2.2 Histone Modification............................................................................................................................ 319
18.2.3 Non-coding RNA (ncRNA)................................................................................................................... 320
18.3 Epigenome Databases and Analysis Tools.................................................................................. 320
18.3.1 DNA Methylation Databases and Analysis Tools.......................................................................... 320
18.3.2 Histone Modification Databases and Analysis Tools.................................................................. 321
18.3.3 Noncoding RNA Databases and Analysis Tools............................................................................ 321
18.4 Epigenome Analysis............................................................................................................................ 322
Bibliography........................................................................................................................................... 322
19 Metagenome Data Analysis.......................................................................................................... 325

19.1 Introduction........................................................................................................................................... 326
19.2 Prerequisites........................................................................................................................................... 326
19.3 Metagenome Analysis Tool............................................................................................................... 326
19.3.1 Major Metagenome Analysis Tools.................................................................................................. 326
19.3.2 Introduction to metagenomeSeq.................................................................................................... 327
19.4 Metagenome Analysis........................................................................................................................ 327
19.4.1 Dataset...................................................................................................................................................... 327
19.4.2 Basic Example of metagenomeSeq.................................................................................................. 327
19.4.3 Statistical Testing................................................................................................................................... 329
19.4.4 Aggregating Counts.............................................................................................................................. 329
19.5 Visualization........................................................................................................................................... 330
Bibliography........................................................................................................................................... 337
20 Epigenome Database and Analysis Tools............................................................................... 339

20.1 Introduction........................................................................................................................................... 340
20.2 Prerequisites........................................................................................................................................... 340
20.3 Epigenome Database.......................................................................................................................... 340
20.3.1 dbEM: A Database of Epigenetic Modifiers................................................................................... 340
20.3.2 EpiFactors................................................................................................................................................. 341
20.3.3 MethylomeDB......................................................................................................................................... 343
20.3.4 NGSmethDB: High-Quality Methylomes and Differential Methylation ............................... 345
xvi
Contents
20.4 Epigenome Analysis Tool................................................................................................................... 346

20.4.1 Bsseq: Analyzing WGBS with the Bsseq Package......................................................................... 346
20.4.2 MethPipe: a Computational Pipeline for Analyzing Bisulfite Sequencing Data ............... 351
20.4.3 PAVIS: Peak Annotation and Visualization...................................................................................... 351
Bibliography........................................................................................................................................... 352
21 Epigenome Data Analysis............................................................................................................... 353

21.1 Introduction........................................................................................................................................... 354
21.2 Preparations........................................................................................................................................... 354
21.3 Input Sequence Data........................................................................................................................... 354
21.3.1 Download Sequence Data.................................................................................................................. 354
21.3.2 Specify Input Sequence Data............................................................................................................. 355
21.4 Data Filtering......................................................................................................................................... 355
21.4.1 Alignment Control (AC)........................................................................................................................ 355
21.4.2 Quality Control (QC).............................................................................................................................. 356
21.5 Analysis of Methylation Status........................................................................................................ 356
21.5.1 Sequence Name..................................................................................................................................... 357
21.5.2 Alignment................................................................................................................................................ 357
21.5.3 Sequence Start and End Positions.................................................................................................... 358
21.5.4 Length of the Reference Sequence.................................................................................................. 358
21.5.5 Methylated Positions in the Sample Sequence............................................................................ 359
21.5.6 Methylated Positions in the Reference Sequence....................................................................... 360
21.6 Exploratory Analysis and Visualization........................................................................................ 360
21.6.1 Plotting Methylation Status................................................................................................................ 360
21.6.2 Lollipop Figure: Methylation Status................................................................................................. 360
21.6.3 Neighboring Co-occurrence Display............................................................................................... 362
21.6.4 Distant Co-occurrence Display.......................................................................................................... 364
21.7 Statistical Tests...................................................................................................................................... 364
21.7.1 Fisher’s Exact Test................................................................................................................................... 364
21.7.2 Clustering Analysis................................................................................................................................ 365
21.7.3 Correspondence Analysis.................................................................................................................... 366
Bibliography........................................................................................................................................... 367
1 I
Bioinformatics for Life

and Personal Genome
Interpretation
Contents
Chapter 1 Bioinformatics for Life – 3
Chapter 2 Next-Generation Sequencing Technology and

Personal Genome Data Analysis – 17
Chapter 3 Personal Genome Data A

nalysis – 33
Chapter 4 Personal Genome Interpretation and Disease

Risk Prediction – 47
3 1
Bioinformatics for Life
1.1 Introduction – 4
1.2 Life and Information – 5
1.3 The Human Genome Project – 8
1.4 Development of Microarray Technology and

its Medical Applications – 9
1.5 Explosion of Data from Next-Generation

Sequencing – 12
1.6 The Era of Systems Biology and

Biomedical Informatics – 14
Bibliography – 15

J. H. Kim, Genome Data Analysis, Learning Materials in Biosciences,
https://doi.org/10.1007/978-981-13-1942-6_1
4 Chapter 1 · Bioinformatics for Life
What You Will Learn in This Chapter

1 Bioinformatics is an essential field of study that will play a leading role in the post-genome
era. Bioinformatics understands biological phenomena as informatics phenomena that take
the form of sophisticated interactions between the materials that constitute living organ-
isms, matter in general, and energy. As a consequence of the development of technology
for acquiring vast amounts of biological information, molecular genetics has progressed
from the study of individual genes to the systematic study of the entire genome. Together
with technology for the acquisition of biological information, the remarkable development
of advanced data processing techniques means that many of the traditional problems of
medicine and the life sciences have become objects of research for bioinformatics. The intro-
duction of the microarray has expanded research on single genes to the entire genome, and
has changed genetic analysis from qualitative to quantitative. Next-generation sequencing
(NGS) technology, by making analysis of the genomic sequences that form the basis of bio-
logical phenomena widely available, is constantly presenting new views on biological and
disease-related phenomena. In the first chapter, we will summarize the overall historical
development of the rapidly growing field of bioinformatics, and provide a general view of
the content of this book.
1.1 Introduction
Bioinformatics has become the core research field in the post-genome era since the
completion of the Human Genome Project.1 The success of the Project not only revealed
the sequence of the 3 billion DNA base pairs that make up the human genome, but also
accelerated the adoption of new paradigms, such as personalized medicine based on
human genome variation and predictive medicine based on gene expression analysis. It
has enabled new paradigms for clinical applications, such as in the emerging fields of
biomedical informatics and genomic medicine. The Human Genome Project also precipi-
tated the massive parallelization and miniaturization of the technology used in molecular
genomics analysis, the innovative development of microarrays, “lab-on-a-chip” devices,
and next-generation sequencing (NGS), and has pioneered high-throughput biology, sys-
tems medicine, and a new era of personal genome informatics.
In its formative stage, bioinformatics was mainly ancillary methods to assist life-science
research in areas that required computing, for example the construction of databases to
store and organize various types of biological data, data structure modeling, analysis of
the three-dimensional structure of proteins, and was limited to a particular area. However,
bioinformatics has become important in life science, has gone beyond the more simplistic
categories of statistical hypothesis testing techniques used in experimental biology, and
has taken a prominent place among the modern biomedical disciplines as the one which
pursues the most progressive approach and understanding. This is attributable to the basic
fact that biological phenomena are truly informatics phenomena, expressed in terms of
interactions in nature between material objects, and between matter and energy. The rapid
growth of bioinformatics shares the same context and current state of progress as the rest
of the life sciences today, which can be understood by looking, briefly, at the historical
development of biology itself.
1 7 https://www.genome.gov/10001772/all-about-the%2D%2Dhuman-genome-project-hgp/

1.2 · Life and Information
5 1
Originally, biology concerned itself with empirical observations and taxonomic
c lassifications, but began to reexamine the processes of life as chemical interactions tak-
ing place in the material world and subject to the same laws as inorganic reactions. With
this understanding, biology developed and incorporated the findings of organic chemistry
and biochemistry, and truly came into its own as an experimental science. The next step
was the growth of molecular and cellular biology, which arose from the special attention
given to the properties and roles of the large molecules characteristic of living organisms.
Thanks to continuous progress in these fields, life sciences have experienced repeated
advancements to this day.
This book, as its title suggests, focuses its perspective especially on the various types
of genomic data. It is therefore the fruit of an effort to reflect the evolution of research
paradigms in life science, from the study of nature and classification of living things based
on observations recorded over long periods of time, to experimental biology and the use
of simulations, and finally to the use of data science today.
1.2 Life and Information
Before genes, proteins were for a long time singled out as the essential component of
life. In 1883, Curtius proposed the first hypothesis that proteins have a one-dimensional
sequence structure resembling ordinary text. A slightly more advanced hypothesis of
protein structure based on peptide bonds was put forward by Hofmeister (1902) and
Fischer (1902). Miescher, who first discovered DNA in 1869, predicted that the genetic
material would consist of sequence information in the form of chemical symbols,
like simple text. Such sequence data started to appear with the publication of the first
nucleotide sequence of a segment of tRNA in 1961, after which a variety of DNA base
sequences began to be presented. At least theoretically, the flow of all genetic informa-
tion from an organism’s DNA, culminating in the manifestation of an individual phe-
notype, is encoded in the pure sequence information of the nucleotide bases that make
up its DNA. This sequence information includes, among other things, the secondary
and tertiary structures of proteins, promoters, enhancers, and other gene regulatory
elements, restriction enzyme cleavage sites, splicing elements, non-coding RNA, accu-
mulated mutations, and even the history of the entire process of that genome’s evolu-
tion. The extraordinary development of modern molecular biology, in the process of
uncovering the mysteries of life, has revealed that biological phenomena are, to a sur-
prising extent, “informatic” in nature. The field that investigates the flow of information
and interacting processes of expression that govern such biological phenomena can be
defined as bioinformatics.
The usefulness of informatics research in biology has been confirmed by its successful
application to the processing of output from DNA sequencers, the latter being the great-
est contributor to the early completion of the Human Genome Project. But beyond its
benefits for the more mechanical aspects of data processing, a classic example of scientific
insight gained through an informatics approach to biology can be seen in the study of
sequence homology. Zuckerkandl and Pauling [12] elucidated the relationship between
the degree of divergence in the peptide sequences of similar proteins, and their stage of
evolution, and so established molecular evolution as an entirely new area of research.
Because sequence variation arises according to fixed rules, it can be quantified, and the
history of evolution can be traced through this molecular timepiece that leaves its record
inside genes. More recently, thanks to the availability of faster analytical tools, sequence
1 homology analysis is being widely employed in metagenomics.
A common problem in sequence homology studies is the comparison of two character
strings, which is a classical problem in discrete mathematics. Since DNA sequences are
essentially character strings, the problem is directly applicable to the comparison of two
DNA sequences. As a simple example, let’s say we have two strings, “BLUE” and “BILE.”
Let’s further suppose that we can modify one of the strings one step at a time by inserting
a character, deleting a character, or substituting one character for another, which happen
to be the types of single base mutations that affect DNA. The problem is to determine
how many insertions, deletions, and/or substitutions it would take to change “BLUE” into
“BILE”, or vice versa. This is the “edit distance” between strings, and if the strings are DNA
sequences, it represents the molecular evolutionary distance between them. So, we have:
“BLUE” → {series of insertions, deletions, and/or substitutions} → “BILE”
However, we could, for example, delete the final “E” from “BLUE”, insert it back, delete
it again, insert it again, and so on, in endless repetition, with the string “BLUE” remaining
in its original state. Thus there are an infinite number of possible routes of transformation
from “BLUE” to “BILE” using the types of edits defined above. The problem is then to find
the shortest route of transformation from among all the infinitely-many possible ones. The
reason for this inference, at least in the case of DNA, is that the shortest route is probably
closest to what actually happened in nature.2 In simplest terms, the problem is to find
the minimum edit distance (also known as the Levenshtein distance) for transforming
sequence S1 into sequence S2 by means of single-character insertions, deletions, and/or
substitutions, and this distance is an approximation of the evolutionary distance between
the sequences (. Fig. 1.1).

In our example problem, there are two equivalent solutions that each have the min-
imum edit distance of 2 edits. The evolutionary distance between “BLUE” and “BILE”
would, therefore, be estimated as 2, as follows:
One insertion and one deletion (edit distance = 2)
B I L – E
B – L U E
Two substitutions: I for L, L for U (edit distance = 2)
BILE
BLUE
Of course, in reality, there aren’t only single base mutations, but longer sequences
can also be inserted or deleted at the same time, and the relative probabilities of occur-
rence of different types of insertions, deletions, and substitutions can also vary, so we
would have to make a somewhat more elaborate model to take them into account. This
problem, for which it is very difficult to obtain a general mathematical solution, is easily
solved with dynamic programming for pathfinding (. Fig. 1.1), a classic area of research

in artificial intelligence. . Figure 1.2 is a matrix of the edit distance for all possible edit

routes in our example, determined by dynamic programming. Vertical and horizontal

arrows represent deletions or insertions, respectively, and diagonal arrows represent sub-
stitutions.
2 The basis for inference from simplicity can be found in the so-called “Occam’s razor” principle, also
known as lex parisomniae in Latin (the law of parsimony, economy or succinctness). This principle
is founded on the insight that out of many equally compelling solutions to a given problem, the
simplest one has a high probability of being the correct one.
1.2 · Life and Information
7 1
Input: two strings S1 and S2

Output: minimum number of edits in editing process D(i, j) to change S1(1, ..., i) into S2(1, ..., j)
Method: recursion
D(i-1, j-1) + cost(i→j)
D(i, j) = min D(i-1, j) + cost(del(i))
D(i, j-1) + cost(add(j))
S1(1, ..., [i-1])
S2(1, ..., [j-1])
.. Fig. 1.1 Dynamic programming for sequence homology analysis (A-star search)
D(i, j) B I L E
0 1 2 3 4
B 1 1 2 3 B I L – E
B – L U E
L 2 1 1 2
U 3 2 2 2 2
B I L E
E 4 3 3 2 B L U E
.. Fig. 1.2 Calculation of minimum edit distance between two comparison sequences using dynamic
programming techniques in order to compare sequence homology. In each cell, an upward (deletion),
leftward (insertion), or up-and-left diagonal (concurrence or substitution) edit can be applied. Cells in
which the row and column coincide to indicate the same character for both sequences are marked in
blue, and the two optimal paths are marked by two sets of bolded arrows of different shape. The small
arrows indicate the possible paths to the coincident cells (blue) from the current cell. For example, in
cell B-B, because the two sequences agree at that position, the edit distance is 0, and the arrow pointing
up and to the left is the optimal path. The edit distance in the case of going up and left, or left and up, is
bigger by 2 than the optimal path distance of 0. By simply repeating this process, the minimum edit dis-
tances represented by numbers to the right can precisely find two different paths, each of edit distance 2
As the scope of the minimum edit distance problem gets larger, the computational
complexity required increases geometrically such that it eventually becomes impos-
sible to find an exact solution; it belongs to the class of NP-hard (Non-deterministically
Polynomially hard) problems. One must then apply a number of analytical methods to
efficiently find an approximate value. There are some slightly more advanced algorithms
one could apply, such as a scoring matrix that assigns differential scoring to substitutions
between similar amino acids as opposed to substitutions between non-similar amino
acids, performance improvements that make use of hash tables (e.g. FASTA), or elaborate
1 transition probability Hidden Markov models.
Sequence similarity analysis is the basic principle behind a many algorithms in bioin-
formatics for elucidating molecular evolutionary hierarchies, finding motifs, reconstitut-
ing gene regulatory networks, and finding other sequence elements such as transcription
factor binding sites, splice sites, and regulatory elements located inside introns. One can
try searching for a large number of genes or proteins with highly-similar sequences to a
given sequence using the Smith-Waterman algorithm provided by the Pairwise Sequence
Alignment,3 or with FASTA4 or BLAST tool,5 among others. A detailed explanation of
these search algorithms and tools will be given in 7 Chaps. 10 and 11.

In the early 1970s, when RNA sequences were first being revealed, Pipas and McMahon
suggested that it should be possible to predict the secondary structure of RNA from its
primary sequence information alone. Each RNA molecule has a specific secondary struc-
ture, and 16S rRNA in particular is very useful for identifying microorganisms. One can
download the Vienna RNA Package6 and enter a sequence to obtain a prediction of the
optimal secondary structure. 16S rRNA analysis is now being used as the most certain and
direct methodology in microbial analysis, including species that have so far been impos-
sible to cultivate, and in metagenomics research. For proteins, there are databases such as
the Protein Data Bank (PDB) and SWISS-PROT. The construction of predictive models
taking advantage of artificial intelligence and machine learning techniques that use graph
theory, hidden Markov models, artificial neural networks, and other concepts based on
probability and information theory, has made it possible to solve various computational
problems in biology. These problems include analysis of the tertiary structure of proteins
and RNA, prediction of the functions of large molecules that are very difficult to ascertain
experimentally, determination of open reading frames (ORFs) from expressed sequence
tags (ESTs), interpretation of gene regulatory networks, inference of restriction enzyme
cleavage sites, and investigation of splice sites.
1.3 The Human Genome Project

If, early on, the effort that built the substructure of bioinformatics was research into algo-
rithms and molecular structure modeling, constructing databases for sequence informa-
tion, computer simulations of biochemical reactions, etc., then the driving force behind
the explosive growth of bioinformatics in recent times was the Human Genome Project.
Despite initial concerns, the Human Genome Project not only completed its ambitious
plan approximately 2 years ahead of schedule, it facilitated the development of sev-
eral related areas of endeavor. Among them are the generation of global gene database
networks, development of ultra-fast sequencing technology, high throughput technol-
ogy, functional genomics, and personal genomics. But above all else, the establishment of
genomics, that has enabled studying the entire genome as a whole rather than individual
genes one at a time, can be considered its most meaningful achievement.
3 7 http://www.ebi.ac.uk/Tools/psa/
4 7 http://www.ebi.ac.uk/Tools/sss/fasta/
5 7 https://www.ncbi.nlm.nih.gov/blast/
6 7 http://www.tbi.univie.ac.at/RNA/
1.4 · Development of Microarray Technology and its Medical Applications
9 1
The generation of large-scale database networks as a consequence of the Human
Genome Project inevitably gave rise to complicated problems related to the storage, classi-
fication, annotation, searching, analysis, and processing of the data, as well as establishing
links between related databases. To give the simplest example, there are many critical limi-
tations in the existing haphazard nomenclature for large biological molecules. Researchers
who discover a new gene are supposed to register it with an appropriate database, such as
GenBank. However, seemingly simple tasks such as checking if that gene has already been
listed under another name become extremely complicated due to the lack of consistency
in gene nomenclature. Attempts are being made to solve this by assigning each gene in a
gene database an “accession number” as a unique identifier, but there is still a great deal
of entries with duplicated or overlapping information. Worse, sequences that are listed
starting from the 3’end are sometimes mixed together with sequences listed starting from
the 5’end without explanation. There are many cases of recording rat sequences as coming
from mouse, and vice versa. In recent times, various bodies have been formed such as the
HUGO Gene Nomenclature Committee7 and the Gene Ontology Consortium8 (2000) to
deal with the issue of systematizing the nomenclature in a semantically meaningful way,
and to secure both syntactic and semantic consistency for data processing.
1.4 Development of Microarray Technology

and its Medical Applications
Microarray technologies (e.g. DNA chip, SNP chip, aCGH) and high-throughput data
acquisition (e.g. NGS), recently in the spotlight, are the main tools that are making it pos-
sible to investigate a biological organism as a single, whole system. When the entire yeast
genome sequence was analyzed in April, 1994, out of all 6200 identified gene sequences,
they could only even estimate, let alone determine, the function of less than a quarter of
them (Brown et al. 1999). The first tool that enabled molecular biological analysis at the
genomic level was the DNA microarray. Detailed reviews on gene chips have been put
together in relevant references,9 and we will discuss them thoroughly in 7 Chaps. 5, 6, 7,
and 8. The use of microarrays has rapidly spread, and we now have several tens of differ-
ent types (e.g. SNP chip, aCGH, tissue microarray, tiling array, exon microarray, miRNA
microarray, cell chip). These have become established experimental methods used in most
of molecular and cellular biology fields.
Gene chips are just a large-scale integration of existing gene detection methods, such
as the reverse dot-blot procedure, parallelized for high-throughput analysis, paradigm
change in research they have occasioned. The introduction of biochips has expanded
genetic research (i.e. research on individual genes in isolation) into genomic research (i.e.
research on all genes in a genome simultaneously), and changed gene detection methods
(despite a few persisting technical limitations) from qualitative to quantitative analysis.
In other words, mathematical and quantitative methodologies have been introduced
into molecular genetics, which had classically relied on chemical analysis and qualitative
methodologies.
7 7 http://www.genenames.org

8 7 http://www.geneontology.org

9 Shena et al. 1995; Shalon et al., 1996; Peaseet et al., 1994; Lockhart et al., 1996; DeRisi et al., 1997
Genome-wide gene expression analysis has proved to be a very powerful tool than can
1 be easily and directly applied to clinical practice. Clinical medicine can be divided into diag-
nostics, risk stratification, and therapeutics, and in each of these there are reports of gene
expression analysis being successfully applied. Golub et al. [6] reported that it is possible to
distinguish between two subtypes of leukemia, acute myeloid leukemia (AML) and acute
lymphoid leukemia (ALL), by analyzing the expression pattern of 6817 different genes.
Alizadeh et al. [1] revealed the existence of two new subtypes of diffuse large B-cell lymphoma
(DLBCL) through genome expression analysis. He showed that of the two subtypes, the sur-
vival rate for the subtype whose expression pattern resembled germinal center B cells was
higher than that for the subtype whose expression pattern resembled activated B cells. They
predict that the entire paradigm of cancer diagnosis classifications and treatment policies
for future medicine will be permanently transformed by the use of gene expression analysis.
Subsequent large-scale research results have proved that prognostic subgroup predic-
tion in cancer is possible by means of genome expression analysis. It is expected accord-
ing to results from the prospective randomized controlled clinical trial MINDACT,10
which involves 6000 lymph node-negative breast cancer patients and determines based
on microarray assays whether or not to treat with chemotherapy, that genome expression
analysis will help avoid unnecessary chemotherapy in 25% of the patients without any
increase in the risk of recurrence. With current treatment methods 85–95% of patients
undergo unnecessary chemotherapy with a high risk of side effects, so reducing unneces-
sary chemotherapy is a major step forward. Microarrays have been rapidly commercial-
ized. Based on results published by van de Vijver MJ et al. [5], a molecular diagnostic
test called “MammaPrint” that estimates the likelihood of metastasis in breast cancer
using an expression pattern analysis of 70 genes has been developed and marketed by the
Agendia company. In 2007, the US Food and Drug Administration approved the use of
MammaPrint for lymph-node negative breast cancer patients who are under 60 years of
age and with tumors under 5 cm in size. MammaPrint currently costs $4200 per assay. A
similar test, Oncotype Dx, is being offered by Genomic Health at $3650 per assay.
Some have raised doubts about the scientific character of diagnostic methods employ-
ing genome expression data. However, while existing pathology tests are based on
pathophysiological and morphological differences between clinical specimens, those dif-
ferences can also be seen as the result of accumulated differences in genome expression
patterns inside cells. If we compare genome expression analysis to other methods that
have come to be used in cytopathology when morphological classification is difficult, such
as immune histopathology or detection of cell markers, then the basic concept behind
genome expression analysis as applied to making diagnoses in the pathology laboratory
is not a new one at all, since it can simply be seen as the simultaneous detection of tens of
thousands of important cellular markers.
SNP microarrays, triggered by research into single nucleotide polymorphisms (SNPs),
have greatly contributed to the investigation of correlations between individual variation
in genotype and disease phenotypes. Out of 3 billion total bases in the genome, approxi-
mately 0.1–0.2% of them (3–6 million bases) show genotypic variation between individu-
als at specific base locations; these variations are called single nucleotide polymorphisms,
or SNPs.11 From the point of view of simple base variation, a SNP is outwardly the same
10 Microarray In Node negative Disease may Avoid Chemo Therapy.

11 According to the latest research, SNPs occur at up to 10–50 million base positions.
1.4 · Development of Microarray Technology and its Medical Applications
11 1
as a single base mutation, but a distinction is made between cases with minor allele fre-
quencies of less than 1%, considered as bona fide mutations, and cases with frequencies
higher than 1%, considered as polymorphisms. There is a big difference in the medical
interpretation between these two types of cases. On the one hand, mutations are singled
out as the cause of Mendelian, simple-trait, “rare genetic disorders”, such as inborn errors
of metabolism, while on the other hand, polymorphisms are suggested as the mechanism
leading to “common complex diseases”, exhibiting complex traits. The “common disease-
common variant hypothesis” is a view of disease etiology that sees diseases that appear
evenly throughout groups with various genetic and environmental backgrounds, such
as schizophrenia, bipolar disorder, diabetes, and autoimmune disorders, as being caused
by “interactions between common genetic variants”, and thus such diseases are seen as
genomic disorders, differentiated from Mendelian genetic disorders which are caused by
mutations in single genes specific to the disorder. Recently, a lot of questions are being
raised about the truth of the “common disease-common variant hypothesis”, and debate
is heating up over whether or not the genetic causes of complex diseases might be better
explained as a mixture of a large number of rare variants instead of a small number of
common variants.
Apart from SNPs, which are confined to variations in single DNA bases, there can be
losses, gains, or transfers of large genomic segments, called “Copy Number Alterations
(CNA).” They were at first discovered in malignant tumors and studied as a type of trans-
formation specific to cancer cells, but it gradually began to be understood that they also
appear in normal cells, and in that context the phenomenon is called “Copy Number
Variation (CNV).” Moreover, through subsequent, large-scale international collabora-
tive studies, it came to be understood that this was part of a larger phenomenon, “Copy
Number Polymorphism (CNP),” that comprises a staggering 12% of the human genome, or
360 million bases, in widely dispersed regions. Variations in chromosomal segments that
have in the meantime occasionally been discovered in cytogenetic studies (e.g. Down syn-
drome, Philadelphia chromosome) can, at the molecular level, be counted among count-
less observations of CNP. Henceforth, CNP will greatly contribute to our understanding
of many complex trait diseases. The transformation of the way variations at the scale of
genomic segments are viewed, from CNA to CNV to CNP, can be the process of coming
to understand genomic variation not as merely a disease process, but as a more fundamen-
tal informatics property of biological phenomena. A detailed discussion of the analysis of
genomic polymorphism and variation data will be taken up in 7 Chaps. 14, 15, 16, and 17.

The first application of genomic polymorphisms is personalized medicine.

Pharmacogenomics has revealed differences according to SNPs and other genomic poly-
morphisms in the reactions of patients to drugs used to treat complex trait diseases, such
as depression and high blood pressure, and tens of relevant studies are currently under-
way. In particular, genotyping of polymorphisms in the genes for liver metabolic enzymes
responsible for drug metabolism has just been recognized as key information for deciding
on the dose and duration of drug treatment. Recently a clinical study to directly com-
pare established clinical protocols for deciding on the right dosage of the thrombolytic
agent warfarin, which has a narrow therapeutic window and whose appropriate dosage
for administration is difficult to determine, with protocols derived from the latest phar-
macogenomics research, was conducted at the IWPC.12 It confirmed the superiority of the
12 International Warfarin Pharmacogenomics Consortium.

pharmacogenomics approach, and pharmacogenomics testing is rapidly being applied to

1 clinical decision-making.
If pharmacogenomics mainly utilizes genome structural polymorphisms derived from
SNP microarrays, then toxicogenomics mainly utilizes genomic expression patterns of
different tissues and organs derived from DNA microarrays, and conforms to the par-
adigm of predictive toxicology, which seeks out markers at the genomic level that are
predictive of toxicity. While the number of chemical substances that toxicogenomics is
concerned with is extremely large (106 ~), the number of serological and pathological
predictive factors that one can test for is limited, so researchers are looking for ways to
overcome the limitations on predictive toxicology which they have previously been unable
to do. Therefore, toxicogenomics has a different goal from pharmacogenomics, which has
pursued a “personalized pharmacology.” Toxicogenomics pursues rather a “predictive
toxicology” for a large number of chemical compounds. It is very exciting to see how phar-
macogenomics and toxicogenomics are converging so that pharmacogenomics is aiming
at predictive pharmacology through genome expression analysis, while toxicogenomics is
aiming at personalized toxicology through analysis of genomic variation.
One important thing in bioinformatics research is new drug discovery. Existing drug
discovery strategies mainly involve examining candidate substances that interact with a
target protein of interest. The bioinformatics approach is to track down candidate sub-
stances capable of affecting genes involved in metabolic pathways of interest, and the
expression of those genes. Moreover, thanks to the massive databases that have already
been accumulated, bioinformatics-based drug discovery research makes use of an infor-
matics paradigm based on data mining rather than traditional experimental biology. For
example, the US National Cancer Institute’s NCI-60 screen has produced data on the
optimal concentrations of over 50,000 chemical substances for inhibition of growth13 of
60 representative human cancer cell lines, and this information has been deposited into
publicly-available electronic databases.14 Comprehensive gene expression profiles for all
of the cell lines are also available, and links to databases containing the 3-dimensional
structure of the chemical substances and information on their functional regions are pro-
vided, as well as links to databases on protein function and tertiary structure organized
by gene. With these resources, and through pure data processing and artificial intelligence
techniques, one can efficiently find a great many candidates for new drugs. Finally, by first
running computer simulations on the reactivity between potential ligands and receptors
as a basic preliminary experiment on the selected candidate substances, and identifying
those candidates that have high specificity, the time required for new drug discovery can
be innovatively reduced.
1.5 Explosion of Data from Next-Generation Sequencing
The traditional Sanger sequencing method of DNA base sequencing is now referred to as
first-generation sequencing, while the more recent sequencing methods claim to stand for
the “next generation” in sequencing, and so are called next-generation sequencing (NGS).
There are new sequencing platforms on the market today claiming to be of the third and
13 GI50: 50% Growth Inhibition, LD: Lethal Dose, etc.

14 In cutting-edge drug discovery research at most pharmaceutical companies, a database of approxi-
mately 20,000 candidate substances is used.
1.5 · Explosion of Data from Next-Generation Sequencing
13 1
fourth generations.15 The new technologies each have their pros and cons, but the key fac-
tor for competitiveness at the present time is speed and cost-effectiveness.
The term “Next-Generation Sequencing” first started being used in 2007 when Solexa
was merging with Illumina. Owing to the simplicity and greater efficiency of NGS,
researchers are either replacing their current equipment for NGS platforms, or supple-
menting it in order to keep up with the rapid growth in the field. Specifically, RNA-Seq
has replaced DNA microarrays for the most part, and SNP microarrays and aCGH are
also being replaced by NGS platforms. Through whole transcriptome sequencing, not just
mRNA, RNA-Seq is blazing a trail into new domains which have previously been difficult
to study, such as research on splicing, isoforms, and ncRNA.
But understanding the new paradigm sparked by the introduction of NGS technol-
ogy is more important. The MIT-lead 1000 Genomes Project has made publicly available
genomic sequence information of populations from across the world, which anyone can
access.16 This means that instead of research being driven by interest in a particular bio-
logical species or a human population with a specific disease, as has up to now been the
case, it can proceed to focus on the genomic sequence of each and every individual, and
the era of personal genomics is truly beginning. Exome and targeted sequencing have
quickly brought about advancement in the research of rare genetic disorders which are
understudied, and is fundamentally altering our traditional understanding of human
genetic variation as being based on Mendelian laws of heredity.
Cancer is understood as a disease with genomic instability as its source. The Cancer
Genome Atlas (TCGA) began as an effort to collect microarray data on the genomes of
cancer cells, but is now not only actively employing NGS to obtain genome sequence data,
but also Chip-Seq, RNA-Seq, and methyl-Seq, to acquire almost all types of genomic
data from a single sample, and by making them publicly available, setting the stage for
an epoch-making turning point in cancer research. The International Cancer Genome
Consortium (ICGC), started in 2008, is making its goal the sequencing of samples from
over 10,000 individuals, representing tens of different kinds of cancer, with the participa-
tion of over 10 different countries worldwide. We will cover NGS data analysis in more
detail in our next volume, Genome Data Analysis II: NGS.
The re-interpretation of the early history of life in the “RNA world17” hypothesis is
in part based on new discoveries showing that intergenic regions are also transcribed
and have biological meaning. Analogously, the field of “epigenomics” is based on the
discovery that the genome can be extensively regulated by modifications to DNA and
the surrounding chromatin proteins without altering the actual genomic sequence. This
can occur through direct methylation of DNA, or “chromatin remodeling” by means
15 Commonly, third generation sequencing refers to methods for sequencing short reads that are
too long for single molecule sequencing (e.g. Pacific Bioscience), while fourth generation goes
beyond established biochemistry-based methods to exploit electrical properties to determine base
sequence.
16 As of early 2017, genome sequence data of more than 3000 individuals were publicly available.
17 The view that the first biological molecule was not DNA but RNA. This is different from the so-called
“central dogma” of biology, where DNA is responsible for storing information, proteins are the mole-
cules that carry out biological functions, and RNA acts as a message or modulator such that the flow
of information is DNA → RNA → protein. In the hypothesized “RNA world”, RNA was the primordial
biological molecule which both stored information and had functional activity (as in the ribosome),
and DNA and proteins later evolved to take over the roles of information storage and functional
activity, respectively.
of methylation, acetylation, or ubiquitination of the surrounding histone proteins, and

1 is related to an individual’s ability to adapt to short term changes in the environment.
When it was revealed that cancer cells undergo considerable chromatin remodeling, a
more conceptually-advanced cancer medicine was developed and new diagnostic and
classification techniques were introduced. Lifestyle factors such as change in the environ-
ment, diet, and exercise exert influence over chromatin remodeling. It had been thought
that such “acquired characteristics” could not be passed down to one’s offspring, but a
molecular mechanism for this possibility has emerged when it was reported, surprisingly,
that parts of chromatin remodeling can be duplicated in the process of DNA replication,
thus leading to the notion of “transgenerational epigenetic inheritance”. This not only
broadened our understanding of the interaction between genes and the environment, but
provided an opportunity for Lamarck’s theory of “use and disuse” to receive new attention.
Epigenetics will likely make a large contribution to our understanding of diseases related
to lifestyle and environment, a previously unexplored field for genetics. Epigenomics is
also, from a bioinformatics research perspective, in a very early stage, and is challenging
and controversial.
Ultra-fast sequencing technology has gone beyond the era of human genomics and
has initiated the new era of personal genomics, signaling the true beginning of personal-
ized, predictive medicine. Increase in the speed of sequencing given birth not only to
epigenomics, but to the new study of microbial metagenomics, by making possible the
sequencing of genomes of microbial species that cannot be cultivated in the laboratory.
These advances can also be used to analyze the genomes of all human intestinal bacteria,
raising new prospects for research into gastrointestinal pathophysiology. Epigenomics
and metagenomics data analysis using NGS will be discussed in detail in 7 Chaps. 18, 19,

20, and 21.
1.6 The Era of Systems Biology and Biomedical Informatics
Personalized, predictive medicine is defined as using all data related to a patient’s genotype
and genome expression to get the best possible perspective for classifying that patient’s
disease, coming up with a treatment plan, and taking preventive measures. Pathological
examinations, image data, and the application of clinical data are also equally important.
That entails overcoming the limitations of current medical decision-making by physicians
relying mainly on clinical signs and categories, by obtaining data directly at the molecular
level for a more detailed approach based on a patient’s intrinsic characteristics. Going
forward, it means taking each patient as an individual, providing the right medication at
the right time with the right dosage for the right patient, thus optimizing treatment and
prevention.
In truth, there are not that many personalized genomic treatment options available
to give to patients. President Barack Obama (then a senator) submitted the Genomics
and Personalized Medicine Act in order to remove scientific and regulatory obstacles
and market pressures, and Secretary Michael Leavitt organized the Secretary’s Advisory
Committee on Genetics, Health, and Society. As has always been the case, the road to data-
based, genomic medicine does not look good. There are especially widespread and influ-
ential concerns about discrimination for employment or insurance eligibility based on the
results of genetic testing. In the United States, the Genetic Information Nondiscrimination
Act has been introduced. Educating the public on the difference between personalized
Bibliography
15 1
medicine based on genomic information, and its diagnosis, treatment, and strategies for
prevention will also emerge as an important task.
Systems biology and biomedical informatics is the combination of biomedical infor-
matics and genomic medicine, and the knowledge of the three fields of molecular biology,
informatics, and clinical medicine form its trinity. It is an essential field of study that will
play a leading role in the life sciences of the future. It’s obvious that if the relevant medical
institutions had not had such excellent clinical information systems, then the research of
Alizadeh et al. [1] and Golub et al. (2000), which opened new prospects in the study of
tumor pathophysiology, would not have been possible. The discussion over how to com-
bine systematic, genome-level biological data with massive health information systems is
an animated one.
Large high-throughput genomic data acquisition and integrative analysis will enable
a systems-level integration of biological phenomena permanently change paradigms in
medicine and biotechnology. If the so-called “omics revolution” is leading us to a hori-
zontal integration of the constituent units of every living thing, then biomedical informat-
ics will lead to a vertical integration, from the microscopic level of molecular biological
phenomena to the macroscopic level of human beings and societies. Through the delicate
interweaving of the various dimensions of systems biomedical informatics, which is a syn-
thesis of informatics with a molecular understanding of biological phenomena, we can see
an image of the future.
Take Home Message

55 Understand bioinformatics and paradigm-shift in a traditional genetic study by
the data.
Bibliography
1. Alizadeh AA et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression
profiling. Nature 403(6769):503–511
2. Altman RB (2000) The interactions between clinical informatics and bioinformatics: a case study. J Am
Med Inform Assoc 7(5):439–443
3. Brown PO, Botstein D (1999) Exploring the new world of the genome with DNA microarrays. Nat
Genet 21(1 Suppl):33–37
4. Bullinger L et al (2004) Use of gene-expression profiling to identify prognostic subclasses in adult
acute myeloid leukemia. N Engl J Med 350(16):1605–1616
5. van de Vijver MJ et al (2002) A gene-expression signature as a predictor of survival in breast cancer. N
Engl J Med 347(25):1999–2009
6. Golub TR et al (1999) Molecular classification of cancer: class discovery and class prediction by gene
expression monitoring. Science 286(5439):531–537
7. Lorenz R et al (2011) ViennaRNA Package 2.0. Algorithms Mol Biol 24(6):26
8. Pipas JM, McMahon JE (1975) Method for predicting RNA secondary structure. Proc Natl Acad Sci U S
A 72(6):2017–2021
9. Povey S et al (2001) The HUGO gene nomenclature committee (HGNC). Hum Genet 109(6):678–680
Epub 2001 Oct 24
10. Redon R et al (2006) Global variation in copy number in the human genome. Nature 444(7118):
444–454
11. Valk PJ et al (2004) Prognostically useful gene-expression profiles in acute myeloid leukemia. N Engl J
Med 350(16):1617–1628
12. Zuckerkandl E, Pauling L (1962) Molecular disease, evolution, and genic heterogeneity. In: Horizons in
biochemistry. Academic Press, New York
17 2
Next-Generation
Sequencing Technology
and Personal Genome
Data Analysis
2.2 Data Format – 20

2.2.1 FASTQ Format – 20
2.2.2 CSFASTA Format – 21
2.2.3 Concept of QV – 21
2.3 Alignment of NGS Sequencing Reads – 22

2.3.1 Sequence Alignment with a Hash Table – 23
2.3.2 Sequence Alignment with a Suffix Tree – 23
2.4 SNP and InDel Detection – 24
2.5 Annotation of Sequence Variation and Function

Prediction – 25
2.5.1 Annotation and Clinical Interpretation of Common
Variations – 25
2.5.2 Annotation and Clinical Interpretation of Rare Variants – 26
2.5.3 KEGG Disease Pathways Mapping – 27
2.5.4 Pharmacogenomics – 28
2.5.5 Distribution of Genetic Variants in the Population – 28
Bibliography – 30

https://doi.org/10.1007/978-981-13-1942-6_2
18 Chapter 2 · Next-Generation Sequencing Technology and Personal Genome Data Analysis

In this chapter, one will learn about data format of next-generation sequencing, a method
of sequence alignment, assembly and visualization tools, and various annotation and
2 detection methods of SNP/InDel from the obtained sequences. This chapter describes the
difference in analytical methods between common variants and rare variants, and an analy-
sis approach using biological pathways, pharmacogenomics, and information of racial dif-
ferences using The 1000 Genomes Project Data.
2.1 Introduction
DNA sequence analyses, which unveils the genetic information of life, is the most classic
methodology of the field. As described in 7 Sects. 1.2, 1.3, and 1.5 of 7 Chap. 1, analy-

sis of homologous sequences and researches in sequence alignment sparked numerous

informatics-based approaches to biological phenomena, which led to the Human Genome
Project and the groundbreaking advancement of next-generation sequencing techniques.
Thereafter, informatics researches have become the major field in the molecular biology
and human genetics. The early DNA sequence-centric research was expanded to the entire
field of genetics including transcriptomics and epigenetics, provided the foundation of
personal genomics and personalized medicine, and pioneered a new field such as metage-
nomics in the microbiology. As the 1000 Genome Project and The Cancer Genome Atlas
(TCGA) project opened an era of bio-data by releasing publicly-available big data, anyone
is now able to analyze human genome data.
DNA sequences can be used to (1) identify individual and racial characteristics, (2)
elucidate a cause of an inherited disease (i.e., genetic variant and chromosomal abnor-
malities), and (3) look for genetic defects repressing human complex disease (i.e., diabetes
and hypertension). Furthermore, gene expression, genome diversities, and their interac-
tion can be analyzed with DNA sequences. The term “Next Generation Sequencing” was
used when SOLEXA was acquired by Illumina in 2007. Nowadays the third and fourth
generation products are on sale.
Next-generation sequencing technology has also brought significant progress in
genome variation studies. Since the release of James Watson’s genome was released in
2007, the number of sequence variations registered in dbSNP has considerably increased
(. Fig. 2.1) and there are still many genetic variations as yet unknown. As there are also a

lot of unknown disease-related variations, it is expected that the next-generation sequenc-

ing technology will make it easier to discover SNPs (one base pair), InDels (one to 100
base pairs), and SVs (>1000 base pairs) easier to discover. Studies of human genome
diversity and variation data analysis will be descried in 7 Chap. 14. Studies of these

genetic variations will be extended to understanding of human DNA sequence variations

that are related to various phenotypes including disease susceptibility and response to
treatment.
In mid-2011, sequencing an individual costed approximately $4000 and opened the
$1000-per-genome era in 2013. However, the accuracy and completeness of the current
personal genome sequencing is still at its early stage, where the quality of personal genome
is still far away from medical application. In 2011, Nature Genetics published an article
titled “Towards a medical grade human genome sequence”, which encouraged that we
need a sequencing technology powerful enough to obtain a highly qualified personal
genome sequence data. The article’s arguments are two-forld: that the quality of current
2.1 · Introduction
19 2
120
Submissions
105.1 million
Ref. Clusters
100 Validated
Number of SNPs (millions)
80
60
40
23.7 million
20
14.5 million
0
Nov02
Nov03
Nov04
Aug02
Jan03
Apr03
May03
Jun03
Aug03
Jan04
Jun04
Aug04
Jan05
May06
Apr08
May09
Feb10
Dec02
Mar03
Mar03
Mar04
Sep05
Mar07
Oct02
Oct03
Oct07
dbSNP Release
.. Fig. 2.1 The trend of the registered number of SNPs in dbSNP database. Since Watson’s genome
acquired by NGS technology was released in 2007, the number of sequence variations registered in
dbSNP has dramatically increased. (Courtesy from Koboldt et al. [12])
sequencing data has limitations; and that anyone will be able to obtain a high-level of
personal genome data in the near future.
It is a new challenge. The emergence of the personal genome sequencing technology
has changed the paradigm of existing studies that compare sequences between species
or certain groups. As we are now able to utilize the data of The 1000 Genomes Project
and published medical and biomedical knowledge including references in PubMed, in
the near future, a physician will be able to provide patients with medical advice based
on patients’ respective sequencing data. It is time for physicians to learn the meaning of
personal DNA sequence, as a bioinformatician should, and to reflect this new perspective,
medical schools also need to include such programs in their curriculum.
Ironically, while personal genome sequencing costs have decreased considerably,
analysis and interpretation of personal genome sequences still remain to be the most dif-
ficult part of the challenge. . Figure 2.2 is a workflow of personal genome analysis and

interpretation that are operated in the Center for Biomedical Informatics, Seoul National
University School of Medicine (SNUBI). The first step with interpreting the personal DNA
sequence is to search for genomic variations by comparing the personal DNA sequence
with the reference genome sequence. Identified variations are assigned with any of
interpretation and annotations. Variations are not only individually interpreted but also
processed for variant-sets assessment, clinical risk analysis, pharmacogenomics, gene-
environment interpretation, and other bioinformatics analyses. These analyses require an
incorporation of the latest technologies such as highly processed distributed file systems
and cloud computing and an integration of diverse biological databases. Through integra-
tion of various bioinformatics tools and knowledgebase analysis methodologies compiled
for the last decade, personal genome interpretation could be achieved (. Fig. 2.2).
NGS analysis(preprocessing) Variant-set assessment External

knowledge and
Over expresentation analysis databases
2 Variants detection
Phenotype-
Clinical risk analysis associated
variants DB
Variants analysis Pre-test probability analysis (SNPedia
etc..)
Variants annotation
Post-test probability analysis
GET-evidence
ANNOVAR Promethease Pharmacogenomics
VAAST SVA Pharmacogenomic analysis

PharmGKB
Interpretome ETC Gene-environment interaction
Gene associated with disease
Etiome
Variants visualization
using GBrowser Disease-environment analysis
Computing resource
Lustre clustered LIMS Galaxy workflow
file system
.. Fig. 2.2 A workflow of personal genome interpretation developed by the Center of Biomedical Infor-
matics, Seoul National University School of Medicine (SNUBI)
This chapter will give an understanding of fundamental tools used for analyzing and
interpreting next-generation sequencing data such as sequence data format, algorithms
of sequence alignment/assembly/visualization tools, and summarizes the process of
sequence variation analysis. More elaborate processes and interpretation of NGS data will
be provided in “Genome data analysis II for NGS, Cancer and disease genome”, which will
be published as a separate volume. DNA sequence motifs and phylogenetic tree analyses
will be described in 7 Chap. 11.
2.2 Data Format
Understanding of data format is the first requirement for sequence data analysis. NGS
sequencers generate data in FASTQ and CSFASTA formats, and a data file including QV
(Quality Value), a series of ASCII symbols which represent quality of each base call.
2.2.1 FASTQ Format
FASTQ format was originally developed by Wellcome Trust Sanger Institute in order to
save sequence data and is now a widely used format, adopted by Illumina NGS sequencer
as well as the Ion TorrentTM sequencer from Thermo Fisher. FASTQ files are essentially
a combination of QV score for each base and FASTA file. To convert all scores (i.e., num-
bers) to one letter, QV includes various ASCII codes.
A sequence in FASTA format begins with a single-line description, followed by lines of
sequence data. The description line is distinguished from the sequence data by a bracket
2.2 · Data Format
21 2
.. Fig. 2.3 An example of a FASTA file format used in sequence data processing
(“>”) in the first column. The word following “>” is the identifier of the sequence, and the
rest of the line is the description (both are optional). There should be no space between
“>” and the first letter of the identifier. It is recommended that all lines of text be shorter
than 90 characters. The sequence ends when another line starts with “>”; this indicates
the start of another sequence (. Fig. 2.3). The further details of FASTA format will be

described in 7 Sect. 11.2 of 7 Chap. 11.

2.2.2 CSFASTA Format

CSFASTA is a data format generated by the SOLiD DNA sequencer. Much like FASTQ for-
mat, CSFASTA format includes QV and sequence information but generates two separate
files for each. One file presents four bases (A, G, T, C) with four color-coded numbers (0,
1, 2, 3) and these numbers are later converted into DNA sequences. Each color codes of
QV are given a number and one color represents two consecutive bases. In other words,
two consecutive colors represent one common base of two consecutive bases. For exam-
ple, 0 indicates sequences of AA, CC, GG, and TT. The second base of the two consecutive
bases defines the first base of the next two consecutive bases. As another example, CG is 3
and AACG is 013 (0 → AA, 1 → AC, 3 → CG, . Fig. 2.4).
2.2.3 Concept of QV
The concept of QV originates from the Phred quality score, which has been used in
sequence analysis for a long time. It takes a log value of an error rate and a score of 10 indi-
cates 10% possibility of error rate, 20 for 1%, and 30 for 0.1% (. Table 2.1). For example,

if a sequencer has 99.99% of accuracy, most generated reads would have QV greater than
40. However, QV is not an absolute measurement for quality assessment because every
.. Fig. 2.4 Color codes used in

CSFASTA format
2nd base
A C G T
2 0 1 2 3
A
C 1 0 3 2
1st base
G 2 3 0 1
T 3 2 1 0
CY5 TXR CY3 FAM

Ex) AA = 0, CG = 3, AACG = 013
.. Table 2.1 Phred quality scores are logarithmically linked to error probabilities
Phred Quality Score Probability of incorrect base call Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10000 99.99%
50 1 in 100000 99.999%
sequencer has its own data generation mechanism to detect a signal, which is represented
as a base call or color call at a certain level of accuracy. Therefore, the concept of QV is
applicable to every sequence but QV scores are not comparable between sequencers. Note
that SOLiD sequencers output raw data without any QV-based filtering process as other
sequencers do and it is meaningful to take further analysis of the unfiltered raw data as
specific genomic regions with very low QV scores can exist.
2.3 Alignment of NGS Sequencing Reads
Personal DNA genome is re-constructed by the process of aligning and mapping, which align
and map short sequences (i.e., reads) generated from NGS sequencer, to the human refer-
ence genome. As existing sequence alignment tools such as BLAST (Basic Local Alignment
Search Tool) are slow for mapping these short reads to the genome, new tools for alignment
and mapping have been developed. . Table 2.2 shows new tools and running time.

Most fast sequence alignment algorithms generate an index as an auxiliary data struc-
ture in order to increase mapping speed. Indexing of huge DNA sequence enables users to
2.3 · Alignment of NGS Sequencing Reads
23 2
.. Table 2.2 Sequence alignment tools for short sequences
Program Author(s) Web site Platform Aligned Gbp

per CPU day
Maq Li H et al. 7 http://maq.sourceforge.net/

Illumina, ~0.2
SOLiD(partial)
Bowtie Langmead B 7 http://bowtie-bio.

Illumina ~7
et al. sourceforge.net/index.shtml
SSAHA2 Ning Z et al. 7 http://www.sanger.ac.uk/

Illumina, ~0.5
resources/software/ssaha2/ SOLiD, 454
BWA Li H and 7 http://bio-bwa.sourceforge.

Illumina, ~7
Durbin R et al. net/bwa.shtml SOLiD, 454
SOAP2 Li R et al. 7 http://www.sanger.ac.uk/

Illumina ~7
resources/software/ssaha2/
find where short sequences are located in the DNA sequence at a much faster rate. There
are two sequence alignment algorithms based on indexing method: suffix trees and hash
table. Hash table is a data structure that stores keys and corresponding values, and hash
function searches the location (index) where value is stored, whose is corresponded to the
key. Suffix tree is a data structure that stores suffixes of a string.
2.3.1 Sequence Alignment with a Hash Table
Indexing method used in the hash table (hash map) stores the position of each sequence,
which are processed by breaking them into consecutive tuples. K-tuple is defined as a
string that is composed of k number of characters. For example, if string s is “banana”
and k is four, a set of all 4-tuples is {“bana”, “anan”, “nana”}. As DNA is composed of four
bases, A, C, G, and T, the maximum number of k-tuple is four**k and the number of its
position information is similar to the DNA sequence length. Once the indexing table of
the reference genome or query sequence are constructed by this method, alignment based
on the hash table search the seed from the hash table, which is a perfect match between
query sequence and reference genome, and sequence alignment start to extend from the
matched seed. This method is called “seed-and-extend.”
Bin Ma suggests to use k things of non-consecutive string as a seed sequence and
define relative position of k things of string as a model and k as a weight. For example,
for 1110111, a sequence model with weight of 6, “actgact”, “actgact” as well as “actgact”
and “acttact” can all match with the seed. “Spaced seed,” as the method is called, has been
reported to show better sensitivity than traditional seed alignment method.
2.3.2 Sequence Alignment with a Suffix Tree
Suffix is a word ending, which is a substring of the given string. For example, a suffix of
the string of “banana” is the string “nana”. A suffix tree is a tree structure including any
.. Fig. 2.5 Suffix tree for the

string “banana”
^
B A N $
2 A N $ A
N A N $
A N $ A
N A $
A $
substrings of given strings and an efficient data structure to operate a string. There are six
suffixes in “banana”: “banana”, “anana”, “nana”, “ana”, “na”, and “a.” As “ana” is a suffix of
“anana”, “na” is a suffix of “nana”, “a” is also a suffix of “anana”, the efficient way to construct
a suffix tree is with only itself and “anana” and “nana.” This tree enables to search a pattern
of small and repeated sequences (. Fig. 2.5). “$” marks the end position of the suffix.

However, when the suffix tree is constructed for original strings, file size of the com-
pleted structure is often too big to store. To overcome this problem, suffix array is sug-
gested, which is a stored array of all suffixes of a given text, as it is highly space-efficient
than suffix tree. Burrow-Wheeler Transform (BWT) is a text transformation algorithm,
developed by Michael Burrow and Davie Wheeler in 1994.
A text transformed by BWT is more efficiently compressed than the original text in
general. BWT has been widely used in text compression as well as in search field. BWT
is originally developed for data compression but now it is also used to sort NGS DNA
sequences due to storage space problems. An advantage of suffix tree, suffix array, and
WT index is that they can search all positions of a substring in the original string at once.
2.4 SNP and InDel Detection
After sequence alignment, SNP and InDel detection is one of key processes in personal
genome sequence analysis. In early NGS sequencing studies, filtering step was employed
to select cases that have Phred-type Quality Score of 20 or greater. Heterozygous
genotype was generally accepted to have non-reference allelic rate of 20–0%, and for
homozygous genotype, 100–21%. This method was an empirical method to determine
genotype. However, disadvantages of this method was that (1) there can be a loss of
many information due to low coverage of sequence alignment or moderate experiment
and (2) the uncertainty of inferred genotype is not measurable, and therefore, various
probabilistic methods are used to alleviate these problems. For example, error rate of
used platform, bad mapping probability, coverage is analyzed to calculate a likelihood
to determine whether the given position is heterozygous or homozygous. For this rea-
son, recent SNP and InDel detection programs apply data preprocessing and Bayes’
framework.
2.5 · Annotation of Sequence Variation and Function Prediction
25 2
Variant detection pipeline at SNUBI
Sample Ref. Low

FASTQ FASTA VCF
Variants Variant quality

Alignment
calling recalibration
Convert Apply
Recalibration
SAM to BAM recalibration
Merge INDEL
Our.snps.vcf
BAM files realignment
TOOLS
Sort Remove
the BAM file PCR duplicates BWA PICARD
SAMTOOLS GATK
.. Fig. 2.6 Workflow for genome variations, detection developed at SNUBI
Data preprocessing involves quality estimation of sequence fragments and applies filters
accordingly. According to the results from different sequence fragment analyses, sequence frag-
ments are evaluated as a paralogue or repeated sequences or filtered out. QV is re-constructed
by various statistical methods. For example, small InDel are re-arranged by a process of
sequence rearrangement. After data preprocessing, Bayes’ method is used for data filtering.
Detection of genome variation involves a series of complicated processes. It is neces-
sary to specialize the process according to the nature of NGS sequencing data and the pur-
pose of analysis. . Figure 2.6 shows the workflow of the detection of genome variations

developed at SNUBI. Each step selectively uses tools such as BWA, PICARD, SAMTOOLS,
and GATK for any analysis purpose (. Fig. 2.6).
2.5 Annotation of Sequence Variation and Function Prediction
2.5.1 nnotation and Clinical Interpretation

A
of Common Variations
After sequencing and numbering, each person has a genomic data of about 3GB. Because
this is too large an amount to possess in personal storage, downloading, saving, and
copying were very difficult in the work environment. However, if you only save the dif-
ferent modifications of space, (SNP, InDel, SV) it is possible to compress the data into
about 4 MB and email as an attached file. If we can decrease the size of data, the speed of
additional analysis of data can also be increased.
The next step after sequencing is identifying sequence variation and attempting to
2 analyze the data. In order to interpret the individual variation, we use an informatics tool
Promethease that can analyze the variation and the effects of these variations using 100,000
large data sets concurrently, which is summarized into a HTML report. Promethease
Analysis bases its analysis on variations stored in SNPedia database. There is a free and
paid version of Promethease, where free reports take about 13–15 min to analyze and
summarize. Promethease categorizes its summarized results into several categories: Good
News, Bad news, These seem interesting, Medicines, Medical Conditions, and Topics. Paid
reports cost 2 US dollars, and provides a substantial increase in report generation speed
and also provides Tab delimited, Tag Cloud, comparison of personal genome to others in
an F1 report, and a family report in which an individual’s genome is compared to his or
her family. Magnitude, repute, population frequency, and text color are used to visual-
ize results and help interpret one rsID per one variation. The report file is supported in
23andMe, deCODeme, Navigenics format.
Genome interpretations cannot be made solely on functional inferences of individual
genome variations. About 300,000 to 400,000 variations are identified per sample, varied
by factors such as racial background of an individual. The discussion of these interactions
isnot addressed in this book, however those interactions must be taken into consideration.
Not just next generation sequencing data but also the individual’s age, gender and health
status, clinical information must be taken into consideration as it may increase the risk
of disease. Though incomplete, . Fig. 2.7 depicts an example of a clinical risk prediction

system developed at SNUBI.
2.5.2 Annotation and Clinical Interpretation of Rare Variants
Promethease is a tool for medical interpretation on SNP-based common variations and

utilizes results from case-control-based epidemiological studies like GWAS. Unlike SNPs,
in the case of a rare variants, epidemiological studies are not easily executable and there-
fore not many known cases of such experiments exist. Although such analysis of non-
synonymous variants located within coding regions provide useful information, identifying
disease variants still remains to be a challenge. Identifying functional variations from Next
Generation Sequencing data is a major task in bioinformatics. Key function prediction and
annotation tools in bioinformatics that are available today are the following.
2.5.2.1 SIFT (Sorting Intolerant from Tolerant)

Based on physics and sequence homology, SIFT algorithm predicts functional con-
sequences of amino acid substitutions caused by nsSNP. The algorithm calculates how
damaging a non-synonymous coding variants could be to protein function without any
experimentation. SIFT scores range from 0 to 1, where 0 indicates that the variant is del-
eterious. SIFT algorithm predicts NMD (Nonsense Mediated Decay) for InDels.
2.5.2.2 PolyPhen (Polymorphism Phenotype)

Just like SIFT, Polyphen predicts function changes and structural changes caused by
amino acid changes by nsSNP modification. The difference is that Polyphen obtains its
results from molecular physics and comparison analysis based on evolution.
27 2
Risk analysis for genetic variants at SNUBI

Stomach cancer
Gastric cancer, also called stomach cancer, may cause up to 1 million deaths per year worldwide. In
several countries it is the second largest cause of cancer deaths; more information is available at
Wikipedia on the general topic. Although the most frequent correlation for increased gastric cancer risk
is not genetic - it is infection by Helicobacter pylori - genetic predisposition may also play a role.
(snpedia)
P1, rs2294008, TT
A variant genetic risk for Stomach cancer
P...
Genotype
P...
TT
0.00036 0.00038 0.00040 0.00042 0.00044
Pre-test Prob: 0.0372%

Post-test Prob: 0.0427 %
Your genetic risk for Stomach cancer

P...
Disease
P...
Stom...
0.00036 0.00038 0.00040 0.00042 0.00044
Pre-test Prob: 0.0372%

Post-test Prob: 0.0427 %
Contingency table
Genotype Case Control LR Posttest Prob
TT 400 256 1.1484 0.0427
TC, CC 49 74 0.4867 0.0181
Reference
21070779 Saeki Norihisa et al. A functional single nucleotide polymorphism in mucin 1, at chromosome
1q22, determines susceptibility to diffuse-type gastric cancer. Gastroenterology. Mar, 2011.
.. Fig. 2.7 A program for disease risk analysis developed at SNUBI
2.5.2.3 hD-SNP (Prediction of Human Deleterious Single

P
Nucelotide Polymorphisms)
PhD-SNP takes single amino acid changes caused by nsSNP as input and classifies these
mutations for disease associations based on protein sequence and profile information
using support vector machine.
2.5.2.4 VAAST (the Variant Annotation, Analysis and Search Tool)

VAAST extracts damaged genes and disease-causing variants from genomic sequences.
Unlike other programs, VAAST scores variants from a wide array of regions, including
non-coding regions, and also identifies rare variants that cause rare genetic diseases.
2.5.3 KEGG Disease Pathways Mapping
KEGG views diseases as molecular biology system in its perturbed state and classifies them
in 3 categories: monogenic disease, multi-factor diseases (cancer, autoimmune disease,
Diagnostic markers
2 Genetic perturbations
(germline/somatic mutations, etc.)
Molecular
network system Disease
Environmental perturbations
(incl. pathogens, microbiome)
Therapeutic Genomic
drugs biomarkers
.. Fig. 2.8 Basic interpretation of KEGG in disease pathway
metabolic disorders, etc.), and infectious diseases. Disease classification criteria follows
ICD-10, ICD-O-3, and MedDRA (. Fig. 2.8). KEGG disease database organizes diseases

according to genomic perturbation and environmental perturbation. KEGG organizes

each disease into a disease pathway map that includes information such as diagnostic
indicators, treatment, and genomic biomarkers. KEGG allows us to analyze different fac-
tors of genomic variability in diseases using the disease pathway map.
2.5.4 Pharmacogenomics
Pharmacogenomics is the study of different responses to drugs caused by variability in an

individual’s genome. As the importance of pharmacogenomics has been steadily increasing
in the last decade, the development of next generation sequencing has improved research
in pharmacogenomics and as a result, clinical use has also been increasing. Although
the direction of pharmacogenomics research has largely been centered around common
variations, rare disease and epigenomic analysis methods have also been developing at
a fast rate. However, the correlative relationship between colossal amounts of identified
genomic variations and drugs increase exponentially, and the confirmatory clinical stud-
ies are, in reality,too difficult to perform because of the sheer number of variant-drug
relationships. Failure to standardize drug reaction phenotypes and the limited experimen-
tal resources regarding pharmacogenomics and pharmacology are obstacles that need to
be overcome. In our detailed discussion regarding pharmacogenomics, we will utilize
GENOtation (7 http://genotation.stanford.edu/), a personal genomic variation analysis

tool from Stanford University, to analyze personal drug response and interpret the results.
GENOtation supports 23andMe data format.
2.5.5 Distribution of Genetic Variants in the Population
1000 Genomes Project started in January of 2008 with the intent to study the correlative
relationship between genotype and phenotype as well as genomic variability between pop-
ulation groups, variant frequency distribution, haploid information, and linkage disequi-
librium of alternative allele. In addition, identifying 95% of the variations (SNPs, CNVs,
29 2
and InDels) distributed in 0.1% to 0.5% of the coding region and 1% of the non-coding
region was also their primary objective.
Comprised of three phases, the project completed its first phase in 2010 with the release
of their project data files. As of now, the Phase 3 sequencing has been completed and of the
2577 specimens, 2504 specimen data was released (. Table 2.3) after excluding those that

had contamination issues, alignment issues, et cetera. The released data can be downloaded
from NCBI and EBI (7 ftp://ftp-trace.ncbi.nih.gov/1000genomics/ftp/release/20130502,

7 ftp://ftp.1000genomics.ebi.ac.uk/vol1/ftp/release/20130502/). The most current data is

from May of 2013, and has since been updated in February of 2015. Chromosomes 1–22,
X, and Y data is compressed in gunzip format data. In 7 Chap. 3, we will practice calculat-

ing the population distribution of variants using 1000 Genomes Project data.
.. Table 2.3 The 1000 Genomes Project Phase 3 data set
Super Population, Description Phase 1 Final Sub-

Population n = 26 Samples Release Total
Samples
AFR (African ASW Americans of African 61 61 661

Ancestry) Ancestry in Southwest US
ACB African Caribbeans in 0 96

Barbados
ESN Esan in Nigeria 0 99
GWD Gambian in Western 0 113

Divisions in the Gambia
LWK Luhya in Webuye, Kenya 97 99
MSL Mende in Sierra Leone 0 85
YRI Yoruba in Ibadan, Nigeria 88 108
AMR CLM Colombians from Mendelin, 60 94 347

(American Colombia
Ancestry)
MXL Mexican Ancestry from Los 66 64
Angeles, California
PEL Peruvians in Lima, Peru 0 85
PUR Puerto Ricans from Puerto 55 104

Rico
EAS (East CDX Chinese Dai in Xishuang- 0 93 504

Asian banna, China
Ancestry)
CHB Han Chinese in Beijing, 97 103
China
JPT Japanese in Tokyo, Japan 89 104
KHV Kinh in Ho Chi Minh City, 0 99

Vietnam
CHS Southern Han Chinese 100 105
(continued)
.. Table 2.3 (continued)
Super Population, Description Phase 1 Final Sub-

2 Population n = 26 Samples Release
Samples
Total
SAS (South BEB Bengali in Bangladesh 0 86 489

Asian
Ancestry) GIH Gujarati Indian in Houston, 0 103
Texas
ITU Indian Telugu from the UK 0 102
PLJ Punjabi from Lahore, 0 96

Pakistan
STU Sri Lankan Tamil from the 0 102

UK
EUR CEU Utah Residents (CEPH) with 85 99 503

(European Northern and Western
Ancestry) European Ancestry
FIN Finnish in Finland 93 99
GBR British in England and 89 91

Scotland
IBS Iberian Population in Spain 14 107
TSI Toscani in Italia 98 107
Total 1,092 2,504 2,504
Take Home Message

55 What the data format of genetic variants (i.e., data from next-generation
sequencing and personal genome) and its characteristics is
55 How to annotate sequence variation, predict function, and interpret.
Bibliography
1. Alberto M et al (2010) Bioinformatics for next generation sequencing data. Genes 1(2):294–307
2. Antonarakis SE et al (2010) Mendelian disorders and multifactorial traits: the big divide or one for all?
Nat Rev Genet 11:380–384
3. Capriotti E et al (2006) R. Predicting the insurgence of human genetic diseases associated to single
point protein mutations with support vector machines and evolutionary information. Bioinformatics
22(22):2729–2734
4. Daly AK (2010) Genome-wide association studies in pharmacogenomics. Nat Rev Genet 11:241–246
5. Daly AK (2010) Pharmacogenetics and human genetic polymorphisms. Biochem J 429(3):435–449
6. Frazer KA et al (2009) Human genetic variation and its contribution to complex traits. Nat Rev Genet
10:241–251
7. Gonzalez-Angulo AM et al (2010) Future of personalized medicine in oncology: a systems biology
approach. J Clin Oncol 28(16):2777–2783
8. Haas BJ, Zody MC (2010) Advancing RNA-Seq analysis. Nat Biotechnol 28(5):421–423
Bibliography
31 2
9. Homer N et al (2009) BFAST: an alignment tool for large scale genome resequencing. PLoS One
4(11):e7767
10. http://en.wikipedia.org/wiki/FASTA_format
11. http://en.wikipedia.org/wiki/Phred_quality_score
12. Koboldt DC et al (2010) Challenges of sequencing human genomes. Brief Bioinform 11(5):484–498
13. Langmead B et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the
human genome. Genome Biol 10(3):R25
14. Li H, Durbin R (2009) Fast and accurate short read alignment with burrows-wheeler transform. Bioin-
formatics 25(14):1754–1760
15. Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing.
Brief Bioinform 11(5):473–483
16. Li H et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25(16):2078–2079
17. Li R et al (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics

25(15):1966–1967
18. Ma B et al (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics 18(3):
440–445
19. MacLean D et al (2009) Application of ‘next-generation’ sequencing technologies to microbial genet-
ics. Nat Tev Microbiol 7(4):287–296
20. Manolio TA et al (2009) Finding the missing heritability of complex diseases. Nature 461:747–753
21. Morin R et al (2008) Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively
parallel short-read sequencing. BioTechniques 45(1):81–94
22. Mortazavi A et al (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Meth-
ods 5(7):621–628
23. Ng PC, Henikoff S (2006) Predicting the effects of amino acid substitutions on protein function. Annu
Rev Genomics Hum Genet 7:61–80
24. Nielsen R et al (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev
Genet 12(6):443–451
25. Ning Z et al (2001) SSAHA: a fast search method for large DNA databases. Genome Res 11(10):
1725–1729
26. Novelli G et al (2010) Role of genomics in cardiovascular medicine. World J Cardiol 2:428–436
27. Ramensky V et al (2002) Human non-synonymous SNPs: server and survey. Nucleic Acids Res 30:3894–
3900
28. Schork NJ et al (2009) Common vs. rare allele hypotheses for complex diseases. Curr Opin Genet Dev
19:212–219
29. Snyder M et al (2010) Personal genome sequencing: current approaches and challenges. Genes Dev
24:423–431
30. Sultan M et al (2008) A global view of gene activity and alternative splicing by deep sequencing of the
human transcriptome. Science 321(5891):956–960
31. The 1000 Genomes Project Consortium (2010) A map of human genome variation from population-
scale sequencing. Nature 467:1061–1073
32. Trapnell C et al (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated tran-
scripts and isoform switching during cell differentiation. Nat Biotechnol 28(5):511–515
33. Tsuji S (2010) Genetics of neurodegenerative diseases: insights from high-throughput resequencing.
Hum Mol Genet 19:R65–R70
34. Wang Z et al (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63
35. Wheeler DA et al (2008) The complete genome of an individual by massively parallel DNA sequencing.
Nature 452(7198):872–876
36. Yandell M et al (2011) A probabilistic disease-gene finder for personal genomes. Genome Res
21(9):1529–1542
37. Yi X et al (2010) Sequencing of 50 human exomes reveals adaptation to high altitude. Science
329(5987):75–78
38. Zhang J et al (2011) The impact of next-generation sequencing on genomics. J Genet Genomics
38(3):95–109
33 3
Personal Genome
Data Analysis
3.1 Prerequisites – 34
3.2 SNP Annotations and Interpretations Using

SNPedia and Promethease – 34
3.2.1 Create Input File – 34
3.2.2 Running Promethease – 35
3.2.3 Results Format and Interpretations of Promethease – 35
3.3 Interpretation of the Correlation of Rare SNPs

with Diseases Using SIFT and KEGG – 37
3.3.1 Run the SIFT Web Tool – 38
3.3.2 KEGG DISEASE Pathway Mapping – 38
3.4 Pharmacogenomic Analysis – 41

3.4.1 Pharmacogenomic Analysis Using GENOtation – 41
3.5 Calculations of Allele Frequencies

in a Population – 43
Bibliography – 45
Electronic supplementary material The online version of this chapter

(https://doi.org/10.1007/978-981-13-1942-6_3) contains supplementary material,
which is available to authorized users.

https://doi.org/10.1007/978-981-13-1942-6_3
34 Chapter 3 · Personal Genome Data Analysis

After obtaining personal genome data by next generation sequencing, we now are ready to
analyze and interpret the data. This chapter introduces various approaches for adding SNP
annotations and medicinal interpretations, using open sources based on personal genome
data and genome variation information. This chapter will cover the following: (1) effective
use of SNP data in SNPedia, (2) auto annotations of large volume of SNPs using Promethease
3 application program, (3) calculation methods of damage degree of proteins through SIFT
algorithm, (4) analysis of the effect of SNPs on coding regions to diseases or drug responses
by mapping with the pharmacogenomics knowledge base or biological pathways, and (5)
practicums for acquiring and using allele frequencies among races based on public data of
the 1000 genomes project.
3.1 Prerequisites
This chapter consists of four sections. The first three sections use the Windows operating
system (OS) and the last section runs on a Linux OS. The requirement for the Windows
OS is to install Windows R program (7 http://www.r-project.org/) and an Internet browser,

either Mozilla Firefox or Google Chrome. The requirement for the Linux OS is an instal-
lation of VCFtools. The updated version for the Linux OS is available at the website,
7 http://vcftools.sourceforge.net.

3.2 NP Annotations and Interpretations

S
Using SNPedia and Promethease
In this section, you will familiarize yourself with 23andMe file and personal genome data
analysis and interpretation using an HTML analysis report created by the Windows ver-
sion of Promethease.
3.2.1 Create Input File
# 23andMe
# rsid chromosome position genotype
rs12255372 10 114808902 GT
rs12255372 10 114808902 TT
rs6152 X 66765627 GG
rs9939609 16 53820527 AA
1. Type “#23andMe” on the first line using text editor.

2. On the second line, type “#rsID chromosome position genotype.” Use the tab key to
set apart each item (TAB-delimited ASCII file).
3. On the third line, type the rsID, chromosome, position, and genotype values
obtained from the link, 7 http://snpedia.com/index.php/SNPedia (. Fig. 3.1).

※ Promethease supports 23andMe and deCODEme formats.

3.2 · SNP Annotations and Interpretations Using SNPedia and Promethease
35 3
.. Fig. 3.1 Retrieving sequence variant information from SNPedia
3.2.2 Running Promethease
It takes about 13 to 15 min to obtain known SNP annotations in an HTML report file.
55 Step 1: Download and run Promethease for the desktop. (7 https://www.snpedia.

com/index.php/Promethease/Desktop).
55 Step 2: Click ‘Load’ option in ‘Genotype Files’ window. Then, select promeSample.txt
file in the directory.
55 Step 3: Click ‘Next’ and select Ethnicity (population).
55 Step 4: Click ‘Next’ and type the file name and a output folder to save the file.
55 Step 5: Click ‘Next’ in ‘Promethease Wizard’ window.
55 Step 6: When the analysis is completed, the resulting HTML file will be saved in the
output directory you defined in step 4 (. Fig. 3.2).

3.2.3 Results Format and Interpretations of Promethease
The subcategories in the Promethease report are as follows: (1) Good/Bad/interesting

news, (2) Top 10% in chosen population, (3) Medicines, (4) Medical Conditions, (5)
Topics, and (6) Complicated data.
.. Fig. 3.2 Result of

Promethease Analysis
55 Good news: Repute is good (i.e. resistant to HIV, won’t go bald, resistant to several
diseases, etc.)
55 Bad news: Repute is bad. (i.e. increased risk for type-2 diabetes, 1.2x prostate cancer
risk, more likely to go bald before age 40)
55 Interesting: Sometimes repute cannot be distinctly ‛Good’ or ‛Bad.’ (i.e. You have a
variant linked to blue eye color, warfarin sensitivity (~2.5 mg/day), 1.38x increased
risk for prostate cancer)
55 Top 10% somewhat rare in your chose population: you can change your chosen
population from whatever default you picked when making the report to others, this
will update the population frequency scores, and therefore the sorting order.
55 Medicines: Information on drug response for 160 drugs is provided. (i.e. Gefitinib:
rs2231142, associated with diarrhea as a side effect of Gefitinib treatment, Gleevec,
Warfarin)
55 Medical Conditions: Interpretations are provided for 270 disease-associated SNPs.
(i.e. AIDS, asthma, liver cancer)
55 Topics: A summary of annotated SNPs is presented.
Promethease visualizes the significance of each SNP by magnitudes, reputes, frequencies,

and text colors.
Magnitude explains the subjective measure of interest. It grades the degree of SNPedia
editors’ interest on a 0–10 scale (. Table 3.1).

Frequency provides a graph created according to population codes of the International

HapMap Project (. Fig. 3.3 and . Table 3.2).

In the case of CEU,

50% of Europeans have the (G;G) genotype
35% of Europeans have the (G;T) genotype
15% of Europeans have the (T;T) genotype
In the case of CHB,
80% of Chinese have the (G;G) genotype
10% of Chinese have the (G;T) genotype
10% of Chinese have the (T;T) genotype
3.3 · Interpretation of the Correlation of Rare SNPs with Diseases Using SIFT and KEGG
37 3
.. Table 3.1 Magnitude of Promethease
0 You have the common genotype, for which nothing interesting is known.
0.1 You have the common genotype, but its interesting that this varies for others.
1 Semi-plausible but not very exciting.
blank No one has yet assigned a magnitude. Treated as a 1.
2 Looks interesting enough to be worth reading.
2.1 Hmm, interesting.
3 Probably worth your time.
10 Really significant information!
(G;G) (G;T) (T;T)

CEU
CHB
JPT
YRI
ASW
CHD
GIH
LWK
MEX
MKK
TSI
AVG
0 20 40 60 80 100
.. Fig. 3.3 Genotype frequency bar by population
3.3 I nterpretation of the Correlation of Rare SNPs

with Diseases Using SIFT and KEGG
The SIFT algorithm is an analysis tool that enables predicting the severity of protein dam-
age by SNPs in protein coding sequences. In this section, we will practice the following:
(1) Identifying SNPs that are most likely to cause protein damages by using SIFT Web
tool, (2) Mapping the identified SNPs on the disease pathway of KEGG DISEASE, (3)
Predicting the relationships between the identified SNPs and the diseases, (4) Adding
annotations to identified SNPs and calculating SIFT scores by using the simple SIFT Web
tool (7 http://sift.jcvi.org/), (5) Investigating the relevance to diseases of the identified

SNPs, which are selected using the Windows R program, by mapping on the pathway of
KEGG DISEASE.
.. Table 3.2 Population frequency of Promethese
CEU European – 180 samples of Utah residents with Northern and Western European ancestry
from the CEPH collection
CHB Han Chinese – 90 samples of Han Chinese in Beijing, China

3 JPT Japanese Tokyo – 91 samples of Japanese in Tokyo, Japan
YRI Yoruba African – 180 samples of Yoruba in Ibadan, Nigeria
ASW 90 samples of African ancestry in Southwest USA
CHD 100 samples of Chinese in Metropolitan Denver, Colorado
GIH 100 samples of Gujarati Indians in Houston, Texas
LWK 100 samples of Luhya in Webuye, Kenya
MEX 90 samples of Mexican ancestry in Los Angeles, California
MKK 180 samples of Maasai in Kinyawa, Kenya
TSI 100 samples of Toscani in Italia
AVG Mathematical average of all samples from above groups
3.3.1 Run the SIFT Web Tool
When the number of SNPs is less than one million, we can analyze the data using the SIFT
Web tool. However, analyzing SNPs of the whole personal genome data requires a pro-
gram, such as ANNOVAR, that can process a larger amount of data.
Using the SNPs data, go to the SIFT web site (7 http://sift.jcvi.org/) and annotate the

SNPs. Also, find the degree of protein damage from the SNPs (. Fig. 3.4).

1. Go to 7 http://sift.jcvi.org/ and click SIFT/PROVEAN Human SNPs.

2. Upload sift_input.txt or click “upload example,” then paste the list of human genomic
variants. Select ‘Ensembl Gene ID’ and ‘Gene Name’ and click ‘submit.’
3. Click “view result table” on the result page. You can either download the files or
review only the brief report.
4. Review VARIATION, PROTEIN SEQUENCE CHANGE, PROVEAN PREDICTION,
SIFT PREDICTION, and ANNOTATION (dbSNP_ID, GENE_ID, GENE_NAME).
If the SIFT score of a SNP is smaller than 0.05, we denote the SNP as ‘DAMAGING’
interpreted that the SNP may affect protein function.
3.3.2 KEGG DISEASE Pathway Mapping
Use the above DAMAGING gene list by SIFT scores or the selected list by other tools to
correlate the SNPs and the diseases. Run hypergeometric test using phyper (x,m,n,k,lower.
tail = FALSE) of Windows R and evaluate the statistical significance.
3.3 · Interpretation of the Correlation of Rare SNPs with Diseases Using SIFT and KEGG
39 3
.. Fig. 3.4 Running the SIFT web tool using sample data
Using the given DAMAGED gene list (damaged_gene_list.txt), correlate KEGG

Colorectal Cancer (colorectal_cancer.txt) with the genes by hypergeometric test.
1. download and install R (7 https://cran.r-project.org). The details of R program

installation are described in Appendix B, 7 Sect. B.1.

2. Import DAMAGED gene list (damaged_gene_list.txt).
> damaged_genes <- read.table(file.choose(), header = F, sep = “\t”)

3. Import KEGG Colorectal Cancer pathway (colorectal_cancer.txt).
> pathway_list <- read.table(file.choose(), header = F, sep = “\t”)
3 > head(pathway_list)
V1 V2 V3 V4 V5
1 Colorectal cancer 207 AKT1 0 NA
4 Colorectal cancer 324 APC 1 NA
5 Colorectal cancer 10297 APC2 0 NA
6 Colorectal cancer 26060 APPL11 0 NA
4. Extract Gene Symbols that have value 1 on the fourth column.
> important_genes <- pathway_list[(pathway_list$V4 == 1), ]
> head(important_genes)
V1 V2 V3 V4 V5
4 Colorectal cancer 324 APC 1 NA
11 Colorectal cancer 581 BAX 1 NA
20 Colorectal cancer 1630 DCC 1 NA
24 Colorectal cancer 3845 KRAS 1 NA
32 Colorectal cancer 4292 MLH1 1 NA
33 Colorectal cancer 4436 MSH2 1 NA
5. Find common genes between damaged genes and colorectal cancer genes (impor-
tant_genes).
> mapping_result <- intersect(damaged_genes$V1, important_genes$V3)
> mapping_result
[1] "MLH1" "TGFBR2" "APC" "TP53" "MSH2" "MSH6" "DCC"
3.4 · Pharmacogenomic Analysis
41 3
6. Run Hypergeometric distribution test. (damaged_gene: 2900, Human gene: 30000)
> m <- lengh(important_genes$V3)
> n <- 30000 – m
> k <- 2900
> x <- length(mappng_result) –1
> phyper(x, m, n, k, lower.tail = FALSE)
7. Confirm using KEGG Pathway
> phyper(x, m, n, k, lower.tail = FALSE) * 300
Resulting P-value of 0.01199652 indicates that our test results are statistically significant.
※ If there are any technical difficulties, plaease refer to ‘command.txt’ provided.
3.4 Pharmacogenomic Analysis
This section covers GENOtation, a personal genome analysis tool designed by

Stanford University, and practices predictions of individual drug responses and inter-
pretations of data. GENOtation runs on a web browser and takes input in the
“23andMe format”.
3.4.1 Pharmacogenomic Analysis Using GENOtation
1. GENOtation Main Page (7 http://genotation.stanford.edu/).

2. Click the “Clinical” Tab and Go to the Pharmacogenomics Menu Page.

3. Select ‘Show my Common PGx Variants’ and load ‘genotation _input.txt’ file.
4. Review the results when the analysis is done. The results show Common
Pharmacogenomic Variants, Rare Pharmacogenomic Variants, output identified
SNPs and relevant information.
55 Common Pharmacogenomic Variants: Evidence level 1~3: Highly curated and
annotated based on literature
55 Rare Pharmacogenomic Variants: Based on scientific evidence, but unverified by
FDA (. Figs. 3.5 and 3.6)

.. Fig. 3.5 Web page of GENOtation
.. Fig. 3.6 Result of Pharmacogenomics – Common Pharmacogenomic Variants

3.5 · Calculations of Allele Frequencies in a Population
43 3
3.5 Calculations of Allele Frequencies in a Population
The 1000 Genomes Project data include SNP, InDel, SV (structural variants) of 2054 indi-
viduals. This section uses VCFtools (7 http://vcftools.sourceforge.net/) to define target

regions on the chromosomes. Also, we will practice finding allele frequencies in a popula-
tion by SNP rsIDs using grep and awk commands.
<Basic and Additional Options of VCFtools>
※ Basic Options of VCFtools

> Examples of allele frequency calculations
vcftools –vcf chr22.vcf –chr 22 --freq
Basic Options
•--vcf <filename>
This option is to state that a VCF file is to be used for the process.
•--gzvcf <filename>
This option is to process a compressed, gzipped, VCF file without extracting it.
•--out <prefix>
This option is for the output filename prefix for all files generated by vcftools. If the
option is omitted, all output files will have the prefix.out
Site Filtering Options

•--chr <chromosome>
This option includes the sites on a given chromosome.
•--snp <string>
This option includes SNP(s) with matching rsID(s).
•--snps <filename>
This option includes multiple SNPs given in a file. The file should contain a list of one
ID per line.
•--exclude <filename>
This option excludes multiple SNPs given in a file. The file should contain a list of one
ID per line.
•--positions <filename>
This option includes multiple position lists in a file. The file should contain a tab-
seperated chromosome and position in each line.
•--exclude-positions <filename>
This option excludes multiple position lists in a file. The file should contain a tab-
seperated chromosome and position in each line.
※ In order to review all options, go to 7 http://vcftools.sourceforge.net/options.html.

※ Additional Options
> Example of extracting a compressed.gz file.
tar-xvfz
ALL.chr22.phase1_integrated_calls.20101123.snps_indels_svs.genotypes.vcf.tar.gz
> Querying (Retrieves data using the genomic positions)

vcf-query chr22.vcf.gz 22:10327-10330
> Comparing (compares two or three compressed files directly)
vcf-compare A.vcf.gz B.vcf.gz C.vcf.gz
> Concatenating (combines files)
vcf-concat A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz
3 > Stripping Columns (separates individual variants)
vcf-subset –c NA18960,NA18961 chr22.vcf.gz
Using VCFtools, calculate allele frequency of a SNP on chromosome 22: position
16050984.
1. Use chr22.vcf file or 1000 Genomes Project data (7 ftp://ftp-trace.ncbi.nih.

gov/1000genomes/ftp/release//20130502/ALL.chr22.phase3_shapeit2_mvncall_inte-
grated_v5a.20130502.genotypes.vcf.gz).
2. Using an editor or vi commands, enter chromosome number and position in a line
and save it as a text tile. Please refer to position.txt as an example.
3. Type the following VCFtools command.
> vcftools --vcf chr22.vcf --positions position.txt --freq —out vcf_position
4. Review the results in vcf_position.frq file.
> cat vcf_position.frq
CHROM POS N_ALLELES N_CHR (ALLELE-FREQ)
22 16050984 2 C:0.997711 G:0.00228938
Exercises
[Exercise 1] - Go to SNPedia (7 http://www.snpedia.com) and learn which disease Herceptin is used for

and find a related SNP (rsID).

[Exercise 2] - Search ‘rs184313718’ using grep command in chr22.vcf and extract the 8th field using awk
command. Then, calculate AF(allele frequency) and ASN_AF (Asian allele frequency).
[Exercise 3] - Using vcf-query from VCFtools, save the list of SNPs on chromosome 22: position
16304864-16399994 as result_out.txt.
Take Home Message

55 Learn the analysis methods of clinical annotation of SNP.
55 What methods and public data are used for clinical annotation of SNP.
Bibliography
45 3
Bibliography
1. 1000 Genomes Project Consortium et al (2010) A map of human genome variation from population-
scale sequencing. Nature 467(7319):1061–1073
2. Cariaso M, Lennon G (2012) SNPedia: a wiki supporting personal genome annotation, interpretation
and analysis. Nucleic Acids Res 40(Database issue):D1308–D1312
3. Danecek P et al (2011) The variant call format and VCFtools. Bioinformatics 27(15):2156–2158
4. Deverka PA (2009) Pharmacogenomics, evidence, and the role of payers. Public Health Genomics
12(3):149–157
5. Hulot JS (2010) Pharmacogenomics and personalized medicine: lost in translation? Genome Med
2(2):13
6. International HapMap Consortium (2003) The International HapMap Project. Nature 426(6968):789–
796
7. Karczewski KJ et al (2012) Interpretome: a freely available, modular, and secure personal genome
interpretation engine. Pac Symp Biocomput 339–350
8. Kitzmiller JP et al (2011) Pharmacogenomic testing: relevance in medical practice: why drugs work in
some patients but not in others. Cleve Clin J Med 78(4):243–257
9. McDonagh EM et al (2012) PharmGKB summary: very important pharmacogene information for
G6PD. Pharmacogenet Genomics 22(3):219–228
10. McKinnon RA et al (2007) A critical analysis of barriers to the clinical implementation of pharmacoge-
nomics. Ther Clin Risk Manag 3(5):751–759
11. Mills RE et al (2011) Mapping copy number variation by population-scale genome sequencing. Nature
470(7332):59–65
12. Sherry ST et al, dbSNP: the NCBI database of genetic variation
13. Pirmohamed M et al (2011) The phenotype standardization project: improving pharmacogenetic
studies of serious adverse drug reactions. Clin Pharmacol Ther 89(6):784–785
14. R Core Team (2016) R: A language and environment for statistical computing. R Foundation for statis-
tical computing, Vienna, Austria. https://www.R-project.org/
47 4
Personal Genome
Interpretation and Disease
Risk Prediction
4.2 SNP Prioritization – 48
4.3 Prediction of Genetic Disease Risk – 49

4.3.1 Algorithm for Prediction of Genetic Disease Risk – 49
4.3.2 Practice Data – 49
4.3.3 Risk Prediction Using GENOtation – 50
4.3.4 Data Analysis Using Promethease – 52
4.4 Resources for Analyzing an Individual Genome – 55

4.4.1 dbGAP – 55
4.4.2 SNPedia – 57
4.4.3 PheGenI – 58
4.5 Disease-Related Sequence Variation Analysis – 62

4.5.1 Verification of File List and Moving to Practice
Directory Using Terminal – 62
4.5.2 Extracting Main Disease-Causing Variants
from Sequence Data – 63
4.5.3 Finding the Disease Gene of a Child Using Kinship Data – 72
4.5.4 Analysis of Gene Lists Known to Affect Phenotype – 72
Bibliography – 75


https://doi.org/10.1007/978-981-13-1942-6_4
48 Chapter 4 · Personal Genome Interpretation and Disease Risk P
rediction

It is not an easy task to interpret an individual genome, even though we have a lot of useful
annotation information from personal genome sequences and variants. A number of soft-
ware programs have been developed to identify statistically significant and clinically relevant
SNPs, predict disease risk, and identify disease-related rare variants. This chapter introduces
various bioinformatics resources used for the interpretation of personal genomic data.
4 4.1 Introduction
The Era of the Personal Genome opened with the development of next generation
sequencing and accompanying decreases in genome analysis costs. There has been a vari-
ety of genome testing services in the past. However, Direct-to-consumer (DTC) com-
panies that can analyze consumer’s individual genome are explosively increasing. Such
companies include 23andMe, deCODE genetics, and Navigenics; they provide services
such as individual genome analysis for disease prediction, drug reactivity prediction, and
finding ancestry. In particular, disease prediction has become one of the most important
consumer genome services. The company Pathway Genomics has actively started provid-
ing services through pharmacies for consumer-based DTC genetic test kits.
One famous example of risk prediction, reported on by the New York Times, involved
Hollywood star Angelina Jolie. She tested positive for BRCA1 gene mutation inherited
from her mother, who had been lost to ovarian cancer at the age of 56. This mutation is
found in ~5% of breast cancer patients, and in ~10% of ovarian cancer patients. During
the time of testing, there were no signs of breast cancer from Angelina Jolie. Due to the
increased risk of developing breast cancer, Jolie chose preventative double mastectomy.
Angelina Jolie’s decision was one of ‘clinical rationalities.’ Various disease-risk rate pre-
diction services provided by DTC companies have been criticized for having low reliabil-
ity; this cannot be used in clinical decisions. Furthermore, risk prediction increases social
anxiety and confusion. The FDA has announced that many commercial genetic testing
services are clinically unreliable, confuses the consumers, and provide clinically insignifi-
cant results. Such service should be FDA approved. The FDA also demanded that services
provided by 23andMe, marketed successfully as an individual genome analysis, be FDA
approved as a ‘medical device.’ 23andMe marketing was banned after the controversy, and
this event changed marketing techniques for many DTC companies.
Today, personal genome interpretation remains challenging. There are too few inter-
pretable individual genome variants, and the analysis algorithms need to be further devel-
oped by bioinformaticians. In this chapter, we practice the basics of individual genome
variation data analysis and clinical interpretation. One must know SNP prioritization
methods and genome-related databases in order to understand genetic disease risk pre-
diction algorithms. We introduce software and databases used in predicting disease risk to
provide accurate information regarding individual genome interpretation.
4.2 SNP Prioritization
In the previous chapter, GWAS genome correlation analysis was introduced as a general
method to correlate genotype to phenotype; used to predict disease risk or drug reactivity in
relation to individual differences in gene sequence. Recently, correlation analyses based on
4.3 · Prediction of Genetic Disease Risk
49 4
microarrays and next generation sequencing have been developed. However, GWAS gives an
excessive amount of ‘statistically significant’ SNPs, the effects of each SNP on molecular biol-
ogy are largely unknown, and it is difficult to use GWAS analysis results in clinical applications.
SNP Prioritization is used to sort out more relevant SNPs from the excessive number
of GWAS results. In this technique, SNPs that are presumed to have correlation with the
phenotype are ranked. This technique does not only utilize p-values for ranking, but also
knowledge of molecular biology, SNP also weights its significance using algorithms such
as SPOT or SNPranker.
4.3 Prediction of Genetic Disease Risk
The previous chapter introduced several algorithms using whole genome or exome
sequence data to predict hereditary disease probability, based on Dr. Ashley’s publica-
tion “Clinical assessment incorporating a personal genome.” Consumer-based genome
analysis companies use a variety of algorithms, while most GWAS research results are
manually reviewed to extract a correlation between disease and sequence variation, allow-
ing prediction of disease risk. GWAS literature uses a meta-analysis of case study/control
group patient numbers to calculate odds ratios relating sequence variation to disease. We
further calculate race-related disease risk using known prevalence factors. Fundamentally,
there is not much literature reporting specific numbers for case/control groups. Therefore,
it is often difficult to calculate disease risk rates.
In this exercise, we use Dr. Ashley’s example data and Stanford University’s GENOtation
tool, which uses his algorithm to provide disease risk results.
4.3.1 Algorithm for Prediction of Genetic Disease Risk
Algorithms, such as Dr. Ashley’s one, take as input data (1) sequence variants and relevant
phenotypes extracted from GWAS papers and (2) case/control group number charts. The
information on the right chart of . Fig. 4.1 must be extracted from literature reviews and

meta-analyses. Unfortunately, most literature does not provide these details, and a lot of
work is needed to extract the information. After the chart is completed, use the calculation
method shown in . Fig. 4.2.

First, assign the prevalence in the race of analyst with the specific disease to the pre-
test probability and calculate the pretest odds using the formula “pretest odds = pretest
probability/(1 – pretest probability).” Then calculate the likelihood ratio (LR) using the
equation at the bottom of . Fig. 4.2, comparing with the allele. This is used to determine

posttest odds, the product of pretest odds and LR. When each variant is independent,
the posttest odds are renewed by sequentially multiplying the LR values of each variant.
Finally, the disease risk can be calculated from posttest odds using the formula “posttest
probability = posttest odds/(1 + posttest odds).”
4.3.2 Practice Data
In this exercise, we use individual genome data in 23andMe file format. The content of
practice file Personal_Genome.txt is as shown in . Fig. 4.3.

rediction
Determine disease
PubMed search
Table 2 Genotypic frequencies of ESR1 SNPs
in the Study of Ostoporotic Fractures
Paper filter (contain tables) SNP Genotype Case n (%) Control n (%)
4 Review papers make SNV list

rs746432 C/C
C/G
G/G
311 (0.80)
72 (0.18)
7 (0.02)
648 (0.82)
126 (0.16)
12 (0.02)
rs2234693 T/T 117 (0.30) 214 (0.27)
T/C 188 (0.48) 393 (0.50)
C/C 87 (0.22) 176 (0.23)
SNV detected rs9340799 A/A 178 (0.45) 315 (0.40)
A/G 176 (0.45) 365 (0.46)
G/G 38 (0.10) 108 (0.14)
rs1801132 C/C 237 (0.60) 461 (0.58)
Mapping to personal vcf using SNV list C/G 137 (0.35) 299 (0.38)
G/G 19 (0.05) 29 (0.04)
(contain rs_id)
Calculated clinical risk
.. Fig. 4.1 Literature meta-analysis process
Risk assessment algorithm

Pretest odds = pretest prob. / (1–pretest prob.)
Posttest odds = pretest odds * LR
Posttest prob. = posttest odds / (1 + posttest odds)
Calculating likelihood ratio (LR) of disease risk for SNP genotypes
probabolity of the genotype in the case population

LR =
probability of the genotype in the control population
AA Aa aa a
Case a b c LR(AA) = a + b +c
d
Control d e f d+e+f
LR
.. Fig. 4.2 Calculation of disease risk prediction
4.3.3 Risk Prediction Using GENOtation
GENOtation by Stanford University is an educational and non-commercial software

that provides tools for ancestry finding, diabetes prediction, sequence variation mapping
related to disease, warfarin drug dosing prediction, pharmacogenomic informatic analysis,
and other analyses of genomic information. In this exercise, we perform risk prediction for
Type 2 Diabetes and determine the relations of sequence variants to the disease (. Fig. 4.4).
51 4
.. Fig. 4.3 Practice data file
Personal_Genome.txt # rsid chromosome position genotype
rs3753242 1 2059541 CT
rs2792248 1 162891886 GG
rs2943634 2 226776324 AA
rs4402960 3 186994381 GT
rs6769511 3 187012984 CT
rs2736100 5 1286516 AC
rs17070145 5 167778369 CT
rs3020317 6 152320434 TT
rs1884051 6 152324972 AA
rs985694 6 152328318 CC
rs726281 6 152344271 AA
rs9295475 6 20760744 AG
rs4712523 6 20765543 AA
rs9465871 6 20825234 TT
rs4722404 7 3128789 TC
rs10811661 9 22124094 CT
rs505922 9 135139050 CC
rs7901695 10 114744078 TT
rs7903146 10 114748339 CC
rs2237892 11 2796327 CT
rs3758650 11 606865 AA
rs5443 12 6825135 CT
rs1667394 15 26203777 TT
rs9939609 16 52378028 AT
rs7216064 17 65898809 AG
rs1982073 19 46550761 AG
rs2075650 19 50087459 GG
rs16999165 20 52807221 AG
rs36600 22 30337936 TC
rs1385699 X 65741711 TT
.. Fig. 4.4 Risk prediction process using GENOtation
In order to predict the risk rate of Type 2 Diabetes, select the ‘Clinical’ tab, then
‘Diabetes,’ and finally select ‘Compute my Type 2 Diabetes risk.’ A file selection screen will
appear. Browse and select the practice data file, after which a race selection screen will
appear. Select ‘European’ and select ‘Compute my Type 2 Diabetes risk’ again. . Figure 4.13

rediction
.. Fig. 4.5 GENOtation results for Type 2 Diabetes risk prediction
shows screenshots of this process. The sample size of GWAS experiments and the SNV
list related to Type 2 diabetes are seen. The calculated LR, running LR (posttest odds),
and probability (posttest probability) results from the user genotype comparison can be
viewed in . Fig. 4.5.

Next, select the ‘Clinical’ tab, then ‘All GWAS studies’. This uses the NHGRI-EBI GWAS
Catalog to determine disease relevance of individual SNVs. As in the previous exercise,
choose the race (‘European’), then select ‘show my GWAS SNPs’ to view individual SNV
genotype information (. Fig. 4.6).

4.3.4 Data Analysis Using Promethease
Promethease uses SNPedia to provide annotations for phenotype-related data, includ-

ing but not limited to drug interactions and disease risk rates of individual genome
variants (7 https://www.snpedia.com/index.php/Promethease). You can download the

Promethease program from its website, as shown in . Fig. 4.7.
To run Prometheus on the practice data, load the ‘Personal_Genome.txt’ file, set race
as JPT, Japanese in Tokyo, Japan, and set the result file to save to your ‘Desktop.’ After
this is all done, select the ‘next’ button. The analysis will take about 20 min to complete
(. Fig. 4.8).

53 4
.. Fig. 4.6 Results for disease relevance of SNVs
.. Fig. 4.7 Download program from the Promethease access page

rediction
.. Fig. 4.8 Executing Promethease
.. Fig. 4.9 Promethease results

overview
. Figure 4.9 shows the analysis results screen, given in HTML format. At the top, you

can verify run information such as the Promethease version, date created, and input file.
At the bottom, the analysis results are classified into eight articles. The first three articles
group variants into ‘Good news,’ and ‘Bad news’. ‘Good news’ variant has low disease risk
or causes positive effects whereas ‘Bad news,’ variant has high disease risk or causes nega-
tive effects. There is a third category – ‘They seem interesting, but have not been flagged as
clearly Good or Bad’ – for other ambiguities. You can click the ‘…more…’ button located
below each article for annotated results.
The Promethease annotation format displays one SNV and one phenotype correlation,
as shown in . Fig. 4.10. Annotations contain the magnitude and information on the SNV,

including occurrence frequency. If you select the ‘…more…’ button next to the pheno-
type information, you can view more details. Selecting the SNV rsID link takes you to its
SNPedia page to view detailed information on the variant.
‘Medicines’ and ‘Medical Conditions’ further classify the three articles according to
the phenotype. As shown in . Fig. 4.11, these tabs report the number of related variants,

annotated phenotype, and annotated information. The bar graph on the left shows the
number of SNVs in each category: red as “Bad News”, green as “Good News”, and grey
for variants with no standings. You can view annotated information by clicking the ‘…
more…’ button located next to each phenotype (. Figs. 4.10 and 4.11).

4.4 · Resources for Analyzing an Individual Genome
55 4
.. Fig. 4.10 Promethease detailed analysis result – News
.. Fig. 4.11 Promethease detailed analysis result – Medical conditions
4.4 Resources for Analyzing an Individual Genome
4.4.1 dbGAP
dbGAP is a database provided by NCBI (National Center for Biotechnology Information,

7 http://www.ncbi.nlm.nih.gov/gap) that contains data from genotype and phenotype

correlation studies. These include GWAS, genotyping, clinical sequencing, and molecular
rediction
diagnostic assays. The information provided by each of the mentioned studies can be
summarized into four categories: documents related to experiments, accessible authority
document, information on genomic data and individual genomic pedigrees and lastly,
linkage analysis and related statistical genetic information. Documents relevant to experi-
ments include experimental instructions, protocol instructions, and the data-collecting
agency. dbGaP contains both ‘open’ data that is easily accessible and ‘controlled’ data that
can only be accessed with permission. Most of the data is controlled, for which access
requests must be submitted. Currently, dbGAP contains 4792 datasets from 780 experi-
4 ments (. Table 4.1).

.. Table 4.1 List of projects participating in dbGAP
Projected No. Study name or Sponsor Type Number of

availability studies disease focus participants
Nov. 2006 1 AREDS NEI Case-control 600

GWAS
Nov. 2006 1 Parkinsonism NINDS/NIA Case-control 2,573

GWAS
Jun. 2007 1 Attention deficit GAIN Trio GWAS 2,874

hyperactivity
disorder
Aug. 2007 2 Diabetic nephropa- GAIN Case-control 1,835

thy GWAS
Sept. 2007 0 GeneLink NHLBI Multipoint TBD

linkage
analyses
Sept. 2007 1 Stroke NINDS Case-control 1,555

GWAS
Sept. 2007 1 Motor neuron NINDS Case-control 1,876

disease and GWAS
amyotrophic lateral
sclerosis
Sept. 2007 1 LEAPS MJFF Tired 886

case-control
GWAS
Sept. 2007 1 Major depression GAIN Case-control 3,720

GWAS
Oct. 2007 1 Framingham SHARe NHLBI Family-based ~9,500

longitudinal
GWAS
Oct. 2007 1 Psoriasis GAIN Case-control 2,898

GWAS
Nov. 2007 2 DCCT/EDIC NIDDK Longitudinal 1,441

GWAS
(continued)
57 4
Projected No. Study name or Sponsor Type Number of

availability studies disease focus participants
Dec. 2007 1 Schizophrenia GAIN Case-control 2,909

GWAS
Dec. 2007 1 Bipolar disorder GAIN Case-control 2,400

GWAS
Early 20008 1 Alzheimer’s disease NIA Case-control 10,000

GWAS
Late 2008 8 8 GEI studies NHGRI TBD >30,000
Late 2008 0 Medical resequenc- NHGRI TBD ~15,000

ing, phase 1
Late 2008 1 MESA SHARe NHGRI Longitudinal 8,000

GWAS
TBD 0 Women’s Health Brigham and Longitudinal ~28,000

Genome Study Women’s GWAS
Hospital,
NHLBI, Amgen
TBD 0 Cornelia de Lange CETT Clinical TBD

study diagnosis
TBD 0 Duchenne muscular CETT Clinical TBD

dystrophy study diagnosis
TBD 0 Kallman syndrome CETT Clinical TBD

diagnosis
TBD 0 Tuberous sclerosis 2 CETT Clinical TBD

diagnosis
TBD 0 OMIM Johns Literature TBD

Hopkins
University
TBD 0 Dystrophin CETT LSMDB TBD

mutation sutdy
25
4.4.2 SNPedia
SNPedia (7 http://www.snpedia.com/index.php/SNPedia) is a wiki-based database started

in 2006 that provides information on sequence variation and related diseases. SNPedia is
composed of information from research literature, data from the cooperation of several
individuals, and the NCBI RSS feed (really Simple Syndication, 7 http://ww.ncbi.nlm.nih.

gov/feed/). Information in SNPedia includes but is not limited to a variant’s associated

gene, genosets, genotypes, medicine interactions, and medical conditions. Currently, the
database contains 1840 genes, 87,963 SNPs, 179 drugs, and 415 medical conditions. SNP,
rediction
.. Fig. 4.12 “Type 2 Diabetes” search in SNPedia
phenotype, and other terms can be used to search in SNPedia (. Fig. 4.12). As shown in

the figure, information and SNPs related to Type 2 Diabetes can be verified from several
references, and each SNP page provides detailed information on its variations.
4.4.3 PheGenI
PheGenI (Phenotype-Genotype Integrator, 7 http://ncbi.nlm.nih.gov/gap/PheGenI) is pro-

vided by NCBI as a genome diversity database that incorporates the GWAS Catalog with
other NCBI tools. The database incorporates dbSNP, NCBI Gene, and Genotype-Tissue
Expression (GTEx) eQTL, and dbGAP genotype-phenotype correlation analysis data.
Currently, PheGenI contains 66,063 Phenotype-Genotype correlations, about 54,000,000
dbSNP entries, about 40,000 NCBI genes, and about 61,000 eQTLs.
We can search PheGenI on the basis of either genotype or phenotype. For this practice,
let’s use individual genome data to search. Access the PheGenI homepage (. Fig. 4.13).

In the Traits field under ‘Phenotype Selection,’ input the keyword ‘Type 2 Diabetes’
and select the ‘Search’ button. It should give you 256 Phenotype Genotype results, 29
genes, and 36 SNPs relating to Type 2 Diabetes (. Fig. 4.14).
In the lower section of the results page, under ‘Association Results,’ we can see related
SNP rsIDs, gene, location, and p-value indicating its linkage to Type 2 Diabetes. Detailed
information from dbSNP or NCBI can be viewed by selecting the source articles as shown
in . Fig. 4.15.

59 4
.. Fig. 4.13 PheGenI webpage access screen
.. Fig. 4.14 Searching on “Type 2 Diabetes” in PheGenI

rediction
.. Fig. 4.15 “Type 2 Diabetes” PheGenI search results (1) – Association Results
.. Fig. 4.16 “Type 2 Diabetes” PheGenI search results (2)
Under ‘Genome View’, we can visualize a SNP’s location on the genome map. Under
the ‘Genes’ and ‘SNPs’ sections, we can view gene and SNP data relating to Type 2
Diabetes. Published studies related to Type 2 Diabetes are listed under ‘dbGaP Studies’
(. Fig. 4.16).

61 4
Next, we will search on an individual SNP from Personal_Genome.txt. This time, click
‘Genotype Selection’, select the ‘SNP’ tab, input ‘rs7903146,’ and select search (. Fig. 4.17).

As shown in . Fig. 4.18, 16 Association Results are returned for rs7903146, most of

them relating to type 2 diabetes.
.. Fig. 4.17 Searching on

“rs7903146” in PheGenI
.. Fig. 4.18 Search result page of “rs7903146”

rediction
4.5 Disease-Related Sequence Variation Analysis
4.5.1 erification of File List and Moving to Practice Directory

V
Using Terminal
In 7 Sects. 4.1, 4.2, 4.3, and 4.4, we used common variants (SNPs). Common SNPs are

excellent genome research markers, but them being poor explanatory markers are their
downsides. In 7 Sect. 4.5, we will identify a sequence variant with better explanation, using
4

familial relationship data and extracting variants that are judged to have individual effects.
This exercise will partially repeat the process and result from “Exome sequencing and
disease-network analysis of a single family implicate a mutation in KIF1A in hereditary
spastic paraparesis” by Yaniv Erlich and his colleagues. In order to obtain results identi-
cal to those reported in literature, we will simulate 1000 Genomes Project with 20,000
chromosome trio sequence data (. Table 4.2). The data can be downloaded from the fol-

lowing FTP site, and any processed files are provided separately (7 ftp://ftp.1000genomes.

ebi.ac.uk/vol1/ftp/pilot_data/release/2010_07/trio/snps/CEU.trio.2010_03.genotypes.vcf.gz)
(. Fig. 4.19).

For this exercise, use a Linux system with ANNOVAR installed (see 7 Appendix B).
gda@GDA:~$ cd chapter04
gda@GDA:~/chapter04 $ ls
process.txt training_set.genes trio.chr2.vcf.gz trio.chr2.vcf.gz.tbi
.. Table 4.2 Files used in the exercise
File name Description
process.txt File providing a command used in the practice
training_set.genes Gene list known to be correlated with hereditary spastic para-paresis

(HSP) suggested in the literature
trio.chr2.vcf.gz A VCF file with variant information for trios
trio.chr2.vcf.gz.tbi An index file for the trio VCF file above
.. Fig. 4.19 Trio sequence

analysis using the 1000 Genomes SA12891 SA12892
Project data
: Male
: Female
: HSP
SA12878
4.5 · Disease-Related Sequence Variation Analysis
63 4
4.5.2 xtracting Main Disease-Causing Variants
E
from Sequence Data
4.5.2.1 Separate Samples Saved in VCF Files

1. If you download the 1000 Genomes Project data, there should be variants from 2000
people in the file. Therefore, a tool is needed to extract the desired samples. First, we
verify that the needed three samples exist in the file. To open the compressed file for
viewing, use command ‘zcat <file>.’
gda@GDA:~/ chapter04 $ zcat trio.chr2.vcf.gz | head
##fileformat=VCFv4.0
..
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
SA12891 SA12892 SA12878
2 3241 . T C . PASS AA=.;DP=110 GT:GQ:DP
2. Separate one sample file ‘SA12891’ using the following command.
gda@GDA:~/chapter04 $ vcf-subset -c SA12891 trio.chr2.vcf.gz > SA12891.vcf
gda@GDA:~/chapter04 $ more SA12891.vcf
..
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
SA12891
3. Run the same command for the two samples.
gda@GDA:~/chapter04 $ vcf-subset-c SA12892 trio.chr2.vcf.gz > SA12892.vcf
gda@GDA:~chapter04$ vcf-subset-c SA12878 trio.chr2.vcf.gz > SA12878.vcf

rediction
4.5.2.2 Convert into ANNOVAR Input Format

1. The VCF file must be converted into an appropriate format for ANNOVAR input.
Create a father file as shown below.
gda@GDA:~/chapter04 $ convert2annovar.pl SA12891.vcf -format vcf4 > father

..
4 NOTICE: Read 305798 lines and wrote 232950 different variants at 232950 genomic
positions (232950 SNPs and 0 indels)
NOTICE: Among 232950 different variants at 232950 positions, 144962 are
heterozygotes, 87988 are homozygotes
NOTICE: Among 232950 SNPs, 156825 are transitions, 76125 are transversions
2. In the same way, convert input files for mother and child.
gda@GDA:~/chapter04 $ convert2annovar.pl SA12892.vcf -format vcf4 > mother
gda@GDA:~/chapter04 $ convert2annovar.pl SA12878.vcf -format vcf4 > child
3. Thus far, we have created three files, one each for the father, the mother, and the
child.
4.5.2.3 Adding Annotation to Variants (SIFT, PolyPhen2)

1. You can add annotate variants with predictions of their effects on protein structure.
First, calculate SIFT (Sorting Intolerant From Tolerant) Score for the father’s variants
using ANNOVAR.
gda@GDA:~/chapter04 $ annotate_variation.pl –filter –dbtype avsift father
$ANNOVAR_HUMANDB –buildver hg19
2. After the calculation, the files father.hg19_avsift_dropped and father.hg19_avsift_fil-

tered will be created. Because only coding variants are included in SIFT predictions,
noncoding variants will be in the _filtered instead of the _dropped fie.
Verify the father.hg19_avsift_dropped file. The values in the second column are
SIFT Scores, ranging between 0 and 1. Values under 0.05 are variants predicted to
65 4
have amino acid substitutions that damage protein structure or function, while values
above 0.05 are classified as benign.
gda@GDA:~/chapter04 $ head father.hg19_avsift_dropped
3. Next, calculate PolyPhen2 Scores using ANNOVAR, just like SIFT. With PolyPhen2,
scores below 0.15 are classified as benign, scores between 0.15 and 0.85 are possibly
damaging, and scores above 0.85 are probably damaging. In contrast to SIFT, higher
scored variants are considered more damaging.
gda@GDA:~/chapter04 $ annotate_variation.pl –filter –dbtype ljb_pp2 father
$ANNOVAR_HUMANDB –buildver hg19
4. As with the SIFT analysis, dropped and filtered files are created. Open the file labeled
father.hg19_ljbpp2_dropped. The values in the second column are the PolyPhen2
scores.
gda@GDA:~/chapter04 $ head father.hg19_ljb_pp2_dropped
※ Analysis workflow using ANNOVAR

The 9-step flow process used in several literary articles has a 1 in 10^8 chance to find a
rare variant. ANNOVAR can execute each step of the process, outlined below. For more
details, see the ANNOVAR website: 7 http://annovar.openbioinformatics.org/en/latest/

misc/accessory/
Step 1. Find variants that occur in splice sites or exonic regions. This limits analysis to
variants that more likely have effects on protein function.
Step 2. Find variants in conserved regions. Variants in conserved regions are more
likely to have large effects on function.
Step 3. Find variants not in segmental duplication regions. Most variants in segmental
duplication regions have smaller effects.
Steps 4,5,6. Remove variants observed in the 1000 Genomes Project (CEU, YRI, JPT,
CHB) (Phase 1 data). These variants are common and cannot be considered as factors
causing Mendelian disease.
Step 7. Remove variants registered in dbSNP.
Step 8. Map the remaining variants to genes.
Step 9. Find genes that include many variants.
In this exercise, we will perform each step in the order stated above.
4.5.2.4 Step 1. Extract Variants in Splice Sites and in Exonic Regions

1. This step can be skipped for the whole genome sequence and exon sequencing
analysis. However, as 1000 Genome project provide a whole genome sequencing
data, this practice analyzes whole genome sequencing data.
rediction
gda@GDA:~/chapter04 $ annotate_variation.pl -geneanno -dbtype refgene - outfile
child.step1 child $ANNOVAR_HUMANDB
NOTICE: Reading gene annotation from

/home/gda/program/annovar/humandb/hg18_refGene.txt ... Done with 5045
transcripts (including 920 without coding sequence annotation) for 2974
unique genes
4
NOTICE: Reading FASTA sequences from
/home/gda/program/annovar/humandb/hg18_refGeneMrna.fa ... Done with 164

sequences
NOTICE: Finished gene-based annotation on 50636 genetic variants in child
NOTICE: Output files were written to child.step1.variant_function,
child.step1.exonic_variant_function
2. Variants at splicing sites and in exonic regions are saved to the file child.step1.
variant_function, which is also the input file for the next step. Execute the following
command to create the file child.step2.varlist, and use this file for step 2.
gda@GDA:~/chapter04 $ next_step.pl -outfile child -step 1
4.5.2.5 Step 2. Extract Variants in Conserved Regions

1. The UCSC Genome browser has a track documenting conservation of sites across 46
species. ANNOVAR uses this track to determine conserved regions.
gda@GDA:~/chapter04 $ annotate_variation.pl -regionanno -dbtype mce44way

-outfile child.step2 child.step2.varlist $ANNOVAR_HUMANDB
NOTICE: Reading annotation database
/home/gda/program/annovar/humandb/hg18_phastConsElements44way.txt ... Done

with 415560 regions
NOTICE: Finished region-based annotation on 105 genetic variants in child.step2.varlist
NOTICE: Output files were written to
child.step2.hg18_phastConsElements44way
67 4
2. Variants in conserved regions are saved to the file child.step2.hg19_phastCons
Elements46way. For the next step, execute the command below, and verify that the
file child.step3.varlist is created.
4.5.2.6 Step 3. Extract Variants Not in Segmental

Duplication Regions
1. Variations in segmental duplications have problems in sequence alignment and
genotype calling, and therefore we exclude them from analysis. These regions are
already defined, and using the following command we can exclude variants occurring
in them.
gda@GDA:~/chapter04 $ annotate_variation.pl -regionanno -dbtype segdup
-outfile child.step3 child.step3.varlist $ANNOVAR_HUMANDB
NOTICE: Reading annotation database
/home/gda/program/annovar/humandb/hg18_genomicSuperDups.txt ... Done with
10209 regions
NOTICE: Finished region-based annotation on 57 genetic variants in child.step3.varlist
NOTICE: Output files were written to child.step3.hg18_genomicSuperDups
2. Variants not present in segmental duplication regions are saved to child.step3.

hg19_genomicSuperDups. To generate the input file for the next step, execute the
following command.
4.5.2.7 tep 4. Remove CEU Population Variants Observed

S
in the 1000 Genome Project
1. The 1000 Genomes Project analyzed people without any particular diseases.
Therefore, we assume that variants identified in these people will not play a large role
in disease, and remove them from our data. First, we remove those variants known in
populations of European descent with the following command.
rediction
gda@GDA:~/chapter04 $ annotate_variation.pl -filter -dbtype 1000g_ceu - outfile
child.step4 child.step4.varlist $ANNOVAR_HUMANDB
NOTICE: Variants matching filtering criteria are written to
child.step4.hg18_CEU.sites.2009_04_dropped, other variants are written to
4 child.step4.hg18_CEU.sites.2009_04_filtered
NOTICE: Processing next batch with 57 unique variants in 57 input lines
NOTICE: Scanning filter database
/home/gda/program/annovar/humandb/hg18_CEU.sites.2009_04.txt...Done
2. The remaining variants are saved to child.step4.hg19_1000g2012apr_eur_filtered.

Execute the following command for the next step’s input file.
4.5.2.8 tep 5. Remove YRI Population Variants Observed

S
in the 1000 Genomes Project
1. As in 7 Sect. 4.5.2.7, remove all variants known from people of African descent.

gda@GDA:~/chapter04 $ annotate_variation.pl -filter -dbtype 1000g_yri - outfile
child.step5.hg18_YRI.sites.2009_04_dropped, other variants are written to
child.step5.hg18_YRI.sites.2009_04_filtered
/home/gda/program/annovar/humandb/hg18_YRI.sites.2009_04.txt...Done
2. As before, variants remaining after removing known African population variants are
saved to the child.step5.hg19_1000g2012apr_afr_filtered file. Execute the following
command for the next step’s input file.

69 4
4.5.2.9 tep 6. Remove JPT, CHB Population Variants Observed
S
in the 1000 Genomes Project
1. The process to remove variants known to occur in Asian populations is as shown
below.
gda@GDA:~/chapter04 $ annotate_variation.pl -filter -dbtype 1000g_jptchb - outfile
child.step6.hg18_JPTCHB.sites.2009_04_dropped, other variants are written to
child.step6.hg18_JPTCHB.sites.2009_04_filtered
/home/gda/program/annovar/humandb/hg18_JPTCHB.sites.2009_04.txt...Done
2. Variants remaining after the process above are saved to the child.step6.
hg19_1000g2012apr_asn_filtered file. Execute the following command for the next
step’s input file.
4.5.2.10 Step 7. Remove Variants Registered in dbSNP

1. In this exercise, the goal is to identify rare variants unique to the individual, so we
remove those variations previously registered in dbSNP. The command for doing so
is shown below.
gda@GDA:~/chapter04 $ annotate_variation.pl -filter -dbtype snp130 - outfile
child.step7.hg18_snp130_dropped, other variants are written to
child.step7.hg18_snp130_filtered
/home/gda/program/annovar/humandb/hg18_snp130.txt...Done
rediction
2. Variants remaining after removing registered dbSNP variants are saved to the file
child.step6.hg19_1000g2012apr_asn_filtered. Execute the following command for the
next step’s input file.
4.5.2.11 Step 8. Organize the Remaining Variants and Map them

4 to Genes
1. In Step 8, we use the genomic locations of the remaining variants to associate them
with genes. This is done by simply using the gene information file and the command
below.
4.5.2.12 Step 9. Prepare a List of Genes with Many Variants

1. If there are many variants in one gene, we can conclude that this gene likely has a
large effect on the phenotype. Next, we will calculate the frequency of variants in any
given gene.
4.5.2.13 Step 9 for Parental Variants

1. Up to this point, we executed each command separately to understand the process.
However, all steps can be executed in ANNOVAR with one command.
gda@GDA:~/chapter04 $ auto_annovar.pl -model recessive father
$ANNOVAR_HUMANDB
2. The default command automatically executes steps 1 through 9. However, users will
sometimes choose to leave out steps. In this exercise, we will execute the ANNOVAR
analysis without step 2. The command is as follows.
gda@GDA:~/chapter04 $ auto_annovar.pl -model recessive -step 1,3-9 mother

$ANNOVAR_HUMANDB
71 4
4.5.2.14 Verify Variant List of Three Family Member Samples
1. At this point, we have obtained a list of potentially harmful variants for each of the
three samples.
gda@GDA:~/chapter04 $ more father.step9.varlist
nonsynonymous SNV VWA3B:NM_144992:exon11:c.T1538C:p.I513T, 2
98175864 98175864 TC het . 191
gda@GDA:~/chapter04 $ more mother.step9.varlist
nonsynonymous SNV
CGREF1:NM_001166239:exon6:c.G430A:p.G144R,CGREF1:
NM_006569:exon6:c.G430A:p.G144R,CGREF1:NM_001166241:exon5:c.G142A:p.
G48R, 2 27178173 27178173 C T het . 102
nonsynonymous SNV BMPR2:NM_001204:exon12:c.G2084T:p.G695V, 2
203128717 203128717 GT het . 150
gda@GDA:~/chapter04 $ more child.step9.varlist
nonsynonymous SNV BMPR2:NM_001204:exon12:c.G2084T:p.G695V, 2
203128717 203128717 GT het . 150
nonsynonymous SNV
KIF1A:NM_001244008:exon47:c.C5114T:p.A1705V,KIF1A:
NM_004321:exon44:c.C4811T:p.A1604V, 2241307196 241307196 G
A het . 138
2. In this exercise, we analyzed chromosome 2 only, and therefore it is easy to examine

the few resulting variants. For larger datasets, there needs to be an automatic process
to identify whether variants come from the father or mother, in addition to which
variants the child acquired.
rediction
4.5.3 Finding the Disease Gene of a Child Using Kinship Data
4.5.3.1 Analyze a Recessive Disease in Several Children

1. Homozygous: heterozygous state in the parent generation, changing to homozygous
state in the filial generation
2. Compound heterozygous: multiple variants in one gene, each heterozygous in the
parent generation and also heterozygous in the filial generation. Consider the IBD
(identity by descent) score of several factors
4 3. De novo mutation: a variant not detected in the parent generation, but detected in
the filial generation
4.5.3.2 Find De Novo Mutations of a Child in Practice Data

1. The following command generates two result lists; compound heterozygous allele and
de novo mutation considering each relationship.
gda@GDA:~/chapter04 $ pedigree_analysis.py father.step9.varlist
mother.step9.varlist child.step9.varlist
*** Putative disease gene list ***
- Due to compound heterozygous state allele
BMPR2
- Due to de novo mutation
KIF1A
4.5.4 Analysis of Gene Lists Known to Affect Phenotype
4.5.4.1 Gene Prioritization

1. As mentioned previously (. Table 4.2), we provide a list of genes known to be related

to HSP. This list is saved in the file ‘training_set.genes’ (. Table 4.3).

2. Go to the ToppGene site (7 https://toppgene.cchmc.org/).

Select ‘ToppGene: Candidate gene prioritization.’ (. Fig. 4.20).

3. Double click the ‘training_set.genes’ file to open it, select all and copy, then paste them
into the left field titled ‘Training Gene Set.’ In the right field titled ‘Test gene set,’ input
the genes BMPR2 and KIF1A, identified in the previous exercise, and also input
VWA3B and CGREF1, which were only found in the parent. Then select ‘Submit Query.’
(. Fig. 4.21).

73 4
4. Under “Training parameters,” select features to be analyzed, correction types, and a
threshold value; then click ‘Start’ to begin the prioritization analysis (. Fig. 4.22).
5. The four test genes entered are ranked using information from the training set. With
this data, the gene ‘KIF1A’ was identified as the most significant.
.. Table 4.3 List of genes related to HSP
Gene name OMIM Phenotype
CYP7B1 603711 SPG5, Bile acid synthesis defect
HSPD1 118190 SPG13, hypomelinating

leukodystrophy
KIAA0196 610657 SPG8
KIF5A 602821 SPG10
NIPA1 608145 SPG6
PLP1 300401 SPG2, Pelizaeus-Merzbacher

disease
PNPLA6 603197 SPG39
REEP1 609139 SPG31
SPAST 604277 SPG4
ATL1 606439 SPG3A
ZFYVE27 610243 SPG33
.. Fig. 4.20 Candidate Gene Prioritization using ToppGene (1)

rediction
Exercises
[Exercise 1] - Using the VCF you own as the input, run GENOtation and Promethease and check the
medical interpretation results.
[Exercise 2] - If there is a variant annotated as “Bad News” from performing [Exercise 1], use ANNOVAR
to look into the variant more closely. We can obtain more information on the damaging risk the variant
has according to its SIFT and Polyphen scores.
Bibliography
75 4
Take Home Message
55 What is the algorithm for prediction of genetic disease risk and its meaning.
55 How to analyze the disease-related sequence variation.
Bibliography
scale sequencing. Nature 467:1061–1073
2. Ashley EA et al (2010) Clinical assessment incorporating a personal genome. Lancet 375(9725):1525–
1535
3. Cariaso M, Lennon G (2012) SNPedia: a wiki supporting personal genome annotation, interpretation
and analysis. Nucleic Acids Res 40(D1):D1308–D1D12
4. Chen J et al (2007) Improved human disease candidate gene prioritization using mouse phenotype.
BMC Bioinformatics 8:392
5. dbGap – http://www.ncbi.nlm.nih.gov/gap
6. Erlich Y et al (2011) Exome sequencing and disease-network analysis of a single family implicate a
mutation in KIF1A in hereditary spastic paraparesis. Genome Res 21:658–664
7. GENOtation – http://genotation.standford.edu/
8. Karczewski KJ et al (2012) Interpretome: a freely available, modular, and secure personal genome
interpretation engine. Pac Symp Biocomput:339–350
9. Mailman MD et al (2007) The NCBI dbGaP database of genotypes and phenotypes. Nat Genet
39(10):1181–1186
10. Ng SB et al (2010) Exome sequencing identifies the cause of a mendelian disorder. Nat Genet 42:
30–35
11. PheGenI – http://www.ncbi.nlm.nih.gov/gap/PheGenI
12. Promethease – http://snpedia.com/index.php/Promethease
13. Ramos EM et al (2013) Phenotype–genotype integrator (PheGenI): synthesizing genome-wide asso-
ciation study (GWAS) data with existing genomic resources. Eur J Hum Genet
14. Saccone SF et al (2010) SPOT: a web-based tool for using biological databases to prioritize SNPs after
a genome-wide association study. Nucleic Acids Res 38(suppl 2):W201–W2W9
15. SNPedia – http://www.snpedia.com/index.php/SNPedia
16. SPOT – https://spot.cgsmd.isi.edu/submit.php
17. ToppGene site – https://toppgene.cchmc.org
18. Wang J et al (2007) Estrogen receptor alpha haplotypes and breast cancer risk in older Caucasian
women. Breast Cancer Res Treat 106(2):273–280
19. Wang K et al (2010) ANNOVAR: functional annotation of genetic variants from high-throughput
sequencing data. Nucleic Acids Res 38:e164
77 II
Advanced Microarray
Data Analysis
Contents
Chapter 5 Advanced Microarray Data Analysis – 79
Chapter 6 Gene Expression Data A

nalysis – 95
Chapter 7 Gene Ontology and Biological Pathway-Based

Analysis – 121
Chapter 8 Gene Set Approaches and Prognostic Subgroup

Prediction – 135
Chapter 9 MicroRNA Data Analysis – 159

79 5
Advanced Microarray
Data Analysis
5.2 Microarray Experiment – 81
5.3 Structure and Normalization of Microarray Data – 82
5.4 Differentially Expressed Genes (DEGs) – 84
5.5 Cluster Analysis and Interpretation – 85
5.6 Classification Analysis – 86

5.6.1 Linear Discriminant Analysis (LDA) – 87
5.6.2 Support Vector Machine (SVM) – 87
5.6.3 K-Nearest Neighborhood (KNN) – 88
5.7 Gene Set Enrichment Analysis – 88

5.7.1 Analysis Method and Feature – 89
5.7.2 Gene Set Database – 90
5.8 Survival Analysis and Prognostic Subgroup

Prediction – 90
Bibliography – 92

https://doi.org/10.1007/978-981-13-1942-6_5
80 Chapter 5 · Advanced Microarray Data Analysis

In this chapter, we provide a basic understanding of microarray data analysis, which is the
foundation of gene expression data analysis. This chapter describes a microarray experi-
ment method and the data structure generated by microarray. There are exercises to iden-
tify differentially expressed genes between case and control groups, to perform cluster and
classification analysis, and to understand the importance of biological pathway analysis
with the interpretation of microarray data using the GSEA program and R package.
5.1 Introduction
5
Since the traditional molecular research method was developed to target a single gene, it
has been difficult to go beyond understanding the result of translating a single gene and
studying the interaction of the network between numerous genes. However, the molecu-
lar biological analysis system was reborn via advanced concept analysis technology. Such
improvement with high-speed mass experimental technique is used to measure tens of
thousands of genes at the same time in just one experiment like that of DNA microarray.
Microarray research is gaining its significance through the integrated analysis of cease-
lessly developing next-generation sequencing (NGS).
Since the microarray technique uses a single chip to measure tens of thousands of
genes, it creates copious data in a short period. At this time, gene expression is quantita-
tively analyzed.
One of the difficulties of quantifying the actual gene expression and standardizing
the condition to whole genome analysis is cacophony and low reproducibility. Since the
experimental condition has to be subjected to the total genome, the best experimental
condition for individual gene expression cannot be determined. For example, it is possible
to adjust the experimental conditions to detect lower gene expression in the Northern
blot technique, but it is not easy to separate the accuracy of lower gene expression in the
microarray technique experiment that examines the total gene. The intrinsic property of
the microarray is its quantitative analyze an entire subject, and new analysis methods for
microarray data must be developed.
A quantification experiment that has standard experimental conditions has an
advantage. The meta-analysis of various experimental data is gaining more attention,
since the measured value is based on standardization and quantification. This property
makes it possible to save and share globally generated microarray data, and this property
becomes the backbone for forging a public database, such as GEO,1 ArrayExpress,2 and
Oncomine.3 By opening microarray data in these databases to the public, researchers
can use public data without performing experiments. Understanding microarray data
and the importance of acquiring the analysis technique are becoming more important.
1 7 https://www.ncbi.nlm.nih.gov/geo/

2 7 https://www.ebi.ac.uk/arrayexpress/

3 7 https://www.oncomine.org/

5.2 · Microarray Experiment
81 5
5.2 Microarray Experiment
The DNA microarray is constructed in 5 steps, shown in . Fig. 5.1. Step 1 shows how to

make a DNA chip. In the past, each lab directly made a robot that spots a DNA clone onto
the chip. Then, it is amplified and refined by PCR. Lastly, a ‘robotically spotted cDNA
microarray’ is produced. However, it is more cost-effective to buy and use a mass-produced
chip. The mass-produced chip is manufactured by creating an oligomer nucleotide on a
chip directly or with a robotic printer.
DNA chips can be separated into a cDNA (200–500 base pairs) chip and oligobase
sequence (15–100 base pairs) chip, depending on base sequence. Moreover, a DNA chip
can be separated, depending on the manufacturing method, such as pin microarray and
inkjet for robotically spotted chips. Also, a DNA chip like photolithography chip from
Affymetrix can be separated by using semi-conductive manufacturing process. A cDNA
chip is literally attaching ORFs (open reading frames, i.e., a gene) or ESTs (expression
sequence tags) to all base sequences; it distinguishes a gene that has a complementary
sequence properly. (1) Affymetrix’s short-base sequence chip is synthesized layer-by-
layer using photolithography technology to manufacture a semi-conductive chip on a
small glass plate. It can distinguish 25-mers, which are composed of 25-base sequences,
for 425 different genes both theoretically and properly. A robotically spotted chip can
normally synthesize 10,000–30,000-base sequences, whereas a photolithography chip can
2 1
Make
Interesting Interesting Interesting
biochip
patients animals cell lines
3
Appropriate Appropriate
Extract RNA
tissue conditions
Hybridize
biochip
5
4
Access Functional Data pre-
significance clustering processing Scan
biochip
Post-cluster
analysis &
Biological integration Informatical
validation validation?
.. Fig. 5.1 Microarray experiment and data analysis

synthesize a sequence more than 1,000,000-bases in a single slide. (2) Instead of using
cDNA, technology that prints 60–100-mer synthetic base sequences has received atten-
tion. A 50–70-mer synthetic base sequence library is cheaper, overcoming the disadvan-
tage and expensive price of Affymetrix’s short 25-mer and the disadvantage of cDNA, a
difficult aspect of clone library maintenance.
Step 2 is preparation of the specimen, which separates mRNA from the experimental
and control specimens individually. This creates cDNA that is indicated by red Cy5 and
green Cy3 fluorescent dyes during reverse transcription. Step 3 is a hybridization step,
which stains the experimental specimen red and the control specimen blue by a hybridiza-
tion reaction between complementary sequence probes on the chip and samples with the
5 same amount of fluoresce.4 In this process, two different fluorescent samples competitively
undergo a synthesis reaction with the fixed probe on the chip under the same condition.
A robotically spotted chip often dyes two different samples with two colors to create more
competitive ‘two-dye technique’ binding. The unreactive specimen does not hybridize
after the reaction and it needs to be removed properly by washing.
Step 4 is quantification of the brightness of the fluorescence of each probe that hybrid-
ized with fluorescent dyes with the laser fluorescent scanner. It quantifies the mRNA
expression, depending on the ratio of red:green intensity of the two fluorescent materi-
als. The color intensity is converted into a numerical value after the image is analyzed.
The amount of red light obtained from Cy3 and the amount of green light obtained from
Cy6 are mixed computationally to create a red~yellow~green spectrum, which creates
data and visualizes the quantification of the numerical values of the expression along with
ratios of the experimental and control specimens. The Affymetrix chip arranges 25-mer
base sequences that match perfectly with the target gene and those that have a mismatch at
base 13, the midpoint of the 25-mer, to compare binding ratios between the two sequences
as a control for any noise due to cross-linking.
Step 5 is the analysis of the numerical value for the expression of each gene. The
detailed analysis steps will be introduced in 7 Sects. 5.3, 5.4, and 5.5. The previous experi-

mental step is based on a two-colored experimental technique that uses red and green.
However, due to the development of precise technology, the use of a monotone technique
is growing, and the measured values have become more credible. The basic data analysis
principles of the two techniques do not differ considerably.
5.3 Structure and Normalization of Microarray Data
The data of microarray experiments are indicated in gene expression matrix and are sepa-
rated into three parts. (1) A list and annotation of genes provide the link to the relevant
database. (2) A list and annotation of specimens provide the link between the classification
of species and a taxonomy database. (3) Each element of a gene expression matrix shows
the numerical value of gene expression. Moreover, this gene expression matrix is gener-
ated in three steps: (1) raw data that are created by the results of laser fluorescent scanning,
4 In the dyeing method, reverse transcription (RT) transforms mRNA into complementary single-
strand cDNA using oligo-dTprimer and reverse transcriptase for technical convenience. In this case,
Cy3-dUTP, which emits green light, and Cy5-dUTP, which emits red light, are added into each reac-
tion separately, which converts all mRNA in the reference and test cell to target cDNA by mixing Cy3
and Cy5.
5.3 · Structure and Normalization of Microarray Data
83 5
(2) various measured values from each experiment (i.e., each chip or each hybridization)
and the quantification matrix that is made of reference counts, and (3) a gene expression
matrix, which is the bundle of numerical gene expression values that are calculated from
the quantification matrices from many experiments (. Fig. 5.2).
Data normalization is necessary before analyzing microarray data. Data normalization

is an essential process for setting the data to be comparable. Actually, data normalization
is an important preprocessing step before analysis. Normally, molecular biology data are
Sample annotation
Samples
Gene expression
matrix
Genes
Gene expression levels
Gene annotation
Raw data Quantification Gene expression
matrices data matrix
Array scans Quantifications Samples
Genes
Spots
Gene
Quantification datum expression
level
.. Fig. 5.2 Generation and structure of microarray data

simple and used for qualitative or semi-quantitative analysis, which skips data normaliza-
tion and preprocessing. Microarray experiments use large-scale data normalization and
relative values5 of the intensity of light, and each chip varies during the manufacturing
process. Moreover, the red and green dyes have different DNA binding and signal detect-
ing efficiencies, requiring data normalization and preprocessing.
Data normalization normally uses three different technologies. (1) Spiked control of
index gene: a housekeeping gene6 that has a comparatively constant expression ratio, is
used as an index to compute the expression of other genes. A 2-colored technique cali-
brates housekeeping genes of the experimental and control specimens, dyed with red and
green at the ratio of 1:1. In this case, it is assumed that the housekeeping gene expression is
5 constant with all specimens and conditions. (2) Global normalization of the total amount
of fluorescence: Applying the assumption that the total amount of fluorescence emission
from one chip (i.e., the total amount of mRNA) is constant, and the total expression rate
of an object of a comparative analysis is made to be constant. (3) Fluorescence inten-
sity-dependent normalization: According to research that the detectable fluorescence is
calculated per the fluorescence intensity function.7 A curvilinear regression technique is
used to speculate fluorescent intensity under the assumption that the fluorescent inten-
sity increases (total fluorescence amount) and decreases gene dosage (total fluorescence
amount) are the same. Besides, various normalization methods are recommended, con-
sidering the data properties. The three aforementioned methods are applied to both the
monotone and two-colored experimental techniques.
For the data of the two-colored experiment, calculated by the expression ratio of both
the experimental and control specimens, a gene that has a low expression amount shows
a large change in difference in expression.
5.4 Differentially Expressed Genes (DEGs)
The objective to performing the most typical microarray experiment is to obtain a list
of upregulated or downregulated genes in a specimen compared to a control. One can
get a good result by analyzing the fold-difference due to the intervention (DeRisi et al.
1997; Heller et al. [13]). However, average values are used for statistical analysis. T-test,
which uses T-distribution, is used more often to compare two independent or conjugated
samples, and ANOVA, uses F-distribution, to analyze the average values of more than 3
groups. These principles are applied to find genes in microarray data that are differentially
expressed through slight modifications.
However, a microarray experiment cannot fulfill the assumption of the traditional
T- or F-test due to the low number of repeated experiments. Especially, the distribution
function, which is written as a percentage, is difficult to decipher when the two-colored
technique is used. Various limitations exist, such as many missing values. Analytic meth-
ods for microarray data have been developed to overcome the limitations of traditional
statistical analysis techniques with average values. Pierre Baldi’s Cyber T8 or Tibshirani’s
5 Instead of using absolute light intensity, the relative value of the opening of the aperture or the
sensitivity setting of the light sensor is used.
6 Often, rRNAand GAPDH (glyceraldehyde 3-phosphate dehydrogenase) are used.
7 Dye bias is dependent on spot intensity.
8 7 http://cybert.ics.uci.edu

5.5 · Cluster Analysis and Interpretation
85 5
SAM (significance analysis of microarrays) are representative methodologies that can
generate comparatively stabilized results with a low number of repeat experiments.
Even if discriminative expression analysis of a single gene is performed perfectly,
numerous significance tests can not be conducted at the same time. The multiple hypoth-
esis testing problem theory states that as the number of hypotheses increases at the same
time, the false positive rate also increases as with the number of tests. There are various
ways to solve the multiple hypothesis testing problem, but the representative methods are:
(1) the FWER (family-wise error rate) calibrating method (2) the FDR (false discovery
rate) calibrating method. FWER is limited to at least the one false positivity after perform-
ing the nth test.9 However, FWER tends to reach an excessively conservative conclusion
in microarray with the large number of testing, and limiting “the possibility of the false
positivity of the presented result to be at least more than one” as this is not clear in a
practical sense; thus, FDR is applied more extensively in microarray data analysis. FDR
is a concept that estimates the false positive ratio of identified genes —for instance, 100
identified genes are not specified, but 5 genes are decided to be false positives.10
5.5 Cluster Analysis and Interpretation
In 7 Sect. 5.3, cluster analysis of differentially expressed genes is performed. Cluster anal-

ysis is a similarity structure analysis. The 800 genes that are involved in the yeast cell cycle
were analyzed in regards to its periodicity and its correlation coefficient. Cluster analysis
finds a correlated gene, but its function is not revealed by this procedure. Cluster analysis
generates a new hypothesis of how an unknown gene correlates to another gene to reduce
the time and effort of planning the next experiment. After finding a gene cluster in various
experimental conditions, it is a good strategy to find tightly co-regulated genes, regardless
of changing conditions.
Cluster analysis can be separated into (1) hierarchical cluster analysis and (2) classified
cluster analysis, depending on the cluster structure. Hierarchical cluster analysis is a struc-
turalizing method with a hierarchical tree, whereas classified cluster analysis separates
each classified partition. Another method is to separate the data into two types: parti-
tion type and merge type by cluster format. Partition type is initially whole data that are
composed of one cluster and separates it into smaller irrelavant clusters, whereas merge
type starts with N number of clusters such that the initial size of the entire data is 1, the
accumulates similar clusters into on big cluster.
Similarity distance could be applied to various distances, such as Euclidean spatial
distance, vector angle, and correlation coefficient. Depending on the analysis objective
and applied distance, the result differs. In order to show the similarity of a gene profile,
vector angle or correlation coefficient is more reasonable compared to Euclidean distance,
which reflects profile shape. Making a profile that correlates inversely by square or abso-
lute value of the correlation coefficient into one cluster could be considered a reasonable
case. Choosing a distance is an issue that could be raised in order to find the optimally
designed distance, depending on the purpose of analysis. Likewise, forming clusters could
be a problem that needs further explanation. In order to overcome the issues, there are
9 FWER = Pr(α > 0).
10 pFDR = E(α / rejected | rejected >0).
a Differential expression b Co-expression(or clustering)

gl
Expression level
Expression level
Expression level gl S0
S0
gl
gl
C1 C1 C1
Significantly
over-represented Random
G0:000123 G0:000126 Selected
Not selected
5
Biological pathways
Gene ontology
.. Fig. 5.3 Over-representation analysis of an overexpressed gene list resulting from differential expres-
sion analysis and cluster analysis. (a) Differential gene expression. (b) Set-wise differential expression (GSEA)
various analysis strategies such as the method that correlates after dividing strategically
into many clusters and the SOM (self-organizing map) method that reconstitutes the
interrelationship of the obtained cluster.
The biological interpretation of the cluster that is extracted by analysis is normally
accepted as tightly co-regulated genes. The obtained cluster will be helpful for GO (Gene
Ontology) or biological pathway-based annotation and performing biological interpreta-
tion through the similarity test. The significance test of annotation for over-representation
analysis (ORA) is (1) the gene list that is given by input, (2) a significant comparison to a
random, specific GO annotation or pathway-affiliated gene, and (3) a technique that uses
hypergeometric distribution to test (. Fig. 5.3). Today, there is interest in the input gene

list, the list of differentially expressed genes. For instance, only upregulated genes can be
used as input, or the gene whose expression has risen twofold. Over-representation analy-
sis will be explained in 7 Chap. 6.
5.6 Classification Analysis
Classification analysis systemically classifies samples that are already prepared for clas-
sification. Cluster analysis, which is an exploratory analysis technique, creates new clus-
ters by similarity analysis without any information about the sample classification system,
whereas classification analysis is a machine learning technique that could be applied to a
situation where the classification label information on the sample classification system is
5.6 · Classification Analysis
87 5
given. Classification analysis is called ‘supervised learning,’ because it has the classifica-
tion label information, whereas cluster analysis is called ‘unsupervised learning,’ because
it does not.
For example, the gene expression profile that is obtained from n number of liver tissues
and m number of brain cells would each get a profile, called ‘liver or brain class label.’ This
label is considered as the standard. It creates a classifier that is a type of discriminant func-
tion from two different data matrices. A classifier with unknown data is used to predict and
deduce the most proper label. Medically, diagnosis and classification are similar in that the
diagnostic problem is solved through the classification analysis of gene expression data.
The created classifier can be evaluated in a cross validation experiment. The original
data can be divided into a training set and a test set; a classifier could be applied to a test set
to evaluate the performance after a classifier can be drawn from a training set. Cross valida-
tion is performed numerous times to optimize a classifier with various designs. In 7 Chap.
6, a classification analysis exercise is introduced using golubEsets data in R package.

The first successful classification analysis of microarray data was done in the publica-
tion of “Molecular Classification of Cancer” in Science. Golub et al. raised the possibility
of theoretical molecular classification of cancer by classifying high discrimination through
microarray and via classification analysis of two acute leukemia subtypes: myelogenous
leukemia and lymphatic leukemia. It could be applied to diagnose new patient or their
prognosis. As the classification analysis is confirmed, many applied experimental results
will be published. Creating a classifier that selects a few core genes that are used in the
classifier also confirms the utility of the investigation of disease mechanisms.
There are two methods of classification analysis: parametric and non-parametric. The
parametric method creates a classifier under the assumption that the sample is distrib-
uted. The non-parametric method does not assume a particular parameter or distribution
of the sample. Representative parametric methods are linear discriminant analysis (LDA),
quadratic discriminant analysis (QDA), and logistic regression analysis. Representative
non-parametric methods are support vector machine (SVM) (Vapnik 1995) and classifi-
cation and regression tree (CART) (Breiman 1984).
5.6.1 Linear Discriminant Analysis (LDA)
Linear combination, obtained from a statistical method, finds the linear combination of
a variable that classifies two or more samples. It is used as a linear classifier and used for
variable level reduction for continued classification analysis. The advantages of microar-
ray data analysis are evident and it provides an outstanding explanation of a model. The
downside is that the measure of a corresponding gene is sensitive. Moreover, if the disper-
sion of each category is different, accuracy decreases. Dudoit et al. reported that DLDA,
which uses only p number of dispersion information for each gene, showed better results
on re-substitution error rates than the covariance structure in the LDA method.
5.6.2 Support Vector Machine (SVM)
SVM or ‘Statistical Learning Theory’ was established by Vapnik (1995) and was applied to
the microarray initially by Terrence et al. (2000). SVM is a technique that finds a hyper-
plane that distinguishes two groups in hyperspace simply by applying a support vector.
Support vector is a representative vector of a contiguous decimal that is closed to the

hyperplane without using all of the samples. SVM has better extendibility compared to
the existing linear classification methods, and there is no assumption of population. Such
problems are common in all classification analyses.
5.6.3 K-Nearest Neighborhood (KNN)
K-NN is the most intuitive method conceptually. Based on the distance, K-NN detects
matters as its neighbor if they are close and assumes that it shares similar information
5 with the neighbor. Based on this property, K-NN method is widely known for revising
a missing value or estimating probability density. It is an older, elementary, and intuitive
algorithm but remains useful.
5.7 Gene Set Enrichment Analysis
As explained in “Cluster Analysis and Interpretation,” even if an expression profile

similarity-based gene cluster is obtained, an additional step is required to interpret the
biological meaning. Two methods to interpret the biological meaning involves GO analy-
sis of relative clusters and over-representation analysis of the biological pathway annota-
tion distributions. Over-representation analysis finds a corresponding cluster close to its
biological meaning (GO or pathway annotation frequency) and interprets the cluster after
composing the cluster (or the gene set) first.
Gene Set Enrichment Analysis (GSEA) defines the specific GO term or list of gene sets
in pathways differently from the over-representation analysis, and the test. Subset of the
given gene set are related to those subset of the given genes.
In other words, GSEA is used to find a correlating set among a pre-defined gene set. In
this analysis, a target list is simply the total gene group or it could be the result of the first
analysis like differentially expressed genes (DEGs).
GSEA can also be interpreted by comparison of differential expression. Differential
expression analysis finds upregulated or downregulated genes in two experimental condi-
tions (. Fig. 5.4a) to interpret the biological meaning of a gene, whereas GSEA finds a

biologically predefined gene set with statistically significant difference in its gene expres-
sion. Thus, the results of GSEA allow the possibility of biological interpretation; different
from differential expression analysis.
The interpretation of GSEA differs, depending on optimization analysis. If we use
optimization analysis, for instance, it is possible to find all possible combinations of a
gene group that shows the largest differential expression in two experimental conditions.
Therefore, optimization analysis can find the larger gene group compared to GSEA in
terms of significance. However, the significance of the biological interpretation of an opti-
mized gene group remains unknown.
In this perspective, GSEA is the result of bioinformatics research that combines infor-
matics algorithms and biological intuition of knowledge. GSEA is a very important algo-
rithm that has been developed and applied to various modifications and has opened a new
paradigm of microarray analysis. This research showed that the biological meaning indi-
cates the importance of a large gene and establishes the relevant database. When GSEA
was introduced, it was applied to interpret microarray data on muscle tissue of diabetic
5.7 · Gene Set Enrichment Analysis
89 5
a
Differential expression
exp exp
gj
gi
c1 c2
b
Set-wise differential expression (GSEA) C2
C2= C1 + b0
exp exp
S0 S0
c1 c2 C1
c
Coexpression (or clustering) gj s0
C0
exp
s0
c0
gi s0
.. Fig. 5.4 Comparison of differential expression, GSEA, and clustering in microarray data analysis.
(a) Differential expression. (b) Set-wise differential expression (GSEA). (c) Coexpression (or clustering)
patients. Originally, these data were created by another research group and differential
genes were not detected between diabetic patients and normal people in the differen-
tial expression analysis. Mesirov’s group developed GSEA and defined various gene sets
beforehand. The biological pathway was remarkably relevant to diabetes, which shows the
significant difference of the two groups after testing the significance of each differential
expression (. Fig. 5.4b). Genes from geneset S0 in . Fig. 5.4b showed slight differential

expression that did not reach significance. However, collectively in the right panel, it
showed a one-sided distribution (i.e., C1 = C2 below the line). Overall, it could be con-
cluded without much objection to have statistical significance. This method is helpful to
overcome the noise of individual gene analysis methods, the lack of significantly different
individual genes from two groups, and the inconsistent results between two microarray
platforms. In 7 Chap. 8, a detailed exercise for GSEA will be discussed.

5.7.1 Analysis Method and Feature
55 According to the selected distance function (e.g., expression rate) based on the
expression pattern, range, and ranks of all genes, the possibility of a designated gene
set concentrated in a specific portion (e.g., where the expression rate is high) can be
calculated to some degree with Kolmogorov-Smirnov statistics.
55 At this point, the differentially expressed genes and all of the genes in the relevant
gene set are used for calculation.
55 After confirming the maximum total score, genes with greater significance in the set
can be extracted.
55 After random permutation of the label of phenotype samples, the p-value and FDR
can be determined by calculating each enrichment score.
55 Adjust the significance level toward statistical multiple hypothesis testing. For each
gene set, enrichment score that is obtained from the previous step is normalized to
the size of the gene set to create an NES (normalized enrichment score). The statisti-
cally significant gene set is decided by modifying the false positive rate by calculating
5 the FDR toward each NES.
5.7.2 Gene Set Database
Gene annotation databases, which contain biological information about genes, include
Cytogenetic Band, KEGG pathway, and Gene Ontology. MIT created MSigDB,11 which
combines many gene set databases for GSEA analysis. MsigDB contains 17,779 gene sets,
composed of 8 main groups.
55 Hallmark gene sets: Summarize and represent specific well-defined biological states
or processes and display coherent expression.
55 Positional gene set: Set of genes in chromosomes and cytogenetic bands, including
the location information of the chromosome, such as deletion, amplification, and
epigenetic silencing.
55 Curated gene set: Collected and organized gene set that is obtained from Biological
pathway, PubMed papers, or domain experts; it is composed of chemical and genetic
perturbations and canonical pathways (BioCarta, KEGG, and REACTOME).
55 Motif gene set: Preserved cis-regulatory motif shared in gene set across human,
mouse, rat, and dog genomes, including miRNA and transcription factors of targeted
gene.
55 Computational gene set: Gene set that shows similar expression patterns to about 380
cancer-related genes (oncogene, tumor suppressor gene).
55 GO gene set: One gene set that comprises many genes, mapped by Gene Ontology
terms.
55 Oncogenic signatures: Gene sets that represent signatures of cellular pathways which
are often dis-regulated in cancer.
55 Immunologic signatures: Collected cell states and perturbations within the immune
system from human and mouse immunology.
5.8 Survival Analysis and Prognostic Subgroup Prediction

Survival analysis is called incident-time analysis—the time until the incident occurs is ana-
lyzed. Clinically most meaningful incident is death, but other incidents are also analyzed.
The standard method of survival analysis is to add each patient to the study at different
11 Molecular Signature Database, 7 http://software.broadinstitute.org/gsea/msigdb/

5.8 · Survival Analysis and Prognostic Subgroup Prediction
91 5
times until the incident occurs. Complete data contain the deceased patient’s data before
the research endpoint, and censored data are the patient’s data after the research endpoint.
A patient who left during the research means that the patient was alive until the last con-
firmed period. Thus, the point after which the condition is treated is the censored data.
This is called right-censored data in the former case; they are censored at the research
endpoint, which is the perfect time on the axis.
When complete or censored survival data are collected, it is hard to assume a normal
distribution; thus, statistical analysis that uses variance analysis or linear regression can-
not be used. Instead, hazard function, h(t), is used. Hazard function is defined as the
possibility of death at the moment right after period ‘t’, the period when the patient is still
alive. Hazard function is hard to calculate directly from the data, since hazard function
becomes 0 if there is no additional incident after the incident occurs. Instead, the cumula-
tive hazard function, H(t), performs in a similar manner until period ‘t’ of the total hazard.
Then, the slope of this function is a hazard. It is more convenient to deal with the hazard
ratio because hazard or cumulative hazard function changes over time during the research
period.
A 2x2 contingency table is created for calculation of every incident occurred by both
groups, one that is exposed to a specific hazard and one that is not. If the average is found
over the total research period, a weighted hazard average ratio could be estimated. Mantel-
Cos estimate of the hazard ratio (HRMC). It tests a specific hazard factor for the difference
between exposed and unexposed groups by using a x2 distribution that is based on a null
hypothesis. “HRMC = 1.” This testing method is called Mantel-Cox x2 test, and it is also
known as the log-rank test.
When there is a factor that is related to the incident, it is sufficient to use log-rank
test. If there are two factors, then it is necessary to calibrate other potential cofounders
such as method of survival analysis, and direct and indirect factors, such as age, sex, and
etiology. Logistic regression analysis focuses on the incident occurrence over time and
is a calibrating statistical technique that calibrates many independent variables at the
same time by setting odds ratio as a dependent variable. Survival analysis deals with the
hazard ratio, similar to that of odds ratio, can derive the algorithm of logistic regression
analysis. This is called Cox regression analysis. Since Cox regression analysis is performed
under “proportional hazard assumption,” also known as Cox proportional hazard regres-
sion analysis. Such analysis claims that the hazard ratio is constant without time. Under
the assumption of proportional hazard, H1(t)/H0(t) is a constant (C); if we take log of
both sides, then it becomes “log(H1(t))–log(H0(t)) = log(C),” which makes the hazard
functions of the exposed and unexposed groups parallel to each other. Mathematically,
it is “H(t) = –log(S(t));” the log(−log(S(t))) function is used for the proportional hazard
assumption test.12 If the hazard ratio changes over time, then it is not available to use,
requiring modification. The Cox model, which is used in the Kaplan-Meier method, can
quantify hazard ratio with differential survival rate. This is different from the log-rank test
yet and it is similar to odds ratio Therefore, it is easy to interpret clinically.
Research has attempted to determine the difference of survival rate of a specific group
of diseases with microarray data. Existing survival analysis has attempted to explain the
difference in survival in regards to the standard of correlation between many clinical haz-
ard factors. It could improve the analysis by considering gene expression information at
12 S(t) is survival function.

the same time. Alizadeh et al. suggest that it is possible to find out the survival rate dif-
ference of one subgroup of clinically similar lymphoma with the gene expression ratio. A
new subgroup can also be found by adding to the existing disease classification system.
Moreover, an international prognostic index value is used in the prognosis to confirm that
there is a difference in survival function, depending on the gene expression information
in the classified subgroup. Gene expression information implies that it is possible to use
to establish a specific therapy for each patient with the International Prognostic Indicator
(IPI). This standard classifies a patient’s condition in more detail by adding genetic infor-
mation of the patient’s symptom. Moreover, this standard is the foundation of customized
medicine, facilitating individual prescriptions by predicting the prognosis13 with genetic
5 information.
Take Home Message

55 How to perform the gene enrichment analysis of gene expression data and clini-
cally interpret.
Bibliography
1. Alizadeh AA et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression
profiling. Nature 403(6769):503–511
2. Baldi P, Long AD (2001) A Bayesian framework for the analysis of microarray expression data: regular-
ized t-test and statistical inferences of gene changes. Bioinformatics 17(6):509–519
3. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach
to multiple testing. JR Statist Soc B 57:289–300
4. Derisi JL et al (1997) Exploring the metabolic and genetic control of gene expression on a genomic
scale. Science 278(5338):680–686
5. Dudoit S et al (2002) Comparison of discrimination methods for the classification of tumors using
gene expression data. J Am Stat Assoc 97(457):77–87
6. Dudoit S et al (2002) Statistical methods for identifying differentially expressed genes in replicated
cDNA microarray experiments. Stat Sin 12:111–139
7. Dudoit S et al (2003) Multiple hypothesis testing in microarray experiments. Stat Sci 18:71–103
8. Durbin BB et al (2002) A variance-stabilizing transformation for gene-expression microarray data. Bio-
informatics 18(suppl 1):S105–S110
9. Eisen MB, Brown PO (1999) DNA arrays for analysis of gene expression. Methods Enzymol 303:3–18
10. Eisen MB et al (1998) Cluster analysis and display of genome-wide expression patterns. PNAS
95(25):14863–14868
11. Golub TR et al (1999) Molecular classification of cancer: class discovery and class prediction by gene
expression monitoring. Science 286(5439):531–537
12. Guyon I et al (2002) Gene selection for cancer classification using support vector machines. Mach
Learn 46(1/3):389–422
13. Heller RA et al (1997) Discovery and analysis of inflammatory disease-related genes using cDNA
microarrays. Proc Natl Acad Sci U S A 94(6):2150–2155
14. Huber W et al (2002) Variance stabilization applied to microarray data calibration and to the quantifi-
cation of differential expression. Bioinformatics 18(suppl 1):S96–S104
15. Kohonen T (1995) Self-organization maps. Springer
16. Newton MA et al (2001) On differential variability of expression ratios: improving statistical inference
about gene expression changes from microarray data. J Comput Biol 8(1):37–52
13 Prognostic subgroup prediction.

Bibliography
93 5
17. Spellman PT et al (1998) Comprehensive identification of cell cycle–regulated genes of the yeast Sac-
charomyces cerevisiae by microarray hybridization. Mol Biol Cell 9(12):3273–3297
18. Tamayo P et al (1999) Interpreting patterns of gene expression with self-organizing maps: methods
and application to hematopoietic differentiation. PNAS 96(6):2907–2912
19. Tusher VG et al (2001) Significance analysis of microarrays applied to the ionizing radiation response.
PNAS 98(9):5116–5121
20. Yang YH et al (2001) Normalization for cDNA microarry data. Proc SPIE 4266:141
95 6
Gene Expression
Data Analysis
6.3 Differential Gene Expression Analysis – 103

6.3.1 Analysis of Significant Differential Expression in
Two-Group Comparisons: T-Test and SAM – 104
6.3.2 Analysis of Significant Differential Expression in
Comparisons of More than Three Groups: ANOVA – 109
6.4 Clustering Analysis – 111

6.4.1 Hierarchical Clustering – 111
6.4.2 K-Means Clustering – 113
6.4.3 SOM Clustering – 114
6.5 Classification Analysis – 115

6.5.1 LDA (Linear Discriminant Analysis) – 116
6.5.2 KNN (K-Nearest Neighbor) – 116
6.5.3 SVM (Support Vector Machine) – 116
6.6 Basic Data Processing in R Program – 117

6.6.1 Understanding the Data Structure – 117
Bibliography – 120


https://doi.org/10.1007/978-981-13-1942-6_6
96 Chapter 6 · Gene Expression Data Analysis

The objectives of this chapter are to teach generating DEGs in microarray gene expression
data, extracting a gene cluster of genes with similar patterns of expression, classifying the
observed data using SVM and KNN, and learning the basic syntax of the R program, a useful
tool for genome data analysis.
6.1 Introduction
Differential gene expression analysis generates a list of differentially expressed genes

between an experimental group and a control group. It is the first step of microarray
data analysis. We use T-Test for two-group comparisons and ANOVA for multiple-group
analysis. Both cases must check for false positive results by multiple-hypothesis tests. The
6 genome consists of many genes. If we run a differential gene expression analysis for each
gene, we would have to repeat the same analysis countless times, increasing the chances
of false positive results. False discovery rate (FDR)-controlling procedures are often used
when multiple comparisons are conducted in microarray data analysis of the genome
(Refer to 7 Chap. 5, 7 Sect. 5.4).

Functional cluster analysis is a systemic analyzing method for microarray data. Cluster
analysis is useful for identifying gene clusters that are similarly expressed or tightly co-
regulated according to changes in experimental conditions. It is an exploratory data analy-
sis method that enables the processing of large amount of data because cluster analysis
does not require a hypothesis for data analysis. We can assume that genes with similar
expression patterns share similar functions. However, there are many unknown functions
of genes within a co-regulated gene family. Cluster analysis helps researchers generate
data-driven hypotheses through correlation analysis of unknown genes in complicated
data, saving their time and effort in designing research plans.
Cluster analysis can be a good tool when there is no prior knowledge of the com-
plicated observed data. This method is categorized as unsupervised machine learning in
the field of artificial intelligence. On the other hand, supervised machine learning is use-
ful when there is prior knowledge of the observed data. For example, prior knowledge
has index labels, such as “control group and disease group” or “malignant neoplasm and
benign neoplasm.” It is used in classification analysis, which is utilized in many areas in
medicine, such as diagnosis and prognosis of diseases. . Figure 6.1 is an introduction to

microarray data analysis.
6.2 Prerequisites
55 Installation of the R program: Install the R program according to Sect. B.1 of

Appendix B.
7 http://cran.r-project.org/.

55 Installation of Bioconductor: Install the analysis package of Bioconductor according

to Sect. B.2 of Appendix B.
7 http://www.bioconductor.org/.

55 Download five txt files in the ch06 directory of the Appendix files.
6.2 · Prerequisites
97 6
· Comparison of two groups: T-Test, SAM

Differential expression analysis · Comparison of more than three groups: ANOVA
· Multiple test correlation: FWER, FDR
· Hierarchical clustering
Clustering analysis · K-means clustering
· Self-Organizing Map (SOM) clustering
· Linear Discriminant Analysis (LDA)

Classification analysis · K-Nearest Neighbor (KNN)
· Support Vector Machine (SVM)
.. Fig. 6.1 Introduction to microarray data analysis
zz Description of the data

This exercise uses the data published in Blood by Chiaretti et al. in 2004. The data comprise
the results of a gene expression analysis of acute lymphocytic leukemia (ALL). Analysis was
performed using the HGU95AV2 chip, a microarray developed by Affymetrix. The data
were normalized by the RMA method, and gene expression was calculated by log2 value.
zz Installation of package library

First, we have to import the R packages that are required for this practice. The package
library should be installed before you use R program. The details are described in Appendix
B. All library packages used in this chapter are provided by Bioconductor. Import biocLite
first, and then move on to the installation step.
> source("http://bioconductor.org/biocLite.R") #Import biocLite script
> biocLite("ALL") #Install ALL data set by using biocLite
> biocLite("hgu95av2") #Import microarray annotation database
> biocLite("annotate") #GO annotation list
> biocLite("samr") #SAM (significance Analysis of Microarray) module
> biocLite("som") #SOM clustering module
> biocLite("golubEsets") #The AML/ALL analysis data of Todd Golub
> biocLite("class") #Classification analysis module

> biocLite("e1071") #SVM
> biocLite("MASS") #Data processing module by Affymetrix
> biocLite(“genefilter”)
> library(ALL) #Import ALL data set
> library(hgu95av2)
> library(annotate)
> library(samr)
6
> library(som)
> library(golubEsets)
> library(class)
> library(e1071)
> library(MASS)
> library(genefilter)
※ In Windows version R, there might be an error saying that your library function is not
installed. This kind of error can be resolved by installing the package as described in Sect.
B.2 of Appendix B.
library (sam)
Error in library(sam): there is no package called “sam”
zz Practice R code
The file (input_microarray.txt) used for this exercise is a tab-delimited ASCII file. The
format of this file will be described in more detail in 7 Sect. 5.3 of 7 Chap. 5. Once you

open the input_microarray.txt file, you will find a matrix of 10 columns and 22777 rows.
The execution codes are in process.txt file (. Fig. 6.2). If you need additional practice, you

can refer to Appendix A for basic analysis using the R program. You can find the R codes
for the additional exercises in the R_basics.txt file.
99 6
> fn < - file.choose() #Load data file
> input <- as.matrix(read.table(fn, sep = “\t”, header = T, row.names = 1))
#Import input_microarray.txt file in the form of a data table.
# Find the file using menu, save the handle in fn, and read it using read.table
# Enter the file path if it is known. The following is an example.
> input <- as.matrix(read.table(C:\gda\ch06\input_microarray.txt”, sep = “\t”, header = F))
> head(input) # Return the first 6 rows of input on screen.
.. Fig. 6.2 input_mircoarray.txt file of this exercise is a tab-delimited file

> head(input)
G1.CEL G2.CEL G3.CEL L1.CEL L2.CEL L3.CEL N1.CEL N2.CEL N3.CEL
1007_s_at 10.461447 11.341586 11.023860 9.320934 11.440068 11.309118 10.086800
10.129414 9.966061
1053_at 8.675580 8.267232 8.486979 9.243647 8.732876 9.230934 7.757453
7.886887 8.014026
117_at 7.847506 7.571837 8.113888 7.619820 7.897814 7.285058 7.402672
7.559963 7.583754
121_at 8.300985 7.837589 7.915578 7.747076 7.772876 8.452841 8.407166
8.504255 8.102939
1255_g_at 7.002520 6.970377 7.000953 7.030557 7.061738 7.048979 8.059134
7.563080 7.658484
6 1294_at 8.304798 8.297717 8.010718 7.589431 7.520612 7.834773 7.698972
7.778027 7.783477
The following is the code in microarray.txt. Please refer to this to perform the exercise in
7 Sect. 6.4.

#########################
# Differential Expression Analysis #
#########################
# T-Test
> input_gn = input[ , c(1, 2, 3, 7, 8, 9)] #The group numbers for the t-test are 1
for columns 1 to 3 and 2 for columns 7 to 9
> list_gn <- c(1, 1, 1, 2, 2, 2)
> group_gn <- factor(list_gn)
> t.test(input_gn[1, ] ~ group_gn) #Perform t-test for the gene in the first row of
the data
> fc <- vector() #Produce a vector to save the fold change value
> pval <- vector() #Produce a vector to sabe the p-value
> for (i in 1:nrow(input_gn)) #Perform t-test for all genes
{
rst <- t.test(input_gn[i, ] ~ group_gn)
fc[i] <- diff(rst$estimate) #Save the fold change values in the vector fc.
pval[i] <- rst$p.value #Save the p-values in the vector pval
}
# Volcano Plot
> id <- which(pval < 0.05) #Extract the genes with p-values less than 0.05.
> plot(fc, -log10(pval)) #Plot the fold change and p-values
> points(fc[id], -log10(pval)[id], col = "red") #Distinguish significant
differential genes with red dots.
101 6
# FDR
> fdr <- p.adjust(pval, method = "fdr") #Perform FDR.
> id <- which(fdr < 0.05) #Extract the genes with FDR adjusted p-values less
than 0.05.
> fc[id]
# SAM
> library(samr) #Load the samr R package.
> sam_list <- list(x = input_gn, y = group_gn, logged2 = TRUE) #Load the data
used to perform SAM.
> sam_input <- samr(sam_list, resp.type = "Two class unpaired") #Perform
SAM.
> delta.table <- samr.compute.delta.table(sam_input) #Caculate the delta
values.
> samr.plot(sam_input, 0.4) #Draw the SAM plot.
# ANOVA
> list <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) #Determine the groups for comparison.
> group <- factor(list)
> anova(lm(input[1, ] ~ group)) #Perform ANOVA for the gene in the first row of the data.
> an_pval <- vector() #Produce vector to save p-values.
> for (i in 1:nrow(input)) #Perform ANOVA for all genes in the data.
{ an_pval[i] <- anova(lm(input[i, ] ~ group))$Pr[1] #Save the p-values

in the vector, an_pval
}
> fwer_p <- p.adjust(an_pval, method = "bonferroni") #adjust p-values using FWER
> id <- which(fwer_p < 0.05) #Extract genes with Bonferroni-adjusted p-value less than 0.05.
> an_pval[id]
################
# Clustering Analysis #
################
# Hierarchical clustering
> d <-as.data.frame(input)
> h <- dist(t(d)) #Obtains the distance matirix
> plot(hclust(h)) #Plot the dendrogram
# Heatmap
> heatmap(input[1:1000, ]) #Plot a heatmap for 1000 genes.
6 # K-means clustering
> k <- kmeans(input, center = 5) #Produce 5 clusters using the k-means algorithms
# SOM clustering
> library(som) #Load the “som“ R package
> som_rst <- som(log(input), xdim = 3, ydim = 3) #Perform SOM clustering
> plot(som_rst) #Plot the SOM cluster graph
#################
# Classifiation Analysis #
#################
# LDA
> library(MASS)
> train_set <- read.table("C:\gda\ch06\train.txt", sep = " \t", header = F)
#Load the training set.
>test_set < - read.table("C:\gda \ch06 \test.txt", sep = " \t", header = F)
#Load the test set.
> label_tr <- train_set[ , 7130] # training set label
> expr_tr <- train_set[ , 1:10] # training set expression data
> lda(label_tr ~ ., data = expr_tr)

6.3 · Differential Gene Expression Analysis
103 6
# KNN
> library(class) #Load the “class“ R package.
> label_tst <- train_set[ , 7130] #test set label
> expr_tst <- test_set[ , 1:10] #test set expression data
> rst_knn <- knn(expr_tr, expr_tst, label_tr, k = 3)
> table(rst_knn, label_tst)
# SVM
> library(e1071) #Load the “e1071“ R package
> s <- svm(expr_tr, label_tr)
> pr <-predict(s, expr_tr) # prediction model
> table(true = label_tr, pred = pr)
6.3 Differential Gene Expression Analysis
In this section, we will learn to find the list of genes that are differentially expressed in
microarray data by using T-Test, SAM, and ANOVA. T-Test analyzes the mean and distri-
bution of experimental and control groups under the null hypothesis. ANOVA uses F-test
and calculates the likelihood of finding differences in the means between more than two
groups, assuming the truth of null hypothesis. Both T-Test and ANOVA assume a normal
distribution of the samples.
Distribution analysis is used for the comparison of more than three groups. When the
null hypothesis is rejected, it is used to find out the groups that are significantly differ-
ent from each other by sample means (post hoc analysis). Multi-hypothesis test or mul-
tiple comparisons is a statistical analysis method that encompasses simultaneous testing
of multiple hypotheses. In order to correct the false positive error rate associated with
performing multiple statistical tests, a statistical test method must be chosen after deter-
mining the type of error which needs to be corrected. The family-wise error rate (FWER)
has less chance of having type 1 error, because it is a compensation method that tests
significant levels of multiple hypotheses simultaneously. However, it has lower statistical
power, because the type 1 error is minimized. FDR is an alternative method that provides
less stringent control of type I errors and provides a greater statistical power compared to
FWER. FDR controls the rate of false positives among the results that are determined to
be significant. (Please refer to 7 Sect. 5.4 of 7 Chap. 5).

zz Creating sample data
> library(ALL) #Import ALL package
> data(ALL) #Import ALL data set
> pdat <- pData(ALL) #Import phenotype and metadata from ALL and assign them to pdat
> subset <- which(as.character(pdat$mol.biol) %in% c(“BCR/ABL”, “NEG”))
#Extract the indexes of which $mol.biol values are “BCR/ABL” and “NEG” in ALL.
> eset <- ALL[ , subset] #Extract the subset of ALL, the extracted indexes.
6
> emat <- data.matrix(exprs(eset)) #Format the data set into the gene
expression matrix
The analysis will start with the generated gene expression matrix.
6.3.1 nalysis of Significant Differential Expression in Two-Group

A
Comparisons: T-Test and SAM
6.3.1.1 T-Test
T-Test compares two sets of data. We can identify the genes that have significant differ-
ential expression by running T-Test. In previously generated data, we can define BCR/
ABL (BCR/ABL fusion gene) as an experimental group and NEG (cytogenetically nor-
mal) as a control group. We will practice generating a volcano plot for the genes selected
as significantly differentially expressed by T-Test and create a distribution graph using
fold-changes and p-values.
In order to run a t-test, we need to look up the index labels saved in eset. These index
labels are designated mol.biol, which is the subset of ALL. We should define “BCR/ABL”
and “NEG” as two comparison groups.
> eset$mol.biol #Output index labels
> labels <- as.numeric(eset$mol.biol == "BCR/ABL") #Convert ”BCR/ABL” group to 1 and

the rest to 0.
> group <- factor(labels)

105 6
Perform t-test on the first gene (m3[1,]) of the matrix.
> t.test(emat[1, ] ~ group)
> t.test(emat[1,] ~ group)

Welch Two Sample t-test
data: emat[1, ] by group

t = 0.82105, df = 89.314, p-value = 0.4138
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.05693514 0.13712876
sample estimates:
mean in group 0 mean in group 1
7.578451 7.538354
※ The syntax is composed of t-test (a gene expressions profile ~ group).

The previous example performed t-test against one gene. If you want to run T-Test
against all genes, you can use for-loop, as in the given example below.
> ttest.p <- vector() #Generate a vector variable that can take p-values.
> ttest.est <- vector() #Generate a vector variable that can take fold-changes.
> for (i in 1:nrow(emat)) #Run for-loop as much as the number of m3 rows.
{
calc <- t.test(emat[i, ] ~ group) #Perform t-test against gene i and save
the results in the variable tmp.
ttest.p[i] <- calc$p.value #Calculate p-value and save the result in the
ttest.p vector.
ttest.est[i] <- diff(calc$estimate) #Calculate fold-changes and save the

value in the ttest.est vector.
We now have generated T-Test results using for-loop and saved p-values and fold-
changes in the ttest.p and ttest.est. vectors, respectively. In order to generate a distribution
graph of fold-changes and p-values, we need to draw a volcano plot using all listed genes
first and then mark the indexes of significant differential genes in red (. Fig. 6.3).

Volcano plot for differential expression of BCR/ABL vs. NEG
7
6
5
–log10(ttest.p)
4
6
3
2
1
0
−2 −1 0 1 2 3
ttest.est
.. Fig. 6.3 Volcano plot
> index <-which(ttest.p < 0.05) #Extract the genes with p-values less than 0.05.
> ttest.est[index] #Extract the t-test results on the extracted gene from the previous step.
> plot(ttest.est, -log10(ttest.p), pch = 20, cex = 0.3, ylim = c(0, 7), main =
"Volcano plot for differential expression of BCR/ABL vs. NEG") #Draw a volcano plot
> points(ttest.est[index], -log10(ttest.p)[index], col = "red", cex = 0.4, pch = 16)
#Distinguish significant differential genes with red dots.
zz Correction of multiple comparison hypothesis test

In microarray experiments, due to the statistical examinations of many gene expres-
sion profiles are performed simultaneously, we encounter multiple comparisons. Such a
problem increases the risk of obtaining false positive errors because each examination
107 6
cannot fulfill the requirements of an independent hypothesis. The false positive error rates
increase as the numbers of examinations rises with a greater number of genes.
In order to correct false positive errors in the problem of multiple comparisons, we
can use two methods; FDR, and FWER. The main difference in these two methods are
the hypotheses that are applied. FWER controls the probability of making any single false
positive error among the selected significant genes. FDR controls the number of false
positive errors in proportion to the true positives.
* FDR < 0.05 example
> pfdr <- p.adjust(ttest.p, method = "fdr") #Calculate FDR.
> index <- which(pfdr < 0.05) #Extract gene indexes with pfdr < 0.05.
> ttest.est[index] #Output t-test results.
* FWER < 0.05 example
> pfwer <- p.adjust(ttest.p, method = "bonferroni") #Calculate FWER using
Bonferroni method.
> index <- which(pfwer < 0.05) #Extract gene indexes with pfwer < 0.05.
> ttest.est[index] #Output t-test results.
So far, we have completed a basic differential analysis. The next step is to annotate the
selected genes. The following example helps you practice to print the extracted genes with
gene symbols and expression profiles.
> library(hgu95av2) #Import Affymetrix HGU95 package
> gene = featureNames(ALL)[index] #Extract gene names of the extracted genes.
> genesymbols <- mget(gene[!is.na(gene)], hgu95av2SYMBOL) #Convert the
gene names to gene symbols.
> deg <- cbind(emat, pfwer, pfdr)[index, ] #Extract the expression profile data,pfwer,
and pfdr.
> rownames(deg) <- genesymbols
> head(deg) #Return the gene symbols and gene expression of selected rows.
> head(deg)
01005 01010 03002 04007 04008 04010
DDR1 7.623562 7.543604 7.916954 7.516981 7.726716 7.288960
CD19 9.812015 10.003141 9.164655 9.876270 9.032565 10.114688
TRD@ 5.162924 5.966752 5.987986 5.201913 7.268455 5.310706
TNK2 8.654172 8.443706 8.521821 8.466544 7.571240 7.638699
ITGAE 6.117544 5.494834 6.588648 6.523782 6.008995 6.650002
GAB1 4.441590 5.095521 3.851317 4.814972 3.799737 4.844462
04016 06002 08001 08011 08012 08024
DDR1 7.724153 6.763490 7.543907 7.784205 7.041890 6.769948
CD19 8.917097 7.823291 10.391308 9.442780 10.449013 9.621440
TRD@ 5.571119 5.656040 6.109881 5.453425 5.302523 6.412112
6 TNK2 8.168028 8.011405 8.795978 8.932922 8.564040 7.312690
ITGAE 8.213070 6.427095 5.995374 7.109062 5.895099 6.734095
GAB1 3.007319 3.014779 4.945371 5.922863 3.718701 3.678384
6.3.1.2 Significant Analysis of Microarrays (SAM)

When the gene expression level has a very small variance, there is a risk of obtaining the
wrong conclusion because the mean level of gene expression is not significantly different
between two groups, while the statistical significance is still very high. SAM offers corrected
t-test results using a fudge factor in order to resolve the problems resulting from a low vari-
ance estimate. SAM uses non-parametric statistics based on repetitive permutation analysis.
The leading advantages of SAM are that data may not have to follow a normal distribution
as well as disregarding assumptions about the distribution of individual genes. SAM uses
FDR method to correct for the problem of multiple comparisons problem (. Fig. 6.4).

> library(samr) #Import the package.
> BAlabels <- as.numeric(eset$mol == "BCR/ABL") #Label the two groups.
> label <- sub("0", "2", BAlabels) #Assign numbers to the two groups.
> a.sam <- list(x = exprs(eset), y = label, logged2 = TRUE) #Input matrix of SAM
> samr.obj <- samr(a.sam, resp.type = "Two class unpaired", nperm = 100)
#Implement SAM
> delta.table <- samr.compute.delta.table(samr.obj) #Generate delta table.
> delta <- 0.4 #Assign the values of delta
> samr.plot(samr.obj, delta) #Draw SAM plot (Fig. 6.4)
> siggenes.table <- samr.compute.siggenes.table(samr.obj, delta, a.sam,delta.table)
#A table of significant genes

109 6
5
observed score
−5 0
−3 −2 −1 0 1 2 3
expected score
.. Fig. 6.4 SAM plot
6.3.2 nalysis of Significant Differential Expression

A
in Comparisons of More than Three Groups: ANOVA
Variance analysis tests a hypothesis using F-distribution for the analysis of more than three
groups. F-distribution tests the variance within each group, group means, and distribution
comparisons based on the variance in group means between groups. This section practices
how to extract statistically significant genes between groups: BCR/ABL, NEG, and ALL/AF4.1
First, extract three groups for analysis.
> subset <- which(as.character(pdat$mol.biol) %in% c("BCR/ABL", "NEG", "ALL1/AF4"))
#Extract samples in three groups that satisfy “BCR/ABL,” “NEG,”
“ALL1/AF4” from ALL data
> eset <- ALL[, subset] #Extract gene expression profile data
> emat <- data.matrix(exprs(eset)) #Save the gene expression profile in matrix form to m3.
1 BCR/ABL fusion gene, cytogenetically normal, and ALL1/AF4 fusion gene.

Label the extracted data and divide the data into three groups.
> BAlabels <- as.numeric(eset$mol.biol) #Convert “BCR/ABL,” “NEG,”
“ALL1/AF4” to numbers.
> BAgroup <- factor(BAlabels) #Split them into three categories
Perform ANOVA analysis on three groups for the first gene of the matrix.
> anova(lm(emat[1, ] ~ BAgroup)) #sample test
> anova(lm(emat[1, ] ~ BAgroup))

6 Analysis of Variance Table
Response: emat[1, ]
Df Sum Sq Mean Sq F value Pr(>F)
BAgroup 2 0.3504 0.175220 2.6018 0.07839 .
Residuals 118 7.9468 0.067346
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Perform ANOVA analysis for all listed genes.
> anova.pvalue <-matrix(, nrow(emat), 1) #Generate vector that saves P-values.
> for(i in 1:nrow(emat)) #Repeat analysis for the assigned rows.
anova.pvalue[i, ] <-anova(lm(emat[i, ] ~ BAgroup))[1, 5] #Obtain P-values
> head(anova.pvalue) #Print a part of ANOVA results.
zz Correction of multiple comparison hypothesis tests

ANOVA also requires correction of errors of multiple comparisons because it repeats the
tests for a number of genes simultaneously. This section practices corrections using FDR
and FWER.
zz FDR < 0.05 example
> pfdr <- p.adjust(anova.pvalue, method = "fdr") #Calculate FDR
> index <- which(pfdr < 0.05) #Extract gene indexes with FDR < 0.05
> anova.pvalue[index] #Extract p-value for each gene.

6.4 · Clustering Analysis
111 6
zz FWER <0.05 example
> pfwer <- p.adjust(anova.pvalue, method = "bonferroni") #Calculate the p-value by

Bonferroni method.
> index <- which (pfwer < 0.05) #Obtain the index with pfwer < 0.05.
> anova.pvalue[index] #Obtain the p-value.
Extract and print the gene symbols of the differentially expressed genes.
> gene = featureNames(ALL)[index] #Extract gene symbols with corrected p-value < 0.05
> genesymbols <- mget(gene[!is.na(gene)], hgu95av2SYMBOL) #Import gene symbols.
> deg <- emat[index, ] #Save the differentially expressed genes to variable.
> rownames(deg) <- genesymbols #Set row names with gene symbol
> head(deg)
> head(deg)
01005 01010 03002 04006 04007 04008 04010
DDR1 7.623562 7.543604 7.916954 6.816397 7.516981 7.726716 7.288960
HIF1A 8.063428 6.740539 8.126621 5.739973 8.533088 8.232524 6.922755
CD19 9.812015 10.003141 9.164655 7.474483 9.876270 9.032565 10.114688
CD44 6.875240 6.315132 6.730446 9.338491 7.282684 8.907688 6.781251
TNK2 8.654172 8.443706 8.521821 7.854585 8.466544 7.571240 7.638699
ITGAE 6.117544 5.494834 6.588648 8.039748 6.523782 6.008995 6.650002
04016 06002 08001 08011 08012 08024 09008
DDR1 7.724153 6.763490 7.543907 7.784205 7.041890 6.769948 8.709475
HIF1A 7.815199 8.510725 8.126982 7.717734 8.513018 6.930541 7.440068
CD19 8.917097 7.823291 10.391308 9.442780 10.449013 9.621440 10.618220
CD44 7.953812 8.697630 7.826722 7.420834 5.942588 7.951445 5.802265
TNK2 8.168028 8.011405 8.795978 8.932922 8.564040 7.312690 8.848862
ITGAE 8.213070 6.427095 5.995374 7.109062 5.895099 6.734095 5.878245
6.4 Clustering Analysis
Obtain a similar gene expression profile using clustering analysis. This section practices
hierarchical clustering, K-means clustering, and SOM.
6.4.1 Hierarchical Clustering
We perform hierarchical clustering analysis on the significant differentially expressed

genes that have p-values less than 0.05 with Bonferroni correction after ANOVA analysis.
We can use the previously obtained list of DEGs as a result of the ANOVA analysis for
Cluster dendrogram
35
30
25
Height
6
20
57001 04016
08024
15
65005
49006
28024
11005
08011 27004
24001 64002
36002
31007 63001
48001
28005
64001
19017
28001
28006
24017
08012
4300728036
10005
22013 20002
25003
22010
84004
03002
10
28035
28044
09017
24008 16009
09002
28019
68001
11002 28008
22011
09008
62003
12026
37001 12008
2402262001
04010
28023
26001
28047
L AL19014
20005
04007
43001
04018
43006
43015
28007
22009
31011
15005
28028
28032
01010
28037
24005
24011
68003
08001
28009
12019
27003
6400519005
5600783001
04006
26008
04008
06002
28042
43012
33005
28031
01003
65003
28043
43004
15004
16004
01005
12007
15006
26009
24018
25006
18001 4
28021
30001
16002
44001
37013
16007
49004
24010
12006
31015
15001
26003
62002
02020
17003
5
24006
12012
14016
19002
19008 h
hclust (*, "complete")
.. Fig. 6.5 The result of hierarchical clustering analysis was visualized in a dendrogram
this hierarchical clustering analysis. . Figure 6.5 shows a dendrogram, the result of the

hierarchical clustering analysis.
> d <- as.data.frame(deg) #Save the differentially expressed genes in data frame
> h <- dist(t(d)) #Perform clustering analysis.
> plot(hclust(h)) #Visualize the results in a dendrogram.
The result of hierarchical clustering analysis can be displayed in a heat map. (. Fig. 6.6).
> heatmap(deg, col = c("black", "yellow"), xlab = "Samples", ylab = "Genes",

main = "Heatmap")
6.4 · Clustering Analysis
113 6
Heatmap
TNFRSF14
TMC6
FCGRT
PTPN18
TNK2
EDEM1
NA
PTP4A3
DPEP1
TNK2
ITGA5
ABL1
FYN
FYN
LSS
NUP93
KIAA0195
SPN
POU4F1
GPR56
SV2A
GYPC
CSTB
MAGED1
TMEM165
IGFBP7
PLXNB2
PPM1F
DENND3
TUBA4A
CDKN1A
HLA-DMA
HLA-DMB
CD79B
CD19
HLA-DQB1
HLA-DQB1
HLA-DRB5
CD52
SOCS2
FHL1
LILRA2
KLF9
C1orf38
NA
ENG
DDR1
ZNF467
DDR1
ITPR1
CDS2
HIF1A
EVI2B
ACTN1
SLC2A5
ALOX5
DUSP6
CCND2
FOXO1
NRIP1
CD24
ITGA6
SMAD1
QPRT
TBXA2R
C7orf44
PPP2R5C
PPP2R5C
WDR45L
STX1A
C15orf39
SLC5A6
PTGES3
DAD1
RGS16
DBN1
FADS1
MAST4
ICAM3
LSP1
DSTN
NUCB2
SPINK2
ATP9A
PRKCQ
GATA3
FUT4
DIAPH2
NA
UCK2
CD72
MAST3
DTX4
PMS2L3
PMS2L8
PMS2L1
LARGE
FARP1
KLRK1
PMS2L1
TTC28
MX2
ITGAE
PKIG
PPP2R5C
PPP2R5C
CSRP2
CD44
CD44
CD44
LGALS1
SNX2
PPP3CC
CYFIP1
SGSH
STX3
ECM1
S100A13
IGFBP4
RGL1
Genes
COLQ
CASP10
OPTN
ACTN1
MUC4
NPR1
PDXK
UBE2E3
tcag7.1314
CD27
F13A1
GSN
SCHIP1
IGJ
CTGF
DIP2C
RAPGEF3
MYLK
NCF2
SAMD4A
MMRN1
NA
CA6
GAB1
KHDRBS3
PON2
EPS8
ALDH1A1
EMP1
GAB1
ACVR2A
CAST
CAST
MTSS1
ADCY9
MAP3K5
NA
GALC
PFTK1
P2RY14
F3
MARCKS
SEMA6A
CNN3
CTDSPL
SPARC
NPC1
TLE4
ZNF43
WSB2
CENTD1
HBS1L
CDC42EP3
IL6ST
PSD3
NT5E
GNAI1
NRXN3
XPA
EMP2
XPA
OPTN
OPTN
YES1
MYO1B
ENPP2
EP400
PIK3C3
HOXA9
MINA
ANGPT1
AMH
PGM1
PPM1H
PALLD
GNA12
CD44
CSPG4
PLD1
RECK
EBI2
GPM6B
IGF2BP3
HOXA4
ADRB2
LOC171220
WT1
THSD7A
RNASE3
TBC1D8
ITGB1BP1
VLDLR
CCNA1
HOXA9
HOXA5
PROM1
MRPL33
WIPI1
CAP2
DDEF2
ANGPT1
PDE4DIP
MDM2
PTHR1
DPYSL3
HOXA10
MEIS1
RAB40B
PRSS12
LILRA1
HOXA4
COBLL1
DLC1
FADS1
KANK1
RHOBTB3
LCK
LCK
CD3D
CD2AP
ZEB1
ITM2A
SH2D1A
ABL1
ABL1
LST1
C6orf32
STMN1
CBX3
CYFIP2
MRPL33
ID3
PSAP
PKM2
GPX1
CD97
MME
IFITM1
HLA-DPB1
HLA-DPA1
HLA-DRA
IGHM
HLA-A
IFITM1
XBP1
08024
04006
26008
63001
28028
28032
31007
24005
19005
16004
15004
19017
28008
31015
11002
10005
43015
02020
17003
18001
19014
20005
09002
01003
44001
16002
16007
49004
04018
43006
15006
26009
28009
37001
19002
12008
56007
19008
83001
65003
24006
64005
65005
20002
22013
14016
12012
03002
84004
49006
11005
27004
36002
68003
08001
43001
12006
24010
04016
57001
64002
22010
12026
62003
26001
28047
08012
22011
24001
28024
28023
04010
15001
24008
16009
06002
04008
28042
43012
28031
33005
22009
28007
68001
25006
24018
24017
28035
28044
48001
25003
12019
43007
28001
28006
28036
09017
27003
26003
62002
28005
64001
08011
24011
12007
01005
30001
28021
28019
01010
28037
04007
43004
28043
09008
31011
15005
62001
37013
24022
LAL4
Samples
.. Fig. 6.6 Heatmap
55 col : parameter that assigns colors

55 xlab: the title of the x axis
55 ylab: the title of the y axis
55 main: the title of the plot
6.4.2 K-Means Clustering

After ANOVA analysis, K-means clustering analysis is used for the significantly differen-
tially expressed genes that have p-values less than 0.05 with Bonferroni correction.
> kmeans.row <- kmeans(deg, centers = 5, iter.max = 1000) #K-means analysis (K = 5)
> kmeans.row$cluster[1:10] #Return 10 gene clusters.

n=47 n=20 n=14 n=42

3
0
−3
n=57 n=23 n=16 n=46

6
3
0
−3
0 1 2 3 4
.. Fig. 6.7 The results of SOM clustering
> kmeans.row$cluster[1:10]
DDR1 HIF1A CD19 CD44 TNK2 ITGAE GAB1 XPA XPA CASP10
3 3 5 3 5 2 4 2 2 2
6.4.3 SOM Clustering
SOM clustering analysis can also be used for the significantly differentially expressed
genes that have p-values less than 0.05 with Bonferroni correction after ANOVA analysis.
SOM displays similar clusters closer to each other (. Fig. 6.7).

> library(som) #Import the package.
> deg.som <- som(log(deg), xdim = 4, ydim = 2) #Perform SOM analysis.
> plot(deg.som) #Visualize the results.

6.5 · Classification Analysis
115 6
Biological characteristics of data
Pre-analysis of the
Generation of Prediction of
Data matrix and data and analysis of
classifier classification
response variables differential
(Training set) (Test set)
expression
Adjusted False classification rates &

parameter True classification rates
.. Fig. 6.8 The Process of Classification Analysis
6.5 Classification Analysis
Classification analysis deals with unknown samples and assigns predefined index labels to
the samples. Clustering analysis, a type of unsupervised learning, groups the genes based
on similar gene expression without additional information. However, classification analy-
sis, a type of supervised learning, is used when we have a certain amount of labeled training
data. Based on the training data, classification analysis generates classifiers and infers new
data classifications. The diagram of steps of classification analysis is described in . Fig. 6.8.
For classification analysis, the data have to be divided into a training set and test set.
This section will use golubEsets data, which is in an R package. The data is consisted of
the following items.
- Training set (Golub_Train): Total 38 individuals (ALL 27; AML 11), 7129 genes
- Test set (Golub_Test): Total 34 individuals (ALL 20; AML 14), 7129 genes
- Merge set (Golub_Merge): Total 72 individuals (ALL 47; AML 25), 7129 genes
Divide the data into a training set and test set.
> library(golubEsets) #Import the data package
> library(MASS)
> data(Golub_Train) #Import the training data set.
> data(Golub_Test) #Import the test data set.
> trainexpr <- exprs(Golub_Train) #Import the training data set.
> testexpr <- exprs(Golub_Test) #Import the test data set.

6.5.1 LDA (Linear Discriminant Analysis)
> golub.traindf <- data.frame(t(trainexpr), ALL.AML = pData(Golub_Train)$ALL.AML)
#training set
> golub.testdf <- data.frame(t(testexpr), ALL.AML = pData(Golub_Test)$ALL.AML)
#test set
> golub.lda <- lda(ALL.AML ~ ., data = golub.traindf) #LDA
6
6.5.2 KNN (K-Nearest Neighbor)
> library(class)
> golub.knn <- knn(t(trainexpr), t(testexpr), pData(Golub_Train)$ALL.AML, k = 3)
> table(golub.knn, Golub_Test$ALL.AML)
> table(golub.knn, Golub_Test$ALL.AML)

golub.knn ALL AML
ALL 20 4
AML 0 10
6.5.3 SVM (Support Vector Machine)
> library(e1071)
> golub.svm <- svm(t(trainexpr), Golub_Train$ALL.AML)
> predicted <- predict(golub.svm, t(trainexpr))
> table(true = Golub_Train$ALL.AML, pred = predicted)
> table(true = Golub_Train$ALL.AML, pred = predicted)

pred
true ALL AML
ALL 27 0
AML 0 11
6.6 · Basic Data Processing in R Program
117 6
6.6 Basic Data Processing in R Program
In this section, we will practice additional simple data analysis using R program for begin-
ners. Easier examples using R program are provided in the basic data analysis practices of
Appendix A. R program is an essential part of genome data analysis, and it is very impor-
tant to have a basic understanding of the program.
6.6.1 Understanding the Data Structure
In order to obtain accurate differential gene expression data, it is crucial to obtain a good
understanding of the data frame that we will use for our practice.
6.6.1.1 Confirmation of Gene Expression Values
> data(ALL) #Import the data.
> exprs(ALL) #Return gene expression values for each gene.
> head(exprs(ALL)) #Return the first six rows in the matrix.
> head(exprs(ALL))
01005 01010 03002 04006 04007 04008 04010
1000_at 7.597323 7.479445 7.567593 7.384684 7.905312 7.065914 7.474537
1001_at 5.046194 4.932537 4.799294 4.922627 4.844565 5.147762 5.122518
1002_f_at 3.900466 4.208155 3.886169 4.206798 3.416923 3.945869 4.150506
1003_s_at 5.903856 6.169024 5.860459 6.116890 5.687997 6.208061 6.292713
1004_at 5.925260 5.912780 5.893209 6.1702 45 5.615210 5.923487 6.046607
1005_at 8.570990 10.428299 9.616713 9.937155 9.983809 10.063484 10.662059
04016 06002 08001 08011 08012 08018 08024
1000_at 7.536119 7.183331 7.735545 7.591498 7.824284 7.231814 7.879988
1001_at 5.016132 5.288943 4.633217 4.583148 4.685951 5.059300 4.830464
1002_f_at 3.576360 3.900935 3.630190 3.609112 3.902139 3.804705 3.862914
1003_s_at 5.665991 5.842326 5.875375 5.733157 5.762857 5.770914 6.079410
1004_at 5.738218 5.994515 5.748350 5.922568 5.679899 6.044520 6.057632
1005_at 11.269115 8.812869 10.165159 9.381072 8.227970 7.627248 7.667445
6.6.1.2 Confirmation of Experimental Design of Data
> pdat <- pData(ALL) #Confirm phenotype data
> setwd("C:/gda") #Set a route to save the data that will be generated.
> write.csv(pdat) #Write the generated data in a file as a csv format.

.. Fig. 6.9 The structure of the ALL data
After reviewing the phenotype data above, we can design the experimental conditions
for analysis.
6.6.1.3 Creating a Sample Table

In order to identify significant genes, we need to create a table that consists necessary data.
Since the required data for this example are BCR/ABL fusion gene (BCR/ABL) and cyto-
genetically normal (NEG), we will generate a table with these samples only (. Fig. 6.9).

> pdat <- pData(ALL) #Save the phenotype data to variables.
> table(pdat$mol) #Generate a table using phenotype data.
> subset <- which(as.character(pdat$mol) %in% c("BCR/ABL", "NEG"))
> eset <- ALL[ , subset] #Extract gene expression data that have “BCR/ABL,” “NEG” labels.
> table(eset$mol) #Generate a table of gene expression data.
> table(eset$mol)
ALL1/AF4 BCR/ABL E2A/PBX1 NEG NUP-98 p15/p16

0 37 0 74 0 0
6.6 · Basic Data Processing in R Program
119 6
6.6.1.4 Removing Genes with Low Expression
Depending on the purpose of analysis, genes with low expression values can be removed
before further analysis. We can apply various filtering strategies, depending on the pur-
pose of the analysis.
> library(genefilter)
> f1 <- pOverA(0.25, log2(100))
> f2 <- function(x) (IQR(x) > 0.5)
> ff <- filterfun(f1, f2)
> selected <- genefilter(eset, ff)
> head(selected)
> sum(selected)
> esetSub <- eset[selected, ]
> head(esetSub)
Exercises
[Exercise 1] - Perform the following procedures using Affymetrix GeneChip.
1.1. Install affy and affydata packages for analysis
1.2. Load the data.
1.3. Load the CEL image.
1.4. Draw a histogram.
1.5. Perform normalization.
1.6. Draw a box plot using the obtained data from the previous steps.
1.7. Draw a scatter plot.
1.8. Run T-test.
[Exercise 2] - This exercise is to practice clustering analysis. (Run the following codes first to solve this
problem.)
> source(“http://bioconductor.org/biocLite.R”)
> biocLite(“golubEsets”)
> library(Biobase)
> library(golubEsets)
> data(Golub_Train)
> golub.expr <- exprs(Golub_Train)
> golub.pData <- pData(Golub_Train)$ALL.AML

2.1. Using Euclidean distance, perform hierarchical clustering analysis.

2.2. Perform K-means clustering analysis in 5 groups.
2.3. Perform 4X2 SOM clustering.
Take Home Message

55 What the concept of differential expression genes (DEGs) is
55 How to cluster and classify gene expression data.
Bibliography
1. Chiaretti S et al (2004) Gene expression profile of adult T-cell acute lymphocytic leukemia identifies
6 distinct subsets of patients with different response to therapy and survival. Blood 103(7):2771–2778
2. Eisen MB et al (1998) Cluster analysis and display of genome-wide expression patterns. PNAS
95(25):14863–14868
3. Gentleman R (2016) Annotate: annotation for microarrays. R package version 1.52.1
4. Gentleman R, Carey V, Huber W, Hahne F (2016) genefilter: genefilter: methods for filtering genes from
high-throughput experiments. R package version 1.56.0
5. Golub T (2016) golubEsets: exprSets for golub leukemia data. R package version 1.16.0
6. Huber W et al (2002) Variance stabilization applied to microarray data calibration and to the quantifi-
cation of differential expression. Bioinformatics 18(suppl 1):S96–S104
7. Li X (2009) ALL: a data package. R package version 1.16.0
8. Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2017) e1071: misc functions of the depart-
ment of statistics, probability theory group (formerly: E1071), TU Wien. R package version 1.6-8.
https://CRAN.R-project.org/package=e1071
9. Tamayo P et al (1999) Interpreting patterns of gene expression with self-organizing maps: methods
and application to hematopoietic differentiation. PNAS 96(6):2907–2912
10. Tibshirani R, Chu G, Narasimhan B, Li J (2011) samr: SAM: Significance analysis of microarrays. R pack-
age version 2.0. https://CRAN.R-project.org/package=samr
11. Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York. ISBN
0-387-95457-0
12. Yan J (2016) som: Self-Organizing Map. R package version 0.3-5.1. https://CRAN.R-project.org/

package=som
121 7
Gene Ontology
and Biological
Pathway-Based Analysis
7.3 Dataset and Biological Interpretation Tools – 124

7.3.1 Generate Example Data Set – 125
7.3.2 Biological Interpretation Tools – 128
7.4 DAVID – 129
7.5 ArrayXPath – 131
7.6 BioLattice – 131


https://doi.org/10.1007/978-981-13-1942-6_7
122 Chapter 7 · Gene Ontology and Biological Pathway-Based Analysis

In this chapter, we aim to learn and digest the usage of various tools in order to analyze and
interpret the biological meanings behind the genomic data. Analysis requires one to have a
thorough understanding of basic biology. We will go over gene sets used to interpret data
as well as analyzing data. DAVID, ArrayXPath are two apparatuses used to gather fundamen-
tal biological interpretation using gene sets given. BioLattice is also designed to analyze the
results of the data given.
7.1 Introduction
Biological interpretation of massive genome experimental data begins with functional

analysis of the genes obtained by differential gene expression or clustering analyses. The
first step of analysis uncovers the hidden patterns and tendencies in the data. The next
step is biological data interpretation, which tries to find explanations for the identified
7 gene expression patterns and tendencies using biological knowledge which has already
been established. . Figure 7.1 shows the data interpretation workflow for results of a dif-

ferential expression analysis.
.. Fig. 7.1 Analysis and

interpretation of massive
genome data
Design and run an experiment
Preprocess and normalize the data
Perform data analysis

(i.e. differential gene expression analysis and
cluster analysis)
Gene set
(i.e. differential expression group and cluster)
Biological interpretation
7.1 · Introduction
123 7
.. Table 7.1 2x2 Contingency Table
GO (−) GO (+)
Not Significant 5830 90 5920
Significant 70 10 80
5900 100 6000
In this chapter, we will practice the biological interpretation of differentially expressed

genes identified from genome data. We will learn to apply biomolecular knowledge includ-
ing GO (Gene Ontology) and biological pathway analysis. The examples in this chapter
are aimed towards web-based analysis tools. 7 Chapter 12 will cover more examples with

locally executable software.

As discussed in 7 Chap. 5 7 Sect. 5.5, the most common approach for biological

interpretation of a given gene list is Over-Representation Analysis (ORA) using a hyper-

geometric distribution. For example, we obtained a total of 80 significantly differentially
expressed genes after analyzing gene expression in yeast, which has 6000 genes. If ten
genes among those 80 belong to the “DNA replication” GO term, is that term statistically
significant? If we assume that out of all 6000 genes, 100 of them have “DNA replication”
GO annotations, we can generate the following 2X2 contingency table (. Table 7.1):

The probability of obtaining this table by sampling without replacement is calculated

by a denominator, the number of cases of sampling 80 genes among 6000 genes without an
order, and a numerator, the number of cases of sampling 10 genes out of GO(+) 100 genes
and sampling 70 genes out of GO(−) 5900 genes.
> (choose(100, 10) * choose(5900, 70)) / choose(6000, 80)
[1] 5.935335e-07
The hypergeometric test used in ORA is the same as a one-tailed Fisher’s exact test. In
other words, the result of ORA is the probability of obtaining 10 or more than 10 genes
with the GO annotation. The formula is as follows:
æ M öæ N - M ö
k ç ÷ç
k ÷
Y n- y ø
P ( X £ k ) = åh ( y|N ;M ;n ) = å è ø è
y =0 y =0 æNö
ç ÷
ènø
Using for loop, we can calculate an aggregate value from 10 to 80 genes.

> a <- 0
> for (i in 10:80)
a <- a + (choose(100, i) * choose(5900, 80 - i)) / choose(6000, 80)
>a
[1] 6.57381e-07
In conclusion, the observation of ten “DNA replication” GO annotated genes among

7 80 differentially expressed yeast genes is a significant overrepresentation. We can apply
this type of analysis to other meaningful gene sets, including biological pathways.
7.2 Prerequisites
The details of R program installation are described in Appendix B, Sect. B.1; R package
installation examples are given in Appendix B, Sect. B.2. Appendix A provides the tuto-
rial for basic usage of R. The examples in this section require R packages which can be
installed as follows:
>source("http://bioconoductor.org/biocLite.R")
>biocLite("Biobase") #base module of various analysis module of R
>biocLite("limma") #linear model analysis module of microarray data
>biocLite("ALL") #Acute Lymphoblastic Leukemia data
* Installation of R library to generate data set

The first command, above, sets the storage bin and method used when downloading the
module. The second command installs a common module for basic bioinformatics analysis.
The third command installs limma, which provides integrative analysis of linear models to
assess differential expression in the context of multifactor experiments. The last command
installs a public gene expression microarray dataset on Acute Lymphoblastic Leukemia (ALL).
7.3 Dataset and Biological Interpretation Tools
We will use the public ALL data set in examples for this section, and easily accessed
web-based biological interpretation tools. This section will also review each knowledge
resource.
7.3 · Dataset and Biological Interpretation Tools
125 7
.. Table 7.2 The structure and content of the ALL dataset
Title Acute Lymphoblastic Leukemia
Feature Probes 12,625
Samples 128
Platform AffyMetrix HGU95Av2
Type and stage of the B B1 B2 B3 B4 T T1 T2 T3 T4

cancer
5 19 36 23 12 5 1 15 10 2
Molecular class of the ALL1/AF4 BCR/ABL E2A/PBX1 NEG NUP-98 P15/p16

cancer
10 37 5 74 1 1
Data preprocessing RMA
7.3.1 Generate Example Data Set
The example data is ALL, which is provided by Bioconductor. . Table 7.2 describes the

structure of the ALL dataset. We will extract significantly differentially expressed genes
and run K-means clustering.
First, load analysis modules and the ALL dataset into R. Load the installed R Library.
> library("Biobase")
> library("limma") #Import limma library.
> library("ALL") #Import ALL data package.
> data(ALL) #Read ALL data set.
> show(ALL) #Print the structure of ALL data set.
> summary(pData(ALL)) #Print the summary of phenotypic data related to experiment.
The show (ALL) command provides an overview of the structure of the dataset
(. Fig. 7.2).

The pData function returns phenotypic info from the dataset. If the info is very long,
use summary (pData (ALL)) to get a summary overview. With these commands, we can
see that ALL consists of 128 samples and 12,625 features related to gene expression. The
names of 128 samples are 01005, 01010, …, LAL4. The following link provides more
detailed information of the ALL data set.
7 http://www.bioconductor.org/packages/release/data/experiment/html/ALL.html

7
.. Fig. 7.2 Usage of show function to view data structure
Using summary(pData (ALL)) also shows the phenotypic distribution of the molecu-
lar biological index, which is titled mol.biol, for 128 samples.
ALL1/AF4: 10
BCR/ABL : 37
E2A/PBX1: 5
NEG : 74
NUP-98 : 1
p15/p16 : 1
In order to perform two group comparisons, we will extract 37 cases of BCR/ABL and
10 cases of ALL1/AF4.
> ALL$mol.biol #Print and review mol.biol label of ALL data set.
> eset <- ALL[, ALL$mol.biol %in% c("BCR/ABL", "ALL1/AF4")] #Save only 47 samples,
whose mol.biol values are “BCR/ABL” and “ALL1/AF4” in eset.
> f <- factor(as.character(eset$mol.biol)) #The values of eset, a subset of mol.biol, are
“BCR/ABL” and “ALL1/AF4” only.
> design <- model.matrix(~f)
Next, we will extract a list of genes which are statistically significantly different between
the two groups (adjusted p-value <0.05). There are 165 significantly different probes.
K-means clustering can be run twice for the 165 probes, and generate 10 clusters. We can
also run this analysis on the differential gene expression files generated in 7 Chap. 6.

7.3 · Dataset and Biological Interpretation Tools
127 7
> fit <- eBayes(lmFit(eset, design)) #Apply linear model to 47 samples.
> selected <- p.adjust(fit$p.value[, 2]) < 0.05 #Extract the indexes of significant analysis
results.
> esetSel <- eset[selected, ] #Save the significant subset in esetSel.
> res <- exprs(esetSel) #Save the expression profile of the significant subsets.
> res.kmeans <- kmeans(res, 10) #Generate 10 clusters.
The following example shows how to convert cluster analysis results into an input for-
mat accepted by an interpretation tool. The first line of this input format describes the
experimental condition for each sample. For DAVID, we need to generate a file listing the
165 probes, because DAVID does not accept cluster analysis results as input. However,
for ArrayXPath, we need to add a cluster ID to each probe because it will incorporate the
cluster analysis results.
Print a list of significant probes and clustering results in a system input file format for
interpretation.
> res.input <- cbind(rownames(res), res.kmeans$cluster) #Generate 2 row matrix with
differential gene ID of res in the first row of res.input and cluster IDs created by K-means
analysis in the second row
> res.input <- cbind(res.input, res) #From the third line, generate a matrix combining gene
expression profiles.
> colnames(res.input) <- c("probe_id", "cluster_no", colnames(res)) #Give a column names
of a data-frame
> write.table(res.input[ ,1], file = "test_input_DAVID.txt", sep = "\t", quote = FALSE,
row.names = FALSE, col.names = FALSE) #Generate an input file for DAVID analysis.
> write.table(res.input, file = "test_input_axp.txt", sep = "\t", quote = FALSE,
row.names = FALSE, col.names = TRUE) #Generate an input file for ArrayXPath and BioLattice.
The input file for DAVID is a simple list of genes arranged in a single column. How-
ever, as shown in . Fig. 7.3, input files for ArrayXPath and BioLattice contain probe IDs,

cluster IDs, and gene expression profiles for each sample. The first row contains column
labels.
.. Fig. 7.3 Data Input File for ArrayXPath and BioLattice
7.3.2 Biological Interpretation Tools
7.3.2.1 DAVID
The Database for Annotation, Visualization and Integrated Discovery (DAVID) is a
web-based tool designed for understanding the biological meaning of a given gene list.
It includes functions for functional annotation, gene functional classification, gene ID
conversion, and pathogen annotation.
The website is: 7 https://david.ncifcrf.gov

7.3.2.2 ArrayXPath
ArrayXPath is a microarray data interpretation system based on biological pathways. It
also provides interpretations using GO and visualizes the genes in terms of clusters, dis-
eases, gene ontology, and biological pathways.
7 http://www.snubi.org/software/ArrayXPath/

7.3.2.3 BioLattice
Biolattice is a web-based clustering analysis tool. DAVID and ArrayXPath provide inter-
pretations for an individual cluster. However, BioLattice analyzes correlations of multiple
clusters using Formal Concept Analysis, and thus provides understanding of the overall
context of an experiment. It helps identify clusters which have important meaning, visu-
alizes the results by running core-periphery analysis, and prints the results in a table
format.
7 http://www.snubi.org/software/biolattice/

7.4 · DAVID
129 7
7.4 DAVID
Go to the DAVID website. Select “Start Analysis” from the top menu, then click “Upload”
in the left input window. We will use the “test_input_DAVID.txt” file which we created in
the previous example. There are two ways of uploading the data.
First, you can upload a file directly by choosing “Browse” under the “B: Choose From a
File” option of “Step 1.” A pop-up window will appear for you to select the file for uploading.
You can also copy the gene list and paste it directly into the field under the “A: Paste a list”
option of “Step 1.” Since example is Affy probe data, choose “AFFYMATRIX_3PRIME_
IVT_ID” in “Step 2: Select Identifier.” There is an option to convert probe IDs to gene IDs
when you enter the data. Next, select “Gene List” in “Step 3”, then click “Submit List” under
“Step 4” to submit the data. The submission results will appear on the screen (. Fig. 7.4).

Based on various database and knowledge resources, DAVID offers various analysis
tools for our 165 probes. You can see detailed results for any analysis by clicking each
resource link (. Fig. 7.4). There are two different presentations of results, the “Functional

Annotation Chart” and the “Functional Annotation Table” (. Fig. 7.5).
In the “Functional Annotation Chart,” the ‘category’ column describes the original
database and specific term name. The table also provides a p-value and adjusted p-value
for each term. The “Functional Annotation Table” matches database terms with individual
submitted probes (. Fig. 7.6). You can confirm the probes at gene level and download the

result as a text file.
.. Fig. 7.4 Web-based GO annotation overrepresentation analysis using DAVID

.. Fig. 7.5 DAVID Functional Annotation Chart
.. Fig. 7.6 DAVID Functional Annotation Table

7.6 · BioLattice
131 7
.. Fig. 7.7 ArrayXPath
7.5 ArrayXPath
The ArrayXPath webtool provides interpretations of bulk genome data based on biologi-
cal pathways. It is available for biological pathways of Homo sapiens, Mus musculus, and
Rattus norvegicus. ArrayXPath automatically converts probe IDs to gene IDs and provides
options for p-value and adjusted p-value calculations (. Fig. 7.7). To perform a test run,

upload the file test_input_ArrayXPath.txt created in the previous section. . Figure 7.8

shows the pathway analysis results. We can confirm the results of statistics and hypergeo-
metric tests of the listed genes. The results can be evaluated by cluster or with a biological
pathway database. Using Scalable Vector Graphics (SVG), ArrayXPath provides a web-
based service for visualizing gene-expression profiles and biological pathway graphs. In
addition, it provides a web-enabled interactive visualization of pathways integrated with
gene-expression profiles so a user can visualize changes in gene expression according to
experimental conditions (. Fig. 7.8).

7.6 BioLattice
DAVID suggests GO terms that best describe the biological meanings of a gene list.
ArrayXPath combines GO analysis and biological pathway analysis. In most cases, though
not always, pathway analysis is more likely to give meaningful results than GO term analysis.
Recently, DAVID and ArrayXPath have been improved and made more compatible with
each other. However, both GO and pathway analyses have the limitation of only interpreting
the genes of a single cluster. BioLattice overcomes this limitation by mapping the correlations
7
.. Fig. 7.8 Results of ArrayXPath analysis. When you click each pathway, it will produce an interactive graph
.. Fig. 7.9 BioLattice analysis menu and results
of each cluster through graphical representations, creating a lattice of concepts. Also, it pro-
vides a core-periphery analysis to present different degrees of importance at a semantic level.
BioLattice provides interpretations using GO or biological pathways. It asks users to
select one of three GO categories, “Biological Process,” “Molecular Function,” or “Cellular
Component,” and provides an option to filter based on degrees of statistical significance
or the number of probes (. Fig. 7.9). In this section, we will perform a test run using the

test_input_BioLattice.txt file.
7.6 · BioLattice
133 7
.. Fig. 7.10 BioLattice Graph. It divides the whole experiment into multiple clusters and rearranges
them according to their correlations
After you run BioLattice, you will find the Lattice graph shown in . Fig. 7.10. In the
graph, red represents “core” clusters which are the most important experimental context.
The green clusters are “communicating” clusters which have high correlations with “core”
clusters. Yellow clusters are independent and do not have any correlation with other clus-
ters. The rest are in grey, and classified as “peripheral.” As shown in . Fig. 7.9, the same

results can be presented in a table.
Exercises
[Exercises 1] - The structure of the Golub data provided by Bioconductor is as . Table 7.3 below.

Download the data (data name: golubEsets), find the differentially expressed genes (DEGs) between ALL
and AML patients, and convert the DEG results to a DAVID input file format.
1.1. Upload the data generated from exercise 1 in DAVID and check which GO terms and OMIM (Online
Mendelian Inheritance in Man) diseases from the “Functional Annotation Chart” and “Functional
Annotation Table” significantly enriched the uploaded genes.
1.2. Using the DEGs having p-values less than 0.05 extracted from the Golub data, perform K-means
using K = 20. Using the results of the K-means, run ArrayXPath.
.. Table 7.3 Structure and information of the Golub data
Title Acute Lymphoblastic Leukemia, Acute myeloid

Leukemia
Feature Probes 7129
Samples 72
Platform AffyMetrix HU-6800
Sample Bone marrow Peripheral blood
62 10
Type of ALL AML

leukemia
47 25
7
Take Home Message
55 Understanding of the Gene Ontology and Pathway analysis as a biological inter-
pretation of gene expression data using R.
Bibliography
1. ArrayXPath – http://www.snubi.org/software/ArrayXPath/
2. BioLattice – http://www.snubi.org/software/biolattice/
3. Chiaretti S et al (2004) T-cell acute lymphocytic leukemia identifies distinct subsets of patients with
different response to therapy and survival. Blood 103(7):2771–2778. Epub 2003 Dec 18.Gene expres-
sion profile of adult
4. Chung HJ et al (2004) ArrayXPath: mapping and visualizing microarray gene expression data with
integrated pathway resources using Scalable Vector Graphics. Nucleic Acids Res 32:W460–W464
5. Chung HJ et al (2005) ArrayXPath II: mapping and visualizing microarray gene expression data with
biomedical ontologies and integrated pathway resources using Scalable Vector Graphics. Nucleic
Acids Res 33:W621–W626
6. DAVID – https://david.ncifcrf.gov
7. Huang DW et al (2009) Systematic and integrative analysis of large gene lists using DAVID bioinfor-
matics resources. Nat Protoc 4(1):44–57
8. Huber W, Carey VJ, Gentleman R et al (2015) Orchestrating high-throughput genomic analysis with
Bioconductor. Nat Methods 12:115–121
9. Kim J et al (2008) BioLattice: a framework for the biological interpretation of microarray gene expres-
sion data using concept lattice analysis. J Biomed Inform 41(2):232–241
10. Li X (2009) ALL: a data package. R package version 1.16.0.
11. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015) Limma powers differential expres-
sion analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43(7):e47
135 8
Gene Set Approaches

and Prognostic Subgroup
Prediction
8.3 Input Files – 139

8.3.1 Gene Expression Data File Format – 140
8.3.2 Phenotype Label File Format – 141
8.3.3 Gene Set Database File – 141
8.3.4 Microarray Chip Annotation File – 142
8.3.5 Data Input – 142
8.4 GSEA Execution – 143

8.4.1 Required Fields – 143
8.4.2 Basic Fields – 145
8.4.3 GSEA Execution – 145
8.4.4 GSEA Analysis Results – 145
8.5 Leading Edge Subset Analysis – 146
8.6 GSEA Analysis Via R – 147

8.6.1 Installation – 148
8.6.2 Input File – 149
8.6.3 GSEA Execution – 149


https://doi.org/10.1007/978-981-13-1942-6_8
8.7 Survival Analysis – 152
8.7.1 Install CGDS-R Package – 152
8.7.2 Get List of Cancer Studies from Server – 153
8.7.3 Extract Samples and Features – 153
8.7.4 Get Mutation Profiles for BRCA1 and BRCA2 – 154
8.7.5 Extraction Samples with BRCA1 and BRCA2 Methylation – 154
8.7.6 Clinical Data Integration – 155
8.7.7 Survival Analysis – 156
8.1 · Introduction
137 8
In this chapter, we implement Gene Set Enrichment Analysis (GSEA) to analyze microarray
data. We perform Kaplan-Meier survival analysis for the clustered genes obtained by micro-
array data clustering analysis and test the statistical significance of different prognoses
between clusters. It provides an understanding of the correlation between biological inter-
pretation and GO and pathway analysis of the clustered genes and an interpretation with
GSEA of the clustered genes.
8.1 Introduction
Gene Set Enrichment Analysis (GSEA) is a valuable tool in analyzing and interpreting
microarray data. GSEA is a method to detect if there is a gene set that has significant
expression or pathway-specific biological meaning. While differential expression analysis,
as explained in 7 Chap. 5, is a method of identifying genes important for this difference,

GSEA is a method of identifying the gene set from the comparison. The former verifies
the number of genes as an analysis unit (. Fig. 8.1), and the latter verifies the level of the

gene set as an analysis unit (. Fig. 8.2). Although the number of verified genes has already

been decided in the experimental step, the verified number of gene sets varies by the
Class A Class B
FDR < 0.05
t-test cut-off
FDR < 0.05

Biological meaning?
.. Fig. 8.1 After performing differential expression analysis on the gene level, the results of the analysis
are used to find biological meaning
138 Chapter 8 · Gene Set Approaches and Prognostic Subgroup Prediction
Class A Class B Gene set 1 Gene set 2 Gene set 3

−
Gene set 2
enriched in
Class B
ES/NES statistic
t-test cut-off
Gene set 3
enriched in
Class A
8
.. Fig. 8.2 GSEA analysis of the gene set unit performed for differential expression significance
connotes biological meaning
analyst’s purpose.1 After verification, both methods can correct for the effect of multiple
hypotheses (Reference 7 Chap. 5, 7 Sect. 5.5 and 7 Chap. 6, 7 Sect. 6.3).

There are two advantages to this type of unit set analysis. First, small signals that are
not detected by individual genome analysis can be amplified by set unit analysis. Second,
because the analysis is run with gene sets that have been proven to have biological mean-
ing, if the significance of differential expression is acknowledged, then the biological inter-
pretation is meaningful. However, in order to interpret the gene list from the differential
analysis of gene units, ontology or pathway analysis must be run. GSEA can use several
gene annotation databases (KEGG pathway, Gene Ontology, Cytogenic Band) to mine
biological information. Of course, gene unit differential expression analysis can also be
used.
Survival analysis is a way to statistically infer the time for a specific event to happen
through observation of animal testing or human clinical trials. Survival analysis of micro-
array gene expression data that are not differentiable clinically can be sorted into catego-
ries. The survival analysis practice in 7 Chap. 7 uses microarray gene expression data and

clinical data containing survival data. Verify the statistical significance of the difference in
survival curves of the groups clustered by gene expression data.
If there is a significant difference between the clusters, classify the prognosis into two
different categories that allow the first steps to suggest different treatment strategies.
1 Theoretically, once the number of genes is decided, the possible number of gene sets is defined by
combinatorics, i.e., “Stirling numbers.” However, this number is a very big number, resulting in a
“computationally unfeasible” state. In addition, not all combinations are reasonable enough to have
a biological meaning. Therefore, practically, a small portion of all Stirling numbers that are
meaningful are testable.
8.3 · Input Files
139 8
8.2 Prerequisites
The main methods of GSEA include Java GSEA Desktop Application and R Package. In
practice, we will use methods (1) and (2). For detailed information on installation, refer
to Appendix B in Sect. B.8.
55 If the latest version of Java has not been installed, download and run “jxpiinstall.exe”
from the website: 7 https://www.java.com/en/download/.

55 Download” gsea.jnlp” from 7 http://software.broadinstitute.org/gsea/downloads.jsp

55 Download the dataset necessary for the practice and create the directory “C:\gda\
ch08\” and unzip the dataset in this directory.
/GeneSet.zip is unzipped to C:\gda\ch08\GeneSet\, which contains the directory files

Datasets, GeneSetDatabases, Reports, and three other directory files. In the. /Datasets/
directory, there are *.gct and *.cls files, in the. /GeneSetDatabases directory, there is a
*.gmt file, and in the Reports/ directory, practice result files from 7 Sect. 8.6 are included.

These datasets can be downloaded from Broad Institute, and these file are modified and
provided in this textbook.
8.3 Input Files
Since the GSEA input file follows the ASCII2 text file format, it can be prepared by text
edit or MS Excel software.3 The input file format follows a table or matrix format, and the
column is separated by line and rows4 by tab; this format is called a tab-delimited file. A
selection box opens up when you select “File” -> “Save As..”in the Excel menu. You can
save by entering “File name (N) (e.g., “p53.gct”5) and selecting the “Text (Separated by
Tab) (*.txt)” option in the “File Format (T).”6 When using text edit, you can use “Tab” on
the keyboard to separate rows and “Enter” to change lines. The types of formats used in
GSEA input files are below. All files are in ASCII tab separation format.
* Gene Expression Data Format
GCT: Gene Cluster Text file format (*.gct)
RES: ExpRESsion (with P and A calls) file format (*.res)
PCL: Stanford cDNA file format (*.pcI)
TXT: Text file format for expression dataset (*.txt)
* Phenotype Label Data Format
CLS: Categorical (ex, tumor vs normal) class file format (*.cls)
CLS: Continuous (ex, time series or gene profile) file format (*.cls)
2 ASCII (American Standard Code for Information Interchange) is a 7-bit character code to present an
English character in computer. It has 128 codes. Codes 0 to 31 are used to control peripherals, such
as printers, codes 32–47 are for all the characters, 48–57 are for numbers, and 65–90 are for
alphabets.
3 Auto-formatting function can lead to “auto-error” when a gene name is entered in the Excel file
(Zeeberg et al. 2004).
4 CR/LF, line feed format is OS-dependent.
5 Do not use hyphen “-” in the file name. It cannot be recognized in the GSEA input window due to
some JAVA libraries.
6 Excel sends a warning that it has features unable to support tab-delimited files. Nevertheless, please
select “Yes” to save.
* Gene Set Database Format

GMX: Gene MatriX file format (*.gmx)
GMT: Gene Matrix Transposed file format (*.gmt)
GRP: Gene set file format (*.grp)
XML: Molecular signature database file format (msigdb_*.xml)
* Microarray Chip Annotation Format
CHIP: Chip file format (*.chip)
* Gene Ranking List
RNK: Rank list file format (*.rnk)
8.3.1 Gene Expression Data File Format
In the practice, *.gct and *.txt files are used.

* GSEA *.gct file
55 First row: The first column in the first row is always “#1.2”, and the rest of which are
kept blank.
55 Second row: The second row contains the number of rows containing the gene probe
8 name; . Fig. 8.3 portrays a sample of 10,056 probe names. The second column is the

number of samples, which is the number of columns. In . Fig. 8.3, the example

contains 48 samples.
55 Third row: Columns 1 and 2 contain “NAME” and “DESCRIPTION”, and each row
after row 3 contains sample names of all 48 in order.
55 The fourth row and below: Gene name, description, and expression value are listed in
their respective order. In other words, actual gene expression data are placed from
column 4 and rows 3 of the input.
.. Fig. 8.3 Example of GSEA input file format *.gct file

8.3 · Input Files
141 8
.. Fig. 8.4 Example of GSEA input file format tab separation *.txt file
.. Fig. 8.5 Example of GSEA phenotype label input file format *.cls
* Tab separation text file, Tab-delimited*.txt

*.gct file is an ASCII text file omitting the first two columns. When opened in Excel,
it is the same format as the third column of the *.gct file as in the following (. Fig. 8.4).

8.3.2 Phenotype Label File Format
Categorical phenotype: the categorical phenotype expression input file format used in
practice is as follows. The first line contains the number of samples, number of phenotype
labels, and the number 1; a space separates the numbers. In the second line after “#,” the
phenotype label type is annotated. In the third line, the phenotype label is listed, separated
by a space, in the same order as the samples are in the gene expression data file (. Fig. 8.5).

8.3.3 Gene Set Database File
Each row in the gene set database represents one gene set. Column 1 contains the gene
set name, column 2 has the target gene set explanation, and the set element gene name is
.. Fig. 8.6 Example of gene set database *.gmt file format used in GSEA
listed from column 3. Since each set may have a different number of gene elements, the
number of columns in each row in the *.gmt file can be different. The gene set name in
column 1 must be unique and cannot be duplicated (. Fig. 8.6).
8.3.4 Microarray Chip Annotation File
Microarray chip companies produce annotation files in their own language and put the
information into a *.chip file. Although this file is not used directly in GSEA, it is used in
the interpretation.
The file is arranged in three rows of tab separation ASCII format; in the first row, the
first three cell columns include the label name (“Probe Set ID,” “Gene Symbol”, and “Gene
Title”). The microarray chip and the corresponding Probe Set ID are first followed by the
explanation of the gene symbol, starting from the second row.
8.3.5 Data Input
The gene set database, gene expression data, and phenotype data necessary in the analy-
sis are input into the installed GSEA program. In order to open the file and open the
“Load Data” tab, select “Load data” in the left menu “Steps in GSEA Analysis.” Select either
“Method 1: Browse for files…” or “Method 2: Load last dataset used.” Or, you can use
“Method 3: drag and drop files here” to move input files by mouse. You can confirm the
input file list in the “Object cache” on the bottom right. A detailed menu can be viewed
8.4 · GSEA Execution
143 8
.. Fig. 8.7 The GSEA program window after uploading three input files
by double-clicking the entry in the list or right-clicking. File content can be confirmed by
“Phenotype Viewer” or “Dataset Viewer” (. Fig. 8.7).
The following practice example uses Leukemia_example.gct, for the expression data,
Leukemia_example.cls for phenotype data, and C1.gmt for the gene set database.
Execute GSEA by selecting “Run GSEA” in the left menu after uploading all three input
files.
8.4 GSEA Execution
An additional information input window to execute GSEA can be viewed by selecting “Run
GSEA” in the left menu. There are three categories: “Required,” “Basic,” and “Advanced”
(. Fig. 8.8).

8.4.1 Required Fields
Select the required input data or parameters in “Required Fields.”

55 Expression dataset: Select Leukemia data (ALL vs. AML)
55 Gene set database: Select the C1.gmt database by opening the tab “Gene matrix (local
gmx/gmt). There are other various gene set databases, and multiple lists of MsigDB
gene sets can be selected by opening the “Gene matrix (from website)” tab (. Fig. 8.9).

.. Fig. 8.8 An additional information input window necessary for GSEA execution is visible by opening
the “Run GSEA” tab
.. Fig. 8.9 MsigDB gene set listed in “Gene matrix (from website)” tab
8.4 · GSEA Execution
145 8
55 Number of permutations: number of phenotype labels or gene set permutations
performed. The larger the number of results, the higher the confidence but also the
longer the calculation time.
55 phenotype labels: Select phenotype labels
55 Collapse dataset to gene symbol
-> True: If *.chip file is present, use to map gene symbol
-> False: If gene symbol is already mapped or there is no *.chip file
55 Permutation type: Select target permutation from phenotype labels or gene sets.
55 Chip platform (s): Designate *.chip file. Leukemia.gct used in the practice has already
been mapped for gene symbol. No designation for Chip needed. Therefore, “Collapse
dataset to gene symbol” is designated as “False.”
8.4.2 Basic Fields
55 Analysis name: Choose a name for the analysis.

55 Enrichment statistic: Select if the weight of each gene is the same or if the correlation
coefficient between phenotype and gene will be used to weigh.
55 Metric for ranking genes: Distance function for gene ranking.
55 Gene list sorting mode: Select real value or absolute value for gene arrangement.
55 Gene list ordering mode: Descending or Ascending.
55 Max size: Gene set element maximum value.
55 Min size: Gene set element minimum value.
55 Save results in this folder: Select the folder in which to save results.
8.4.3 GSEA Execution
The analysis can be started by selecting “Run” on the bottom right. If the permutation test
is run 1000 times using practice data, it will take 1–2 min. The left-bottom “GSEA reports”
will have “Success” listed with the complete analyses. The results report can be viewed as
an HTML file by selecting entries in the list.
8.4.4 GSEA Analysis Results
GSEA result files are by default saved in the directory “gsea_home” (ex, C:\
DocumentsandSettings\username\gsea_home). The report file can be prepared as an
HTML file and linked to relevant files through a hyperlink in a browser, as shown below
(. Fig. 8.10).

By default, two phenotype label (e.g., ALL vs. AML) analysis results are provided in
the report. Results with FDR <25% or p-value <5% can be interpreted as a significant
difference between gene sets. Verification of a nominal p-value is only conducted within
the gene set and is not corrected for multiple hypothesis verification; therefore, standard
results with FDR <25% are used.
The individual analysis results in . Fig. 8.11 can be viewed by selecting “Detailed

enrichment results” in the analysis results report and selecting the analysis results of each
.. Fig. 8.10 GSEA Result report
gene set in the third column, “GS Details” (. Fig. 8.10). The summary page, enrichment

plot, heat map, and random ES distribution graphs can be viewed.
8.5 Leading Edge Subset Analysis

In the example “Enrichment plot: MORF_GNB1” in . Fig. 8.11, the “Leading edge subset”

is an important subset in the relevant gene set that contains the top-ranking gene list
of enrichment scores (ESs), from 0 to the top score. An additional analysis to find the
biological meaning is a program called “Leading edge subset analysis,” available in the left
menu window. The results page for the GSEA can be loaded by selecting the “Load GSEA
Results” in the top right menu. In the results page, select the target gene set and select
“Run leading edge analysis” to run the analysis (. Fig. 8.12).

(Upper left) Expression profile, Heat map of selected gene sets of leading edge subset
gene
(Upper right) Selected gene sets overlap with leading edge subset. Higher, green.
Lower, white
(Lower left) Bar graphs of gene sets of leading edge subset gene
(Lower right) Standard histogram of Jacquard distance of leading edge subset gene
8.6 · GSEA Analysis Via R
147 8
.. Fig. 8.11 Significant gene set analysis results
8.6 GSEA Analysis Via R
As introduced in 7 Sects. 8.3, 8.4, and 8.5, Java is more user-friendly. However, to handle the

analysis results more freely, R is highly recommended. In this practice, acute lymphoblastic
leukemia (ALL) data reported by Chiaretti et al. [2] will be analyzed by GSEA using R.
.. Fig. 8.12 Leading edge subset analysis results screen
8.6.1 Installation
55 Select the tab for “Downloads” from the MIT Broad Institute website (7 http://

software.broadinstitute.org/gsea/downloads.jsp) and download the file R-GSEA R

Script (GSEA-P-R.1.0.zip) to place into a directory after unzipping.
55 Execute R.
> setwd(directory to unzip) (ex: setwd(“C:/gda/ch08/GSEA-P-R.1.0”))
> source (“GSEA.1.0.R”)
55 For Windows users, download “C:\gda\ch08\” and unzip the file (Reference to
7 Sect. 8.2. Prerequisites).

149 8
> setwd(“C:\gda\ch08\GeneSet\”)
> source(“GSEA.1.0.R”)
The result file is saved to the directory. “C:\gda\ch08\GeneSet\Reports\”.
8.6.2 Input File
Input file is the same as mentioned above.

55 Gene expression data (.gct): Leukemia_example.gct
55 Phenotype data (.cls): Leukemia_example.cls
55 Gene set data (.gmt): C1.gmt
Before GSEA in R, the genes must be mapped as gene symbols. The practice file has
already been mapped.
8.6.3 GSEA Execution
A directory to store the compressed results. In this practice, the directories that include
the input files, datasets/, and gene set data, GeneSetDatabases/, are used. Resultsreports
in text file and image file are saved in the Reports/directory (. Figs. 8.13 and 8.14).

Interpretation of the analysis results is the same as in 7 Sect. 8.4.4.

.. Fig. 8.13 Comprehensive report of GSEA results for each gene set
Association of BRCA 1/2 Mutations with Survival
1.0
BRCA1 mutation
BRCA2 mythylation
BRCA2 mutation
BRCA wild-type
0.8
0.6
Proportion
0.4
8
0.2
0.0
0 50 100 150
Time, days
.. Fig. 8.14 Survival plot for association of BRCA 1/2 mutations with survival
#######################################################
# setwd("C:\gda\ch08\GeneSet\") #Define working directory
# source("GSEA.1.0.R") #Call executive R code.
#######################################################
> GSEA(
#######################################################
# Define input and output file
# Check working directory with getwd() and set working directory with setwd(“C:\gda\ch08\”)
151 8
#######################################################
> input.ds = "Datasets/Leukemia_example.gct",
> input.cls = "Datasets/Leukemia_example.cls",
> gs.db = "GeneSetDatabases/C1.gmt",
> output.directory = "Reports/",
#######################################################
# Set a parameter. Modify permutation number according to computer specification.
#######################################################
> doc.string = "Leukemia",
> non.interactive.run = FALSE,

> reshuffling.type = "sample.labels",
#nperm = 1000,
> weighted.score.type = 1,
> nom.p.val.threshold = -1,
> fwer.p.val.threshold = -1,
> fdr.q.val.threshold = 0.25,
> topgs = 20,
> adjust.FDR.q.val = FALSE,
> gs.size.threshold.min = 15,
> gs.size.threshold.max = 500,
> reverse.sign = FALSE,
> preproc.type = 0,
> random.seed = 111,

#######################################################
# Set for an expert
#######################################################
> perm.type = 0,
> fraction = 1.0,
> replace = FALSE,
> save.intermediate.results = FALSE,
> OLD.GSEA = FALSE,
> use.fast.enrichment.routine = TRUE
>)
8
8.7 Survival Analysis
Integrated genomic analyses of ovarian carcinoma, a study published on Nature in 2011,

obtained statistically valid results using TCGA ovarian cancer data exclusively. In our
practice analysis, we will replicate this very paper in order to get ourselves familiar with
analyzing survival rate differences between two groups and creating graphs.
8.7.1 Install CGDS-R Package
With CGDS-R package provided by cBioPortal, we could gain access to TCGA data rela-
tively easily.
> install.packages("cgdsr") # Install "cgdsr" R package
> library(cgdsr) # Load library
> mycgds = CGDS("http://www.cbioportal.org/") # Create CGDS object
> test(mycgds) # Check if data is correctly loaded

8.7 · Survival Analysis
153 8
getCancerStudies... OK
getCaseLists (1/2) ... OK
getCaseLists (2/2) ... OK
getGeneticProfiles (1/2) ... OK
getGeneticProfiles (2/2) ... OK
getClinicalData (1/1) ... OK
getProfileData (1/6) ... OK
8.7.2 Get List of Cancer Studies from Server
We could call for a list of cancer studies within cBioPortal servers. Among these studies,
select ovarian cancer and save the data into “mycancerstudy” variable.
> cancerstudy <- getCancerStudies(mycgds)
> head(cancerstudy)
> cancerstudy$name # View the list of studies
> mycancerstudy = cancerstudy[135, 1] #"ov_tcga_pub" a
> mycancerstudy
[1] "ov_tcga_pub“
8.7.3 Extract Samples and Features
Extracting a list of patients that hold preferable data.
> getCaseLists(mycgds, mycancerstudy)[ ,1]
> mycaselist = getCaseLists(mycgds, mycancerstudy)[2,1]
> mymutationprofile = getGeneticProfiles(mycgds, mycancerstudy)[6,1]
#"ov_tcga_pub_mutations"
> mymethylationprofile = getGeneticProfiles(mycgds, mycancerstudy)[5,1]
#"ov_tcga_pub_methylation_hm27"
In order to verify the prognostic effects of mutation and methylation within BRCA1/2
genes in ovarian cancer, comparing these factors to overall survival is crucial. Therefore,
we will first split the patients into BRCA1 mutation group, BRCA2 mutation group,
BRCA1 methylation group, and wild type preceding the analysis.
8.7.4 Get Mutation Profiles for BRCA1 and BRCA2
Extract patients with mutations in BRCA1 or BRCA2.
> brca_mutation = getMutationData(mycgds, mycaselist, mymutationprofile,
c('BRCA1', 'BRCA2'))
> table(brca_mutation$gene_symbol) # Count the number of samples with BRCA1 or
2 mutation
> brca1_mutated_cases <- brca_mutation[which(brca_mutation$gene_symbol

8
== 'BRCA1'), 3]
> brca2_mutated_cases <- brca_mutation[which(brca_mutation$gene_symbol
== 'BRCA2'), 3]
> table(brca_mutation$gene_symbol)
BRCA1 BRCA2
37 35
8.7.5 Extraction Samples with BRCA1 and BRCA2 Methylation
Methylated BRCA1/2 genes leads to lowered gene expression, which may have a nega-
tive impact in survival rate. Therefore, we will extract patients with BRCA1 or BRCA2
methylation.
> brca_methylation = getProfileData(mycgds, c('BRCA1', 'BRCA2'), mymethylationprofile,
mycaselist)
> brca1_methylation_cases =
rownames(brca_methylation[which(brca_methylation$BRCA1 > 0.8), ])
> brca2_methylation_cases =
rownames(brca_methylation[which(brca_methylation$BRCA2 > 0.8), ])

8.7 · Survival Analysis
155 8
8.7.6 Clinical Data Integration
So far, we have extracted a list of patients that either have mutations or methylation in
BRCA1/2 genes. We next have to extract survival time variable from clinical variables in
order to proceed with our survival analysis.
> myclinicaldata = getClinicalData(mycgds, mycaselist) # Get clinical data for the case list
> myclinicaldata$OS_STATUS[myclinicaldata$OS_STATUS == ""] <- NA
> myclinicaldata$OS_MONTHS
> myclinicaldata$OS_MONTHS
[1] 12.29 39.82 8.41 36.99 17.77 57.53 31.14 35.78 29.37 24.67 23.98$
Differentiate patients with mutations or methylation in either BRCA genes from those
that do not.
> total_sample <- rownames(myclinicaldata) # Extract all patients with clinical data
> type <- rep('Wild', length(total_sample)) # Assume all patients are wild type and make a
base type table.
> names(type) <- total_sample # Matching patients IDs: Matching patient IDs is a crucial
step because patient IDs used in mutation profile, and patient IDs used in clinical, methylation
profiles are different.
> brca1_mutated_cases = gsub("-", ".", brca1_mutated_cases)
> brca2_mutated_cases = gsub("-", ".", brca2_mutated_cases)
Categorize all patients according to their mutation profile.
> type[brca1_methylation_cases] <- "BRCA1_methylation"
> type[brca2_methylation_cases] <- "BRCA2_methylation"
> type[brca1_mutated_cases] <- "BRCA1_mutation"
> type[brca2_mutated_cases] <- "BRCA2_mutation"

8.7.7 Survival Analysis
Perform survival analysis on four patient groups categorized by BRCA mutation and
methylation profile.
> install.packages(‘survival’) #survival package install
> library(survival)
> out <- survfit(Surv(OS_MONTHS, OS_STATUS == "DECEASED") ~ type, data =
myclinicaldata)
> survdiff(Surv(OS_MONTHS, OS_STATUS == "DECEASED") ~ type, data =
myclinicaldata) # Log rank test
> coxph(Surv(OS_MONTHS, OS_STATUS == "DECEASED") ~ type, data = myclinicaldata)
8 # cox proportional hazard model
Create survival plot.
> color <- c("skyblue", "red", "blue", "black")
> plot(out, col=color, main = "Association of BRCA1/2 Mutations with Survival", xlab=
"Time, days", ylab = "Proportion", lty=c(2,4,3,1))
> legend("topright", c("BRCA1 mutation","BRCA2 mythylation","BRCA2
mutation","BRCA wild-type"), col = color, lty=c(2,4,3,1), lwd=3)
Exercise
[Exercise 1] - Using gene set C2.gmt from the ‘Leukemia’ data, find the gene set with FDR <25%.
[Exercise 2] - Conduct a leading edge subset analysis on the analysis results above using the gene sets
with FDR <25%.
[Exercise 3] - Using gender data gender.gct, gender.cls, and motif gene set C3.gmt provided by the
Broad Institute, find the gene set with FDR <25%.
Take Home Message

55 How to perform GSEA using Java GSEA application and R.
55 How to perform survival analysis using public data.
Bibliography
157 8
Bibliography
1. Cancer Genome Atlas Research Network (2011) Integrated genomic analyses of ovarian carcinoma.
Nature 474(7353):609–615. https://doi.org/10.1038/nature10166
distinct subsets of patients with different response to therapy and survival. Blood 103(7):2771–2778.
Epub 2003 Dec 18
3. Jacobsen A (2017) cgdsr: R-Based API for Accessing the MSKCC Cancer Genomics Data Server (CGDS).
R package version 1.2.6. https://CRAN.R-project.org/package=cgdsr
4. Liberzon A et al (2011) Molecular signatures database (MSigDB) 3.0. Bioinformatics 27(12):1739–1740.
https://doi.org/10.1093/bioinformatics/btr260. Epub 2011 May 5
5. Mootha VK et al (2003) PGC-1alpha-responsive genes involved in oxidative phosphorylation are coor-
dinately downregulated in human diabetes. Nat Genet 34(3):267–273
6. Subramanian A et al (2005) Gene set enrichment analysis: a knowledge-based approach for interpret-
ing genome-wide expression profiles. Proc Natl Acad Sci U S A 102(43):15545–15550
7. Therneau T (2015) _A package for survival analysis in S_. version 2.38, <URL: https://CRAN.R-project.
org/package=survival>
8. Zeeberg BR et al (2004) Mistaken identifiers: gene name errors can be introduced inadvertently when
using excel in bioinformatics. BMC Bioinformatics 5:80
9. Zeeberg BR et al (2004) Mistaken identifiers: gene name errors can be introduced inadvertently when
using excel in bioinformatics. BMC Bioinformatics 5:80
159 9
MicroRNA Data Analysis

9.3 Retrieving miRNA-mRNA Pair Expression Data – 160

9.3.1 Input File – 160
9.3.2 Missing Value Removal – 161
9.4 miRNA-mRNA Expression Correlation Coefficient – 162

9.4.1 Finding One miRNA and mRNA Correlation – 162
9.4.2 Finding the Correlation Coefficient of All Pair-Wise mRNAs
of Several Genes – 163
9.4.3 Find Significant miRNA-mRNA Pairs – 165
9.4.4 Draw Significant miRNA-mRNA Plot – 166
9.5 Verifying a Significant miRNA-mRNA Correlations

in Information Databases – 167
9.5.1 Verify mRNA Target Against Verified Experiments – 167
9.5.2 Confirm Predicted Target Gene – 169


https://doi.org/10.1007/978-981-13-1942-6_9
160 Chapter 9 · MicroRNA Data Analysis

In this chapter, in order to investigate the biological function, we will practice verifying
existing knowledge with concurrent experiments on microarray data analysis to identify
significant correlations of miRNA-mRNA pairs derived from the miRNA and mRNA analysis
profile of the same sample.
9.1 Introduction
Micro-RNA has a comparatively short length of about 21–22 bases and is single-stranded
RNA that does not code for protein, i.e., non-coding RNA. miRNA is known as an impor-
tant transcriptional control factor. Therefore, we can infer the gene functions that are con-
trolled by miRNA (or are related to the target).
Investigating miRNA target relationships is important in miRNA research. However,
since the number of verified miRNAs and their target mRNAs is small, many target pre-
diction algorithms have been developed and used. Furthermore, in contrast to plants with
much accuracy, organisms that are in higher hierarchy, the accuracy of the miRNA target
prediction algorithm is lower due to imperfect base pairing in the miRNA seed region that
cause interaction between miRNA and its target gene. There is a high chance of false pos-
itives. Therefore, there is a limit in using base pair pattern analysis to predict miRNA-gene
9 interaction mechanisms.
Expression pattern analysis of gene expression data gained from microarray experi-
ments can be used to infer the biological functions of miRNAs. Therefore, in the exercise,
miRNA-mRNA pair dual expression profile data from samples are used with target pre-
diction algorithm and known biological knowledge to carry out a function analysis as well
as to increase our understanding of miRNA-gene interactions. In this chapter, we do not
use RNA-seq gained from next-generation sequencing technology to analyze miRNA
expression. RNA-seq data will be used in Genome Data Analysis Version II for NGS.
9.2 Prerequisites
55 Installation of the R program: Install the R program according to Sect. B.1 of

Appendix B.
55 Download input txt files in the ch09 directory of the Exercise DVD.
9.3 Retrieving miRNA-mRNA Pair Expression Data
The practice uses Windows R program in a Microsoft Windows environment to analyze

the miRNA-mRNA gene expression and, using the information base from that analysis, to
analyze semantics. The practice uses expression profile data from Na YJ et al.:
“Comprehensive analysis of microRNA-mRNA co-expression in circadian rhythm.”
9.3.1 Input File

First, execute R, and open the miRNA.R script file by “File/Open Script.” Select the work
directory and save the gene expression profile, mRNA.txt, and miRNA.txt file in row and
column format by the read.table command.
9.3 · Retrieving miRNA-mRNA Pair Expression Data
161 9
> setwd("C:\gda\ch09") #Set the working directory
> mRNA_nor <- read.table("mRNA.txt", header = TRUE, sep = "\t") #Load mRNA
expression data
> miRNA_nor <- read.table("miRNA.txt", header = TRUE, sep = "\t") #Load miRNA
expression data
> dim(mRNA_nor) #Load expression data
> dim(miRNA_nor) #Load expression data
Data file (miRNA.txt) of miRNA expression profile.
> head(miRNA_nor)
MetaCol MetaRow Column Row Reporter_ID Reporter_Name I04hr_1 I04hr_2
1 1 1 2 BM11097 hsa_miR_518a_2_AS 2.605266 4.638006
1 1 2 2 BM11207 ambi_miR_7920 3.463365 3.822202
1 1 3 2 BM11352 rno_miR_499 3.360941 4.243580
1 1 4 2 BM11262 ambi_miR_10394 2.843248 4.334861
1 1 5 2 BM10438 hsa_let_7e 10.387571 9.903674
1 1 6 2 BM10719 hsa_miR_99a 9.271923 8.628829
1 1 7 2 BM10137 hsa_miR_133a 4.048892 4.497045
1 1 8 2 BM11246 ambi_miR_11576 4.186156 4.428559
1 1 9 2 BM11236 ambi_miR_279 3.638117 4.700757
1 1 10 2 BM10165 hsa_miR_372 2.775328 4.059538
* column 1~4: MetaCol, MetaRow, Column, Row (The location of the probe on the
microarray)
* column 5: reporter_id
* column 6: miRNA_id
* column 7~30: normalized expression values (04hr_1, 04hr_2, 08hr_1, 08hr_2, 12hr_1,
12hr_2, 16hr_1, 16hr_2, 20hr_1, 20hr_2, 24hr_1, 24hr_2, 28hr_1, 28hr_2, 32hr_1,
32hr_2, 36hr_1, 36hr_2, 40hr_1, 40hr_2, 44hr_1, 44hr_2, 48hr_1, 48hr_2 sample)
9.3.2 Missing Value Removal

We remove the NA (Not Available) from the gene expression data.
> length(which(is.na(mRNA_nor))) #Count the number of NA in mRNA_nor
> length(which(is.na(miRNA_nor))) #Count the number of NA in miRNA_nor

If the value is greater than 0, we remove the missing value and get the expression file. In
this case, we get a gene expression matrix without missing values by removing the row
with the missing value when we know that an miRNA expression value is missing.
#Obtain expression matrix without missing value
> miRNA_nor.omit_na <- na.omit(miRNA_nor)
> dim(miRNA_nor.omit_na)
The final resulting matrix is saved in the mRNA and miRNA variables. We use this input
for further operations.
> mRNA <- mRNA_nor
> miRNA <- miRNA_nor.omit_na
9 9.4 miRNA-mRNA Expression Correlation Coefficient
9.4.1 Finding One miRNA and mRNA Correlation
Let us calculate the correlation value of the mRNA matrix’s 2006th gene, the miRNA.
> ex <- c(2006, 625)
> x <- as.numeric(mRNA[ex[1], -1:-6]) #2006th gene expression
> y <- as.numeric(miRNA[ex[2], -1:-6]) #625th miRNA expression
> mRNA[ex[1], 5:6] #Checking gene information
> as.character(miRNA[ex[2], 6]) #Checking miRNA names
In effect, verify that the 2006th gene is GI_3807572-S (Mus musculus similar to ribosomal
protein S24(Loc380888)) and the 625th miRNA is “has_miR_489.”
> res <- cor.test(x, y, method="pearson")
> res
9.4 · miRNA-mRNA Expression Correlation Coefficient
163 9
Use the cor.test function to verify the applicable miRNA-mRNA correlation results as
shown below.
9.4.2 inding the Correlation Coefficient of All Pair-Wise mRNAs

F
of Several Genes
9.4.2.1 Create cor Function Input Matrix
> t_miRNA <- t(miRNA[ , -1:-6])
> x <- as.numeric(mRNA[i, -1:-6]) #The expression of the ith mRNA
> this_mat <- cbind(x, t_miRNA) #ADD the expression in the input matrix
> this_cor <- cor(this_mat, method = "pearson") #Calculate the correlation
using the cor function
t(transpose) is a function to convert the row to the column and vice versa. After convert-
ing sample information to column using sample information, arbitrarily input mRNA
expression into the matrix column-wise (. Fig. 9.1). With this input matrix, we can calcu-

late correlation using the cor function. We can repeat this process for 10 mRNAs.
Samples
mRNA
miRNAs miRNAs
Samples
Samples
miRNAs
t(transpose) add mRNA
.. Fig. 9.1 Create cor function input matrix

Repeat this process for 10 mRNAs. The input for the repeat process is as shown below.
> res <- c() #Reset a result variable
#Calculate a correlation coefficient of all pair-wise miRNAs for 10 genes
> start_i <- 2001 #Select the 2001st miRNA to the 2010th miRNA
> end_i <- 2010
> for (i in start_i:end_i) #Repeat as many times as there are number of mRNAs
x <- as.numeric(mRNA[i,-1:-6]) #Extract mRNA expression
this_mat <- cbind(x, t_miRNA) #Create input matrix
9 this_cor <- cor(this_mat, method = "pearson") #Calculate correlation
res <- rbind(res, round(this_cor[-1, 1], 5)) #Extract correlation values of selected
mRNAs and all pair-wise miRNAs
> rownames(res) <- mRNA[start_i:end_i, 5] #Extract selected mRNA gene symbols
> colnames(res) <- as.character(miRNA[ ,6]) #Extract selected miRNA names
As shown below, confirm the correlation result value.
> dim(res)
> res[1:10, 1211:1220]
We can gain the applicable correlation value of res with a size of 10(row)*2662(column),
in which the row is mRNA and the column is miRNA. Find the maximum and minimum
correlation values.
9.4 · miRNA-mRNA Expression Correlation Coefficient
165 9
#Finding the maximum and minimum correlation values
> max(res)
> min(res)
9.4.3 Find Significant miRNA-mRNA Pairs
Generally, miRNAs show a negative correlation with mRNA gene expression; search for
miRNA-mRNAs showing a negative correlation. In this practice, we find correlation coef-
ficients less than −0.7.
> sig_which <- which(res < -0.7)
> sig_which #Result: 6246
#Create a function covering a location into a two-dimensional array of rows and columns
> where_is <- function(x, matrix)
total_row <- dim(matrix)[1]
this_row <- x %% total_row
this_col <- ceiling(x/total_row)
position <- c(this_row, this_col)
position
In this case, the correlation coefficient is less than −0.7, the sequence index is 6246, and
when converted to rows and columns, it is the sixth row and the 625th column. At this
time, we can verify the gene name as below.
> this_position <- where_is(sig_which[1], res)
> res[this_position[1],this_position[2]]
> rownames(res)[this_position[1]] #mRNA: "GI_38075702-S"
> colnames(res)[this_position[2]] #miRNA: "hsa_miR_489"
# Synthesize results (when there are multiple significant pairs)
> gene_ind <- start_i-1+this_position[1]
> position_res <- c(gene_ind, this_position[2],
rownames(res)[this_position[1]], colnames(res)[this_position[2]],
res[this_position[1],this_position[2]])
> position_res <- noquote(position_res)
> position_res
9
9.4.4 Draw Significant miRNA-mRNA Plot
> x_at <- c(1:24) #X-axis: sample name
> x_str <- rep((1:12)*4, each = 2)
> y_m <- as.numeric(mRNA[gene_ind, -1:-6]) #Extract mRNA expression
> y_mi <- as.numeric(miRNA[this_position[2], -1:-6]) #Extract mRNA expression
> plot(x_at, y_m, type = "l", pch = 1, xlim = c(1, length(x)), ylim=c(0, max(y_m,
y_mi)), col = "red" ,axes = FALSE) #Draw mRNA expression
> axis(1, at = x_at, lab = x_str, cex.axis = 0.8) #Draw x-axis
> axis(2, cex.axis = 0.8, ylab = "expression") #Draw y-axis
> box() #Draw the edge
> points(x_at, y_mi, type = "l", col = "blue", xlim = c(1, length(x_at)), ylim = c(0,
max(y_m, y_mi))) #Draw miRNA expression

9.5 · Verifying a Significant miRNA-mRNA Correlations in Information Databases
167 9
miRNA-mRNA pair expression data
8
6
exprs
4
2
0
4 4 8 8 12 16 20 24 28 32 36 40 44 48
x_at
.. Fig. 9.2 miRNA-mRNA plot
A graph similar to the one below will be printed. The red line represents mRNA and the
blue line miRNA (. Fig. 9.2).

9.5 erifying a Significant miRNA-mRNA Correlations

V
in Information Databases
9.5.1 Verify mRNA Target Against Verified Experiments
Recently, (2014) TarBase v7.0 was released, and includes about 20 different species and
65,000 miRNA-mRNA target information. We can confirm if a significant miRNA-mRNA
pair has been verified experimentally. First, we use the site below to log on (. Fig. 9.3).
7 http://www.microrna.gr/tarbase

Gene or miRNA names are searchable. In the practice, we search for the target gene
“hsa-let-7a-5p”. The search results are similar to . Fig. 9.4.

If the target gene “hsa-let-7a-5p” is predicted via validation method (reporter gene
assay, Northern blot, Western blot) and the prediction algorithm microT provided by
Tarbase, then the scores are also supplied.
.. Fig. 9.3 Structure of TarbBase webpage
.. Fig. 9.4 The result page of “hsa-let-7a-5p“search

9.5 · Verifying a Significant miRNA-mRNA Correlations in Information Databases
169 9
9.5.2 Confirm Predicted Target Gene
A representative miRNA target prediction program, TargetScan predicts target genes of

miRNAs using seed match in the seed region.
TargetScan selects target genes in three main processes.
1. Target candidate selection: find mRNA 3’UTR rank (target candidate) that can bind
complementarily to miRNA seed site (8-mer or 7-mer).
2. Conservation confirm: confirm whether the target candidate site is conserved in
other species.
3. Prediction target ranking: rank target site candidates by type of target site, 3’UTR
pairing, position, etc.
We can log on to TargetScan (7 http://www.targetscan.org) (. Fig. 9.5).

In order to run the analysis, we select the species of interest and enter the gene name
or miRNA and click “Submit.” In this exercise, we attempt to predict the target gene of
human “miR-155-5p”.
Detailed Search Results provide information on predicted target gene name, represen-
tative transcript, type of miRNA target site, total context, and score. In the “Links to sites
in UTRs” column, you can view several miRNAs and their binding sites that target the
3’UTR region of the applicable gene (. Figs. 9.6, 9.7 and 9.8).

.. Fig. 9.5 Main page of TargetScan

.. Fig. 9.6 TargetScan Detailed Search Results of “miR-155-5p”

9
.. Fig. 9.7 Visualization of 3’ UTR regions searched by TargetScan

Bibliography
171 9
.. Fig. 9.8 TargetScan detailed search: In the “Target gene” column, click ID of gene symbol or
“Representative transcript” column; you can view detailed information linked to Ensembl
Exercises
[Exercise 1] - Calculate the correlation coefficient of the applicable gene and the entire miRNA, and
graph the correlation distribution. Also, find miRNA-mRNA correlation coefficients that are lower than
−0.7 or higher than 0.7.
[Exercise 2] - Verify the significantly correlated miRNA-mRNA pairs fouds in [Exercise 1] using TarBase
and TargetScan.
Take Home Message

55 The feature of two biological data, mi-RNA and mRNA.
55 How to identify correlation coefficient between miRNA and mRNA and interpret
the biological interaction.
Bibliography
1. Agarwal V et al (2015) Predicting effective microRNA target sites in mammalian mRNAs. Elife 12:4.
https://doi.org/10.7554/eLife.05005
2. Barbato C et al (2009) Computational challenges in miRNA target predictions: to be or not to be a true
target? J Biomed Biotechnol 2009:803069
3. Bartel DP (2009) MicroRNAs: target recognition and regulatory functions. Cell 136(2):215–233
4. Garofalo M, Croce CM (2011) microRNAs: master regulators as potential therapeutics in cancer. Annu
Rev Pharmacol Toxicol 51:25–43
5. Na YJ et al (2009) Comprehensive analysis of microRNA-mRNA co-expression in circadian rhythm. Exp
Mol Med 41(9):638–647
6. R Core Team (2016) R: a language and environment for statistical computing. R foundation for statisti-
cal computing. Vienna, Austria. URL https://www.R-project.org/
7. TarBase – http://www.microrna.gr/tarbase
8. TargetScan – http://www.targetscan.org
9. Vlachos IS et al (2015) DIANA-TarBase v7.0: indexing more than half a million experimentally sup-
ported miRNA:mRNA interactions. Nucleic Acids Res 43(Database issue):D153–D159
9
173 III
Network Biology,
Sequence, Pathway
and Ontology
Informatics
Contents
Chapter 10 Network Biology, Sequence, Pathway and

Ontology Informatics – 175
Chapter 11 Motif and Regulatory Sequence Analysis – 189
Chapter 12 Molecular Pathways and Gene ontology – 213
Chapter 13 Biological Network Analysis – 233

175 10
Network Biology, Sequence,

Pathway and Ontology
Informatics
10.2 Sequence Data Analysis – 177

10.2.1 Sequence Information – 177
10.2.2 Pair-wise Alignment – 178
10.2.3 Multiple Alignment – 178
10.2.4 Global Alignment – 178
10.2.5 Local Alignment – 179
10.3 Phylogenetic Tree Analysis – 179
10.4 Visualization for Sequence Analysis – 180
10.5 Biological Pathway Analysis – 181
10.6 Gene Ontology – 182
10.7 Biomedical Text Mining – 183
10.8 Biomedical Network Analysis – 184

https://doi.org/10.1007/978-981-13-1942-6_10
176 Chapter 10 · Network Biology, Sequence, Pathway and Ontology I nformatics

Genome data analysis includes not only analytical methodology and algorithms but also
extensive knowledge about life sciences. In this chapter, a short sequence analysis that
includes sequence motif analysis, phylogenetic tree analysis, and the importance of visual-
ization will be practiced in order to do a high valued analysis of genomic data. Biological
pathway and gene ontology (GO), which are extensively used techniques for analyzing
genomic data, will be explained in general terms. Biomedical knowledge-based text mining
techniques and fundamental background of biological network analysis will be studied.
10.1 Introduction
Biological sequence is the fundamental unit of genomic information, composed of ade-

nine, guanine, cytosine, and thymine. A sequence is expressed through transcription and
translation. Since the base order of sequence is becoming the basis for trait selection,
sequence data analysis is the most essential and fundamental work for life phenomenon
research. In order to establish life phenomena, sequence data have improved quantita-
tively and qualitatively due to development of a generalized method that is dependent on
experimental biological data, the success of the human genome industry, and develop-
ment of genomics “big data.” Attempts at discovery and establishment of new life phe-
nomena is continued through shared data and bioinformatics technologies. Basic analysis
of sequence data consists of sequence alignment. Sequence alignment is done in order
10 to normalize the correlation between an amino acid and base sequence, which can be
used to estimate how each sequence is related functionally, evolutionarily, and structur-
ally. A sequence alignment is used to search for sequence homology1 according to the
consensus between the comparison sequences or for predicting the correlation between
similar sequences or the relevant functional region. A phylogenetic tree analysis is also in
the realm of sequence alignment technique that analyzes the taxonomic system of newly
discovered organisms. Sequence alignment analysis compares the homologue of two dif-
ferent organisms with a common origin by head-to-head, leg-to-leg, or tail-to-tail align-
ment; however, it is limited with regard to finding the similarity between the two species
that have a long evolutionary distance.
Sequence alignment is also used in transcription factors and microRNA analyses,
which is an important role for controlling gene expression and repression. Transcription
factors read gene information from DNA and facilitate RNA polymerase to control tran-
scription. A single transcription factor is involved in multiple levels of gene expression
control. Sequence motif analysis is used to investigate transcription factors involved in
gene expression control. A protein sequence motif is composed of amino acid sequence,
which delineates a specific function in the protein structure that has a length of 5–25
base pairs. In order to activate RNA polymerase, a transcription factor combines with a
short sequence motif that is in the upstream regulation region. Based on the relationship
between the importance of the sequence motif and the evolutionary conservativeness of
the important sequence, a sequence motif can be identified by multiple sequence align-
ments between species.
1 Homology means similar DNA or protein sequences of an individual of the same or different
species. It is used to infer sequence function by searching for a highly homologous sequence
with a sequence of interest or to predict evolutionary correlation between sequences.
10.2 · Sequence Data Analysis
177 10
.. Fig. 10.1 Sequence Logo visualize frequency of a base or amino acid at each position of sequence in
a log value
Sequence motif search is based on sequence pattern analysis. A sequence logo is

the method of visualizing the sequence motif. It schematizes the various properties of
substitutive preference and information content and relative frequency of each sequence
position. It is also useful for analyzing a comparatively short sequence such as a protein
motif, and it is used more for indicating both conserved DNA binding or transcription
factor binding sites. After multiple alignments of related DNA, RNA, or protein sequences,
frequencies at each base position are converted to log scale, represented by the size of each
letter (. Fig. 10.1).

10.2 Sequence Data Analysis
10.2.1 Sequence Information
Before analyzing the sequence, it is essential to obtain basic sequence information. The
National Center for Biotechnology information (NCBI); PubMed, a massive archive of
biotechnology and medical paper indices; and GenBank, a genome sequence database,
contain an extensive amount of biotechnology information. Entrez is an integrated
retrieval system, providing not only DNA and protein sequence data, but also related
MEDLINE2 literature, the genomic data of GenBank, taxonomy, and tertiary protein
structure databases.
Sequence alignment can be classified into two different methods depending on the
input sequence number:
55 Pairwise alignment
55 Multiple alignment
Alternatively, it can be classified into two different alignments:

55 Global alignment
55 Local alignment
2 7 https://www.ncbi.nlm.nih.gov/pubmed/. More than 27 million MEDLINE references can

be searched by using the PubMed system. Also, it is possible to retrieve the full text from 25
academic journals from the website in the PubMed system.
10.2.2 Pair-wise Alignment
Pair-wise alignment is a method for indicating homology when there are two alignment
objects. The object of alignment can become a pair of amino acid sequence to amino acid
sequence or base sequence to base sequence. Pair-wise alignment is the most fundamen-
tal method for comparing homology. BLAST, which is the one of pair-wise alignment
databases, is mostly used for searching sequence homology in gene sequence database,
and it is used for comparing the similarity of two gene sequences. BLAST utilizes a heu-
ristic algorithm similar to FASTA to search for homology. BLAST provides online data
in European Bioinformatics Institute (EBI) and NCBI, but it can also run as stand-alone
program.
55 EBI BLAST: 7 http://www.ebi.ac.uk/Tools/psa/

55 NCBI BLAST: 7 http://blsat.ncbi.nlm.nih.gov/Blast.cgi

There are different types of BLAST, and NCBI provides the three methods:
55 BLASTN (Nucleotide BLAST): Comparison between the base sequences
55 BLASTP (Protein BLAST): Comparison between the protein sequences
55 BLASTX: Comparison between the protein sequence database after converting the
input base sequence into six frames.
10.2.3 Multiple Alignment

10
Multiple alignment is aligning and comparing the similarity of greater than three base
and amino acid sequences concurrently. It can be diversely applied to sequence family,
phylogeny, and domain analyses.
While BLAST is mostly used for the pair-wise alignment, Clustal is used mainly for
the multiple alignment involving aligns more than three sequences. Clustal is a general
purpose multiple sequence alignment program for DNA or proteins. By applying vari-
ous sequence analysis techniques, biologically valid multiple sequence alignment can
be performed. It gives relevance to similarities and differences between the sequences
after finding the optimal alignments of given sequences. Also, phylogenetic analysis
such as cladogram or phylogram can analyze the evolutionary relationships between the
sequences. There are several versions of Clustel: Clustal is the name of the most basic ver-
sion, a graphic user interface (GUI) version is called ClustalX, and a version that gives a
weight to each sequence is called ClustalW.
10.2.4 Global Alignment
Global alignment is a method used to search for the optimal alignment for two long
sequences. When distances of the comparative sequences are similar, it is suitable when
the overall similarity is extensive. A global alignment using the Needleman-Wunsch
algorithm as a representative method consists of three steps: (1) initialization; (2)
score calculation; and (3) alignment. Calculation scheme is as follows: match, add 1;
mismatch, subtract 1; indel, subtract 1; align sequences by which the highest score is
obtained.
10.3 · Phylogenetic Tree Analysis
179 10
.. Fig. 10.2 Performing global
alignment for two sequences a a a g c g g a a g t c a c a g
• • •
a a g g c t g a a g t - a t a g
.. Fig. 10.3 Performing local

alignment for two sequences a a a g c g g a a g t c a c a g
• • • • • • • • • •
a a g g c t g a a g t - a t a g
10.2.5 Local Alignment
A local alignment aligns only the strongest homology region instead of aligning all the
sequences within the two sequences. It is suitable for aligning the overall sequences that
have lower similarities, especially when the length of comparison sequences is quite dif-
ferent. The Smith-Waterman algorithm is most commonly used for the local alignment.
The Smith-Waterman algorithm is similar to the Needleman-Wunch algorithm, which is
used in global alignment, but the score calculation method is different. We can see dif-
ferent results for the same input in . Fig. 10.2 (global alignment by Needleman-Wunch

algorithm) and . Fig. 10.3 (local alignment by Smith-Waterman).

10.3 Phylogenetic Tree Analysis
A phylogenetic tree analysis of a sequence is an analogizing analysis of an evolutionary

relationship by drawing the evolutionary relationship tree of given sequences. It can be
used to analyze the correlation by measuring the molecular evolutionary distance between
the two organisms, and if the group of organisms are evolutionarily close, the organisms
will locate close to the stem. Taxon is an evolutionary relationship analysis parameter, and
it can be the same gene or species in some cases. In the evolutionary relationship analysis,
performing the phylogenetic tree analysis after aligning the sequence of DNA or proteins
is performed first (. Fig. 10.4).

It is possible to analogize the time of genetic mutation occurrence via a phylogenetic

tree analysis. As gaining the gene function can be the result of evolution and if the evolu-
tionary history of gene is reconstructed, then it is possible to predict the unknown gene
function. Performing various homolog analyses after creating the phylogenetic tree analy-
sis is really helpful for predicting the functional similarity of unknown genes based on
the structure of the phylogenetic tree and base sequence similarity. Of course, the base
sequence similarity is not assured with regard to the functional similarity since functional
differentiation is observed within the similarly sequenced gene. A vertically inherited
ortholog has higher functional similarity, whereas a horizontally replicated paralog has a
general functional differentiation.
Bacteria
Eukaryota
Archaea
.. Fig. 10.4 Phylogenetic tree created using genetic analysis
10.4 Visualization for Sequence Analysis
The widely used visualization tools for sequence analysis are UCSC Genome Browser,
10 NCBI Map Viewer, and Ensembl Genome Browser. Annotation information for each
special method and a visualization tool is provided with the genome coordinates. One
drawback for the visualization tool is that the basic genome structure of each different
genome database has the same base, but the annotation information describes many
other cases. A specific location for the genome coordinate system of the number of
genes, exons-introns, and structure show subtle differences between each other. Thus, it
is very important to indicate the information source when you use the sequence infor-
mation.
The most useful feature of the UCSC Genome Browser3 is that it provides annota-
tion information with a multiple layer track by arranging all of the data with the genome
coordinate system of the reference genome sequence. NCBI’s Map Viewer4 also provides
genome-centered data, but the UCSC Genome Browser is better with regard to ease and
utilization. The UCSC Genome Browser provides track with bundling of various related
annotation information. The UCSC Genome Browser provides not only the graphic data
but also the download of text data and the result file in HTML format. The UCSC genome
browser allows not only reception of individual gene information but automatically pro-
cesses extensive genomic data, which has required previous and considerable bioinfor-
matics knowledge.
3 7 https://genome.ucsc.edu/

4 7 http://www.ncbi.nlm.nih.gov/projects/mapview

10.5 · Biological Pathway Analysis
181 10
10.5 Biological Pathway Analysis
A biological pathway analysis outlines a biochemical mechanism or an interaction of bio-

logical components such as genes, proteins, or enzymes into a two-dimensional plane.
In general, pathways are structured with a node and an edge. A node represents genes,
proteins, or enzymes that are biological entities, and an edge is a line representing the
connection between the nodes. A pathway can be classified into metabolic and signaling
pathways. . Figure 10.5 is a pathway model of a well-known signal transduction pathway.

The construction of pathway databases had laid the foundation that placed biological
pathway data to be the core component of bioinformatics research for genomics data anal-
ysis. . Table 10.1 shows the list of known pathway databases. With recent increase in the

number of pathway databases, websites such as Pathguide (7 http://www.pathguide.org),
which classifies databases based on objectives, supported species, standards, et cetera,

emerged.
Due to the emergence of the pathway databases, various data analysis techniques have
been developed for the use of biological knowledge and information from pathways. As
explained in 7 Chap. 5, pathway-annotated over representation analysis and GSEA analy-

sis are core methodologies of genome data analysis. Text mining techniques have also
been developed based on pathway information. PharmGKB provides pathways based on
pharmacogenetics.
Pathways, the representative aggregate of biological knowledge, would only prove to
be more valuable in the field of bioinformatics.
Chemokines,
Hormones,
Survival Factors Transmitters Growth Factors
Extracellular
(e.g., IGF1) (e.g., interleukins, (e.g., TGFα, EGF)
Matrix
serotonin, etc.)
GPCR
Integrins
RTK RTK cdc42 Wnt
Fyn/Shc
PLC Grb2/SOS
Frizzled
PI3K G-Protein Ras FAK Dishevelled

Src
Akt Raf GSK-3β
PKC Adenylate
cyclase Hedgehog
Akkα MEK APC
Cytokine Receptor
NF-κB
PKA MEKK MAPK MKK
Cytokines IκB β-catenin
Patched
JAKs
(e.g., EPC)
STAT3,5 TCF
Myc: Mad: ERK JNKs
Bcl-xL Max Max β-catenin:TCF
Fos Jun
SMO
Cytochrome C CREB CycID Gli

p16
Rb CDK4
Caspase 9 p15
E2F
Gene Regulation
CyclE p27
Caspase 8 Apoptosis
ARF CDK2
p21
Cell
FADD mdm2
Bcl-2 Proliferation
p53
Bad
Mt Bax
Abnormality
FasR Sensor Bim
Death factors
(e.g. FasL, Tnf )
.. Fig. 10.5 Signal transduction pathway

.. Table 10.1 Biological pathway databases
Pathway databases Site
KEGG: Kyoto Encyclopedia of Genes 7 http://www.genome.jp/kegg/

and Genomes
BioCyc 7 https://biocyc.org/

PANTHER 7 http://www.panther.org/pathway/

TRANSPATH 7 http://genexplain.com/transpath/

BioCarta 7 https://cgap.nci.nih.gov/Pathways/BioCarta_Pathways

Reactome 7 http://www.reactome.org

PharmGKB 7 https://www.pharmgkb.org/view/pathways.do

10.6 Gene Ontology
Ontology is a model for representing the vocabulary, concept, and relationships of a

genomic database. While gene ontology (GO) is not a perfect form of ontology, it is a type
of a structured vocabulary system. Many genome databases have been developed since
10 the 1990s. The administrators of different species-oriented genome databases used their
own language for genome annotation. The necessity of a standardized language was neces-
sary due to the development of the Internet and the increase in the number of genome
databases. Early GOs were developed to satisfy these demands and in the early 2000s,
GOs were developed as the fastest growing standardized terminology system among the
biomedical terminology systems. Nowadays, almost every gene in the genomic database
has been assigned to a GO.
As described above, GO is the standardized terminology representing genes. GO is
structured as a directed acyclic graph similar to the MeSH system. In other words, GO
resembles a tree structure, which has a tree-like graph. Tens of thousands of standardized
terminologies are arranged in the GO DAG file format node. GO consist of three catego-
ries: (1) biological processes (BP); (2) molecular function (MF); (3) and cellular compo-
nents (CC). . Figure 10.6 is a schematic of the GO structure for DNA metabolism, which

is under BP.
As you can see in the figure, each standardized terminology node is connected with
‘is_a’ and ‘part_of ’ relationship. Recently, some new relationships have been added, includ-
ing ‘regulates’. For instance, “DNA replication” is under “DNA metabolism” (. Fig. 10.6).
In the figure, it can be observed that the terminology called “DNA replication” is annotated
in the RNH3 and RNR1 enzymes, drosophila Rntl and Rnrs, and mouse Rccc1, Rrm1, and
Rrm2. GO gene annotation is independently included in each genomic database, but the
fact that these three species genes are connected to “DNA replication” function can be
logically concluded from the figure. More precisely since the gene is connected to GO
terminology, GO terminology is then connected to other gene again; thus, all genomic
databases are connected to each other, which creates a massive bi-partite graph structure.
GO annotation is the core source for using genome data analysis in a manner similar to a
biological pathway as shown in 7 Chap. 5.

10.7 · Biomedical Text Mining
183 10
DNA Metabolic process
DNA repair DNA recombination

CDC9mei-9 Lig1
Lig3
DNA modification DNA ligation
DNA replication CDC9DNA-lig l Lig1
RNH35 RntL Recc1 DNA-lig II Lig3
RNR1 RnrS Rrm1
mitochondrial Rrm2
DNA metabolic process
DNA–dependent
DNA replication
Mitochondrial DNA ligation involved in
DNA replication DNA dependent DNA replication
Negative regulation of
DNA endoreduplication DNA strand elongation
IS – a wg Fbxw7 involved in DNA replication
Part of Mrella
Smcla pcna DNA pol-a 180
Regulates relation Positive regulation of lagging strand
Positively regulated relation DNA dependent DNA replciation elongation
CDC2
Negatively regulated relation Regulation of DNA-dependent DNA unwinding leading strand DPB11
DNA replication involved in replication elongation POL2
Saccharomyces CDC28 Ino80
HFM1 dre4 Mcm2 CDC2 mus209 CDC9
MCM4 Mcm4 DPB11 DNA pol-β
Drosophila MHR1 Ciz1 MCM6 Mcm6 POL2 hay
NTG1
Mus RFA1 Peo1 Rad51
RFA3 Rad 51
.. Fig. 10.6 GO Biological process, “DNA metabolism”
GO terminology includes additional information: (1) references; (2) the data; (3) the
source of the annotation; and (4) the evidence code.5 GO has been led by Gene Ontology
Consortium (7 http://www.geneontology.org/). GO is used widely in the genome data

analysis, and it has 14 different objectives:

55 Ontology or annotation browser
55 Ontology or annotation search engine
55 Ontology or annotation visualization
55 Ontology or annotation editor
55 Database or data warehouse
55 Software library
55 Statistical Analysis
55 Slimmer-type tool
55 Term enrichment
55 Text mining
55 Protein interactions
55 Functional similarity
55 Semantic similarity
55 Other analysis
10.7 Biomedical Text Mining
Biological text mining is used to apply text-mining technology to biomedical science data
or the extensive amounts of molecular biological documents or papers. It has a central role
in the biomedical science field due to the huge increase in the amount of data. Biological
5 7 http://www.geneontology.org/Go.evidence.shtml

.. Table 10.2 Biological text mining tools categorized according to the purpose of the user
Source of biomedical data Functions used by the text Text mining tools
mining tool
Entity name Protein relations (binary relations COREMINE, iHOP, Chilibot

& interactions)
Entity identifier Function, annotation & localiza- GOCat, GoPubMed,

tion relations GOAnnotator
Entity pair & lists Gene group & lists analysis Chilibot, SimConcept,
GNormPlus
Protein sequence Protein sequence (mutations, tmVar

polymorphisms, modifications)
Document/full text, Text retrieval, classification, CiteSpace, askMEDLINE,

keyword, abstract clustering, similarity, ranking BabelMeSH, ReleMed
NL Question, Document/ Bio-entity tagging (genes, ABNER, LingPipe, KEX,

full text, abstract proteins, compounds, cell lines) ABGene
PMID Gene-disease association PubTator, Dnorm, COREMINE
Acronym, Term (Ontology) Acronym & term extraction Acromine
10
text mining techniques extract and manages life science knowledge from documented
data that is written in an easily conceivable language. PubMed in particular, is the big-
gest literature database in the biomedical science field; therefore, this PubMed data is the
most mined information. While abstracts are the most utilized, text mining has developed
to mine the full texts of documents ever since increased use of open access policy and
online publication has occurred. Krallinger and others have classified text-mining tools
that are developed in the biomedical science field into eight categories according to the
user’s question and purpose (. Table 10.2).

10.8 Biomedical Network Analysis
Network science, which is a part of complex system research, studies the intrinsic prop-
erty of networks represented with nodes and edges (link). Studying the complicated
system from the perspective of the correlation between the nodes significantly helps to
understand the network properties are generally embedded in the natural and social sci-
ences. Acclaimed scientists, including Albert Barabasi et al., analyzed and identified that
the Internet connection structure has a “scale-free” network structure, which caught many
eyes of academics in the field.
In order to understand the scale-free network, an understanding of degree distribu-
tion is required. An edge number indicates the number of links that one node has. If a link
number of one node is N-1 in a network that has n number of nodes, that particular node
is connected to all other nodes within the network. If the link number of all the nodes is
(N-1), then the relevant network receives a completely connected graph structure. Link
10.8 · Biomedical Network Analysis
185 10
Power Law Distribution

10−1
Number of nodes with k links
Very many nodes 10−1.5

with only a few links
10−2
P (k)
A few hubs with 10−2.5
large number of links
10−3
10−3.5
Number of links (k) 100.5 101 101.5 102 102.5 103

k
.. Fig. 10.7 Degree distribution of scale-free network
number distribution of a random network generally has median value and a symmetrical
structure that has both extreme nodes of either many or few link numbers.
Then, what would the structure of the Internet be like? On one hand, the Internet
seems to have been randomly created, but after precisely analyzing the link structure,
it turns out that the Internet structure is not a random network but rather a scale-free
structure as described previously. The link number distribution of a scale-free network has
a power law distribution. Thus, the hub node, which has a high link number, has a lower
link number distribution. It has a plenty of outside nodes that have less link numbers,
which yields a straight line distribution if it uses the log of the link number (. Fig. 10.7).
Thus, the Internet consists of a few hubs with many outside nodes. The formation func-
tion of the scale-free network can be explained with growth and preferential attachments.
When there is a new node in the network, it tends to connect to a surrounding relative hub
node, and the distributional scale free network is created by that result. Therefore, the new
Internet member tends to connect to the closed area hub to create this structure.
The scale-free network has high efficiency for the information exchange. In other
words, the average path distance for connecting from one node to other node is notice-
ably shorter compared to a random network, and this kind of phenomenon is called “small
world phenomenon”. The phenomenon, in which “there are many people living in the
world, but it is possible to connect to a specific person in an average of 6~7 connections”,
can be explained with the link number analysis of the scale-free network. An additional
property is network robustness toward random attack. A victim of random attack could
be the outside nodes that have many numbers, as opposed to the hub which are scarce.
Inversely, a network is vulnerable to selective hub attacks. For instance, an airline network
that consist of worldwide airports and air routes is a typical scale-free network, but the
entire airline network dimension is not affected by a random attack such as a natural
disaster. In contrast to this, it explains the phenomenon that the airline network is very
vulnerable to a selective hub airport attack from terrorists.
These phenomena are also observed in not only with the Internet but also with vari-
ous networks. Of special interest is the protein interaction network, which is a biological
network, and like the Internet, has a high information exchange efficiency as the structure
of it is a scale-free network, and robustness from random attack. One of other properties
is that the hub node is evolutionarily old. Since the network formation process, which
has been mentioned earlier, consists of developmental and preferential connections, it
corresponds with the direct interpretation that the older node (protein) will last longer.
Surprisingly, results of the protein interaction network analysis have shown that the hub
protein is evolutionarily old, and network research has proven that it has many essential
proteins that will be lethal when the hub protein is knocked out. Many studies have shown
that the scale-free network has a self-organizing property. A computer network such as
the World Wide Web, in addition to the life networks that represent genomic, metabolic,
and rapidly growing social connection networks, has a free-scale network property. This
property is part of a complex system network resulting from the development of social
network services.
Life network research has become possible due to identification of extensive pro-
tein/gene interactions after analyzing publication abstracts in PubMed over the past few
decades, a microarray that has made the massive transcriptome analysis available, and the
development of a massive genomic technology that is next generation sequencing tech-
nology (NGS). These days, network medicine, which originated from network biology,
has become a new, emerging field.
The reason why life network research is involved in the current life science system is
because of the possibility of overcoming the limits of existing reductive research, which
is a study that focuses on a specific life process. For example, genetic diseases that have
specific phenotypes are affected by not only the specific gene variation but also interac-
tions between gene variations. One interesting study in the medical network is “Network
10 Medicine: From Obesity to the Diseasome”, which was published in NEJM in 2007. This
paper proposed that the pathophysiology of obesity is the result of life network inter-
actions at various levels with propagation of obesity by social networks. “The Spread
of Obesity in a Large Social Network over 32 Years” (Christakis et al. 2007), which was
published together with the paper that is mentioned above, claimed that obesity can be
propagated socially, after analyzing the social network data of 12,067 people from 1971
to 2003. That is, the increment of a person’s weight is related to the weight increments of
their friends, brothers and sisters, spouse, and neighbors. From now on, network analysis
can be expected to be applied in various fields such as life’s structural and evolutionary
mechanisms, new medical development, and disease gene discoveries.
Take Home Message

55 The sequence analysis methods and knowledge-based data interpretation.
Bibliography
1. Ashburner M et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Con-
sortium. Nat Genet 25(1):25–29
2. Bader GD et al (2006) Pathguide: a pathway resource list. Nucleic Acids Res 34(Database issue):D504–
D506
3. Barabási AL (2007) Network medicine – from obesity to the “diseasome”. N Engl J Med 357(4):404–407.
Epub 2007 Jul 25
4. Barabási AL et al (2000) Scale-free characteristics of random networks: the topology of the world-
wide web. Physica A Stat Mech Appl 281(1–4):69–77
Bibliography
187 10
5. Ciccarelli FD et al (2006) Toward Automatic Reconstruction of a Highly Resolved Tree of Life. Science
311(5765):1283–1287
6. Crooks GE et al (2004) WebLogo: a sequence logo generator. Genome Res 14(6):1188–1190
7. Johnson M et al (2008) NCBI BLAST: a better web interface. Nucleic Acids Res 36(Web Server issue):
W5–W9
8. Kanehisa M, Goto S (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res
28(1):27–30
9. Karolchik D et al (2012) The UCSC Genome Browser. Curr Protoc Bioinformatics Chapter 1:Unit1.4
10. Karp PD et al (2005) Expansion of the BioCyc collection of pathway/genome databases to 160
genomes. Nucleic Acids Res 33(19):6083–6089. Print 2005
11. Krull M et al (2006) TRANSPATH: an information resource for storing and visualizing signaling path-
ways and their pathological aberrations. Nucleic Acids Res 34(Database issue):D546–D551
12. Larkin MA et al (2007) Clustal W and Clustal X version 2.0. Bioinformatics 23(21):2947–2948. Epub
2007 Sep 10
13. Letunic I, Bork P (2007) Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and
annotation. Bioinformatics 23(1):127–128. Epub 2006 Oct 18
14. Matthews L et al (2009) Reactome knowledgebase of human biological pathways and processes.
Nucleic Acids Res 37(Database issue):D619–D622
15. McDonagh EM et al (2011) From pharmacogenomic knowledge acquisition to clinical applications:
the PharmGKB as a clinical pharmacogenomic biomarker resource. Biomark Med 5(6):795–806
16. McWilliam H et al (2013) Analysis Tool Web Services from the EMBL-EBI. Nucleic Acids Res 41(Web
Server issue):W597–W600
17. Schaefer CF et al (2009) PID: the Pathway Interaction Database. Nucleic Acids Res 37(Database
issue):D674–D679
18. Tatusova TA, Madden TL (1999) BLAST 2 Sequences, a new tool for comparing protein and nucleotide
sequences. FEMS Microbiol Lett 174(2):247–250
19. Wolfsberg TG (2011) Using the NCBI Map Viewer to browse genomic sequence data. Curr Protoc Hum
Genet Chapter 18:Unit18.5
189 11
Motif and Regulatory
Sequence Analysis
11.2 Sequence Data Structure – 190
11.3 Sequence Alignment and Phylogenetic

Tree Analysis – 193
11.3.1 Sequence Alignment – 193
11.3.2 Sequence Alignment Practice – 196
11.3.3 Phylogenetic Tree Analysis – 199
11.4 Transcription Regulatory Site Prediction

Using Sequence Alignments – 202
11.4.1 Transcription Factors and Sequence Motifs – 202
11.4.2 Conserved Sequence Region Detection – 202
11.5 UCSC Genome Browser – 206
11.6 Prediction of Targeting microRNAs – 209


https://doi.org/10.1007/978-981-13-1942-6_11
190 Chapter 11 · Motif and Regulatory Sequence Analysis

In this chapter, we will learn and practice (1) type and method of sequence alignment (2)
sequence motif searching, sequence logo creation, and phylogenetic tree analysis (3) pre-
diction of transcription factor and microRNA (miRNA) binding sites involved in gene regula-
tion (4) visualization and exploration of sequence annotations using a genome browser.
11.1 Introduction
‘Gene finding’ and ‘transcription regulatory site finding’ through analysis of a given
genomic base sequence are the most fundamental informatics practices. There are still
many unsolved problems in gene finding, but transcription regulatory site finding is an
even harder challenge due to the complex patterns and short sequence lengths. Sequence
analysis algorithms are also used in evolutionary phylogenetic correlation analysis, and
such correlation leads to building phylogenetic trees through the comparison of gene,
protein, and RNA sequences. A peptide sequence comparison is used to find evolution-
arily conserved sequences of related proteins. These conserved sequences often provide an
essential function of the protein. Therefore, one strategy for identifying functional non-
coding regions like transcription regulatory sites is to find consensus sequences of bases
conserved across multiple species. Further analysis of the consensus sequence can infer
the function of that sequence region. ‘Phylogenetic analysis’ is a term used when building
and analyzing relationships of sequence similarities.
In this exercise, we will repeat the process of producing a major result1 reported by
Cliften et al. We use the Cluster Omega software, which applies a multiple sequence
alignment technique, to search for transcription factor binding sites (TFBS) in the yeast
11 genome. Also, we will learn to use various tracks in the UCSC Genome Browser to visual-
ize related information that you find. Finally, we will practice making miRNA target gene
predictions.
11.2 Sequence Data Structure
This practice uses sequence data from the yeast genome. This data is composed of four
DNA bases, adenine (A), guanine (G), cytosine (C), and thymine (T).
In order to do sequence analysis, sequence data acquired from the experiment needs
to be converted into a FASTA format. Sequence alignment algorithms such as BLAST
and Clustal Omega, used for practice, widely support input and output of FASTA format.
FASTA consists of two parts; the sequence and the annotation about the sequence. An
annotation line must start with the symbol “>”, and the actual sequence data must start
right below the annotation line. The annotation line can include Entry Name, Molecule
Type, Gene Name, and Sequence Length. Entry Name is required, while the others are
optional. Molecule Type, Gene Name, and Sequence Length are also recommended to be
included. Sequence data may be DNA bases or amino acids. . Figure 11.1 shows an exam-
ple FASTA sequence with the gene name of ‘YDR374C’ and a sequence length of ‘362’.
1 Cliften P et al (2003) Finding functional features in Saccharomyces genome by phylogenetic

footprinting. Science 301:71–76.
11.2 · Sequence Data Structure
191 11
.. Fig. 11.1 A sample FASTA
format
In the practice below, we will convert sequence data into FASTA format. The sequence
data that will be used in this practice is from Cliften et al., and DNA base sequence of the
gene YDR374C from four types of yeast.
* Create FASTA files.
1. Assume that the following sequence data is obtained from an experiment.
GGAAGAATGTTAGGAACTGTTGCTATTGTTGTACTTTGGTTATACGACAGTAAGTAAC
GTTGACTTGGTGACCGAAAATAGACACGAAATCGCTACCCGTTTCCCCAGAATATCACT
CCTCACGATGTACCTCGGCGGCTAATCTTTTTGGTAGCCTTTTGTGATATATATATAAA
TAAATAAGTATACATACATATATATATATATATATTTATACAGCTACATTGTTTTCCTC
CAAAATTTTCTGTTGGTTATGAATCGCAAAAGAAGTTTTCAGATTGTGTCCTCTGTTAC
TATTTCGTTAAGAAAGGAAGATATCGTCTACGGCTGGTGTGACGTAAGTATTGCGTTGT
GCTCTAAAA
2. Open the notepad of text editor.

3. In order to annotate the sequence with its species, type in the first row of the editor:
“>S.cerevisiae”.
4. Enter the sequence data on the next row.
5. Select [File] -> [Save As …] in Notepad to open the save window.
6. Set ‘File Format’ to “All files (*.*)” and the filename as “cerevisiae.fasta”, then save.
7. Similarly, save the following sequence as “bayanus.fasta” with the annotation line
“>S.bayanus” following the steps 3 to 6 above.
TAAAACCCTCAAGAACTCTTGACACTACTGTGCTCTGTCTTCTTATTAAATGTAGAAGC
ATTTGCCTAAAGTAAACAAGAATAAATATACTGCATGGGGCTACCCGTTCCATATGATA
TCATCGGTCACGAAGTGTCGGCGGCTAATTTAGAGTACGCCTTTTGTGATATATATATA
TATATATATATACATAGAATGAACTACCGCTATTTTAAAACTCTTTTTGGTGGCTATGA
TTGCAGAAAAAGTGTCTAATAATAAGTGTGTTCTGTCACTTTGAGAAAGAATATTGCA
TATACGGTAAACAGTGGTGTGAGCTTTCTATTTTTTATTTTAAGAAAT
8. You can also put many different sequences in one FASTA file as below. Using
sequence data from all four species of yeast, create a new FASTA file called
“saccaharomyces.fasta”.
>S.cerevisiae
GGAAGAATGTTAGGAACTGTTGCTATTGTTGTACTTTGGTTATACGACAGTAAGTAAC
GTTGACTTGGTGACCGAAAATAGACACGAAATCGCTACCCGTTTCCCCAGAATATCACT
CCTCACGATGTACCTCGGCGGCTAATCTTTTTGGTAGCCTTTTGTGATATATATATAAA
TAAATAAGTATACATACATATATATATATATATATTTATACAGCTACATTGTTTTCCTC
CAAAATTTTCTGTTGGTTATGAATCGCAAAAGAAGTTTTCAGATTGTGTCCTCTGTTAC
TATTTCGTTAAGAAAGGAAGATATCGTCTACGGCTGGTGTGACGTAAGTATTGCGTTGT
GCTCTAAAA
>S.bayanus
TAAAACCCTCAAGAACTCTTGACACTACTGTGCTCTGTCTTCTTATTAAATGTAGAAGC
ATTTGCCTAAAGTAAACAAGAATAAATATACTGCATGGGGCTACCCGTTCCATATGATA
11 TCATCGGTCACGAAGTGTCGGCGGCTAATTTAGAGTACGCCTTTTGTGATATATATATA
TATATATATATACATAGAATGAACTACCGCTATTTTAAAACTCTTTTTGGTGGCTATGA
TTGCAGAAAAAGTGTCTAATAATAAGTGTGTTCTGTCACTTTGAGAAAGAATATTGCA
TATACGGTAAACAGTGGTGTGAGCTTTCTATTTTTTATTTTAAGAAAT
>S.mikate
GGACGACTCTAAAAAATGTTGTCACTGCAGCATTTTGGTTTAAGCGAGAGTTAATTATG
TTGGTCTGAGCAACCAAAAATAAACAGTTCAAGTGTTGCTACCCGTTTTTGCAGTTAAG
ATCACTTACCACGGATAAGTATCGGCGGCTAATCCTCATGGGACGCCTTTTGTGATATA
TAAATACATGCATCTAGTGAAACCTTTTCTTCAAAATTCACTCGCTGACTATAAGCCCC
AAACAGAAGCTTTAAAACTACGTATTCTACTACTAATTGATTAGAAAATATCACTTCAT
ACACGGTTGAAGTGGCTTAAGCATTGTTTGTGCTTGAAAAAT
11.3 · Sequence Alignment and Phylogenetic Tree Analysis
193 11
>S.kudriazevii
GAGATTATTTAGTAACTTTGTTGCTACACTACCTCTTTATACGAGAATTGATAGGATTG
ACCAAAGCATCTAGGATAAATAAGATGTGAATGTATTACCCGTTTTGTATTCAAGATCA
CCTCTCACGGAGGGGTTTCGGCGGCTAATCGTTATTAGCGCCTTTTGTGATATGCGTAT
AAATAAAGTGACTACTTCTAGCTTCAAAAAATTGCTTACTGCTATACCCCTCGCTCTAA
GCGCGAAGTTTCAAAATTGTCTGTTCTACCATTCCTTGGTTAAGAAAATACTGCTAGGG
TGGTGTGAACATTGTCTTGTGCTTGAGAAAT
11.3 Sequence Alignment and Phylogenetic Tree Analysis
11.3.1 Sequence Alignment
In this chapter, we will explain how to perform homology analysis using sequence align-
ments. Sequence homology refers to the similarity of DNA base or amino acid sequences
between individuals or between species. A sequence alignment quantitatively and visually
represents correlation of bases, which can be used to infer a degree of functional, evolu-
tionary, and structural relationships among sequences. Sequence alignment is similar to
aligning heads, legs and tails to compare organs. Finding sequences that are high homolo-
gous with your sequence allows prediction of evolutionary correlations or the inference of
function for your sequence.
First, we introduce sequence alignment techniques and practice a homology analysis
using the sequence alignment algorithms BLAST and Clustal Omega. BLAST and Clustal
Omega are available on NCBI and EBI websites. Refer to 7 Chap. 1 7 Sect. 1.2 for funda-

mental principles of sequence alignment explanation.
11.3.1.1 Pairwise Alignment

A pairwise alignment is the most fundamental method to compare homology comparison
(. Fig. 11.2). BLAST is a widely used program for determining the homology between one

pair of sequences. It has the disadvantage of not considering sequence gaps, which has
overcome with BLAST2.
.. Fig. 11.2 Example of

pairwise alignment
Input the sequence data
Calculate the distance between all pairs of sequences
Plot the phylogenetic tree
Sort sequences in the order of the closeness to each other
Yes
Any sequences left to sort?
No
Output the sorted sequences and the phylogenetic tree
.. Fig. 11.3 Clustal Omega calculation algorithm
11.3.1.2 Multiple Alignment

A multiple alignment compares the homology of three or more sequences of similar
11 lengths at the same time. A representative multiple alignment program is Clustal Omega,
freely available for Unix/Linux, Mac, and Windows. The Unix version is appropriate for
users who must deal with massive sequence data, and the Windows version is appropri-
ate for users who only need to deal with a few genes. Clustal Omega can be downloaded
from 7 http://www.clustal.org/omega/#Download. The algorithm it uses is outlined in

. Fig. 11.3.

When you run Clustal Omega, the following five result files are generated:
– *.input input sequences

– *.output comparison analysis result file
– *.clustal input sequences in Clustal format
– *.ph phylogenetic tree
– *.pim Percent Identity Matrix, which contains the correlation between sequences
< Workflow >

“*.input” denotes the input files or the sequence information. “*.output” shows the results
of comparing various sequences, and “*.clustal” shows the results of multiple alignment.
. Figure 11.4 shows the Clustal Omega Web service provided by EBI, and . Fig. 11.5 explains

the “*.output” result file. The “*.ph” file is a phylogenetic tree file, and a detailed explanation
will be provided in 7 Sect. 11.3.3. Multiple Alignment practice also will be performed.

This utility provides a colored representation of the aligned sequences (. Fig. 11.5).
. Tables 11.1 and 11.2 shows color coordination for aligned sequences and symbols denot-
ing conserved sequence.
195 11
.. Fig. 11.4 Clustal Omega web service from EBI (7 http://www.ebi.ac.uk/Tools/msa/clustalo)

.. Fig. 11.5 Explanation of multiple sequence alignment results obtained from Clustal Omega
.. Table 11.1 Key for color coding of clustal omega sequence alignment
AVFPMILW Red Small (smlaa + hydrophobic (incl. aromatic-Y))
DE Blue Acidic
RK Magenta Basic-H
STYHCNGQ Green Hydroxyl + sulfhydryl + amine + G
Others Grey Unusual amino/imino acids etc.
.. Table 11.2 Key for clustal omega conserved sequence information
* Matching, conserved sequence
: Conserved substitution with similar properties
. Semi-conserved substitution
(Space) Not conserved
11.3.1.3 Global Alignment and Local Alignment

Sequence alignments can be divided into two different types depending on the compari-
son being performed.
55 Global Alignment: Alignment on the entire sequence
11 55 Local Alignment: Alignment on a portion of a sequence
There are many different algorithms for each type, such as the Needle-Wunsch (global)
and Smith-Waterman (local) algorithms. Refer to their Wikipedia pages2 for detailed
explanations. Next, you will perform both global and local alignments of sequence pairs
and learn the practical differences between the two methods. This will be done with
EMBOSS_Needle and EMBOSS_Water web services provided by the EMBOSS-ALIGN
alignment tool3 from EBI.
55 EMBOSS_Needle: Performs global alignments using the Needle-Wunsch algorithm
55 EMBOSS_Water: Performs local alignments using the Smith-Waterman algorithm
11.3.2 Sequence Alignment Practice
11.3.2.1 BLAST
* Understand the use of global and local pairwise sequence alignment
1. Go to “Tools” > “Pairwise Sequence Alignment” Pairwise Sequence Alignment of EBI
webpage in order to do pairwise sequence alignments (. Fig. 11.6). (http://www.ebi.

ac.uk/Tools/psa)
2 7 http://en.wikipedia.org/wiki/Sequence-alignment
3 EMBOSS (European Molecular Biology Open Software Suite).
197 11
.. Fig. 11.6 Website of pairwise sequence alignments
2. Under “Global Alignment” > “Needle Tools”, select “Nucleotide”

3. In STEP1, you can input two sequences that you want to compare. You can either
upload selected FASTA files or paste FASTA format sequences directly into the input
fields. For this exercise, upload the cervisiae.fasta (top) and bayanus.fasta (bottom)
files. If you have difficulty uploading the files, simply past the FASTA format data
directly.
4. Select “Submit” to run the alignment.
5. Check the result (. Fig. 11.7).

6. After opening a new window, run the method from 1) to 5), but instead select “Local
Alignment” > “Water Tools” > “Nucleotide” in part 2) (. Fig. 11.8).

7. Check differences between the results of global alignment and local alignment
(. Figs. 11.7 and 11.8).

* Practice multiple sequence alignment using Clustal Omega

1. Go to the EBI Clustal Omega webpage. 7 http://www.ebi.ac.uk/Tools/msa/clustalo

2. Select a set of type and upload the saccharomyces.fasta file in STEP1 (Upload the
sequences to be compared as a set).
3. Click “Submit”. If you want the results to be sent to your email, then check the
appropriate box in STEP3 and input your email address.
4. If you select “Show Colors” in the result tab “Alignments”, then you will see color-
coded values. Refer to . Table 11.1 for the color key (. Fig. 11.9).

11
.. Fig. 11.7 Result of sequence global pairwise alignment
5. Check the four result files given in “Result Summary” (. Fig. 11.10). The Percent

Identity Matrix gives a comparison score between each sequence pair. The homology
of two sequences is greater for higher scores.
6. Click “Start Jalview” under “Jalview” in the “Result Summary” tab. Jalview4 visualizes
the degree of consensus across sequences (. Fig. 11.11). It is a multiple sequence

4 7 http://www.jalview.org, Development of Jalview is supported by BBSRC from 2009 to 2014, and it

is administrated by Geoff Barton from University of Dundee.
199 11
.. Fig. 11.8 Result of local pairwise alignment
alignment editor provided through a JAVA applet, and it requires that the Java
Virtual Machine (JVM)5 is installed on the user’s computer.
11.3.3 Phylogenetic Tree Analysis
There are three types of phylogenetic tree analysis:

55 Cladoram
55 Phylogram
55 Ultrametric Tree
A cladoram does not limit the lengths of branches in the tree. Cladorams only return
approximate classifications. In phylograms, the length of a branch is based upon the
genetic relationship between taxa (e.g., among various biological species or entities), where
taxa on shorter branches are genetically closer to one another. Finally, an ultrametric tree
5 7 http://www.java.com/en/download/index.jsp
11
.. Fig. 11.9 Screenshot of alignment results from clustal omega
focuses more on the time-based relation of taxa than their evolutionary relationships, with
the length of each branch indicating when the two taxa diverged.
* Creating a Phylogenetic Tree
1. Open the Clustal Omega webpage given above.
2. Select “DNA” under Step 1 and upload or paste the results file from [Example 11.3]
into the input field.
3. Click “Submit”. If you want the results to be sent to your email, then check the
appropriate box in STEP 3 and input your email address.
4. Results will be shown as both graphic cladogram and as text (. Fig. 11.12). In the text

notation, parentheses group taxa together by their relavance. For instance, S. mikate
and S. kudriazevii are in same parentheses set, which means they are most closely
related.
201 11
.. Fig. 11.10 Screenshot of clustal omega “Result Summary” tab
.. Fig. 11.11 Visualization of clustal omega results using Jalview
5. Clicking on the “Cladogram” and “Real” option buttons will change the graphical tree
from cladogram to phylogram. As shown in . Fig. 11.12, S. mikate and S. kudriazevii

have high homology, compared to the others.

.. Fig. 11.12 Phylogenetic tree

results
11.4 Transcription Regulatory Site Prediction

Using Sequence Alignments
11 11.4.1 Transcription Factors and Sequence Motifs
A transcription factor plays an important role in controlling gene expression. It binds to a

short base sequence called a transcription factor binding sites (TFBS).
In this exercise, you will find a conserved sequence region (motif) using sequence
alignment and identify transcription factors binding to a particular motif using a motif
database.
11.4.2 Conserved Sequence Region Detection
* Predict Transcription Factor Binding Site (TFBS)

55 Programs:
Clustal Omega (7 http://www.ebi.ac.uk/Tools/msa/clustalo/)

Saccharomyces Genome Database (SGD) (7 http://www.yeastgenome.org)

Saccharomyces Cervisiae PRomoter Database (SCPRD)

(7 http://rulai.cshl.edu/SCPD/searchmotif.html)

1. Go to the multiple sequence alignment results using Clustal Omega

2. Check for conserved sequence regions (. Fig. 11.13), which are coded yellow, and

include sequences like “TACCCG,” “TCGGCGGCTAAT,” and “GCCTTTGTGATAT.”

11.4 · Transcription Regulatory Site Prediction Using Sequence Alignments
203 11
.. Fig. 11.13 Multiple sequence alignment results and conserved sequence regions
.. Fig. 11.14 Motif search input

screen
3. Copy and paste a selected conserved region into the Motif Database of SCPD.6 For
example, enter “TACCCG” in the motif field. Click “Submit” after entering 0 for
“Allowed mismatches” (. Fig. 11.14).

4. A list of transcription factors which bind the motif “TACCCG” is returned

(. Fig. 11.15).

6 SCPD (Saccharomyces cerevisiae PRomoter Database).

.. Fig. 11.15 Motif search

results
11
.. Fig. 11.16 Results of motif

search with “TCGGCGGCTAAT”
and “GCCTTTGTGATAT”
5. Repeat the motif search with “TCGGCGGCTAAT,” and “GCCTTTGTGATAT”

(. Fig. 11.16). There should be one found for “TCGGCGGCTAAT” and no result for

“GCCTTTGTGATAT” (. Fig. 11.16).
6. Finally, perform the motif search again for “TACCCG”. Now, let’s analyze the search
results.
“Factor” means the transcription factor, and “gene” is the gene controlled by that tran-
scription factor. Text in black, such as “TTATTACCCG,” shows the full transcription fac-
tor binding site, while below it in red is the aligned input motif, “TACCCG”.
7. Among 15 search results, there are three having only one match (MCM1, PHO4, and
GRF2). Among these three, GRF2 has one TFBS (TTATTACCG) also listed under
REB1. Since REB1 and GRF2 have the same TFBS, they are assumed synonyms.
8. Go to the SGD website (7 http://www.yestgenome.org) and search for GRF2

(. Fig. 11.17). You can see that GRF2 is listed as an alias of REB1

11.4 · Transcription Regulatory Site Prediction Using Sequence Alignments
205 11
.. Fig. 11.17 GRF2 search results from the Saccharomyces genome database
.. Fig. 11.18 The consensus

motif “YYACCCG” of REB1
9. Back in the motif search results, click REB1 to bring up its page, then “Get consensus” to
confirm that the “TACCCG” motif is the transcription factor’s binding site (. Fig. 11.18)

10. The consensus is “YYACCG”, where “Y” is the IUPAC7 notation for any pyrimidine
(C or T). If you look back to the multiple alignment corresponding to the first
conserved region, you see that S. cerevisiae, S. bayanus, and S. mikate species have
“CT” while and S. kudriazevii has “TT”. We can conclude that this conserved
sequence is related to REB1 binding.
7 IUPAC (International Union of Pure and Applied Chemistry).

.. Fig. 11.19 Searching for REB1 in the UCSC genome browser
11
.. Fig. 11.20 UCSC genome browser gene search results
11.5 UCSC Genome Browser
The greatest advantage of the UCSC Genome Browser8 is providing detailed annotation
information and other relevant data, which are centered around the reference genome
sequence. The NCBI Map Viewer9 also provides genome sequence-based information,
but the UCSC Genome Browser is simpler and generally more useful. The UCSC Genome
browser provides each various annotations as tracks. In addition to the graphic output
screen, ASCII text data and HTML format reports are freely downloadable. The UCSC
Genome Browser can provide a large amount of information required for genome data
analysis.
1. Go to the UCSC Genome Browser website given above.
2. Click on “Genome Browser”, then select Yeast under “Popular Species” at the top
3. Select “June 2008 (SGD/sacCer2)” as the assembly and enter “REB1” as the search
term, then click “Submit” (. Fig. 11.19).

8 7 http://genome.ucsc.edu/
9 7 http://www.ncbi.nlm.nih.gov/projects/mapview
11.5 · UCSC Genome Browser
207 11
.. Fig. 11.21 UCSC genome browser genome view
.. Fig. 11.22 All tracks except backbone collapsed in genome browser
A list of search results is returned (. Fig. 11.20). In some cases, a lot of candidates are

shown. Check the first gene’s name and explanation, then select it.
4. Look at the layout of the genome browser page (. Fig. 11.21).

The genome base position numbers are at the top of the view. There are several tracks
below, each providing relevant information. Each track type has a control panel (dark
blue horizontal bands in . Fig. 11.21), and when you click the “refresh” button, changed

options for tracks are applied.

5. Tracks currently visible in the genome view are determined by default parameters of
the UCSC Browser. If you click “collapse all” below the genome viewer, all tracks will
disappear except the genome backbone (. Fig. 11.22).

.. Fig. 11.23 Open ‘SGD Genes’ track to visualize REB1 gene in genome browser
11
.. Fig. 11.24 Description page for REB1 in UCSC genome browser
.. Fig. 11.25 REB1 DNA binding motif from CHIP/CHIP experiments
6. Under “Genes and Gene Predictions”, select “pack” in the menu for “SGD Genes”.
Click any “refresh” button.
7. REB1 information is added back into the genome view (. Fig. 11.23). Track control

panels have up to five options (hide, dense, squish, pack, full), and more detailed
information is provided in the viewer as you go from dense to full. It is really conve-
nient for controlling display of EST or SNP data. The “pack” option is generally used.
11.6 · Prediction of Targeting microRNAs
209 11
8. UCSC Genome Browser also contains information about a transcription factor’s bind-
ing sites. First, open the REB1 information page by clicking REB1 on the side of the
genome view (. Fig. 11.24).

9. In the section titled “DNA Binding Motif from CHIP/CHIP Experiments, the motif
“CGGGTAA” is shown. This is complementary to the sequence motif “TACCCG” that
was used in [Example 11.5] (. Fig. 11.25).

11.6 Prediction of Targeting microRNAs
* predicting microRNAs that target a gene

1. Go to the DIANA-microT Webserver (7 http://diana.imis.athena-innovation.gr/

DianaTools/index.php?r=MicroT_CDS/index)
2. Search for ‘ENSG00000110092”. Select one of miRNAs, which has a green box in the
“Also Predicted” column, indicating it has been verified experimentally (. Fig. 11.26).
3. The details page provides gene name, miRNA details, and methods information.
Click the circled “i” colored in grey, next to the gene name for more details.
4. Review the detailed information (. Fig. 11.27).

5. Similarly, details for the targeting miRNA can be obtained from the grey circle next
to an miRNA name (. Fig. 11.28).

.. Fig. 11.26 Results from searching ‘ENSG00000110092’ in the DIANA-microT Webserver

.. Fig. 11.27 Gene details for ‘ENSG00000110092’ from the DIANA-microT Webserver
11
.. Fig. 11.28 Detailed miRNA info for has-let-7b in the DIANA-microT Webserver
Bibliography
211 11
Exercises
[Exercise 1] - Using the myosin protein sequence of human, chicken, and other provided species, work
through the following problems: (sequences are in file /gda/ch11/myosin.fasta).
.1 Perform multiple sequence alignment using Clustal Omega
1
1.2 Find the species whose MYH9 myosin protein is closest to human using the hierarchical tree dia-
gram
1.3 Search for MYH9 in the UCSC Genome Browser
1.4 Hide all tracks, then check if there are myosin ESTs from brain tissue by using the “Human ESTs”
track in the “mRNA and EST” group
55 Select pack for “Display mode,” red for “Filter color”, and brain for “tissue”
1.5 Hide the Human ESTs track again, and show SNPs in MYH9
55 Use the SNPs track under “Variation and Repeats”
Take Home Message

55 How and why to perform sequence alignment and phylogenetic tree.
55 How to predict transcription regulatory site using sequence alignments.
Bibliography
1. Cliften P et al (2003) Finding functional features in Saccharomyces genome by phylogenetic footprint-
ing. Science 301:71–76
2. Clustal Omega – http://www.ebi.ac.uk/Tools/msa/clustalo
3. DIANA-microT – http://diana.imis.athena-innovation.gr/DianaTools/index.php?r=MicroT_CDS/inde
4. http://en.wikipedia.org/wiki/Sequence-alignment
5. Leung W (2008) Identifying regulatory regions using multiple sequence alignments. http://www.nslc.
wustl.edu/courses/archives/Bio5924/elgin/MSA_Intro_rv1.pdf
6. Maragkakis M et al (2009) DIANA-microT web server: elucidating microRNA functions through target
prediction. Nucleic Acids Res 37:W273–W276
7. Saccharomyces Cervisiae PRomoter Database (SCPRD) – http://rulai.cshl.edu/SCPD/searchmotif.html
8. Saccharomyces Genome Database (SGD) – http://www.yeastgenome.org
9. UCSC Genome Browser – http://genome.ucsc.edu/
213 12
Molecular Pathways
and Gene Ontology
12.3 Gene Ontology – 214

12.3.1 Search GO Annotation for a Single Gene – 214
12.3.2 Search GO Annotations for a Gene List – 216
12.3.3 Calculation of the Semantic Distance Between Genes – 217
12.3.4 GO Annotation Analysis Using R – 218
12.4 Biological Pathway Analysis – 224

12.4.1 Search a Biological Pathway of a Specific Gene – 224
12.4.2 Search Significant Biological Pathways of a Gene List – 224
12.4.3 Data Analysis Using Gene Expression Data – 225
12.4.4 Biological Pathway Analysis Using R – 226
12.5 Biological Text Mining – COREMINE – 229


https://doi.org/10.1007/978-981-13-1942-6_12
214 Chapter 12 · Molecular Pathways and Gene Ontology

This chapter covers several topics: (1) understanding biomedical data and knowledge
resources, such as gene ontology and biological pathways, (2) the practical use of these
systems, and (3) biological text mining based on biomedical resources.
12.1 Introduction
With the development of genomic technology, large-scale genomic data accumulated and
various bioinformatic analysis methodologies were developed. The analytical approach
patterns or classifies genomic data, and semantically interprets of such data. A variety of
additional data such as life science data are needed. Gene ontology (GO), biological path-
ways, and literature information are used for analysis of large-scale genomic data. Various
algorithms and biological text mining techniques are developed for the analysis of the
large-scale genomic data.
After conducting biological text data mining, we will try to understand the semantic
interpretation process of genomic data based on the biomedical knowledge in 7 Chap. 12.
12.2 Prerequisites
This section will use acute lymphoblastic leukemia (ALL) data by Chiaretti S et al. (Blood
2004). We will use the 20 selected genes in the paper instead of using microarray data of
ALL. Twenty genes were selected: (1) IL-8; (2) CHC1L; (3) AHNAK; (4) SEC31B-1; (5)
MEF2A; (6) CAT; (7) MYC; (8) HNRPH1; (9) DEK; (10) CDC7L1; (11) BUB1B; (12)
H2AFX; (13) CENPF; (14) KIAA0175; (15) HEC; (16) CD2; (17) USP1; (18) TTK; (19)
CD8; and (20) LRMP.
12 * Gene Ontology Tools
AmiGo (7 http://amigo.geneontology.org/amigo)

G-SESAME (7 http://bioinformatics.clemson.edu/G-SESAME/)

* Pathway Tools
Reactome (7 http://www.reactome.org/)

* Text Mining Tools

COREMINE (7 https://www.coremine.com/)

12.3 Gene Ontology
12.3.1 Search GO Annotation for a Single Gene
AmiGO1 is a tool to retrieve and visualize information related to GO and GO annotations

provided by the GO consortium. In this section, we will use AmiGO to practice GO struc-
ture and GO annotation mechanisms. The gene symbol IL-8 should be entered in the
search box and then click “genes and gene products” (. Fig. 12.1).

Currently, GO provides annotation information for 319 species of genes, yielding search
results for IL-related genes or proteins across a variety of species. When an icon such as “Protein
1 7 http://amigo.geneontology.org/amigo

12.3 · Gene Ontology
215 12
.. Fig. 12.1 AmiGo input screen (IL-8)
.. Fig. 12.2 A list of GO terms related to IL-8 in AmiGO
from Homo sapiens” on the right side of each row or “Interleukin-8” is selected, we can obtain
information on human Interleukin-8, including name, type, source database, and sequences.
To search GO annotation for a given gene, select “# associations” on the left of “Protein from
Homo sapiens.” You will find annotated GO terms for IL-8 as shown in . Fig. 12.2. You can

confirm IL-8 is annotated with various GO terms such as, angiogenesis, calcium-medicated
signaling, cell cycle arrest, cellular component movements, cellular response to lipopolysac-
charides, chemotaxis, and embryonic digestive track development. When a GO term is anno-
tated to a gene, the evidence code is marked differently according to the methodology.
.. Fig. 12.3 Tree view and graph view of search results of IL-8 in AmiGO
Therefore, the quality of the GO annotation is determined by the evidence code. The
top left filter allows you to filter annotations based on the basis code and GO type.
If you select “GO: 000001524 angiogenesis” in the annotation, a detailed description of
the corresponding GO term is displayed. If you select the “Inferred Tree View” tab in the
upper tab menu, the relationship between the other terms is shown in a tree diagram
(. Fig. 12.3). “Graph View” shows the relationship and hierarchical structure between

each term in the GO DAG structure (. Fig. 12.3).
12
12.3.2 Search GO Annotations for a Gene List
The previous exercise examined GO annotations in one gene and the characteristics of
individual GO terms. The strength of GO is that it is useful to find semantic similarities in
gene lists obtained from differential expression or cluster analysis. In this section, we will
practice GO annotation search for a gene list (or set) composed of multiple genes. Select
“Search” from the first screen menu of AmiGO and the “Advanced Search” screen appears.
You can search GO terms of a gene list here. In this exercise, we will perform a GO anno-
tation search for IL-8, AHNAK, CD2, and TTK (. Fig. 12.4). Go to the setting at the

bottom of the page and select genes of proteins for “Search Type.” Select “symbol” since
input data is composed of gene symbols. In “Filter results”, select Homo sapiens for
“Species” and then select “send query.” Non redundant GO annotation will be searched in
order to find the input genes. As explained in 7 Chap. 7, GO annotation results for gene

clusters are important information that can be used in further analysis to infer the bio-
logical significance of the clusters.
217 12
.. Fig. 12.4 GO annotation search for gene list (IL-8, AHNAK, CD2, TTK)
12.3.3 Calculation of the Semantic Distance Between Genes
It can be assumed that genes with similar GO annotations are genes that code for similar
functions. Therefore, the annotated GO term in the gene can be used to assess the seman-
tic similarities between genes or between gene clusters. Such approach is very important
to infer biological meanings in genome data. This section uses tools to support semantic
similarities or distance measurements between genes. G-SESAME2 is a tool to calculate
the similarities between genes based on GO terms. Enter the two gene names and set the
ontology type, species, database, and evidence code, and then click “Submit” to calculate
the semantic distance between the two genes. In this exercise, we will calculate the
2 7 http://bioinformatics.clemson.edu/G-SESAME/

.. Fig. 12.5 G-SESAME input screen with CD2 and TTK
s emantic similarity between two different human genes, CD2 and TTK, using the GO
12 Molecular Function category (. Fig. 12.5).

The distance between the two genes is 0.746. The closer the distance is to 1, the closer
the semantic distance of the annotated GO terms of the two genes (. Fig. 12.6). We then

should review the relationship between the annotated GO terms of both genes.
. Figure 12.7 shows the relationship between the two genes and the annotated GO terms

as a DAG. The cyan nodes denote CD2 terms, orange nodes for TTK, and grey nodes for
both. In addition, the correlation of each term on the GO DAG can also be analyzed.
12.3.4 GO Annotation Analysis Using R
Web-based tools have limits for handling large volumes of data or performing GO annota-
tion searches in conjunction with other R-based statistical analysis results. This section
provides an R script for advanced users. If you run the R scripts provided in this section,
you can do the same exercises as you performed on the web.
* Package installation
219 12
Functional similarity of two genes

Semantic similarity between CD2 and TTK is
0.746
Associated GO term information:

CD2 from Homo sapiens TTK from Homo sapieas
Evidence Evidence
GO Name Data Source
Code
GO Name Data Source
Code
ASPGD, CGD, dictyBase, lntAct, ASPGD, CGD, dictyBase, lntAct,
EcoCyc, EcoliWiki, UniProtKB, EcoCyc, EcoliWiki, UniProtKB,
FlyBase, GeneDB_Spombe, FlyBase, GeneDB_Spombe,
GeneDB_Tbrucei, AgBase, GeneDB_Tbrucei, AgBase,
protein Roslin_Institute, HGNC, BHF-UCL, IPI, IEA, protein Roslin_Institute, HGNC, BHF-UCL, IPI, IEA,
GO:0005515 GO:0005515
binding MTBBASE, InterPro, ENSEMBL, IDA, ISS binding MTBBASE, InterPro, ENSEMBL, IDA, ISS
PAMGO_GAT, WormBase, MGI, PAMGO_GAT, WormBase, MGI,
GR, PAMGO, PAMGO_VMD, RGD, GR, PAMGO, PAMGO_VMD,
SGD, TAIR, WB, ZFIN RGD, SGD, TAIR, WB, ZFIN
ISS, dictyBase, EcoliWiki, lntAct, ISS, IPI,

CGD, GOA, dictyBase, EcoliWiki, IDA, EcoCyc, FlyBase, UniProtKB, IDA,
identical
FlyBase, UniProtKB, IEA, GeneDB_Spombe, ENSEMBL, IEA,
GO:00042802 protein
GeneDB_Pfalciparum, NAS, AgBase, InterPro, BHF-UCL, MGI, TAS,
receptor binding
GO:0004872 GeneDB_Spombe, AgBase, InterPro, TAS, HGNC, RGD, PAMGO_MGG, NAS,
activity IMP, IPI, SGD, TAIR, WB, ZFIN ISO, IGI
ENSEMBL, PINC, HGNC, MGI, GR,
TIGR, JCVI, RGD, SGD, SGN, IC, RCA,
TAIR, WB, ZFIN IGI, ISO,
ISA
Similarities of the associated GO terms:

GO:0005515 GO:0042802
GO:0005515 1 0.776
GO:0004872 0.205 0.149
.. Fig. 12.6 G-SESAME results screen 1 for CD2 and TTK genes
> source("http://www.bioconductor.org/biocLite.R")
> biocLite("GOSim")
> biocLite("topGO")
> biocLite("sigPathway")
> library(GOSim)
> InputGenes <- read.delim(file.choose(), head=T) #Read 314 GenesFromALL.txt.
#If the working directory is set, you can open it directly as shown below.
#InputGenes <- read.delim("C: \gda\ch12\314GenesFromALL.txt", head=T)
> head(InputGenes)
All Molecular function 33295
molecular_function
GO:0003674
binding molecular transducer activity

GO:0005488 GO:0060089
protein binding signal transducer activity

GO:0005515 GO:0004871
identical protein binding receptor activity

GO:0042802 GO:0004872
• Cyan nodes are GO terms annotating the gene CD2.

• Orange nodes are GO terms annotating the gene TTK.
• Grey nodes are GO terms annotating both genes.
12
.. Fig. 12.7 G-SESAME results screen 2 for CD2 and TTK genes
> head(InputGenes)
ProbeID EntrezGene GeneSymbole
1 31491_s_at 841 CASP8
2 31506_s_at 1668 DEFA3
3 31696_at 5912 RAP2B
4 31737_at 58 ACTA1
5 32393_s_at 9987 HNRPDL
6 34098_f_at 9270 ICAP-1A
Some of the genes stored in InputGenes have an “NA” value in the second column, which
is the EntrezGene column. The next step is to remove the gene where the second column
value is ‘NA’ and extract only 283 of 314 possible analyzes. 238 genes with EntrezGene
present in the second column are included in the next process.
221 12
3.0
2.5
2.0 Cluster Dendrogram
Height
1.5
1.0
9270
0.5
3205
0.0
5912
3045
5723
2921
10399
9262
8329
6504
3820
3592
5791
3087
8970
6421
10859
841
909
910
58
1668
1791
6095
9987
7253
6320
as.dist(1 - sim)
hclust (*, "ward.D")
.. Fig. 12.8 Cluster dendrograms constructed with semantic distance of genes
> GenesOfInterest <- InputGenes[2]
> GenesOfInterest <- GenesOfInterest[!is.na(GenesOfInterest)]
> length(GenesOfInterest)
We will calculate the distance between the first 30 genes because it takes a longer time to
calculate the total number of gene distances.
> mGOI <- GenesOfInterest[1:30]
> sim <- getGeneSim(mGOI, verbose = FALSE)
A hierarchical cluster analysis generates a dendrogram (. Fig. 12.8).

n = 27 3 clusters Cj
j : nj | avei∈Cj si
2
18
19
15
16 1 : 10 | 0.48
9
20
21
13
1
6
14
24 2 : 5 | −0.15
3
17
5
7
22
12
10
11 3 : 12 | 0.46
4
26
25
8
27
23
−0.2 0.0 0.2 0.4 0.6 0.8 1.0

Silhouette width si
Average silhouette width : 0.35
.. Fig. 12.9 Cluster silhouette constructed with semantic distance of genes
12 > hc = hclust(as.dist(1 - sim), method = "ward.D")
> plot(hc)
Evaluate the clustering and plot the cluster silhouettes (. Fig. 12.9).
> cl = cutree(hc, k = 3)
> if(require(cluster))
ev = evaluateClustering(cl, sim) # evaluate the clustering
print(ev$clusterstats) # print out some statistics
plot(ev$clustersil, main=" ") # plot the cluster silhouettes
}
223 12
Next, using the GO enrichment function, we should perform GO enrichment of the genes
belong to the cluster. Since GO enrichment only accepts character type variables, we
should convert the data types of variables to character types using as.character () function.
> GenesOfInterest <- as.character(GenesOfInterest)
> mGOI <- as.character(mGOI)
> typeof(GenesOfInterest)
> typeof(mGOI)
Using the GO enrichment function, we can perform a GO enrichment analysis for cluster
1 of the clusters in . Fig. 12.8 compared by the Fisher’s exact test. The cutoff value is a

parameter in the GO enrichment function. In this exercise, only a value <0.05 based on
the p-value is extracted, and the resulting value is assigned to the variable of GOEResult.
> if(require(topGO))
GOEResult <- GOenrichment(mGOI[cl == 1], GenesOfInterest, cutoff = 0.05)
Building most specific GOs ..... ( 1711 GO terms found. )

Build GO DAG topology .......... ( 4499 GO terms and 10307 relations. )
Annotating nodes ............... ( 241 genes annotated to the GO terms. )
-- Elim Algorithm --
the algorithm is scoring 758 nontrivial nodes
parameters:
test statistic: fisher
cutOff: 0.01
Level 16: 1 nodes to be scored (0 eliminated genes)
GOEResult consists of enriched GO terms, p-values, and a list of corresponding genes.
> names(GOEResult)
> names(GOEResult)
[1] "GOTerms" "p.values" "genes"
Print the items we want to see. For example, the following example returns the genes
enriched in GO terms.
> GOEResult$genes
12.4 Biological Pathway Analysis
The biological pathway refers to a well-organized summary of the key life science research
findings. A pathway database is a necessary resource for biological data analysis using
biological pathways. However, there are very few available pathway databases. Exercise in
this section briefly introduce a simple web-based tool, Reactome. Reactome provides a
variety of well-organized information from basic pathways to complex biological path-
ways such as GPCRs and APOPTOSIS. A user guide can be found on the website. (7 http://
wiki.reactome.org/index.php/Usersguide).
12.4.1 Search a Biological Pathway of a Specific Gene

12
The “Analyze Data” is the most basic function that finds pathways of input genes. For this
exercise, we will use all 20 genes in the ALL database: IL8, CHC1L, AHNAK, SEC31B-1,
MEF2A, CAT, MYC, HNRPH1, DEK, CDC7L1, BUB1B, H2AFX, KIAA0175, HEC, CD2,
USP1, TTK, CD8, and LRMP. Clicking “Analyze Data” on the main page of the Reactome
homepage opens the Analysis tools window. Enter the above 20 gene symbols and click
“Continue”. In the following window, select the option and click “Analyze!” to get a path-
way diagram.
12.4.2 Search Significant Biological Pathways of a Gene List
Selecting the “Overrepresentation analysis” option at the bottom of the previous exercise
gives a list of pathways in which input genes are significantly enriched. (Refer to 7 Chap. 4,

7 Sect. 4.2 to use the overrepresentation analysis). The statistical significance of the GO

annotation is analyzed by using hypergeometric distribution applying the same principle

of the previous overrepresentation analysis. The differences in significance levels are
shown in different colors.
225 12
12.4.3 Data Analysis Using Gene Expression Data
By selecting “Analyze Expression Data” on the left side of the Reactome homepage a win-
dow of “Upload expression data” opens (. Fig. 12.10). When the genome expression pro-

file data from microarray or RNA-Seq is entered, the input gene is mapped to the Reactome
Pathway as described above, and as a result, the number of mapped genes and mapping
ratio (%) are shown (. Fig. 12.11). The gene expression level is shown in a color-coded

diagram of the pathway. In this pathway analysis, the “Extrinsic Pathway for Apoptosis”
was mapped to all genes (100% mapping). When “Extrinsic Pathway for Apoptosis” is
clicked, gene expression profiles of this pathway are displayed color coded. Rich and dull
gene expressions are represented as red to blue gradient (. Fig. 12.12).

.. Fig. 12.10 Reactome results screen 1. Expression analysis
.. Fig. 12.11 Reactome results screen 2. Expression analysis

.. Fig. 12.12 Reactome results screen of extrinsic pathway for apoptosis
12.4.4 Biological Pathway Analysis Using R
The section provides R scripts for advanced R users. We use the sigPathway package to
perform biological pathway overrepresentation analysis with the MuscleExample data.
MuscleExample data is a microarray data for patients with z muscular system disease
called inclusion body myositis, which is composed of 5000 genes in columns, and 15
experiments (number of samples) in rows.
12
> library(sigPathway)
> data(MuscleExample)
> dim(tab)
[1] 5000 15
> print(tab[501:504, 1:3])
GEIM1.IBM.S GEIM7.IBM.S GEIM20.IBM.S

217466_x_at 3203 4085 23736
211939_x_at 28250 32293 36890
203932_at 6452 3596 13392
200715_x_at 20792 12647 18865
Fifteen samples are composed of eight patients and seven healthy subjects as a control
group phenotype.
227 12
> table(phenotype)
0_NORM 1_IBM
7 8
Using the T-test, the differentially expressed genes between cases and controls were identi-
fied. The results are plotted in the histograms (. Fig. 12.13).

> statList <- calcTStatFast(tab, phenotype, ngroups = 2)
> hist(statList$pval, breaks = seq(0, 1, 0.025), xlab = "p -value", ylab = "Frequency", main = " ")
As shown in . Fig. 12.13, we found that a large number of genes were differentially

expressed between case and control groups. Next, using the runSigPathway function, we
searched for pathways differing between the two groups.
> set.seed(1234)
> res.muscle <- runSigPathway(G, 20, 500, tab, phenotype, nsim = 1000,
weightType = "constant", ngroups = 2, npath = 25, verbose = FALSE,
allpathways = FALSE, annotpkg = "hgu133a.db", alwaysUseRandomPerm = FALSE)
This search returns a list of pathways that show statistically significant differences between
case and control groups.
.. Fig. 12.13 Frequency of

differentially expressed genes
2500
between case and control

groups
2000
1500
Frequency
1000 500
0
0.0 0.2 0.4 0.6 0.8 1.0

p-value
> print(res.muscle$df.pathways[1:10,])
IndexG Gene Set Category Pathway

1 234 GO:0019883 antigen presentation, endogenous antigen
2 292 GO:0042611 MHC protein complex
3 293 GO:0042612 MHC class I protein complex
4 233 GO:0019882 antigen presentation
5 84 GO:0030333 antigen processing
6 237 GO:0019885 antigen processing, endogenous antigen via MHC class I
7 117 GO:0030106 MHC class I receptor activity
8 92 GO:0001772 immunological synapse
9 613 humanpaths Interferon a,b Response
10 601 humanpaths Dendritic / Antigen Presenting Cell
Set Size Percent Up NTk Stat NTk q-value NTk Rank NEk Stat NEk q-value NEk Rank
1 22 0.00 18.97 0 3 9.33 0 2
2 20 0.00 17.83 0 6 9.36 0 1
3 20 0.00 17.83 0 6 9.36 0 1
4 45 0.00 19.41 0 1 7.24 0 7
5 44 0.00 19.03 0 2 7.26 0 6
6 23 0.00 18.44 0 4 9.11 0 4
7 22 0.00 18.37 0 5 9.28 0 3
8 26 0.00 16.95 0 7 8.27 0 5
9 71 12.68 10.79 0 8 4.83 0 9
10 105 5.71 10.66 0 9 3.62 0 11
> names(res.muscle$df.pathways[1:10,])
12 [1] "IndexG"
[5] "Percent Up"
"Gene Set Category" "Pathway"
"NTk Stat" "NTk q-value"
"Set Size"
"NTk Rank"
[9] "NEk Stat" "NEk q-value" "NEk Rank"
.. Fig. 12.14 Screenshot of the leukemia keyword entered into the COREMINE main screen
12.5 · Biological Text Mining – COREMINE
229 12
> print(res.muscle$df.pathways[c("IndexG","Gene Set Category", "Pathway",
"NTk Rank")])
IndexG Gene Set Category Pathway NTk Rank

1 234 GO:0019883 antigen presentation, endogenous antigen 3.0
2 292 GO:0042611 MHC protein complex 6.0
3 293 GO:0042612 MHC class I protein complex 6.0
4 233 GO:0019882 antigen presentation 1.0
5 84 GO:0030333 antigen processing 2.0
6 237 GO:0019885 antigen processing, endogenous antigen via MHC class I 4.0
7 117 GO:0030106 MHC class I receptor activity 5.0
8 92 GO:0001772 immunological synapse 7.0
9 613 humanpaths Interferon a,b Response 8.0
10 601 humanpaths Dendritic / Antigen Presenting Cell 9.0
11 19 GO:0045012 MHC class II receptor activity 21.0
12 236 GO:0019884 antigen presentation, exogenous an tigen 21.0
13 238 GO:0019886 antigen processing, exogenous antigen via MHC class II 21.0
14 481 GO:0009615 response to virus 75.0
15 576 KEGG Jak-STAT_signaling_pathway 84.0
16 40 GO:0006968 cellular defense response 93.0
17 42 GO:0006959 humoral immune response 86.0
18 612 humanpaths Th1-Th2-Th3 115.0
19 575 KEGG Toll-like_receptor_signaling_pathway 118.0
20 625 humanpaths Asthma 139.0
21 470 GO:0043085 positive regulation of enzyme activity 149.0
22 89 GO:0045333 cellular respiration 22.0
23 526 BioCarta p38 MAPK Signaling Pathway 193.5
24 18 GO:0005884 actin filament 225.5
25 529 BioCarta Activation of Csk by cAMP-dependent Protein Kinase Inhibits Signaling through the T Cell Receptor 238.0
26 226 GO:0019866 inner membrane 19.0
27 2 GO:0005740 mitochondrial membrane 20.0
28 507 GO:0015980 energy derivation by oxidation of organic compounds 17.0
29 142 GO:0015078 hydrogen ion transporter activity 11.0
30 141 GO:0015077 monovalent inorganic cation transporter activity 12.0
31 199 GO:0015399 primary active transporter activity 10.0
32 371 GO:0051239 regulation of organismal physiological process 338.0
33 546 KEGG Oxidative_phosphorylation 15.0
34 418 GO:0005386 carrier activity 13.0
35 563 KEGG Ribosome 16.0
36 450 GO:0003735 structural constituent of ribosome 14.0
37 25 GO:0005840 ribosome 18.0
12.5 Biological Text Mining – COREMINE
COREMINE Medical, a product of PubGene, is a domain-specific search engine for med-

ical information. Through text analysis, this tool extracts meaningful sections from text
information such as Medline. When keywords were searched, general information on
search topics, such as related articles, researchers, disease, drug, symptom, procedure,
anatomy, food, gene/protein, MeSH terms, chemical, GO, and traditional Chinese medi-
cine through text mining were shown in the graphic network.
For exercise, go to the following website. 7 http://www.coremine.com/.

When you enter a search term on the COREMINE main screen, similar terms with
different concepts are displayed. The appropriate category can then be selected.
(. Fig. 12.14).

In COREMINE, when you enter multiple keywords, related information is displayed.

As an example, use this exercise to investigate the relationship between Leukemia and IL-8
gene. When you type a search word, several concepts of a word are shown and you select
the concept of interest (. Fig. 12.15).

. Figure 12.16 shows the results of a two keyword search, leukemia and IL-8.

Wikipedia, ClinicalTrials, and PubMed articles were used for text mining. As a result,
.. Fig. 12.15 Screenshot
showing concurrent input of
leukemia (disease) and IL-8
(gene/protein)
12
.. Fig. 12.16 Screenshot of the results of a two keyword search, leukemia (disease) and IL-8 (gene/
protein)
there were 13 concepts and keywords related to the input. On the first screen, highly
related keywords are nodes and the user can see more to less relevant keywords by clicking
the ‘+’ or ‘−’ buttons on the left. Clicking each node presents basic information and dis-
plays extracted associations on the right side. Connection degree between nodes is shown
in a bar graph. Briefly, leukemia and IL-8 are closely related to CSF3, LIF, CSF2, IL6, and
recombinant Interleukin-6 (. Fig. 12.16).

Clicking on the link between two nodes shows the number of PubMed Articles that
link their relationship. In the results page, a list of 243 papers that mention the relationship
of two keywords is displayed on the right side. The title of the paper serves as the link to
the PubMed site (. Fig. 12.17).

Bibliography
231 12
.. Fig. 12.17 References for the relationship between leukemia (disease) and IL-8 (gene/protein)
Take Home Message

55 Learn the biological interpretation methods of a set of genes at the system level
using GO and biological pathway.
Exercises
[Exercise 1] - Download the MSigDB gene set and extract GO annotations from AmiGO for oncogene
gene sets.
[Exercise 2] - Calculate the similarity of CD74 and MYC genes among the genes entered in [Exercise 1].
Select Homo sapiens and biological process for GO.
[Exercise 3] - Perform pathway assignment and over representation analysis for the genes entered in
[Exercise 1].
[Exercise 4] - Perform a literature network with the CD74 and MYC genes used in [Exercise 1] and use
carcinoma as a keyword.
[Exercise 5] - Configure the literature network with the genes entered in [Exercise 1] and carcinoma as a
keyword.
Bibliography
1. Alexa A, Rahnenfuhrer J (2016) topGO: enrichment analysis for Gene Ontology. R package version
2.26.0
2. AmiGo – http://amigo.geneontology.org/amigo
3. Carbon S et al (2009) AmiGO: online access to ontology and annotation data. Bioinformatics
25(2):288–289
4. Chen H, Sharp BM (2004) Content-rich biological network constructed by mining PubMed abstracts.
BMC Bioinformatics 5:147
distinct subsets of patients with different response to therapy and survival. Blood 103:2771–2778
6. COREMINE – https://www.coremine.com
7. Croft D et al (2011) Reactome: a database of reactions, pathways and biological processes. Nucleic
Acids Res 39(Database):D691–D697
8. G-SESAME – http://bioinformatics.clemson.edu/G-SESAME
9. Harris MA et al (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res
32(Database issue):D258–D261
10. Holger F, Speer N, Poustka A, Beissbarth T (2007) GOSim – An R-package for computation of informa-
tion theoretic GO similarities between terms and gene products. BMC Bioinformatics 8:166
11. Lai W, Tian L, Park P (2008) sigPathway: pathway analysis. http://www.pnas.org/cgi/doi/10.1073/pnas.
0506577102, http://www.chip.org/~ppark/Supplements/PNAS05.html
12. Reactome – http://www.reactome.org
12
233 13
Biological Network Analysis

13.2 Preparations – 234
13.3 The Network Analysis Tools – 234

13.3.1 Major Network Analysis Tools – 234
13.3.2 Introduction to igraph and an Example of Its Use – 235
13.4 Introduction to Data and Publication

Used for Analysis – 240
13.4.1 Introduction to Publication and Dataset – 240
13.4.2 Introduction to Data and Preprocessing – 240
13.5 Analysis of Protein Interaction Networks – 241

13.5.1 Visualization – 241
13.5.2 Distribution of Connections – 242
13.5.3 Evolutionary Analysis – 244


https://doi.org/10.1007/978-981-13-1942-6_13
234 Chapter 13 · Biological Network Analysis

In this chapter, we will learn how to use the igraph package in R (a network analysis tool) and
will analyze a property of protein interaction networking based on existing publications. We
will analyze evolutionary distance and connectivity so that we need to confirm that the pro-
tein interaction network is a scale-free network and hub genes are evolutionarily old proteins.
13.1 Introduction
The human body is composed of more than 60 trillion cells; each cell consists of many
genes and proteins. Also, there are a number of environmental factors and various organ-
isms surrounding human beings. Each organism is not only independent but also consis-
tently interacts to create a huge and complex network. Therefore, to understand life
phenomena, it is important to understand function and structure of interactions between
various elements composing the life system.
Protein-protein interaction network (PPI), transcriptional regulatory, metabolic, and
perturbation networks constitute biological network studies. Gene-disease, disease-dis-
ease network, and drug interaction networks have been actively studied.
It is expected that the bio network will change the current research paradigm in life
science research and provide clues for understanding genome structure and evolutionary
mechanisms. The development of extensively parallel technology, including NGS and bio-
informatics interpretation techniques, have produced advances in new fields such as net-
work biology and network medicine. The study about the biological network is expected
to provide a clue to the structural and evolutionary mechanisms of the genome.
In this chapter, we learn how to use the “igraph” package, which is one of R packages
for biological network analysis.
13.2 Preparations
13 First, install R program and “igraph” package, and install “rgl” package for visualization of
network in a three-dimensional space. We can practice the analysis on a Windows system.
After running R, enter the following command.
> source("http://www.bioconductor.org/biocLite.R")
> biocLite("rgl")
> biocLite("igraph")
13.3 The Network Analysis Tools
13.3.1 Major Network Analysis Tools
A variety of software has been developed to analyze and visualize biological networks.
Such software may be distributed for free by developers, and it is also sold for commercial
purposes by companies related to bioinformatics. . Figure 13.1 shows the tools for analyz-

ing and visualizing major biological networks.

13.3 · The Network Analysis Tools
235 13
.. Fig. 13.1 Network analysis and visualization software
Software, including Cytoscape and Pajek, are stand-alone modes after they are installed
on your computer, while NetworkX or igraph are available in R or Python packages.
Stand-alone software provides an easy user interface, but it has limited use in various and
sophisticated functions. On the other hand, Python or R allows flexible and sophisticated
analysis, but it requires programming skills. In this chapter, we learn how to use igraph
packages in R, which is mostly used in biological analysis like microarray and with this
package, we practice network analysis and visualization of protein interaction.
13.3.2 Introduction to igraph and an Example of Its Use
“igraph” is a network analysis software package based on graph theory. The core software
library has been written in C/C++ language. igraph provides an interface and program-
ming language such as GNU R, Python, and Ruby (. Fig. 13.2). igraph allows for direc-

tional and non-directional network, visualization, and multi-dimensional network

analyses. In addition, three dimensional visualizations are possible by linking to the “rgl”
package. We can practice several exercises to learn the general usage of igraph before
performing real network data analysis.
Create and visualize igraph objects.
1. Load the igraph library.
> library(igraph)
.. Fig. 13.2 igraph
2. Create a directional object with seven nodes and check the connectivity between the
nodes.
> g <- graph(c(1, 2, 2, 3, 4, 5, 6, 7), directed = TRUE)
>g
IGRAPH D--- 7 4--

+ edges:
[1] 1->2 2->3 4->5 6->7
> are.connected(g, 1, 2)
> are.connected(g, 1, 3)
13 [1] TRUE: connected.

[2] FALSE: not connected.
3. You can enter data by calling the edit function. Create the igraph object and enter the
nodes’ name and the weighted value of connectivity in the edit window (. Fig. 13.3).

> data <- edit(data.frame())
> g <- graph.data.frame(data, directed = TRUE)
Confirm the nodes and the weighted value of connectivity.
> V(g)$name
> E(g)$weight
[1] "A" "B" "C" "D" "E"

[1] 1-1 1 1-1 1 1 1 1 –1
237 13
.. Fig. 13.3 Create a graph
using edit function
4. Network visualization exercise: Build a star-like network, and set the color of each
connection line according to the weight. Practice functions: (1) plot, (2) tkplot, and
(3) rglplot (. Fig. 13.4).

> E(g)$color <- “blue”
> E(g)[weight == -1]$color <- “red”
> plot(g, layout = layout.kamada.kawai, vertex.label = V(g)$name,
edge.color = E(g)$color)
> tkplot(g, layout = layout.kamada.kawai, vertex.label = V(g)$name,
> library(rgl)
> rglplot(g, layout = layout.kamada.kawai, vertex.label = V(g)$name,
.. Fig. 13.4 Network visualization
0.500
degree.distribution(g)
0.005 0.020 0.0010.100
1 2 5 10 20 50
Index
13
.. Fig. 13.5 Scale-free network
Create a network model, then perform visualization and analysis.

1. Create and visualize a scale-free network that is a Barabasi network, and display the
distribution of the number of nodes in a graph.
> g <- barabasi.game(1000, directed = FALSE)
> plot(g, vertex.size = 2, vertex.label = NA, vertex.shape = “circle”,
layout = layout.fruchterman.reingold, edge.color = “black”)
> plot(degree.distribution(g), log = “xy”)
Note: igraph provides a variety of layout options for visualization. Let’s draw a network by
changing the actual layout. We can see various types of network graphs (. Fig. 13.5 and
. Table 13.1).

239 13
.. Table 13.1 Layout options in igraph
layout.random layout.circle
layout.sphere layout.fruchterman.reingold
layout.kamada.kawai layout.spring
layout.reingold. layout.fruchterman.reingold.grid
tilford
layout.lgl layout.graphopt
layout.mds layout.svd
0.200
0.020 0.050
0.005
0.001
1 2 5
Index
.. Fig. 13.6 Erdos-Renyi Network
2. Create and visualize an Erdos Renyi network model, and draw a graph of the
distribution of the number of nodes (. Fig. 13.6).
> g <- erdos.renyi.game(1000, 1000, type = “gnm”, directed = FALSE)
> plot(g, vertex.size = 2, vertex.label = NA, vertex.shape = “circle”,

13.4 Introduction to Data and Publication Used for Analysis
13.4.1 Introduction to Publication and Dataset
In the chapter, we analyze a protein interaction network in yeast, which is the mostly
studied protein interaction network. In practice, we reproduce the network of “Evolutionary
rate in the Protein Interaction Network” (Fraser et al. [4]) and “Emergency of Scaling in
Random Networks” (Barabási and Albert [2]) published in Science.
13.4.2 Introduction to Data and Preprocessing
Protein interaction information is available at various databases. Major protein interac-

tion databases are shown in . Table 13.2.

Protein interaction data is downloaded from the Saccharomyces Genome Database,

which is a yeast genome database. The data format is shown in the left upper panel of
. Fig. 13.7. This includes not only PPI but also protein-tRNA interactions. Protein-tRNA

interactions should be removed for the analysis in this chapter. The lower two panels in
. Fig. 13.7 shows a re-formatted input file that is readable in igraph. In total, 5543 genes

(nodes) and 54,484 connection (links) exist in this data. Download the evolutionary rate
(non-synonymous substitution rate, dN) data, published in PNAS in 2005 “Functional
genomic analysis of the rates of protein evolution” (upper and right panel in . Fig. 13.7).
Among the 5543 genes, if gene has dN in the file, assign the dN value to the gene, other-
wise, assign −1 since the gene evolution rate has a value >0.
.. Table 13.2 List of protein interaction databases
Database Feature
13
HPRD (Human Reference Construction of human PPI database through literature
Database) collection and verification.
41,327 interactions, 30,047 proteins, 470 domains.
MINT (Molecular Interaction homoMINT: homology-inferred human networks

Database) 125,464 interaction database for various species
IntAct (Protein Interaction PPI database based on EBI literature

Database) 653,574 interactions, 94,146 proteins, 38,446 experiments,
3581 ontology terms
DIP (Database of Interacting Experimentally confirmed PPI data

Protein) PPI database based on 28,764 proteins, 81,627 interactions,
826 organisms
STRING Establishment of PPI database from experiments, literature

collection and forecasts
Information from a variety of sources including fusion,
co-expression, HT-experiments, and text mining.
9,643,763 proteins for 2031 species
BioGRID Literature collection and verification

61 species, 65,192 genes, 1,079,789 interactions
13.5 · Analysis of Protein Interaction Networks
241 13
.. Fig. 13.7 Pre-processing of datasets (PPI)
13.5 Analysis of Protein Interaction Networks
13.5.1 Visualization
In R, read the protein interaction network files provided in the CD. (“vertexes.txt” and
“relations.txt”), create graph object, and visualize with Fruchterman.reingold (. Fig. 13.8).

# Confirm the current working directory using “getwd()” and move the working directory
to “C: \gda\ch13” using “setwd”.
> traits <- read.delim(“C: \gda\ch13\vertexes.txt", head = F)
> relations <- read.delim(“C: \gda\ch13\relations.txt”, head = F)
> colnames(traits) <- c(“gene”, “orf”, “ppi”, “dn”)
> colnames(relations) <- c(“from”, “to”)
> library(igraph)
> g <- graph.data.frame(relations, vertices = traits, directed = FALSE)
> plot(g, vertex.size = 1, vertex.label = NA, asp = FALSE, vertex.shape = “circle”,
※ Caution: Depending on your computer’s specifications, visualizations may have dif-

ferent processing times. If there is a time constraint, we recommend omitting the plot
command.
.. Fig. 13.8 Protein interaction

networks
1e-01
2e-02
5e-03
1e-03
13
2e-04
1 5 10 50 500
Index
.. Fig. 13.9 The result showing the power function distribution
13.5.2 Distribution of Connections
igraph provides a “degree.distribution” function, which is used to calculate the connection

distribution (degree). We confirm whether distribution of connections in protein interac-
tion network represents distribution of power law, which is a characteristic of a scale-free
network.
Draw the distribution of connections in the protein interaction network in log scale
(. Figs. 13.9 and 13.10).

243 13
.. Fig. 13.10 Comparison with
the data from publication
10−1
A
10−2
10−3
P(k)
10−4
10−5
10−6
100 101 102 103
.. Fig. 13.11 Evolution analysis

of protein interaction network dN to PPI
1.2
1.0
0.8
0.6
dN
0.4
0.2
0.0
0 100 200 300 400 500

Degree
> g <- graph.data.frame(relations, vertices = traits, directed = FALSE)

13.5.3 Evolutionary Analysis
In this chapter, we confirm that hub proteins in scale-free network are evolutionarily old.
Draw a small dot for the evolution rate by degree and add the regression line. Please
confirm that hub genes are evolutionarily older (. Fig. 13.11).

> traits <- read.delim ("C:\gda\ch13\vertexes.txt", head = F)
> range(traits[ ,3]) # Check the distribution of connections
> k_degree <- c(min(traits[ , 3]):max(traits[ , 3]))
> y_value <- rep(0, max(traits[ , 3])) # Replace 2569 with 0
> ev_table <- matrix(0, 2569, 1) # Make matrix to calculate evolutionary rate
> plot(k_degree, y_value, type = “n”, ylim = c(0, 1.2), xlab = “Degree”, ylab = “dN”,
main = “dN to PPI”, xlim = c(0, 500)) # Draw an empty plot
> for (i in k_degree){
temp <- traits[ , 3] == i & traits[ , 4] >= 0 # Search data matching the ppi value
each_degree_data = traits[temp, ]
m_t = rep(i, length(each_degree_data[ ,1])) # Create data as much as the assigned
data is
lines(m_t, each_degree_data[ , 4], type = “p”, col = “blue”) # Draw a plot

13 }
> temp <- traits[ , 4] >= 0
> each_degree_data = traits[temp, ]
> lm_result <- lm(each_degree_data[ , 4] ~ each_degree_data[ , 3])
> abline(lm_result, col = “red”)
Using data obtained from the above practice, calculate the Pearson correlation coefficient
with p-value or Spearman correlation coefficient with p-value.
> cor.test(each_degree_data$ppi, each_degree_data$dn, method = “pearson”)
> cor.test(each_degree_data$ppi, each_degree_data$dn, method = “spearman”)

245 13
Exercises
[Exercise 1] - Extract proteins which interact with p53 using protein interaction databases in
. Table 13.2.

[Exercise 2] - Visualize the interaction network using data from [Exercise 1].
[Exercise 3] - Find proteins in the OMIM database. Draw proteins associated with disease in red and
proteins not related to disease in blue.
[Exercise 4] - Draw proteins associated with metabolism in red and proteins not associated with
metabolism in blue using physical interaction data from the yeast genome database.
[Exercise 5] - From [Exercise 4], extract the proteins related to metabolism and plot the distribution of
connections between them.
Take Home Message

55 How to analyze the primary network data using R package, “igraph”.
55 What visualization methods are used.
Bibliography
1. Adler D, Murdoch D, et al (2017) rgl: 3D visualization using OpenGL. R package version 098.1. https://
CRAN.R-project.org/package=rgl
2. Barabasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512
3. Csardi G, Nepusz T (2006) The igraph software package for complex network research, InterJournal,
Complex Systems 1695. http://igraph.org
4. Fraser HB et al (2002) Evolutionary rate in the protein interaction network. Science 296:750–752
5. Jeong H, Mason SP, Barabasi AL et al (2001) Lethality and centrality in protein networks. Nature
411:41–42
6. Saccharomyces Genome Database – http://www.yeastgenome.org
7. Shannon P et al (2003 Nov) Cytoscape: a software environment for integrated models of biomolecular
interaction networks. Genome Res 13(11):2498–2504
8. Wall DP, Hirsh AE, Frase HB et al (2005) Functional genomic analysis of the rates of protein evolution.
PNAS 102(15):5483–5488
9. Wuchty S, Almaas E (2005) Peeling the yeast protein network. Proteomics 5(2):444–449
13
247 IV
SNPS, GWAS and

CNVS, Informatics for
Genome Variants
Contents
Chapter 14 SNPs, GWAS, CNVs: Informatics for Human

Genome Variations – 249
Chapter 15 SNP Data Analysis – 261
Chapter 16 GWAS Data Analysis – 281
Chapter 17 CNV Analysis – 299

249 14
SNPs, GWAS, CNVs:

Informatics for Human
Genome Variations
14.2 dbSNP – 252
14.3 International HapMap Project – 253
14.4 PharmGKB – 253
14.5 Genome-Wide Association Studies (GWAS) – 254

14.5.1 Allelic Test – 254
14.5.2 Genotypic Test – 255
14.5.3 Multiple Testing Correction – 256
14.6 Definition and Importance of Copy-Number

Variation – 256
14.7 Analysis Methods of CNV – 257

14.7.1 Chip Method to Detect CNV – 258
14.7.2 Sequencing Method to Detect CNV – 258
14.8 Conclusion – 258

https://doi.org/10.1007/978-981-13-1942-6_14
250 Chapter 14 · SNPs, GWAS, CNVs: Informatics for Human Genome V
ariations

In this chapter, we will discuss modifications in gene base sequencing, diverse characteris-
tics or variation, different gene variations, and different gene lengths and their distribu-
tions. We investigate single base diversity and databases providing genome diversity-related
information including dbSNP, International HapMap Project, and PharmGKB. We will study
the hypothesis of common and rare disease gene variations. Finally, this chapter will also go
over the assumptions and method of analysis used by GWAS research and copy-number
variation research and survey the future of genome diversity.
14.1 Introduction
When the first blueprint of The Human Genome Project was finished at the end of the
twentieth century, we expected that the mystery of human genetics would be solved. News
that claimed that the human genome blueprint was finished went viral. It also stated that
we acquired the entire chimpanzee genome and that humans and chimpanzees share
99% of the same genome. It was said that we are so close to conquering all the diseases.
However, in reality, there were not enough results that could explain the molecular mech-
anism of life.
Investigating the relationship between genotype and phenotype is one of the oldest
research topics. The results of the human genome project verified that there are more
variations among individuals than initial speculation. Gene variation can be largely
separated as those that have <1000 base pair base variations and those that have >1000
base pair variations; these differences are called structural variations. Examples of base
variations include single nucleotide polymorphism (SNP), insertion and deletion (InDel),
and structure variations including copy number variation (CNV), segmental duplica-
tion, translocation, and inversion (. Figs. 14.1 and 14.2). Copy number variation can be

branched into amplification and deletion. Although we have arbitrarily chosen structure
variation as 1000 base pairs, there have been copy number variations of 800~1000 base
pairs. Therefore, it is important for us to view 1000 base pairs as a convenient standard.
. Figure 14.2 outlines the various genome variation distributions and their criteria.

SNPs among many genome variations drew the attention of early scientists. DNA is
14 made up of four base pairs A, G, T, and C and encodes crucial information of life. It is well
known that each base can have variations such as mutations, insertions, and deletions.
SNP and mutation are characteristically the same in that it is a single base variation. SNP
and mutation differ in that mutations pinpoint to an exceptional (or diseased) condition,
whereas SNP is more of a diverse universal condition. The difference between SNP and
mutation is solely for our convenience; if the allele at a single base is rare and the fre-
quency is >1%, it is defined as SNP, and if it is <1%, it is defined as mutation. Therefore,
mutations are understood as exceptional or what we commonly call a ‘disease’, and SNP
is understood as the factor that can explain human diversity. This distinction is bound to
cause controversy.
In data analysis, SNPs have two main characteristics: (1) a dense marker; (2) disease-
related. Although SNPs only exist in pairs of alleles and therefore it lacks the ability to
segregate alleles, SNP has a high emergence frequency. If several SNPs are used together,
it is possible to distinct the alleles in multiples of two. They can be used as dense markers
in dynamic and genealogy research. The second characteristic –disease related- could be
intended as a dense marker as well. However, the hypothesis states that a specific SNP or
14.1 · Introduction
251 14
Single Nucleotide
Polymorphism (SNP)
Base Variation
Insertion & Deletion

(Short InDel)
Copy Number
Variation (CNV) Copy Number Gain
Genome Variation
Duplication Copy Number Loss
Insertion & Deletion

(InDel)
Structural Variation
Translocation
Inversion
.. Fig. 14.1 Classification of genome variations
Sequence variation Structural variation
Microscopic to Whole chromosomal to

Single nucleotide 2 bp to 1,000 bp 1 kb to submicroscopic
subchro mosomal whole genome
• Base change – substitu • Microsatellites, → Copy number variants → Segmental aneusomy • Interchromosomal
tion – point mutation minisatellites (CNVs) • Chromosomal translocations
→ Indels → Segmental duplications deletions – losses • Ring chromosomes, is
→ Insertion-deletions • Inversions • Chromosomal ochromosomes
(“indels”) • Di-, tri-, tetranucleotide • Inversions, insertions – gains • Marker chromosomes
• SNPs-tagSNPs repeats translocations • Chromosomal → Aneuploidy
• VNTRs → CNV regions(CNVRs) inversions → Aneusomy
• Microdeletions, micro • Intrachromosomal
duplications translocations
→ Heteromorphisms
• Fragile sites
Molecular genetic detection Cytogenetic detection
.. Fig. 14.2 Related criterion of different genome variations
a combination of specific SNPs is related to disease. This is the common disease-common

variant hypothesis. With the fact that mutation frequency is low when considering rare
diseases, it is hypothesized that rare (mutation) variations are related to rare diseases
and SNP or common variations are related to common diseases. Although the “common
disease-common variant hypothesis” is the basis of the Genome-Wide Association Study
(GWAS) research, there has been a lot of controversy and examination of research, and
this hypothesis does not have much credibility.
ariations
Contrary to the attention and the spotlight it has been having, there has not been
much progress with SNP research regarding diabetes or complex diseases stemming
from complex causes. It may be necessary to switch gears and to readjust the research
direction. Regardless, genomic variation research continues, and we continue to supple-
ment the limits of GWAS by NGS technology and The 1000 Genome Project. This is an
effort to make a list of all genomic variations and defects in order to understand human
diseases. In this exercise, we analyze SNP data by using GWAS and copy-number varia-
tion data. Other types of diversity include InDels, microsatellites, and cancer genomic
markers, which are important targets for research; they will, however, be explained
elsewhere.
14.2 dbSNP
dbSNP provided by NCBI is a comparatively significant resource that is high reliable and
contains inclusively large genomic diversity data. Database that can be trusted is crucial
to genome diversity research. There is the need to consistently upgrade informatics for
highly reliability data analysis. Users of SNP resources need to know when the resource
was created, if it has been renewed to the most recent data, and if the data version and the
analysis software and the database version are being used correctly.
Each distinct SNP needs to be given a distinct identifier. Reference SNP identifiers or
rsIDs are known identifiers provided by dbSNP and InDel, and repeated identifiers are
also included in dbSNPs. The constant increase in SNPs makes it difficult to manage, and
many rsID identify the same SNP. In this case, the alias rsID and combined overlap refer-
ence are given. Therefore, if the rsID does not consider the alias rsID when searching for
dbSNP in articles and software, the search may yield to incorrect results. dbSNP search
results show the SNP history. For example, the web-based analysis tool SNAP (7 http://

www.broad.mit.edu/mpg/snap/) considers past and present aliases in dbSNP version com-

pared to the rsID list and also aids in analysis.
Some SNP bioinformatics analysis tools and databases only look for gene ID. Gene ID
and the same alias can cause version problems. The necessary information can be found
in the HGNC (HUGO gene nomenclature committee).
14 There are over 800 databases regarding human genome diversity; however, only few
are commonly used. The types of databases can be largely divided into common or rare
variations of databases containing variations with additional functions or information.
The largest common variation database is dbSNP from NCBI. dbSNP is a genome varia-
tion database that was created at the beginning of the discovery of common variations in
the human genome project. The discovery of common variations continues to increase
exponentially and as of 2016, 154,000,000 human genome variations and 782,000,000
variations in 53 different species have been listed. dbSNP provides resources for several
purposes as described subsequently:
55 Connects known variations and human genes
55 Provides sorting of known or new variations
55 Provides variation function and gene variations known in the vicinity
55 Designs of specific variation identification
55 Assessment of the validity and possibility of variation existence
55 Assessment of allele frequency in diverse populations
14.4 · PharmGKB
253 14
dbSNP combines variation information and gene information and related function,
comments in a browser. dbSNP provides search services of haploid prediction function,
sequence search such as snpBLAST including dbSNP, dbMHC, dbLRC, and dbRBC.
dbSNP also allows downloading and searching of large amounts of SNP data.
14.3 International HapMap Project
In order to comprehensively investigate common variations in the human genome, The

International HapMap Project collected allele frequency, linkage disequilibrium (LD),
and haploid genetic data. The HapMap Project was carried out in three stages. Stage 3 data
in 2009 contained 11 races, 1115 samples, and about 1,600,000 SNPs. This type of gene
data can be downloaded and can be viewed using the HapMap browser. HapMap can be
used in the following ways with no limitations:
55 Assessment of SNP allele frequency within the gene
55 Assessment and technology of Haplotype block
55 Assessment of recombination ratio and hotspot
55 Basic information regarding genotype imputation
55 Selection of SNPs for GWAS
55 Reference model for designing genotype identification and in vitro experiment
The goal of the 1000 genome project was to analyze the genome of 1000 people including
697 samples used in the HapMap. This project was first announced in 2009 and continues
to provide data on human genomic diversity of common and rare variations. Some SNPs
in ThaiSNP2, Taiwan Biobank, HGDP-CEPH Database, and ALFRED can be obtained
from dbSNP or HapMap. ALFRED is known to provide information on 664,292 variations
across 724 populations. Diverse specimens can be used to calculate population control
SNP frequencies or race origin tracking assessments of a SNP of interest.
Biomart, SPSmart, and others can be used to quickly provide SNP results correspond-
ing to user preference and can also arrange searches. Databases that provide useful infor-
mation regarding a specific variation combines literature and several different databases.
OMIM is a database that was manually put together by experts to search allele variation
and SNP ID. Similarly, the National Institutes of Health Genetic Association Database
(NIH GAD) provides race research, specific variation statistics, and research conclusions
from 40,000 search resources gathered from several different paths.
The GWAS catalog contains a large analyzed data set. Portions of GWAS results are
incomplete. There are efforts to increase GWAS data usage, and the NCBI Database of
Genotype and Phenotype (dbGAP) is playing a huge role. The results of GWAS are main-
tained by the National Human Genome Research Institute (NHGRI).
14.4 PharmGKB
PharmGKB was created at Stanford University in 2000. At that time, there was no file
format technology to save phenotypic and genotypic data from pharmacogenetics
research. Appropriate file format and search format were developed for the increase of
data and the results of the research. Not only gene-drug relationships but also genomic
ariations
variation-drug-disease relationships were informed through the PharmGKB information

flow system. The PharmGKB information flow system provides flow of one reference drug
and its effect or flow within a pathway. It also provides network with other genes, varia-
tions, diseases pertinent to the pathway, and the target of the drug. Targeting in addition
to genes, variations, and diseases related to the biological drug pathways ultimately pairs
it to the clinical outcome. In 2016, PharmGKB provides a database of 114 drug biological
pathways and its associations with 3634 drugs.
14.5 Genome-Wide Association Studies (GWAS)
GWAS is genome-wide scanning of genetic variants in many individuals to find com-

mon variations and genes involved in human disease [3, 4]. Previous research on GWAS
focused on linkage study1 78 and candidate genes of interest to reveal the correlation of
diseases and their genetic factor. A linkage study uses family history to vertically identify
a specific marker that has been inherited. However, detection resolution and the power
are very low to identify genetic risk factors [10]. Candidate gene research supplements this
issue by using biological backgrounds to identify candidates in order to investigate the
relationship between the variation and the corresponding disease. However, since candi-
date gene research ignores all gene regions aside from the chosen gene of interest, there is
a high chance of missing the region or identifying a false positive correlation [1, 5].
Because of the large population that GWAS analyzes, small or moderate size effect
detection using GWAS compared to linkage study has a higher statistical power.
Additionally, GWAS investigates entire genome markers; therefore, there is no reason to
select a marker in advance to research the gene of interest. Ultimately, we were able to
identify many diseases linked to new SNPs. For example, prostate cancer research could
not be easily reproduced in linkage or gene of interest studies. In GWAS, we identified >12
genetic variations that have high reproducibility [13].
However, despite active GWAS research, many diseases affect gene size with compara-
tively low heritability [7]. These so-called missing heritability issues are being settled by
the changes in perspective from a common disease common variant (CDCV) theory to a
common disease rare variant (CDRV) [6] as well as the application of NGS and other new
14 technologies and bioinformatics approaches of next generation GWAS research [14]. In
this chapter, CDCV-based GWAS is used only.
14.5.1 Allelic Test
The main idea of an association study is to investigate the differential allele frequency of
SNPs between case and control groups. When there is statistically, a significant difference in
the allele frequency, the given allele then has a significant relationship with the p
henotype.
1 In this chapter, an association study is considered separate from a linkage study. A linkage study is
used in order to explore trait-associated loci within families, and an association study is used in
order to explore an association between specific genotype and phenotype at the population level.
Therefore, they are co-operative. In other words, a linkage analysis can be used to identify related
gene or loci and an association study can be used for estimating a causal correlation level of specific
variants with a trait.
14.5 · Genome-Wide Association Studies (GWAS)
255 14
Using a chi square test, we can summarize the difference between expected allele and
observed allele frequency; this type of frequency analysis is called an allelic test, and the
test statistic is the allele. The genotypes aa, aA, and AA are paired into a-a, a-A, and A-A
alleles, and the relationship of a single allele and its phenotype will ultimately be tested.
The interaction or relationship between the alleles are ignored. For example, if there are
100 people the expected contingency table is as below.
allele a allele A
Case (aff ) 50 [Ea, aff ] 50 [EA, aff ]
Control (unaff ) 50 [Ea, unaff ] 50 [EA, unaff ]
The actual genotype test results are as reported below.
allele a allele A
Case (aff ) 25 [Oa, aff ] 75 [OA, aff ]
Control (unaff ) 75 [Oa, unaff ] 25 [OA, unaff ]
We use an independent chi square test to investigate the statistical difference between
the two distributions above.
n
( Oi - Ei )2
x2 = å
i =1 Ei
χ2 = χ2 distribution of testing statistic

Oi = observed frequency
Ei = (theoretically) expected frequency
n = the number of results possibly occurring in each event
n = 4 (allele combination between case and control group is a total of 4)
The chi square value is 50 and the associated p-value is <0.0001.2 Therefore, the marker
can be seen as statistically significant.
14.5.2 Genotypic Test
For a genotype test, a 2×3 contingency table is used instead of a 2×2 genetic contingency
table (isomorphic aa, heteromorphic Aa, isomorphic AA) used for an allelic test. Broadly,
there are three types of models as describe subsequently:
2 In an online chi-squared calculator (7 http://www.graphpad.com/quickcalcs/contingency1.cfm),

select “Chi-square without Yates’ correction, two-tailed”. X2 = (Oa,aff – Ea,aff )2/Ea,aff + (Oa,unaff –

Ea,unaff )2/Ea,unaff + (OA,aff – EA,aff 22/ EA,aff + (Oa,unaff – Ea,unaff )2 / EA,unaff = (25–50)2/50
+ (75–50)2/50 + (75–50)2/50 + (25–50)2/50 = 50.
ariations
.. Table 14.1 Multi hypothesis correction method used by PLINK
Field Description
UNADJ Unadjusted p-value
BONF Bonferroni single-step adjusted p-values
HOLM Holm (1979) step-down adjusted p-values
SIDAK_SS Sidak single-step adjusted p-values
SIDAK_SD Sidak step-down adjusted p-values
FDR_BH Benjamini & Hochberg (1995) step-up FDR control
FDR_BY Benjamini & Yekutieli (2001) step-up FDR control
55 Additive Model: Assume Two copy minor allele (ex, AA genotype) are twice as
effective as the single copy minor allele (such as Aa genotype). It is also called a
Cochran-Armitage tendency test.
55 Dominant Model: Assume there is a small phenotypic effect with at least one copy of
the minor allele (such as Aa or AA).
55 Recessive Model: Assume a phenotypic effect only happens with two copies of the
minor allele.
14.5.3 Multiple Testing Correction
In 7 Chap. 16, the example works with 83,534 SNPs. We test allele and four genotypes

for a total of 417,650 times (=5 × 83,534). Setting the significance level at p < 0.05, each
hypothesis test has a 5% false positive rate. As the number of comparisons increase, the
false positive rate also increases; this is known as the ‘multiple comparisons problem.3 To
decrease these false positive relationships, we can either adjust the p-value threshold or
14 adjust the p-value entirely, and this can be done using the PLINK program. PLINK uses
seven methods. . Table 14.1 explains each method and corresponding symbols.

14.6 Definition and Importance of Copy-Number Variation
The human genome has 23 pairs of chromosomes that are composed of paternal and
maternal genomes. This type of pair format is called diploid (2n) genetic information.
However, there are haploid (1n) or triploid (3n) states. Polyploidy or aneuploidy are also
quite common in cancer cell change and is called copy number alteration (CNA). However,
during the development of genomics, simple partial region copy-number variation has
been found to be commonly detected. This is called copy number variation (CNV).
3 Refer to 7 Chap. 5 7 Sect. 5.4 and 7 Chap. 6 7 Sect. 6.3 for the multiple comparisons problem in

microarray analysis.
14.7 · Analysis Methods of CNV
257 14
More recently, in addition to variation, copy-number variation is perceived to be poly-
morphic similar to that of SNP. This is called copy number polymorphism (CNP). It is
interesting that the initial theory has shifted from normal (diploid) to modified (abnor-
mality, CNA), to variation (exceptional diversity, CNV), and finally polymorphic (com-
mon polymorphism, CNP).
In early research concerning copy-number variation, about 100 variations per an indi-
vidual were detected. However, recent technological advancements have detected an aver-
age of about 1000 variations per individual, and that number is expected to increase. Since
it is known that copy-number variation can be inherited from parental genes, there have
been studies on how CNV can explain genetic diseases and the relationship to specific
phenotypes. Active genes could be involved in copy-number variation regions. In this
case, gene dose changes can be induced, and the chances of inevitable molecular biological
or pathological changes are high. In that aspect, there are high expectations that CNV can
supplement the biggest weakness of SNP. In a real example, research on childhood autism
related to copy-number variation is being done, and to date, there have been many causal
relationships in CNV etiology.
Early interest in SNP can be seen as a technological limit in that only a base or short
base sequence could be detected. Long structure variation detection was difficult and
meaningful interpretations were also difficult in the beginning. However, recent technol-
ogy has allowed us to detect variations of all lengths and detection accuracy continues
to improve. Compared to the 1% that SNP affects, CNV is known to affect over 10% of
the entire human genome and is vastly distributed. We expect that CNV may be more
important than SNP.
To date, there are not many disease cases that we can explain through CNV. Mutations
arising from reduction division have been known to cause disease. During mutation forma-
tion, genes with regions containing CNV may be severely damaged or fused to cause gene
function defects. Furthermore, in cases when the variation is large, many genes may be
affected and different symptoms may be observed. This is known as continuous gene syn-
drome. Known diseases in which CNV is known to affect a single gene in several diseases
include Duchenne/Becker muscular dystrophy, Type 1 fibrocystic cancer, tuberous sclerosis,
sotos syndrome, charge syndrome, Pelizaeus-Merzbacher disease, early onset Alzheimer’s
disease, autism, psoriasis, Crohn’s disease, and Parkinson’s disease. Diseases known to be
affected by continuous gene syndromes include DiGeorge and Williams syndromes.
In the beginning, using single-base polymorphisms to investigate cause of disease was
not fruitful, but we continued to find explanations to missing heritability. Within missing
heritability, we focused on CNVs due to the fact that it causes more gene sequence changes
than that of single base polymorphisms, which in turn explains diseases or specific phe-
notypes. We expect that about 10% of the genome contains variations and the other 1%
difference can be explained solely by the diversity of humans. Copy-number variation data
registered in the Database of Genome Variants has been significantly increasing since 2005.
14.7 Analysis Methods of CNV
There are two main methods in detecting CNVs: (1) using microarray chips (2) NGS
technology. Chip technology is more cost efficient than sequencing. It is also suitable for
detecting known variations. However the accuracy of chip technology is lower and detec-
tion of new variations is impossible. In repeating sequence regions, it is very difficult to
ariations
detect CNV using sequencing technology. If detecting CNV is successful, the resolution
is very low. Recently, many researchers have been using chip analysis first and sequencing
technology afterwards for result verification.
14.7.1 Chip Method to Detect CNV
Initial structure variation was detected via microarray chip aCGH (Array CGH) using
comparative genome hybridization technology: (1) the sample and the control genomes
are first dyed in a fluorescent substance (2) the two base sequences of samples and controls
bind competitively to the complementary sequence of a probe planted on the aCGH chip;
(3) the combination amount is quantified by the fluorescence intensity of fluorescence.
The weakness of this method, however, is its inability to accurately detect the start and end
point of the structural variation’s breakpoint. Recently, short sequence probe chips with
high density and high resolution have been used.
14.7.2 Sequencing Method to Detect CNV
There are two methods used in CNV detection: (1) paired-end mapping (PEM) (2) depth
of coverage (DOC) analysis.
A paired-end fragment sequence contains the distance information of a fragment
sequence pair from both ends of the genome of interest. We can use this genome informa-
tion in both mapping-linked and base sequence distance differences to calculate structural
variation. Additionally, we can accurately identify the beginning and end of the structural
variation since information on the linked region is already defined. When the fragment
sequence is mapped with the reference genome considering forward and reverse directions
similar to when a specific region is inverted or copied on the wrong side, we can find struc-
tural variation not available in chip based analysis. DOC base analysis uses mapping of
ratios onto the reference genome to detect CNVs. Assuming that mapping multiples are
equal in regions where copy-number is high, there will be a specific ratio in which the map-
ping multiple will increase and deletion sections will decrease; this difference can be quanti-
14 fied. The downfall to this type of analysis is that it is difficult to detect structure variation
in short sequences and to identify start and endpoints. Therefore, we use both methods
mentioned above to supplement the weakness of each method to detect structural variation.
14.8 Conclusion
In the last 10 years, genome expression analysis using large genome data, GWAS, and
copy-number variation analysis has allowed rapid development of our understanding of
gene to gene domains involved in human diseases. It has been a major starting point for
the start of genomic medicine and human health enhancement. GWAS yielded disap-
pointing experimental results of missing heritability, low effect size, and disagreeing phe-
notype. Furthermore, GWAS analysis was based on was the CDCV hypothesis but has
progressed to the CDRV hypothesis in which a few rare alleles affect diseases (. Fig. 14.3).

We hope that the NGS technology will allow us to examine and find variation causes,
including rare mutations, and ultimately redefine genome research.
Bibliography
259 14
High
Penetrance
Highly unusual
Rare variants common variants
causing Mendelian Influencing common
disease disease
Intermediate
Less common
variants
with intermediate
penetrance
Modest
Common variants
Private variants
Influencing common
hard to identify by genetic disease identified by GWA
means studies
Low
Private 0.1% Rare 1% Less common 5% Common

Variant class, Allele frequency
.. Fig. 14.3 Dangerous allele distribution frequency and level of penetrance
Take Home Message

55 The various types of genome variants, SNPs, CNV, and detection methods.
Bibliography
1. Hirschhorn JN et al (2002) A comprehensive review of genetic asso-ciation studies. Genet Med 4:
45–61
2. International HapMap Consortium (2003) The international HapMap project. Nature 426(6968):
789–796
3. Lander ES (1996) The new genomics: global views of biology. Science 274:536–539
4. Lander ES et al (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921
5. Lohmueller KE et al (2003) Meta-analysis of genetic associa-tion studies supports a contribution of
common variants to susceptibility to common disease. Nat Genet 33:177–182
6. Manolio TA et al (2009) Finding the missing heritability of complex diseases. Nature 461:747–753
7. McCarthy MI et al (2008) Genome-wide association studies for complex traits: consensus, uncertainty
and challenges. Nat Rev Genet 9:356–369
8. Osier MV et al (2001) ALFRED: an allele frequency database for diverse populations and DNA polymor-
phisms—an update. Nucleic Acids Res 29(1):317–319
9. Purcell S et al (2007) PLINK: a tool set for whole-genome association and population-based linkage
analyses. Am J Hum Genet 81(3):559–575. Epub 2007 Jul 25
10. Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases. Science
273:1516–1517
ariations
1. Sherry ST et al (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29(1):308–311
1
12. Thorn CF et al (2013) PharmGKB: the pharmacogenomics knowledge base. Methods Mol Biol
1015:311–320. https://doi.org/10.1007/978-1-62703-435-7_20
13. Witte JS (2009) Prostate cancer genomics: towards a new understanding. Nat Rev Genet 10:77–82
14. Witte JS (2010) Genome-wide association studies and beyond. Annu Rev Public Health 31:9–20. 4 p
following 20
14
261 15
SNP Data Analysis

15.1.1 dbSNP – 262
15.1.2 International HapMap Project – 262
15.1.3 PharmGKB – 262
15.2 dbSNP – 263

15.2.1 rsID, ssID Search – 263
15.2.2 Entrez SNP Search – 264
15.3 1000 Genomes Project – 266

15.3.1 Data – 266
15.3.2 Browser – 268
15.4 PharmGKB – 270

15.4.1 Search PharmGKB – 270
15.4.2 Clinical Annotations – 270
15.5 Variant Analysis Using NGS – 271

15.5.1 Preparation – 271
15.5.2 Extraction of Disease-Associated Variants from Variants
Identified Through NGS – 272


https://doi.org/10.1007/978-981-13-1942-6_15
262 Chapter 15 · SNP Data Analysis

In this chapter, resources, including dbSNP, HapMap, PharmGKB and ANNOVAR that are
required for SNP data analysis and interpretation, are explained. We establish a strategy of
how to query these resources in order to answer our questions and obtain information.
Actual practice examples for each system are used.
15.1 Introduction
15.1.1 dbSNP
dbSNP is a basic SNP (single nucleotide polymorphism) search database. dbSNP was con-
jointly created by NHGRI and NCBI since December 2001. The 101 build version was the
first database. Currently, it is being run by the NCBI Entrez system (. Table 15.1). General

SNP analysis is used to search for existing SNPs registered in the dbSNP database. The
current version of dbSNP includes over ten million SNPs, each with a given rs number.
15.1.2 International HapMap Project
The International HapMap Project officially started in 2002 in Canada, China, Japan,
Nigeria, England, and the United States to establish the haploid human genome map.
Stage 1 was announced in 2005, followed by Stage 2 in 2007, and Stage 3 in 2009 (. Tables
15.2 and 15.3).
15.1.3 PharmGKB
PharmGKB is funded by the NIH PGRN (Pharmacogenomics Research Network) estab-

lished by Stanford University (. Table 15.4). Since the establishment of PharmGKB in

April 2000 online, it has become one of the most-used drug-related databases, along with
DrugBank. PharmGKB provides curated data for the relationships between genes, drugs,
and diseases; various information about the drug itself; and information on important
genes in drug-related pathways.
15
.. Table 15.1 Summary of dbSNP database
Organism Genome No. of #RefSNP #SNPs in #SNPs #SNPs with

build submissions clusters gene with frequency
genotype
Homo 38.3 557,939,960 154,206,854 89,404,961 73,917,935 130,169,906

sapiens
Total: 53 149 582,072,379 174,266,789 102,128,612 86,763,411 130,169,943

organisms genomes
15.2 · dbSNP
263 15
.. Table 15.2 Summary of the international HapMap project data
Phase 1 Phase 2 Phase 3
Sample & POP panels 269 samples 270 samples 1115 samples (11 panels)
(4 panels) (4 panels)
Genotyping centers HapMap international Perlegen Broad & Sanger

consortium
Unique QC + SNPs 1.1 M 3.8 M (phase I + II) 1.6 M (Affy 6.0 & Illumina 1 M)
.. Table 15.3 International HapMap project
Samples Genotyped QC + SNPs
71 ASW 1,543,731
162 CEU 1,398,396
82 CHB 1,342,348
70 CHD 1,312,343
83 GIH 1,409,510
82 JPT 1,294,974
83 LWK 1,527,403
71 MEX 1,453,659
171 MKK 1,532,587
77 TSI 1,420,526
163 YRI 1,494,330
.. Table 15.4 Summary of PharmGKB
Attribute #(number)
Gene 27,007
Chemicals 3634
Diseases 3518
Pathways 114
15.2 dbSNP
15.2.1 rsID, ssID Search
dbSNP is composed of two types of data: (1) data that users submitted (ssID) (2) data that
is maintained or adjusted by the system (rsID). These IDs are the basic ID values and the
format used when referring to SNP (. Fig. 15.1).

.. Fig. 15.1 The result page of rs380390 search
55 Submitted data: original observations of sequence variation (ex: ss5586300)

55 Computed/curated data: Reference SNP Clusters (Ref SNP) (ex: rs4986582)
1. Access dbSNP site: 7 http://www.ncbi.nlm.nih.gov/projects/SNP using web

browser.
2. In the “Reference cluster ID (rs#)” search box type rs380390 and click “Search”.
3. You can verify CFH gene related information on the results page.
15.2.2 Entrez SNP Search

15
dbSNP is integrated in the NCBI Entrez system. Therefore, it is possible to search using
queries such as PubMed or GenBank. It is easier to get results because you can get infor-
mation faster compared to when you search directly one by one.
1. On the left side of the search menu of the dbSNP screen click Entrez SNP.
2. You can search your desired gene variation by entering your desired gene.
3. Enter CFH gene.
4. For advanced search, click Advanced in the search box and click SNP Advanced
Search Builder.
5. Reference the chart below and select the field to search (. Table 15.5).

15.2 · dbSNP
265 15
.. Table 15.5 dbSNP Specialized fields
Field Tag Type Notes
Allele [ALLELE], [VARIATION], [VARI] IUPAC Observed allele(s)
Chromosome [CHR] Textnum Mapped chromosome number
Base position [CHRPOS], [BPOS] Integer Mapped chromosome position
Create build ID [CREATE_BUILD], [CBID] Integer SNP create build ID
Publication date [CREATE_DATE], [CDAT], Date SNP create/publication date

[PDAT], [PUBDATE]
Function class [FXN_CLASS], [FUNC] Text
Gene name [GENE], [GENE_SYMBOL] Textnum Locus link symbol
Genotype [GENOTYPE], [GTYPE] Boolean
Heterozygosity [HET] Integer
Local SNP ID [LOC_SNP_ID] Textnum Submitter local SNP ID
LocusLink ID [LOCUS_ID], [LID] Integer LocusLink ID number
Map weight [WEIGHT], [MPWT], [HIT] Integer SNP map weight info – the
number of times a SNP map to
the genome contig (range 1–10)
Method class [METHOD_CLASS], Text Assay method used to identify

[METHD], [MCLS] SNP
Accession version [ACC] Textnum Search by nucleotide or protein

accession and version number
Contig position [CTPOS] Mapped contig position
Reference SNP ID [RS] Integer
Submitter SNP ID [SS] Integer
SNP class [SNP_CLASS], [SCLS] Text
Submitter Handel [HANDLE] Text
Success rate [SUCCESS_RATE], [SRATE] Integer
Organism [ORGN], [TAX_ID] Text Organism name or taxonomy

ID number
Update build ID [UPD_BUILD], [UBID], Integer

[ORGN], [TAX_ID]
Modification date [UDATE], [UDAT], Date

[MODDATE]
Validation [VALIDATION] Text

15.3 1000 Genomes Project
The 1000 Genomes Project was planned during a meeting at The Wellcome Genome
Campus in September 2007. Taking advantage of developments in sequencing technology
and the reduced cost of sequencing, this project produced the largest public catalogue of
human variation and genotype data for various populations. The 1000 Genome Project
includes the initial HapMap data and the HapMap site has disappeared. In 2013, the 1000
Genomes Project released phase 3 and the recently version GRCh38 data was released.
15.3.1 Data
The 1000 Genomes Project data can be downloaded using the ftp of EBI and NCBI. Aspera
and Globus software can be used for faster and more reliable downloads. Additional portals
are available to categorize data according to sample, race, and release version (. Fig. 15.2)

and (. Table 15.6).

NCBI FTP Site route (7 ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp) is available on

the data site. Move the directory to look at data provided by 1000 Genomes Project. The
release directory is organized by version. You can access the latest phase 3 data through
20,130,502 directory version hg19. The data is organized by chromosome.
15
.. Fig. 15.2 Population and data information of each sample

15.3 · 1000 Genomes Project
267 15
.. Table 15.6 Population category in 1000 Genomes project
Popu- Description Super Sam- GRCh38 Phase Phase Platinum Structural

lation popu- ples 3 1 pedigree variation
lation
MXL Mexican ancestry AMR 107 O O O

in Los Angeles,
California
CLM Colombian in AMR 148 O O O

Medellin,
Colombia
PEL Peruvian in Lima, AMR 130 O O

Peru
GWD Gambian in AFR 180 O O

Western Division,
the Gambia
MSL Mende in Sierra AFR 128 O O

Leone
ESN Esan in Nigeria AFR 173 O O
TSI Toscani in Italy EUR 112 O O O
IBS Iberian popula- EUR 162 O O O

tions in Spain
PJL Punjabi in Lahore, SAS 158 O O

Pakistan
ITU Indian Telugu in SAS 118 O O

the UK
GBR British in England EUR 107 O O O

and Scotland
STU Sri Lankan Tamil SAS 128 O O

in the UK
YRI Yoruba in Ibadan, AFR 186 O O O O

Nigeria
LWK Luhya in Webuye, AFR 116 O O O

Kenya
CDX Chinese Dai in EAS 109 O O

Xishuangbanna,
China
CEU Utah residents EUR 183 O O O O

(CEPH) with
Northern and
Western
European
ancestry
CHB Han Chinese in EAS 108 O O O

Beijing, China
(continued)
Popu- Description Super Sam- GRCh38 Phase Phase Platinum Structural

lation popu- ples 3 1 pedigree variation
lation
JPT Japanese in EAS 105 O O O

Tokyo, Japan
GIH Gujarati Indian in SAS 113 O O

Houston, TX
ASW African ancestry AFR 112 O O O

in Southwest US
ACB African Caribbean AFR 123 O O

in Barbados
PUR Puerto Rican in AMR 150 O O O O

Peurto Rico
BEB Bengali in SAS 144 O O

Bangladesh
CHS Han Chinese EAS 171 O O O O

south
KHV Kinh in Ho Chi EAS 124 O O

Minh City,
Vietnam
CHD Chinese in EAS 0

Denver, Colorado
FIN Finnish in Finland EUR 105 O O O
15.3.2 Browser
In the browser page, data can be searched using Ensembl or 1000 Genomes Browser.
Ensembl is linked to the site and data can be retrieved by Human genome build version
15 GRCh37 and GRCh38. Project release version pilot and phases 1 and 3 can be seen in the
1000 Genome browser. The current main version is phase 3 (7 http://phase3browser.

1000genomes.org) (. Fig. 15.3).

Enter previous search ‘rs380390’ in the search bar. A simple summary regarding the
variation in addition to genomic context, genes and regulation, population genetics, indi-
vidual genotypes, linkage disequilibrium, phenotype data, citations, and etc. can be viewed
in the results page (. Fig. 15.4).

For example, for population genetics, information on allele frequencies of 2504 indi-
viduals from the 1000 Genome Project Phase 3 is available. We can verify that the average
MAF is 0.25 (G), EAS is 0.05(G), EUR is 0.4(G), and SAS is 0.29(G).
15.3 · 1000 Genomes Project
269 15
.. Fig. 15.3 1000 Genomes browsers by phases
.. Fig. 15.4 The result page of rs380390 search and detailed allele frequencies
15.4 PharmGKB
PharmGKB is a database that contains pharmacogenomic data. The database contains

information on gene-drug-disease relationships, drug information, pathways closely
related to drugs, and other important genes.
Go to the PharmGKB site. 7 www.pharmgkb.org

15.4.1 Search PharmGKB
The main page provides PharmGKB knowledge pyramid and news. In the upper search
PharmGKB search bar, we can search for gene, rsID, drug name, and disease name. In this
exercise, let’s search for the drug warfarin (. Figs. 15.5 and 15.6).

15.4.2 Clinical Annotations
Clinical information related to warfarin is categorized by tab. In the search screen, click
dosing guidelines. As shown in the figure below, the important points regarding warfarin
dosing are organized using published literature. We can also identify how genotype affects
dosing.
In the drug labels tab, information regarding drug labels from the Food and Drug
Administration (FDA), and Health Care Service Corporation (HCSC) are shared. In the
clinical annotations tab, information on drug-related genes and variations are listed. Level
15
.. Fig. 15.5 The main page of the pharmacogenomic knowledgebase

15.5 · Variant Analysis Using NGS
271 15
.. Fig. 15.6 The prescribing information of search drug ‘warfarin’
of evidence of PharmGKB is used to rate the correlation level from 1 to 4. To understand

a variation of interest, click ‘rs1057910’ among variations with high level of evidence
(. Fig. 15.7).

These are the search results of gene CYP2C9 variation rs105710. After logging in, we
can see additional detailed information of all clinical annotation including race, genotype,
and other factors (. Fig. 15.8).

In this database, VIP stands for very important pharmacogene and shows information
about the gene. The pathway tab shows diagrams of pharmacodynamic and pharmacoki-
netic effects of the drugs of interest (. Fig. 15.9).

15.5 Variant Analysis Using NGS
15.5.1 Preparation
In this exercise, you will practice SNP analysis in Linux with one single sample of the IBS
population from the 1000 Genomes Project. You will need to install ANNOVAR and
VCFtools. The download pathway is shown below; refer to Appendix B to see detailed
installation instructions.
55 ANNOVAR (7 http://annovar.openbioinformatics.org/en/latest/)

55 VCFtools (7 http://vcftools.sourceforge.net/index.html)

55 Data (. Table 15.7)

.. Fig. 15.7 The clinical annotation by genotypes
15.5.2 Extraction of Disease-Associated Variants from Variants

Identified Through NGS
15 15.5.2.1 Separation of Samples from a Multi-sample VCF File

One single file download from the 1000 Genomes Project contains 2504 patients’ data. In
this exercise, you will learn to extract the genetic information for one sample from a file
containing 107 IBS samples.
> zcat IBS.chr19.vcf.gz | more
..
#CHROM POS ID REF ALT QUAL FILTERINFO FORMAT
EUR_IBS_HG01500_M EUR_IBS_HG01501_F
19 60842 . A G 100 PASS AC=2;AF=0.000399361;AN=5
008;NS=2504;DP=19533;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0;EUR_AF=0;SAS_AF=0.0
01;AA=.||| GT 0|0 0|0
273 15
.. Fig. 15.8 The significant related variant with high evidence level
Extract information for the sample “EUR_IBS_HG01612_F” using the command vcf-
subset from VCFtools, and verify the file.
> vcf-subset –c EUR_IBS_HG01612_F IBS.chr19.vcf.gz >
EUR_IBS_HG01612_F.vcf &
..
#CHROM POS ID REF ALT QUAL FILTERINFO FORMAT
EUR_IBS_HG01612_F
19 60842 . A . 100 PASS AA=.|||;AC=0;AF=0.000399
361;AFR_AF=0;AMR_AF=0.0014;AN=2;DP=19533;EAS_AF=0;EUR_AF=0;NS=2504;SAS_A
F=0.001 GT 0|0
Cell membrane
R-Warfarin S-Warfarin
NADH NAD+
EPHX1 VKORC1 Hydroxy-
Vitamin K1
Liver cell Vitamin K Vitamin K

(reduced) CYP4F2
(epoxidized)
GGCX
Functional Hypofunctional
F2 F2
F10 F10
F9 F9
GAS6 GAS6
F7 CALU F7
PROZ PROZ
BGLAP BGLAP
PROC PROC
PROS1 PROS1
MGP MGP
Clotting Bone Apoptosis

metabolism
& connective
Tissue calcification
.. Fig. 15.9 Warfarin pathway, pharmacodynamics
15
.. Table 15.7 List of data for practice
IBS.chr19.vcf.gz A VCF format file that contains gene information from 107 IBS samples
IBS_chr19.exonic.txt An annotation file including gene and mutation type for all variants
located on chromosome 19.
EUR_IBS_HG01612_F. A file for one single IBS samples “EUR_IBS_HG01612_F” out of 107, containing
chr19.vcf gene and mutation type for all variants located on chromosome 19.
RVIS_v3_12MAR16.txt A file containing RVIS scores, which was provided by the publication.
code A directory containing code files that will be used in the practice.
GDI A directory containing the code for extracting GDI scores.

275 15
15.5.2.2 Conversion to ANNOVAR Input Format
Since the VCF file is not an input format compatible with ANNOVAR, create an input file
using the following code:
> convert2annovar.pl EUR_IBS_HG01612_F.chr19.vcf –format vcf4 >
EUR_IBS_HG01612_F.step1
15.5.2.3 Extract Variants with a Minor Allele Frequency (MAF) >0.1%

For individual variants, you can confirm the MAF of each population based on reference
data. Using variants from the sample “EUR_IBS_HG01612_F”, extract those that also
have an MAF >0.1% in the European 1000 Genomes Project data.
> annotate_variation.pl –filter –dbtype 1000g2015aug_eur –build hg19 –out
EUR_IBS_HG01612_F.step1 EUR_IBS_HG01612_F.step1 $ANNOVAR_HUMANDB
–maf 0.0 01
The output files ‘output.hg19_EUR.sites.12015_08_dropped’ and ‘output.hg19_EUR.

sites.2015_08_filtered’ are generated. Variants with MAF >0.1% in the EUR population
are saved in the _dropped file, and rare variants with MAF <0.1% are saved in the _filted
file. The second column of each file is the MAF value. You can defind a MAF threshold and
use the –reverse option to extract variants with MAF <0.1%.
> head EUR_I BS_HG01612_F.step1.hg19_EUR.sites.2015_08_dropped
1000g2015aug_eur 0.96 19 91106 91106 A C hom 100

30569
1000g2015aug_eur 0.58 19 226776 226776 C T hom 100
22137
1000g2015aug_eur 0.47 19 240867 240867 A C hom 100
7512
1000g2015aug_eur 0.65 19 244421 244421 A G hom 100
20146
1000g2015aug_eur 0.65 19 244426 244426 C T hom 100
20139
1000g2015aug_eur 0.70 19 245631 245631 C T hom 100
8925
Run the following command to generate the input file “EUR_IBS_HG01612_F.step2” for
the next step.
> next_step.py EUR_IBS_HG01612_F.step1.hg19_EUR.sites.2015_08_dropped

15.5.2.4 Extraction of Variants in Splice Sites and Exonic Regions

This step will annotate the functionality of variants that are identified in an individual.
ANNOVAR categorizes variants according to protein sequence changes into ‘nonsynony-
mous’, ‘synonymous’, ‘frameshift insertion’, ‘frameshift deletion’, ‘stopgain’, and ‘non frame-
shift deletion’. Using the RVIS score, variants belonging to the categories of ‘missense’, ‘stop
gained’, ‘missense near splice’, ‘stop lost’, ‘splice-5’, ‘splice-3’, ‘stop gained near splice’, or
‘stop lost near splice’are considered functional variants.
> annotate_variation.pl –geneanno –dbtype refgene –outfile

–buildver hg19
Functional variants in splice site and exonic regions are saved in the file “EUR_IBS_
HG01612_F.step2.exonic_variant_function”.
> more EUR_IBS_HG01612_F.step2.exonic_variant_function
In this practice, you will learn to calculate the gene core using whole-genome sequencing
data, since many research institutes (including the 1000 Genomes Project) provide the
this data. Run the following command to generate the file “EUR_IBS_HG01612_F.step3”,
which will be the input for the next step.
> next_step.py EUR_IBS_HG01612_F.step2.exonic_variant_function
15.5.2.5 Adding Informational Annotations to Variants (PolyPhen2)

Let’s annotate variants with PolyPhen2 scores using ANNOVAR. The PolyPhen2 score is a
quantitative numerical value between 0 and 1. If the score of a variant is <0.15, the variant
is defined as benign; if the score is between 0.15 and 0.85, it is defined as possibly damag-
ing; and if the score is >0.85, the variant is defined as probably damaging. PolyPhen2
scores can be downloaded; refer to Appendix B for details.
15
> annotate_variation.pl –filter –dbtype ljb_pp2 –outfile
–buildver hg19
The output results are saved in two files, which “ EUR_IBS_HG01612_F.step3.hg19_ljb_

pp2_dropped” provides the list of variants annotated with PolyPhen2 scores. The score in
the 2nd row is the prediction score.
> head EUR_IBS_HG01612_F.step3.hg19_ljb_pp2_dropped

277 15
15.5.2.6 Perform the Entire Process Simultaneously
You can run all steps at once with one ANNOVAR command as follows:
> table_annovar.pl EUR_IBS_HG01612_F.step1 $ANNOVAR_HUMANDB
–buildver hg19 –out table –protocol refGene, 1000g2015aug, ljb_pp2
–operationa g,f,f
15.5.2.7 Calculation Over All Variants by Gene

The annotation process we performed above for calculating gene score uses only variants
identified in one individual sample. However, RVIS is a population-level score, based on
data from the whole population. Therefore, to calculate the RVIS score, we need to per-
form the same analysis process with any variants that are identified in at least one among
the 107 IBS samples. This analysis has already been done and all information files are
provided. The results file includes variants identified on chromosome 19.
> more IBS_chr19_exonic.txt
RVIS generates a regression line using the total number of variants that are identified in a
given gene relative to the number of common variants identified in the given gene.
Therefore, first you need to count total and common functional variants in each gene’s
coding region. In this practice, common variants are defined as having an MAF >0.1%,
and functional variants are defined as those classified with functional annotations of ‘non-
synonymous’, ‘frameshift insertion’, ‘frameshift deletion’, or ‘stopgain’.Count the number of
variants pergene using the following R code (. Fig. 15.10):

> R CMD BATCH code/01_count_var.R

> more IBS_variant_cnt.txt
Gene No_all_var No_functional_var

GRIN3B 34 22
COL5A3 25 13
C19orf6 10 3
15.5.2.8 RVIS (Residual Variation Intolerance Score) Calculation

The main purpose of RVIS is to quantify differences between observed variants and
expected variants. Methodologically, RVIS assigns the residual value to the given gene,
which is the distance from the gene to the regression line. Calculate RVIS scores using the
following R code:
> R CMD BATCH code/02_calculate_rvis.R

> more IBS_RVIS_score.txt
30
Sum of all common functional variants in a gene (Y)
25
KIAA 1683
20
15
10
RYR1
5
0
0 10 20 30 40 50 60
Sum of all variant sites in a gene (X)
.. Fig. 15.10 Distribution of tolerant/ intolerant variants
Gene RVIS_score
GRIN3B 1.97729539195145
COL5A3 -1.58064893054413
15 C19orf6 -2.51055613470339
A negative (−) RVIS score means the gene is intolerant of variation, having fewer observed
variants than expected. Apositive (+) RVIS score means the gene is tolerant of variation,
having a higher number of observed variants than expected. An intolerant gene is a poten-
tially pathogenic locus that may cause Mendelian disease, as the gene is highly conserved
and has a relatively lower chance of mutation.
15.5.2.9 RVIS Plotting

Use the scatterplot function in R to plot RVIS scores.
> R CMD BATCH code/03_plot_RVIS.R

Bibliography
279 15
Check the plot in the result file “Rplots.pdf ”. Different input data was used from the RVIS
publication, with differential distributions in the population, and with only limited chro-
mosomal regions was analyzed. The plot shows a distribution very similar to the published
results. The red dots represent intolerant genes (bottom 2%) and the blue dots represent
tolerant genes (top 2%).
15.5.2.10 Variant Prioritization Using the Hot-Zone Approach

(RVIS < 25%, PolyPhen2 > 0.95)
The hot-zone approach is prioritizes the variant level based on a gene score. It prioritizes
pathogenic variants by identifying predicted deleterious variants that have PolyPhen2
scores >0.95 which also belong to genes intolerant of variation that have RVIS scores in
the bottom 25% of genes. Let’s extract the list of hot-zone variants among those identified
in the EUR_IBS_HG01612_F sample we used above.
> python hot_zone.py

> more EUR_IBS_HG01612_F.hotzone
Exercises
[Exercise 1] - Confirm SNP number in Human Gene CYP2D6 and search information regarding SNP
rs1081003 at the dbSNP website.
[Exercise 2] - Set SNP_Class of Human chromosome 5 and 6 as SNP and filter Ventor SNP data on the
dbSNP website. Print the result.
[Exercise 3] - Prepare Problem 2 using Entrez Gene query on the dbSNP website.
[Exercise 4] - Search all drugs related to the drug Warfarin, if one patient’s rs2292566 variation allele is
A allele, search what phenotype will be present on the PharmGKB website.
[Exercise 5] - Investigate distribution of allele frequency of rs1799853 among different races, and
investigate what related phenotypes exist on the all websites.
Take Home Message

55 Learn the utility and feature of existing public resources (i.e., dbSNP, HapMap,
PharmGKB).
Bibliography
1. 1000 Genome Project – http://www.internationalgenome.org
scale sequencing. Nature 467(7319):1061–1073
3. Adzhubei I et al (2013) Predicting functional effect of human missense mutations using PolyPhen-2.
Curr Protoc Hum Genet Chapter 7:Unit7.20. https://doi.org/10.1002/0471142905.hg0720s76
4. Danecek P et al (2011) The variant call format and VCFtools. Bioinformatics 27(15):2156–2158. https://
doi.org/10.1093/bioinformatics/btr330. Epub 2011 Jun 7
5. dbSNP site – http://www.ncbi.nlm.nih.gov/projects/SNP
6. Hewett M et al (2002) PharmGKB: the pharmacogenetics knowledge base. Nucleic Acids Res
30(1):163–165
7. Petrovski S et al (2013) Genic intolerance to functional variation and the interpretation of personal
genomes. PLoS Genet 9(8):e1003709
8. Petrovski S et al (2015) The intolerance of regulatory sequence to genetic variation predicts gene
dosage sensitivity. PLoS Genet 11(9):e1005492
9. PharmGKB – www.pharmgkb.org
10. Sherry ST et al (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29(1):308–311
11. The International HapMap Consortium (2005) A haplotype map of the human genome. Nature
437:1299–1320
12. The International HapMap Consortium (2010) The international HapMap consortium. Integrating
common and rare genetic variation in diverse human populations. Nature 467:52–58
13. Wang K et al (2010) ANNOVAR: functional annotation of genetic variants from high-throughput
sequencing data. Nucleic Acids Res 38(16):e164
15
281 16
GWAS Data Analysis

16.3 *.ped and *.map Files of the PLINK – 283

16.3.1 The genotypes.ped File – 283
16.3.2 The genotypes.map File – 284
16.4 Configuration of gPLINK and Other Programs

that Work with gPLINK – 285
16.5 Validation of the GWAS Data Set and Summary

of Statistics – 285
16.6 Filtering Out Data with a Threshold – 288
16.7 Basic Association Test – 291
16.8 Additive Genotypic Test – 292
16.9 Manhattan Plot – 294


https://doi.org/10.1007/978-981-13-1942-6_16
282 Chapter 16 · GWAS Data Analysis

In this chapter, we will analyze basic GWAS using gPLINK and HaploView, which are visual
interfaces of the PLINK software. We will perform appropriate GWAS data set selection,
obtain a summary of the statistics, and filter out data with a proper threshold. We will per-
form a GWAS and visualize the results with a Manhattan plot.
16.1 Introduction
We use PLINK1 software and HaploView and gPLINK in this practice. PLINK is an open
source tool for GWAS developed by the Broad Institute. HaploView is a tool to analyze
and visualize genetic information, especially haplotypes. gPLINK2 is a GUI (Graphic User
Interface) tool for visualization using PLINK and HaploView. gPLINK helps to easily
carry out command lines in the graphic interface.
16.2 Prerequisites
55 Install PLINK program. See the installation details in Appendix B. This chapter
explains simple installation based on Windows OS. JAVA is a prerequisite. The Java
download page is 7 https://java.com/en/download/

gPLINK could be downloaded using the URL below:

7 http://pngu.mgh.harvard.edu/~purcell/plink/gplink.shtml#down

55 Extract the PLINK.zip file and move to the C:\gda\ch16 directory.

Select [Start] - > [Program] - > [Accessories] - > [Command Prompt] and
execute the commands:
cd C:\gda\ch16
java –jar gPLINK.jar
55 Check whether gPLINK properly runs through the java virtual machine.
55 Haploview.jar could be downloaded using the URL below:
7 https://www.broadinstitute.org/haploview/downloads#JAR

The following procedure uses the menu option to set data directory (. Fig. 16.1).
“Project” -> “Option” -> “Browse”-> select C:\gda\ch16 directory.

Do not click this part of “SSH link ….”
Click ok to see the file list in the present directory in the left.
16 Check if the genotypes.ped and genotypes.map files exist.
1 Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW,
Daly MJ & Sham PC(2007) PLINK: a toolset for whole-genome association and population-based
linkage analysis. American Journal of Human Genetics, 81.
Package: PLINK (including version number)
Author: Shaun Purcell
URL: 7 http://pngu.mgh.harvard.edu/purcell/plink/

2 7 http://pngu.mgh.harvard.edu/~purcell/PLINK/gPLINK.shtml

16.3 · *.ped and *.map Files of the PLINK
283 16
.. Fig. 16.1 Execute gPLINK, and move to the practice directory
16.3 *.ped and *.map Files of the PLINK
genotypes.ped and genotypes.map are data consisted of 83,534 SNPs from 89 (44 case and
45 control) individuals with no familial relations, created for the purpose of this exercise.
The purpose of this exercise is to test the association of phenotype with genotype. The
GWAS data is saved in two text files.
16.3.1 The genotypes.ped File
genotypes.ped file is consisted of 89 rows by 167,077 columns, which contains pedigree

data, genotype data, and phenotype data. The description of columns is listed below.
Column 1 = Family ID
Column 2 = Individual ID
Column 3 = Paternal ID (zero for missing)
Column 4 = Maternal ID (zero for missing)
Column 5 = Sex
Column 6 = Phenotype (1=unaffected, 2=affected, and 0=missing)
Column 7, 8 = genotype pair of the first SNP1 (zero for missing)
Column 9, 10 = genotype pair of the second SNP2 (zero for means missing)
…
Column 457393, 457394 = genotype pair of the last SNP228694
In GWAS, there is a case in which paternal and maternal IDs are zero, but the individual
ID is 1 (. Fig. 16.2).

16.3.2 The genotypes.map File
The locus information of each SNP genotype is tabulated in the genotypes.map file
(. Table 16.1).

Maternal ID
Paternal ID Genotype Pair at SNP1 Genotype Pair at SNP6
JA19012 NA19012 0 0 1 2 G A G G G A T T C C T T G G 0 0 G G 0 0
Family ID Individual ID Sex Genotype Pair at SNP2 Genotype missing

Phenotype
.. Fig. 16.2 File format of genotypes.ped
.. Table 16.1 The detail of the genotypes.ped file
16 genotypes.ped
CH18526 NA18526 0 0 2 1 A A G G A A T T C C T T T G G G G G T .......
CH18524 NA18524 0 0 1 1 A A G G A A T T C C T T T G G G G G C .......
CH18529 NA18529 0 0 2 1 G A G G G A T T C C T T T G G G G G T .......
CH18558 NA18558 0 0 1 1 A A G G A A T T C C T T T G G G G G T .......
......
JA19012 NA19012 0 0 1 2 G A G G G A T T C C T T G G 0 0 G G 0 0 ........

16.5 · Validation of the GWAS Data Set and Summary of Statistics
285 16
genotype.ped
JA19012 NA19012 0 0 1 2 G A G G G A T T C C T T G G 0 0 G G 0 0
1 rs6681049 0 789870
1 rs4074137 0 1016570
1 rs7540009 0 1050098
1 rs1891905 0 1090080 genotype.map
1 rs9729550 0 1125105
1 rs3813196 0 1159244
1 rs6704013 0 1187454
.. Fig. 16.3 Relation between genotypes.ped and genotypes.map files
Column 1 = chromosome number

Column 2 = SNP ID
Column 3 = Genetic Distance (morgans)
Column 4 = physical base-pair position (bp)
. Figure 16.3 illustrates the relationship between genotypes.ped and genotypes.map.

16.4 Configuration of gPLINK and Other Programs

that Work with gPLINK
As described above, gPLINK is the graphic interface that enables users to use PLINK and
HaploView software. Thus, gPLINK options need to be configured in order to use PLINK
and HaploView.
1. “Project” -> “Configure”: it shows the set up options for PLINK and HaploView.
2. Select “Browse” of “Haploview path” and select*.jar file in C:\GWAS\PLINK (for
Linux OS).
3. In Windows, select “Browse” of “PLINK path” and, select PLINK1.07_windows.exe
under c:\GWAS\PLINK\. In Mac OSX, select PLINK1.07_mac_intel (. Fig. 16.4).
4. Complete configuration by clicking “OK”.
16.5 Validation of the GWAS Data Set and Summary of Statistics
Before analysis, check if the input data is formatted for the PLINK program
1. Select “PLINK” – > “Summary Statistics” – > “Validate Fileset” (. Fig. 16.5)

2. Select the “Standard Input” tab.

3. Select “genotypes”. The *.ped and *.map files in the C:\gda\ch16 directory are
automatically selected.
4. Enter “summary” for output file name
5. Clicking the “OK” button shows a window, a screenshot in . Fig. 16.6.
.. Fig. 16.4 The configuration page of gPLINK
16
.. Fig. 16.5 Set-up for validate fileset

16.5 · Validation of the GWAS Data Set and Summary of Statistics
287 16
.. Fig. 16.6 The command to execute gPLINK options in the PLINK software
.. Fig. 16.7 The result of validate fileset
6. When the window shows the PLINK command line as shown in . Fig. 16.6, click
“OK to run PLINK.”

7. The blue colored “R” indicates “calculation running.” When the calculation is
complete, a green colored “√” appears. The calculation takes about 5 min, but that
may vary depending on computers’ specifications.
8. See the log history in the “Log View” by selecting “summary.log” of Output files in
the “Operations Viewer” (. Fig. 16.7)
9. . Table 16.2 shows the summary.log file

GENO >1 means Genotype data Missingness test. There is no result because the current
option is there is no genotype data yet.
MAF <0: there is no SNPs filtered out as no SNP has MAF lower than zero. As men-
tioned above, check whether the data is correct.
.. Table 16.2 The detail of genotypes.map file.
1 rs3094315 0.792429 792429

1 rs6672353 0.817376 817376
1 rs4040617 0.819185 819185
1 rs2905036 0.832343 832343
1 rs4245756 0.839326 839326
1 rs4075116 1.04355 1043552
1 rs9442385 1.13726 1137258
........... ( skip )
16.6 Filtering Out Data with a Threshold
.. Fig. 16.8 Threshold option
Filtering options is required according to noise or missing values that can exist in the real
16 data (. Fig. 16.8).

1. Select “PLINK” –> “Summary Statistics” -> “Missingness”

2. Select “Standard input” tab
3. Select “Threshold” in the button. Enter a threshold for minor allele frequency,
maximum SNP missing rate, maximum individual missing rate, and Hardy Weinberg
equilibrium as shown in . Fig. 16.8 and click “OK.”

4. Enter “filter” as an output file name and click “OK.”

5. Select “filter.log” in “Operation Viewer” -> “filter” -> “Output files.” You can see the
result of filtering (. Fig. 16.9, . Table 16.3)

16.6 · Filtering Out Data with a Threshold
289 16
.. Fig. 16.9 The result of filtering
# filter.log file contents
.... (skip)
Before frequency and genotyping pruning, there are 83534 SNPs
89 founders and 0 non-founders found
0 of 89 individuals removed for low genotyping ( MIND > 0.1 )
19 markers to be excluded based on HWE test ( p <= 0.001 )
35 markers failed HWE test in cases
19 markers failed HWE test in controls
Writing individual missingness information to [ C:\gda\ch16\filter.imiss ]
Writing locus missingness information to [ C:\gda\ch16\filter.lmiss ]
Total genotyping rate in remaining individuals is 0.994427

858 SNPs failed missingness test ( GENO > 0.1 )
16993 SNPs failed frequency test ( MAF < 0.01 )
After frequency and genotyping pruning, there are 65787 SNPs
After filtering, 44 cases, 45 controls and 0 missing
After filtering, 89 males, 0 females, and 0 of unspecified sex
.. Table 16.3 The summary.log file
# summary.log file contents
.... (skip)
83534 (of 83534) markers to be included from [ C:\gda\ch16\genotype.map ]
89 individuals read from [ C:\gda\ch16\genotype.ped ]
89 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also -9
44 cases, 45 controls and 0 missing
89 males, 0 females, and 0 of unspecified sex
Before frequency and genotyping pruning, there are 83534 SNPs
89 founders and 0 non-founders found
Total genotyping rate in remaining individuals is 0.994427
16 0 SNPs failed missingness test ( GENO > 1 )
0 SNPs failed frequency test ( MAF < 0 )
After frequency and genotyping pruning, there are 83534 SNPs
After filtering, 44 cases, 45 controls and 0 missing
After filtering, 89 males, 0 females, and 0 of unspecified sex

16.7 · Basic Association Test
291 16
55 mind (MIND): It returns removed sample that have missing genotypes > threshold of
10%. There is no sample removed in this exercise.
55 geno (GENO): it returns SNPs that have a missing genotype greater than a threshold
of 10%.
55 mal (MAF): this option is to filter out the SNPs that have a MAF lower than a
threshold of 0.01 (=1%)
16.7 Basic Association Test
We will perform basic association to identify alleles associated with phenotypes. This pro-
cess filters out SNP and samples that have low MAFs and high missing values.
1. Select “PLINK” - > “Association” -> “Allelic Association Tests”
2. Select “Standard Input” tab.
3. Select “Threshold” and enter the same filtering option of the previous one as shown
in . Fig. 16.8.

4. Enter “genotypes_allelictest” as an output file and click the “OK” button

(. Fig. 16.10).

5. Right click at “Operations Viewer” -> “genotypes_allelictest” -> “Output files” ->
genotypes_allelictest.assoc and select “Open in Haploview” (. Fig. 16.11).

.. Fig. 16.10 Allelic Association Tests

.. Fig. 16.11 The result of Allelic tests
CHROM: the number of chromosomes

MARKER: the SNP id
POSITION: the physical locus of SNP
A1: the haplotype of Allele 1
F_A: the personal frequency when phenotype is “Affected”
F_U: the personal frequency when phenotype is “Unaffected”
A2: the haplotype of Allele 2
CHISQ: chi-square test statistic (1 df)
P: p-value
OR: the odds ratio.
※ the total data are ordered by the selected row.
16
16.8 Additive Genotypic Test
We perform the additive genotypic test using the Cochran-Amitage trend test (. Fig. 16.12).

1. Select “PLINK” - > “Association” -> “Genotypic C/C association tests”.

2. Select “Standard Input” tab.
3. Select “Permute trend test” and “Adjusted p-value”
4. Click “Threshold” and enter the same options that were assigned above as shown
in . Fig. 16.13

5. Enter “genotypes_trendtest” as an output file and click the “OK” button.

16.8 · Additive Genotypic Test
293 16
.. Fig. 16.12 Cochran-Armitage trend test
6. After completing the test, click the right button of the mouse at “Operation Viewer”
-> “genotypes_trendtest” -> “Output files” -> “genotypes_trendtest.model.trend.
adjusted” and click the “Open in.”
7. Right click on “Operation Viewer” - > “genotypes_trendtest” - > “Output files” - >
“genotypes_trendtest.model.trend.adjusted”, and then, click the “Open in Haploview.”
8. Read the result table of the genotypes_trendtest.model.trend.adjusted file as shown in
. Fig. 16.14.
9. Open the genotypes_trendtest.model to check the result of each genetic model for
SNP rs11830226, which is the SNP in the first row. Enter “rs11830226” in the “Specify
Marker” text box and click the “Prune Table” button (. Fig. 16.15).

ALLELIC, TREND (=addictive test) are shown to be significantly associated, while GENO
(=basic genotypic), DOM (=dominant model), REC (=recessive model) are not.
.. Fig. 16.13 Open the trend test result
16.9 Manhattan Plot
The Manhattan plot is a scatter plot used to visualize data with a lot of variables, none of
which are zero. This plot is useful to visualize a few SNPs with low P-value from GWAS
concurrently. The chromosomes are plotted on the X-axis, and the –log10 (P-values) are
plotted on the Y-axis.
1. Move to the genotypes_trendtest.model.trend.adjusted window.
2. Click the “plot” at bottom.
16 3. Enter “AssocTest” for Title, select “Chromosomes” for X-Axis, “”FDR_BH “for
Y-Axis, and “-log10 “for Scale (. Fig. 16.16).

4. Click the “OK” button and visualize the results as a Manhattan plot (. Fig. 16.17).

16.9 · Manhattan Plot
295 16
.. Fig. 16.14 The result file ‘genotypes_trendtest.model.trend.adjusted’
.. Fig. 16.15 The result for each genetic model

.. Fig. 16.16 Set up the plot options
16
.. Fig. 16.17 Manhattan plot
Exercises
[Exercise 1] - How many markers are filtered out when “Hardy-Weinberg” option is selected in “PLINK” -
> “Summary Statistics” menu?
[Exercise 2] - Perform association tests using diverse genetic models such as –model-gen, −model-
dom, and -model-rec from the “PLINK” - > “Association” -> “Genotypic C/C association tests” menu.
Bibliography
297 16
Take Home Message
55 How to analyze and use GWAS data with gPLNK and visualize with HaploView.
Bibliography
1. Barrett JC et al (2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics
21(2):263–265. Epub 2004 Aug 5
2. Purcell S et al (2007) PLINK: a toolset for whole-genome association and population-based linkage
analysis. Am J Hum Genet 81(3):559–575. Epub 2007 Jul 25
299 17
CNV Analysis
17.3 Data – 301

17.3.1 Normalization – 301
17.4 Genomic Alteration Detection – 302

17.4.1 Single & Multi-sample Segmentation and Allele-specific
Segmentation – 302
17.4.2 Identification of Gain and Loss at Genomic Position – 304
17.5 Visualization – 308

17.5.1 Analysis of the Differences Between Samples
Using the Heatmap – 308
17.6 Obtaining Genomic Regions – 309

https://doi.org/10.1007/978-981-13-1942-6_17
300 Chapter 17 · CNV Analysis

Recent studies have shown important results in cancer in which somatic copy-number
alteration(s) activates oncogenes, inactivates tumor suppressors, and have been implicated
in cancer diagnostics and therapy. In this chapter, you will understand and analyze copy-
number alteration data using R packages.
17.1 Introduction
Humans are generically different, and they have various genome mutations. The charac-
teristic genome mutation of each person affects not only various traits such as blood type,
height, and weight, but also the pathogenesis of complex diseases such as malignant
tumors, hypertension, and diabetes. SNPs, CNVs, and occurrence in a wider genomic
region contribute to a genetic diversity. CNV is expected to demonstrate a high possibility
of being related to various disease sensibility. CNV is a modification that has hundreds ~
millions base pairs, which are deleted or amplified and accordingly results in changing
gene copy number existing within the CNV region.
17.2 Prerequisites
55 R programing installation: Install the R version of 3.0X or higher. Refer to the

Appendix B Sect. B.1.
55 Bioconductor installation: Install analysis package of Bioconductor. Refer to the
In this practice, you will conduct a basic CNV analysis using the R package: Copynumber, which
was published in BMC genomics, 2012. . Figure 17.1 briefly describes the analysis pipeline.

Data preprocessing Segmentation Visualization

Copy number Outlier handling Individual segmentation Whole-genome plots
data (+ allele winsorize(...) of one or more samples plotHeatmap(...)
frequencies) pcf(...) plotGenome(...)
Missing value plotCircle(...)
Imputation Joint segmentation plotFreq(...)
ImputeMissing(...) of multiple samples
multipcf(...) Chromosome plots of
data and segments
Segmentation of plotSample(...)
17 SNP-array data plotChrom(...)
aspcf(...) plotAllele(...)
Diagnosis plot
plotGamma(...)
Analysis pipeline
.. Fig. 17.1 Outline of analysis pipeline

17.3 · Data
301 17
17.3 Data
In this chapter, we use a arrayCGH data for eight lymphoma patients and 21 biopsy samples.
> source(“www.bioconductor.org/biocLite.R”)
> biocLite(“copynumber”)
> library(copynumber)
You can confirm several points: (1) the data consists of chromosome, median base pair,
and the value of each patient and (2) the data has 3091 rows and 23 columns. The first two
columns of the data file indicates chromosome and median base pair, respectively, and the
subsequent columns indicate the copy number measurements for the 21 samples.
> data(lymphoma) # Load lymphoma patient data set
> head(lymphoma) # Check first 6 rows
> str(lymphoma) # Check the overall data structure
17.3.1 Normalization
> lymph.wins <- winsorize(data = lymphoma, verbose = FALSE) # Check the outlier of
each value and assign the data to the new variable in order to modify a data
> wins.res <- winsorize(data = lymphoma, return.outliers = TRUE, verbose = FALSE)
# Process the data with TRUE option
> head(wins.res) # Confirm the first 6 rows of the data
Winsorize function provide a return outlier as a parameter. Using this parameter, you can
check position and status of patients in the outliers. After assigning return, the outlier with
“TRUE” parameter to wins.res variable, wins.res variable has two data frame, wins.data
and wins.outlier, wins.outlier returns samples and probes in the outliers.
> lymphoma[1:10, 1:8]
> wins.res$wins.outliers[1:10, 1:8]
> wins.res$wins.data[1:10, 1:8]

These commands return positions in outliers by each sample. 1 value is higher outlier and
−1 value is lower outlier. 0 value is another value, not outlier.
17.4 Genomic Alteration Detection
17.4.1 Single & Multi-sample Segmentation

and Allele-specific Segmentation
Now, let’s practice data segmentation of 21 samples. The ‘pcf ’ function is used for segmen-
tation of single sample data and ‘multipcf ’ function is used for segmentation of multi
patient data.
Here is a segmentation of single sample data.
> single.lymphoma <- subsetData(data = lymphoma, sample = 1) # Load only a data of
one sample X01.B01 from the total sample data.
> head(single.lymphoma)
‘pcf ’ function is performed after setting the gamma value, which is a penalty parameter
that will be applied to CNV as 12 to use in the single segmentation with ‘lymph.wins’ data
that is produced through the above described normalization.
CNV single segmentation plot can be confirmed with ‘plogGenome’ function in that
order (. Fig. 17.2).

> single.seg <- pcf(data = lymph.wins, gamma = 12, verbose = FALSE)
> plotGenome(data = single.lymphoma, segments = single.seg, sample = 1, cex = 3)
> multi.seg <- multipcf(data = lymph.wins, verbose = FALSE) # Use ‘multipcf’ function
for multiple segmentation of the total 21 samples
> head(multi.seg)
> plotChrom(data = lymph.wins, segments = multi.seg, layout = c(3, 1), chrom = 1)
17
Using ‘multipcf ’ function, you can confirm the p- and q-arms and new dataset generated
including probe information. You can see the distribution of 21 patients by chromosome
position. Assign value of gamma variable in multipcf function. Gamma is a penalty vari-
able by assigning a number to segmentation. If the gamma value is high, the segmentation
value becomes low. If there is no value for gamma, you can see distribution of segmenta-
tion at an interval of 10, which is the division of 100 by gamma value = 10. In addition, as
17.4 · Genomic Alteration Detection
303 17
X01.B2
2 4 6 8 10 12 14 16 18 20 22
0.5
Log R
0.0
− 0.5
1 3 5 7 9 11 13 15 17 19 21 23
.. Fig. 17.2 CNV single segmentation plot
assigned variable, chromosomes 1–23, you can draw a plot showing a distribution of
patients by chromosome (. Fig. 17.3).

Now, let’s draw a plot of allele-specific segmentation using logR and BAF data. Gener-
ally, allele-specific segmentation cannot be confirmed through array CGH. Therefore,
SNP array data can be used in the analysis.
Load refreshed data using the data command: (1) logR is data with log-scaled value of
the total copy-number (In general, copy number analysis supersedes log-scaled value) and
(2) B-allele frequency (BAF) provides information for confirmation of an allelic imbal-
ance for each individual patient (B/(A + B)).
When two data sets have the same structure, the first and second columns are chromo-
some and probe position and S1 or S2 indicates the designated samples.
> data(logR) # Load logR data
> head(logR) # Return the first 6 rows of logR data
> data(BAF)
> head(BAF)
You perform the data normalization process and use the aspcf function for allele specific
segmentation.
Chromosome 1
X01.B1
0.5
Log R
0.0
− 0.5
X01.B2
0.5
Log R
0.0
− 0.5
X01.B3
0.5
Log R
0.0
− 0.5
.. Fig. 17.3 plotChrom using the gamma variable
> logR.wins <- winsorize(logR, verbose = FALSE)
> allele.seg <- aspcf(logR.wins, BAF, verbose = FALSE)
> head(allele.seg)
Likewise, use ‘plotAllele’ to draw an allele specific plot in order to present logR of chromo-
somes in individual patients and segmentation including BAF information (. Fig. 17.4).

> plotAllele(logR.wins, BAF, allele.seg, sample = 1, chrom = c(1:4), layout = c(2, 2))
# Set the option to draw multiple plots for four chromosomes (1, 2, 3, and 4) in a page.
17
17.4.2 Identification of Gain and Loss at Genomic Position
Using the same data used above, practice plotting a frequency graph by designating gain
and loss conditions. At this point, it is necessary to designate the threshold of gain and
loss. In this practice, set the threshold of gain as 0.2 and loss as 0.1 (. Fig. 17.5).

305 17
S1
Chromosome 1 Chromosome 2
0.5 *** ***** 0.5
logR
logR
0.0 0.0
**** * * * ** **
0.8 0.8
0.6 0.6
BAF
BAF
0.4 0.4
0.2 0.2
Chromosome 3 Chromosome 4
0.5 0.5
** *
*
logR
logR
0.0 * 0.0
*
***
0.8 0.8
* *
0.6 0.6
BAF
BAF
*
0.4 * 0.4
0.2 0.2
.. Fig. 17.4 Four allele specific segmentation plots in four chromosomes, for one individual
> lymphoma.res <- pcf(data = lymph.wins, gamma = 12, verbose = FALSE)
# Do segmentation on lymphoma data with gamma value of 12
> plotFreq(segments = lymphoma.res, thres.gain = 0.2, thres.loss = -0.1)
# Draw a frequency plot

Thresholds = [–0.1,0.2]
2 4 6 8 10 12 14 16 18 20 22
% with gain or loss
75
50
25
0
25
50
75
1 3 5 7 9 11 13 15 17 19 21 23
.. Fig. 17.5 Frequency plots with gain value of 0.2 and loss value of −0.1
Draw a cirrus plot, showing chromosomes in a circular form
> chr.from <- c(2, 12, 4)
> chr.to <- c(14, 21, 17)
> pos.from <- c(168754669, 847879349, 121809306)
> pos.to <- c(6147539, 301955563, 12364465)
> cl <- c(1, 1, 2) # color
> arcs <- cbind(chr.from, pos.from, chr.to, pos.to, cl)
> plotCircle(segments = lymphoma.res, thres.gain = 0.15, arcs = arcs)
The above described variables described above are used to set the default value of the line that
shows correlation in the cirrus plot, including the start and end positions of chromosome
and color, and use plotCircle to visualize the data with the default parameter (. Fig. 17.6).
Apply this to all chromosomes. Set a threshold of correlation and draw correlation
graphs showing any correlations among chromosomes.
> multiseg <- multipcf(lymphoma)
> nseg = nrow(multiseg)
> cormat = cor(t(multiseg[ , -c(1:5)]))
17 > chr.from <- c()
> pos.from <- c()
> chr.to <- c()
> pos.to <- c()
> cl <- c()
> thresh = 0.7 # Threshold value

307 17
Y 1
X
22
21 2
20
19
18 3
17
16
4
15
14 5
13
6
12
7
11
10 8
9
.. Fig. 17.6 Cirrus graph showing correlation of copy number aberration within chromosomes
> for (i in 1:(nseg -1)) {
for (j in (i + 1):nseg) {
if (abs(cormat[i, j]) > thresh && multiseg$chrom[i] !=
multiseg$chrom[j]) {
chr.from = c(chr.from, multiseg$chrom[i])
chr.to = c(chr.to, multiseg$chrom[j])
pos.from = c(pos.from, (multiseg$start.pos[i] + multiseg$end.pos[i])/2)
pos.to = c(pos.to, (multiseg$start.pos[j] + multiseg$end.pos[j])/2)

if(cormat[i, j] > thresh){
cl < -c(cl, 1) #class 1 for those with positive correlation
}else{
cl <- c(cl, 2) #class 2 for those with negative correlation
> arcs <- cbind(chr.from, pos.from, chr.to, pos.to, cl)
> plotCircle(segments = lymphoma.res, thres.gain = 0.15, arcs = arcs, d = 0)
After running the code, we can see the orange line for the positive correlation and the blue
line for the negative correlation as shown below (. Fig. 17.7).
17.5 Visualization
17.5.1 Analysis of the Differences Between Samples

Using the Heatmap
Use a heatmap, plotHeatmap function, to show differential copy number alteration

between each patient. In this function, input is the segmentation data of 21 patients, and
high and low copy number values can be set by ‘upper.lim’. The default set colors are red
and blue, for copy number values >0.3, the color is coded as dark red and for copy number
values <−0.3, the color is coded as dark blue (. Fig. 17.8).
> plotHeatmap(segments = lymphoma.res, upper.lim = 0.3)
This heatmap shows the tendency of copy number alteration of 21 patients. Draw an aber-
17 ration plot using gain and loss values set above (. Fig. 17.9).
> plotAberration(segments = lymphoma.res, thres.gain = 0.2)

17.6 · Obtaining Genomic Regions
309 17
Y 1
X
22
21 2
20
19
18 3
17
16
4
15
14 5
13
6
12
7
11
10 8
9
.. Fig. 17.7 plotCircle with the threshold value of 0.7
17.6 Obtaining Genomic Regions
In the plot shown above, define an interesting region and extract genes belonging in this
region (. Fig. 17.10).

> minimum.mean <- min(lymphoma.res[lymphoma.res$chrom ==
14, ]$mean)
> min.14.lymphoma.res <- lymphoma.res[(lymphoma.res$chrom == 14) &
(lymphoma.res$mean == minimum.mean), ]
> plotFreq(segments = min.14.lymphoma.res,thres.gain = 0.2, thres.loss =
-0.1, chrom=c(14))
Limits = [–0.3,0.3]
2 4 6 8 10 12 14 16 18 20 22
X09.B3
X09.B2
X09.B1
X08.B3
X08.B2
X08.B1
X07.B3
X07.B2
X07.B1
X06.B2
X06.B1
X05.B3
X05.B2
X05.B1
X04.B2
X04.B1
X03.B2
X03.B1
X01.B3
X01.B2
X01.B1
1 3 5 7 9 11 13 15 17 19 21 23
.. Fig. 17.8 Heatmap showing difference of CNV
Limits = [–0.2,0.2]
2 4 6 8 10 12 14 16 18 20 22
X09.B3
X09.B2
X09.B1
X08.B3
X08.B2
X08.B1
X07.B3
X07.B2
X07.B1
X06.B2
X06.B1
X05.B3
X05.B2
X05.B1
X04.B2
17 X04.B1
X03.B2
X03.B1
X01.B3
X01.B2
X01.B1
1 3 5 7 9 11 13 15 17 19 21 23
.. Fig. 17.9 Abberation plot using predefined loss and gain threshold
17.6 · Obtaining Genomic Regions
311 17
Thresholds = [–0.1,0.2]
Chromosome 14
100
% with gain or loss
75
50
25
0
25
50
75
100
.. Fig. 17.10 A plot showing the region that had the lowest copy number value in chr14
For example, the above code draws a plot of the region that has lowest copy number value
in chromosome 14, which was plotted in 7 Chap. 5. The result is as follows.

> min.14.lymphoma.res #Confirm the location of corresponding region.
> min.14.lympoma.res
sampleID chrom arm start.pos end.pos n.probes mean
879 X06.B1 14 q 102399698 106227119 5 -0.406
After checking the corresponding region, from 102,399,698 to 106,227,119 of chromo-

some 14, you can investigate genes belonging to this specific region using various meth-
ods.
One of the easiest methods to use is the UCSC genome browser. Type the region,
“chr14:102,399,698–106,227,119”, in the Genome Browser and then you will see all the
genes in this region (. Fig. 17.11).
For example, the TRAF3 gene in this region, is translated into a key protein in the
hymphtoxin-bera receptor signaling complex.
Take Home Message

55 How to detect genomic alteration in cancer
55 How to visualize differences between samples
.. Fig. 17.11 Screen shot of genome browser search result
Bibliography
1. Komura D, Shen F, Ishikawa S et al (2006) Genome-wide detection of human copy number variations
using high density DNA oligonucleotide arrays. Genome Res 16(12):1575–1584
2. Nilsen G, Liestol K, Lingjaerde OC (2013) Copynumber: segmentation of single- and multi-track copy
number data by penalized least squares regression. R package version 1.14.0
3. UCSC Genome Browser – https://genome.ucsc.edu
17
313 V
Metagenome and
Epigenome, Basic
Data Analysis
Contents
Chapter 18 Metagenome and Epigenome Data Analysis – 315
Chapter 19 Metagenome Data Analysis – 325
Chapter 20 Epigenome Database and Analysis Tools – 339
Chapter 21 Epigenome Data Analysis – 353

315 18
Metagenome
and Epigenome
Data Analysis
18.1 Metagenome – 316
18.2 Epigenome – 318

18.2.1 DNA Methylation – 318
18.2.2 Histone Modification – 319
18.2.3 Non-coding RNA (ncRNA) – 320
18.3 Epigenome Databases and Analysis Tools – 320

18.3.1 DNA Methylation Databases and Analysis Tools – 320
18.3.2 Histone Modification Databases and Analysis Tools – 321
18.3.3 Noncoding RNA Databases and Analysis Tools – 321
18.4 Epigenome Analysis – 322

https://doi.org/10.1007/978-981-13-1942-6_18
316 Chapter 18 · Metagenome and Epigenome Data Analysis

The development of NGS technology has fostered the innovative development of
metagenomics and epigenomics methodology. This chapter introduces metagenomics
and epigenomics and examines their extensive academic application possibilities.
Various bioinformatics databases and analysis tools for metagenomics and epigenomics
will be introduced.
18.1 Metagenome
A microorganism lives and adapts to various environments such as 1000 m deep sea water
or hypoxic high-level altitude. Microorganisms are considered essential components for
normal functioning of Earth’s ecosystems. Microorganisms were first reported by the
Dutch scientist Leeuwenhoek in 1683. He observed the shape of bacteria with a micro-
scope for the first time. Since then, research with pure cultures was done by the German
scientist, Schrader.
Previously, microbiologists were only able to do research with the cultivable micro-
organisms in the lab since it is impossible to do research without cultures. Therefore,
a known microorganism species is estimated as only the tip of an iceberg compared to
the total species of unknown microorganisms. An actual microorganism lives through
the interaction between the same or different kinds of species by colonizing in various
inhabitable environments, but the pure culture environment does not mimic this actual
environment. Thus, with the variety of microorganisms existing in the natural world and
proper understanding of the ecological aspect, a different macroscopic approach was
required in order to effectively understand these parameters.
Metagenome research has emerged as an innovative method to overcome the funda-
mental limitation of microorganism research. These limitations include extracting the total
DNA, which is the genetic material of microorganism existing in the natural world, con-
structing the library, and investigating the new genetic material using the NGS technique.
In this case, the origin of genetic material, which is bacterial, cannot be identified, but all
genetic sources can be inversely obtained and the genome sequence of origin bacteria can
also be reconstructed with the latest bioinformatics technique. Metagenome is defined
as all microorganisms’ genome sets existing in the specific environment, and it is a field
that researches the collective genome unit of functional and structural diverse of exist-
ing microorganisms in the natural world. Metagenomic research subjects can be obtained
from every environment such as the soil, ocean, river, and swamp that exists in the natural
world and it also can be obtained from the various organs of the animal and humans.
The first metagenomic study is represented by the creation of a 10 Kb DNA from the
genome of all the microorganisms that exist in the Pacific Ocean, cloned into E. coli, and
constructed into a library by Schmidt in 1991. NGS technology used in the metagenomic
analysis constructs a library by reading the longer sequences rather than mapping to the
18 reference genome that is done by reading the short sequence. That is a reason for fully
preserving the gene sequence of various microorganisms without the information about
each species’ genome.
In . Fig. 18.1, the generalized process of metagenome research is schematized. Sample

collection, as the first step, should collect enough of the sample from the proper place
for the proper purpose. At present, metadata for metagenomics is required, and it has to
describe the detailed information that is related to the collecting environment that could
18.1 · Metagenome
317 18
.. Fig. 18.1 Metagenome
research process
Sample collecting
DNA/RNA Extraction
Library Construction
Screening
be required when the useful candidate matter is discovered through the metagenomic
analysis. Extracting DNA/RNA, which is the second step, should eliminate the impuri-
ties from the obtained sample. This step uses either the chemical method, using various
enzymes to eliminate the impurities, or a physical method. The physical method is more
useful for the actual meta genomic analysis, but it has its disadvantage of causing genome
segmentation.
RNA extraction is harder than DNA extraction due to the structural instability of
RNA. The following step is a library construction step, which constructs a library by using
the amplification of various cloning vectors such as Bacteriophage, cosmid, forsmid, and
BAC. Metagenomics creates a library that has a big insert, and it is advantageous for pro-
ducing novel products through a phylogenetic analysis or an antibiotic selection. The final
screening step varies depending on the functional metagenomic or base sequence analyses.
Metagenomics is a field in which technical advances have appeared earlier than the
complete scientific understanding of the field. It has a shorter history compared to classi-
cal molecular biology. Craig Venter, who is well-known for The Human Genome Project,
has initiated an advanced research concept called ‘Ecosystem Sequencing.’ It applies whole
genome shotgun sequencing into a metagenomic library of samples collected from the
waters around Bermuda in order to determine the hundreds of millions base sequence.
The report has been entitled “Environmental Genome Shotgun Sequencing of the Sargasso
Sea” and was published in Science in 2004. This publication also reported that 1800 spe-
cies of microorganisms have been identified, more than 150 new species and more than
1.2 million novel genes have been discovered.
Brady, Gillespie, Diaz-Torres, Beja et al., have extracted various new antibiotics through
the metagenome research of the ocean, soil, oral cavity, and inner intestine. They have
discovered the new gene showing the resistance to the specific antibiotic, and we are now
even in the step to apply several useful enzyme resources to medical industry. Turnbaugh
et al. have claimed that the cause of obesity is the result from various microorganism
group that coexists in the intestine, indicating that obesity not only originates as a result
of indigenous genetic predisposition but also by the various microorganism infections
that exist around us. Such obesity-related microorganisms may increase the infectious
probability between people who often come into contact with these microorganisms. This
will eventually increase the probability that the obesity-related microorganisms would
live in the intestines of that individual his/her family members. Through this process,
the evidence indicating that specific microorganisms cause obesity, can be discovered.
Furthermore, Barabasi et al. claimed that obesity is propagated by the network structure
of various level in the study entitled “Network Medicine–From Obesity to the Diseasome”
reported in NEJM in 2007.
18.2 Epigenome
Epigenome is a combination word between ‘epi-‘, which means ‘above’ in Greek, and
‘genome’. In other words, it means ‘something out of the genome’, and without changing
DNA base sequence itself, it can cause alterations in genomic expression patterns through
DNA or chromatin reconstruction. This property can be delivered to the next generation.
Epigenome means that DNA and genes obtained from the parents when they are born
can be expressed differently depending on the environment and life patterns. The same
DNA can be differently expressed depending on the environment in identical twins. The
function of the epigenome is to regulate gene expression. For example, there are three
types of epigenetic modifications affecting gene expression: (1) DNA methylation; (2) his-
tone modification; and (3) non-coding RNAs. The first two are known as molecular-based
mechanisms, and the last one is known as a transcriptional regulatory mechanism. These
mechanisms affect the whole part of one individual, but it can also affect a specific group
of cells. Therefore, it one of major interests in various diseases, including cancer.
Epigenetic alteration is an unlikely genetic mutation caused by DNA sequence altera-
tion. It is affected more by the environment. The epigenetic profile of twins who were
raised in the same environment showed more similarity than that of twins who were
raised in different environments. In another example, DNA methylation status between
identical twins who were around 3 years old showed very similar patterns, but the status
between the 50 year old twins was very different. Such epigenetic changes can be factors
that lead to differential gene expression during one’s life span as one gets older with the
same genetic information. It might explain how many environmental factors affect genes.
18.2.1 DNA Methylation
DNA methylation is the attachment of a methyl group to the DNA itself at the C5 position
in the pyrimidine ring of cytosine. This reaction occurs when guanine follows cytosine
(CpG) via a methyl group of S-adenosyl-methionine mediated by DNA methyllationsfer-
ase (DNMT). Mammals have high cytosine methylation tendencies showing that 3–5%
of cytosine exists as 5-methylcytosine. Among that, around 70% exists in CpG regions.
DNA methylation is known to specifically inhibit gene expressions that are involved in
formation of heterochromatin that does not transcribe due to the condensation of base
strands in X chromosome deactivation and in genome imprinting. The methylation rate
18 and pattern of CpG varies across the mammal species and tissues. Most CpGs are known
to be methylated (60–90%). However, CpG island in mammal genome exist where CpG
is concentrated. CpG islands have been shown to be less methylated (typically about
300–3000 base pairs in length, located at promoter region). If CpG islands in the promo-
tor regions of a specific gene are methylated, transcription factors binding to the pro-
moter region is inhibited, resulting in repressed expression of a given gene. Methylation
in a promoter region is a well-known epigenetic regulation of gene expression. Three
18.2 · Epigenome
319 18
.. Table 18.1 Comparison of epigenomic mechanisms
Mechanisms DNA methylation Histone modification
Mechanisms Addition of a methyl group to Acetylation, phosphorylation, demethyl-

cytosines within CpG ation of histones in chromatin
(cytosine/guanine) pairs
Enzymes DNMT, MeCP, MBD DHAT, HDAC, HMT, HDM
Gene expressions Inhibited gene expression Altered chromatin structure and

without a change of DNA activated gene expression without a
sequence change of DNA sequence
DNMT DNA methyltransferase, MeCP methyl-CpG binding protein, MBD methyl-CpG binding
domain, HAT histone acetyltransferase, HDAC histone deacetyltransferase, HMT histone methyl-
transferase, HDM histone demethylase
known mammalian methyl transferases exist. DNMT1 is involved in retaining the methyl
group in DNA where DNA is synthesized during cell division. DNMT3a and DNMT3b
might catalyze new methylation in DNA. In addition, methylated DNA allows the CpG-
bindingdomain (MBD) to bind to HDAC, (a chromatin modification protein leading to
the chromatin condensation), which leads to transcription repression by blocking binding
of transcription factors to DNA sequences.
DNA methylation is involved in the genomic imprinting phenomenon which is
involved in gene expression from just one chromosome among the two chromosomes
that come from each parent. In the early stages of egg fertilization, the CpG island of
the corresponding gene are selectively methylated to repress the given gene expression.
Methylation of the CpG island occurs in a tissue-specific manner and are also regulated
in a tissue-specific manner. In cancer cells, genome-wide DNA methylation is generally
decreased, but some specific promoter regions are highly methylated. For example, some
CpG islands are highly methylated in various cancers including colorectal, lung cancer,
and breast cancers. Such abnormal methylation can occur at the beginning of cancer
development; therefore, many studies have focused on marker discovery for early diag-
nosis (. Table 18.1).

18.2.2 Histone Modification
Eukaryotic chromatin is the structural unit of a chromosome. Total DNA, which consists
of 46 chromosomes existing in the nucleus of human somatic cell, is very long (2 m).
DNA exists as a highly condensed nuclear chromatin structure with a size of 10–100 in
diameter. DNA is condensed with a nucleosome, which is the base unit of chromatin. A
nucleosome forms the core particle wrapped with DNA strand of 146 base pairs, and the
core particle is composed of two copies of H2A, H2B, H3, and H4. In a nucleosome, the
N-terminal histone proteins have been shown to be modified. It is called histone modifica-
tion and six types of modification have been discovered to date:
55 Acetylation
55 Methylation of lysine and arginine
55 Ubiquitination
55 Phosphorylation
55 Sumoylation
55 ADP rybosylation
55 Deamination, proline isomerization
Such chromatin remodeling is involved in processes such as gene expression regulation,

apoptosis regulation, DNA replication and repair, and chromosome condensation and
division. Abnormal chromatin remodeling has been known to increase the risk of cancer.
Acetylation and methylation of histone proteins are the major modifications. Acetylation
occurs at the lysine residue of the histone tail, neutralizing the positive charge of the lysine
residue, and then loosening the bond between negatively charged DNA and the histone
protein (opened chromatin structure), which facilitates the binding of RNA polymerase.
Histone methylation can turn on and off gene expression according to the location and
degree of methylation in the histone. For example, while addition of three methyl groups
to the 3rd lysine or 36th lysine of H3 histone switches on gene expression, addition of three
methyl groups to the 27th lysine or 9th lysine of H3 histone switches off gene expression.
18.2.3 Non-coding RNA (ncRNA)
Among transcribed RNAs, several RNAs are not translated into proteins (1) tRNA
(transfer RNA); (2) rRNA (ribosomal RNA); (3) miRNA (microRNA); and (4) siRNA
(short-interfering RNA). Listed RNAs are called non-coding RNA (ncRNA), and some
ncRNA has been shown to control gene expression. miRNA that consists of around 22
bases, represses target mRNA. In animals, miRNA binds to mRNA 3’UTR and inhibits
gene expression by degrading mRNA or inhibiting translation into protein. miRNA has
been shown to be involved in ontogenesis, apoptosis, proliferation, hematopoiesis, insulin
secretion, and immune responses.
18.3 Epigenome Databases and Analysis Tools
18.3.1 DNA Methylation Databases and Analysis Tools
55 MethylomeDB (7 http://www.neuroepigenomics.org/methylomedb)

Database providing genome-wide DNA methylation profile in human and mouse

brain
55 NGSmethDB (7 http://bioinfo2.ugr.es/NGSmethDB)

A collection of NGS data used for DNA bisulfide conversion. It includes high quality
of chromosome methylation information from various tissues, pathological condi-
18 tions, and species.
55 MethBase (7 http://smithlabresearch.org/software/methbase)

A collection of methylation in six species from public Bisulfite sequencing (BS)-seq

datasets. It provides meta data, methylation information at individual sites, and
regions of allele specific methylation
55 MethHC (7 http://methhc.mbc.nctu.edu.tw)

A web-based resource visualizing DNA methylation of Pan-Cancer. It provides cor-

relation among DNA methylation, gene expression, and miRNA methylation.
18.3 · Epigenome Databases and Analysis Tools
321 18
55 PubMeth (7 http://www.pubmeth.org)

A database of cancer-related methylation information extracted from PubMed by

text mining technique.
18.3.2 Histone Modification Databases and Analysis Tools
55 4DGenome (7 http://4dgenome.research.chop.edu)

Database of chromatin interaction information in five species including humans. It

includes major experimental methods and computational techniques used to find
chromatin interactions such as UCSC Genome Browser, which displays interaction
results.
55 3CDB (7 http://3cdb.big.ac.cn)

Database of chromsome conformation captured (3C) for 17 species identified in

PubMed and Google Scholar (5000 publications)
55 Human Histone Modification Database (HHMD) (7 http://bioinfo.hrbmu.edu.cn/

hhmd)
A collection of human histone modification data obtained by experimental methods.
55 PeakSeq (7 http://info.gersteinlab.org/PeakSeq)

A program to search for the peak regions in ChIP-Seq data and rank them. The input
is the sequence fragment mapped by Chip-Seq and the output is a result ranked by
Q-value.
55 ChIP-Seq Analysis Server (7 http://ccg.vital-it.ch/chipseq)

Collection of tools for various genome annotation related to ChIP-Seq data

ChIP-Cor: Feature Correlation Tool
ChIP-Extract: Feature Correlation Tool
ChIP-Peak: Signal Peaks Location Tool
ChIP-Part: Partitioning Tool
ChIP-Center: Tag Centering Tool
ChIP-Convert: Format Converter Tool
18.3.3 Noncoding RNA Databases and Analysis Tools
55 miRBase (7 http://www.mirbase.org)

One of widely used tools. It provides currently known miRNA sequences and annotation.
55 TarBase (7 http://www.microrna.gr/tarbase)

The biggest miRNA target database providing miRNA-gene interactions consisting of

more than 63,000 in 21 species.
55 HaploReg (7 http://archive.broadinstitute.org/mammals/haploreg/haploreg.php).

Tools providing annotation of noncoding genom. It provides information from the

1000 Genome Project and Encode projects and non-coding variants from clinical
phenotypes and normal variations. Tools provide annotations of the noncoding
genome at variants on haplotype blocks. It is designed to investigate the impact of
non-coding variants on clinical phenotypes and normal variation using information
from 1000 Genomes Project and ENCODE projects.
55 NONCODE (7 http://www.noncode.org)
Database of noncoding RNA excluding tRNA and rRNA.

55 Long Noncoding RNA Database (lncrnadb) (7 http://www.lncrnadb.org)

Database of eukaryotic long non-coding RNAs. It provides a text search tool and
BLAST search tool
18.4 Epigenome Analysis
Currently, more than 20 different techniques exist for DNA methylation detection. DNA
methylation profiling techniques are categorized into three methylation status detection
methods:
55 Bisulfite conversion method
55 DNA Lysis with restriction enzyme sensitive to methylation status
55 Obtaining a DNA fragment methylated with Recombinant MBD (methyl-DNA bind-
ing protein domain) or monoclonal anti-5-methyl-cytosine antibody
Recently, Bisulphite-seq and MeDIP-seq using NGS technique have been widely used to
analyze genome-wide DNA methylation.
The ChIP-Seq technique is used to analyze histone modifications. Chip-Seq is a com-
bined technique of chromatin immuno-precipitation (CHIP) and sequencing techniques.
A ChIP-Seq is generally used to search for transcription factor-binding sites (TFBS) and
has recently been used for epignenomic profile analysis. A Chip-Seq can be used to search
for a short sequence fragment, which is mapped to a specific region of the genome, and
evaluate how much these fragments are enriched compared to other regions.
Take Home Message

55 Types and characteristics of metagenome and epigenome data (i.e., DNA meth-
ylation, histone modification, and ncRNA).
55 What epigenome database and analysis tools are used.
Bibliography
1. Amaral PP et al (2011) lncRNAdb: a reference database for long noncoding RNAs. Nucleic Acids Res
39(Database issue):D146–D151
2. Berger SL et al (2009) An operational definition of epigenetics. Genes Dev 23(7):781–783
3. Goldberg AD et al (2007) Epigenetics: a landscape takes shape. Cell 128(4):635–638
4. Hackenberg M et al (2011) NGSmethDB: a database for next-generation sequencing single-cytosine-
resolution DNA methylation data. Nucleic Acids Res 39(Database issue):D75–D79
5. Huang WY et al (2015) MethHC: a database of DNA methylation and gene expression in human can-
cer. Nucleic Acids Res 43(Database issue):D856–D861
6. Kharchenko PV et al (2008) Design and analysis of chIP experiments for DNA binding proteins. Nat
18 Biotechnol 26:1351–1359. Park PJ (2009). ChIP-seq: advantages and challenges of a maturing technol-
ogy. Nat Rev Genet 10(10):669–680
7. Schones DE, Zhao K (2008) Genome-wide approaches to studying chromatin modifications. Nat Rev
Genet 9(3):179–191
8. Ongenaert M et al (2008) PubMeth: a cancer methylation database combining text-mining and expert
annotation. Nucleic Acids Res 36(Database issue):D842–D846 Epub 2007 Oct 11
9. Rozowsky J et al (2009) PeakSeq enables systematic scoring of ChIP-seq experiments relative to con-
trols. Nat Biotechnol 27(1):66–75
Bibliography
323 18
10. Sethupathy P et al (2006) TarBase: a comprehensive database of experimentally supported animal
microRNA targets. RNA 12(2):192–197 Epub 2005 Dec 22
11. Song Q et al (2013) A reference methylome database and analysis pipeline to facilitate integrative and
comparative epigenomics. PLoS One 8(12):e81148
12. Teng L et al (2015) 4DGenome: a comprehensive database of chromatin interactions. Bioinformatics
31(15):2560–2564
13. Ward LD, Kellis M (2012) HaploReg: a resource for exploring chromatin states, conservation, and regu-
latory motif alterations within sets of genetically linked variants. Nucleic Acids Res 40(Database
issue):D930–D934
14. Xin Y et al (2012) MethylomeDB: a database of DNA methylation profiles of the brain. Nucleic Acids
Res 40(Database issue):D1245–D1249
15. Yun X et al (2016) 3CDB: a manually curated database of chromosome conformation capture data.
Database (Oxford). 2016. pii:baw044. https://doi.org/10.1093/database/baw044. Print 2016
16. Zhang Y et al (2010) HHMD: the human histone modification database. Nucleic Acids Res 38(Database
issue):D149–D154
325 19
Metagenome Data Analysis

19.3 Metagenome Analysis Tool – 326

19.3.1 Major Metagenome Analysis Tools – 326
19.3.2 Introduction to metagenomeSeq – 327
19.4 Metagenome Analysis – 327

19.4.1 Dataset – 327
19.4.2 Basic Example of metagenomeSeq – 327
19.4.3 Statistical Testing – 329
19.4.4 Aggregating Counts – 329
19.5 Visualization – 330

https://doi.org/10.1007/978-981-13-1942-6_19
326 Chapter 19 · Metagenome Data Analysis

In this chapter, we learn how to use the metagenomeSeq in the R package for both meta-
data and functional analyses of metagenomes using published data. It includes preprocess-
ing and annotation methods such as gene-centered, pathway-centered, and functional
diversity analyses.
19.1 Introduction
A metagenome is a set of the genomes of all microorganisms that exist in certain environ-
ments. NGS techniques have recently advanced the metagenome field. Therefore, in this
chapter, we will use metagenomeSeq, a functional analysis technique of a microorganism’s
genome that is one of major metagenome analysis tools. It includes the basic use of
metagenomeSeq (R-package) and gene-based metagenome analysis using the mouseDB
dataset provided in the R-packages.
19.2 Prerequisites
First, install the R program and metagenomeSeq R-package. As metagenomeSeq is depen-

dent on the interactiveDisplay and vegan packages in Bioconductor, these two packages
should be installed too. In this practice, the MS Windows system is used as the working
environment but Linux users can follow the installation step.
Run R and type the following commands in the console.
> source(“http://www.bioconductor.org/biocLite.R”)
> biocLite(“interactiveDisplay”)
> biocLite(“vegan”)
> biocLite(“metagenomeSeq”)
19.3 Metagenome Analysis Tool
19.3.1 Major Metagenome Analysis Tools
. Figure 19.1 shows the typical software developed for metagenome analysis. Among

these software, MEGAN,1 as a standalone program, maps the sequence fragments resulted
from the metagenome analysis to the NCBI taxonomy database, and perform phyloge-
netic analysis, diversity assessment, and functional analysis. MG-RAST,2 a web-based
analysis tool, analyzes and visualizes raw data in FASTQ format. R-packages for metage-
19 nomic analysis exist. Web-based or standalone software provides various functions and
1 7 http://ab.inf.uni-tuebingen.de/software/megan6/

2 7 http://metagenomics.anl.gov

19.4 · Metagenome Analysis
327 19
.. Fig. 19.1 Metagenome analysis and visualization program
easier user interface, but they have limitations for some types of analyses. In contrast, the
R-package allows various analytical approaches, but it requires a certain level of program-
ming skill.
19.3.2 Introduction to metagenomeSeq
MetagenomeSeq is an R-package supporting statistical analysis for sparse high-throughput

sequencing. It was developed by Joseph N. Paulson et al. in 2013. The main function of
this package consists of several points: (1) to assess differential abundance between groups
of various samples; (2) to standardize the microbial community; (3) to detect association
with disease; (4) to evaluate sampling effects by applying a correlation test; and (5) to
visualize data with an interactive display using shiny apps.
19.4 Metagenome Analysis
19.4.1 Dataset
We will use the dataset published in Science, 2009, entitled with “The effect of diet on the
human gut microbiome: a metagenomic analysis in humanized gnotobiotic mice.” This
dataset includes two metagenomes from an intestinal microorganism obtained from
12 germ-free adult male mouse treated under two conditions: (1) six mice under low-fat
and plant-polysaccharide-rich dietary conditions and (2) six mice under high-fat and
high-sugar dietary conditions.
19.4.2 Basic Example of metagenomeSeq
In this exercise, we will learn the basic usage of metagenomeSeq as a preprocessing tech-
nique for metagenome data
1. Import a library
> library(metagnomeSeq)
> library(vegan)
> library(interactiveDisplay)
2. List functions and variables in metagenomeSeq
> ls(“package:metagenomeSeq”)
3. mouseData dataset process
Using the command lines below, pring the phenotypic data and feature data of the mouse-
Data and examine their properties. featureData(phenoData) returns an object containing
information on both variable values and variable meta-data, whereas fData(pData) shows
a data frame with features as rows, and variables as columns. You can use the
fvarLabels(varLabels) function to return a character vector of measured variable names.
> data(mouseData) # Load dataset from mouseData
> mouseData # Return mouseData of MRexperiment3 type
> phenoData(mouseData) # Return phenotypic data in experiments, variable value and
variable meta-data
> head(pData(mouseData), 3) # Return the top three rows in the data having phenotypinc
data and sample variable
> featureData(mouseData) # Return feature data, variable value of feature meta-data,
and variable meta-data
> head(fData(mouseData)[ , -c(1, 7)], 3) # Return the top three rows excluding the data in
the first and 7th columns having feature data,variable value of feature meta-data,
and variable meta-data
> summary(pData(mouseData)) # Return summary of a data using summary function

19
> summary(fData(mouseData))
3 a modified eSet object for the data from high-throughput sequencing experiments
19.4 · Metagenome Analysis
329 19
19.4.3 Statistical Testing
> head(MRcounts(mouseData[, 1:2])) # Return top six rows in the calculation of the
combination numbers between the first and second samples and features.
> filterData(mouseData, present = 10, depth = 1000) # Filter samples by number of depth
of coverage and filter again by the number of present features.
For group comparisons, perform presence-absence test using 2 × 2 contingency tables.
Calculate p-values, odd’s ratios, and confidence intervals. The fitPA function provided in
metagenomeSeq calculates p-values, odds ratios, and lower and upper confidence limits
for all rows.
> classes = pData(mouseData)$diet # Save a value of diet column in classes variable.
> res = fitPA(mouseData[1:5, ], cl = classes) # Divide data (1st column to 5th columns)
into two groups by cl variables and run fitPA.
> head(res) # Return top six rows
To calculate the correlation between the samples or the features, use the correlationTest
function in metagenomeSeq to calculate basic Pearson, Spearman, and Kendall correla-
tion statistics, and corresponding p-values.
> cors = correlationTest(mouseData[55:60, ], norm = FALSE, log = FALSE)
# Calculates the (pairwise) correlation statistics and associated P-values of a matrix or the
correlation of each row with a vector.
19.4.4 Aggregating Counts
aggTax, which stands for aggregateByTaxonomy, counts data of each taxonomy using fea-
ture information.
> obj = aggTax(mouseData, lvl = ‘phylum’, out = ‘matrix’)
> head(obj[1:5, 1:5])

aggSamp, which stands for aggregateBySample, counts each samples using phenoData
information.
> obj = aggSamp(mouseData, fct = ‘mouseID’, out = ‘matrix’) # Using the phenoData
information in the MRexperiment, calling aggregateBySample on a MRexperiment and
a particular phenoData column (i.e., ‘diet’) will aggregate counts using the aggfun
function (default row Means). Possible aggfun alternative include rowMeans and rowMedians.
> head(obj[1:5, 1:5])
19.5 Visualization
metagenomeSeq provides functions for drawing various plots. Draw several plots using
mouseData. classIndex variable is assigned 54 Western values and 85 BK values.
>
classIndex = list(Western = which(pData(mouseData)$diet == “Western”))
> classIndex$BK = which(pData(mouseData)$diet == “BK”)
> otuIndex = 8770
> classIndex
$Western
[1] 14 15 17 19 20 21 22 23 24 38 39 40 42 43 44 45 46 47 84 85 87 89 90
91 92 93 94 96 97 98 100 101 102 103 104 105 117 118 120 122 123 124 125 126
127 129 130 132 134 135 136 137 138 139
$BK
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 16 18 25 26 27 28 29 30 31 32 33 34
35 36 37 41 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 86 88 95 99 106 107 108
109 110 111 112 113 114 115 116 119 121 128 131 133
19
19.5 · Visualization
331 19
> par(mfrow = c(1, 2)) # Show multiple plots
> dates = pData(mouseData)$date # Save mouseData in dates
> plotFeature(mouseData, norm = FALSE, log = FALSE, otuIndex, classIndex,
col = dates, sortby = dates, ylab = “Raw reads”) # Divide data into two groups
by diet and draw with a color by dates.
This function plots the abundance of a particular OTU by class. The function is the typical
manhattan plot of the abundances (. Fig. 19.2).

Specify heatmapColColors according to value of pData(mouseData)$diet. “#8DD3C7”

(yellow) represents ‘BK’ and “#FFFFB3” (green) reprsents ‘Western’. Then draw a heatmap
of abundance estimates with 50 colors in palette (. Fig. 19.3).
700
700
600
600
500
500
400
400
Raw reads
Raw reads
300
300
200
200
100
100
0
0 10 20 30 40 50 0 20 40 60 80
Western BK
.. Fig. 19.2 Plot of raw abundances between Western and BK

Color Key
and Histogram
Count
6000
0
0 2 4 6 8 12
Value
Lachnospiraceae:3796
Ruminococcaceae:357
Ruminococcaceae:642
Clostridia:36
Holdemania:23
LachnospiraceaeIncertaeS
Bacteroides:1084
Bacteroides:1048
Betaproteobacteria:10
Proteobacteria:26
Parabacteroides:745
Erysipelotrichaceae:26
Anaerostipes:38
Ruminococcaceae:387
ErysipelotrichaceaeIncerta
IncertaeSedisXIII:13
Lachnospiraceae:313
Coprobacillus:85
RuminococcaceaeIncertae
Clostridiales:1075
Ruminococcaceae:639
Ruminococcaceae:547
RuminococcaceaeIncertae
Akkermansia:40
Coprobacillus:67
Ruminococcaceae:526
Ruminococcaceae:469
Enterococcaceae:28
Lactococcus:42
Lactococcus:48
Enterococcus:162
Enterococcus:182
Clostridiales:313
Ruminococcaceae:634
Bacteroides:1058
Akkermansia:54
Bilophila:38
Bilophila:37
Ruminococcaceae:374
Coriobacteriaceae:42
Lachnospiraceae:209
Lachnospiraceae:796
Ruminococcaceae:575
Enterococcus:43
Ruminococcaceae:541
Ruminococcaceae:274
Firmicutes:278
Coprobacillus:126
Enterobacter:1
Enterobacteriaceae:18
Collinsella:28
Ruminococcaceae:635
Eubacterium:44
Ruminococcaceae:80
Enterococcus:39
Enterococcus:153
Prevotella:83
Prevotella:81
Prevotella:76
Prevotella:74
Bacteria:5
Bacteria:4
Prevotellaceae:334
Prevotellaceae:433
Bacteria:2
Prevotellaceae:143
Bacteroides:901
Sutterella:13
Faecalibacterium:275
Dorea:58
Bacteroides:265
Ruminococcaceae:453
Bacteroides:854
Bacteroides:667
Bacteroides:820
Firmicutes:340
Bacteroidales:72
Bacteroides:965
Bacteroidales:105
Bacteroides:978
Bacteroides:1146
Bacteroides:768
Lachnospiraceae:685
Bacteroides:1050
Clostridiales:1117
Ruminococcaceae:593
Ruminococcaceae:612
Bacteroides:868
Clostridiales:1114
Bacteroides:1060
Bacteroides:1166
Bacteroides:1194
Bacteroides:1151
Bacteroides:1150
Alistipes:156
Alistipes:161
Bacteroides:1153
Bacteroides:956
Bacteroides:1081
Bacteroides:1111
Alistipes:151
Bacteroides:891
Bacteroides:1055
Coprobacillus:49
Ruminococcus:15
Clostridiales:1116
Bryantella:94
Veillonellaceae:54
Bacteroides:921
Bacteroides:811
Coprobacillus:38
Prevotella:86
Prevotella:85
Bacteroides:1132
Parabacteroides:743
Veillonellaceae:60
Bacteroides:369
Bacteroides:925
Bacteroides:890
Prevotella:84
PM12:20080128
PM12:20080204
PM10:20080108
PM12:20080108
PM10:20080114
PM12:20080218
PM12:20080225
PM12:20080303
PM12:20080114
PM12:20080121
PM12:20080211
PM10:20080211
PM10:20080225
PM10:20080218
PM10:20080303
PM10:20080128
PM10:20080204
PM10:20080121
PM11:20071211
PM10:20071211
PM12:20071217
PM10:20071217
PM11:20071217
PM11:20080303
PM11:20080225
PM11:20080211
PM11:20080204
PM11:20080218
PM11:20080128
PM10:20080107
PM12:20080107
PM11:20080121
PM11:20080107
PM11:20080114
PM11:20080108
PM5:20080303
PM9:20080108
PM6:20080108
PM5:20080108
PM8:20080108
PM5:20080114
PM6:20080114
PM8:20080114
PM9:20080211
PM6:20080121
PM9:20080114
PM6:20080211
PM5:20080121
PM5:20080225
PM5:20080218
PM5:20080211
PM8:20080121
PM6:20080128
PM6:20080204
PM9:20080121
PM9:20080128
PM9:20080204
PM9:20080218
PM9:20080225
PM6:20080303
PM6:20080225
PM6:20080218
PM8:20080211
PM8:20080204
PM8:20080128
PM9:20080303
PM8:20080303
PM8:20080225
PM8:20080218
PM5:20080204
PM5:20080128
PM7:20071211
PM8:20071211
PM2:20071211
PM3:20071211
PM5:20071211
PM9:20071211
PM1:20071211
PM7:20071217
PM7:20080303
PM7:20080121
PM7:20080114
PM7:20080204
PM7:20080128
PM7:20080211
PM7:20080218
PM7:20080225
PM9:20071217
PM2:20071217
PM5:20071217
PM8:20071217
PM4:20071217
PM6:20071217
PM1:20071217
PM3:20071217
PM3:20080218
PM3:20080303
PM1:20080303
PM4:20080303
PM2:20080303
PM2:20080225
PM2:20080218
PM4:20080114
PM4:20080204
PM4:20080211
PM4:20080225
PM4:20080218
PM1:20080218
PM1:20080225
PM1:20080114
PM1:20080128
PM4:20080128
PM1:20080211
PM1:20080121
PM3:20080121
PM9:20080107
PM4:20080121
PM3:20080211
PM3:20080128
PM3:20080225
PM2:20080204
PM2:20080211
PM3:20080204
PM1:20080204
PM1:20080107
PM3:20080107
PM4:20080107
PM4:20080108
PM2:20080108
PM6:20080107
PM8:20080107
PM2:20080107
PM5:20080107
PM3:20080114
PM3:20080108
PM1:20080108
PM2:20080121
PM2:20080128
PM2:20080114
.. Fig. 19.3 Heatmap of abundance estimates
> trials = pData(mouseData)$diet # Save diet value of mouseData in trials
> heatmapColColors = brewer.pal(12, “Set3”)[as.integer(factor(trials))]
# Define a color by a trial value
> heatmapCols = colorRampPalette(brewer.pal(9, “RdBu”))(50) # Define 50 colors in palette
> plotMRheatmap(obj = mouseData, n = 200, cexRow = 0.4, trace = “none”,
col = heatmapCols, ColSideColors = heatmapColColors)

19
# Draw heatmap showing diet value of mouse Data
333 19
Color Key
and Histogram
1000
Count
0
–1 –0.5 0 0.5 1
Value
LachnospiraceaeIncertaeSedis:977
Bryantella:94
Clostridiales:1116
Clostridiales:1117
Ruminococcaceae:453
Clostridiales:1114
Bacteroides:868
Bacteroides:978
Ruminococcaceae:593
Ruminococcaceae:612
Bacteroides:1146
Lachnospiraceae:685
Bacteroides:854
Bacteroides:667
Bacteroides:820
Bacteroides:925
Firmicutes:340
Ruminococcus:15
Bacteroides:890
Bacteroides:768
Bacteroides:265
Bacteroides:901
Bacteroides:1132
Bacteroides:1194
Bacteroides:1151
Bacteroides:1150
Bacteroides:1084
Bacteroides:1050
Bacteroides:1048
Bacteroides:1055
Veillonellaceae:60
Dorea:58
Bacteroides:811
Prevotella:74
Prevotella:84
Prevotella:86
Bacteria:5
Prevotella:76
Bacteroidales:105
Bacteroidales:72
Bacteroides:369
Bacteroides:965
Sutterella:13
Prevotellaceae:433
Prevotellaceae:143
Bacteria:2
Bacteria:4
Prevotellaceae:334
Prevotella:85
Prevotella:81
Prevotella:83
Ruminococcaceae:80
Eubacterium:44
RuminococcaceaeIncertaeSedis:55
Ruminococcaceae:635
ErysipelotrichaceaeIncertaeSedis:315
Ruminococcaceae:547
Clostridiales:1075
Ruminococcaceae:639
Bacteroides:1153
Bacteroides:956
Ruminococcaceae:357
Bilophila:37
Bilophila:38
Bacteroides:1081
Bacteroides:1058
Bacteroides:1111
Alistipes:151
Clostridiales:313
Coprobacillus:126
Ruminococcaceae:374
Firmicutes:278
Ruminococcaceae:469
Enterococcus:153
Ruminococcaceae:274
Enterococcaceae:28
Ruminococcaceae:634
Enterococcus:162
Enterococcus:182
Ruminococcaceae:575
Ruminococcaceae:526
Lactococcus:48
Lactococcus:42
Enterococcus:39
Enterococcus:43
Ruminococcaceae:541
Alistipes:156
Alistipes:161
Bacteroides:891
Parabacteroides:743
Bacteroides:1060
Bacteroides:1166
Lachnospiraceae:313
Anaerostipes:38
Ruminococcaceae:387
Coprobacillus:38
Coprobacillus:67
Proteobacteria:26
Parabacteroides:745
Bacteroides:921
Veillonellaceae:54
Ruminococcaceae:642
Holdemania:23
Clostridia:36
Coprobacillus:49
Akkermansia:40
Akkermansia:54
Coprobacillus:85
Lachnospiraceae:796
Lachnospiraceae:209
Collinsella:28
Enterobacter:1
Lachnospiraceae:796
Veillonellaceae:54
Ruminococcaceae:526
Ruminococcaceae:575
Coprobacillus:126
Ruminococcaceae:80
Veillonellaceae:60
Enterobacter:1
Collinsella:28
Lachnospiraceae:209
Coprobacillus:85
Akkermansia:54
Akkermansia:40
Ruminococcaceae:642
Coprobacillus:49
Clostridia:36
Holdemania:23
Parabacteroides:745
Bacteroides:921
Proteobacteria:26
Coprobacillus:67
Coprobacillus:38
Ruminococcaceae:387
Ruminococcaceae:541
Anaerostipes:38
Lachnospiraceae:313
Bacteroides:1166
Bacteroides:1060
Parabacteroides:743
Bacteroides:891
Alistipes:161
Alistipes:156
Enterococcus:43
Enterococcus:39
Lactococcus:42
Lactococcus:48
Enterococcus:182
Enterococcus:162
Ruminococcaceae:634
Enterococcaceae:28
Ruminococcaceae:274
Enterococcus:153
Ruminococcaceae:469
Firmicutes:278
Ruminococcaceae:374
Clostridiales:313
Alistipes:151
Bacteroides:1111
Bacteroides:1058
Bacteroides:1081
Bilophila:38
Bilophila:37
Ruminococcaceae:357
Ruminococcaceae:639
Ruminococcaceae:547
Ruminococcaceae:635
Ruminococcaceae:593
Bacteroides:956
Bacteroides:1153
Clostridiales:1075
Eubacterium:44
Prevotella:83
Prevotella:81
Prevotella:85
Prevotellaceae:334
Bacteria:4
Bacteria:2
Prevotellaceae:143
Prevotellaceae:433
Sutterella:13
Bacteroides:965
Bacteroides:369
Bacteroidales:72
Bacteroidales:105
Prevotella:76
Bacteria:5
Prevotella:86
Prevotella:84
Prevotella:74
Bacteroides:811
Dorea:58
Bacteroides:1055
Bacteroides:1048
Bacteroides:1050
Bacteroides:1084
Bacteroides:1150
Bacteroides:1151
Bacteroides:1194
Bacteroides:1132
Bacteroides:901
Bacteroides:265
Lachnospiraceae:685
Bacteroides:1146
Ruminococcaceae:612
Ruminococcaceae:453
Bacteroides:768
Bacteroides:890
Ruminococcus:15
Firmicutes:340
Bacteroides:925
Bacteroides:820
Bacteroides:667
Bacteroides:854
Bacteroides:978
Bacteroides:868
Clostridiales:1114
Clostridiales:1117
Clostridiales:1116
Bryantella:94
.. Fig. 19.4 Heatmap of pairwise correlations
Calculate the correlation using the heatmapCols variable set in the above color and express
it as a heatmap. The higher the correlation value, the darker in blue color, and the lower
the correlation value, the darker in red color (. Fig. 19.4).

> plotCorr(obj = mouseData, n = 200, cexRow = 0.25, cexCol = 0.25, trace = “none”,
dendrogram = “none”, col = heatmapCols) # Draw heatmap showing correlation by
diet value of mouseData

20
10
0
MDS component: 2
–10
–20
–30
–40
–20 –10 0 10 20 30
MDS component: 1
.. Fig. 19.5 Plot of either PCA or MDS coordinates for the distances
This function plots the PCA / MDS coordinates for the diet values. Potentially relation-
ships of two diet groups uncovers their distances (. Fig. 19.5).
> cl = factor(pData(mouseData)$diet) # Save factored diet value in cl
> plotOrd(mouseData, tran = TRUE, usePCA = FALSE, useDist = TRUE, bg = cl, pch = 21)
# the PCA/MDS coordinates
19
335 19
Diet1-BK
Diet2-Western
600
Number of detected features
500
400
300
1000 2000 3000 4000 5000 6000

Depth of coverage
.. Fig. 19.6 The number of observed features vs. the depth of coverage following diet values
plotRare visualizes the number of obeserved features vs. the depth of coverage (. Fig. 19.6).
> cl = factor(pData(mouseData)$diet) # Save factored diet value in cl
> res = plotRare(mouseData, cl = cl, pch = 21, bg = cl) # This function plots the number
of observed features vs. the depth of coverage.
> tmp = lapply(levels(cl), function(lv)lm(res[ , “ident”] ~res[ , “libSize”] – 1, subset = cl == lv))
> for (i in 1:length(levels(cl))){
abline(tmp[[i]], col = i)
} # Draw abline with cl variables
> legend(“topright”, c(“Diet1 – BK”, “Diet2 – Western”), text.col = c(1, 2), b ox.col = NA)
# Add legend on the topright.

.. Fig. 19.7 A web browser opens where the graph can be examined interactively by adjusting the
available options
Shiny is an interactive web application generated by R. All plots generated in above can be
actively visualized. Shiny is an extended application with reactive binding between input
and output. metagenomeSeq and interactiveDisplay R-pacage provide a visualization tool
through shiny (. Fig. 19.7)

> display(mouseData) # Visualization of mouse data in web browser
Exercises
[Exercise 1] - Perform the same analysis with the lungData provided in metagenomeSeq.
[Exercise 2] - Understand characteristic of data used in the analysis.
[Exercise 3] - Apply fitDO function for discovery odds ratio test.
[Exercise 4] - Find a combination of samples with high correlation value.
[Exercise 5] - Draw plot with several options of display (mouseData).
Take Home Message

55 How to analyze metagenome data using R package “metegenomeSeq”
19
Bibliography
337 19
Bibliography
1. DeLong EF et al (2006) Community genomics among stratified microbial assemblages in the ocean’s
interior. Science 311(5760):496–503
2. Huson DH et al (2007) MEGAN analysis of metagenomic data. Genome Res 17(3):377–386. Epub 2007
Jan 25
3. Meyer F et al (2008) The metagenomics RAST server – a public resource for the automatic phyloge-
netic and functional analysis of metagenomes. BMC Bioinforma 9:386. https://doi.org/10.1186/1471-
2105-9-386
4. Paulson JN et al (2013) Differential abundance analysis for microbial marker-gene surveys. Nat
Methods 10(12):1200–1202. https://doi.org/10.1038/nmeth.2658 Epub 2013 Sep 29
5. Paulson JN, Talukder H, Pop M, Bravo HC metagenomeSeq: statistical analysis for sparse high-
throughput sequencing. Bioconductor package: 1.16.0. http://cbcb.umd.edu/software/metageno-
meSeq
6. Turnbaugh PJ et al (2006) An obesity-associated gut microbiome with increased capacity for energy
harvest. Nature 444(7122):1027–1031
7. Turnbaugh PJ et al (2009) The effect of diet on the human gut microbiome: a metagenomic analysis
in humanized gnotobiotic mice. Sci Transl Med 1(6):6ra14. https://doi.org/10.1126/scitrans-
lmed.3000322
339 20
Epigenome Database
and Analysis Tools
20.3 Epigenome Database – 340

20.3.1 dbEM: A Database of Epigenetic Modifiers – 340
20.3.2 EpiFactors – 341
20.3.3 MethylomeDB – 343
20.3.4 NGSmethDB: High-Quality Methylomes
and Differential Methylation – 345
20.4 Epigenome Analysis Tool – 346

20.4.1 Bsseq: Analyzing WGBS with the Bsseq Package – 346
20.4.2 MethPipe: a Computational Pipeline for Analyzing
Bisulfite Sequencing Data – 351
20.4.3 PAVIS: Peak Annotation and Visualization – 351

https://doi.org/10.1007/978-981-13-1942-6_20
340 Chapter 20 · Epigenome Database and Analysis Tools

This chapter introduces the types and properties of epigenome databases. We will learn
about the main functions and properties of epigenome analysis tools and understand the
overall process of epigenome data analysis by practicing with the example file.
20.1 Introduction
The epigenome is a means by which regulation of gene expression is affected by various

other factors beyond the genome sequence. While the same genome exists in every cell of
a human, the epigenome is composed differently in each cell and disease condition, and
thus it plays an important role in controlling cells and in the cause of disease. There are
various types of epigenetic mechanisms, including DNA methylation, histone modifica-
tion, and noncoding RNAs. These factors are inherited directly during cell division with-
out changes in base sequence, but the epigenome is changed by age and exposures from
the external environment. Understanding how the epigenome contributes to gene regula-
tion will provide deep insights into human disease.
20.2 Prerequisites
55 This practice requires a web browser and R.

55 R programing installation: Install an R program with a version higher than
3.0X. Refer to the Appendix B Sect. B.1.
55 Bioconductor installation: Install the Bioconductor analysis package. Refer to the
> biocLite(“bsseq”)
> biocLite(“bsseqData”)
> library(bsseq)
> library(bsseqData)
20.3 Epigenome Database
20.3.1 dbEM: A Database of Epigenetic Modifiers

(7 http://crdd.osdd.net/raghava/dbem/)

dbEM is a database that has information relating to epigenetic modifiers. It includes data
such as mutations, CNVs, and gene expression in thousands of tumor samples, cancer cell
lines, and healthy individuals obtained from COSMIC, CCLE, and the 1000 Genomes
20 Project. It includes both cancerous and normal genomes, and it aims to study the roles of
epigenetic proteins in oncogenesis and cancer drug resistance.
20.3 · Epigenome Database
341 20
.. Fig. 20.1 Profile based prediction query and result pages
20.3.1.1 Searching dbEM

dbEM supports various query types: (1) Simple search, which takes keyword input such as
a Gene, Gene Synonym, Location, Role, PFAM Domain, or Mutation. (2) Composite
search takes multiple specific inputs such as Cellular Location, modifier Class, and modi-
fier Subclass. (3) The Similarity search takes protein sequences in FASTA format. (4)
Alignment with Modifiers aligns an input protein sequence (FASTA format) against a
reference database. (5) Finally, Profile Based Prediction takes a protein sequence (FASTA
format) containing variants and predicts whether they are cancer causing mutations.
The figure below shows an example of Tools > Profile Based Prediction (. Fig. 20.1).

Both of the sequences that we input are predicted to be cancer sensitive. Prediction is
based on the HMM score from EMBL-EBI HAMMER. If the original HMM score is
higher, then it is considered a “Normal Variant”, while if the mutant HMM score is higher,
then it is predicted to be “Cancer Sensitive.”
20.3.1.2 Browse
From the Browse menu, you can view all data in a given database. (1) The Epigenetic
Modifiers database is sorted by each protein, and shows annotated IDs from Uniprot,
Homologues, PDB, and PubChem. (2) Selecting Chromosomes lists epigenetic proteins
by their position on each of the 24 human chromosomes. (3) The Frequency of Mutation
page shows the number of cancer mutant and normal variants per gene, referenced from
CCLE, COSMIC, and the 1000 Genomes Project. (4) Browsing by Genomic Features
allows searching for epigenetic proteins based on Mutation Frequency, Expression level,
and CNV copy number change. (5) Finally, under Drugs/Inhibitors, it shows a table of
drugs affecting epigenetic modifiers, listed by DrugBank or PubChem ID (. Fig. 20.2).

20.3.2 EpiFactors (7 http://epifactors.autosome.ru)
EpiFactor is a database containing epigenetic factors, their genes, and their products. In
this database, “epigenetic factors” means a protein that causes chromatin remodeling.
These can include: (1) proteins that act upon post-translational modifications of histones
(histone modification read, write, and erase); (2) proteins that move, eject, or restructure
nucleosomes (ATP-dependent chromatin remodelers); and (3) proteins that incorporate
histone variants into the nucleosomes.
.. Fig. 20.2 Frequency of mutation table showing DNMT3A
.. Fig. 20.3 Genes page from the EpiFactors database
The Epifactors menu has options for viewing Genes, Complexes, Histones and
Protamines, Expression, and Docs and Downloads. Let’s check Genes first. This page
shows information for 815 genes, including annotations for the HGNC symbol, HGNC
name, UniProt ID, Pfam domains ID, UniProt mouse ID, gene function, modification,
associated protein complexes, target entity, and Product (. Fig. 20.3).

Let’s search for a gene of interest. For the demonstration, let’s put “DNMT3A” in the
search field under “HGNC approved Symbol.” When the search results come up, click
‘details’ to go to the detail screen. Here, when you click on any database ID, you will go to
the information page associated with that database (. Fig. 20.4).

If you click on the Complexes, Histones and Protamines, or Expression menu, you can
perform a similar simple query. EpiFactor includes information for 69 Protein Complexes
and 95 Histones and Protamines (. Fig. 20.5).

20 The Expression menu includes 255 cell lines, 12 fractionations, 458 primary cells, 29
time courses, and 135 tissue types. After performing a query for ‘brain’ samples, the result
343 20
.. Fig. 20.4 Search results for the HGNC symbol ‘DNMT3A’ in EpiFactor
.. Fig. 20.5 Screenshot of the EpiFactor detail page for ‘DNMT3A’
shows 3 primary cell and 3 tissue samples. The links in each row lead to pages with expres-
sion results for the relevant genes (. Fig. 20.6).

20.3.3 MethylomeDB (7 http://www.neuroepigenomics.org/

methylomedb/)
MethylomeDB provides human and mouse DNA methylation profiles. Methylation profiles
include over 80% of CpG dinucleotides in the brains of humans and mice at single CpG reso-
.. Fig. 20.6 Expression page from the EpiFactors database
.. Fig. 20.7 Browser and download pages for MethylomeDB
lution. Profiles are generated through Methylation Mapping Analysis by Paired- end
Sequencing (Methyl-MAPS), and analyzed with the Methyl-Analyzer software package. In
this database, the integrated genome browser, an edited version of the UCSC genome browser,
is used to search for DNA methylation profiles at specific genomic loci, to search for specific
methylation patterns, and to compare methylation patterns between samples.
MethylomeDB provides Browse/Search and Download functions. Clicking on
20 “Browse” takes you to the MethylomeDB Browser. It uses the UCSC Genome Browser to
access the Brain Methylation Database, and you can query through either the Genome
Browser or the Table Browser interfaces. From the Download page, you can download
345 20
methylation data for each sample. Data are arranged by sample, tissue, organism, gender,
age, and PMI (hour), with links on the right for downloading (. Fig. 20.7).
20.3.4 NGSmethDB: High-Quality Methylomes and Differential

Methylation (7 http://bioinfo2.ugr.es/NGSmethDB/)

NGSmeth DB is a database that provides whole-genome methylation maps(methylomes).

These are produced by the high-throughput (formerly “next-generation”) sequencing
(NGS) of sodium-bisulfite treated DNA. Also, NGSmethDB includes 2 additional datas-
ets: (1) methylation segments, genome regions of homogeneous methylation, and (2) dif-
ferentially methylated single-cytosines. From this database, you can download a dump file
of whole methylome data, or obtain data via web-form by querying a data sample, or view
data with the UCSC Genome Browser. Data is available for many different tissues, patho-
logical conditions, and species (. Fig. 20.8).

Short-read MethFlow
datasets from pipeline
WGBS projects
High-quality Differential Methylation

methylation methylation segments
maps maps (DMCs) maps
NGSmethDB
(MongoDB)
NGSmethDB Database UCSC Track

API server dumps Hubs
HTTP RESTful NGSmethDB Web form Genome Data

Table Browser Integrator
access API client access Browser
Standalone NGSmethDB
client API client VM
.. Fig. 20.8 The complete workflow of NGSmethDB

20.4 Epigenome Analysis Tool
20.4.1 Bsseq: Analyzing WGBS with the Bsseq Package
Bsseq is an R package that uses the BSmooth algorithm to analyze and visualize whole-
genome shotgun bisulfite sequencing (WGBS) datasets. The BSmooth algorithm uses
smoothing to obtain reliable semi-local methylation estimates in low-coverage regions.
After smoothing, BSmooth estimates biological variation and confirms differentially
methylated regions (DMRs) using biological replicates. We will also need to consider the
variation between individuals, which may relate to disease.
For practice, we will use the data from Hansen KD et al., (2011) Nature Genetics. This
is a primary dataset focused on chromosomes 21 and 22, which was obtained from colon
cancer patients, and it contains both normal and cancer tissue. The data consists of 50 bp
single-end reads and was generated using ABI SOLiD sequencing.
20.4.1.1 Preparation
Download, install, and load the packages bsseq and bsseqData from Bioconductor using
biocLite.R
> biocLite(“bsseq”)
> biocLite(“bsseqData”)
> library(bsseq)
> library(bsseqData)
20.4.1.2 Practice Data Setup

Load the newest version of the “BS.cancer.ex” dataset, which is included in the BsseqData
package. Check the dataset’s structure to confirm numbers of samples and loci. It should
contain 6 samples and 958,541 methylation loci.
> data(BS.cancer.ex)
> BS.cancer.ex <- updateObject(BS.cancer.ex)
> BS.cancer.ex
> BS.cancer.ex
An object of type 'BSseq' with
958541 methylation loci
6 samples
has not been smoothed
20
20.4 · Epigenome Analysis Tool
347 20
Also check the phenotypic information in the dataset; it should be composed of three
cancer tissues and three normal tissues.
> pData(BS.cancer.ex)
> pData(BS.cancer.ex)
DataFrame with 6 rows and 2 columns
Type Pair
<character> <character>
C1 cancer pair1
C2 cancer pair2
C3 cancer pair3
N1 normal pair1
N2 normal pair2
N3 normal pair3
Execute BSmooth on the BS.cancer.ex data. The run will take around 2 min for each sam-
ple. The command option mc.cores defines the number of CPU cores to use, and can be
optimized to speed the analysis. (Please note that on Windows, the number of usable cores
is limited to one.)
> BS.cancer.ex.fit <- BSmooth(BS.cancer.ex, mc.cores = 1, verbose = TRUE)
Also check that the dataset BS.cancer.ex.fit is loaded. This data, composed of 958,541
methylation loci and 6 samples, has already been processed with “smoothing.”
> data(BS.cancer.ex.fit)
> BS.cancer.ex.fit <- updateObject(BS.cancer.ex.fit)
> BS.cancer.ex.fit
20.4.1.3 Poisson
Calculate the (log) distribution function of the Poisson distribution using the lambda
parameter.
Since CpG coverage is approximately 4x per sample, many zero coverage CpGs would
be expected by chance, but the same CpG must not be a zero coverage in all six samples.
> round(colMeans(getCoverage(BS.cancer.ex)), 1)
> round(colMeans(getCoverage(BS.cancer.ex)), 1)
[1] 3.5 4.2 3.7 4.0 4.3 3.9
With the assumption that coverage genome-wide follows a Poisson distribution with a
parameter (lambda) of 4, it is expected that 0.105 of CpGs have a zero coverage in at least
one sample.
> logp <- ppois(0, lambda = 4, lower.tail = FALSE, log.p = TRUE)
> round(1 –exp(6 * logp), 3)
20.4.1.4 BSmooth.tstat
Before analyzing differences between DMRs in the two tissue groups, eliminate low cover-
age CpGs to reduce false positive results. To do this, calculate coverages of the BS.cancer.
ex.fit dataset, which is composed of 597,371 methylation loci and 6 samples. Save those
coverages in the variable BS.cov.
> BS.cov <- getCoverage(BS.cancer.ex.fit)
In the BS.cancer.ex data frame, columns 1–3 are cancer tissues and columns 4–6 are nor-
mal tissues. To perform t-statistics using BSmooth.tstat, first select loci from BS.cov that
have coverages equal to or greater than two in each group, then apply rowSum to the
selected data and save the resulting value in the variable keepLoci.ex.
> keepLoci.ex <- which(rowSums(BS.cov[ , BS.cancer.ex$Type == “cancer”] >
2) >= 2 & rowSums(BS.cov[ , BS.cancer.ex$Type == “normal] >= 2) >= 2)
Next, save the rows from BS.cancer.ex.fit that are listed in keepLoci.ex to the variable BS.
cancer.ex.fit. The number of methylation loci in the dataset will decrease from 958,541 to
597,371.
> BS.cancer.ex.fit <- BS.cancer.ex.fit[keepLoci.ex, ]
Assign the cancer values as Group1 and the normal values as Group 2, then and perform
t-statistics for the reduced BS.cancer.ex.fit dataset. Save the calculated value to the variable
BS.cancer.ex.tstat.
> BS.cancer.ex.tstat <- BSmooth.tstat(BS.cancer.ex.fit, group1 = c(“C1”, “C2”, “C3”),
group2 = c(“N1”, “N2”, “N3”), estimate.var = “group2”, local.correct = TRUE,
verbose = TRUE)
Now plot the values of BS.cancer.ex.tstat. The blue line shows uncorrelated values while
the black line shows correlated values. The option estimate.var. can be used to define the
sample for measuring variability. In a cancer dataset, the normal sample is used since
cancer tissue has higher variability than normal tissue. The option local.correct provides
20 large-scale mean correlation, which we use for this dataset to discover large-scale
methylation differences between cancer and normal tissues.
349 20
0.30 uncorrected
corrected
0.25
0.20
Density
0.15
0.10
0.05
0.00
–10 –5 0 5 10
N = 538851 Bandwidth = 0.1625
.. Fig. 20.9 The marginal distribution of the t-statistic
20.4.1.5 plotTstat
This function displays the distribution of the t-statistics.
> plot(BS.cancer.ex.tstat)
Hypomethylated regions can be seen clearly in uncorrected t-statistics (. Fig. 20.9).
20.4.1.6 Differentially Methylated Regions (DMRs)

Once t-statistics have been calculated, differentially methylated regions (DMRs) are auto-
matically determined for the whole region. In the BS.cancer.ex.tstat dataset, find differen-
tially methylated regions, which are all genomic regions that have methylated loci with
t-statistics outside the range of −4.6–4.6 (low and high cutoffs respectively).
> dmr <- dmrFinder(BS.cancer.ex.tstat, cutoff = c(-4.6, 4.6))

chr21: 28,215,373 - 28,219,673 (width = 4,301, extended = 5,000)
0.8
Methylation
0.5
0.2
.. Fig. 20.10 Plots for the top 200 individual DMRs with plotRegions
Select DMRs that matches the following criteria: (1) the number of affected CpGs is equal
to or greater than three, and (2) the difference in methylation average between cancer and
normal tissue is at least 0.1. Save this subset to the variable BS.cancer.ex.dmr.
> BS.cancer.ex.dmr <- subset(dmr, n >= 3, abs(meanDiff) >= 0.1)
20.4.1.7 plotSetup
> pData <- pData(BS.cancer.ex.fit)
For clear visualization, we will color data according to the associated phenotypes in
BS.cancer.ex.fit. Create a column named “col” in pData, assigning “red” to cancer samples
and “blue” to normal samples.
> pData$col <- rep(c(“red”, “blue”), each = 3)
> pData(BS.cancer.ex.fit) <- pData
Now plot the BS.cancer.ex.fit data. First, draw a plot for just the first row of BS.cancer.ex.
dmr (. Fig. 20.10).
> plotRegion(BS.cancer.ex.fit, BS.cancer.ex.dmr[1, ], extend = 5000,
addRegions = BS.cancer.ex.dmr)
Next, draw a plot for the first 200 rows of BS.cancer.ex.fit. Save the plots as a pdf file named
“dmrs_200_plots.pdf,” then close out the plot with dev.off().
20
351 20
> pdf(file = “dmrs_200_plots.pdf”, width = 10, height = 5)
> plotManyRegions(BS.cancer.ex.fit, BS.cancer.ex.dmr[1:200, ], extend = 5000,
addRegions = dmr)
> dev.off()
20.4.2 MethPipe: a Computational Pipeline for Analyzing

Bisulfite Sequencing Data (7 http://smithlabresearch.org/

software/methpipe/)
MethBase is a central methylome reference database generated from public BS-seq datas-
ets. MethBase includes hundreds of methylomes from various organisms including
Arabidopsis, human, mouse, zebrafish, chimp, and dog. For each methylome, MethBase
provides methylation levels at individual sites, regions of allele-specific methylation,
hypo- and hyper-methylated regions, and partially methylated regions. MethBase also
provides and detailed metadata and summary statistics for each methylome. These results
are created using the MethPipe software package, which is an independent and compre-
hensive standalone pipeline for analyzing WGBS and RRBS data.
20.4.3 PAVIS: Peak Annotation and Visualization

(7 https://manticore.neihs.nih.gov/pavis2)

PAVIS (Peak Annotation and Visualization) is a tool for annotating and visualizing ChIP-
seq and BS-seq data, and is useful for hypothesis generation and data analysis. The annota-
tion function provides the relative locations between query peaks and genes, and with
other comparison peaks in a genome, and reports the relative enrichment levels of peaks
in different genomic regions. This visualization tool offers simultaneous viewing of mul-
tiple peaks in the context of genomic features and nearby comparison peaks. PAVIS
accepts peak location data, which is created by a peak-calling tool, as input data. Accepted
inputs include files in the UCSC BED format, GFF3 format, and peak data files output by
most ChIP-seq data analysis tools.
Exercises
[Exercise 1] - Analyze Rhesus macaque data (Trung J et al., (2012) PNAS) using the bsseq R package.
[Exercise 2] - Check your data of interest against the MethBase database. Analyze it with MethPipe and
visualize the data through the Genome Browser.
Take Home Message

55 The characteristics of epigenome databases.
55 How to use existing tools to analyze epigenome data.
Bibliography
1. dbEM – http://crdd.osdd.net/raghava/dbem/
2. EpiFactors – http://epifactors.autosome.ru
3. Hansen KD et al (2012) BSmooth: from whole genome bisulfite sequencing reads to differentially
methylated regions. Genome Biol 13(10):R83. https://doi.org/10.1186/gb-2012-13-10-r83
4. Hansen KD, Langmead B, Irizarry RA (2012) BSmooth: from whole genome bisulfite sequencing reads
to differentially methylated regions. Genome Biol 13(10):R83. https://doi.org/10.1186/gb-2012-
13-10-r83
5. Hansen KD et al (2016) bsseqData: example whole genome bisulfite data for the bsseq package. R
package version 0.12.0
6. Huang W et al (2013) PAVIS: a tool for peak annotation and visualization. Bioinformatics 29(23):3097–
3099. https://doi.org/10.1093/bioinformatics/btt520. Epub 2013 Sep 4
7. Lebrón R et al (2017) NGSmethDB 2017: enhanced methylomes and differential methylation. Nucleic
Acids Res 45(D1):D97–D103. https://doi.org/10.1093/nar/gkw996. Epub 2016 Oct 27
8. Medvedeva YA et al (2015) EpiFactors: a comprehensive database of human epigenetic factors and
complexes. Database (Oxford) 2015:bav067. https://doi.org/10.1093/database/bav067. Print 2015
9. Plongthongkum N et al (2014) Advances in the profiling of DNA modifications: cytosine methylation
and beyond. Nat Rev Genet 15(10):647–661. https://doi.org/10.1038/nrg3772. Epub 2014 Aug 27
10. Singh Nanda J et al (2016) dbEM: A database of epigenetic modifiers curated from cancerous and
normal genomes. Sci Rep 6:19340. https://doi.org/10.1038/srep19340
11. Song Q et al (2013) A reference methylome database and analysis pipeline to facilitate integrative and
comparative epigenomics. PLoS One 8(12):e81148. https://doi.org/10.1371/journal.pone.0081148.
eCollection 2013
12. Xin Y et al (2012) MethylomeDB: a database of DNA methylation profiles of the brain. Nucleic Acids
Res 40(Database issue):D1245–D1249. https://doi.org/10.1093/nar/gkr1193. Epub 2011 Dec 2
20
353 21
Epigenome Data Analysis

21.2 Preparations – 354
21.3 Input Sequence Data – 354

21.3.1 Download Sequence Data – 354
21.3.2 Specify Input Sequence Data – 355
21.4 Data Filtering – 355

21.4.1 Alignment Control (AC) – 355
21.4.2 Quality Control (QC) – 356
21.5 Analysis of Methylation Status – 356

21.5.1 Sequence Name – 357
21.5.2 Alignment – 357
21.5.3 Sequence Start and End Positions – 358
21.5.4 Length of the Reference Sequence – 358
21.5.5 Methylated Positions in the Sample Sequence – 359
21.5.6 Methylated Positions in the Reference Sequence – 360
21.6 Exploratory Analysis and Visualization – 360

21.6.1 Plotting Methylation Status – 360
21.6.2 Lollipop Figure: Methylation Status – 360
21.6.3 Neighboring Co-occurrence Display – 362
21.6.4 Distant Co-occurrence Display – 364
21.7 Statistical Tests – 364

21.7.1 Fisher’s Exact Test – 364
21.7.2 Clustering Analysis – 365
21.7.3 Correspondence Analysis – 366

https://doi.org/10.1007/978-981-13-1942-6_21
354 Chapter 21 · Epigenome Data Analysis

21 In this chapter, we understand the structure of DNA methylation data obtained by bisulfite
sequencing and learn how to deal with misalignments and perform quality control. We
practice DNA methylation data extraction, exploratory data analysis, visualization, and vari-
ous statistical analysis.
21.1 Introduction
DNA methylation is an epigenetic mechanism involving the chemical modification of

DNA at CpG sites. Epigenetic modifications are involved in many functions such as X
chromosome inactivation, gene expression regulation, repetitive sequence silencing, and
genomic imprinting. Bisufite sequencing is a method that allows DNA methylation profil-
ing, as bisulfite treatment converts unmethylated Cytosines (C) to Uracils (U).
We will use methVisual R package for DNA methylation analysis. With this package,
we will perform (1) alignment of DNA sequences obtained from bisulfite sequencing, (2)
quality control to remove misaligned reads, and (3) visualization and statistical analysis
for data interpretation.
21.2 Preparations
* Environment
55 Software: methVisual Bioconductor package (R version 2.11.0 or higher required)
55 Data: Sample data provided with the BiQ Analyzer program (7 http://biq-analyzer.

bioinf.mpi-inf.mpg.de/)
55 Practice: Bisulfite sequencing samples from the mouse Gm9 region (Oda et al. [1])
> source(“http://bioconductor.org/biocLite.R”)
> biocLite(“methVisual”)
> library(methVisual)
※ Internet connection is needed to download methVisual packages

※ methVisual is case-sensitive.
※ When downloading packages, select “Yes” to the prompt that asks if the packages are to
be saved in a personal library
21.3 Input Sequence Data
21.3.1 Download Sequence Data
The input file for this analysis must be in FASTA format. FASTA format sequence data is
composed of (1) a line starting with “>” which contains summary information and (2)
sequences following. For more details, refer to 7 Chap. 11 Sect. 7 11.3.2. The directory

“C:\gda\ch21” includes files Master_sequence.txt, PathFileTab.txt, and file_A~seq_J.fasta.

The file PathFileTab.txt contains the path to additional data files.
21.4 · Data Filtering
355 21
21.3.2 Specify Input Sequence Data
We will use the MethDataInput function to load the sample sequences into the workspace.
The MethDataInput function takes as input a text file that contains a list of file names and
paths, which are tab-delimited. Assign the loaded sequences to the methData variable.
> methData <- MethDataInput(file.path(“C:\gda\ch21\PathFileTab.txt”))
> methData
> methData
FILE PATH
1 seq_A.fasta C:/gda/ch21/
2 seq_B.fasta C:/gda/ch21/
3 seq_C.fasta C:/gda/ch21/
4 seq_D.fasta C:/gda/ch21/
5 seq_E.fasta C:/gda/ch21/
6 seq_F.fasta C:/gda/ch21/
7 seq_G.fasta C:/gda/ch21/
8 seq_H.fasta C:/gda/ch21/
9 seq_I.fasta C:/gda/ch21/
10 seq_J.fasta C:/gda/ch21/
The reference sequence must be loaded as well. Using the selectRefSeq function, assign
“Master_Sequence.txt” file, which has the reference sequence, to the variable refseq.
> refseq <- selectRefSeq(file.path(“C:\gda\ch21\Master_Sequence.txt”))
> refseq
> refseq
[1]
CCCGGGATCGCTCTCCCAGCAGGTGAAGCCTCGCCATGGACCCTCCCCGTCGGGGCCCCGCGCT
G$
21.4 Data Filtering
21.4.1 Alignment Control (AC)
For methylation analysis, alignment control (AC) is the process of reducing misalign-
ments that can occur while comparing sample sequences with reference sequences. The
cases that can lead to misalignment are as follows: First, misalignment can occur with a
reversed sequence. Second, it can also misalign to a complement or reverse-complemented
sequence. In the AC process, the Needleman-Wunsch algorithm is used to score align-
ments and evaluate their reliability.
21.4.2 Quality Control (QC)

21
Typically, errors may occur during bisulfite conversion of DNA; the conversion process is
not perfect in the experiment. For example, unmethylated Cs may not be converted to U
during bisulfite treatment. Therefore, quality control (QC) is required to reduce errors
that may be included in the sample sequence. In vertebrates, as DNA methylation is lim-
ited to CpG sites, unconverted Cs found at non-CpG sites are considered to be errors.
Therefore, methVisual estimates the error rate by calculating the conversion rate of a
number of Cs in an area outside CpG sites.
In the methVisual package, one function performs AC and QC together. The com-
mand is as follows, and the result screen shows the degree of agreement between the
sample sequence and reference sequence. The minimum thresholds for sequence agree-
ment and conversion can be adjusted to meet the researcher’s requirements.
> QCdata <- MethylQC(refseq, methData, identity = 80, conversion = 85)
Note: If you do as follows, the minimum agreement rate will automatically be set to 80%
and the minimum conversion will be set to 85%.
> QCdata <- MethylQC(refseq, methData)
21.5 Analysis of Methylation Status
Using sample sequences passing alignment control (AC) and quality control (QC), we
extract their methylation status. We will use the MethyAlignNW function to extract this
information and interpret the results returned by this function.
> methQCData <- MethAlignNW(refseq, QCdata)
> methQCData <- MethAlignNW(refseq, QCdata)

Alignment with QC_seq_B.fasta done
Alignment with QC_seq_C.fasta done
Alignment with QC_seq_D.fasta done
Alignment with QC_seq_E.fasta done
Alignment with QC_seq_F.fasta done
Alignment with QC_seq_G.fasta done
Alignment with QC_seq_H.fasta done
Alignment with QC_seq_I.fasta done
As above, save information from 8 sequences into the object methQCData. Use the names
function to find the variable names stored in the object. There are six pre-named variables,
as shown below.
> names(methQCData)
21.5 · Analysis of Methylation Status
357 21
> names(methQCData)
[1] "seqName" "alignment" "methPos" "positionCGIRef" "startEnd" "lengthRef"
Display each variable above and understand its significance.

As shown above, object “methQCData” contain six different information types, which
we will devide into, individually.
21.5.1 Sequence Name
Entering the following returns eight filenames containing methylation information.
> methQCData$seqName
> methQCData$seqName
[1] "QC_seq_B.fasta" "QC_seq_C.fasta" "QC_seq_D.fasta" "QC_seq_E.fasta" "QC_seq_
F.fasta"
[6] "QC_seq_G.fasta" "QC_seq_H.fasta" "QC_seq_I.fasta"
21.5.2 Alignment
The following returns alignment results for sample sequences.
> methQCData$alignment
> methQCData$alignment
[1]"TTTGGGATTGTTTTTTTAGTAGGTGAAGTTTTGTTATGGATTTTTTTTGTTGGGGTTTT
GTGTTGTTTT$
[2]"TTCGGGATCGTTTTTTTAGTAGGTGAAGTTTTGTTATGGATTTTTTTCGTTGGGGTTTCG
TGTTGTTTT$
[3]"TCCGGGATCGTTTTTTTAGTAGGTGAAGTTTTGTTATGGATTTTCTTTGTTGGGGTTTTG
TGTTGTTTT$
[4]"TTTGGGATTGTTTTTTTTAGCAGGTGAAGTTTTGTTATGGATTTTTTTTGTTGGGGTTTT
GATGTTGTT$
[5]"TTCGGGATCGTTTTTTTAGTAGGTGAAGTTTTGTTATGGATTTTTTTCGTTGGGGTTTCG
TGTTGCCCT$
GTGTTGTTTT$
[7]"TTCGGGATTGTTTTTTTAGTAGGTGAAGTTTTGTTATGGATTTTTTTTGTTGGGGTTTCG
TGTTGTTTT$
GTGTTGTTTT$
21.5.3 Sequence Start and End Positions

21
The following returns the start and end positions of a sequence. You can calculate the
length of the aligned sample sequences, which should be similar to the length of the refer-
ence sequence.
> methQCData$startEnd
> methQCData$startEnd
[,1] [,2]
[1,] 1 233
[2,] 1 233
[3,] 1 233
[4,] 1 233
[5,] 1 233
[6,] 1 233
[7,] 1 233
[8,] 1 233
21.5.4 Length of the Reference Sequence
The following returns the length of the reference sequence.
> methQCData$methPos
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 1 1 0 1 0 1 0 0 1 1 0 1 1 1
[3,] 1 1 0 0 0 0 0 0 0 1 0 0 0 0
[4,] 0 0 0 0 0 0 0 0 0 0 1 0 0 0
[5,] 1 1 0 1 0 1 0 0 1 1 0 1 1 1
[6,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[7,] 1 0 0 0 0 1 0 0 1 0 1 1 1 1
[8,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
[1,] 0 0 0 0 1 0 0 0 0 0 0 0
[2,] 1 1 1 1 1 1 1 1 0 0 0 0
[3,] 0 1 0 0 1 1 0 0 0 1 0 0
[4,] 1 1 1 1 1 0 0 1 1 1 0 0
[5,] 1 1 1 1 1 1 1 1 0 0 0 0
[6,] 0 0 0 0 1 0 0 0 0 0 0 0
[7,] 1 1 1 1 1 1 0 1 1 1 1 0
[8,] 0 0 0 0 0 0 0 0 0 0 0 0
[,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35]
21.5 · Analysis of Methylation Status
359 21
[1,] 0 0 0 0 0 0 0 0 0
[2,] 1 1 0 1 1 0 0 1 0
[3,] 1 0 1 0 1 1 0 1 1
[4,] 0 1 0 1 0 0 0 0 0
[5,] 1 1 0 1 1 0 0 1 0
[6,] 0 0 0 0 0 0 0 0 0
[7,] 1 1 1 0 0 0 1 1 1
[8,] 0 0 0 0 0 0 0 0 0
21.5.5 Methylated Positions in the Sample Sequence
In the methAlignNW function, variable methPos returns the methylated position. The
methylated positions are indicated with 0 and 1 according to its agreement with the refer-
ence sequence.
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 1 1 0 1 0 1 0 0 1 1 0 1 1 1
[3,] 1 1 0 0 0 0 0 0 0 1 0 0 0 0
[4,] 0 0 0 0 0 0 0 0 0 0 1 0 0 0
[5,] 1 1 0 1 0 1 0 0 1 1 0 1 1 1
[6,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[7,] 1 0 0 0 0 1 0 0 1 0 1 1 1 1
[8,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
[1,] 0 0 0 0 1 0 0 0 0 0 0 0
[2,] 1 1 1 1 1 1 1 1 0 0 0 0
[3,] 0 1 0 0 1 1 0 0 0 1 0 0
[4,] 1 1 1 1 1 0 0 1 1 1 0 0
[5,] 1 1 1 1 1 1 1 1 0 0 0 0
[6,] 0 0 0 0 1 0 0 0 0 0 0 0
[7,] 1 1 1 1 1 1 0 1 1 1 1 0
[8,] 0 0 0 0 0 0 0 0 0 0 0 0
[,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35]
[1,] 0 0 0 0 0 0 0 0 0
[2,] 1 1 0 1 1 0 0 1 0
[3,] 1 0 1 0 1 1 0 1 1
[4,] 0 1 0 1 0 0 0 0 0
[5,] 1 1 0 1 1 0 0 1 0
[6,] 0 0 0 0 0 0 0 0 0
[7,] 1 1 1 0 0 0 1 1 1
[8,] 0 0 0 0 0 0 0 0 0
21.5.6 Methylated Positions in the Reference Sequence

21
As above, determine the locations of CpG sites in the reference sequence, which are
returned in the positionCGIRef variable by the methAlignNW function. Thirty-five
methylated positions are shown below.
> methQCData$positionCGIRef
> methQCData$positionCGIRef
[1] 3 9 32 48 51 59 61 69 73 83 96 98 102 104 110 118 120 125 128
[20] 131 136 145 148 153 160 163 165 167 175 177 198 200 209 217 232
21.6 Exploratory Analysis and Visualization
So far, we have simply determined methylated positions. It is not easy to understand what
these results mean. Therefore, we will visualize the results to perform more meaningful
interpretation. Below are several examples of commonly used visualization methods.
21.6.1 Plotting Methylation Status
The plotAbsMethyl function returns the total number of CpG sites for each sample
sequence by sequence position. You can then plot methylation level by sequence position
(. Fig. 21.1).

> plotAbsMethyl(methQCData, real = TRUE)
> plotAbsMethyl(methQCData, real=TRUE)

[1] 4 3 0 2 0 3 0 0 3 3 2 3 3 3 4 5 4 4 7 4 2 4 2 3 1 0 4 4 2 3 3 1 1 4 2
21.6.2 Lollipop Figure: Methylation Status
The MethLollipops function returns the methylation status of the sample sequence and
the reference sequence at the indexed CpG sites. The graph visually compares methylation
status per position between the sample and the reference (. Fig. 21.2).
> MethLollipops(methQCData)
21.6 · Exploratory Analysis and Visualization
361 21
5
4 Absolute number of methylated CpGs
absolute number of methylated CpG
3
2
1
0
0 50 100 150 200

position of CpG methylation on genomic sequence
.. Fig. 21.1 Visualization of DNA methylation using plotAbsMethyl functions
> MethLollipops(methQCData)
LABEL_Y_AXIS Experiment
1 1 QC_seq_B.fasta
2 2 QC_seq_C.fasta
3 3 QC_seq_D.fasta
4 4 QC_seq_E.fasta
5 5 QC_seq_F.fasta
6 6 QC_seq_G.fasta
7 7 QC_seq_H.fasta
8 8 QC_seq_I.fasta
9 refSeq refernceSequence
21 refSeq
5
index of clone sequences
0 5 10 15 20 25 30 35
index of CpG methylation
.. Fig. 21.2 Visualization of DNA methylation using MethLollipops functions
21.6.3 Neighboring Co-occurrence Display
This plot shows the correlation of methylation status between two neighboring CpG sites
(. Fig. 21.3).

> filepath <- file.path(“C:\gda\ch21\Cooccurrence.pdf”)
> Cooccurrence(methQCData, file = filepath)
> filepath <- file.path("C:/gda/ch21/Cooccurrence.pdf")

> Cooccurrence(methQCData, file=filepath)
LABEL_Y_AXIS Experiment
1 1 QC_seq_B.fasta
2 2 QC_seq_C.fasta
3 3 QC_seq_D.fasta
4 4 QC_seq_E.fasta
5 5 QC_seq_F.fasta
6 6 QC_seq_G.fasta
7 7 QC_seq_H.fasta
8 8 QC_seq_I.fasta
9 coocureceLolipop coocureceLolipop
Coocurrence Plot
0.33 0.17 0 0.17 0 0.33 0 0 0.33 0.17 0.33 0.33 0.33 0.33 0.5 0.5 0.5 0.5 0.33 0.33 0.17 0.5 0.33 0.33 0.17 0 0.33 0.5 0.17 0.33 0.17 0 0.17 0.33 0.17
0.53 NA NA NA NA NA NA NA 0.83 –0.32 0.25 1 1 0.71 1 1 1 0.45 0.32 0.83 0.45 0.71 1 0.83 NA NA 0.71 0.45 –0.32 0.83 NA NA 0.83 0.83
6
21.6 · Exploratory Analysis and Visualization
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
genomic position of CpG site
.. Fig. 21.3 Comparison of methylation at neighboring CpG sites from Cooccurrence.pdf

21 363
21 Distant cooccurrence plot 217

1
232
1
0.63
209
1 0.63 1
200
NA NA NA NA
198
1 NA -0.2 0.63 -0.2
177
1 0.63 NA -0.32 0.25 -0.32
175
1 -0.32 -0.2 NA 1 0.63 1
167
1 0.45 0.71 0.45 NA 0.45 0.71 0.45
165
1 0.71 0.63 0.25 0.63 NA 0.63 1 0.63
cor 1 163
NA NA NA NA NA NA NA NA NA NA
160
1 NA 0.63 0.45 1 -0.32 -0.2 NA 1 0.63 1
153
1 0.63 NA 0.25 0.71 0.63 0.25 -0.32 NA 0.63 0.25 0.63
148
1 1 0.63 NA 0.25 0.71 0.63 0.25 -0.32 NA 0.63 0.25 0.63
145
1 0.71 0.71 0.45 NA 0.71 1 0.45 0.71 0.45 NA 0.45 0.71 0.45
136
1 0.45 -0.32 -0.32 -0.2 NA 0.63 0.45 -0.2 0.63 1 NA -0.2 0.63 -0.2
131
1 0.63 0.71 0.25 0.25 0.63 NA 1 0.71 0.63 0.25 0.63 NA 0.63 1 0.63
128
1 0.32 0.2 0.45 0.32 0.32 0.2 NA 0.32 0.45 0.2 0.32 0.2 NA 0.2 0.32 0.2
–0.32 125
1 0.45 0.71 0.45 1 0.71 0.71 0.45 NA 0.71 1 0.45 0.71 0.45 NA 0.45 0.71 0.45
120
1 1 0.45 0.71 0.45 1 0.71 0.71 0.45 NA 0.71 1 0.45 0.71 0.45 NA 0.45 0.71 0.45
118
1 1 1 0.45 0.71 0.45 1 0.71 0.71 0.45 NA 0.71 1 0.45 0.71 0.45 NA 0.45 0.71 0.45
110
1 1 1 1 0.45 0.71 0.45 1 0.71 0.71 0.45 NA 0.71 1 0.45 0.71 0.45 NA 0.45 0.71 0.45
104
1 0.71 0.71 0.71 0.71 0.32 1 0.63 0.71 0.25 0.25 0.63 NA 1 0.71 0.63 0.25 0.63 NA 0.63 1 0.63
102
1 1 0.71 0.71 0.71 0.71 0.32 1 0.63 0.71 0.25 0.25 0.63 NA 1 0.71 0.63 0.25 0.63 NA 0.63 1 0.63
98
1 1 1 0.71 0.71 0.71 0.71 0.32 1 0.63 0.71 0.25 0.25 0.63 NA 1 0.71 0.63 0.25 0.63 NA 0.63 1 0.63
96
1 0.25 0.25 0.25 0.71 0.71 0.71 0.71 0.32 0.25 -0.32 0.71 1 1 0.63 NA 0.25 0.71 0.63 0.25 -0.32 NA 0.63 0.25 0.63
83
1 -0.32 0.63 0.63 0.63 0.45 0.45 0.45 0.45 0.2 0.63 1 0.45 -0.32 -0.32 -0.2 NA 0.63 0.45 -0.2 0.63 1 NA -0.2 0.63 -0.2
73
1 0.63 0.25 1 1 1 0.71 0.71 0.71 0.71 0.32 1 0.63 0.71 0.25 0.25 0.63 NA 1 0.71 0.63 0.25 0.63 NA 0.63 1 0.63
69
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
61
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
59
1 NA NA 1 0.63 0.25 1 1 1 0.71 0.71 0.71 0.71 0.32 1 0.63 0.71 0.25 0.25 0.63 NA 1 0.71 0.63 0.25 0.63 NA 0.63 1 0.63
51
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
48
1 NA 0.63 NA NA 0.63 1 -0.32 0.63 0.63 0.63 0.45 0.45 0.45 0.45 0.2 0.63 1 0.45 -0.32 -0.32 -0.2 NA 0.63 0.45 -0.2 0.63 1 NA -0.2 0.63 -0.2
32
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
91 NA 1 NA 0.63 NA NA 0.63 1 -0.32 0.63 0.63 0.63 0.45 0.45 0.45 0.45 0.2 0.63 1 0.45 -0.32 -0.32 -0.2 NA 0.63 0.45 -0.2 0.63 1 NA -0.2 0.63 -0.2
31 0.63 NA 0.63 NA 1 NA NA 1 0.63 0.25 1 1 1 0.71 0.71 0.71 0.71 0.32 1 0.63 0.71 0.25 0.25 0.63 NA 1 0.71 0.63 0.25 0.63 NA 0.63 1 0.63
.. Fig. 21.4 Plot of co-occurrence between distant CpG sites
21.6.4 Distant Co-occurrence Display
This visualization estimates the co-occurrence between A and neighboring base positions
as well as between A and distant base positions (. Fig. 21.4).
> distantCooccur <- matrixSNP(methQCData)
> plotMatrixSNP(distantCooccur, methQCdata)
21.7 Statistical Tests
21.7.1 Fisher’s Exact Test
We can perform a Fisher’s exact test to determine the significance of differences between
two groups for each CpG site. Run a Fisher’s exact test for 35 CpG sites (. Fig. 21.5).
> methFisherTest(methQCData, c(1, 4, 6, 8), c(2, 3, 5, 7))

21.7 · Statistical Tests
365 21
Fisher’s exact test
1.0
0.8
0.6
p-value
0.4
0.2
0.0
0
10
15
20
25
30
CpG index 35
.. Fig. 21.5 Fisher’s exact test for comparison of two groups
> methFisherTest(methQCData, c(1,4,6,8), c(2,3,5,7))

[1] 0.0286 0.1429 1.0000 0.4286 1.0000 0.1429 1.0000 1.0000 0.1429 0.1429 1.0000
0.1429
[13] 0.1429 0.1429 0.4857 0.1429 0.4857 0.4857 1.0000 0.0286 0.4286 0.4857 1.0000
1.0000
[25] 1.0000 1.0000 0.0286 0.4857 0.4286 1.0000 0.1429 1.0000 1.0000 0.0286 0.4286
21.7.2 Clustering Analysis
Clustering analysis is used to search for similar patterns of methylation state across sam-
ples. We can draw a heatmap using hierarchical clustering and infer a group that shows
similar patterns of methylation (. Fig. 21.6).

> heatMapMeth(methQCData)
21
6
35
33
25
29
24
11
23
32
26
8
7
3
5
30
31
21
10
2
4
19
34
27
20
14
13
12
9
1
6
28
22
18
17
15
16
.. Fig. 21.6 Heatmap for methylated CpGs
21.7.3 Correspondence Analysis
We can also perform correspondence analysis to cluster methylation patterns among

sequences and among sites, and better understand whether they are related (. Fig. 21.7).
> methCA(methQCData)
Exercise
[Exercise 1] - Load sample and reference sequences from the bisulfite sequenced samples of the Mouse
Gm9 region (Oda et al. [1]). Return the results of the following questions.
1.1. Select a sequence with an agreement rate of 70% and a conversion rate of 80% or less.
1.2. Perform alignment and extract CpG sites.
1.3. Visualize the methylation pattern using a Lollipop plot.
1.4. Perform statistical analysis to determine the difference between the following two groups: Group
1 comprising sequences 1,2,5,7, and 8 vs. Group 2 comprising sequences 3,4,6,9, and 10.
1.5. Draw a heatmap of the sequence and methylation sites using cluster analysis.
Bibliography
367 21
0.6 CpG30
clone3
CpG19
0.4
CpG31
CpG21
CpG10
CpG2
CpG4
CpG32
CpG26
CpG3
CpG5
CpG8
CpG7
clone4
clone1
0.2
Dimension 2 (29.3%)
clone2 CpG22
CpG15
CpG18
CpG17
CpG16
CpG28
CpG24
CpG23
CpG11
clone6
0.0
–0.2
CpG34
CpG20
CpG12
CpG14
CpG13
CpG27
CpG6
CpG9
CpG1
–0.4
clone5
–0.6
CpG33
CpG29
CpG25
–0.5 0.0 0.5

Dimension 1 (41.7%)
.. Fig. 21.7 Clustering of methylation patterns in correspondence analysis
Take Home Message

55 How to analyze the bisulfite sequencing-driven DNA methylation data.
55 How to visualize the epigenome data.
55 What statistical methods are used.
Bibliography
1. Oda M et al (2006) DNA methylation regulates long-range gene silencing of an X-linked homeobox
gene cluster in a lineage-specific manner. Genes Dev 20:3382–3394
2. Zackay A, Steinhoff C (2010) MethVisual - visualization and exploratory statistical analysis of DNA
methylation profiles from bisulfite sequencing. BMC Res Notes 3:337
3. Zackay A, Steinhoff C (2016) methVisual: Methods for visualization and statistics on DNA methylation
data. R package version 1.26.0

10.1007@978 981 13 1942 6

Uploaded by

Copyright:

Available Formats

You might also like

10.1007@978 981 13 1942 6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10.1007@978 981 13 1942 6

Uploaded by

Copyright:

Available Formats

Learning Materials in Biosciences

More information about this series at http://www.springer.com/series/15430

Additional material to this book can be downloaded from http://extras.springer.com.

ISSN 2509-6125 ISSN 2509-6133 (electronic)

Library of Congress Control Number: 2018966388

© Springer Nature Singapore Pte Ltd. 2019

In the era of experimental biology, which is represented by traditional experiments in

Each exercise is comprised of one chapter of theoretical background, which is relevant to

At the Hamchunmoon Gate, in May 2015

I Bioinformatics for Life and Personal Genome

1 Bioinformatics for Life...................................................................................................................... 3

2 Next-Generation Sequencing Technology and Personal

3 Personal Genome Data ­Analysis................................................................................................. 33

4 Personal Genome Interpretation and Disease Risk ­Prediction.................................... 47

II Advanced Microarray Data Analysis

6 Gene Expression Data A

6.4 Clustering Analysis.............................................................................................................................. 111

7 Gene Ontology and Biological Pathway-Based Analysis................................................ 121

8 Gene Set Approaches and Prognostic Subgroup Prediction........................................ 135

8.7.5 Extraction Samples with BRCA1 and BRCA2 Methylation........................................................ 154

9 MicroRNA Data Analysis.................................................................................................................. 159

III Network Biology, Sequence, Pathway

10 Network Biology, Sequence, Pathway and Ontology ­Informatics............................. 175

11 Motif and Regulatory Sequence Analysis............................................................................... 189

11.4 Transcription Regulatory Site Prediction Using Sequence Alignments.......................... 202

12 Molecular Pathways and Gene Ontology............................................................................... 213

13 Biological Network Analysis.......................................................................................................... 233

IV SNPS, GWAS and CNVS, Informatics for Genome Variants

14.5 Genome-Wide Association Studies (GWAS)............................................................................... 254

15 SNP Data Analysis.............................................................................................................................. 261

16 GWAS Data Analysis.......................................................................................................................... 281

17 CNV Analysis......................................................................................................................................... 299

17.3 Data............................................................................................................................................................ 301

V Metagenome and Epigenome, Basic Data Analysis

19 Metagenome Data Analysis.......................................................................................................... 325

20 Epigenome Database and Analysis Tools............................................................................... 339

20.4 Epigenome Analysis Tool................................................................................................................... 346

21 Epigenome Data Analysis............................................................................................................... 353

Bioinformatics for Life

Chapter 1 Bioinformatics for Life – 3

Chapter 2 Next-Generation Sequencing Technology and

Chapter 3 Personal Genome Data A

Chapter 4 Personal Genome Interpretation and Disease

1.2 Life and Information – 5

1.3 The Human Genome Project – 8

1.4 Development of Microarray Technology and

1.5 Explosion of Data from Next-Generation

1.6 The Era of Systems Biology and

© Springer Nature Singapore Pte Ltd. 2019

What You Will Learn in This Chapter

1.2 Life and Information

routes in our example, determined by dynamic programming. Vertical and horizontal

Input: two strings S1 and S2

S2(1, ..., [j-1])

1.3 The Human Genome Project

1.4 Development of Microarray Technology

10 Microarray In Node negative Disease may Avoid Chemo Therapy.

The first application of genomic polymorphisms is personalized medicine.

3 Personal Genome Data Analysis................................................................................................. 33

4 Personal Genome Interpretation and Disease Risk Prediction.................................... 47

6.4 Clustering Analysis.............................................................................................................................. 111

8.7.5 Extraction Samples with BRCA1 and BRCA2 Methylation........................................................ 154

10 Network Biology, Sequence, Pathway and Ontology Informatics............................. 175

11.4 Transcription Regulatory Site Prediction Using Sequence Alignments.......................... 202

14.5 Genome-Wide Association Studies (GWAS)............................................................................... 254

17.3 Data............................................................................................................................................................ 301

20.4 Epigenome Analysis Tool................................................................................................................... 346

1.2 Life and Information – 5

1.3 The Human Genome Project – 8

1.4 Development of Microarray Technology and

1.5 Explosion of Data from Next-Generation

1.6 The Era of Systems Biology and

1.2 Life and Information

1.3 The Human Genome Project

1.4 Development of Microarray Technology

1.5 Explosion of Data from Next-Generation Sequencing

1.6 The Era of Systems Biology and Biomedical Informatics

2.2 Data Format – 20

2.3 Alignment of NGS Sequencing Reads – 22

2.4 SNP and InDel Detection – 24

2.5 Annotation of Sequence Variation and Function

2.2 Data Format

2.2.1 FASTQ Format

2.2.2 CSFASTA Format

2.2.3 Concept of QV

2.3 Alignment of NGS Sequencing Reads

Maq Li H et al. 7 http://maq.sourceforge.net/

Bowtie Langmead B 7 http://bowtie-bio.

SSAHA2 Ning Z et al. 7 http://www.sanger.ac.uk/

BWA Li H and 7 http://bio-bwa.sourceforge.

SOAP2 Li R et al. 7 http://www.sanger.ac.uk/

2.3.1 Sequence Alignment with a Hash Table

2.3.2 Sequence Alignment with a Suffix Tree

2.4 SNP and InDel Detection

2.5 Annotation of Sequence Variation and Function Prediction

2.5.1 nnotation and Clinical Interpretation

2.5.2 Annotation and Clinical Interpretation of Rare Variants

2.5.2.1 SIFT (Sorting Intolerant from Tolerant)

2.5.2.2 PolyPhen (Polymorphism Phenotype)

2.5.2.3 hD-SNP (Prediction of Human Deleterious Single

2.5.2.4 VAAST (the Variant Annotation, Analysis and Search Tool)

2.5.3 KEGG Disease Pathways Mapping

2.5.5 Distribution of Genetic Variants in the Population

7 ftp://ftp.1000genomics.ebi.ac.uk/vol1/ftp/release/20130502/). The most current data is

Super Population, Description Phase 1 Final Sub-