Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

[ Supplement An Overview of Study Design and Statistical Considerations ]

A Bioinformatics Crash Course for


Interpreting Genomics Data
Daniel M. Rotroff, PhD

Reductions in genotyping costs and improvements in computational power have made


conducting genome-wide association studies (GWAS) standard practice for many com-
plex diseases. GWAS is the assessment of genetic variants across the genome of many
individuals to determine which, if any, genetic variants are associated with a specific
trait. As with any analysis, there are evolving best practices that should be followed to
ensure scientific rigor and reliability in the conclusions. This article presents a brief
summary for many of the key bioinformatics considerations when either planning or
evaluating GWAS. This review is meant to serve as a guide to those without deep
expertise in bioinformatics and GWAS and give them tools to critically evaluate this
popular approach to investigating complex diseases. In addition, a checklist is provided
that can be used by investigators to evaluate whether a GWAS has appropriately
accounted for the many potential sources of bias and generally followed current best
practices. CHEST 2020; 158(1S):S113-S123

KEY WORDS: bioinformatics; genomics; statistics

General Overview of Study Design Unlike previous approaches (eg, candidate


Recent technologic advancements have led to gene studies), a major benefit to omics
unprecedented volumes of data, bringing technologies is that one does not need a priori
renewed hope for medical breakthroughs. knowledge of what targets may be involved.
However, the generation of large volumes of The ability to agnostically evaluate potential
data has also brought new challenges for targets opens the door to discovering novel
analysis and interpretation. Omics has targets involved in disease pathogenesis.
recently become a popular term in biological Genome-wide association studies (GWAS)
research and refers to technologies that aim to have become increasingly popular and
comprehensively evaluate a group of involve the assessment of genetic variants
molecular or biochemical constituents. across the genome of many individuals to

ABBREVIATIONS: GWAS = genome-wide association studies; LD = Cleveland (KL2TR002547 to D. M. R.) from the National Center for
linkage disequilibrium; MAF = minor allele frequency; PRS = poly- Advancing Translational Sciences.
genic risk score; SKAT = sequence kernel association test; SNP = single CORRESPONDENCE TO: Daniel Rotroff, PhD, Department of Quanti-
nucleotide polymorphism; QQ = quantile-quantile tative Health Sciences, Lerner Research Institute, Cleveland Clinic,
AFFILIATIONS: From the Department of Quantitative Health Sciences, 9500 Euclid Ave, Cleveland, OH 44195; e-mail: rotrofd@ccf.org
Lerner Research Institute, Cleveland Clinic, Cleveland, OH. Copyright Ó 2020 American College of Chest Physicians. Published by
FUNDING/SUPPORT: This study was supported in part by grants Elsevier Inc. All rights reserved.
provided by the Clinical and Translational Science Collaborative of DOI: https://doi.org/10.1016/j.chest.2020.03.004

chestjournal.org S113
determine which, if any, genetic variants are associated Considerations Prior to Initiating a Genomics Study
with a specific trait. There is no one-size-fits-all analysis plan, and each
analysis plan should consider potential impacts from the
Precision pulmonology will require a deep
cohort selection (eg, sex, race, socioeconomic status, age,
understanding of disease heterogeneity. Whereas
medication usage), study design (eg, cross-sectional,
some hereditary pulmonary diseases (eg, cystic
longitudinal), technical considerations (eg, batch effects,
fibrosis, alpha1-antitrypsin deficiency) have relatively
biological and technical replicates), and traits (eg,
well-understood pathologies, other lung diseases (eg,
quantitative, dichotomous). The way these factors are
COPD, asthma) are highly complex and
incorporated into the analysis can affect the success of
heterogeneous. Furthermore, because the respiratory
the study. For instance, is it appropriate to include sex as
system is integrally tied to our physiology, many
a model covariate or should the cohort be stratified
diseases can result in pulmonary comorbidities (eg,
according to sex? Stratifying according to sex requires
pulmonary hypertension in patients with valvular
performing tests separately on each group, reducing
heart disease, pulmonary fibrosis in patients with
the sample size but providing separate effect size
rheumatoid arthritis). There is heterogeneity in the
estimates for each group. It is easy to imagine other
presentation of these secondary disorders due to
study factors that may need to be considered in the same
complex genetic and environmental factors. As with
way.
many complex diseases, many primary and
secondary pulmonary diseases are likely to be Genomics Studies Involve Many Statistical Tests,
influenced by multiple molecular and biochemical Requiring Special Attention: Due to the nature of
effects with modest effect sizes, making them genomics technologies, a large number of features are
difficult to detect. Clever bioinformatics evaluated, single nucleotide polymorphisms (SNPs),
approaches will be needed to identify the many resulting in potentially millions of individual statistical
contributing factors for the diseases that fall within tests. It is essential that adjustments for multiple testing
this category. are applied to limit the likelihood of observing large
numbers of false-positive results. Bonferroni corrections
The aim of the current review was to describe key use a family-wise error rate approach, which aims to
bioinformatics considerations when analyzing and minimize the probability of having false-positive
interpreting genomics data. As a companion to this findings (type I error). It is easy to calculate: just divide
review, it is recommended that readers use the the desired statistical threshold (eg, P < .05) by the
glossary of common genomics terms provided by number of independent statistical tests to determine the
the Journal of the American Medical Association, adjusted significance threshold. However, this approach
which is available at: https://jamanetwork.com/ is conservative and is prone to producing false-negative
journals/jama/fullarticle/1677346.1 Although this findings.2 A false discovery rate approach can also be
review is not meant to serve as a detailed guide calculated and aims to adjust the number of false
for performing the analyses, it is meant to discoveries. It is less conservative and allows the
provide clinicians and investigators without investigator to select the threshold for how many false-
bioinformatics expertise with the knowledge to positive findings are expected to minimize false-negative
identify potential sources of bias, critically findings.3 The choice between Bonferroni and false
evaluate genomics research, and better understand discovery rate approaches will depend on how
how to effectively use genomics to advance chest conservative the investigators desire is to minimize false-
medicine in their domain. To this end, Table 1 positive findings.
contains a checklist for researchers and statisticians
to use during planning and performance of GWAS. Although performing multiple testing adjustment is
The short list of questions for reviewers contains necessary in GWAS, performing the adjustment based
guiding questions that can be used by clinical on the total number of SNPs is inappropriate because
reviewers to critically evaluate GWAS. In addition, a the SNPs are not independent due to linkage
table representing key considerations for various disequilibrium (LD). LD is the nonrandom arrangement
stages of genomics study development is included of alleles across the genome and results in SNPs being
(Table 2). correlated with other nearby SNPs. LD is a consequence of

S114 Supplement [ 158#1S CHEST JULY 2020 ]


TABLE 1 ] Checklist for Researchers and Statisticians to Evaluate the Performance of Genome-Wide Association
Studies
Ensure the following
The sample is size appropriate for the study
The phenotype is well defined
There is evidence that the phenotype is heritable
The statistical model is appropriate for the phenotype
Dichotomous trait, no covariates: Fisher exact test, c2 test
Dichotomous trait: logistic regression model
Quantitative trait: linear regression model
Multivariate trait: MANOVA/MANCOVA
Categorical/ordinal trait: multinomial logistic regression
Other ______________________________
Sex was appropriately accounted for.
Incorporated as a covariate
Cohort stratified according to sex
Population stratification was appropriately accounted for.
Principal component analysis
Genetic relationship matrix incorporated into statistical model
Cohort stratified
Other ____________________________
Linkage disequilibrium is accounted for in the population stratification analysis, if applicable
If imputation was performed
An appropriate reference population was used for imputation, if applicable
Study-specific features were accounted for (eg, batch effects, medications)
An appropriate genetic model was selected
Additive
Recessive
Dominant
Multiple models considered
Study covariates were appropriately selected
Expert opinion
Selection procedure
An appropriate minor allele frequency was used for the common variant analysis
The results of the covariate selection were adequately described
A quantile-quantile plot was presented, and minimal inflation was observed
A multiple test correction was applied.
Standard genome-wide threshold of P < 5  10–8
Bonferroni adjustment
False discovery rate approach
Other ______________________
If the goal was to create a prediction model
Cross-validation was performed to evaluate overfitting and tested in an independent cohort
If the goal was to perform an association analysis
The associations were replicated in an independent cohort or were functionally validated in an appropriate model
system

MANCOVA ¼ multivariate analysis of covariance; MANOVA ¼ multivariate analysis of variance;

chestjournal.org S115
genetic distance, history of mutation events, and changes because additional funds and expertise may be needed to
in population dynamics. Studies investigating the number confirm discoveries.
of unique LD “blocks” across the genome, an estimated
one million independent regions, have determined that The Phenotype Should Be Well Defined: Phenotypic
5  10–8 is a reasonable Bonferroni-adjusted threshold for heterogeneity is increasingly being realized as a challenge
a type I error rate of 0.05 and has since become the for omics studies. A trait with multiple underlying
standard convention for GWAS.4 Multiple testing has etiologies can result in a loss of statistical power. In fact,
garnered attention because it has been identified as a likely Manchia et al7 determined that 50% phenotypic
contributor for why many published studies have failed to heterogeneity in GWAS increases the necessary sample
be reproducible.5,6 Due to the aforementioned challenges, size by three times, suggesting that appropriate
it has now become expected that replication be performed phenotype classification may be more important than
to verify associations observed in genomics studies. This is increased sample sizes. Phenotypic heterogeneity can
often done through replication in a similar but also pose a challenge for replicating findings; the
independent cohort, or functional validation in an in vivo cohort and phenotype used for replication must be
or in vitro model system. It is important to consider appropriately similar to the discovery cohort to be
options for validating findings prior to initiating the study successful.

TABLE 2 ] Key Stages of Genomics Study Development and Analysis


Stage Important Considerations
Study design and preparation  Determine if evidence exists that the phenotype is heritable: genetics contributes
to phenotypic variation
 Consider whether the phenotype is clearly defined and whether the phenotype
definition will be applied consistently across the entire cohort
 Analysis plan must be appropriate for the study design (eg, cross-sectional,
longitudinal)
 Consider cohort factors that may affect results and ensure that information will be
collected to evaluate or adjust for these impacts (eg, sex, race, socioeconomic
status, age, medications)
 Consider technical factors that may affect results (eg, batch effects, technical
and/or biological replicates)
 Many statistical tests will be conducted; consider impacts on statistical power
after correcting for multiple testing
Data collection  Collect data uniformly across the study
 Evaluate phenotypes and/or survey responses periodically for any systematic bias
or incomplete information
Data processing  Consider excluding SNPs and individuals with extensive missingness or poor-
quality control metrics
 Test for relatedness if not performing a family-based study
 Account for population stratification
 Select variables that will be adjusted in models (eg, sex, batch, age, medications)
 Identify a minor allele frequency threshold for common and rare variants
 Perform genomic imputation, if desired
Postprocessing data analysis-  Select an appropriate genetic model (ie, additive, recessive, dominant)
common variant analysis  Select appropriate statistical model based on trait (eg, linear regression, logistic
regression) and add selected variables to model.
 Test each SNP that is above minor allele frequency in the statistical model.
 Create a quantile-quantile plot to assess whether the type I error rate inflation is
observed in the calculated P values
 Perform multiple test correction and create a Manhattan plot to visualize results
Optional follow-up analyses  GWAS replication in an external cohort
 Rare variant analysis
 Pathway-based or network-based analysis
 Polygenic risk scores
 Integration with other omics platforms
 Functional follow-up in in vitro or in vivo models

GWAS ¼ genome-wide association studies; SNP ¼ single nucleotide polymorphism.

S116 Supplement [ 158#1S CHEST JULY 2020 ]


TABLE 3 ] Example of SNP Dichotomization for Classical Genetic Modelsa
Unaffected Partially Affected Fully Affected Unaffected SNP Partially Affected SNP Fully Affected SNP
Model Genotype Genotype Genotype Coding Encoding Encoding
Recessive AA or AB . BB 0 or 1 . 2
Dominant AA . AB or BB 2 . 1 or 2
Additive AA AB BB 0 1 2

See Table 2 legend for expansion of abbreviation.


a
In this example, B is the “risk allele” (the allele increasing the trait effect). 0 encoding ¼ AA genotype; 1 encoding ¼ AB genotype; 2 encoding ¼ BB genotype.

There Should Be Evidence of a Genetic Contribution diseases. There are now many publicly available
to the Phenotype: Heritability is the extent of trait resources for genomics data. The National Institutes of
variation due to genetic factors. Pividori et al8 estimated Health database of Genotypes and Phenotypes is a
the heritability of childhood-onset and adult-onset searchable repository of downloadable genotype data for
asthma to be 0.33 and 0.10, respectively. The remaining a wide range of diseases (https://www.ncbi.nlm.nih.gov/
fraction of the disease variation is expected to be due to gap/). The Cancer Genome Atlas is a cancer-focused
environmental factors. To observe a genetic association resource that contains genomic data, in addition to
with a trait, the trait must be heritable. Although having multiple other omics data (https://www.cancer.gov/
a heritable trait does not guarantee that genetic about-nci/organization/ccg/research/structural-
associations will be observed, it is prudent to establish genomics/tcga). Lastly, the UK Biobank Resource,14
that the trait is heritable prior to performing GWAS. For which recruited 500,000 people with genetic data,
many complex diseases, SNPs explain a relatively small biospecimens, and many other clinical and demographic
portion of the overall disease heritability, the remaining measures, has provided a significant resource for
of which is often referred to as “missing heritability” and conducting large GWAS. Information regarding the UK
is well described by Manolio et al.9 As larger GWAS are Biobank can be found at https://www.ukbiobank.ac.uk/.
conducted, additional variants with small effect sizes are
A genome-wide association study of 35,735 patients with
being discovered, suggesting that a lack of statistical
COPD and 222,076 control subjects from the UK
power may be the source of missing heritability, which
Biobank identified > 80 genome-wide significant loci,
has been supported by other studies as well.10,11
including 35 loci not previously known to be associated
Although it may be surprising that studies with >
with COPD.15 A recent study of 376,358 individuals
100,000 individuals would be underpowered to find all
without asthma and those with childhood-onset asthma,
the contributing genetic variants, this is possible due to
young adults with asthma, or adults with asthma from the
the potential for thousands of variants to contribute very
UK Biobank discovered 61 independent asthma loci, with
small effects on the phenotype and the substantial
some loci shared across disease groups and some specific
multiple testing burden due to millions of statistical tests
for childhood-onset or age of onset.8 In addition,
being performed. This may be further affected by
approximately 27 identified loci seem to be novel.
phenotype heterogeneity, which can significantly
Although these studies represent some of the largest
influence statistical power, as described earlier. Lastly,
GWAS conducted to date for disease risk, many others
with the exception of whole-genome sequencing,
have been conducted for treatment response.16-18 Several
genotyping technologies do not genotype every SNP in
software tools are available with the capability to perform
the genome, potentially missing variants contributing to
analyses like those mentioned earlier. PLINK is a widely
disease. Another possibility is that heritability methods
used software tool that uses standard file formats and has
overestimate genetic contribution. New bioinformatics
the functionality to perform many of the analyses
approaches are being developed to try to gain more
described here.19 Additional tools for analyzing
precise and accurate measures of heritability.12,13
sequencing data include the Broad Institutes, GATK
pipeline (https://software.broadinstitute.org/gatk/), and
Important Considerations When Analyzing Illumina’s DRAGEN Bio-IT Platform.20
and Evaluating Genomics Data
GWAS are now performed routinely and have led to a Impacts of Racial and Ethnic Heterogeneity
substantially improved understanding of disease etiology Although the UK Biobank8,15 is an incredibly valuable
as well as a new appreciation for the complexity of many resource, it is composed mostly of individuals of European

chestjournal.org S117
ancestry and therefore does not adequately represent all The Genome-Wide Association Model: Common
ethnicities. SNPs have different allele frequencies in Variants
different populations. The importance of this can be seen Associations of commonly occurring genetic variants
in the study by Hirota et al,21 which found SNPs with a phenotype is the heart of GWAS and are the
associated with asthma in a Japanese cohort but only a typical analysis presented in the literature regarding
single SNP they investigated was associated with asthma GWAS. Examples of GWAS relevant for chest medicine
in a non-Japanese cohort. Further information about the have been published previously.8,15-17,21 Building the
challenges of genomics in multi-ethnic cohorts can be appropriate statistical model for GWAS depends on
found in the article by Medina-Gomez et al.22 If a three main factors: (1) the type of dependent variable (ie,
population of individuals has increased disease risk, SNPs outcome or trait of interest); (2) the inclusion of
overrepresented in that population, but unrelated to the covariates to limit confounding; and (3) the selection of
disease, may appear enriched for the disease. One solution an appropriate genetic model. If the outcome or trait is
is to stratify the cohort according to racial or ethnic continuous, then a linear regression model is likely the
groups; however, it can also be advantageous to analyze an best choice. However, for binary traits or for time-to-
entire cohort to leverage a larger sample size. Importantly, event data, a logistic regression and Cox proportional
even a cohort with the same self-reported race or ethnicity hazards models, respectively, may be more appropriate.
can be admixed due to nonrandom ancestral mating.23 Detailed criteria for selecting a statistical model have
Studies have been conducted to identify “ancestry been reported by Long and Freese.29
informative markers” or SNPs that display substantial
frequency differences for specific populations, and these Software is also available to perform GWAS if you have
markers have been proposed as a way to better a multivariate trait (ie, concentration response data).30
characterize populations included in a cohort.24-26 Many study designs contain aspects that could confound
results if not appropriately accounted for in the
Two common approaches are used to account for statistical model. Examples may include sex, population
population stratification: (1) principal component stratification, medication usage, socioeconomic status,
analysis27; and (2) mixed models. It is important to and many others. Including these in the model as
perform SNP pruning, which will filter out highly covariates can be effective at preventing confounding;
correlated SNPs due to LD. Principal component however, including unimportant or correlated variables
analysis clusters individuals, and the first few principal can result in a loss of statistical power and unstable
components can be incorporated into regression models regression coefficients. For this reason, it is necessary to
(described later in the text) to adjust for population select an optimal set of variables, while minimizing
stratification.27 Mixed model approaches rely on the model complexity. Multiple approaches exist for this
incorporation of a genetic relationship matrix into the approach, and a common strategy is to exclude one
regression model.28 The genetic relationship matrix variable from pairs of correlated variables. A more
consists of pairwise genetic relationship coefficients of sophisticated approach that has gained favor is the least
each subject in the analysis. Inadequate control of absolute shrinkage and selection operator,31 which aims
population stratification can manifest in type I error rate to minimize the mean squared error (or some other
inflation. Quantile-quantile (QQ) plots (described later measure of fit); it does this by penalizing some model
in the text) can provide critical information regarding coefficients by shrinking them to zero. Selected features
inflation and can highlight potential problems such as are those with nonzero coefficients.
inadequate adjustment for population stratification.
Once the proper statistical model has been identified
and features have been selected, the genetic model must
Description of Subtypes of Study Design be specified. The three classic genetic models are
SNPs occur at different allele frequencies in various dominant, recessive, and additive, although there is no
populations, and SNPs with rare minor allele frequencies established mode of inheritance for complex diseases.
(MAF) pose a statistical challenge because there are Simulation studies indicate that the optimal approach
often too few subjects expressing these alleles to be well may be to evaluate all three models together, but special
powered. For this reason, different approaches are consideration must be paid to account for multiple
required to analyze these SNPs resulting in two testing with correlated models.32 Table 3 shows the
categories of genome-wide association study analyses: appropriate SNP encoding based on each of the three
common variant analyses and rare variant analyses. models.

S118 Supplement [ 158#1S CHEST JULY 2020 ]


A model must then be created for each individual SNP, effects will not cancel each other out. More sophisticated
and the statistical test will be to evaluate the approaches, such as sequence kernel association tests
significance of the SNP with the trait while accounting (SKATs) allow for different effect directions and
for the other covariates in the model. SNPs with rare magnitudes for each variant.41 When variants have
MAF should be excluded from this analysis due to a different directions of effects, SKATs are more powerful
lack of statistical power and their ability to produce than burden approaches; however, they are less powerful
false-positive findings. MAF thresholds typically range when all variant effects are in the same direction. This
from 1% to 5%, and some approaches have used led to the development of SKAT-O, an approach aiming
inflation factors to determine the optimal thresholds to balance SKATs and burden tests.42 Additional
for filtering rare variants.33 Due to the many statistical approaches have been developed for performing SKAT
tests being performed, it is critical that multiple test in the context of family-based study designs
corrections are applied, and as described earlier, the (famSKAT),43 combining rare and common variants
standard convention is to use a threshold of P < 5  (SKAT-RC)44, and for multiple phenotypes (Multi-
10–8. Lastly, QQ plots should be developed to SKAT).45 Methods for rare variant analysis are an active
investigate the extent of the type I error rate inflation area of research, and other approaches are also being
present in the results, which could point to potential considered, such as the Protein Structure Guided Local
confounding. QQ plots provide information regarding Test,46 which aim to improve associations by
the distribution of observed test results over what incorporating information from protein structure.
would be expected by chance. Additional information
about QQ plots has been provided by Voorman et al.34 What Types of Analyses Can Be Done After
Genomic control is often an effective approach to Obtaining the Genome-Wide Association
quantifying and adjusting results for inflation.35 The
Study Results?
qqman R package is a popular tool to easily create QQ
plots and Manhattan plots.36 What Is the Difference Between Association and
Prediction?

The Genome-Wide Association Model: Rare Although routinely conflated, association and prediction
Variants models have different requirements and serve different
purposes. Association models determine which factors
Missing heritability may be partly due to large effects
explain the most variation in a trait and test for
from rare disease-causing variants.9,37 However,
statistical significance to determine causal associations.
discovering rare disease-causing variants is challenging
Prediction models try to predict a future event. In
because the rarity of the risk allele results in a small
genomics, association models are often used to
sample size that limits statistical power. Most studies use
determine etiology, whereas prediction models do not
an MAF threshold of 1% to 5% to identify rare SNPs.
need to inform etiology and are sometimes lacking
When testing across the entire genome, it is likely that for
interpretability. A strong association between a feature
some rare SNPs, the few individuals with the risk allele
and a trait does not necessarily indicate that feature will
may end up all being in the disease group by chance,
be able to accurately discriminate individuals, which is
resulting in a false-positive result. Many approaches have
needed for prediction.47 Associations should be
been developed attempting to address this challenge, and
replicated in an independent dataset. Prediction models
two popular classes of methods are burden and variance
should be built into a cross-validation framework to
component tests. Both of these involve collapsing rare
prevent overfitting and then tested in an independent
variants from a region of interest (eg, a gene) into a single
test set. Metrics to evaluate predictive performance
variable. This single variable can then be tested for
include sensitivity, specificity, positive predictive value,
association with a phenotype. Simple burden methods use
and negative predictive value. Model performance across
an a priori selected indicator and a proportion or
various thresholds is usually visualized by using a
weighted approach to collapsing variants. These are often
receiver-operating characteristic curve. An optimal
easily computed from the genotype data.38-40
threshold might then be selected by maximizing the
A major limitation of simple burden tests, however, is sensitivity and/or specificity, based on the clinical need
that they cannot account for the direction (positive or and an understanding of the cost of misclassification.
negative association) of a rare variant effect.41 In other The review by Shmueli48 provides additional
words, two variants in the same region with opposing information about this concept.

chestjournal.org S119
Polygenic Risk Scores Can Be Effective Approaches an expansion of these approaches, and a detailed review
for Prediction and comparison can be found in Ihnatova et al.57
Complex diseases and traits are influenced by many Simulated datasets have been created to properly evaluate
genetic factors. A polygenic risk score (PRS) is designed which methods have the best statistical power.58
to predict an outcome by incorporating information
Genomics data pose an additional challenge for pathway
from multiple genetic variants. Constructing a PRS
analysis: knowledge bases contain information at the gene
consists of taking SNPs above a certain statistical
level, and optimal methods for integrating SNP-level data
threshold in a genome-wide association study, and
with gene sets are still evolving. Methods for aggregating
weighting them by their coefficient from the genome-
causal SNPs while minimizing the influence of noncausal
wide association study regression models. Each subject
SNPs are needed to effectively determine the impacts on
receives a score based on the weighted sum of the risk
biological pathways. Several approaches for aggregating
alleles.49 It is also important that a PRS is evaluated in a
SNP-level data have been developed, such as gene set
cross-validation framework to evaluate predictive
analysis-single nucleotide polymorphisms,59 adaptive sum
performance. A guide to performing PRS has been
of powered score,60 and hybrid set-based test.61 More
developed by Choi et al.50
recently, open-source software, eXploring Genomic
Relations offers a user-friendly suite of tools to leverage
Pathway Analysis Can Help Put Genomics Studies prior biological knowledge and improve interpretability of
Into Biological Context
genomics data.62 eXploring Genomic Relations does this
Results generated from genomics studies can be by accepting differential expression, SNP, or expression
notoriously difficult to put into biological context. quantitative trait loci and mapping to multiple knowledge
Pathway analysis aims to reduce complexity and bases, including gene ontologies, gene networks, gene/SNP
improve interpretability of results. Pathway analysis has annotations, and genomic annotations.
been used to identify potential pathways putatively
involved in asthma susceptibility.51,52 Pathway analysis is Additional Considerations Exist
an expansive topic, and the following is a brief overview of
The aim of the current article was to provide a brief
the major concepts and considerations with commonly
overview of important bioinformatics considerations for
used approaches. Early pathway analysis methods focused
genomics studies. However, this is not a fully
on gene expression. Pathways, or gene sets, consist of
comprehensive evaluation, and there are additional
groups of genes with a coordinated effect on a biological
considerations that should be accounted for. For
function. Now pathways can consist of groups of
example, data-processing strategies for sequencing-
metabolites, proteins, microRNAs, and other molecular or
based platforms differ from those of array-based
biochemical features. Numerous open source knowledge
platforms. These are highly platform specific and
bases are available that contain pathway information, such
depend on the technology used for the study.
as Gene Ontology53 and Kyoto Encyclopedia of Genes and
Imputation may be necessary to gain information for
Genomes,54 among many others.
variants not included on the genotyping platform.
Khatri et al55 describes three generations of pathway Various considerations such as the reference population
analyses. First-generation approaches take features such as ethnicity and quality metrics are important to obtain
statistically significant genes and then test for enrichment reliable imputed variants.63 In addition, family-based
in a given pathway. A limitation of these approaches is designs have not been discussed here, but it is important
that they rely on an arbitrary statistical threshold to that family structure is accounted for when a cohort
determine which features should be evaluated. In addition, consists of related individuals, and additional software
this approach favors large effects being tested for pathway tools are available to incorporate pedigree information.64
impacts and has limited ability to detect multiple small It is likely that epistasis plays an important role in
coordinated effects on a pathway. Second-generation complex diseases, and approaches are also available that
approaches (eg, gene set enrichment analysis56) are search for gene-gene interactions. These methods
designed to address this limitation by ranking all features continue to evolve and have been reviewed by Niel
and testing if the ranks are enriched in a pathway.55 et al65 and Chatelain et al.66 To obtain additional
Third-generation methods are still evolving and use statistical power, genomics studies are routinely being
topologic information by incorporating the correlation meta-analyzed; a review on meta-analysis approaches
structure of the features into the pathway. There has been can be found in Evangelou and Ioannidis.67 As

S120 Supplement [ 158#1S CHEST JULY 2020 ]


technology continues to advance, analyzing multiple Acknowledgments
omics platforms in the same individuals (multi-omics) is Financial/nonfinancial disclosures: None declared.
shedding new light on disease mechanisms. Approaches Role of sponsors: The sponsor had no role in the design of the study,
to simultaneously analyzing multiple omics can be the collection and analysis of the data, or the preparation of the
manuscript.
found by using tools such as Data Integration Analysis
for Biomarker Discovery Using Latent Components.68,69
References
As the cost of data generation continues to decline, the 1. Glossary of Genomics Terms. JAMA. 2013;309(14):1533-1535.
importance of bioinformatics is growing. Although 2. Andrade C. Multiple testing and protection against a type 1 (false
many automated tools and pipelines make data positive) error using the Bonferroni and Hochberg corrections.
Indian J Psychol Med. 2019;41(1):99.
processing and analysis easily accessible to a wide range
3. Storey JD. A direct approach to false discovery rates. J R Stat Soc
of investigators, it remains important to incorporate Series B Stat Methodol. 2002;64(3):479-498.
bioinformatics expertise into the study design and 4. Fadista J, Manning AK, Florez JC, Groop L. The (in) famous GWAS
analysis to ensure best practices. P-value threshold revisited and updated for low-frequency variants.
Eur J Hum Genet. 2016;24(8):1202.
5. Ioannidis JP. Why most published research findings are false. PLoS
Short List of Questions to Guide the Reviewer Med. 2005;2(8):e124.
When reviewing a GWAS, consider commenting on the 6. Moonesinghe R, Khoury MJ, Janssens ACJ. Most published research
findings are false—but a little replication goes a long way. PLoS Med.
following: 2007;4(2):e28.
7. Manchia M, Cullis J, Turecki G, Rouleau GA, Uher R, Alda M. The
impact of phenotypic and genetic heterogeneity on results of genome
1. Sample selection, description, and analysis. What wide association studies of complex diseases. PloS One. 2013;8(10):
considerations were made when selecting individuals e76295.

to be included in the study? Was the sample size 8. Pividori M, Schoettler N, Nicolae DL, Ober C, Im HK. Shared and
distinct genetic risk factors for childhood-onset and adult-onset
appropriate? Was the race and ethnicity of partici- asthma: genome-wide and transcriptome-wide studies. Lancet Respir
pants well described and justified? Were study Med. 2019;7(6):509-522.

covariates appropriately selected and described? Was 9. Manolio TA, Collins FS, Cox NJ, et al. Finding the missing
heritability of complex diseases. Nature. 2009;461(7265):747.
sex appropriately accounted for?
10. Kruglyak L. Quantitative genetics and the missing heritability
2. The phenotype included in the study. Was the problem. Bulletin of the Am Phys Soc. 2018;7:63.
phenotype well-defined and described? Is there evi- 11. Shirali M, Knott SA, Pong-Wong R, Navarro P, Haley CS. Haplotype
dence that the phenotype is heritable? Could the heritability mapping method uncovers missing heritability of
complex traits. Sci Rep. 2018;8(1):4982.
phenotype represent multiple underlying disease 12. Young AI. Solving the missing heritability problem. PLOS Genet.
etiologies? If so, was this addressed by the authors? 2019;15(6):e1008222.
3. The statistical approach. Was the statistical 13. Hou K, Burch KS, Majumdar A, et al. Accurate estimation of SNP-
heritability from biobank-scale data irrespective of genetic
model appropriate for the study design? Did the architecture. bioRxiv. 2019:526855.
authors account for possible sources of con- 14. Sudlow C, Gallacher J, Allen N, et al. UK Biobank: an open access
founding (eg, sex, demographics, age, batch ef- resource for identifying the causes of a wide range of complex diseases
of middle and old age. PLoS Med. 2015;12(3):e1001779.
fects)? Was it clearly defined whether the purpose
15. Sakornsakolpat P, Prokopenko D, Lamontagne M, et al. Genetic
of the analysis was for association (ie, determining landscape of chronic obstructive pulmonary disease identifies
causality) or prediction of an outcome? Was heterogeneous cell-type and phenotype associations. Nat Genet.
2019;51(3):494-505.
multiple test correction applied?
16. Hernandez-Pacheco N, Farzan N, Francis B, et al. Genome-
4. The quality control and processing of genomic wide association study of inhaled corticosteroid response in
data. Was quality control and processing of admixed children with asthma. Clin Exp Allergy. 2019;49(6):
789-798.
genomic data properly described? Were rare var-
17. Dahlin A, Litonjua A, Irvin CG, et al. Genome-wide association
iants handled appropriately? study of leukotriene modifier response in asthma.
5. Validation of the findings. If the goal was to create Pharmacogenomics J. 2016;16(2):151.
a prediction model, was cross-validation performed 18. Dahlin A, Litonjua A, Lima JJ, et al. Genome-wide association study
identifies novel pharmacogenomic loci for therapeutic response to
to evaluate overfitting, and was the model tested in montelukast in asthma. PloS One. 2015;10(6):e0129385.
an independent cohort? If the goal was to perform 19. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ.
an association analysis, were the associations repli- Second-generation PLINK: rising to the challenge of larger and
richer datasets. Gigascience. 2015;4(1):7.
cated in an independent cohort or were they func-
20. Goyal A, Kwon HJ, Lee K, et al. Ultra-fast next generation human
tionally validated in an appropriate model system? genome sequencing data processing using DRAGENTM bio-IT
processor for precision medicine. Open J Genet. 2017;7(1):9-19.

chestjournal.org S121
21. Hirota T, Takahashi A, Kubo M, et al. Genome-wide association 44. Ionita-Laza I, Lee S, Makarov V, Buxbaum JD, Lin X. Sequence
study identifies three new susceptibility loci for adult asthma in the kernel association tests for the combined effect of rare and common
Japanese population. Nature Genet. 2011;43(9):893. variants. Am J Hum Genet. 2013;92(6):841-853.
22. Medina-Gomez C, Felix JF, Estrada K, et al. Challenges in 45. Dutta D, Scott L, Boehnke M, Lee S. Multi-SKAT: general
conducting genome-wide association studies in highly admixed framework to test for rare-variant association with multiple
multi-ethnic populations: the Generation R Study. Eur J Epidemiol. phenotypes. Genet Epidemiol. 2019;43(1):4-23.
2015;30(4):317-330.
46. Marceau West R, Lu W, Rotroff DM, et al. Identifying individual risk
23. Hellwege JN, Keaton JM, Giri A, Gao X, Velez Edwards DR, rare variants using protein structure guided local tests (POINT).
Edwards TL. Population stratification in genetic association studies. PLOS Computational Biol. 2019;15(2).
Curr Protocols Human Genet. 2017;95(1):1-22.
47. McLeod HL, Mariam A, Schveder KA, Rotroff DM. Assessment of
24. Halder I, Shriver M, Thomas M, Fernandez JR, Frudakis T. A panel adverse events and their ability to discriminate response to anti-PD-
of ancestry informative markers for estimating individual 1/PD-L1 antibody immunotherapy. J Clin Oncol. 2019:JCO1901712.
biogeographical ancestry and admixture from four continents: utility
and applications. Hum Mutation. 2008;29(5):648-658. 48. Shmueli G. To explain or to predict? Statistical Sci. 2010;25(3):289-
310.
25. Enoch MA, Shen PH, Xu K, Hodgkinson C, Goldman D. Using
ancestry-informative markers to define populations and detect 49. Simonson MA, Wills AG, Keller MC, McQueen MB. Recent
population stratification. J Psychopharmacol. 2006;20(suppl 4):19-26. methods for polygenic analysis of genome-wide data implicate an
important effect of common variants on cardiovascular disease risk.
26. Galanter JM, Fernandez-Lopez JC, Gignoux CR, et al. Development BMC Medical Genet. 2011;12(1):146.
of a panel of genome-wide ancestry informative markers to study
admixture throughout the Americas. PLoS Genet. 2012;8(3): 50. Choi SW, Mak TSH, O’Reilly P. A guide to performing polygenic
e1002554. risk score analyses. BioRxiv. 2018:416545.
27. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, 51. Song GG, Lee YH. Pathway analysis of genome-wide association
Reich D. Principal components analysis corrects for stratification in study on asthma. Hum Immunol. 2013;74(2):256-260.
genome-wide association studies. Nature Genet. 2006;38(8):904-909. 52. Ding L, Abebe T, Beyene J, et al. Rank-based genome-wide analysis
28. Shin J, Lee C. A mixed model reduces spurious genetic associations reveals the association of ryanodine receptor-2 gene variants with
produced by population stratification in genome-wide association childhood asthma among human populations. Hum Genomics.
studies. Genomics. 2015;105(4):191-196. 2013;7(1):16.
29. Long JS, Freese J. Regression Models for Categorical Dependent 53. Gene Ontology Consortium. The gene ontology resource: 20
Variables Using Stata. College Station, TX: Stata Press; 2006. years and still GOing strong. Nucleic Acids Res. 2018;47(D1):
30. Brown CC, Havener TM, Medina MW, et al. Multivariate methods D330-D338.
and software for association mapping in dose-response genome-wide 54. Kanehisa M, Sato Y, Furumichi M, Morishima K, Tanabe M. New
association studies. BioData Mining. 2012;5(1). approach for understanding genome variations in KEGG. Nucleic
31. Tibshirani R. Regression shrinkage and selection via the lasso. J R Acids Res. 2018;47(D1):D590-D595.
Stat Soc Series B Stat Methodol. 1996;58(1):267-288. 55. Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current
32. Lettre G, Lange C, Hirschhorn JN. Genetic model testing and approaches and outstanding challenges. PLoS Comput Biol.
statistical power in population-based association studies of 2012;8(2):e1002375.
quantitative traits. Genet Epidemiol. 2007;31(4):358-362. 56. Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment
33. Marvel SW, Rotroff DM, Wagner MJ, et al. Common and rare analysis: a knowledge-based approach for interpreting genome-wide
genetic markers of lipid variation in subjects with type 2 diabetes expression profiles. Proc Natl Acad Sci U S A. 2005;102(43):15545-
from the ACCORD clinical trial. PeerJ. 2017;5:e3187. 15550.
34. Voorman A, Lumley T, McKnight B, Rice K. Behavior of QQ-plots 57. Ihnatova I, Popovici V, Budinska E. A critical comparison of
and genomic control in studies of gene-environment interaction. topology-based pathway analysis methods. PloS One. 2018;13(1):
PloS One. 2011;6(5):e19416. e0191154.
35. Devlin B, Roeder K. Genomic control for association studies. 58. Mathur R, Rotroff D, Ma J, Shojaie A, Motsinger-Reif A. Gene set
Biometrics. 1999;55(4):997-1004. analysis methods: a systematic comparison. BioData Mining.
2018;11(1):8.
36. Turner SD. qqman: an R package for visualizing GWAS results using
QQ and Manhattan plots. BioRciv. 2014005165. 59. Nam D, Kim J, Kim SY, Kim S. GSA-SNP: a general approach for
37. Gibson G. Rare and common variants: twenty arguments. Nature gene set analysis of polymorphisms. Nucleic Acids Res. 2010;38(suppl
Rev Genet. 2012;13(2):135. 2):W749-W754.

38. Morris AP, Zeggini E. An evaluation of statistical approaches to rare 60. Pan W, Kwak IY, Wei P. A powerful pathway-based adaptive test for
variant analysis in genetic association studies. Genet Epidemiol. genetic association with common or rare variants. Am J Hum Genet.
2010;34(2):188. 2015;97(1):86-98.
39. Morgenthaler S, Thilly WG. A strategy to discover genes that carry 61. Li MX, Kwan JS, Sham PC. HYST: a hybrid set-based test for
multi-allelic or mono-allelic risk for common diseases: a cohort genome-wide association studies, with application to protein-protein
allelic sums test (CAST). Mutat Res. 2007;615(1):28-56. interaction-based association analysis. Am J Hum Genet. 2012;91(3):
478-488.
40. Li B, Leal SM. Methods for detecting associations with rare variants
for common diseases: application to analysis of sequence data. Am J 62. Fang H, Knezevic B, Burnham KL, Knight JC. XGR software for
Hum Genet. 2008;83(3):311-321. enhanced interpretation of genomic summary data, illustrated by
application to immunological traits. Genome Med. 2016;8(1):129.
41. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant
association testing for sequencing data with the sequence kernel 63. Shriner D, Adeyemo A, Chen G, Rotimi CN. Practical considerations
association test. Am J Hum Genet. 2011;89(1):82-93. for imputation of untyped markers in admixed populations. Genet
Epidemiol. 2010;34(3):258-265.
42. Lee S, Wu MC, Lin X. Optimal tests for rare variant effects in
sequencing association studies. Biostatistics. 2012;13(4):762-775. 64. Ott J, Kamatani Y, Lathrop M. Family-based designs for genome-
wide association studies. Nat Rev Genet. 2011;12(7):465-474.
43. Chen H, Meigs JB, Dupuis J. Sequence kernel association test for
quantitative traits in family samples. Gen Epidemiol. 2013;37(2):196- 65. Niel C, Sinoquet C, Dina C, Rocheleau G. A survey about methods
204. dedicated to epistasis detection. Frontiers Genet. 2015;6:285.

S122 Supplement [ 158#1S CHEST JULY 2020 ]


66. Chatelain C, Durand G, Thuillier V, Augé F. Performance of 68. Rohart F, Gautier B, Singh A, Lê Cao KA. mixOmics: an R package
epistasis detection methods in semi-simulated GWAS. BMC for ‘omics feature selection and multiple data integration. PLoS
Bioinformatics. 2018;19(1):231. Comput Biol. 2017;13(11):e1005752.
67. Evangelou E, Ioannidis JPA. Meta-analysis methods for genome- 69. Singh A, Shannon CP, Gautier B, et al. DIABLO: an
wide association studies and beyond. Nature Rev Genet. 2013;14(6): integrative approach for identifying key molecular drivers from multi-
379-389. omics assays. Bioinformatics. 2019;35(17):3055-3062.

chestjournal.org S123

You might also like