Professional Documents
Culture Documents
Young2019 - Deconstructing The Sources of Genotype-Phenotype Associations in Humans
Young2019 - Deconstructing The Sources of Genotype-Phenotype Associations in Humans
N
mean that bias is eliminated, nor the nature of
ot long ago, genetic analyses were per- in proportion to the square of the effect size and genotype-phenotype associations properly char-
formed using trait values (phenotypes) in the heterozygosity. Because heterozygosity is higher acterized. We aim to lay out here the different
families, without genetic data. The discov- for more common variants, initial successes were contributions to genotype-phenotype association,
ery of readily measurable genomic markers mainly for susceptibility variants with a minor explain the difficulties they introduce, and pro-
enabled the identification of disease genes allele frequency above 5%. Even if a common pose possible solutions.
by linkage analysis, without prior knowledge of variant is not directly analyzed, it is likely to
the underlying mechanisms (1). This led to the be strongly correlated with a genotyped SNP Effects captured by GWAS associations
identification of the gene responsible for the X- nearby, because of the lack of ancestral recom- The association between a genetic variant and
linked phagocytic disorder chronic granulom- bination events between them. This correlation is a phenotype can be decomposed into the direct
atous disease in 1986, followed by those for other termed “local” linkage disequilibrium (LD). Non- effect of the variant, the indirect genetic effect
Mendelian diseases such as cystic fibrosis (2) and local LD—correlations between variants that are of the variant, and confounding effects (Fig. 1).
Huntington’s disease (3) as well as the breast cancer not physically close—can result from nonrandom An example would be a variant that has a direct
genes (4, 5). This approach was also applied to the mating. As a result of local LD, GWAS usually does effect on educational attainment (EA) when in-
study of common complex diseases, including type 2 herited, as well as an indirect effect through
diabetes, but failed to provide replicable findings.
The second major development came with
‘For many complex traits, GWAS parental behavior/nurture (20). The same vari-
ant could also have an indirect effect on health
high-throughput single-nucleotide polymorphism has changed the landscape through parental nurture, but little to no direct
(SNP) arrays, which allowed for the genotyping of effect. Direct effects incorporate a wide range
hundreds of thousands of SNPs simultaneously, of genetic investigations of causal pathways, some neither simple nor
giving rise to the genome-wide association study and our understanding of “direct”; for example, variants in CHRNA5 af-
(GWAS) (6). A GWAS tests each SNP for asso- fect lung cancer risk through their association
ciation with the phenotype, without family data. genetic architectures’ with smoking quantity (21). Furthermore, the
The success of GWAS started with the discovery direct effect here can include effects of other
that CFH contributes to age-related macular de- variants in local LD. Note that typical GWAS
generation; that analysis was based on 96 cases conducted without family data can only estimate
and 50 controls (7). Subsequent increases in sample not directly identify the specific causal variant, but the sum of the direct and indirect effects (com-
size, with some now more than 2 million (8), have only localizes its approximate genomic position. bined effect), not the two separately.
led to the discovery of thousands of genetic variants Fine-scale mapping, which often requires func- Under an additive model for the joint effects of
affecting hundreds of human traits. Results from tional analysis and experimentation, is needed to variants, we define a genetic component as a
GWAS hold the promise of identifying novel drug identify causal variants (11). linear combination of the genotypes of all the
targets (9, 10), among other applications. The majority of common variants found by causal variants with weights proportional to the
The power of a GWAS to identify a trait- GWAS to affect disease risk have low to modest true (direct, indirect, or combined) effects (Fig. 2).
affecting SNP depends on the fraction of trait effects (increasing the odds of disease by less The genetic components for direct effect and
variation explained by the SNP, which increases than a factor of 1.5 per risk allele) (12, 13). Applica- indirect effect are distinct, but they can be cor-
tion of GWAS to whole-exome and whole-genome related with a strength that depends on the gen-
1
sequencing, along with statistical imputation of etic correlation between the proband phenotype
Big Data Institute, Li Ka Shing Centre for Health Information
sequence-level variants into samples genotyped of interest and the phenotypes of the relatives
Discovery, University of Oxford, Oxford, UK. 2Social, Genetic
and Developmental Psychiatry Centre, Institute of Psychiatry, by SNP arrays, has led to the discovery of some through which the indirect effects are mediated.
Psychology and Neuroscience, King’s College London, rarer variants with large effects (14). Although the As an example, this correlation is probably strong
London, UK. 3Department of Biological Sciences, Columbia trait variance explained by genome-wide signifi- for EA and weak for body mass index (BMI)
University, New York, NY, USA. 4Department of Systems
cant (GWS) loci has increased, for most complex (20). The relative strengths of these two genetic
Biology, Columbia University, New York, NY, USA.
*Corresponding author. Email: alextisyoung@gmail.com (A.I.Y.); traits, the variance explained by GWS loci is only a components and their correlation determine the
mp3284@columbia.edu (M.P.); augustine.kong@bdi.ox.ac.uk (A.K.) fraction of the estimated heritability. This gap, correlations with the combined-effect genetic
mated for many pairs of traits using GWAS data in GWAS: (i) environmental confounding, where stratification by modeling the effects of (nearly)
(22). By separating out direct- and indirect-effect allele frequencies and environmental effects vary all measured SNPs, capturing both real genetic
components, the model (Fig. 2, bottom) has in a correlated way across different geographic re- effects and stratification effects (27). Furthermore,
10 parameters, including the magnitudes of four gions or subpopulations; (ii) genetic confound- LMM methods can lead to improved estimation
direct- and indirect-effect genetic components, ing, when allele frequency differences between of SNP effects and their sampling errors over
and six correlations. The full model cannot be subpopulations correlate with frequency differ- linear regression in the presence of sample re-
estimated using standard GWAS, so we currently ences of other alleles with causal effects; or (iii) latedness (27). LMMs can also reduce bias in SNP
have little understanding of the extent to which assortative-mating confounding, which occurs effect estimates due to assortative mating (30).
However, current LMM GWAS methods do not re- cannot give the full picture (32). Ideally, GWAS
Box 1. Glossary. move the contribution of indirect genetic effects. should be performed with parental or sibling
genotypes as controls and using models with
Assortative mating: When couples Using family genotype data indirect genetic effects. However, the power of
that produce offspring select
Given the parental genotypes, an offspring’s this approach is currently limited because large
one another on the basis of particular
phenotypes.
genotype is determined by random segregation samples with genotyped siblings and/or parents
of genetic material during meiosis. This random are uncommon. Furthermore, as only around half
Fine-scale mapping: Refers to segregation is uncorrelated with indirect genetic of the genetic variation in a population is within-
approaches that aim to identify effects from relatives and other confounding ef- family, substantially larger samples of families
which variant or variants are likely to fects. Parental genotypes can thus be used as con- are required to obtain the same study power as
be causal among the set of associated trols to obtain unbiased estimates of direct genetic standard GWAS analysis. Therefore, methods
variants identified in a GWAS. effects (20, 31) (Fig. 1). Similarly, genetic differences combining information from standard GWAS
between siblings are a result of random Mendelian and from analysis of families are needed.
Heritability: Measures the proportion segregation in the parents during meiosis. The
of phenotypic variation explained by genetic differences between siblings are there- Heritability
the direct effects of all genetic variants fore not confounded with indirect genetic effects Traditionally, heritability has been estimated from
in a population at a given time. from parents, population stratification, and assort- comparing correlations between identical and non-
ative mating. However, methods using the differ- identical twins. In addition to identifying specific
Heterozygosity: The probability that ences in sibling genotypes estimate the direct effect causal loci, it is possible to use GWAS data to
two alleles at a site differ; assuming minus the indirect effect from the sibling, and hence estimate the phenotypic variation explained by
Hardy-Weinberg equilibrium and only provide unbiased direct-effect estimates when the genetic variation captured by the SNPs (and
the genetic variants affect trait B only through Replication (between chromosomes)
15
their effect on trait A, and that the genetic var-
iants are not correlated with any confounding
factors. MR has proven successful in refuting false
10
causal hypotheses derived from observational
data, such as the association between HDL choles-
terol levels and cardiovascular disease (43) and
5
the reduced risk of cardiovascular disease in mod-
erate drinkers in Western societies (44).
MR usually relies on SNP effect estimates from 0
GWAS without families, which can be biased by 0 10 20 30 40
population stratification, indirect genetic effects Principal components
from relatives, and assortative mating (45). Within-
family MR methods have been proposed to address Fig. 3. Behavior of principal components of 272,519 UK Biobank samples. We investigate the
these concerns and have shown that previous degree to which principal components are capturing real population structure by examining whether
MR estimates of causal effects of height and BMI the genetic variance (eigenvalues) explained by the top 40 principal components inferred from
on EA were spurious (45). 146,082 SNPs in 272,519 UK Biobank White British (WB) samples replicates in an independent
A further challenge for MR analyses is wide- sample of WB. A replication eigenvalue above 1 indicates that the inferred principal component is
spread pleiotropy: If a SNP affects trait B through capturing replicable correlations between SNPs, either local LD (within a chromosome) or population
a trait other than trait A, then it is not a valid structure (mostly between chromosomes). Original (black circles): eigenvalues of the principal
instrument for inference of the causal effect of components in the original set of 272,519 WB individuals. Replication (blue triangles): eigenvalue
trait A on trait B. Although methods have been equivalents, i.e., variances of the linear combinations of SNP genotypes using weights inferred
developed to address this problem, their effec- from the original set and standardized genotypes in the replication set of 64,969 WB individuals.
tiveness can depend on prior knowledge about Replication (between chromosomes only) (red crosses): using the same replication set, but
the confounding pathway (46). eigenvalue equivalents computed by ignoring the covariances of SNP pairs within the same
chromosomes, and counting only the covariances of SNP pairs on different chromosomes, which
Gene-by-environment interactions includes 94.8% of all SNP pairs. The average eigenvalue for the last 32 PCs decreases from 4.37 for
A gene-environment (GxE) interaction occurs the original set to 2.61 for the replication set and further to 1.03 for the between-chromosome set,
when a genetic variant’s effect on a trait differs in indicating those PCs are mostly capturing noise and local LD rather than population structure.
RELATED http://science.sciencemag.org/content/sci/365/6460/1394.full
CONTENT
http://science.sciencemag.org/content/sci/365/6460/1401.full
http://science.sciencemag.org/content/sci/365/6460/1405.full
http://science.sciencemag.org/content/sci/365/6460/1409.full
REFERENCES This article cites 55 articles, 7 of which you can access for free
http://science.sciencemag.org/content/365/6460/1396#BIBL
PERMISSIONS http://www.sciencemag.org/help/reprints-and-permissions
Science (print ISSN 0036-8075; online ISSN 1095-9203) is published by the American Association for the Advancement of
Science, 1200 New York Avenue NW, Washington, DC 20005. The title Science is a registered trademark of AAAS.
Copyright © 2019 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of
Science. No claim to original U.S. Government Works