Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

BMC Supplements

Region-Based Interaction Detection in Genome-Wide Case-Control Studies


--Manuscript Draft--

Manuscript Number: SUPP-D-18-00064

Full Title: Region-Based Interaction Detection in Genome-Wide Case-Control Studies

Article Type: BMC Supplements Reviewed

Abstract: In genome-wide association study (GWAS), conventional interaction detection


methods such as BOOST are mostly based on SNP-SNP interactions. Although single
nucleotides are the building blocks of human genome, single nucleotide
polymorphisms (SNPs) are not necessarily the smallest functional unit for complex
phenotypes. Region-based strategies have been proved to be successful in studies
aiming at marginal effects. In this paper, we propose a novel region-region interaction
detection method named RRIntCC (region-region interaction detection for case-control
studies). RRIntCC uses the correlations between individual SNP-SNP interactions
based on linkage disequilibrium (LD) contrast test. Simulation experiments showed that
our method can achieve a higher power than conventional SNP-based methods with
similar type-I-error rates. When applied to two real datasets, RRIntCC was able to find
several significant regions, while BOOST failed to identify any significant results. The
source code and the sample data of RRIntCC are available at
http://bioinformatics.ust.hk/RRIntCC.html.

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Pdf of Manuscript Click here to access/download;Manuscript;RRIntCC_revised.pdf

Click here to view linked References


1
2
3
4
5
6
7
8 Region-Based Interaction Detection in
9
10 Genome-Wide Case-Control Studies
11
12
13 Sen Zhang1 , Wei Jiang2 , Ronald CW Ma3 , and Weichuan Yu2( )
14 1
15 Department of Chemical and Biological Engineering, The Hong Kong University of
16 Science and Technology, Kowloon, Hong Kong, China
szhangat@connect.ust.hk
17 2
Department of Electronic and Computer Engineering, The Hong Kong University
18
of Science and Technology, Kowloon, Hong Kong, China
19
wjiangaa@connect.ust.hk
20
eeyu@ust.hk
21 3
Department of Medicine and Therapeutics, The Chinese University of Hong Kong,
22 Shatin, Hong Kong, China
23 rcwma@cuhk.edu.hk
24
25
26
27 Abstract. In genome-wide association study (GWAS), conventional in-
28 teraction detection methods such as BOOST are mostly based on SNP-
29 SNP interactions. Although single nucleotides are the building blocks of
30 human genome, single nucleotide polymorphisms (SNPs) are not neces-
31 sarily the smallest functional unit for complex phenotypes. Region-based
32 strategies have been proved to be successful in studies aiming at marginal
33 effects. In this paper, we propose a novel region-region interaction de-
34 tection method named RRIntCC (region-region interaction detection for
case-control studies). RRIntCC uses the correlations between individ-
35
ual SNP-SNP interactions based on linkage disequilibrium (LD) con-
36
trast test. Simulation experiments showed that our method can achieve
37
a higher power than conventional SNP-based methods with similar type-
38
I-error rates. When applied to two real datasets, RRIntCC was able to
39 find several significant regions, while BOOST failed to identify any sig-
40 nificant results. The source code and the sample data of RRIntCC are
41 available at http://bioinformatics.ust.hk/RRIntCC.html.
42
43 Keywords: GWAS · Statistical interaction detection · Region-based
44 method · LD contrast test
45
46
47 1 Introduction
48
49 Genome-wide association study (GWAS) has served as an important tool to in-
50 vestigate the relationship between genomic variants and human traits [1]. The
51 genetic variants investigated in GWAS are mainly single nucleotide polymor-
52 phisms (SNPs). SNPs are single nucleotide variants whose genotypes are not
53
fixed in the population and exhibit diversities among different individuals. Most
54
GWAS analysis protocols follow the single-locus test procedures aimed at de-
55
56 tecting the marginal effects of SNPs [2][3]. However, it’s well recognized that
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6 2 Zhang et al.
7
8 genetic variants work synergistically through certain pathogenic pathways [4].
9 The interactions among SNPs are not guaranteed to be discovered by marginal
10 effect detection, especially for SNPs with weak marginal effects but strong inter-
11 action effects [5]. Many methods have been developed to address this problem
12 [4][6], including PLINK [7], BOOST [5], MDR [8], ReliefF [9], BEAM [10], and
13
LD contrast test [11].
14
An important problem in SNP-SNP interaction detection is the stringent
15
16 threshold when considering multiple testing correction. For marginal effect de-
17 tection, a SNP can only be considered as significant when its corresponding
18 p-value is at the order of 10−8 (assuming we use Bonferroni correction). In SNP-
19 SNP interaction detection, the threshold has to go down further to the order of
20 10−14 . As a result, interactions with weak or moderate effect sizes might remain
21 undiscovered.
22 In this paper, we proposed a region-based interaction detection method to
23 address this problem. Region-based methods have been successful in marginal
24 effect detection [12][13]. The basic idea is to group the effects of nearby SNPs
25 together and test their aggregation rather than investigating the elements sepa-
26 rately. The benefit is two-folds: Firstly, the size and the number of regions are
27 controllable. We can achieve the balance between the resolution of the results
28 and the statistical significance threshold after Bonferroni correction. Secondly,
29 the effect size might be enhanced by taking the whole region into account. SNPs
30 are the basic genomic units. Neverthless, they are not necessarily the functional
31 units of diseases. Different SNP mutations in a gene can all lead to changes of
32
protein functions. Therefore, grouping different SNPs together provides a possi-
33
ble alternative.
34
35 To group different SNP-SNP pairs together, the key is to quantitatively mea-
36 sure and account for the relationships between different SNP-SNP pairs. To the
37 best of our knowledge, no existing method is available to test region-region in-
38 teractions for case-control studies, where we only have two groups of people:
39 healthy people (controls) and people with the investigated disease (cases). Al-
40 though Ma et al. [14] proposed a region-based interaction detection method to
41 analyze continuous traits based on the linear regression model, it is not easy to
42 extend their method to the case-control setting due to the difficulty of deriv-
43 ing the covariances of test statistics under the logistic regression model that is
44 commonly used in case-control studies. In this paper, we use the LD contrast
45 test method instead of the logistic regression in interaction detection. We derive
46 the correlation coeffcients of the correpsonding SNP-SNP interaction test statis-
47 tics. Then we further extend region-based methods to the case-control setting
48 by accounting for the covariances between SNP-based test statistics. We name
49 this method RRIntCC (region-region interaction detection for case-control stud-
50 ies). Experiment results illustrate that RRIntCC achieves a higher power than
51
conventional SNP-SNP interaction detection methods at the same type-I-error
52
rate.
53
54
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6 Region-Based Interaction Detection in Case-Control Studies 3
7
8 2 Methodology
9
10 Here we propose a novel region-based interaction detection method for genome-
11
wide case-control studies that utilizes SNP-based interaction test statistics and
12
their covariances. LD contrast test is adopted to measure SNP-based interaction
13
14 effects. We derive the covariance of LD contrast test statistics, which enables a
15 robust aggregation of SNP-SNP interactions within a region pair. The determi-
16 nation of regions comes from gene definitions or BOOST results.
17
18
2.1 Genomic Data Formats
19
20 There are two alleles for almost every base pair (bp) position in the human
21
genome, one from the maternal chromosome and the other from the paternal
22
chromosome. A combination of the two alleles is denoted as a genotype of this
23
24 bp position. SNPs are defined as the base pairs that could exhibit different geno-
25 type values in different individuals. Normally a SNP only has two possible allele
26 values in the population, one major allele with a higher probability (denoted
27 as B), and one minor allele (denoted as b). Correspondingly, there exist three
28 genoytpes for a typical SNP, i.e., BB, Bb and bb, where Bb is called a hetero-
29 geneous genotype and the rest two are called homogeneous genotypes. GWAS
30 uses microarrays to generate SNP genotype data. In SNP data analysis, we use
31 0/1/2, 0/1/1, and 0/0/1 for BB/Bb/bb as the encoding scheme for additive,
32 dominant, and recessive genetic models, respectively. A more flexible strategy is
33 to estimate the effects of three genotypes independently, at the price of an in-
34 creased degree of freedom. Allele data could also be used for analysis, with 0/1 as
35 the numerical values of major/minor alleles. However, statistical inference needs
36 to be performed in advance to retrieve allele information from original genotype
37 data, which is called haplotype phasing in the GWAS community. In this paper,
38 we focus on the analysis of genotype data.
39
40
41 2.2 LD Contrast Test for SNP Interaction Detection
42
43 Current interaction detection methods are mainly based on the deviation from
44 additive effect by assuming a linear or logistic regression model. Nevertheless,
45 this approach is not necessarily the most powerful method due to the uncertainty
46 of underpinning biochemical mechanisms. Linkage disequilibrium (LD) contrast
47 test provides another valuable perspective to investigate this problem. Empirical
48 studies have shown that LD contrast test can achieve higher power than logistic
49 regression under certain disease models for case-control studies [6]. In this paper,
50 LD contrast test is adopted to generate SNP-based interaction test statistics
51
because of its clear statistical meaning and mathematical simplicity.
52
LD represents the statistical association between two genetic loci with allele
53
54 values, defined as the deviation from the independence of two SNPs (A and B)
55
56 LD = p(A, B) − p(A)p(B). (1)
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6 4 Zhang et al.
7
8 To avoid the ambiguity caused by haplotype phasing, composite LD (CLD)
9 which only requires genotype data is commonly used to approximate LD. CLD
10 is defined as [15]:
11
12 CLD = pAB + p0AB − 2p(A)p(B)
13  AB
pAB = PAB + 12 (PAb
AB AB
+ PaB AB
+ Pab )
14 with 0 AB 1 AB AB aB , (2)
pAB = PAB + 2 (PAb + PaB + PAb )
15
16 where the subscript and the superscript represent two gametes that are passed
17 to offsprings and P denotes the probability of the specific gamete combination.
18 CLD could be regarded as a simplified version of phasing to facilitate the analysis
19 based on genotype data. The statistical properties of CLD have been well studied
20 [16][17]. One important fact is that CLD corresponds to the sample correlation
21 coefficient r̂ of genotype values under the additive model,
22
23 CLD CLD
r̂genotype = p p ≈p p , (3)
24 p(1 − p) + DA q(1 − q) + DB p(1 − p) q(1 − q)
25
26 where p = p(A), q = p(B), DA and DB represent Hardy-Weinberg disequilibri-
27 ums, i.e. DA = pAA − p2 (A), DB = pBB − p2 (B). DA and DB are nearly 0 in
28 GWAS datasets after quality control.
29 A similar result holds for the original LD and allele values,
30
31 LD
r̂allele = p p . (4)
32 p(1 − p) q(1 − q)
33
34 Therefore, CLD could also be viewed as an approximation of LD by using the
35 correlation coefficient of 0/1/2 genotype data under the addtive model to replace
36 that of 0/1 allele values, at the price of implicitly conducting phasing with equal
37 probabilities for two-allele combinations.
38 Suppose two SNPs work synergistically to contribute to the same pathways,
39 they are less likely to be separated during recombination and will be inherited
40 together to offsprings in the case group. As a result, the SNP-SNP pattern
41 should be different between patients and healthy people. Therefore, checking
42 the difference of LD patterns between cases and controls provides an alternative
43 way to detect interaction. LD contrast test was proposed to statistically test this
44 difference [11]. The test statistic based on CLD has the following form:
45
46 ˆ case
(CLD ˆ control )2
AB − CLD AB
47 χ2 = , (5)
ˆ case
V ar(CLD ) + V ar( ˆ
CLD
control
)
48 AB AB
49 which follows a 1-df χ2 distribution under the null hypothesis that there is no
50
LD difference between cases and controls.
51
52
53 2.3 Covariance between SNP Interactions
54
The key issue in the aggregation of individual SNP-SNP interaction effects is
55
56 the correction of inflated effect sizes caused by the correlations among individual
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6 Region-Based Interaction Detection in Case-Control Studies 5
7
8 test statistics. The fact that LD is actually the sample covariance of two SNPs
9 is leveraged to derive the correlation coefficients of LD contrast test statistics.
10 Suppose two SNP pairs (X, Y ) and (U, V ) have interactions with contrast
11 LDs 
12 ∆LDXY = cov(X,
ˆ Y |case) − cov(X,
ˆ Y |control)
13 . (6)
∆LDU V = cov(U,
ˆ V |case) − cov(U,
ˆ V |control)
14
The corresponding LD contrast test statistics read:
15
16 ∆LDˆXY ∆LDˆU V
17 TXY = q and TU V = q . (7)
18 V ar(∆LDˆXY ) V ar(∆LDˆU V )
19 The covariance of the two test statistics reads:
20
21 cov(∆LDˆXY , ∆LDˆU V )
cov(TXY , TU V ) ≈ q . (8)
22 cov(∆LDˆXY , ∆LDˆXY )cov(∆LDˆU V , ∆LDˆU V )
23
24 In GWAS, it’s commonly assumed that population samples are independent.
25 Under this assumption, we can derive the following theorems.
26 Theorem 1. The covariance of contrast LDs can be decomposed into compo-
27
nents from cases and controls separately,
28
29 cov(∆LDXY , ∆LDU V ) = cov[cov(X,
ˆ Y |case), cov(U,
ˆ V |case)]
30 + cov[cov(X,
ˆ Y |control), cov(U,
ˆ V |control)]. (9)
31
32 Proof. ∆LD is the difference of the two sample covariances in cases and controls.
33 By the linear property of covariance, cov(∆LDXY , ∆LDU V ) can be decomposed
34 into four covariances of two sample covariances. Because individuals are assumed
35 to be independent, the two terms with one sample covariance from cases and the
36 other from controls are 0. Therefore, Theorem 1 holds. t
u
37
Theorem 2. The covariance of sample covariances reads
38
39 1 σ2 + τ2
cov[cov(X,
ˆ Y ), cov(U,
ˆ V )] = (δ4 − δ2 + ), (10)
40 n n−1
41 where
42
43 δ4 = E[(X − EX)(Y − EY )(U − EU )(V − EV )],
44 δ2 = cov(X, Y )cov(U, V ),
45 σ2 = cov(X, U )cov(Y, V ),
46
47 τ2 = cov(X, V )cov(Y, U ).
48 Proof. The covariance of sample covariances can be rewritten as
49
50 cov[cov(X,
ˆ Y ), cov(U,
ˆ V )]
n X n n X
n
51 1
X
1
X
52 = cov[ 2n(n−1) (Xi − Xj )(Yi − Yj ), 2n(n−1) (Ui − Uj )(Vi − Vj )]
53 i=1 j=1 i=1 j=1
n n X n X
n
54 1
XX
55 = 4n2 (n−1)2 cov[(Xi − Xj )(Yi − Yj ), (Uk − Ul )(Vk − Vl )]. (11)
56 j=1 j=1 k=1 l=1
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6 6 Zhang et al.
7
8 We consider the following four conditions. (1) i = j or k = l. (2) i 6= j, i 6=
9 k, i 6= l, j 6= k, j 6= l and k 6= l. (3) i 6= j and {i = k, j = l or i = l, j = k}. (4)
10 i 6= j, k 6= l, and {i = k or i = l or j = k or j = l}. The basic covariance unit in
11 (11) can be rewriten as
12
13 cov{ [(Xi − EX) − (Xj − EX)][(Yi − EY ) − (Yj − EY )],
14 [(Uk − EU ) − (Ul − EU )][(Vk − EV ) − (Vl − EV )]}. (12)
15
16 There are 2n3 − n2 , n(n − 1)(n − 2)(n − 3), 2n(n − 1) and 4n(n − 1)(n − 2)
17 items for the four conditions respectively. We can further separate (12) into 16
18 components and calculate their values under different conditions. The derivation
19 is straightforward. Our conclusion thus holds. t
u
20
21 Theorem 3. The sample mean of (X − EX)(Y − EY )(U − EU )(V − EV ) is
22 an asympototically unbiased estimator of δ4 ,
23 n n n n n
24 1X 1X 1X 1X 1X
E[ (Xi − Xj )(Yi − Yj )(Ui − Uj )(Vi − Vj )]
25 n i=1 n j=1 n j=1 n j=1 n j=1
26
27 4 6 3 2(n − 1) 3(n − 1) n→∞
= (1 − + 2 − 3 )δ4 + [ − ](δ2 + σ2 + τ2 ) −−−−→ δ4 . (13)
28 n n n n2 n3
29 Proof. Equation (13) can be rewritten as
30
n n n
31 X 1X 1X
32 E{ [(Xi − EX) − ( Xj − EX)][(Yi − EY ) − ( Yj − EY )]
i=1
n j=1 n j=1
33
34 n n
1X 1X
35 [(Ui − EU ) − ( Uj − EU )][(Vi − EV ) − ( Vj − EV )]}. (14)
n j=1 n j=1
36
37
Again (14) can be separated into 16 components which are solvable under the
38
39 independence assumption. The rest of the proof is omitted due to page limit. t
u
40 By integrating (8-13), the covariance of the LD contrast test statistics can be
41 estimated. Note that the variance of the standardized LD contrast test statistic
42 is approximately 1,
43
44 ∆LDˆXY V ar(∆LDˆXY )
45 V ar(TXY ) = V ar[ q ]≈ = 1. (15)
ˆ V ar(∆ LDˆXY )
46 V ar(∆LDXY )
47
48 Therefore, the covariance of TXY and TU V can be reduced to the corresponding
49 correlation coefficients,
50 corr(TXY , TU V ) ≈ cov(TXY , TU V ). (16)
51
52
53 2.4 The Test Statistic for Region-Based Interactions
54
To aggregate SNP-SNP interaction test statistics, a minimum p-value based
55
56 method is adopted. In detail, we assume a multivariate normal distribution
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6 Region-Based Interaction Detection in Case-Control Studies 7
7
8 M V N (0, Σ) for the observed test statistics zi , i = 1, 2, ..., k1 k2 , where k1 and
9 k2 are the number of SNPs in the two regions. The covariance matrix Σ is
10 estimated using (8-13).
11 Then the region-based p-value is defined as the probability that we observe
12 a value that is larger than the largest absolute value of SNP-SNP interaction
13
test statistics under M V N (0, Σ). Denote the absolute value of the test statistic
14
related to the minimum p-value as T :
15
16 min(pi , i = 1, 2, .., k1 k2
17 T = |Φ−1 ( )|. (17)
2
18
19 Then the p-value for this region-region interaction reads,
20
21 pregion-region = P r[max(|zi |, i = 1, 2, .., k1 k2 ) ≥ T | zi ∼ M V N (0, Σ)]
22
= 1 − P r[max(|zi |, i = 1, 2, .., k1 k2 ) < T | zi ∼ M V N (0, Σ)]
23
24 = 1 − P r[{|zi | < T, i = 1, 2, .., k1 k2 }| zi ∼ M V N (0, Σ)]. (18)
25
26 In this paper, We use the results of GBOOST [18], the GPU version of
27 BOOST, to specify candidate regions. The regions could also be selected by
28 checking potential pathogenic pathways or protein-protein interaction networks.
29
30
31 3 Experiment
32
33 We conducted simulations under various settings to examine whether the pro-
34 posed method can correctly control type-I-error rates and outperform SNP-based
35 methods in terms of statistical power. To mimic real LD patterns, we picked all
36 genotyped SNPs from two genomic regions (A and B) with intensive LD patterns
37 in the dataset from Myocardial Infarction Genetics Consortium (MIGen) [19].
38 Region A is of size 157.874 kbp, located in chromosome 1, with 34 genotyped
39 SNPs inside and 9 tag SNPs selected by haploview. Region B is of size 267.528
40 kbp, located in chromosome 3, with 50 genotyped SNPs and 10 tag SNPs.
41
We developed the software RRIntCC in C++. The source code of RRIntCC is
42
available at http://bioinformatics.ust.hk/RRIntCC.html. The results of RRIntCC
43
44 and SNP-based methods were compared for empirical power experiments. We
45 further applied RRIntCC to MIGen and a renal complication dataset of type 2
46 diabetes (T2D) patients. RRIntCC reported several significant region pairs in
47 both datasets while conventional SNP-based interaction detection tools failed to
48 identify any SNP pairs.
49
50
51 3.1 Type-I-Error Rate Control
52 For type-I-error rate evaluation, we randomly selected 1000, 2000, 3000, 4000,
53
and 5000 samples from MIGen dataset and maintained their genotype values to
54
preserve the LD patterns. Phenotype values for the randomly picked samples
55
56 were assigned using a Bernoulli distribution with equal probabilities for case
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6 8 Zhang et al.
7
8 and control disease status. 1000 simulations were run for each sample size to de-
9 termine the empirical type-I-error rates under two commonly used significance
10 levels, i.e. 0.05 and 0.01. We repeat the experiment 20 times to examine the
11 robustness of empirical type-I-error rates. As shown in Fig. 1, simulations of em-
12 pirical type-I-error rates indicated that the results of RRIntCC are not inflated
13
at given significance levels.
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33 Fig. 1. The boxplots of empirical type-I-error rates at the significant levels of 0.05
34 (black) and 0.01 (blue).
35
36
37
38
39 3.2 Empirical Statistical Power
40
41 For power evaluation, phenotype values were generated using the public software
42 GWASimulator [20], which uses haplotype information to simulate LD structure
43 and produces phenotype values according to preset disease prevalence, causal
44 SNPs and interactions with certain effect sizes. In total, 12084 haplotypes of
45 these two regions were generated by PLINK [7]. We performed 1000 simulations
46 for 1000, 2000, 3000, 4000, and 5000 samples, respectively. Results of original
47 LD contrast test (LDCont) and GBOOST were also given for comparison.
48 GWASimulator simulated genotypes of all SNPs in the two regions, while
49 only the tag SNPs were analyzed. Even though non-tag SNPs could be selected
50 as causal SNPs, we can still observe interaction effects between tag SNPs due to
51 LD between tag SNPs and non-tag SNPs. We designed six experimental settings
52 with different tag status and allele frequencies for the causal interacted SNP
53
pair. The effect sizes were determined by the relative risk ratio.√The increment
54
of relative risk ratio by observing one disease allele was set as √2, so
√ that the
55
56 ratios for genotype combinations 1/1, 1/2, 2/1, and 2/2 were 2, 2 2, 2 2, and 4,
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6 Region-Based Interaction Detection in Case-Control Studies 9
7
8 respectively. The results are summarized in Table 1. Under all settings, RRIntCC
9 achieves a higher power than LDCont. GBOOST outperforms RRIntCC and
10 LDCont when the MAFs of both causal SNPs are large. However, when the MAF
11 of even one causal SNP goes down, the power of GBOOST drops dramatically
12 and RRIntCC is the most powerful method under such settings. Even in the
13
cases where both MAFs are large, RRIntCC is still valuable when sample size
14
is small. The results support the use of our region-based interaction detection
15
16 method in GWAS studies, especially considering that GWAS datasets usually
17 have quite limited sample sizes compared to the huge number of SNPs.
18
19 Table 1. Empirical statistical power results. The indices are the order of SNPs in their
20 corresponding regions, * means this SNP is a tag SNP, and the values in the brackets
21 denote minor allele frequencies (MAFs).
22
23 1000 2000 3000 4000 5000
24
25 RRIntCC 0.247 0.564 0.806 0.913 0.975
26 19(0.424) ∼ 28(0.414) LDCont 0.205 0.524 0.778 0.888 0.968
27
GBOOST 0.240 0.624 0.872 0.961 0.994
28
29 RRIntCC 0.255 0.545 0.814 0.924 0.979
30 19(0.424) ∼ 22*(0.413) LDCont 0.214 0.489 0.793 0.905 0.969
31
GBOOST 0.218 0.609 0.885 0.968 0.998
32
33 RRIntCC 0.244 0.548 0.772 0.880 0.964
34 15(0.067) ∼ 22*(0.413) LDCont 0.188 0.496 0.724 0.849 0.953
35
36 GBOOST 0.058 0.211 0.411 0.559 0.731
37 RRIntCC 0.307 0.631 0.882 0.954 0.986
38 23*(0.067) ∼ 22*(0.413) LDCont 0.264 0.574 0.856 0.942 0.975
39
40 GBOOST 0.088 0.272 0.548 0.713 0.838
41 RRIntCC 0.116 0.266 0.398 0.551 0.667
42
15(0.067) ∼ 25(0.094) LDCont 0.072 0.204 0.323 0.480 0.612
43
44 GBOOST 0.012 0.060 0.108 0.224 0.285
45 RRIntCC 0.110 0.282 0.502 0.638 0.790
46
23*(0.067) ∼ 25(0.094) LDCont 0.081 0.220 0.428 0.576 0.729
47
48 GBOOST 0.025 0.064 0.161 0.259 0.397
49
50
51
52
53 3.3 Experiment Using Real Datasets
54
We applied our method to the dataset of Myocardial Infarction Genetics Consor-
55
56 tium (MIGen) with 649370 genotyped SNPs and 2967/3075 cases/controls, and
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6 10 Zhang et al.
7
8 the renal complication dataset collected in Hong Kong with 1257031 SNPs and
9 882/2231 cases/controls. Current computation capability cannot support whole-
10 genome interaction analysis using LD contrast test. Instead, GBOOST [18] was
11 first used as probes to generate region-pairs for region-based interaction analysis.
12 We adopted 5 × 10−10 as a suggestive p-value threshold to screen out SNP pairs
13
that are unlikely to be associated. The remaining SNP pairs were clumped into
14
regions with size 200 kbp, which is roughly the size of typical genes. After identi-
15
16 fying the ranges of clumped regions, all genotyped SNPs in MIGen dataset were
17 mapped into these regions. For computation efficiency, the maxmium number
18 of SNPs in each region was set to be 31, so that the total number of SNP-SNP
19 interactions within each region pair was controlled below 1000. The choice of
20 this number is arbitrary. In case that the real number of SNPs inside a region is
21 larger than this limit, we randomly choose 31 SNPs to represent this region.
22
23
Table 2. Top four SNP pairs found by GBOOST in the MIGen dataset.
24
25 SNP pairs p-value cFWER
26 rs4678428 (chr3) ∼ rs9961565 (chr18) 2.588 × 10−11 > 1
27 rs17626606 (chr5) ∼ rs11190346 (chr10) 2.679 × 10−11 > 1
28 rs11925209 (chr3) ∼ rs1501909 (chr5) 3.006 × 10−11 > 1
29 rs6930292 (chr6) ∼ rs114313 (chr6) 3.026 × 10−11 > 1
30
31
32 Table 2 lists the top four SNP pairs found by GBOOST in the MIGen dataset
33 and their corrected family-wise error rates (cFWER). None of them can pass the
34
Bonferroni-corrected p-value threshold. Moreover, even the smallest p-value is
35
100 times larger than the threshold. Table 3 lists the top four region pairs found
36
37 by RRIntCC. One region pair, chr3: [177577480, 177777480] ∼ chr7: [81695481,
38 81895481], passes the Bonferroni-corrected p-value threshold. The second and
39 third region pairs share the same region in chr3 and overlap in the region in chr20,
40 which indicates that these two region pairs actually refer to only one region pair
41 with size larger than the preset 200 kbp. Therefore, we further analyze the region
42 interaction between chr3: [187498383, 187698383] with size 200 kbp and chr20:
43 [39109460, 39444799] with size 335.339 kbp, leading to a cFWER of 0.0536.
44 Multiple genes, including CACNA2D1, DGKG, AK057298, TOP1, BC035080,
45 PLCG1, ZHX3, LPIN3, and EMILIN3, are located in these two region pairs.
46 CACNA2D1 has been found to be involved in cardiomyopathy pathway [21][22].
47 Besides, ZHX3 is reported to be associated with left ventricle wall thickness [23].
48 Both ZH3 and EMILIN3 are reported to be associated with resting heart rate
49 [24]. The regions identified by RRIntCC might provide clues for factors affecting
50 myocardial infarction risks.
51 We also applied GBOOST and RRIntCC to the renal complication dataset.
52 GBOOST has no significant finding, while RRIntCC found one region pair,
53
chr12: [103040398, 103240398] and chr15: [33102602, 33302602], with a cFWER
54
of 0.00382. Two genes, PAH and FMN1, are involved in this region pair. Both
55
56 PAH and FMN1 were reported to be related to kidney disorders [25][26], which
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6 Region-Based Interaction Detection in Case-Control Studies 11
7
8 Table 3. Top four region pairs found by RRIntCC in the MIGen dataset.
9
10 region pairs p-value cFWER
11 chr3: [177577480, 177777480] ∼ chr7: [81695481, 81895481] 1.652 × 10−10 0.0186
12 chr3: [187498383, 187698383] ∼ chr20: [39244799, 39444799] 5.363 × 10−10 0.0603
13 chr3: [187498383, 187698383] ∼ chr20: [39109460, 39309460] 7.497 × 10−10 0.0843
chr2: [184236258, 184436258] ∼ chr13: [29010198, 29210198] 7.835 × 10−9 0.8814
14
15
16
17 implies a potentially target pathway for the study of renal complications in pa-
18 tients with T2D.
19
20
4 Conclusion
21
22
In this paper, we proposed a region-based interaction detection method named
23
RRIntCC. We derived the correlation coefficients between SNP-SNP interaction
24
25 test statistics by using LD contrast test. We aggregated SNP-SNP interaction
26 test statistics by assuming a multi-variate normal distribution with the estimated
27 covariance matrix to account for the potential intensive LD pattern within the
28 regions. By using region-based strategy, we reduced the total number of tests and
29 were therefore able to use a less stringent Bonferroni-corrected p-value threshold.
30 Simulation results support that our region-based strategy outperforms SNP-
31 based method in terms of statistical power at similar type-I-error rates.
32 There still remain several issues that could be improved in the future. First,
33 the computation complexity of calculating the covariance matrix is O(n2 ), which
34 is unacceptable for whole genome analysis. Second, the genomic resolution has
35 been sacrificed by replacing SNPs with regions. One potential remedy is to ex-
36 tend statistical fine mapping methods for interaction detection to determine the
37 leading SNP pairs within the significant region pairs.
38
39
40 List of abbreviations
41
42 GWAS: Genome-Wide Association Study; SNP: Single Nucleotide Polymorphism;
43 LD: Linkage Disequilibrium; CLD: Composite Linkage Disequilibrium; RRIntCC:
44 Region-Region Interaction Detection for Case-Control Studies; bp: Base Pair;
45 MIGen: Myocardial Infarction Genetics Consortium; MAF: Minor Allele Fre-
46 quency; cFWER: Corrected Family-Wise Error Rate; T2D: Type II Diabetes.
47
48
49 Ethics Approval and Consent to Participate
50
51 Not applicable.
52
53
54 Consent for publication
55
56 Not applicable.
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6 12 Zhang et al.
7
8 Availability of data and material
9
10 The data of Myocardial Infarction Genetics Consortium are publicly available
11
from The database of Genotypes and Phenotypes (dbGaP), accession number:
12
phs000294.v1.p1. The data of renal complication in T2D patients are are avail-
13
14 able from the Theme-based Research Scheme (T12-402/13N) of the Hong Kong
15 Research Grant Council (RGC) but restrictions apply to the availability of these
16 data, which were used under license for the current study, and so are not publicly
17 available. Data are however available from the authors upon reasonable request
18 and with permission of the Theme-based Research Scheme (T12-402/13N).
19
20
21 Competing interests
22
23 The authors declare that they have no competing interests.
24
25
26 Funding
27
28 This work is supported by the Theme-based Research Scheme (project T12-
29 402/13N) of the Hong Kong Research Grant Council (RGC).
30
31
32 Authors’ contributions
33
34 SZ designed the method, implemented the C++ package, performed statisitcal
35 experiments and drafted the manuscript. WJ contributed to the design of the
36 study and the design of the method. RCWM provided the data and contributed
37
to the design of the study. WY conceived and supervised the study, and helped
38
to draft the manuscript. All authors read and approved the final manuscript.
39
40
41 Acknowledgements
42
43
44 We would like to thank the anonymous reviewers for their helpful suggestions.
45
46
47 References
48
49 1. Welter, D., et al.: The NHGRI GWAS Catalog, a curated resource of SNP-trait
50 associations. Nucleic Acids Research 42(D1) (2014) D1001–D1006
51 2. Burton, P.R., et al.: Genome-wide association study of 14,000 cases of seven com-
52 mon diseases and 3,000 shared controls. Nature 447(7145) (2007) 661–678
53 3. Visscher, P.M., Brown, M.A., McCarthy, M.I., Yang, J.: Five years of GWAS
54 discovery. The American Journal of Human Genetics 90(1) (2012) 7–24
55 4. Cordell, H.J.: Detecting gene–gene interactions that underlie human diseases. Na-
ture Reviews Genetics 10(6) (2009) 392–404
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6 Region-Based Interaction Detection in Case-Control Studies 13
7
8 5. Wan, X., et al.: BOOST: A fast approach to detecting gene-gene interactions in
9 genome-wide case-control studies. The American Journal of Human Genetics 87(3)
10 (2010) 325–340
11 6. Hu, J.K., Wang, X., Wang, P.: Testing gene–gene interactions in genome wide
12 association studies. Genetic Epidemiology 38(2) (2014) 123–134
13 7. Purcell, S., et al.: PLINK: a tool set for whole-genome association and population-
14 based linkage analyses. The American Journal of Human Genetics 81(3) (2007)
15 559–575
16 8. Ritchie, M.D., et al.: Multifactor-dimensionality reduction reveals high-order inter-
17 actions among estrogen-metabolism genes in sporadic breast cancer. The American
18 Journal of Human Genetics 69(1) (2001) 138–147
19 9. Moore, J.H., White, B.C.: Tuning relieff for genome-wide genetic analysis. In:
20 European Conference on Evolutionary Computation, Machine Learning and Data
21 Mining in Bioinformatics, Springer (2007) 166–175
22 10. Zhang, Y., Liu, J.S.: Bayesian inference of epistatic interactions in case-control
23 studies. Nature Genetics 39(9) (2007) 1167–1173
24 11. Zaykin, D.V., Meng, Z., Ehm, M.G.: Contrasting linkage-disequilibrium patterns
25 between cases and controls as a novel association-mapping method. The American
Journal of Human Genetics 78(5) (2006) 737–746
26
12. Li, M.X., Gui, H.S., Kwan, J.S., Sham, P.C.: GATES: a rapid and powerful gene-
27
based association test using extended Simes procedure. The American Journal of
28
Human Genetics 88(3) (2011) 283–293
29
13. Liu, J.Z., et al.: A versatile gene-based test for genome-wide association studies.
30 The American Journal of Human Genetics 87(1) (2010) 139–145
31 14. Ma, L., Clark, A.G., Keinan, A.: Gene-based testing of interactions in association
32 studies of quantitative traits. PLoS Genetics 9(2) (2013) e1003321
33 15. Hill, W.G.: Estimation of linkage disequilibrium in randomly mating populations.
34 Heredity 33(2) (1974) 229–239
35 16. Wu, X., Jin, L., Xiong, M.: Composite measure of linkage disequilibrium for testing
36 interaction between unlinked loci. European Journal of Human Genetics 16(5)
37 (2008) 644–651
38 17. Schaid, D.J.: Linkage disequilibrium testing when linkage phase is unknown. Ge-
39 netics 166(1) (2004) 505–512
40 18. Yung, L.S., Yang, C., Wan, X., Yu, W.: GBOOST: a GPU-based tool for detecting
41 gene–gene interactions in genome–wide case control studies. Bioinformatics 27(9)
42 (2011) 1309–1310
43 19. Kathiresan, S., et al.: Genome-wide association of early-onset myocardial infarction
44 with single nucleotide polymorphisms and copy number variants. Nature Genetics
45 41(3) (2009) 334–341
46 20. Li, C., Li, M.: GWAsimulator: a rapid whole-genome simulation program. Bioin-
47 formatics 24(1) (2007) 140–142
48 21. Cataldo, S., Annoni, G.A., Marziliano, N.: The perfect storm? Histiocytoid car-
49 diomyopathy and compound CACNA2D1 and RANGRF mutation in a baby. Car-
50 diology in the Young 25(1) (2015) 174–176
51 22. Bourdin, B., et al.: Functional characterization of CaVα2δ mutations associated
52 with sudden cardiac death. Journal of Biological Chemistry 290(5) (2015) 2854–
53 2869
54 23. Wild, P.S., et al.: Large-scale genome-wide analysis identifies genetic variants as-
55 sociated with cardiac structure and function. The Journal of Clinical Investigation
127(5) (2017) 1798–1812
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6 14 Zhang et al.
7
8 24. Eppinga, R.N., et al.: Identification of genomic loci associated with resting heart
9 rate and shared genetic predictors with all-cause mortality. Nature Genetics 48(12)
10 (2016) 1557
11 25. Lichter-Konecki, U., Hipke, C.M., Konecki, D.S.: Human phenylalanine hydroxy-
12 lase gene expression in kidney and other nonhepatic tissues. Molecular Genetics
13 and Metabolism 67(4) (1999) 308–316
14 26. Dimitrov, B.I., et al.: Genomic rearrangements of the GREM1-FMN1 locus cause
15 oligosyndactyly, radio-ulnar synostosis, hearing loss, renal defects syndrome and
16 Cenani–Lenz-like non-syndromic oligosyndactyly. Journal of Medical Genetics
17 47(8) (2010) 569–574
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Response letter

Click here to access/download


Supplementary Material
Response_Letter_RRIntCC.docx
Pdf of manuscript with changes highlighted

Click here to access/download


Supplementary Material
RRIntCC_revised_highlighted.pdf
Latex source file of manuscript

Click here to access/download


Supplementary Material
RRIntCC_revised.tex
Latex reference file of manuscript

Click here to access/download


Supplementary Material
reference.bib
Figure in the manuscript

Click here to access/download


Supplementary Material
t1e_both.png

You might also like