Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

bioRxiv preprint doi: https://doi.org/10.1101/2024.03.19.585836; this version posted March 21, 2024.

The copyright holder for this preprint


(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
made available under aCC-BY-NC-ND 4.0 International license.

1 Prioritizing disease-related rare variants by integrating gene

2 expression data
3
4 Hanmin Guo1,2, Alexander Eckehart Urban2,3,#, Wing Hung Wong1,4,#
5 1Department of Statistics, Stanford University, Stanford, California 94305, USA

6 2Department of Psychiatry and Behavioral Sciences, Stanford University School of

7 Medicine, Stanford, California 94305, USA


8 3Department of Genetics, Stanford University School of Medicine, Stanford, California

9 94305, USA
10 4Department of Biomedical Data Science, Stanford University School of Medicine, Stanford,

11 California 94305, USA


12
13 #To whom correspondence should be addressed:
14 Wing Hung Wong (whwong@stanford.edu)
15 Alexander Eckehart Urban (aeurban@stanford.edu)
16
17

Page 1 of 14
bioRxiv preprint doi: https://doi.org/10.1101/2024.03.19.585836; this version posted March 21, 2024. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
made available under aCC-BY-NC-ND 4.0 International license.

18 Abstract
19 Rare variants, comprising a vast majority of human genetic variations, are likely to have
20 more deleterious impact on human diseases compared to common variants. Here we
21 present carrier statistic, a statistical framework to prioritize disease-related rare variants by
22 integrating gene expression data. By quantifying the impact of rare variants on gene
23 expression, carrier statistic can prioritize those rare variants that have large functional
24 consequence in the diseased patients. Through simulation studies and analyzing real
25 multi-omics dataset, we demonstrated that carrier statistic is applicable in studies with
26 limited sample size (a few hundreds) and achieves substantially higher sensitivity than
27 existing rare variants association methods. Application to Alzheimer's disease reveals 16
28 rare variants within 15 genes with extreme carrier statistics. The carrier statistic method
29 can be applied to various rare variant types and is adaptable to other omics data modalities,
30 offering a powerful tool for investigating the molecular mechanisms underlying complex
31 diseases.
32

Page 2 of 14
bioRxiv preprint doi: https://doi.org/10.1101/2024.03.19.585836; this version posted March 21, 2024. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
made available under aCC-BY-NC-ND 4.0 International license.

33 Introduction
34 Rare variants (minor allele frequency (MAF) < 1%) constitute a vast majority of human
35 genetic variations1,2. They are on average more deleterious compared with common
36 variants, and thus undergoes stronger selection and remains a low frequency in the
37 general population. By analyzing large cohorts of whole genome sequencing/whole exome
38 sequencing (WGS/WES) data, researchers have identified a handful of rare variants-trait
39 association3-6 and shown that rare variants contribute to a large proportion of missing
40 heritability that cannot be explained by common variants7.
41 Rare variants on average confer larger effects on gene expression and complex
42 diseases and are easier to map to causal genes than common variants8. However,
43 statistical power to identify disease-associated rare variants, especially for ultra-rare
44 variants or even singletons, is limited, given that the sample size is not too high or the
45 effect size is not too large. Variants collapsing methods (burden test, variance component
46 test, omnibus test) are proposed to circumvent this obstacle6,9-11, which evaluate
47 association for multiple variants in a biologically relevant region, such as a gene, instead
48 of testing the effect of single variant. These methods work well under the assumption that
49 multiple variants in a gene cumulatively contribute to the disease risk with each individual
50 allele explaining only a small fraction of the cases, but resolution to pinpoint the risk
51 variants may be diluted when the assumption is violated.
52 Complementary to genome sequencing assay, RNA-seq can quantitatively measure
53 gene expression level and provide molecular cause of complex diseases, especially rare
54 diseases. Previous studies have shown that rare variants are enriched near genes with
55 aberrant gene expression12-14. We posit that those rare variants will be more prone to
56 disease pathology. Specifically, for those rare variants with large functional impact, it will
57 be likely to trigger some feedback in the regulatory network and lead to extreme expression
58 change among disease-related genes. In this work, we propose carrier statistic, a statistical
59 framework for prioritizing those rare variants with large functional consequence in diseased
60 patients by integrating gene expression data. We demonstrate superior performance of our
61 method through extensive simulations and application study to the Alzheimer’s disease,
62 where given a limited sample size, existing rare variants association methods without
63 functional gene expression data cannot provide positive findings.
64

65 Results
66 Method Overview

Page 3 of 14
bioRxiv preprint doi: https://doi.org/10.1101/2024.03.19.585836; this version posted March 21, 2024. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
made available under aCC-BY-NC-ND 4.0 International license.

67

68 Figure 1. Carrier statistics for disease-related rare variant-gene pairs will tend to be more
69 extreme compared to those for non-related ones in the case group.
70
71 Our method stems from the expectation that diseased population shows enrichment in rare
72 variants that have large impact on expression for disease-related genes. Suppose we have
73 genotypes (e.g. variants call from WGS data) and gene expression measurements (e.g.
74 reads count from RNA-seq data) for both diseased patients and healthy controls. For each
75 rare variant-gene pair, we calculate the expression association z-score for rare variant
76 carriers by using the gene expression from individuals without the variant as the null
77 distribution (Figure 1a). The expression association z-score, which we term as carrier
78 statistic, is calculated separately within the case and the control group. We only consider
79 rare variant-gene pair wherein the variant is located within the exon of the gene throughout
80 this study. The carrier statistic quantifies the degree to which the rare variant impacts gene
81 expression level. We assume that most rare variants do not have a large impact on the
82 gene function, so the distribution of carrier statistic will be centered around 0. Now, if a
83 gene is relevant for the disease, then conditioning on having the disease will bias the
84 sampling towards people carrying rare variants with large functional impact on the gene
85 expression. Thus, the carrier statistics for disease-related rare variant-gene pairs will tend
86 to be more extreme compared to those for non-related pairs in the case group (Figure 1b).
87 We prioritize those rare variants and genes with outlier carrier statistic in the case group.
88 False discovery rate (FDR) can be computed as the ratio of tail probability for carrier
89 statistic between two groups (Methods).
90
91 Simulation Results
92 We first carried out simulations to assess whether the carrier statistic-based method would
93 produce false positive findings. We simulated genotypes based on whole exome
94 sequencing (WES) data from Genome Aggregation Database (gnomAD)2 and simulated
95 gene expression profiles based on RNA-seq data in whole blood tissue from the Genotype-

Page 4 of 14
bioRxiv preprint doi: https://doi.org/10.1101/2024.03.19.585836; this version posted March 21, 2024. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
made available under aCC-BY-NC-ND 4.0 International license.

96 Tissue Expression (GTEx) project (Methods). We perturbed the expression level of the
97 causal genes for causal variants carriers by assuming that causal rare variants have large
98 functional impact on disease-related genes. FDR for carrier statistic was well-calibrated in
99 all simulation settings with varying penetrance of causal variant, prevalence in causal
100 variant noncarriers, and number of causal variants per causal gene (Supplementary
101 Figure 1). We also checked if using gene expression from noncarriers in two groups
102 together rather than separately as null distribution will produce false positive findings,
103 particularly when diseased patients have systematic change in gene expression profile
104 from healthy controls (Methods). In this case, FDR showed substantial inflation for using
105 gene expression from all noncarriers as null distribution, while using gene expression from
106 noncarriers in the same group of carriers as null distribution consistently gave well-
107 calibrated error rates (Supplementary Figure 2).
108
109 Next, we benchmarked the performance of carrier statistic with three existing rare variants
110 association methods: burden test9, Sequence Kernel Association Test (SKAT)10, and
111 SKAT-O11. Burden test counts the number of rare variants within a gene followed by the
112 association test with the disease. SKAT computes a gene-level variance component score
113 statistic which allows bidirectional effect of different variants. The unified test SKAT-O
114 implements a linear combination of burden test statistic and variance component test
115 statistic, which is preferred when the underlying genetic architecture of the disease is not
116 known. While all four methods controlled FDR at the nominal level (Supplementary Figure
117 1), carrier statistic consistently achieved higher sensitivity than the three variants collapsing
118 methods under various settings (Figure 2). It is worth noting that only a small proportion
119 of case samples can be attributed to the causal variants in our simulated disease model,
120 thus the enrichment of rare variants burden within causal genes in the case group over
121 that in the control group was moderate and did not attain statistical significance
122 (Supplementary Tables 1-3). However, carrier statistic could prioritize those causal rare
123 variants with large functional impact by integrating information from gene expression data.
124

125

126 Figure 2. Carrier statistic achieves higher sensitivity than variants collapsing methods
127 in simulations with varying (a) penetrance of causal variant, (b) prevalence in causal
128 variant noncarriers, and (c) number of causal variants per causal gene.
129

Page 5 of 14
bioRxiv preprint doi: https://doi.org/10.1101/2024.03.19.585836; this version posted March 21, 2024. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
made available under aCC-BY-NC-ND 4.0 International license.

130 Application to Alzheimer’s disease


131 Alzheimer’s disease is highly heritable, with estimated heritability up to ~60%-80% based
132 on twin studies15. Large scale GWASs have identified multiple loci contributing to
133 Alzheimer’s disease, but the genetic variance explained by the significant loci is far from
134 the level suggested by the disease heritability16. Additionally, there is limited understanding
135 regarding the molecular mechanism through which GWAS significant variants affect the
136 disease, with the exception of the well-known APOE locus. To investigate whether rare
137 variants (single nucleotide variants [SNVs] and short indels) confer functional
138 consequences in Alzheimer’s disease, we applied carrier statistic to a harmonized multi-
139 omics dataset (𝑛𝑛𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 444, 𝑛𝑛𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 234) consisting of WGS and RNA-seq from four
140 aging cohort studies: the Religious Orders Study (ROS) and Memory and Aging Project
141 (MAP), the Mount Sinai Brain Bank (MSBB), and the Mayo Clinic (Methods). We found
142 significant excess of large carrier statistic in the diseased patients (Figure 3). Controlling
143 FDR with a cutoff of 0.2, we prioritized 16 rare variants within 15 genes with large carrier
144 statistic in the case group (Table 1), implicating those variants may contribute to
145 Alzheimer’s disease through up-regulating gene expression in the brain.
146
147 Table 1. 16 rare variants within 15 genes with large carrier statistic in the Alzheimer’s
148 disease patients.
CHR POS REF ALT SNP Gene Carrier statistic
14 31344166 G A rs200935305 COCH 5.78
10 71716481 G A rs537156091 COL13A1 5.09
3 147108829 T TGTAGCCC rs765482543 ZIC4 5.02
15 63845640 C T rs971269728 USP3 5.00
15 81565527 G T rs188062582 IL16 4.77
10 115991333 A G rs867202477 TDRD1 4.74
6 37614984 C T rs755089010 MDGA1 4.71
1 29445783 A AT rs1210817818 EPB41 4.69
5 161113972 A G rs966622465 GABRA6 4.69
15 32907368 GCCTTCAGC G rs1377672532 ARHGAP11A 4.62
7 31747403 G A rs766182885 PPP1R17 4.61
18 22804990 G A rs770368967 ZNF521 4.61
6 34028294 C T rs556027021 GRM4 4.58
1 231356659 G A rs528099262 TRIM67 4.56
6 37665148 A G rs545009624 MDGA1 4.55
149 6 123046921 C T rs1016716974 PKIB 4.53

150
151 To benchmark the performance of carrier statistic with existing methods, we also applied
152 burden test, SKAT, SKAT-O to the same Alzheimer’s disease dataset. These three variants
153 collapsing methods did not identify any significant genes (FDR < 0.2), possibly due to
154 insufficient sample size. To further evaluate the performance of carrier statistic, we
155 assessed the enrichment of rare variants burden within the top prioritized genes in case
156 group compared to that in controls. Among the top 100 genes with largest carrier statistic,
157 67 genes have fold enrichment larger than 1, 32 genes have fold enrichment larger than
158 4/3, while only 6 genes have fold enrichment smaller than 3/4 (Supplementary Figure 3).
159 As demonstrated in the simulations, enrichment of rare variants burden within those genes

Page 6 of 14
bioRxiv preprint doi: https://doi.org/10.1101/2024.03.19.585836; this version posted March 21, 2024. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
made available under aCC-BY-NC-ND 4.0 International license.

160 was moderate and did not pass significance threshold by the variants collapsing methods.
161

162

163 Figure 3. Alzheimer’s disease patients show significant excess of large carrier
164 statistic.
165
166 The carrier statistic method prioritized a handful of significant genes that may shed light on
167 the genetic etiology of Alzheimer’s disease (Figure 4). COCH has the largest carrier
168 statistic of 5.78. Missense mutations within this gene were found to cause the late-onset
169 DFNA9 deafness disorder17,18. Furthermore, deposits of Cochlin encoded by COCH is
170 associated with age-related glaucomatous trabecular meshwork but absent in healthy
171 controls. Additionally, SNPs inside COCH are associated with cortical thickness19, changes
172 in which through neuroimaging techniques is commonly used in early detection and
173 monitoring of Alzheimer's disease progression20,21. Gene ARHGAP11A has a carrier
174 statistic of 4.62. Transcribed mRNAs of the gene subcellularly localize and are locally
175 translated in radial glia cells of human cerebral cortex and further regulate cortical
176 development22. More importantly, ARHGAP11A may contribute to Alzheimer’s disease
177 pathology by mediating Amyloid-β generation and Amyloid-β oligomer neurotoxicity23.
178 PPP1R17 (carrier statistic = 4.61) functions as a suppressor of phosphatase complexes 1
179 (PP1) and 2A (PP2A). A recent mice study suggests that hypothalamus neurons with
180 PPP1R17 expression are associated with synaptic dysfunction and WAT impairment,
181 which can further delay aging24. Interestingly, PPP1R17 is involved in human-specific
182 cortical neurodevelopment regulated by enhancers in human accelerated regions25. ZIC4
183 (carrier statistic = 5.02), also known as zinc finger protein of the cerebellum 4, plays an
184 important role in the embryonal development of the cerebellum. Heterozygous deletions
185 encompassing the ZIC4 locus are associated with a rare congenital cerebellar
186 malformation known as the Dandy–Walker malformation26. Notably, mutations in proximity
187 to the ZIC4 loci are implicated in multiple system atrophy, a rare neurodegenerative
188 disease27. Large scale GWAS study of brain morphology has also identified associations
189 with ZIC4, underscoring its significance in diverse neurological processes28. MDGA1

Page 7 of 14
bioRxiv preprint doi: https://doi.org/10.1101/2024.03.19.585836; this version posted March 21, 2024. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
made available under aCC-BY-NC-ND 4.0 International license.

190 (carrier statistic = 4.71) encodes a glycosylphosphatidylinositol (GPI)-anchored cell surface


191 glycoprotein. It has been reported that MDGA1 can contribute to cognitive deficits through
192 altering inhibitory synapse development and transmission in the hippocampus29. Of note,
193 MDGA 1 is one of the 96 genes from the Olink neurology panel with established links to
194 neurobiological processes and neurological diseases.
195

196

197 Figure 4. Rare variant carriers show outlier expression level for (a) COCH, (b)
198 ARHGAP11A, and (c) PPP1R17. Pseudo count 1 was added to the RNA reads count for
199 visualization purpose. Y-axis is on log scale.
200

201 Discussion
202 We presented carrier statistic, a statistical framework to perform multi-omics data analysis,
203 for prioritization of disease-related rare variants and their regulated genes. Through
204 simulations and analyses of real multi-omics dataset, we demonstrated that carrier statistic
205 overcomes sample size limitation and achieves substantial gain in statistical power
206 compared to existing variants collapsing methods. The superior performance of carrier
207 statistic can be attributed to incorporation of functional gene expression data, which allows
208 quantitatively measuring the impact of rare variants that cannot be determined by looking
209 at the variants alone. We applied carrier statistic to Alzheimer’s disease and highlighted
210 several novel risk genes, providing insights into the molecular etiology of the complex
211 disease.
212
213 Carrier statistic serves as a general approach, for the first time, to study how rare variants
214 affect complex disease through mediating gene expression. There exist several methods
215 such as transcriptome-wide association study (TWAS)30,31 or colocalisation32,33 that can
216 also perform integrative analysis across multiple data modalities (genotype, gene
217 expression, and phenotype). However, those methods focus exclusively on effects of
218 common SNPs and will have limited power for rare variants. In addition to SNVs and short
219 indels that we included in this study, the statistical framework can be also applied to other
220 types of rare variants (simple structural variants [SVs], complex SVs, mobile element
221 insertions, tandem repeat expansions), which in general have larger effect size than
222 SNVs34. Finally, carrier statistic can be adapted to other omics data, such as epigenomics

Page 8 of 14
bioRxiv preprint doi: https://doi.org/10.1101/2024.03.19.585836; this version posted March 21, 2024. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
made available under aCC-BY-NC-ND 4.0 International license.

223 and proteomics. As multi-omics data accumulates complementary to genome sequencing


224 assay, we anticipate that carrier statistic will be an effective approach to dissect the
225 molecular mechanism of complex diseases.
226

Page 9 of 14
bioRxiv preprint doi: https://doi.org/10.1101/2024.03.19.585836; this version posted March 21, 2024. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
made available under aCC-BY-NC-ND 4.0 International license.

227 Methods
228 Carrier statistic
229 For each rare variant-gene pair (the variant is located within the exon of the gene), we
230 used the expression of that gene in the rare variant noncarriers as the null distribution and
231 computed a z-score for each rare variant carrier, then average over carriers of that variant.
232 The carrier statistic was computed separately within the case group and the control group.
233 Rare variants were defined as SNVs and short indels whose allele count was no larger
234 than 5 within the case group or within the control group. Therefore, the rare variants and
235 thus the number of carrier statistics are not the same between two groups. The carrier
236 statistic quantifies the degree to which the rare variant impacts gene expression level. We
237 assume that diseased population shows enrichment in rare variants that have large impact
238 on expression for disease-related genes, thus there will also be enrichment of extreme
239 carrier statistic for disease-related rare variant-gene pairs in the case group. We prioritize
240 those rare variants and genes with outlier carrier statistic in the case group. For rare
241 variant-gene pairs with positive carrier statistic, false discovery rate at a given threshold of
Pr[𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ≥ 𝑧𝑧0 ]
242 carrier statistic, denoted by 𝑧𝑧0 , can be computed as Pr[𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ≥ 𝑧𝑧0 ]
, where 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 and 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐

243 denote the carrier statistic in the case group and control group, respectively. Duplicative
244 carrier statistics were removed (i.e. multiple rare variants occurring in the same individuals
245 are counted as the same rare variant). Similarly, for rare variant-gene pairs with negative
Pr[𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ≤ 𝑧𝑧0 ]
246 carrier statistic, false discovery rate at threshold of 𝑧𝑧0 can be computed as .
Pr[𝑧𝑧𝑐𝑐𝑐𝑐𝑠𝑠𝑒𝑒 ≤ 𝑧𝑧0 ]

247
248 Simulations
249 We simulated genotypes for a large population consisting of 125,748 individuals based on
250 the alternative allele count from 125,748 exomes in the gnomAD v2.1.1 dataset2. Only
251 exonic variants that passed all variant filters in the gnomAD dataset were retained. Then
252 we simulated gene expression data for the large population as follows. We first simulated
253 background gene expression profile for these 125,748 individuals while matching the mean
254 and standard deviation of normalized gene expression in the reference expression dataset.
255 In this study, we used log 2 (reads count + 1) as normalized gene expression and RNA-
256 seq in the whole blood tissue from GTEx project v8 as the reference expression dataset35.
257 Genes whose median number of reads count in the large population < 10 were removed.
258 We randomly selected 𝑚𝑚 causal genes and 𝑙𝑙 causal variants for each causal gene,
259 where 𝑚𝑚 was set as 10 and 𝑙𝑙 was set to vary from 1 to 10. For each causal gene, we
260 perturbed the gene expression for causal variant carriers by 𝑧𝑧 ∗ 𝑠𝑠𝑠𝑠 fold, where 𝑧𝑧 was set
261 to vary from 1 to 5 and 𝑠𝑠𝑠𝑠 was the standard deviation of normalized gene expression in
262 the reference dataset. Next, we simulated the disease status for the large population by
263 assuming penetrance of causal variant as 𝑝𝑝𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 and prevalence in causal variant
264 noncarrier as 𝑝𝑝𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 . Here 𝑝𝑝𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 varied from 0.5 to 0.9 and 𝑝𝑝𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 varied from
265 0.005 to 0.02. Finally, we randomly sampled 500 cases and 500 controls from the affected
266 and nonaffected population respectively to mimic the sample recruitment procedure in the
267 disease study. Each simulation setting was repeated for 100 times.

Page 10 of 14
bioRxiv preprint doi: https://doi.org/10.1101/2024.03.19.585836; this version posted March 21, 2024. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
made available under aCC-BY-NC-ND 4.0 International license.

268
269 We evaluated the performance of different methods using two metrics: FDR and sensitivity.
270 FDR was defined as the proportion of falsely identified genes among all identified ones. If
271 no gene was identified then FDR was set as 0. Sensitivity was defined as the proportion of
272 truly identified genes among all underlying causal genes.
273
274 Note that carrier statistic was computed by using gene expression from rare variant
275 noncarriers in the same group (i.e. case or control) as the carriers as null distribution. We
276 also checked if using gene expression from noncarriers in both case group and control
277 group as the null distribution will produce false positive findings. We perturbed expression
278 level for all genes in the case group. In this case, FDR showed substantial inflation for
279 using gene expression from all individuals in two groups as null distribution, especially
280 when there is large systematic difference in the transcriptome between two groups
281 (Supplementary Figure 1b). On the contrary, using gene expression from rare variant
282 noncarriers in the same group of carriers as null distribution consistently controlled FDR at
283 the nominal level (Supplementary Figure 1a).
284
285 Implementation of different methods
286 Variants collapsing methods were performed using the R package SKAT v.2.2.5. All the
287 parameters were set as default value. Both common and rare variants were included in the
288 analysis.
289
290 Multi-omics data analysis for Alzheimer’s disease
291 WGS data for the four aging cohorts (ROS/MAP, MSBB, and the Mayo Clinic) were
292 obtained from the Whole Genome Sequence Harmonization Study (Synapse ID:
293 syn22264775). RNA-seq data for the same four cohorts were downloaded from the
294 RNAseq Harmonization Study (syn21241740). Only white people with both WGS and
295 RNA-seq data were included in the analysis.
296
297 We determined disease status following description in previous publications36,37. For the
298 ROS/MAP cohorts, individuals with a Braak neurofibrillary tangle score ≥ 4, a CERAD
299 neuritic and cortical plaque score ≤ 2, and a cognitive diagnosis of probable Alzheimer’s
300 disease with no other causes (cogdx = 4) were classified as cases, while individuals with
301 a Braak score ≤ 3, a CERAD score ≥ 3, and a cognitive diagnosis of no cognitive
302 impairment (cogdx = 1) were classified as controls. For MSBB, individuals with a Braak
303 score ≥ 4, a CERAD score ≥ 2, and a Clinical Dementia Rating (CDR) score ≥ 1 were
304 classified as cases, while individuals with a Braak score ≤ 3, a CERAD score ≤ 1, and
305 a CDR score ≤ 0.5 were classified as controls. For the Mayo Clinic cohort, individuals
306 with a Braak score ≥ 4 and a CERAD score ≥ 2 were classified as cases, while
307 individuals with a Braak score ≤ 3 and a CERAD score ≤ 1 were classified as controls.
308 Of note, definition of CERAD score in the ROS/MAP cohort is different from that in the
309 MSBB and the Mayo Clinic cohorts. After harmonization across cohorts, 444 cases and
310 234 controls in total were identified and used for downstream analysis.
311

Page 11 of 14
bioRxiv preprint doi: https://doi.org/10.1101/2024.03.19.585836; this version posted March 21, 2024. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
made available under aCC-BY-NC-ND 4.0 International license.

312 Next, we performed quality control on the WGS and RNA-seq data. For WGS data, only
313 exonic variants with missing genotypes < 10% were retained. For RNA-seq data, we
314 selected prefrontal cortex as the target brain tissue. If a donor does not have RNA-seq in
315 the prefrontal cortex, then RNA-seq in other tissues will be used based on the following
316 order: dorsolateral prefrontal cortex > posterior cingulate cortex > head of caudate nucleus
317 in the ROS/MAP cohort, prefrontal cortex > frontal pole > superior temporal gyrus > inferior
318 frontal gyrus > parahippocampal gyrus in the MSBB cohort, and temporal cortex >
319 cerebellum in the Mayo Clinic cohort. Genes with zero reads count in more than 10% of
320 samples or with median reads count < 10 across samples were excluded.
321 log 2 (reads count + 1) was used as normalized gene expression. Then we applied carrier
322 statistic to perform downstream analysis.
323
324 Data availability
325 The WGS data in the Whole Genome Sequence Harmonization Study
326 (https://www.synapse.org/#!Synapse:syn22264775) and the RNA-seq data in the RNAseq
327 Harmonization Study (https://www.synapse.org/#!Synapse:syn21241740) are publicly
328 available on the AD Knowledge Portal platform through completion of a data use certificate.
329 The gnomAD v2.1.1 data consisting of 125,748 exomes were downloaded from
330 https://gnomad.broadinstitute.org/. The gene expression data (gene reads count) from
331 GTEx project version 8 were downloaded from the GTEx Portal,
332 https://gtexportal.org/home/downloads/adult-gtex/bulk_tissue_expression.
333
334 Acknowledgements
335 We thank Dr. Bo Zhou and Dr. Hua Tang for discussion and providing feedbacks.
336 This work was partially supported by NIH grants R01 HG010359, R01 MH116529, and P50
337 HG007735.
338
339 Author contributions
340 W.H.W. and A.E.U. supervised the study. H.G. and W.H.W. designed the idea. H.G.
341 performed data analysis. H.G., A.E.U., and W.H.W. interpreted the results. H.G. and W.H.W.
342 wrote the manuscript.
343
344 References
345 1. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature
346 536, 285-291 (2016).
347 2. Karczewski, K.J. et al. The mutational constraint spectrum quantified from variation in
348 141,456 humans. Nature 581, 434-443 (2020).
349 3. Jurgens, S.J. et al. Analysis of rare genetic variation underlying cardiometabolic
350 diseases and traits among 200,000 individuals in the UK Biobank. Nature genetics 54,
351 240-250 (2022).
352 4. Wang, Q. et al. Rare variant contribution to human disease in 281,104 UK Biobank
353 exomes. Nature 597, 527-532 (2021).
354 5. Flannick, J. et al. Exome sequencing of 20,791 cases of type 2 diabetes and 24,440
355 controls. Nature 570, 71-76 (2019).

Page 12 of 14
bioRxiv preprint doi: https://doi.org/10.1101/2024.03.19.585836; this version posted March 21, 2024. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
made available under aCC-BY-NC-ND 4.0 International license.

356 6. Li, X. et al. Powerful, scalable and resource-efficient meta-analysis of rare variant
357 associations in large whole genome sequencing studies. Nature genetics 55, 154-164
358 (2023).
359 7. Wainschtein, P. et al. Assessing the contribution of rare variants to complex trait
360 heritability from whole-genome sequence data. Nature Genetics 54, 263-273 (2022).
361 8. Claussnitzer, M. et al. A brief history of human disease genetics. Nature 577, 179-189
362 (2020).
363 9. Li, B. & Leal, S.M. Methods for detecting associations with rare variants for common
364 diseases: application to analysis of sequence data. The American Journal of Human
365 Genetics 83, 311-321 (2008).
366 10. Wu, M.C. et al. Rare-variant association testing for sequencing data with the sequence
367 kernel association test. The American Journal of Human Genetics 89, 82-93 (2011).
368 11. Lee, S. et al. Optimal unified approach for rare-variant association testing with
369 application to small-sample case-control whole-exome sequencing studies. The
370 American Journal of Human Genetics 91, 224-237 (2012).
371 12. Li, X. et al. The impact of rare variation on gene expression across tissues. Nature 550,
372 239-243 (2017).
373 13. Zeng, Y. et al. Aberrant gene expression in humans. PLoS genetics 11, e1004942
374 (2015).
375 14. Brechtmann, F. et al. OUTRIDER: a statistical method for detecting aberrantly
376 expressed genes in RNA sequencing data. The American Journal of Human Genetics
377 103, 907-917 (2018).
378 15. Gatz, M. et al. Role of genes and environments for explaining Alzheimer disease.
379 Archives of general psychiatry 63, 168-174 (2006).
380 16. Wightman, D.P. et al. A genome-wide association study with 1,126,563 individuals
381 identifies new risk loci for Alzheimer’s disease. Nature genetics 53, 1276-1282 (2021).
382 17. Bhattacharya, S.K. Focus on molecules: cochlin. Experimental eye research 82, 355
383 (2006).
384 18. Robertson, N.G. et al. Cochlin immunostaining of inner ear pathologic deposits and
385 proteomic analysis in DFNA9 deafness and vestibular dysfunction. Human Molecular
386 Genetics 15, 1071-1085 (2006).
387 19. Van Der Meer, D. et al. The genetic architecture of human cortical folding. Science
388 advances 7, eabj9446 (2021).
389 20. Querbes, O. et al. Early diagnosis of Alzheimer's disease using cortical thickness:
390 impact of cognitive reserve. Brain 132, 2036-2047 (2009).
391 21. Schwarz, C.G. et al. A large-scale comparison of cortical thickness and volume
392 methods for measuring Alzheimer's disease severity. NeuroImage: Clinical 11, 802-812
393 (2016).
394 22. Pilaz, L.-J. et al. Subcellular mRNA localization and local translation of Arhgap11a in
395 radial glial cells regulates cortical development. bioRxiv, 2020.07. 30.229724 (2020).
396 23. Huang, Y.-r. et al. ArhGAP11A mediates amyloid-β generation and neuropathology in
397 an Alzheimer’s disease-like mouse model. Cell Reports 42(2023).
398 24. Tokizane, K., Brace, C.S. & Imai, S.-i. DMHPpp1r17 neurons regulate aging and
399 lifespan in mice through hypothalamic-adipose inter-tissue communication. Cell

Page 13 of 14
bioRxiv preprint doi: https://doi.org/10.1101/2024.03.19.585836; this version posted March 21, 2024. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
made available under aCC-BY-NC-ND 4.0 International license.

400 Metabolism 36, 377-392. e11 (2024).


401 25. Girskis, K.M. et al. Rewiring of human neurodevelopmental gene regulatory programs
402 by human accelerated regions. Neuron 109, 3239-3251. e7 (2021).
403 26. Grinberg, I. et al. Heterozygous deletion of the linked genes ZIC1 and ZIC4 is involved
404 in Dandy-Walker malformation. Nature genetics 36, 1053-1055 (2004).
405 27. Hopfner, F. et al. Common Variants Near ZIC1 and ZIC4 in Autopsy ‐Confirmed
406 Multiple System Atrophy. Movement disorders 37, 2110-2121 (2022).
407 28. Zhao, B. et al. Genome-wide association analysis of 19,629 individuals identifies
408 variants influencing regional brain volumes and refines their genetic co-architecture
409 with cognitive and mental health traits. Nature genetics 51, 1637-1644 (2019).
410 29. Connor, S.A. et al. Loss of synapse repressor MDGA1 enhances perisomatic inhibition,
411 confers resistance to network excitation, and impairs cognitive function. Cell reports 21,
412 3637-3645 (2017).
413 30. Gamazon, E.R. et al. A gene-based association method for mapping traits using
414 reference transcriptome data. Nature genetics 47, 1091-1098 (2015).
415 31. Wainberg, M. et al. Opportunities and challenges for transcriptome-wide association
416 studies. Nature genetics 51, 592-599 (2019).
417 32. Giambartolomei, C. et al. Bayesian test for colocalisation between pairs of genetic
418 association studies using summary statistics. PLoS genetics 10, e1004383 (2014).
419 33. Hormozdiari, F. et al. Colocalization of GWAS and eQTL signals detects target genes.
420 The American Journal of Human Genetics 99, 1245-1260 (2016).
421 34. Chiang, C. et al. The impact of structural variation on human gene expression. Nature
422 genetics 49, 692-699 (2017).
423 35. Consortium, G. The GTEx Consortium atlas of genetic regulatory effects across human
424 tissues. Science 369, 1318-1330 (2020).
425 36. Wan, Y.-W. et al. Meta-analysis of the Alzheimer’s disease human brain transcriptome
426 and functional dissection in mouse models. Cell reports 32(2020).
427 37. Vialle, R.A., de Paiva Lopes, K., Bennett, D.A., Crary, J.F. & Raj, T. Integrating whole-
428 genome sequencing with multi-omic data reveals the impact of structural variants on
429 gene regulation in the human brain. Nature neuroscience 25, 504-514 (2022).
430

Page 14 of 14

You might also like