Professional Documents
Culture Documents
tmpEB12 TMP
tmpEB12 TMP
ScienceDirect
journal homepage: www.elsevier.com/locate/cmrp
Research Methods
article info
abstract
Article history:
With the help of molecular markers, genome-wide association studies (GWAS) are con-
ducted to identify genes associated with diseases. Association mapping uses unrelated
individuals from the same population that has undergone recombination in many gener-
ations since the inception of the mutant gene and is the basis for detection of causal genes.
The data that forms the basis for computational detection of causal genes are of three
Keywords:
kinds, phenotypic values (single trait or several traits), genotypes of hundreds of thousands
Genomic studies
of SNP markers, and data on gene expression, a sort of intermediate phenotypes that are
GWAS
used to associate genes with disease phenotypes. Most of the studies except a few, how-
Linear models
ever, consider single trait at a time and take either phenotypes and marker genotypes only
or considers phenotypes, genotypes and gene expression all together. In actual situations,
on the other hand, the problem is multivariate since many complex disease syndromes
consist of a large number of highly related clinical or molecular phenotypes. For instance,
asthma is influenced by as many as 53 clinical traits that can be represented as a quantitative trait network (QTN). The methodological issue is then to conduct association
analysis that takes into account jointly all the relevant traits instead of a single trait only.
Linear models in which a dependent variable (expression of a disease trait) is related to
a set of independent variables (for instance, SNPs) provide with a very versatile tool that
can be used for the association analysis both for a single as well as multiple traits. To this
end, we systematically discuss the sparse regression methodology of Ridge Regression,
Lasso, and GFLasso with illustrations from published literature.
Copyright 2014, Sir Ganga Ram Hospital. Published by Reed Elsevier India Pvt. Ltd. All
rights reserved.
1.
Introduction
wide association studies (GWAS) have identified genes associated with diseases in several cases, but they explain only a
small fraction of the variability leading to the frequently asked
question of missing heritability.1 In the pre-genomic era
statistical considerations were predominant in dissecting
such complex traits into estimable components.2 Heritability
*
Presented as an invited paper in Mathematical Sciences Section of the 99th Indian Science Congress held at KIIT University,
Bhubaneswar (3e7 January 2012).
E-mail address: Narainprem@Hotmail.Com.
http://dx.doi.org/10.1016/j.cmrp.2014.10.004
2352-0817/Copyright 2014, Sir Ganga Ram Hospital. Published by Reed Elsevier India Pvt. Ltd. All rights reserved.
226
c u r r e n t m e d i c i n e r e s e a r c h a n d p r a c t i c e 4 ( 2 0 1 4 ) 2 2 5 e2 2 9
2.
(1)
b X0 X
X0 Y
(2)
bridge X0 X kI
X0 Y
(3)
yi
bj Xij
i2
subject to
X
bj t; t 0
(4)
yi
bj Xij
i2
X
bj
(5)
c u r r e n t m e d i c i n e r e s e a r c h a n d p r a c t i c e 4 ( 2 0 1 4 ) 2 2 5 e2 2 9
2.1.
The computation of Lasso is a quadratic programming problem for which various algorithms exist, the most useful being
the Least Angle Regression (LAR) given by Efron et al5 in which
values of the coefficient bj corresponding to xj, the most
correlated with y, is increased in the direction of the sign of its
correlation with y. Taking the residuals r along the way,
continue until some other predictor xk has as much correlation with r as xj has. Now increase (bj, bk) in their joint least
squares direction until some other predictor xm has as much
correlation with r. Continue until all predictors are in the
model.
2.2.
227
level with a Bonferroni correction as against 14 in the 26 single-locus analyses. Among these 11 SNPs in single-locus based
analyses and 5 SNPs in ridge regression analysis were significant in one method and not the other. Accounting for LD
among the SNPs radically changed which of the SNPs is likely
to be causally associated with the CHI3L2 phenotype.
3.
Multiple traits
(6)
for the i-th trait. The phenotypic data can then be represented
by an n k matrix Y with Yi as the ith column and bi (b1i,b2i,
.bpi)T as a column vector of regression coefficients for the
i-th trait and ei is a column vector of n independent errors with
mean zero and common variance.
Then an estimate of B [b1, b2,.bk] is obtained by
minimizing the residual sum of squares:
B arg min
h
i
Yi Xbi T Yi Xbi
(7)
h
i
XX
X
bji
Yi Xbi T Yi Xbi l
(8)
228
c u r r e n t m e d i c i n e r e s e a r c h a n d p r a c t i c e 4 ( 2 0 1 4 ) 2 2 5 e2 2 9
4.
Graph-guided fused lasso (GFLasso) for
multiple correlated traits
The additional penalty term encourages a fusion of two
regression coefficients bjm and bjl for each SNP marker j when
the traits m and l are connected with an edge in the QTN with
weight as the correlation coefficient rml. This modifies (6) to
h
i
XX
X
bji
BGFlasso arg min
Yi Xbi T Yi Xbi l
X
X
bjm sign rml bjl
g
f rml
(9)
4.1.
Application to polymorphism in IL-4R gene and
asthma-traits
The asthma dataset used by Kim and Xing7 involved 543
asthma patients as a part of the SARP. The genotype data were
obtained for 34 SNPs within or near the IL-4R gene that spans a
40 kb region on chromosome 16. The phenotype data included
53 clinical traits related to severe asthma such as age of onset,
family history, and severity of various symptoms. To investigate whether any of the SNPs in the IL-4R gene were associated
with a subnetwork of correlated traits rather than an individual trait, pairwise correlations between these traits, after
standardizing the measurements for each phenotype to have
zero mean and unit standard deviation, were first computed
and thresholded at r 0.7 to obtain the QTN. A comprehensive
comparison of QTL mapping using GFLasso and other
methods such as single marker-single trait association, ridge
regression, and lasso showed that the GFLsso method gives
increased power in detecting causal variants than other
methods, taking the absolute values of the estimated regression coefficients as a measure of association strength.
All the methods found an SNP, named as Q551R, as
significantly associated with a block of correlated phenotypes
related to lung physiology pertaining to 10 clinical phenotypes
with p-values for the single-marker analysis being 2.0 104.
This analysis also showed that upstream of Q551R, there was
a set (rows 24e27) of adjacent SNPs having a high level of
association with the same subset of traits for lung physiology
with p-values ranging from 2.0 104 to 7.6 103. On the
other hand, lasso set the regression coefficients for most of
this block of SNPs to zero, thus ignoring the possibly irrelevant
markers (rows 26 and 27) which were merely in strong LD with
the causal SNP Q551R. On the other hand, the other two SNPs
(rows 24 and 25) in the same block were in weak LD with
c u r r e n t m e d i c i n e r e s e a r c h a n d p r a c t i c e 4 ( 2 0 1 4 ) 2 2 5 e2 2 9
5.
Conclusions
229
Conflicts of interest
The author has none to declare.
Acknowledgments
The author is grateful to Dr. R. L. Sapra, Biostatistician and
Editorial Board Member, CMRP for making useful suggestions
and to the Indian National Science Academy, New Delhi for
support under their program INSA Honorary Scientist.
references
1. Manolia TA, Collins FS, Cox NJ, et al. Finding the missing
heritability of complex diseases. Nature. 2009;461:747e753.
2. Narain P. Statistical Genetics. New Delhi: John Wiley, New York
& Wiley Eastern Ltd.; 1990:599.
3. Fisher RA. On the correlation between relatives on the
supposition of Mendelian inheritance. Trans Roy Soc Edin.
1918;52:399e433.
4. Tibshirani R. Regression shrinkage and selection via the lasso. J
Roy Stat Soc Ser B. 1996;58:267e288.
5. Efron B, Hastie T, Johnstone I, Tibshirani R. Least Angle
Regression. Technical Report. CA, USA: Stanford University; 2002.
6. Malo N, Libiger O, Schork NJ. Accomodating linkage
disequilibrium in genetic association analyses via ridge
regression. Am J Hum Genet. 2008;82:375e385.
7. Kim S, Xing EP. Statistical estimation of correlated genome
associations to a quantitative trait network. PLoS Genet.
2009;5:e 1000587. http://dx.doi.org/10.1371/journal.pogen.
1000587.
8. Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity
and smoothness via the fused lasso. J Roy Stat Soc Ser B.
2005;67:91e108.