Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

c u r r e n t m e d i c i n e r e s e a r c h a n d p r a c t i c e 4 ( 2 0 1 4 ) 2 2 5 e2 2 9

Available online at www.sciencedirect.com

ScienceDirect
journal homepage: www.elsevier.com/locate/cmrp

Research Methods

Linear models in genomic studies*


P. Narain
INSA Honorary Scientist, Former Director, IASRI, New Delhi, India

article info

abstract

Article history:

With the help of molecular markers, genome-wide association studies (GWAS) are con-

Received 27 August 2014

ducted to identify genes associated with diseases. Association mapping uses unrelated

Accepted 1 October 2014

individuals from the same population that has undergone recombination in many gener-

Available online 31 October 2014

ations since the inception of the mutant gene and is the basis for detection of causal genes.
The data that forms the basis for computational detection of causal genes are of three

Keywords:

kinds, phenotypic values (single trait or several traits), genotypes of hundreds of thousands

Genomic studies

of SNP markers, and data on gene expression, a sort of intermediate phenotypes that are

GWAS

used to associate genes with disease phenotypes. Most of the studies except a few, how-

Linear models

ever, consider single trait at a time and take either phenotypes and marker genotypes only
or considers phenotypes, genotypes and gene expression all together. In actual situations,
on the other hand, the problem is multivariate since many complex disease syndromes
consist of a large number of highly related clinical or molecular phenotypes. For instance,
asthma is influenced by as many as 53 clinical traits that can be represented as a quantitative trait network (QTN). The methodological issue is then to conduct association
analysis that takes into account jointly all the relevant traits instead of a single trait only.
Linear models in which a dependent variable (expression of a disease trait) is related to
a set of independent variables (for instance, SNPs) provide with a very versatile tool that
can be used for the association analysis both for a single as well as multiple traits. To this
end, we systematically discuss the sparse regression methodology of Ridge Regression,
Lasso, and GFLasso with illustrations from published literature.
Copyright 2014, Sir Ganga Ram Hospital. Published by Reed Elsevier India Pvt. Ltd. All
rights reserved.

1.

Introduction

Many of the common diseases in humans exhibit quantitative


variation, the genetics of which is complex in that it involves
multiple genes and are subject to modification by environmental factors. With the help of molecular markers, genome-

wide association studies (GWAS) have identified genes associated with diseases in several cases, but they explain only a
small fraction of the variability leading to the frequently asked
question of missing heritability.1 In the pre-genomic era
statistical considerations were predominant in dissecting
such complex traits into estimable components.2 Heritability

*
Presented as an invited paper in Mathematical Sciences Section of the 99th Indian Science Congress held at KIIT University,
Bhubaneswar (3e7 January 2012).
E-mail address: Narainprem@Hotmail.Com.
http://dx.doi.org/10.1016/j.cmrp.2014.10.004
2352-0817/Copyright 2014, Sir Ganga Ram Hospital. Published by Reed Elsevier India Pvt. Ltd. All rights reserved.

226

c u r r e n t m e d i c i n e r e s e a r c h a n d p r a c t i c e 4 ( 2 0 1 4 ) 2 2 5 e2 2 9

of a trait as the proportion of phenotypic variation that is


attributed to genetic causes was a prime indicator since the
introduction of infinitesimal model.3 The relationship between the phenotype and the genotype was like a black box
wherein one had to adopt an inferential approach to look into
it. This scenario is now changing with the advent of modern
technologies of high-throughput single nucleotide polymorphism (SNP) chip, gene and protein sequencing, microarray experiments etc., and enormous advances in medical
field taking place in attempts to understand gene and protein
expression within the cell of an organism. Gene sequence data
allows one to identify individual nucleotide-level variations
associated with the disease whereas gene expression data
enables one to identify genes whose products may be implicated in a disease pathway.
The information on molecular markers has been extremely
helpful in identifying the regions on chromosomes termed as
quantitative trait loci (QTL) that bring about variation in the
trait. Saturated genetic maps of markers, giving their order
along a chromosome and relative distances between them
have been developed. The gene transcript data from microarray experiments are integrated with molecular marker information to map expression traits (eQTL) that lead to causal
networks. These developments have proceeded along broadly
two lines. In the linkage-based analysis the segregating
markers that can predict the organismal phenotype and are
close to causal genes are identified using a mapping population derived from a cross between genetically divergent
strains e an approach mostly adopted in plants, animals, and
model organisms like fruit fly Drosophila, mice and yeast. The
other approach e the association mapping e uses unrelated
individuals from the same population that has undergone
recombination in many generations since the inception of the
mutant gene and is the basis for detection of causal genes in
humans. In either strategy, the object of the mapping population is to supply the genotypic variation through which
variation in the organismal phenotype can be explained and
involves computational detection of causal genes. Extensive
literature exist on methodological issues as well as applications to plants, animals, humans and laboratory model organisms. We consider here issues that are basically
methodological applicable to disease traits in humans and
take into account the whole genome association (WGA) on a
large scale. The data that form the basis for computational
detection of causal genes are of three kinds, phenotypic
values (single trait or several traits), genotypes of hundreds of
thousands of SNP markers, and data on gene expression, a
sort of intermediate phenotypes that are used to associate
genes with disease phenotypes. Most of the studies except a
few, however, consider single trait at a time and take either
phenotypes and marker genotypes only or considers phenotypes, genotypes and gene expression all together. In actual
situations, on the other hand, the problem is multivariate
since many complex disease syndromes consist of a large
number of highly related clinical or molecular phenotypes. For
instance, asthma is influenced by as many as 53 clinical traits
that can be represented as a quantitative trait network (QTN).
The methodological issue is then to conduct association
analysis that takes into account jointly all the relevant traits
instead of a single trait only.

2.

Sparse or regularized regression methods

With hundreds of thousands of SNPs genotyped on a large


number of subjects that are used to study the association
strength between the multiple loci within particular genomic
regions and specific phenotypes the usual linear regression
model, in matrix notation, is given by
Y Xb e

(1)

where X is an n  p matrix with p as the number of SNPs or


other forms of genetic variation, genotyped on a set of n
subjects, giving the values of the dummy variables corresponding to SNP genotypes AA, Aa, aa, coded as 0.0, 0.5, 1.0
respectively, b is set of regression coefficients expressed as a
p  1 vector, e is p  1 vector of residuals with common variance and Y is n  1 vector of phenotypic values of the individuals. The regression coefficients, estimated by
minimizing the residual sum of squares, are then given by
1

b X0 X

X0 Y

(2)

Regularized regression methods, on the other hand, are those


in which the effect of a redundant variable is shrunk to zero by
fixing a penalty on its size, and thus help in separating the
causal variations from those that are simply in LD with causal
loci within a genomic region. In other words, when among the
multiple SNPs in strong LD, some subset of them are causally
associated with the phenotype, the usual multiple regression
method is not able to discern them due to high correlation
produced by strong LD. To select therefore a relatively small
subset of SNPs as associated with the trait and setting the
regression coefficients for the rest of the markers to zero, residual sum of squares is penalized with an Lq norm (q > 0).
Ridge regression is one such method in which L2 norm is used.
It leads to the estimates
1

bridge X0 X kI

X0 Y

(3)

where k > 0 is the ridge parameter representing the degree of


shrinkage. The term kI in the multiplier matrix prevents it
from being singular and estimates of regression coefficients
can be obtained. However ridge regression only shrinks the
regression coefficients for irrelevant markers towards zero and
does not set them exactly at zero.
Lasso (Least Absolute Shrinkage and Selection Operator), on the
other hand, uses an L1 norm as a penalty that sets the parameters for irrelevant markers exactly to zero.4 Here we
minimize the usual sum of squared errors with a bound on the
sum of the absolute values of the coefficients i.e. minimize.
Xh

yi 

bj Xij

i2

subject to

X 
bj   t; t  0

(4)

or, equivalently minimize, for non-negative l


Xh

yi 

bj Xij

i2

X 
bj 

Such estimates are obtained by solving a L1-regularized


linear regression giving the estimates as minimizer of the eq.
h 

i
X 
bj 
blasso arg min bT XT X b  2yT Xb l

(5)

c u r r e n t m e d i c i n e r e s e a r c h a n d p r a c t i c e 4 ( 2 0 1 4 ) 2 2 5 e2 2 9

where l is a regularization parameter that controls the


amount of sparsity in the estimated regression coefficients.
Setting l to a larger value increases the amount of penalization, setting more regression coefficients to zero. Several
efficient algorithms such as, for instance, LAR (Least Angle
Regression) are available for solving the optimization problem
defined by Eq. (5).
Owing to the nature of L1 penalty, lasso does both continuous shrinkage and automatic variable selection simultaneously. It has a sparse representation selecting at most n
variables out of p candidates before it saturates. However, if
there is a group of variables among which the pairwise correlations are very high, lasso tends to select only one variable
from the group without bothering about which one is selected.
Small values of tuning parameter t lead to shrunken estimates which could often be exactly zero whereas very large
values results in estimates as in ordinary least squares (OLS).
Choosing t is like choosing the number of predictors in a
regression model. Cross-validation is a good tool for estimating the best value of t in which the data set is divided into
training and validation sets. Estimate the regression coefficients using the former for a number of values of l and
select l that gives the coefficient with the lowest squared errors on the validation set. Choosing the parameter thus, use
the combined data set to obtain the final estimate of the
regression coefficient.

2.1.

Least Angle Regression (LAR)

The computation of Lasso is a quadratic programming problem for which various algorithms exist, the most useful being
the Least Angle Regression (LAR) given by Efron et al5 in which
values of the coefficient bj corresponding to xj, the most
correlated with y, is increased in the direction of the sign of its
correlation with y. Taking the residuals r along the way,
continue until some other predictor xk has as much correlation with r as xj has. Now increase (bj, bk) in their joint least
squares direction until some other predictor xm has as much
correlation with r. Continue until all predictors are in the
model.

2.2.

Application to data on CHI3L2 region

Malo et al,6 while advocating the use of ridge regression,


analyzed CEPH Family Gene Expression data involving 26 SNPs
collected on 57 unrelated individuals from the International
HapMap Project database for studying association within the
CH3L2 gene region. The multiple regression model (1) was
found to fit only 11 of the 26 SNPs and resulted in several
missing regression coefficient values. This is well known in
statistical literature that if the independent variables in the
multiple regression model are highly correlated, the multiplier matrix (X0 X)1in (2) tends to be singular and cannot be
inverted to give the estimates of the regression coefficients.
Ridge regression method obviates this difficulty. The analysis
showed that ridge regression method performed better than
multiple regression as well as single-locus based analyses in
identifying the variations that are causally related to the
phenotype from those that are merely in LD with these variants. This method gave 8 SNPs as significant at an overall 5%

227

level with a Bonferroni correction as against 14 in the 26 single-locus analyses. Among these 11 SNPs in single-locus based
analyses and 5 SNPs in ridge regression analysis were significant in one method and not the other. Accounting for LD
among the SNPs radically changed which of the SNPs is likely
to be causally associated with the CHI3L2 phenotype.

3.

Multiple traits

When we have several traits to consider, as for instance in a


disease like asthma that cannot be characterized by a single
phenotype but as many as 53 clinical traits are involved, the
same SNP can have pleiotropic effects, affecting several of the
traits and leading to correlations between them. The lasso
procedure discussed above needs therefore to be extended to
discover association strengths jointly for multiple correlated
traits while maintaining sparsity. This is best done by first
considering the complex correlation pattern among the traits
expressed as a quantitative trait network (QTN). In the asthma
dataset collected under Severe Asthma Research Program
(SARP), the QTN consists of several subnetworks corresponding to different clinical aspects of asthma such as
quality of life, asthma symptoms, and lung physiology. Highly
correlated traits in a subnetwork may have some common
genetic basis and the goal is then to identify SNPs that are
associated with a subnetwork of clinical traits rather than
individual traits. It is assumed that we have, by initial processing steps, a QTN denoted by G with a set of nodes V and a
set of edges E. Each edge (m,l) E in a QTN G is associated with
a weight based on the correlation between the two nodes m
and l connected by the edge. Using pairwise correlation coefficients between the traits we connect two edges if the correlation coefficient is above a threshold say, r. The weight is
then the absolute value of the correlation coefficient, jrml j.
For k multiple independent traits linear regression, we
modify (1) as
Yi Xbi ei ; i 1; 2k

(6)

for the i-th trait. The phenotypic data can then be represented
by an n  k matrix Y with Yi as the ith column and bi (b1i,b2i,
.bpi)T as a column vector of regression coefficients for the
i-th trait and ei is a column vector of n independent errors with
mean zero and common variance.
Then an estimate of B [b1, b2,.bk] is obtained by
minimizing the residual sum of squares:
B arg min

h
i
Yi  Xbi T Yi  Xbi

(7)

The lasso estimates in such a case is obtained by solving the


following L1-regularized linear regression:
Blasso arg min

h
i
XX 
X
bji 
Yi  Xbi T Yi  Xbi l

(8)

where l is a regularization parameter that controls the


amount of sparsity in the estimated regression coefficients.
This is equivalent to solving a set of k independent regressions
for each trait with its own L1 penalty. It does not combine
information across multiple traits so that the estimates do not
give information on the relatedness in the regression

228

c u r r e n t m e d i c i n e r e s e a r c h a n d p r a c t i c e 4 ( 2 0 1 4 ) 2 2 5 e2 2 9

coefficients for those correlated traits in the QTN that are


affected by common SNPs. Kim and Xing7 therefore extended
the standard lasso to include a fusion penalty as first described
by Tibshirani et al8 in classical linear regression problems
with time element. This is described below.

4.
Graph-guided fused lasso (GFLasso) for
multiple correlated traits
The additional penalty term encourages a fusion of two
regression coefficients bjm and bjl for each SNP marker j when
the traits m and l are connected with an edge in the QTN with
weight as the correlation coefficient rml. This modifies (6) to
h
i
XX 
X
bji 
BGFlasso arg min
Yi  Xbi T Yi  Xbi l
X
X
  
bjm  sign rml bjl 
g
f rml

(9)

where g is the regularization parameter for the fusion penalty.


This fusion penalty encourages bjm and sign (rml)bjl to take the
same value by shrinking the difference between them toward
zero but not exactly at zero. This tends to flatten the values of
the coefficients for each marker across multiple highly
correlated traits so that the effect of each marker becomes
similar across those traits. Also since the weight f(rml)
attached to each term in the penalty is a function of the correlation coefficient rml, the amount of correlation between the
two traits controls the amount of fusion for each edge in the
QTN. With high correlation between two traits so that the
weight is relatively large making more intensive fusion effect,
the difference between the two corresponding regression coefficients will be penalized more than those for other pairs of
traits with weaker correlations. In Kim and Xing7 the function
f(r) was taken as jrj and r2 to give two variations of GFLasso.
They also considered, as a special case of GFLasso, when this
function is just 1 to provide with a GFLasso which only exploits the graph topology of a QTN without any edge weights
attached to the terms of fusion penalty.
When two groups of highly correlated traits show a relatively weaker correlation across the two subnetworks, the
GFLasso can handle the hierarchical subgroup structure and
adjust the amount of fusion accordingly by weighting each
fusion term. In addition, when the association strength of an
SNP with pleiotropic effect varies over traits in a subnetwork,
GFLasso can use different levels of correlations for different
pairs of traits to adjust the amount of fusion in GFLasso. Then
this method tends to identify multiple blocks of fused
regression coefficients within a subnetwork. In the unweighted version of GFLasso, on the other hand, the set of
non-zero regression coefficients tend to show a single block
structure with the same or similar values across correlated
traits for each genotype marker. It may be noted that the
fusion penalty combines information across the two correlated traits for the given edge to potentially increase power for
detecting true associations while reducing false positives. For
example, if two traits are only weakly affected by a common
SNP, the fusion penalty for the corresponding edge can
combine the two weak signals and detect the associations that
might have been missed under a single trait analysis. Similarly, when the information of an SNP being irrelevant is

combined across two correlated traits connected with an edge


and both of the two regression coefficients for the irrelevant
SNP are fused to zero, it results in fewer false positives. When
the edge level fusion penalty is applied to all of the edges in
the QTN, the overall effect is to discover associations between
an SNP and the subnetwork of densely connected traits that
form a phenome as well as the associations between an SNP
and a single phenotype.
The optimization problem in Eq. (9) being convex is
formulated as a quadratic programming problem for which a
number of software packages are available. A more efficient
way, given in Kim and Xing,7 is to first transform (9), containing a non-smooth function of L1 norm, to an equivalent
form that involves only smooth functions and then use a fast
coordinate-descent algorithm to find the estimates of regression coefficients. Details can be found in the paper already
referred to above.
In the simulation experiments carried out by Kim and
Xing7 based on datasets of HapMap Consortium, the GFLasso
method was compared with methods based on single-marker
analysis and regression-based methods that do not use any of
the relational information in the traits. It was found that
borrowing information across correlated traits in a QTN, as in
GFLasso, significantly increases power of discovering true
causal SNPs with possibly pleiotropic effects.

4.1.
Application to polymorphism in IL-4R gene and
asthma-traits
The asthma dataset used by Kim and Xing7 involved 543
asthma patients as a part of the SARP. The genotype data were
obtained for 34 SNPs within or near the IL-4R gene that spans a
40 kb region on chromosome 16. The phenotype data included
53 clinical traits related to severe asthma such as age of onset,
family history, and severity of various symptoms. To investigate whether any of the SNPs in the IL-4R gene were associated
with a subnetwork of correlated traits rather than an individual trait, pairwise correlations between these traits, after
standardizing the measurements for each phenotype to have
zero mean and unit standard deviation, were first computed
and thresholded at r 0.7 to obtain the QTN. A comprehensive
comparison of QTL mapping using GFLasso and other
methods such as single marker-single trait association, ridge
regression, and lasso showed that the GFLsso method gives
increased power in detecting causal variants than other
methods, taking the absolute values of the estimated regression coefficients as a measure of association strength.
All the methods found an SNP, named as Q551R, as
significantly associated with a block of correlated phenotypes
related to lung physiology pertaining to 10 clinical phenotypes
with p-values for the single-marker analysis being 2.0  104.
This analysis also showed that upstream of Q551R, there was
a set (rows 24e27) of adjacent SNPs having a high level of
association with the same subset of traits for lung physiology
with p-values ranging from 2.0  104 to 7.6  103. On the
other hand, lasso set the regression coefficients for most of
this block of SNPs to zero, thus ignoring the possibly irrelevant
markers (rows 26 and 27) which were merely in strong LD with
the causal SNP Q551R. On the other hand, the other two SNPs
(rows 24 and 25) in the same block were in weak LD with

c u r r e n t m e d i c i n e r e s e a r c h a n d p r a c t i c e 4 ( 2 0 1 4 ) 2 2 5 e2 2 9

Q551R probably missed by lasso due to sparsity but these


might be unknown causal SNPs. Results of ridge regression
method did not, as expected, show a sparse structure. In the
case of GFLasso due to L1 norm term, results showed the same
property of sparsity as lasso in their estimates so that the
regression coefficients corresponding to these SNPs in rows
24e27 and lung physiology traits were all got set to zero.
The fusion penalty in GFLasso locally fused two regression
coefficients for a pair of correlated phenotypes. This effect
propagated through the edges of the QTN effectively applying
fusion to all of the phenotypes within each subgraph. In the
asthma dataset analyzed, because of this fusion penalty, the
estimated regression coefficients formed a block structure
where each block corresponds to an SNP associated with
several correlated traits. However, the regression coefficients
were not fused to the same value, but often contained several
small blocks of fused values. Also, the SNPs in rows 18 and 22
on the upstream of Q551R were associated with the same
block of traits as SNP Q551R, generating a new hypothesis that
these two SNPs and Q551R might be jointly associated with
the same subset of traits for lung physiology.
An important issue in the results obtained above is the role
of threshold r that creates the QTN effects and that was taken
as 0.7. What would happen if this threshold is changed from
0.7? The association analysis was therefore performed for
r 0.3, 0.5, and 0.9 and compared with the results for r 0.7.
The number of non-zero regression coefficients for GFlasso
were respectively 105, 108, and 125 as against 105 for r 0.7.
This showed that at the higher threshold of 0.9 where only a
small number of edges were included in the QTN, the number
of non-zero regression coefficients viz. 125 was same as that
for lasso that does not have fusion penalty. Lowering the
threshold to 0.7, this number decreased significantly to 105
and practically remained the same with further decrease in
the threshold. This is because adding more edges by lowering
r did not add any significant correlation information to the
QTN and the results of GFlasso were not sensitive to these
additional edges with relatively little information.

5.

Conclusions

To understand the genetic basis of a disease, particularly the


complex ones like diabetes, hypertension, cardiovascular
disorders, cancer, etc., millions of molecular markers like
SNPs spread along the entire genome are genotyped on
randomly selected individuals on which traits are also
measured. The analysis of the resulting data requires relating
the SNPs to the traits by linear regressions. Due to the irrelevancy of a very large number of markers, though in LD with
the traits, sparse regression methodology is often required to
be used to make the regression coefficients pertaining to such
markers either tend to zero or be exactly at zero. This has been

229

systematically discussed in this paper firstly for a single trait


and later on for a set of traits. The methodology is demonstrated from the published literature, particularly for asthma
where as many as 53 clinical traits on age of onset, family
history, and severity of various symptoms are involved. In this
case the analysis was done to find out the association between
the traits, expressed as a QTN, and 34 SNPs within or near the
IL-4R gene on chromosome 16. The penalized regression of
GFlasso applied to the asthma dataset was found to substantially reduce the number of non-zero regression coefficients
and identified the causal SNP Q551R as significantly associated
with the lung physiology traits.
It is concluded that sparse linear regression models could
be very effective for association analysis on whole genome
basis involving single or multiple traits.

Conflicts of interest
The author has none to declare.

Acknowledgments
The author is grateful to Dr. R. L. Sapra, Biostatistician and
Editorial Board Member, CMRP for making useful suggestions
and to the Indian National Science Academy, New Delhi for
support under their program INSA Honorary Scientist.

references

1. Manolia TA, Collins FS, Cox NJ, et al. Finding the missing
heritability of complex diseases. Nature. 2009;461:747e753.
2. Narain P. Statistical Genetics. New Delhi: John Wiley, New York
& Wiley Eastern Ltd.; 1990:599.
3. Fisher RA. On the correlation between relatives on the
supposition of Mendelian inheritance. Trans Roy Soc Edin.
1918;52:399e433.
4. Tibshirani R. Regression shrinkage and selection via the lasso. J
Roy Stat Soc Ser B. 1996;58:267e288.
5. Efron B, Hastie T, Johnstone I, Tibshirani R. Least Angle
Regression. Technical Report. CA, USA: Stanford University; 2002.
6. Malo N, Libiger O, Schork NJ. Accomodating linkage
disequilibrium in genetic association analyses via ridge
regression. Am J Hum Genet. 2008;82:375e385.
7. Kim S, Xing EP. Statistical estimation of correlated genome
associations to a quantitative trait network. PLoS Genet.
2009;5:e 1000587. http://dx.doi.org/10.1371/journal.pogen.
1000587.
8. Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity
and smoothness via the fused lasso. J Roy Stat Soc Ser B.
2005;67:91e108.

You might also like