Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

A multivariate probit model for inferring missing haplotype/genotype data

David J. Lunn Department of Epidemiology and Public Health, Imperial College London, St. Marys Campus, Norfolk Place, London W2 1PG, UK Tel: +44 (0)20 7594 3321; Fax: +44 (0)20 7402 2150 d.lunn@imperial.ac.uk Carolina Osorio Formerly: Department of Statistics, University College London Currently: Institute of Mathematics, Swiss Institute of Technology John C. Whittaker Formerly: Department of Epidemiology and Public Health, Imperial College London Currently: Department of Epidemiology and Population Health, London School of Hygiene and Tropical Medicine

Summary
We present a Bayesian model that measures the extent of correlation between a set of genetic markers. The data are categorical and so (continuous-valued) auxiliary variables are introduced to allow arbitrary correlation structures. The model allows us both to predict the values of missing genotypes and to infer which of the underlying alleles reside on the same strand of DNA. Such inferences are crucial in genetic epidemiology as they form vital input to models designed to identify genetic factors contributing to complex diseases. We analyse our model using Markov chain Monte Carlo in WinBUGS. Care must be taken, however, to avoid reducibility of the Markov chain and we present an ecient block-update for circumventing this problem. We compare our method with two other packages, one of which represents the gold standard for this type of analysis, and conclude that our approach oers 1

comparable performance. The main advantage of our method over the others considered is that it can easily be formulated as a graphical model, which means that it may be combined straightforwardly with other models, thus facilitating fully Bayesian (simultaneous) analyses. Key words: Auxiliary variables; Haplotype reconstruction; Markov chain Monte Carlo; Missing genotypes; WinBUGS.

Introduction

The identication of genetic variants contributing to human disease is currently a very active eld of research. The basic idea is to locate on the genome the vicinity of disease predisposing mutations by exploiting the fact that, in general, the probability that a given section of genetic material is transmitted intact, from one generation to the next, is inversely related to the length of that section. Hence, genetic markers in close proximity to causal mutations may appear correlated with disease status and can therefore be identied, thus mapping the locations of those mutations (see Balding et al., 2003, for example). From a statistical perspective, the basic ingredients of such an analysis are the observable traits of interest (known as phenotypes) for some relevant group of individuals and the values of the various genetic markers for those individuals. Observation of the former is fairly straightforward whereas genotyping relies on complex biochemical assays that can lead to missing data. As each individual is genotyped at multiple marker sites, it is usually (prohibitively) wasteful to simply discard individuals for whom a complete set of genotype data is not available. Hence a model for predicting the missing information is crucial. In this paper we present such a model and compare its performance to two well established alternatives. In recent years, single nucleotide polymorphisms (SNPs, pronounced snips) have 2

become the genetic markers of choice. These are physical locations on the genome (known as loci ) of single nucleotide bases that are known to exhibit non-negligible variation (> 5%, say) across the population. Current SNP assays are designed to detect the amount of a specic base (A, C, G or T) at a given locus, and so, if the assay is successful, we observe how many copies of that base are present on the relevant individuals two chromosomes, 0, 1 or 2. Unfortunately, when two or more markers are genotyped together we do not typically observe which bases at one locus belong to the same chromosome as those observed at another locus. Although such phase information is of great potential value, it is very costly to obtain (in the lab) and so we usually attempt to infer it instead. When we have the phase information for all markers then we know the two haplotypes underlying the observed genotypes: in this context a haplotype is a binary string showing which of the bases of interest at the various loci are present on a single chromosome. Many important models in genetic epidemiology for mapping complex disease genes are formulated directly in terms of haplotypes (Molitor et al., 2003) whereas others are formulated in terms of unphased genotypes (Clayton et al., 2004). It is dicult to say which of these approaches is preferable, but in terms of predicting the missing input data we tend to focus on haplotypes, since missing genotype data are inferred trivially given the haplotypes. Approaches to predicting missing genetic information thus typically have the primary aim of reconstructing the unobserved haplotypes. In cases where genotypes are observed equal to either 0 or 2, the underlying haplotype elements can be inferred with certainty: in the former case, neither chromosome carries the base of interest whereas in the latter, both chromosomes carry the base of interest. The fact that we essentially observe a proportion of each haplotype means that we can, in principle, measure the correlation between the various loci and use this to infer the values of uncertain haplotype elements, conditional on those observed. However, the quantities of 3

interest are binary and modelling correlation among them is therefore dicult. Our approach to circumventing this is to introduce auxiliary variables to transform the problem onto a continuous scale. Albert and Chib (1993) rst introduced auxiliary variables for facilitating binary (and polychotomous) regression using Markov chain Monte Carlo (MCMC); this was based on the data augmentation idea of Tanner and Wong (1987). Multivariate extensions of this work have tended to model correlation among multiple responses by relating the underlying auxiliary variables to some latent (unobserved) response, or set of latent responses (Shi and Lee, 2000; Dunson, 2000). For the purposes of this paper, however, we require a greater level of exibility in the correlation model and so we adopt a multivariate probit approach (Chib and Greenberg, 1998; Amemiya, 1972; Ashford and Sowden, 1970) instead. The main advantage of our approach over existing methods for haplotype inference is that it is formulated as a graphical model (Lauritzen et al., 1990; Spiegelhalter, 1998). Other approaches (Stephens et al., 2001; Stephens and Donnelly, 2003; Clayton, 2002; Clark, 1990; Qin et al., 2002) are more procedural (algorithmic) and it is dicult to see how they might be expressed graphically. This is all very well if the missing genotype/haplotype data are the main objects of inference, but, as mentioned above, the genetic data often form vital input to other models. Ideally, we would like to analyse all sub-models simultaneously so that if, for example, relationships are posited between observed phenotypes and genetic data, then feedback from the phenotypes can inuence prediction of the missing genotypes/haplotypes. Such fully Bayesian analysis is facilitated for graphical models as the elegant theory underlying these constructs allows arbitrary graphs to be combined without signicantly aecting the mechanism of inference. The structure of this paper is as follows. In Section 2 we describe the (real) data that we use to assess the performance of our haplotype reconstruction method. Section 3 4

provides a detailed discussion of our approach as well as brief descriptions of two other (well established) methods that we also assess. Section 3 concludes with a brief discussion of our approach to performance assessment while the assessments themselves are presented in Section 4. We provide a concluding discussion in Section 5 and annotated code for running our model within the widely used WinBUGS software (Spiegelhalter et al., 2003; Lunn et al., 2000) is given in the appendix.

Data

The data we consider here are the cystic brosis data of Kerem et al. (1989); numerous analyses of these can be found in the literature (McPeek and Strahs, 1999; Morris et al., 2002). The data set is useful for evaluating haplotype reconstruction techniques as the underlying haplotypes are known. Ninety three individuals (47 cases and 46 controls) were genotyped at 23 distinct loci1 on the long arm of chromosome 7. The data set comprises the resulting 186 observed haplotypes, each agged with case-control status. Note that there is no information regarding which haplotypes belong to the same individual. For the purposes of this paper, we take advantage of the fact that cystic brosis is known to be a rare, recessive trait and form new (imaginary) individuals by randomly pairing two case or two control haplotypes together. Case-control status is used only to pair haplotypes together; in what follows we will ignore it and assume all haplotypes to be exchangeable. This is slightly unrealistic, and haplotype inference would be improved by acknowledging case-control status in some way. This happens automatically when our haplotype model is combined with a model for phenotypes in a fully Bayesian analysis (Lunn et al., 2005), but the optimal way to
1

These loci are not actually SNP loci but they may be treated as such.

incorporate phenotypic information into other algorithms is less obvious. Common practice (see, for example: Hodgkinson et al., 2004; Carlson et al., 2005) is to ignore phenotypic information, which we have accordingly done here. We form the input data for each method by adding together the haplotype elements at each locus for each individual thus the phase information is discarded. In the cystic brosis data set, however, there are 169 from a total of 2 93 23 = 4278 haplotype elements that are missing. These give rise to 157 out of 2139 missing genotypes (there are twelve genotypes for which both of the underlying haplotype elements are missing). These do not present a problem in terms of running the various methods, or indeed in terms of assessing their performance reconstructing haplotypes. However, we are also interested in assessing the various methods abilities to predict missing genotypes and we cannot achieve this if we do not know the true values of those genotypes. Hence, we derive a reduced data set in which all genotypes are observed. An ecient way of doing this for the cystic brosis data is to delete two particularly sparse loci, loci 10 and 18, and then 24 individuals for whom a fully observed genotype sequence is still not available (hence the numbers of individuals and genetic loci are now 69 and 21 respectively). We now randomly delete ve per cent of the reduced data set so that we may assess the ability of each method to predict missing genotypes (with knowledge of the underlying true values) as well as to reconstruct haplotypes. We analyse both the full and reduced data sets in what follows.

3
3.1

Methods
Model

Suppose for each of N individuals (indexed by i) we have observed genotypes at Q distinct SNP loci (indexed by j). We denote the 1 Q row-vector of genotypes for individual i by Gi , i = 1, ..., N, and the individual elements of this vector by gij , j = 1, ..., Q. Let h1ij and h2ij denote the values of the two haplotypes for individual i at locus j such that gij = h1ij + h2ij . Further, denote the two full haplotypes (i.e. spanning all Q loci) for individual i by H1i and H2i these are binary strings of length Q. We wish to exploit the correlation that exists between genetic loci in order to predict the missing data for a given individual based on his/her observed genotypes. However, both genotypes and haplotypes are discrete and there does not exist a natural discrete-multivariate distribution that allows for arbitrary correlation structures. We therefore transform to a continuous scale and dene two 1 Q pseudo-haplotypes 1i and 2i for each individual. These have elements 1ij and 2ij respectively (j = 1, ..., Q) such that h1ij = I(1ij 0) and h2ij = I(2ij 0), where I(.) denotes the indicator function. Thus the sign of a pseudo-haplotype at a specic locus is exactly equivalent to the value of the corresponding (binary) haplotype (at the same locus), i.e. they contain the same information. Importantly though, the pseudo-haplotype is dened on a continuous scale and as such can be assumed to arise from a Q-dimensional multivariate normal distribution, with unknown mean and covariance matrix : 1i MVNQ (1i |, ); 2i MVNQ (2i|, )

This assumption allows us now to model arbitrary correlation structures via and . As our approach is Bayesian we require prior distributions for the unknown parameters and 7

. We specify vague but proper priors via MVNQ (|0, 1 IQ ); T = 1 Wishart(T |Q0 , Q)

where is small, e.g. 104 , IQ denotes the Q Q identity matrix, and 0 and 0 are initial guesses for the values of and respectively. In the absence of any prior knowledge we typically set each element of 0 equal to zero and 0 = IQ . Chib and Greenberg (1998) point out that should be constrained to correlation form as and are not likelihood-identied otherwise. However, in this setting we have no interest in making inferences about the mean and/or covariance and so identiability is not a major issue. Moreover, and are posterioridentied with the proper priors specied above. Hence we prefer to retain the convenience of a conjugate prior for . The graphical model corresponding to our pseudo-haplotype approach is shown in Figure 1.

3.2

Implementation

We rst examine the situation in which the pseudo-haplotype model described above is analysed alone, that is, the graph shown in Figure 1 represents the entirety of our statistical model. The model is analysed via the Gibbs sampler (Gelfand and Smith, 1990; Geman and Geman, 1984), which proceeds by iterative simulation from the full conditional distributions of all unknown stochastic quantities conditional on the most recent values of all other quantities. The multivariate normal and Wishart priors for and T respectively are conjugate and so the full conditional distributions for these parameters are available in closed form; sampling is therefore straightforward. For the remaining stochastic parameters, i.e. 1i and 2i , i = 1, ..., N, we must take care to avoid reducibility of the Markov chain. To see why this is an issue, consider a single genotype gij observed equal to one. One of the corre8

sponding haplotypes at locus j must then be equal to one and the other must be equal to zero. Consequently, one of the pseudo-haplotypes (at locus j) must be negative and the other must be positive. If we were to update each pseudo-haplotype separately then this conguration would never change, i.e. it would be the same pseudo-haplotype element that would be positive throughout the entire simulation. Hence, for each locus j {1, ..., Q} we sample 1ij and 2ij simultaneously from their joint full conditional distribution:
1 1 p(1ij , 2ij |.) = N 1ij |m1ij , Tjj N 2ij |m2ij , Tjj I (gij ; 1ij , 2ij )

(1)

where m1ij = j

Tjl l=j Tjj (1il

l ), m2ij = j

Tjl l=j Tjj (2il

l ) and I(.) represents

an appropriate truncation of R2 . If gij is unobserved then there is no truncation and I 1; otherwise I(.) is dened as follows. Let 1ij and 2ij denote the probabilities associated
1 1 with negative values according to N (1ij |m1ij , Tjj ) and N (2ij |m2ij , Tjj ) respectively:

1ij = (m1ij Then

Tjj );

2ij = (m2ij

Tjj )

I(1ij < 0, 2ij < 0)/1ij 2ij if gij = 0 I(gij ; 1ij , 2ij ) = I(1ij 2ij < 0)/(1ij + 2ij 21ij 2ij ) if gij = 1 I(1 0, 2 0)/(1 1 )(1 2 ) if g = 2 ij ij ij ij ij Given an ecient algorithm for drawing from (univariate) truncated normal distributions we sample from (1) as follows. In cases where gij is missing, equal to zero, or equal to two, we sample 1ij and 2ij independently from standard, right-truncated (at zero), and
1 1 left-truncated (at zero) versions of N (1ij |m1ij , Tjj ) and N (2ij |m2ij , Tjj ), respectively.

Where gij = 1, however, we rst choose which of the two compatible quadrants of R2 to sample from. We choose [1ij 0, 2ij < 0] with probability 2ij (1 1ij )/(1ij + 9

2ij 21ij 2ij ), and [1ij < 0, 2ij 0] otherwise. In the former case we then take
1 1 1ij N + (1ij |m1ij , Tjj ) and 2ij N (2ij |m2ij , Tjj ) where the + and superscripts

denote left- and right-truncation, at zero, respectively; in the latter case we simply switch the truncations.

3.2.1

Combining models

In cases where the genotypes form input to some other sub-model, for example, when the association between observed phenotypes and genotypes is being modelled (Lunn et al., 2005; Verzilli et al., 2005), sampling of the 1ij s and 2ij s may be complicated by additional factors arising in (1) due to observed values from the other model. When gij is observed no further likelihood can be transmitted to the corresponding pseudo-haplotypes, but when gij is missing and has observed descendants in the graph, these descendants contribute to the joint full conditional for 1ij and 2ij , rendering it less straightforward to sample from. However, we can retain the sampling scheme described above so long as we update each missing genotype (with observed descendants) as follows. We dene prior probabilities Pr(gij = 0) = 1ij 2ij ; Pr(gij = 1) = 1ij +2ij 21ij 2ij ; Pr(gij = 2) = (11ij )(12ij ) and multiply these by the likelihood contributions from the observed descendants. After trivial normalization of the resulting probabilities, by dividing each by their sum, we obtain full conditional posterior probabilities for the missing genotype. If we sample the missing genotype from this distribution then it can be shown that using (1) for drawing {1ij , 2ij } conditional on the sampled value of gij is equivalent to using the correct joint full conditional (without treating deterministic genotype nodes as stochastic parameters). A similar approach can be used when it is haplotypes, as opposed to genotypes, that 10

form the input to another sub-model. Note that it is then the h1ij s and h2ij s that are treated (jointly) as stochastic parameters rather than deterministic functions when they are missing. Also note that h1ij and h2ij are missing when gij = 1 as well as when gij is missing, but that updating h1ij and h2ij together means that 1ij and 2ij can be sampled independently. We have implemented the above model and sampling scheme in the widely used WinBUGS software. Annotated WinBUGS code for tting the pseudo-haplotype model alone to the cystic brosis data described above is given in the appendix.

3.3

Other approaches

Here we provide brief descriptions of two other haplotype reconstruction methods that we compare our approach with in subsequent sections. These are namely the PHASE software (Stephens et al., 2001; Stephens and Donnelly, 2003) and SNPHAP (Clayton, 2002). Most haplotype reconstruction techniques, including PHASE and SNPHAP, treat haplotypes as essentially univariate quantities; that is, instead of viewing haplotypes as strings of ones and zeros and modelling the individual elements (as we do), they are simply viewed as words. Each possible haplotype occurs in the population of interest with a specic frequency (often close to or equal to zero) and it is these frequencies that are the objects of inference. The likelihood is then the distribution of observed genotypes conditional on the population haplotype frequencies, which is straightforward to dene under fairly standard assumptions (Excoer and Slatkin, 1995).

3.3.1

PHASE

The PHASE algorithms prior approximates the coalescent distribution, which models the genealogy of a set of haplotypes by tracing similar haplotypes back to their most recent 11

common ancestors (see Hudson, 1991, for a review). The software makes use of MCMC for posterior inference but the posterior is not dened in the conventional way, by combining prior and likelihood together. Instead, it is dened implicitly, as the stationary distribution of a particular Markov chain corresponding to a set of inconsistent conditional distributions. Though not entirely satisfactory from a theoretical point of view, this works extremely well (Stephens and Donnelly, 2003) and PHASE is generally regarded as the gold standard for haplotype reconstruction (e.g. Xu et al., 2004). At the time when most of our analyses were conducted, the latest version of the PHASE software was 2.0.2. Recently, however, version 2.11 has been released. For completeness we have compared both of these versions with our pseudo-haplotype approach.

3.3.2

SNPHAP

For the population haplotype frequencies SNPHAP species a Dirichlet prior distribution with constant degrees of freedom for all possible haplotypes. The posterior is sampled using the IP (Imputation/Posterior sampling) algorithm (Tanner and Wong, 1987), a form of Gibbs sampling. In principle, this approach can be represented as a graphical model. In practice, however, the number of possible haplotypes is simply too large to handle (except when the number of loci is small, < 8, say) and some form of algorithmic intervention is required. SNPHAP tackles this problem by rst tting two-locus haplotypes and then extending the solution by one locus at a time. The SNPHAP analyses presented in this paper were conducted using version 1.3 of the software.

12

3.4

Performance

Assessing the performance of haplotype reconstruction methods is non-trivial as discussed below. Let Hi denote the row-vector obtained by joining together the two true (observed) haplotypes for individual i in some arbitrary order (e.g. the order in which they appear in the original data set). We denote the elements of Hi by h , j = 1, ..., 2Q. For all of the methods ij considered, the posterior sample for each individuals haplotypes can be summarised as a list of distinct haplotype pairs denoted {H1i , H2i }, k = 1, ..., Ci , with associated posterior probabilities qi , ..., qi
(1) (Ci ) (k) (k)

; the elements of H1i

(k)

and H2i

(k)

are denoted h1ij and h2ij , j =

(k)

(k)

1, ..., Q, respectively. To facilitate a comparison between these haplotype pairs and the true values we order them so that they are aligned with the Hi terms dened above. We dene 1i
(k)

Q j=1 |hij

h1ij | + |h i(j+Q) h2ij | and 2i

(k)

(k)

(k)

Q j=1 |hij

h2ij | + |h i(j+Q) h1ij |,

(k)

(k)

which count the number of errors associated with each of the two possible orderings for each haplotype pair. (Note that we dene |.| such that it equals zero for all h s with missing ij values.) We also dene weights w1i and w2i to be applied to each ordering: (k) (k) 0 if 2(k) > 1(k) 0 if 1 > 2 i i i i (k) (k) (k) (k) (k) (k) 1 1 w2i = w1i = if 1i = 2i 2 2 if 2i = 1i 1 if 2(k) < 1(k) 1 if 1(k) < 2(k) i i i i We then transform our posterior sample for unordered haplotype pairs to a sample that can be directly compared with Hi :
(k) (k) {H1i , H2i } (k) (k)

(2k1) (k) (k) Hi = (H1i , H2i ) (2k) (k) (k) H = (H2i , H1i ) i
(k)

posterior probability pi posterior probability pi

(2k1) (2k)

= w1i qi
(k) (k)

(k) (k)

= w2i qi

Finally, we collapse this new sample so that elements with zero probability are removed and non-unique elements (due to H1i = H2i ) are combined. We also relabel the elements 13
(k)

so that they are ranked in terms of posterior probability. Thus our sample now comprises ordered haplotype pairs Hi , k = 1, ..., Ki Ci , with associated posterior probabilities pi
(1) (k)

> pi

(2)

> ... > pi

(Ki )

> 0; the elements of each Hi

(k)

are denoted hij , j = 1, ..., 2Q. In

(k)

short, where possible we have chosen for each haplotype pair the ordering that best matches the underlying true values. In cases where there is no unique best match we split the posterior probability equally between the two alternatives. The errors associated with these matched haplotype pairs are given by i
(k)

2Q j=1 |hij

hij |, k = 1, ..., Ki .

(k)

4
4.1

Results
Full data set

For the pseudo-haplotype model, 700000 iterations in WinBUGS took around four hours on a 1.8GHz laptop machine. The rst 200000 samples were discarded and every 25th sample from the remaining 500000 was retained for posterior inference (giving a sample size of 20000). Note that this represents a very conservative run-length, chosen to minimize the chances of our posterior sample being unduly aected by a lack of convergence and/or eciency; much shorter run-times are achievable in practice. PHASE analyses were conducted on the same machine and took around 10 minutes and 25 minutes, respectively, using versions 2.02 and 2.11 of the software with their default settings. SNPHAP analysis was performed, again using the default settings, on a 1.8GHz Linux machine; this took around 10 seconds. In order to rst gain a simple overview of the performance of the various methods under investigation, we begin by assessing only the best guesses (posterior modes) from each method for each individual. That is, we examine the distribution across individuals of i
(1)

2Q j=1

|h hij |, the error associated with the most probable (ordered/aligned) ij 14

(1)

haplotype pair under the relevant methods posterior. Table 1 shows percentages of the 93 individuals constituting the full data set for which various accuracy thresholds were surpassed in each analysis. For example, the table shows that for 42% of individuals the pseudohaplotype approach implemented in WinBUGS reconstructs both haplotypes (46 SNPs) exactly; also, for all four methods, there are no more than 16 errors for any individual. There is no notable dierence between WinBUGS and the two PHASE algorithms but SNPHAP appears to have performed relatively poorly, reconstructing both haplotypes exactly for only 34% of individuals. It is interesting to consider the extent to which reconstruction errors are shared across methods, that is, are some haplotypes particularly dicult to reconstruct, or do dierent methods make mistakes in dierent areas? In the performance assessment described above, the SNPHAP method generated 326 errors whereas WinBUGS and the two PHASE methods all generated around 290. The maximum number of errors possible is 1707 and so this represents an error rate of 17% for the latter three methods. Nearly half of the errors generated by WinBUGS are reproduced by the two versions of PHASE, and SNPHAP reproduces a slightly smaller fraction. Hence, there is substantial sharing of errors across the four methods, indicating that to some extent the degree to which a haplotype can be accurately reconstructed is a characteristic of the haplotype. Of course examining only the posterior mode for each haplotype pair represents an overly simplistic assessment of the various methods performance and, moreover, does not reect the way in which the posterior output should be used in practice. Hence, we now examine the posterior probabilities from each method for the true haplotype pairs for each individual. That is, Pr(Hi |G(obs) ) =
Ki k=1

pi I(i

(k)

(k)

= 0), i = 1, ..., N, where G(obs) denotes

the observed genotype data. The summation is required here since each Hi may contain 15

missing values and there may thus be more than one Hi

(k)

giving rise to zero error. Figure

2(a) shows posterior probabilities for the true haplotype pairs from PHASE 2.02 and 2.11 plotted against those obtained from WinBUGS. Again, there is little to choose between the three methods although there is a small cloud of points in the bottom-left where the posterior probability from WinBUGS is less than 0.1 and that from PHASE is considerably higher. The degree of correlation in this plot is also indicative of some error sharing across methods. We are also interested in the ability of these methods to predict the missing genotypes. However, it is impossible to assess this with the full data set as we do not know the underlying true values. To circumvent this problem we now analyse the reduced data set described in Section 2.

4.2

Reduced data set

Again, 700000 iterations were performed in WinBUGS for the pseudo-haplotype model, with the rst 200000 samples discarded and 20000 equally spaced samples from the nal 500000
1 retained for inference. This took around 2 2 hours on our 1.8GHz laptop. The PHASE 2.02,

PHASE 2.11 and SNPHAP analyses were again conducted using default settings and took 3.5 minutes, 11 minutes and 10 seconds, respectively. Table 2 shows the same information as Table 1 but for the reduced data set, i.e. percentages of individuals for which each method surpassed various accuracy thresholds. Now WinBUGS, PHASE 2.02 and SNPHAP all perform similarly but PHASE 2.11 is somewhat superior. All methods perform well in our opinion, however. Figure 2(b) shows posterior probabilities for each individuals true haplotypes from PHASE 2.02 and PHASE 2.11 plotted against those from the WinBUGS analysis. There is good agreement between the 16

methods but both of the PHASE algorithms seem to perform slightly better for the lower probabilities. We now turn our attention towards missing genotypes as opposed to haplotype reconstruction. As the majority (and sometimes all) of each genotype sequence is observed, it makes little sense to consider posterior probabilities for whole sequences. Instead we examine the marginal posterior probability of predicting the true value for each missing genotype:
Ki

Pr(gij =

gij |G(obs) )

=
k=1

pi I(hij + hi(j+Q) = gij ),

(k)

(k)

(k)

gij G(miss)

where G(miss) denotes the set of all missing genotype data. Figure 3 shows marginal posterior probabilities for the true genotypes from PHASE 2.02 and 2.11 plotted against those obtained from WinBUGS. Yet again, there is good agreement between the methods; however, both versions of PHASE perform slightly better when the posterior probability from WinBUGS lies in the region 0.7 to 0.9. The gure shows that most of the estimated probabilities are large, which demonstrates excellent performance of all three methods for predicting missing genotypes. The median probabilities across individuals for each method are 0.968, 0.964 and 0.890 for PHASE 2.02, PHASE 2.11 and WinBUGS, respectively. We may also calculate marginal probabilities for each of the individual haplotype elements whose value is uncertain, that is, when the corresponding gij is either missing or equal to one. These are given by
(k) (k) Ki k=1 pi I(hij

= h ) for all i and all j {1, ..., 2Q} such ij

that gi(j mod Q) = 1 or gi(j mod Q) G(miss) . There is no longer a strong correlation between the probabilities generated by each method but the distributions across uncertain haplotype elements are similar. PHASE performs slightly better when the genotype is missing whereas WinBUGS performs better when the phase information is missing, i.e. when the genotype is observed to be one. Bar charts showing the distributions of these posterior probabilities 17

are presented in Figure 4. For cases in which only the phase information is missing we can also obtain marginal probabilities for individual haplotype elements for the full data set (less than half of the true haplotype values for missing genotypes are known). In this case the distributions are even more similar.

Discussion

We have adapted the auxiliary variable techniques introduced by Albert and Chib (1993) and extended by Chib and Greenberg (1998) for solving a fundamental problem in statistical genetics, that of predicting missing genotype and/or haplotype data. Such data constitute a crucial input to many important models for mapping genes contributing to complex diseases. Indeed, the analysis of such models cannot proceed without some form of strategy for imputing the missing covariates. For the cystic brosis data considered in this paper our approach (implemented in WinBUGS) oers performance comparable to that of the PHASE software, which represents the current gold standard for this type of analysis. For the full data set, there is virtually no dierence in the accuracies of WinBUGS and PHASE, but SNPHAP does not perform quite as well. For the reduced data set, PHASE (version 2.11 in particular) appears to slightly out-perform both our method and SNPHAP, but dierences are relatively small. It is important to note, however, that our goal is not to propose a method superior to those currently available, but instead to propose a good method that can be formulated as a graphical model. Many approaches to haplotype reconstruction, including PHASE, are not explicitly formulated as statistical models but have some algorithmic avour, which makes it dicult to see how they might be expressed graphically. This somewhat inhibits combining

18

them for simultaneous (fully Bayesian) analysis with other models, such as the gene mapping models mentioned above, which underpin much of genetic epidemiology. Implementing our pseudo-haplotype approach as a graphical model within WinBUGS allows it to be combined almost arbitrarily with other models, thus facilitating proper accounting for uncertainty in drawing substantive inferences. The WinBUGS software is available from http://www.mrc-bsu.cam.ac.uk/bugs/ and a plug-in for the pseudo-haplotype model can be obtained from the rst author. Another advantage of our approach is that it reects the retrospective way in which the data are typically collected, i.e. individuals are genotyped after being identied as cases or controls. Thus by modifying our approach to allow (and possibly ) to depend on case-control status, we may model the distribution of genotypes (or haplotypes) conditional on phenotypes. The vast majority of genetic models specify a prospective likelihood instead, i.e. they model the distribution of phenotypes conditional on genotypes, but this can lead to bias in a Bayesian setting (Seaman and Richardson, 2001, 2004). Of course expressing in terms of phenotypes means that we now wish to make inferences about and identiability becomes a major issue. In this case, however, we could adopt the sampling scheme of Chib and Greenberg (1998) for updating in correlation form.

Acknowledgements
The support of the Medical Research Council (award G90/82) is gratefully acknowledged. We would also like to thank Nicky Best and Paul Burton for helpful discussions.

19

References
Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data, Journal of the American Statistical Association 88: 669679. Amemiya, T. (1972). Bivariate probit analysis: minimum chi-square methods, Journal of the American Statistical Association 69: 940944. Ashford, J. R. and Sowden, R. R. (1970). Multivariate probit analysis, Biometrics 26: 535 546. Balding, D. J., Bishop, M. and Cannings, C. (2003). Handbook of Statistical Genetics, 2nd Edition, John Wiley & Sons, Chichester. Carlson, C. S., Aldred, S. F., Lee, P. K., Tracy, R. P., Schwartz, S. M., Rieder, M., Liu, K., Williams, O. D., Iribarren, C., Lewis, E. C., Fornage, M., Boerwinkle, E., Gross, M., Jaquish, C., Nickerson, D. A., Myers, R. M., Siscovick, D. S. and Reiner, A. P. (2005). Polymorphisms within the C-Reactive Protein (CRP) promoter region are associated with plasma CRP levels, American Journal of Human Genetics 77: 6477. Chib, S. and Greenberg, E. (1998). Analysis of multivariate probit models, Biometrika 85: 347361. Clark, A. G. (1990). Inference of haplotypes from PCR-amplied samples of diploid populations, Molecular Biology and Evolution 7: 111122. Clayton, D. (2002). SNPHAP A program for estimating frequencies of large haplotypes of SNPs, Department of Medical Genetics, Cambridge Institute for Medical Research, Cambridge. 20

Clayton, D., Chapman, C. and Cooper, C. (2004). The use of unphased multilocus genotype data in indirect association studies, Genetic Epidemiology 27: 415428. Dunson, D. B. (2000). Bayesian latent variable models for clustered mixed outcomes, Journal of the Royal Statistical Society, Series B 62: 355366. Excoer, L. and Slatkin, M. (1995). Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population, Molecular Biology and Evolution 12: 921927. Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities, Journal of the American Statistical Association 85: 398409. Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images, IEEE Transactions on Pattern Analysis and Machine Intelligence 6: 721741. Hodgkinson, C. A., Goldman, D., Jaeger, J., Persaud, S., Kane, J. M., Lipsky, R. H. and Malhotra, A. K. (2004). Disrupted in Schizophrenia 1 (DISC1): Association with schizophrenia, schizoaective disorder, and bipolar disorder, American Journal of Human Genetics 75: 862872. Hudson, R. R. (1991). Gene genealogies and the coalescent process, in D. Futuyma and J. Antonovics (eds), Oxford Surveys in Evolutionary Biology, Volume 7, Oxford University Press, pp. 144. Kerem, B., Rommens, J. M., Buchanan, J. A., Markiewicz, D., Cox, T. K., Chakravarti, A., Buchwald, M. and Tsui, L. C. (1989). Identication of the cystic brosis gene: genetic analysis, Science 245: 10731080.

21

Lauritzen, S. L., Dawid, A. P., Larsen, B. N. and Leimer, H. G. (1990). Independence properties of directed Markov elds, Networks 20: 491505. Lunn, D. J., Thomas, A., Best, N. and Spiegelhalter, D. (2000). WinBUGS a Bayesian modelling framework: concepts, structure, and extensibility, Statistics and Computing 10: 325337. Lunn, D. J., Whittaker, J. C. and Best, N. (2005). A Bayesian toolkit for genetic association studies, Genetic Epidemiology. In press. McPeek, M. S. and Strahs, A. (1999). Assessment of linkage disequilibrium by the decay of haplotype sharing, with application to ne-scale genetic mapping, American Journal of Human Genetics 65: 858875. Molitor, J., Marjoram, P. and Thomas, D. (2003). Fine-scale mapping of disease genes with multiple mutations via spatial clustering techniques, American Journal of Human Genetics 73: 13681384. Morris, A. P., Whittaker, J. C. and Balding, D. J. (2002). Fine scale mapping of disease loci via shattered coalescent modelling of genealogies, American Journal of Human Genetics 70: 686707. Qin, Z. S., Niu, T. and Liu, J. S. (2002). Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms, American Journal of Human Genetics 71: 12421247. Seaman, S. R. and Richardson, S. (2001). Bayesian analysis of case-control studies with categorical covariates, Biometrika 88: 10731088.

22

Seaman, S. R. and Richardson, S. (2004). Equivalence of prospective and retrospective models in the bayesian analysis of case-control studies, Biometrika 91: 1525. Shi, J.-Q. and Lee, S.-Y. (2000). Latent variable models with mixed continuous and polytomous data, Journal of the Royal Statistical Society, Series B 62: 7787. Spiegelhalter, D. J. (1998). Bayesian graphical modelling: a case-study in monitoring health outcomes, Applied Statistics 47: 115133. Spiegelhalter, D., Thomas, A., Best, N. and Lunn, D. (2003). WinBUGS User Manual, Version 1.4, Medical Research Council Biostatistics Unit, Cambridge. Stephens, M. and Donnelly, P. (2003). A comparison of Bayesian methods for haplotype reconstruction from population genotype data, American Journal of Human Genetics 73: 11621169. Stephens, M., Smith, N. J. and Donnelly, P. (2001). A new statistical method for haplotype reconstruction from population data, American Journal of Human Genetics 68: 978989. Tanner, T. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation, Journal of the American Statistical Association 82: 528549. Verzilli, C. J., Stallard, N. and Whittaker, J. C. (2005). Bayesian modelling of multivariate quantitative traits using seemingly unrelated regressions, Genetic Epidemiology 28: 313 325. Xu, H., Wu, X., Spitz, M. R. and Shete, S. (2004). Comparison of haplotype inference methods using genotypic data from unrelated individuals, Human Heredity 58: 6368.

23

Appendix: WinBUGS code


Here we present annotated WinBUGS code for tting our pseudo-haplotype model to a set of genotypes measured on N individuals at Q SNP loci. The genotypes themselves, denoted G[i, j], i = 1, ..., N, j = 1, ..., Q, are specied in a data set (along with the values of N and Q) that is loaded into WinBUGS separately; missing values in this data set are denoted by NA. The code is reasonably self-explanatory but some notes, pertaining to the line numbers given in the right-hand margin, are provided for clarity. model { for (i in 1:N) { for (j in 1:Q) { G[i, j] ~ dgene.aux(Psi1[i, j], Psi2[i, j]) H1[i, j] <- step(Psi1[i, j]) H2[i, j] <- step(Psi2[i, j]) } Psi1[i, 1:Q] ~ dmnorm(mu[1:Q], T[1:Q, 1:Q]) Psi2[i, 1:Q] ~ dmnorm(mu[1:Q], T[1:Q, 1:Q]) } mu[1:Q] ~ dmnorm(mu.0[1:Q], mu.prec[1:Q, 1:Q]) T[1:Q, 1:Q] ~ dwish(T.matrix[1:Q, 1:Q], Q) } Line 3: In the mathematical description of our model in Section 3.1, genotypes are expressed as deterministic functions of pseudo-haplotypes: gij = I(1ij 0) + I(2ij 0). However, WinBUGS will not allow any node in the specied graph to be both deter24 #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11

ministic and observed. Thus we construct a degenerate distribution dgene.aux(arg1, arg2) to trick WinBUGS into believing that genotypes are stochastic variables. This distribution has support {0, 1, 2} and the probabilities associated with each value are I(arg1 < 0, arg2 < 0), I(arg1 arg2 < 0) and I(arg1 0, arg2 0) for 0, 1 and 2, respectively. Lines 4 & 5: The step(.) function in WinBUGS returns the value one if and only if its argument is greater than or equal to zero; the value zero is returned otherwise. These two lines are not an essential feature of the model; they are presented merely to demonstrate how to dene haplotypes for input into another model. Lines 7 & 8: The multivariate normal distribution in WinBUGS, denoted dmnorm(.,.), is parameterized in terms of precision as opposed to covariance. Hence the T[.,.] matrix represents T = 1 . Lines 10 & 11: The parameters of these prior distributions, multivariate normal and Wishart respectively, are xed and specied as part of the data set. The values of mu.0, mu.prec and T.matrix are given by 0 , IQ and Q0 , respectively.

25

Method WinBUGS

0 42

1 2 44 46 43 39 53 52 47 45

3 4 56 52 49 47 75 69 71 67

6 8 10 12 84 87 89 82 91 92 95 91 96 97 98 99 99 99 99 99

16 100 100 100 100

PHASE 2.02 41 PHASE 2.11 39 SNPHAP 34

Table 1: Percentages of individuals in full data set for which each methods posterior mode haplotype pair contains less than or equal to the specied number of errors.

Method WinBUGS

0 42

1 2 45 46 57 42 54 52 61 55

3 4 57 57 62 55 83 77 81 80

6 8 10 12 88 90 90 88 93 96 94 91 99 96 94 96 99 99 99 99

16 100 100 100 100

PHASE 2.02 45 PHASE 2.11 51 SNPHAP 39

Table 2: Percentages of individuals in reduced data set for which each methods posterior mode haplotype pair contains less than or equal to the specied number of errors.

1i

2i

H1i

H2i

Gi
i = 1, ..., N
Figure 1: Graphical model depicting pseudo-haplotype approach to haplotype reconstruction. The notation is dened as follows. Circular nodes are used to represent all quantities in the underlying statistical model. Directed links show the nature (and direction) of dependence between nodes. Solid links denote stochastic dependence whereas dashed links represent logical (or deterministic) dependence. Repetitive structures, such as the loop from i = 1 to i = N, are denoted by rectangular plates.

(a)
posterior probability (PHASE)

0.8

0.6

0.4

0.2

0 0 0.2 0.4 0.6 0.8 1 posterior probability (pseudo-haplotype)

(b)
posterior probability (PHASE)

0.8

0.6

0.4

0.2

0 0 0.2 0.4 0.6 0.8 1 posterior probability (pseudo-haplotype)

Figure 2: Posterior probabilities for true haplotype pairs from PHASE 2.02 () and PHASE 2.11 () plotted against those obtained via pseudo-haplotype approach: (a) full data set; (b) reduced data set. The diagonal line is the identity, y = x.

posterior probability (PHASE)

0.8

0.6

0.4

0.2

0 0 0.2 0.4 0.6 0.8 1 posterior probability (pseudo-haplotype)

Figure 3: Reduced data set. Marginal posterior probabilities for true values of missing genotypes from PHASE 2.02 () and PHASE 2.11 () plotted against those obtained via pseudo-haplotype approach. The diagonal line is the identity, y = x.

(a)

0.8 0.7 0.6

relative frequency

0.5 0.4 0.3 0.2 0.1 0 0-.1 .1-.2 .2-.3 .3-.4 .4-.5 .5-.6 .6-.7 .7-.8 .8-.9 .8-.9 .9-1 .9-1

posterior probability

(b)

0.8 0.7 0.6

relative frequency

0.5 0.4 0.3 0.2 0.1 0 0-.1 .1-.2 .2-.3 .3-.4 .4-.5 .5-.6 .6-.7 .7-.8

posterior probability

Figure 4: Reduced data set. Bar charts showing the relative frequencies of marginal posterior probabilities, for the true values of uncertain haplotype elements, lying in the ranges 0 0.1, 0.10.2, ..., 0.91: (a) haplotypes uncertain due to corresponding genotypes missing (n = 144); (b) haplotypes uncertain due to corresponding genotypes equal to one (n = 1108). Dark, grey and white bars represent PHASE 2.11, PHASE 2.02 and WinBUGS, respectively.

You might also like