Title: Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering
Abstract: Whole-genome association studies present many new statistical and computational challenges due to the large quantity of data obtained. One of these challenges is haplotype inference; methods for haplotype inference designed for small data sets from candidate-gene studies do not scale well to the large number of individuals genotyped in whole-genome association studies. We present a new method and software for inference of haplotype phase and missing data that can accurately phase data from whole-genome association studies, and we present the first comparison of haplotype-inference methods for real and simulated data sets with thousands of genotyped individuals. We find that our method outperforms existing methods in terms of both speed and accuracy for large data sets with thousands of individuals and densely spaced genetic markers, and we use our method to phase a real data set of 3,002 individuals genotyped for 490,032 markers in 3.1 days of computing time, with 99% of masked alleles imputed correctly. Our method is implemented in the Beagle software package, which is freely available. Whole-genome association studies present many new statistical and computational challenges due to the large quantity of data obtained. One of these challenges is haplotype inference; methods for haplotype inference designed for small data sets from candidate-gene studies do not scale well to the large number of individuals genotyped in whole-genome association studies. We present a new method and software for inference of haplotype phase and missing data that can accurately phase data from whole-genome association studies, and we present the first comparison of haplotype-inference methods for real and simulated data sets with thousands of genotyped individuals. We find that our method outperforms existing methods in terms of both speed and accuracy for large data sets with thousands of individuals and densely spaced genetic markers, and we use our method to phase a real data set of 3,002 individuals genotyped for 490,032 markers in 3.1 days of computing time, with 99% of masked alleles imputed correctly. Our method is implemented in the Beagle software package, which is freely available. Multilocus analysis can provide improved power to detect associations between complex traits and densely spaced genetic markers, compared with that of single-marker methods.1Browning BL Browning SR Efficient multilocus association mapping for whole genome association studies using localized haplotype clustering.Genet Epidemiol. 2007; 31: 365-375Crossref PubMed Scopus (110) Google Scholar Most methods for multilocus analysis that are suitable for whole-genome association data require phased haplotypes because methods that allow for uncertainty in haplotype phase typically use small sliding windows of markers, which cannot make full use of the correlation structure in the data. Inference of haplotype phase for whole-genome association data can be performed with a high degree of accuracy, as we demonstrate, but is computationally challenging and requires methods that scale well to thousands of individuals, as well as to hundreds of thousands or millions of genetic markers. It has been common to compare haplotype-inference methods with HapMap data2The International HapMap Consortium A haplotype map of the human genome.Nature. 2005; 437: 1299-1320Crossref PubMed Scopus (4545) Google Scholar, 3Marchini J Cutler D Patterson N Stephens M Eskin E Halperin E Lin S Qin ZS Munro HM Abecasis GR et al.A comparison of phasing algorithms for trios and unrelated individuals.Am J Hum Genet. 2006; 78: 437-450Abstract Full Text Full Text PDF PubMed Scopus (236) Google Scholar; however, because each of the ethnicities in the HapMap study has at most 60 unrelated individuals, results from such comparisons are not necessarily good predictors of haplotype-phasing performance for data sets with thousands of individuals. We propose a new haplotype-inference method and show that our method outperforms existing methods in terms of both computational speed and measures of accuracy for large whole-genome data sets with thousands of individuals and hundreds of thousands of or even a million genetic markers. The method is one or two orders of magnitude faster than the most accurate competing methods, enabling accurate haplotype phasing of data from whole-genome association studies in a few days of computing time, instead of months or years. An efficient software implementation of our method is freely available in version 2.1 of the Beagle genetic analysis software package,1Browning BL Browning SR Efficient multilocus association mapping for whole genome association studies using localized haplotype clustering.Genet Epidemiol. 2007; 31: 365-375Crossref PubMed Scopus (110) Google Scholar which is written in Java and includes software for haplotype and missing-data inference, single-marker and multilocus association analysis, and permutation testing. Early methods for haplotype inference were based on a multinomial model for haplotype frequencies that used no prior information about the haplotype frequency distribution.4Excoffier L Slatkin M Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population.Mol Biol Evol. 1995; 12: 921-927PubMed Google Scholar, 5Long JC Williams RC Urbanek M An E-M algorithm and testing strategy for multiple-locus haplotypes.Am J Hum Genet. 1995; 56: 799-810PubMed Google Scholar, 6Hawley ME Kidd KK HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes.J Hered. 1995; 86: 409-411PubMed Google Scholar Such methods include those commonly referred to as “expectation-maximization (EM) methods,” reflecting the algorithm that is used to maximize the likelihood. These methods should be referred to as “multinomial model methods,” because EM algorithms are also used in methods employing more-complex statistical models, such as fastPHASE7Scheet P Stephens M A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.Am J Hum Genet. 2006; 78: 629-644Abstract Full Text Full Text PDF PubMed Scopus (1343) Google Scholar and HaploRec.8Eronen L Geerts F Toivonen H HaploRec: efficient and accurate large-scale reconstruction of haplotypes.BMC Bioinformatics. 2006; 7: 542Crossref PubMed Scopus (43) Google Scholar The multinomial model methods work fairly well on a small handful of markers but break down with larger numbers of markers. When the number of markers increases, so does the number of observable haplotypes, and the frequencies of these haplotypes become too small to estimate directly. Also, the computational time quickly becomes intractable, since all feasible haplotypes must be considered. Extensions such as partition-ligation (PL)–EM9Qin ZS Niu T Liu JS Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms.Am J Hum Genet. 2002; 71: 1242-1247Abstract Full Text Full Text PDF PubMed Scopus (422) Google Scholar have extended the usefulness of the multinomial model to a somewhat larger number of markers, but multinomial methods are still less accurate than are other types of models.8Eronen L Geerts F Toivonen H HaploRec: efficient and accurate large-scale reconstruction of haplotypes.BMC Bioinformatics. 2006; 7: 542Crossref PubMed Scopus (43) Google Scholar PHASE10Stephens M Smith NJ Donnelly P A new statistical method for haplotype reconstruction from population data.Am J Hum Genet. 2001; 68: 978-989Abstract Full Text Full Text PDF PubMed Scopus (6195) Google Scholar and later related methods applied a coalescent model to haplotype frequencies. This model implies that haplotypes similar to ones we have already seen are more likely to be seen than are completely different haplotypes, since changes to haplotypes occur through recombination and mutation. Although it produces very accurate results, application of the coalescent model is computationally intensive, limiting its applicability for large data sets. A variety of other approaches has been proposed, some of which make use of haplotype blocks.11Halperin E Eskin E Haplotype reconstruction from genotype data using imperfect phylogeny.Bioinformatics. 2004; 20: 1842-1849Crossref PubMed Scopus (170) Google Scholar Although the concept of haplotype blocks is useful, haplotype blocks do not adequately explain all the correlation structure between markers, because linkage disequilibrium (LD) can extend beyond block boundaries and can have complex patterns within blocks.2The International HapMap Consortium A haplotype map of the human genome.Nature. 2005; 437: 1299-1320Crossref PubMed Scopus (4545) Google Scholar Another haplotype-phasing method is needed because all existing methods for haplotype inference either are too slow for routine application to whole-genome association studies or have severely suboptimal accuracy. We address these two issues with the method that we propose here. Our method is a novel application of the recently proposed localized haplotype-cluster model that has been used for association testing.1Browning BL Browning SR Efficient multilocus association mapping for whole genome association studies using localized haplotype clustering.Genet Epidemiol. 2007; 31: 365-375Crossref PubMed Scopus (110) Google Scholar, 12Browning SR Multilocus association mapping using variable-length Markov chains.Am J Hum Genet. 2006; 78: 903-913Abstract Full Text Full Text PDF PubMed Scopus (103) Google Scholar The localized haplotype-cluster model is an empirical LD model that adapts to the local structure in the data. Relative to other methods, it does particularly well with large sample sizes, where the data are, in a sense, allowed to speak for themselves. The model can be fit to haplotypes by use of an algorithm that is very fast, and we apply an iterative approach to haplotype phasing in which an initial guess of haplotype phase is made, the model is fit, improved estimates of haplotype phase are obtained, the model is refit, and so forth. This is essentially an EM approach, as are most other non-Bayesian haplotype-inference methods. (Bayesian methods also employ iteration, by means of Markov chain–Monte Carlo techniques.) We compared the speed and accuracy of our proposed method with those of a selection of alternative haplotype-inference methods. We excluded any method that is excessively slow or whose current implementation does not allow it to be applied to at least a few hundred densely spaced genetic markers. We also excluded methods that showed significantly lower accuracy than that of competing methods in the only previous study that considered large numbers of individuals.8Eronen L Geerts F Toivonen H HaploRec: efficient and accurate large-scale reconstruction of haplotypes.BMC Bioinformatics. 2006; 7: 542Crossref PubMed Scopus (43) Google Scholar In particular, we excluded Gerbil13Kimmel G Shamir R GERBIL: genotype resolution and block identification using likelihood.Proc Natl Acad Sci USA. 2005; 102: 158-162Crossref PubMed Scopus (107) Google Scholar (version 1.0) and PL-EM9Qin ZS Niu T Liu JS Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms.Am J Hum Genet. 2002; 71: 1242-1247Abstract Full Text Full Text PDF PubMed Scopus (422) Google Scholar (version 1.5), which had low accuracy relative to that of HaploRec.8Eronen L Geerts F Toivonen H HaploRec: efficient and accurate large-scale reconstruction of haplotypes.BMC Bioinformatics. 2006; 7: 542Crossref PubMed Scopus (43) Google Scholar The methods we chose for comparison, on the basis of these criteria, were fastPHASE (version 1.2.3), HaploRec (version 2.1), HAP (version 2.9), and 2SNP (version 1.5.1, for 64-bit architecture). All methods were used with the default settings except as noted. fastPHASE7Scheet P Stephens M A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.Am J Hum Genet. 2006; 78: 629-644Abstract Full Text Full Text PDF PubMed Scopus (1343) Google Scholar uses a haplotype-clustering model with a fixed number of clusters. For data sets with small numbers of individuals, such as the HapMap data, fastPHASE is almost as accurate as PHASE but is much faster.7Scheet P Stephens M A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.Am J Hum Genet. 2006; 78: 629-644Abstract Full Text Full Text PDF PubMed Scopus (1343) Google Scholar Although an earlier study found that fastPHASE performed poorly compared with HaploRec for large data sets,8Eronen L Geerts F Toivonen H HaploRec: efficient and accurate large-scale reconstruction of haplotypes.BMC Bioinformatics. 2006; 7: 542Crossref PubMed Scopus (43) Google Scholar we included it because we knew that we could decrease the running time by turning off the default cross-validation procedure (as recommended for data sets with only several hundred markers) and because we felt that the default choices of 5, 10, and 15 for the number of clusters, K, used in the previously published comparison of fastPHASE version 1.1.3 and HaploRec were too low for accurate phasing of large numbers of individuals. Because the default numbers of clusters that are considered by fastPHASE version 1.2.3 in the cross-validation procedure (which was omitted to reduce running times) are K=10 and K=20, we ran fastPHASE with each of these values. HaploRec8Eronen L Geerts F Toivonen H HaploRec: efficient and accurate large-scale reconstruction of haplotypes.BMC Bioinformatics. 2006; 7: 542Crossref PubMed Scopus (43) Google Scholar uses frequencies of haplotype fragments in a segmentation model (HaploRec-S) or in a variable-order Markov-chain model (HaploRec-VMM). HaploRec was run with increased memory allocation (as was necessary for its operation) and with both the segmentation model and the variable-order Markov model. The segmentation model is generally slower but more accurate than is the variable-order Markov model.8Eronen L Geerts F Toivonen H HaploRec: efficient and accurate large-scale reconstruction of haplotypes.BMC Bioinformatics. 2006; 7: 542Crossref PubMed Scopus (43) Google Scholar One limitation of HaploRec is that it does not currently have the capability to impute missing data, which limits its usefulness for genetic association studies. HAP11Halperin E Eskin E Haplotype reconstruction from genotype data using imperfect phylogeny.Bioinformatics. 2004; 20: 1842-1849Crossref PubMed Scopus (170) Google Scholar uses haplotype blocks with imperfect phylogeny constraints that limit inference of haplotypes that would imply back-mutations or recombinations within blocks. 2SNP14Brinza D Zelikovsky A 2SNP: scalable phasing based on 2-SNP haplotypes.Bioinformatics. 2006; 22: 371-373Crossref PubMed Scopus (31) Google Scholar uses a very fast algorithm based on consideration of two-marker haplotypes. We first describe the localized haplotype-cluster model, which is a special class of directed acyclic graph. We show how the localized haplotype-cluster model defines a hidden Markov model (HMM) that can be used to sample haplotype pairs or to find the most likely haplotype pair for each individual conditional on the individual’s genotypes. We then describe the phasing algorithm, which involves iteratively sampling haplotype pairs and building the localized haplotype-cluster model from the sampled haplotype pairs. We conclude with a description of the real and simulated data sets and the metrics we used to compare haplotype-inference algorithms. Correlation between markers is a localized phenomenon, since LD decays with distance. If this localization is not considered when genotype data are phased over an extended region, noise will be introduced by sampling variation, resulting in apparent correlations observed between distant markers, which reduces the accuracy of the haplotype inference. Existing approaches to this problem include explicitly modeling the recombination in the coalescent model,15Stephens M Scheet P Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation.Am J Hum Genet. 2005; 76: 449-462Abstract Full Text Full Text PDF PubMed Scopus (1068) Google Scholar using haplotype blocks,11Halperin E Eskin E Haplotype reconstruction from genotype data using imperfect phylogeny.Bioinformatics. 2004; 20: 1842-1849Crossref PubMed Scopus (170) Google Scholar, 13Kimmel G Shamir R GERBIL: genotype resolution and block identification using likelihood.Proc Natl Acad Sci USA. 2005; 102: 158-162Crossref PubMed Scopus (107) Google Scholar taking a sliding-window approach with window size varying with LD structure,16Excoffier L Laval G Balding D Gametic phase estimation over large genomic regions using an adaptive window approach.Hum Genomics. 2003; 1: 7-19Crossref PubMed Scopus (84) Google Scholar employing an HMM,7Scheet P Stephens M A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.Am J Hum Genet. 2006; 78: 629-644Abstract Full Text Full Text PDF PubMed Scopus (1343) Google Scholar and using frequent haplotype fragments.8Eronen L Geerts F Toivonen H HaploRec: efficient and accurate large-scale reconstruction of haplotypes.BMC Bioinformatics. 2006; 7: 542Crossref PubMed Scopus (43) Google Scholar Our approach to making use of the localized LD structure is a localized haplotype-cluster model,1Browning BL Browning SR Efficient multilocus association mapping for whole genome association studies using localized haplotype clustering.Genet Epidemiol. 2007; 31: 365-375Crossref PubMed Scopus (110) Google Scholar, 12Browning SR Multilocus association mapping using variable-length Markov chains.Am J Hum Genet. 2006; 78: 903-913Abstract Full Text Full Text PDF PubMed Scopus (103) Google Scholar which empirically models haplotype frequencies on a local scale. The localized haplotype-cluster model clusters haplotypes at each marker to improve prediction of alleles at markers t+1, t+2, t+3,…, given alleles at markers t, t-1, t-2,… on a haplotype. This is achieved by defining clusters according to a Markov property—given cluster membership at position t, the sequence of alleles at markers t, t-1, t-2,… is irrelevant for predicting the sequence of alleles at markers t+1, t+2, t+3,…. The clustering is localized, so that haplotypes in the same cluster at position t are likely to be in the same cluster at position t+1 but need not be. This model has a number of important advantages. By clustering the haplotypes, a parsimonious model is obtained, which is important for obtaining good estimates of haplotype frequencies. By allowing the number of clusters and the relationships between clusters at different positions to be determined largely by the data rather than by a restrictive model, the model adapts to the data, which is particularly useful when the number of individuals in the sample is large, as it will be for well-powered association studies. Finally, the model can be fit using a computationally efficient algorithm1Browning BL Browning SR Efficient multilocus association mapping for whole genome association studies using localized haplotype clustering.Genet Epidemiol. 2007; 31: 365-375Crossref PubMed Scopus (110) Google Scholar, 12Browning SR Multilocus association mapping using variable-length Markov chains.Am J Hum Genet. 2006; 78: 903-913Abstract Full Text Full Text PDF PubMed Scopus (103) Google Scholar, 17Ron D Singer Y Tishby N On the learnability and usage of acyclic probabilistic finite automata.J Comput Systems Sci. 1998; 56: 133-152Crossref Scopus (67) Google Scholar that is extremely fast. Suppose that we have a sample of haplotypes for M markers and that the haplotypes have no missing alleles. A localized haplotype-cluster model for this sample is a directed acyclic graph with the following four properties:1. The graph has one root (initial) node with no incoming edges and has one terminal node with no outgoing edges. The root node represents all haplotypes before any markers are processed, whereas the terminal node represents all haplotypes after all markers are processed.2. The graph is leveled with M+1 levels. Each node A has a level, m. All incoming edges to A have the parent (originating) node at level m-1, and all outgoing edges from A have the child (destination) node at level m+1. The root node has level 0, and the terminal node has level M.3. For each m=1,2,…,M, each edge with the child node at level m is labeled with an allele for the mth marker. Two edges originating from the same parent node cannot be labeled with the same allele.4. For each haplotype in the sample, there is a path from the root node to the terminal node, such that the mth allele of the haplotype is the label of the mth edge of the path. Each edge of the graph has at least one haplotype in the sample whose path traverses the edge. Each edge, e, of the graph represents a cluster of haplotypes consisting of all haplotypes whose path from the initial node to the terminal node of the graph traverses e. Haplotypes are defined over the whole chromosome, but haplotypes within a cluster corresponding to an edge at level m will tend to have similar patterns of alleles at markers immediately to the right of marker m. Thus, each edge defines a localized haplotype cluster that is determined by local LD patterns. For each edge, e, of a localized haplotype-cluster model, we define the edge count, n(e), to be the number of haplotypes in the sample whose path traverses the edge, and we define the parent node count, np(e), to be the number of haplotypes in the sample whose path traverses the parent node of the edge. Table 1 and figure 1 illustrate these concepts. The data in table 1 do not correspond to any real data set but are merely for illustration. The bold-line edges from the root node to the terminal node in figure 1 represent the haplotype 2112. Edge eF includes haplotypes 1111, 1112, 2111, and 2112, with total count n(eF)=21+79+25+112=237 (counts taken from table 1). The node marked by an asterisk (*) is the parent node for edge eF. This node includes the four haplotypes whose paths traverse edge eF and, additionally, haplotypes 1122 and 2122, whose paths traverse edge eG, and it has countnp(eF)=n(eF)+n(eG)=237+(95+152)=484 .Table 1Example Haplotype Counts for Figure 1HaplotypeCount111121111279112295122111621112521121122122152 Open table in a new tab Localized haplotype-cluster models are especially well suited for modeling LD structure. Recombination between haplotypes is modeled as merging edges (i.e., edges with the same child node).12Browning SR Multilocus association mapping using variable-length Markov chains.Am J Hum Genet. 2006; 78: 903-913Abstract Full Text Full Text PDF PubMed Scopus (103) Google Scholar Unlike haplotype-block–based models, which permit recombination only between haplotype blocks, localized haplotype-cluster models can model the complex recombination patterns found in real data. Unlike models with fixed numbers of clusters at each locus, the localized haplotype-cluster model is flexible and can vary the number of clusters at each locus to model the data. A computationally efficient algorithm for fitting a localized haplotype model to haplotype data has been described in detail elsewhere.1Browning BL Browning SR Efficient multilocus association mapping for whole genome association studies using localized haplotype clustering.Genet Epidemiol. 2007; 31: 365-375Crossref PubMed Scopus (110) Google Scholar, 12Browning SR Multilocus association mapping using variable-length Markov chains.Am J Hum Genet. 2006; 78: 903-913Abstract Full Text Full Text PDF PubMed Scopus (103) Google Scholar The algorithm scales linearly in the number of markers and slightly more than linearly (less than quadratically) in the number of individuals,1Browning BL Browning SR Efficient multilocus association mapping for whole genome association studies using localized haplotype clustering.Genet Epidemiol. 2007; 31: 365-375Crossref PubMed Scopus (110) Google Scholar so that it can quickly process large-scale data sets. Since the model-fitting algorithm is a published algorithm, we do not repeat the details of the model-fitting algorithm in this work but refer the reader to our earlier work,12Browning SR Multilocus association mapping using variable-length Markov chains.Am J Hum Genet. 2006; 78: 903-913Abstract Full Text Full Text PDF PubMed Scopus (103) Google Scholar which contains a detailed worked example of fitting a localized haplotype-cluster model to the haplotype data in table 1. We show below that localized haplotype-cluster models can be interpreted as a special class of HMMs. By viewing localized haplotype-cluster models as HMMs, we are able to extend the model to diplotypes and to use efficient HMM sampling algorithms. In “The Beagle Phasing Algorithm” section, we describe an iterative process for haplotype phasing that involves fitting a localized haplotype model to estimated haplotype data and sampling haplotype estimates conditional on the fitted localized haplotype model and the genotype data. A localized haplotype-cluster model determines an HMM for which the states of the HMM are the edges of the localized haplotype-cluster model, and the emitted symbol for each state is the allele that labels the edge of the localized haplotype-cluster model. To specify the HMM, we must specify the emission probabilities, the initial-state probabilities, and the transmission probabilities.18Rabiner LR A tutorial on hidden Markov models and selected applications in speech recognition.Proc IEEE. 1989; 77: 257-286Crossref Scopus (15047) Google Scholar Each state (i.e., edge) emits with probability 1 the allele that labels the edge. Thus, the state uniquely determines the observed allele, but the observed allele for a marker does not generally determine the state, because edges with distinct parent nodes at the same level of the graph can be labeled with the same allele. The initial-state probabilities and the transition probabilities are computed from the edge counts. The initial-state probabilities are P(e)=n(e)/np(e) if the parent node of edge e is the root node and are P(e)=0 otherwise. The transition probabilities are P(e1|e2)=n(e1)/np(e1) if the parent node of edge e1 is the child node of edge e2 and are P(e1|e2)=0 otherwise. Note that, if edges e2 and e3 both have the same child node, then P(e1|e3)=P(e1|e2). For example, by considering the haplotype counts in table 1 and the corresponding graph in figure 1, P(eF|eC)=P(eF|eE)=n(eF)/np(eF)=237/484=0.49, whereas P(eC|eE)=0 because the parent node of eC is not the child node of eE. We have described a haploid HMM; however, a diploid HMM is needed because we observe diploid data (i.e., genotypes) rather than haploid data. We create a diploid HMM from ordered pairs of edges in each level of the graph. Since the graph is leveled, the states of the haploid HMM (the edges of the graph) can be partitioned into classes Lm, where m=1,2,…,M is the level of the edge’s child node in the graph. For example, in figure 1, L1={eA,eB}, L2={eC,eD,eE}, L3={eF,eG,eH}, and L4={eI,eJ,eK,eL}. For the diploid HMM, the state space is the union over m of (Lm×Lm), and the emitted symbol for each state is the unordered pair of alleles that label the state’s ordered pair of edges. If (e1,e2) is a state of the diploid HMM, then the unordered genotype determined by the two alleles that label edges e1 and e2 is emitted with probability 1. We assume Hardy-Weinberg equilibrium, so that the diploid initial and transition probabilities are the product of the corresponding haploid probabilities: P(e1,e2)=P(e1)P(e2) and P[(e1,e2)|(e3,e4)]=P(e1|e3)P(e2|e4). Note that the edge pairs are ordered, so that factors of 2 are not needed when the ordered edge pairs represent heterozygote genotypes. Our phasing algorithm samples from the diploid HMM conditional on the observed data by use of a forwards-backwards algorithm (see the work of Rabiner18Rabiner LR A tutorial on hidden Markov models and selected applications in speech recognition.Proc IEEE. 1989; 77: 257-286Crossref Scopus (15047) Google Scholar for a review of the forwards-backwards and Viterbi algorithms for HMMs, and see section 7.1 in the work of Thompson19Thompson EA Statistical inference from genetic data on pedigrees. Vol 6. Institute of Mathematical Statistics, Beachwood, OH2000Google Scholar for an example of conditional HMM sampling). For a given individual, let gm be the observed unordered genotype at marker m, and let the state sm=(e1,e2) be an ordered pair of edges in Lm×Lm. For any sm in the diploid HMM and with the individual’s genotype {g1,g2,…,gM}, define the forward variables as αm(sm)=P(g1,g2,…,gm,sm). The αm(sm) for m=1,2,…,M can be computed inductively by the forward algorithm as follows.1. Initiation: α1(s1)=P(g1,s1)=P(s1)P(g1|s1).2. Induction:αm+1(sm+1)=P(g1,g2,…,gm+1,sm+1)=∑smP(g1,g2,…,gm,gm+1,sm,sm+1)=∑smP(g1,g2,…,gm,sm)P(gm+1|sm+1)P(sm+1|sm)=P(gm+1|sm+1)∑smαm(sm)P(sm+1|sm) . For our purposes, the αm need be known only up to a constant of proportionality for each m.Appendix A gives a worked example of forward calculation. The probabilities P(gm+1|sm+1) are always either zero or one, depending on whether the genotype is consistent with the labels on the ordered pair of edges. Our model could be extended to incorporate genotype error by allowing these probabilities to take intermediate values; however, for low error rates, we expect that genotype-error modeling will not add significant value. Our results (see the “Results” section)