Haplotype phase inference software tools | Population genetics data analysis
Two categories of computational methods exist for determining haplotypes: haplotype phasing and haplotype assembly. Given the genotypes of a sample of individuals from a population, haplotype phasing attempts to infer the haplotypes of the sample using haplotype sharing information within the sample. In the related problem of genotype imputation, a phased reference panel is used to infer missing markers and haplotype phase of the sample. Methods for haplotype phasing and imputation are based on computational and statistical inference techniques, but both use the fact that closely spaced markers tend to be in linkage disequilibrium and smaller haplotypes blocks are often shared in a population of seemingly unrelated individuals.
Estimates haplotype phase either within a genotyped cohort or using a phased reference panel. A phasing algorithm attaining high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels using a new data structure based on the positional Burrows-Wheeler transform. Eagle2 shows two key differences compared to others hidden Markov model-based (HMM-based) methods: (i) it efficiently represents the full haplotype structure in a way that losslessly condenses locally matching haplotypes, (ii) it selectively explores that space of diplotypes in a way that only expends computation on the most likely phase paths.
Performs genotype calling, genotype phasing, imputation of ungenotyped markers, and identity-by-descent segment detection. Beagle can be applied to thousands of samples across genome-wide single nucleotide polymorphism (SNP) data. It can retrieve short tracts of identity by descent (IBD). This tool utilizes composite reference haplotypes to model large genomic regions with a parsimonious statistical model.
Assists users to assemble noisy single-molecule sequences. Canu introduces several features including computational resource discovery, adaptive k-mer weighting, automated error rate estimation, sparse graph construction, and graphical fragment assembly (GFA) outputs. This pipeline consists of three different stages: correction, trimming, and assembly. Moreover, this tool can auto-detect available resources and configure itself to maximize resource utilization.
An algorithm for haplotype resolution and block partitioning. The algorithm uses a stochastic model for genotype generation, based on the biological finding that genotypes can be partitioned into blocks of low recombination rate, and in each block, a small number of common haplotypes is found. Our model uses the notion of a probabilistic common haplotype, which can have different forms in different genotypes, thereby accommodating errors, rare recombination events, and mutations. GERBIL was shown to be quick and accurate even when applied to many hundreds of individuals.
It can resolve long haplotypes or infer missing genotypes in samples of unrelated individuals. Specifically, MACH can estimate haplotypes, impute missing genotypes in a variety of populations, using the HapMap sample or another set of densely genotyped individuals as a reference, analyze shotgun re-sequencing data from high-throughput technologies now being developed, and carry out simple tests of association.
Offers a platform for performing genome wide association studies (GWAS) based on haplotypes. ParaHaplo is an application leaning on data parallelism to allow users to perform analysis with an increased speed for the assessing of both haplotypes and P values. The application can be used in conjunction with other software for running: (i) genotype imputation and haplotype reconstruction; (ii) haplotype estimation and (iii) haplotype-based GWAS.
Provides a method for efficiently phasing large data sets. HAPI-UR is an application that was developed for the application to large genotype data sets of unrelated and/or trio and duo samples. Because the number of states that HAPI-UR uses in any window is dependent on the haplotype structure and diversity in an individual, the method adapts to the nature of the data set. It is also efficient in inferring phase in large data sets.
Determines alleles and haplotypes into a targeted copy number variation (CNV) region. CNVphaser is a standalone software enabling the processing of high-throughput experimental platforms. The application includes features for: (i) choosing among three types of initial values in the expectation-maximization (EM) algorithm; (ii) handling missing calls at several SNVC sites and; (iii) managing any number of variant base types.
Predicts haplotypes as well as the copy number and segregational origin of those haplotypes across the genome of a single cell. HiVA eases the discovery of genuine DNA copy-number changes, parental and mechanistic origin in single cells. It is integrated into a software allowing single-cell genome-wide haplotyping and copy-number typing of the haplotypes in a cell. This tool can be applied to single cells from human cleavage-stage embryos.
Phases genotype data. AlphaPhase consists of a heuristic method that combines long range phasing (LRP) and haplotype library imputation (HLI). The software implements methods which avoid the need for all-against-all searches by performing long-range phasing on subsets of individuals and then combing results. It enables the phasing of millions of individuals genotyped on multiple single nucleotide (SNP) arrays.
A statistical haplotype reconstruction algorithm targeted for large-scale disease association studies. HaploRec is especially suitable for data sets with a large number of subjects and a large number of possibly sparsely located markers.
Provides a method for haplotype inference in a pedigree. HAPLORE first assists users in determining haplotypes from a sample of individuals thanks to a determined set of rules. Then, the application uses a haplotype elimination algorithm to both estimate all possible configurations as well as excluding irrelevant ones. Finally, a PL-EM algorithm evaluates haplotype frequencies using the selected configurations.
An efficient method that combines multi-SNP read information with reference panels of haplotypes for improved genotype and haplotype inference in sequencing data. Unlike previous phasing methods that use read counts at each SNP as input, our method takes into account the information from reads spanning multiple SNPs. HARSH is able to efficiently find the likely haplotypes in terms of the marginal probability over the genotype data. Using simulations from HapMap and 1000 Genomes data, we show that our method achieves superior accuracy than existing approaches with decreased computational requirements.
Deduces population frequencies of a combination of allelic copy numbers and single nucleotide polymorphism (SNP) alleles. MOCSphaser is a standalone software which aims to assist users in population-genetic studies. This program is also able to consider ambiguous phenotypic copy numbers and to investigate copy number variation (CNV)-SNP haplotypes derived from a merging of phenotypic copy numbers at CNV loci and genotypes at SNP loci.
A program to improve haplotype reconstruction using paired-end short reads. Assuming that the users have run an existing phaser, HI processes the paired-end information in the raw data to form blocks of haplotypes and compares them with the output of a phasing tool (currently HI supports PHASE and fastPHASE). When inconsistencies are found, HI will decide whether or not, and at which loci, to change the haplotype reconstructions according to its calculations.
Provides a fast program for haplotype-aware consequence calling which can take into account known phase. BCFtools/csq is part of the BCFtools package. Applied to the 1000 Genomes Project data, haplotype-aware consequence calling modifies the predictions for 501 of 5019 compound variants. Predictions match existing tools when run in localized mode, but the program is an order of magnitude faster and requires an order of magnitude less memory.
Provides a solution for polynomial arithmetic operations and for haplotype inference problem. PTG is a software based on a parsimonious tree-grow method that finds the minimum number of distinct haplotypes based on the criterion of keeping all genotypes resolved during tree-grow process. It provides a low computational cost even for large scale genomic data. It can also resolve the case that every genotype has more than one heterozygous site.
An algorithm that models linkage disequilibrium using a simple multivariate Gaussian distribution. The key feature of our algorithm is its speed. MarViN is hundreds of times faster than other methods on the same data set and its scaling behaviour is linear in the number of samples. We demonstrate the performance of the method on both low- and high-coverage samples.
Reconstructs local tree topologies for a set of population single-nucleotide polymorphism (SNP) haplotypes undergoing recombination. Due to recombination, tree topologies change as one moves accross the genome. The main idea of this tool is to jointly refine a set of local trees at the SNP sites by several justifiable rules. RENT+ extends previous program RENT+ which uses a novel search method to infer the local trees, one for each genomic region near a SNP site. The key benefit of using RENT+ is that it allows the inference to utilize the underlying joint information contained in multiple nearby SNPs (i.e. the so-called linkage disequilibrium) in such inference.
Estimates haplotypes in polyploid parent-offspring trios. TriPoly is an approach that uses next generation sequencing (NGS) data while taking haplotype transmission from the parents to the progeny into account. It reconstructs the phasing of the single-nucleotide polymorphism (SNP) over a genomic region simultaneously. This method provides an option to include all of the SNPs, including those homozygous or missing for an individual, in the output.