Computational protocol: Direct maximum parsimony phylogeny reconstruction from genotype data

Similar protocols

Protocol publication

[…] In order to generate simulated data, coalescent trees were created using Hudson's ms program []. The only parameter required to generate tree topologies is the number of haploid chromosomes nh. The ms program can also use this tree to produce haplotype sequences, but does so under the infinite-sites model (without any recurrent mutations). We therefore instead used the seq-gen program of Rambaut and Grassly [] to generate nh haplotypes using the ms coalescent tree. We varied the number of SNPs m and the mutation rate parameter θ = 4N0μ, where μ is the probability of mutation of any of the simulated SNPs in one generation and N0 is the effective population size. We relate the simulation parameter μ to the per-site mutation rate by assuming an effective population size N0 = 10, 000 (a reasonable estimate for humans []). For instance, for 5 sites, we obtain a per-site mutation probability of 10-6 for θ = 0.2.seq-gen was used under the GTR model, a generic time reversible Markov model. Mutation rates between A and C and between G and T were defined using the same parameter θ. For all the other four pairs we set the mutation rate to be 0 in order to produce biallelic data. The exact command line used to execute seq-gen for a given mutation rate parameter θ and SNP number m was the following: seq-gen -mGTR -r θ, 0, 0, 0, 0, θ -l m Each data point was generated from 200 independently generated simulated data sets, with the reported error rates summed over the 200 replicates. In our first set of simulations, designed to test the effect of mutation rate on accuracy, we varied θ over the range 0.2–0.6 in increments of 0.05 for windows of 5 and 10 SNPs and for sample sizes of 30 and 60 input haplotypes. Our second set of experiments, designed to test the effect of sample size on accuracy, fixed θ at 0.5 and varied the number of haplotypes from 30 to 120 in increments of 10 for windows of 5 and 10 SNPs. Data points plotted represent summed errors over the 200 replicates per parameter value.Mitochondrial data was extracted from of a set 63 complete mitochondrial DNA sequences of 16,569 bases each produced by from Fraumene et al. []. We produced artificial diploids from the data by randomly selecting 60 of the sequences and randomly grouping them into 30 pairs. We computationally inferred haplotypes from all of the genotypes using fastPHASE and we constructed phylogenies for all sliding windows of 50 bases across the data set by each of three methods: maximum parsimony using true haplotypes, inferred haplotypes and from the genotypes.Autosomal DNA was extracted from a lipoprotein lipase (LPL) data set due to Nickerson et al. []. Because the pairs of haplotypes into genotypes were not published, we duplicated the first haplotype to obtain 78 distinct sequences and then randomly paired them to produce 39 artificial genotypes from the true haplotypes. As in the previous case, we ran fastPHASE and haplotyper on all of the SNPs put together to obtain inferred haplotypes. In order to reduce the possibility of recombination events confounding our results, we used the HAP webserver [] to break the 86 SNPs into blocks. HAP was also used to infer missing data. We then evaluated phylogeny sizes by our direct method, from the true haplotypes, and from the inferred haplotypes for each block. […]

Pipeline specifications

Software tools Seq-Gen, fastPHASE
Applications Phylogenetics, GWAS