## Similar protocols

## Protocol publication

[…] The Mitomap “mtDNA Tree” is a navigable mutational mitochondrial DNA (mtDNA) phylogenetic tree containing approximately 3,000 mtDNA coding region sequences. The core topology of the tree was generated by the **MEGA** neighbor-joining program , using 1,060 human mtDNA sequences. Separate neighbor-joining trees were also built for each major haplogroup .This phylogeny was used to identified 1,173 haplotypes classified within the R super-haplogroup and characterized by the 12705 polymorphism (ND5). We distributed every haplotype within the 4 major monophyletic lineages: R0 (formerly preHV), R2JT, U, and B (). These 4 major lineages contained over 100 haplotypes. In , the “OTHER” category contains the remaining haplotypes belonging to less-represented lineages: R* (1, 5–9, 30, 31), F and P. [...] In order to confirm the Mitomap results, we performed the same analysis on **PhyloTree** (http://www.PhyloTree.org/), which was built using the parsimony method, using a similar set of data to those in Mitomap. For the 709 haplotypes belonging to the R lineage, we listed: (i) the set of present synonymous polymorphisms on the coding sequences of the mitochondrial proteins and (ii) the present polymorphisms on the HVS1 between nucleotides 16090 and 16383. The phylogenetic distance between each haplotype and MRCA-R was calculated by determining the number of mutations separating it from MRCA-R . A general histogram of haplotype distribution according to these distances is presented in .In order to estimate the global quality of the sequences in this dataset, we analyzed the hyper-variable region in these complete genomes, using the methodology described by Bandelt et al . We started by performing the “weighty filter”, and then computed the cube and incompatibility spectra using SPECTRA software .We used an adaptation of Bayesian method proposed by Wilcox et al. to test the heterogeneity rate across the tree. Due to computer limitations and a possible ascertainment bias, we generated 10 independent sets of 36 randomly-sampled individuals within each major haplogroup (using R software). We generated a collapsed sequence, containing only the nucleotides in the third codon base, for each individual selected, in order to study the synonymous mutation rate (). Independent phylogenetic analysis was performed on the 10 sets, using **MrBayes** (GTR + Γ + PINVAR, 2 chains, chain temperature parameter: 0.2) . 1×107 generations were generated per run, with sampling every 1000 generations, and a burn-in period of 1×106 generations. Each tree was rooted using the R ancestor sequence. We then applied the Wilcox method to obtain the posterior probability distribution of distance between each individual and MRCA, by saving branch lengths for each sampled tree during a Bayesian tree search . We then compared the average distance between MRCA and individuals belonging to the R0 and J clusters and the rest of the R individuals for each set, using a paired t-test.The PAML 3.15 package was used to investigate the positive selection signature among specific lineages. Due to computer limitations, the model-based codeml analysis was only performed on 52 individuals, randomly selected among all the major clusters, respecting the PhyloTree topology (). To investigate possible rate heterogeneity among lineages, we also compared a 1 omega (dN/dS) model (M0) with several other models (M1: free ratio model, where rates may vary freely among the branches; M2: three omega ratios were specified, one for J, J1, and J2 stems, one for the R0 lineage, and the third for the remaining branches in the tree).We investigated the potential effect of ascertainment bias on the PAML analyses by simulating populations of sequences and comparing the PAML results for sets of random-sampled sequences vs. samples with an ascertainment bias. Due to computer limitations, it was not possible to build a realistic model of the evolution of human mitochondrial DNA (excessively large populations, sequences, and even samples). Consequently, we focused on a small model directly addressing whether the PAML omega ratio was influenced by the fact that DNA in the data-base were not sequenced randomly but on the basis of prior knowledge of the haplogroups (based on HV1 sequences and/or some coding SNPs). This test used Re-codon to generate sequence populations and a python script to obtain biased and unbiased samples, then used PAML to compare the two. This analysis was repeated independently 10 times:- A) Re-codon: generate 500 haploid sequences with 3000 nucleotides, no recombination, mutation rate = +1.0e-04, omega = 1 using Re-codon with default values for other parameters (Exponential growth rate = +1.0e-03, Effective population size = 1000).- B) Python: Study the variance at each of 3000 nucleotide sites i.e. a site where 50% of the population had one allele and the other 50% another allele was considered to have high variance. On the contrary, a nucleotide site where 95% of the population shared the same allele had low variance. (This step simulated the first studies of Torroni and collaborators, based on RFLP diversity in human populations).- C) Python: The polymorphisms were then sorted according to variance and the most variant sites were used in turn to split the populations into groups, stopping when the population was divided into at least 5 groups with a minimum of 20 individuals in each. (These groups of haplotypes corresponded to haplogroups, as defined by Torroni and collaborators).- D) Python: A sample of around 40 individuals was taken from among the 500 individuals on the basis of this pseudo-haplogrouping. The 500 individuals were divided into pseudo-haplogroups and the script randomly picked a number of individuals proportional to the percentage of that pseudo-haplogroup in the total population.-E) Finally, this biased sample was subjected to PAML analysis (M1: free ratio model). The same analysis was performed using a non-biased sample from the same population. We compared the whole-tree omega ratio (dN/dS ratio) with the distribution of the branch omega obtained by both sampling methods. These two analyses were performed on 10 independent sets generated by re-CODON .The possible functional effects and potential damaging effect of every non-synonymous mutation present in the JT, J, J1, and J2 lineages were studied using the **PolyPhen** server . […]

## Pipeline specifications

Software tools | MEGA, MrBayes, PolyPhen |
---|---|

Databases | MITOMAP PhyloTree.org nextstrain |

Application | Phylogenetics |