Computational protocol: Gene Tree Affects Inference of Sites Under Selection by the Branch-Site Test of Positive Selection

Similar protocols

Protocol publication

[…] We simulate MSAs given a phylogenetic tree using INDELible (version 1.03), which is a flexible simulation tool implementing a variety of different substitutions and indel models. It uses a Markov chain approach that allows to deal with the dependency among sites introduced when simulating indels (refer the book by Yang, pp. 302–304). Here, we simulate genes consisting of 522 codons with no indels along the tree depicted in , starting from a random sequence at the root. Simulation parameters are the transition/transversion ratio κ = 2.1, chosen to match the average reported for the human genome (see DePristo et al.) and a background scheme of dN/dS ratios (1, 1, 0.8, 0.8, 0.5, 0.5, 0.2, 0.2, 0, 0) with every class making up 10% of the sites (the same as background scheme X from the studies by Zhang et al. and Fletcher and Yang). Furthermore, we use two foreground selection schemes (0.5, 1, 4, 0.8, 4, 0.5, 4, 0.2, 0.8, 0.5) (referred to as W) and (1.0, 0.7, 4.0, 0.8, 2.0, 0.5, 0.3, 0.2, 0.1, 0.0) (the same as foreground selection scheme V in the references mentioned above). The simulated MSAs and the control file with all parameters are attached as .The sequences simulated in the previous step are analyzed with PAML (version 4.6), which is a package with various programs for the phylogenetic analysis of molecular sequences in an ML statistical framework. It provides a rich repertoire of evolutionary models allowing to test biological hypotheses, for example, of positive Darwinian selection as does the BSPS. We label the branches as foreground that were simulated as such. Branch lengths are estimated by PAML (“runmode = 0,” refer for the basic control files with and without selection we used for all simulations). The sites under selection in the foreground branches are obtained by BEB at site-specific posterior probabilities >0.95 and >0.99.We define the age of a foreground branch spanning nodes n1 to nm (ie, for m > 2, additional internal nodes are present) as the average distance of the nodes n1 to nm to human (the leaf at the end of fg6 in ).Simple and multiple linear regressions are compared using the BIC, which allows to select among a set of models. It compares models based on their likelihood while penalizing for the number of model parameters. Models with the lowest BIC are preferred, with a difference ΔBIC = BIC(H0) − BIC(H1) above 6 indicating strong evidence against the null model H0.We summarize the elements of the confusion matrix (ie, TP, FP, TN, and FN), computing sensitivity and specificity according to their standard definitions TP/(TP + FN), TN/(FP + TN), respectively. Tree manipulations are done in Python using Biopython and the ETE library.The codon MSA for the reanalysis of the sequences from the study by Voordeckers et al. is generated based on the protein MSA provided in their Supporting Information. After retrieving the corresponding cDNA sequences from the NCBI and Sanger Institute, we use Pal2Nal (version 14) to convert the protein alignment into a codon alignment. Pal2Nal automates this conversion, providing robustness against the presence of mismatches, UTRs and polyA tails in the input DNA sequences, frame shifts, and inframe stop codons in the input alignment. Results shown in and are generated with PAML version 4.4. We use PAML version 4.4 to exclude different versions of PAML as a reason for different sets of sites inferred to be under positive selection. […]

Pipeline specifications