Computational protocol: A systematic search for positive selection in higher plants (Embryophytes)

Similar protocols

Protocol publication

[…] The TAED database was built as described in Roth et al. []. 138,662 embryophyte sequences longer than 10 amino acids from 4,568 species were extracted from GenBank release 136 []. After an all-against- all BLAST search [] global PAM distances were calculated for each hit using Darwin []. Families were built by single linkage clustering from sequences annotated as complete, with pairwise PAM distances of 100 PAM units above a relative length threshold. For each family, a multiple sequence alignment was calculated using POA []. Phylogenetic trees were estimated by Bayesian inference using MrBayes []. A novel soft parsimony approach [] was used to simultaneously root the trees and map them onto the NCBI taxonomy. Distant families were split based on alignment quality as measured in the percentage of gaps and where the oldest node in the tree was identified as a duplication event rather than a speciation event. This served to increase alignment quality, a possible source of false positive positive selection. For each branch of every phylogenetic tree, Ka/Ks ratios were calculated using a previously established ancestral sequence reconstruction and counting- based approach (see [-]) with a Ks lower cut- off of 0.05. Branches were preliminarily considered significant if Ka- V(Ka)>Ks+V(Ks) and at least two non- synonymous substitutions occurred along the branch. This statistical assessment was refined through evaluation with a neutral null hypothesis, as described below. The procedure described above automatically generated the TAED database and hits were manually curated for this study. Manual evaluation of positively selected genes involved examination of the alignments, where in families with branches with Ka/Ks>1 and alignments where some sequences had a high percentage of gaps, the effect of these sequences on the alignment and subsequent calculation of Ka/Ks ratios was evaluated.PDB- structures were assigned to families at a distance of 70 PAM units. For each family with a structure, a new multiple sequence alignment was calculated included the structure. From this, sequences were threaded through the structure and contact maps with a radius of 10Å were calculated. Then Ka/Ks ratios were determined within these 10Å- windows for each sphere along each branch of every phylogenetic tree []. Significant branches were selected as previously described [].To compare the results with a null (actually a combination of neutral and negative selection in the absence of positive selection) model, we used Evolver from the PAML package []. Families were generated with non-Ka/Ks parameters sampled proportionately from the distribution obtained by codeml (PAML) analysis of all TAED families and with the amino acid distribution derived from TAED as a whole. Trees and branch lengths were randomly selected from the distribution in the database, and Ka/Ks was set to the mean value of the database (0.21). Ka/Ks values were then calculated for these simulated families as described above. The simulated dataset was used to calculate both p- values and an expected false positive rate, given multiple tests (equivalent to the Bonferroni correction multiplied by the number of tests) to assess statistical significance. Many of these whole gene calculations will likely be unambiguous true positives when more sensitive methods that look at sequence subsets (like []) are employed. Employing a null model that is based upon positive selection free parameters estimated from comparative genomic data rather than a null model based upon pseudogenes (which these are not) is expected to give a more realistic (less overly conservative) picture of statistical significance, even though this has not traditionally been employed in evolutionary studies.Saturation of synonymous sites was calculated as the fraction of identical four- fold degenerate codons in a sequence pair if at least five shared codons occur. These pairwise values were projected onto the node of the species tree that joined them from the mapping of the gene family tree to the species tree. From these pairwise values, average gene saturation at every node in a gene family tree was calculated. Then these nodes were projected onto the corresponding node in the embryophyte tree of life. This is different from the along- branch calculation used in TAED to discard saturated datapoints, but is given as an estimate of how far back (on average), pairwise analysis enables estimation of Ks.367 families in TAED with close structural homologs in PDB were used to categorize general structural properties of sequence evolution. Using DSSP, both the secondary structural elements and solvent accessibility of each residue site were identified []. After partitioning residues into secondary structural categories and solvent accessibility categories, Ka/Ks calculation was performed in the 367 families over each branch for each partition and averaged values compared.To determine which GO terms [] were over- represented among the gene lineages having undergone positive selection, the functional annotations of the gene family dataset as a whole and the global and tertiary windowing high Ka/Ks datasets were compared to GO, and gene families assigned to a small number of selected GO terms. Over-representation was determined by examining the ratio of GO term frequency per positively selected branch to GO term frequency in the total set of test branches in the TAED database. […]

Pipeline specifications

Software tools POA, MrBayes, PAML
Databases TAED
Applications Phylogenetics, Nucleotide sequence alignment