Computational protocol: Paralogous Radiations of PIN Proteins with Multiple Origins of Noncanonical PIN Structure

Similar protocols

Protocol publication

[…] PIN cDNA sequences were identified from previously published analyses together with BLAST searches from four primary sources: Phytozome (, last accessed May 1, 2014), the Ancestral Angiosperm Genome Project (, last accessed May 1, 2014), NCBI Blast, and the 1KP project (, last accessed May 1, 2014). Accession numbers are listed in supplementary data set S3, Supplementary material online. Arabidopsis PIN1 or PIN5 sequences were judged at the outset to represent the broad diversity of PIN genes and used as the search sequences employing the tBlastX option. For nonannotated sequences derived from EST data sets, translations across all six reading frames were searched for significant ORFs, and the longest open reading frame (ORF) extracted for alignment. Very short sequences (<100 amino acids) were generally discarded. Where two or more partial sequences from the same species were independently assigned to the same subgroup by initial phylogenetic analyses and exhibited significant sequence overlap, the sequences were scaffolded into a single consensus sequence to reduce the overall number of sequences in the data set. These are clearly marked in supplementary data set S3, Supplementary material online. A total of 370 unique PIN sequences from the 1KP database were deposited in GenBank (accession numbers: KJ664232–KJ664532). [...] All alignments were performed at the amino acid level. Initially, only full-length protein sequences from completed genomes were used to build a preliminary alignment. A total of 96 PIN protein sequences from the genomes of A. thaliana, Populus trichocarpus, Vitis vinifera, Solanum lycopersicum, Oryza sativa, Sorghum bicolor, Zea mays, Brachypodium distachyon (all angiosperms), Se. moellendorffii (lycophyte), and Ph. patens (moss) were identified for this purpose (supplementary data set S3, Supplementary Material online). These sequences were aligned using with MAFFT, using an E-INS-I alignment strategy (, last accessed May 1, 2014). The alignment was further manually refined using the software program Se-Al (, last accessed May 1, 2014). This initial alignment was subsequently expanded through the addition of partial cDNA from a variety of EST databases, in particular from the 1KP project (, last accessed May 1, 2014) (supplementary data set S3, Supplementary Material online). These sequences have a variety of length and coverage, but our initial full-length alignment provided a scaffold, which allowed them to be incorporated into the alignment. We added the new sequences in by hand in order to produce an optimal alignment (supplementary data set S5A, Supplementary Material online).Although alignments were previously aligned by codon, phylogenetic analyses were performed at the nucleotide level. The N- and C-termini of the proteins were well aligned but an overall lack of conservation in the center of the proteins resulted in generally poorer alignment. Regions that could not be confidently aligned at the amino acid level were excluded from the analysis. To determine the final exclusion parameters, the alignment was subject to reiterative preliminary analyses to explore the effect of including different parts. Trees derived from these preliminary analyses were examined to determine: 1) The extent to which the tree topology is robust to variable alignment; 2) the extent to which different alignments generate the same topology regardless of the tree building optimality criterion; and 3) the degree to which the topology tracks organismal angiosperm phylogeny within paralogous clades. In evaluating the performance of these exploratory analyses, we approached a robust optimized alignment that was selected for final analyses. Gaps and missing ends of partial sequences and incomplete ESTs were coded as missing data. The final alignment included 473 taxa, with 1,809 nt of which 1,690 were parsimony informative, with 45% missing data. (analyzed alignment in supplementary data set S5B, Supplementary Material online). [...] The final alignment was analyzed using PartitionFinder () to select the best-fit partitioning schemes and to choose among models of molecular evolution at both the nucleotide and amino acid level. Explored partition schemes included the three separate codon positions, and the N-terminus, C-terminus, and intracellular loop sections. In all instances, PartitionFinder () suggested analyzing the partitions separately under a GTR + I + G model for nucleotides according to the AIC and BIC selection criteria. The protein analyses were run as a single partition under the JTT + I + G substitution model chosen according to the AIC and BIC selection criteria. All partitions schemes were then further analyzed by “fast ML” analysis to explore the effect of partitioning on tree topology. For nucleotide-level ML analyses, we employed the program GARLI (Genetic Algorithm for Rapid Likelihood Inference; version 2.0) (). Analyses were run with default options, except that the “significanttopochange” parameter was reduced to 0.01 to make searches more stringent. ML bootstrap analyses were conducted with the default parameters and 500 replicates. We performed 100 replicate GARLI analyses and selected the topology with the highest likelihood score. Similarly, codon-level analyses are performed in GARLI using an empirical + F, 6-rate model, with 12 replicate analyses and 100 bootstrap repetitions. For the protein-level analyses, we employed the program RAxML with 1,000 fast bootstrap replicates. Bayesian analyses were implemented in MrBayes 3.2.2 () with a GTR + I + G model of evolution, and 5 million generations, with two hot and two cold chains, and burn-in of 25%. Convergence was assessed at standard deviation of 0.01. Posterior probabilities were derived from a majority rule consensus over the final 1 million generations of post burn-in trees. […]

Pipeline specifications