Computational protocol: Plant Proteins Are Smaller Because They Are Encoded by Fewer Exons than Animal Proteins

Similar protocols

Protocol publication

[…] The evolutionary analysis was performed based on the established and best curated taxonomy of eukaryotes , , . The results were validated by performing an independent reconstruction of the phylogenies using the sequences of the small subunit rRNAs of 233 representative eukaryotic species. We consulted the Protist Ribosomal Reference database (PR2) and the SILVA database ( for phylogenetic regression analysis. Small subunit rRNA sequences were aligned with SINA . Gaps of multiple sequence alignments were eliminated using trimAl with the “automated1” option. Both estimation of the best-fit model and reconstruction of the phylogenetic tree were inferred with jModelTest version 2.1.7 , using the maximum likelihood model through PhyML . The resulting tree was rooted using the “phangorn” R package .Phylogenetic independent contrast regression analysis was conducted using the “ape” R package . Linear model was forced through the origin and adjusted as recommended by Garland et al . Response variable was protein length and explanatory variables were exon number and exon length. [...] Sequences of nuclear-encoded proteins from the whole genomes of 4 archaebacteria (Pyrococcus furiosus, Methanobacterium AL, Methanococcus maripaludis, and Archaeoglobus fulgidus), 3 Gram positives (Mycoplasma genitalium, Bacillus subtilis, and Mycobacterium sp. JDM601), 3 cyanobacteria (Nostoc sp. PCC7107, P.marinus, and Synechocystis sp. PCC6803), 4 eubacteria (Borrelia afzelii, Treponema azotonutricium, Chlamydia pecorum, and Aquifex aeolicus), and 4 proteobacteria (Rickettsia akari, Helicobacter acinonychis, Haemophilus ducreyi, and Escherichia coli) were obtained from NCBI genome database in August, 2015. A. thaliana and Saccharomyces cerevisiae nuclear proteomes were downloaded from The Arabidopsis Information Resource (TAIR) ( and Saccharomyces Genome Database (SGD) (, respectively.A non-redundant set of Arabidopsis sequences was obtained with the Cluster Database at High Identity with Tolerance (CD-HIT) program using default parameters , . Construction of BLAST tables was done with the reciprocal best BLAST hits by comparing Arabidopsis proteome with all other proteomes with thresholds of e-value <10−10 and aa sequence identities >30%. Multiple sequence alignments (MSAs) of proteins were obtained with the multiple sequence comparison by log-expectation (MUSCLE) using default parameters. Gaps were removed using trimAl with the “gappyout” option. Phylogenetic trees were reconstructed with PhyML using a maximum likelihood approach . The best-fit model was inferred with ProtTest . All the procedures above were conducted using the Environment for Tree Exploration (ETE) pipeline for phylogenetic analysis . To identify genes of endosymbiotic origin that migrated from the chloroplast to the nucleus in Arabidopsis, we searched for phylogenetic trees in which cyanobacterial protein sequences branch together with Arabidopsis nuclear protein sequences , . […]

Pipeline specifications

Software tools trimAl, jModelTest, PhyML, Phangorn, APE, CD-HIT, MUSCLE, ProtTest, ETE
Databases SGD TAIR
Applications Phylogenetics, Nucleotide sequence alignment
Chemicals Amino Acids, Nucleotides