Similar protocols

Protocol publication

[…] Repetitive elements were identified following the standard Maker-P recipe (weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced, accessed October 2015) as described on the GMOD site: (i) RepeatModeler with Repeatscout v1.0.5, Recon v1.08, RepeatMasker version open4.0.5, using RepBase version 20140131 () and TandemRepeatFinder; (ii) GenomeTools LTRharvest and LTRdigest (); (iii) MITEhunter with default parameters (). We generated species-specific repeat libraries for P. andersonii and T. orientalis separately and combined these into a single repeat library, filtering out sequences that are >98% similar. We masked both genomes by using RepeatMasker with this shared repeat library.To aid the structural annotation, we used 11 P. andersonii and 6 T. orientalis RNA-sequencing (RNA-seq) datasets (SI Appendix, Table S8). All RNA-seq samples were assembled de novo by using genome-guided Trinity (), resulting in one combined transcriptome assembly per species. In addition, all samples were mapped to their respective reference genomes by using BWA-MEM and processed into putative transcripts by using cufflinks () and transdecoder (). As protein homology evidence, only UniProt () entries filtered for plant proteins were used. This way we included only manually verified protein sequences and prevented the incorporation of erroneous predictions. Finally, four gene-predictor tracks were used: (i) SNAP () trained on P. andersonii transdecoder transcript annotations; (ii) SNAP trained on T. orientalis transdecoder transcript annotations; (iii) Augustus (), as used in the BRAKER pipeline, trained on RNA-seq alignments (); and (iv) GeneMark-ET, as used in the BRAKER pipeline, trained on RNA-seq alignments ().First, all evidence tracks were processed by Maker-P (). The results were refined with EVidenceModeler (EVM) (), which was used with all of the same tracks as Maker-P, except for the Maker-P blast tracks and with the addition of the Maker-P consensus track as additional evidence. Ultimately, EVM gene models were preferred over Maker-P gene models except when there was no overlapping EVM gene model. Where possible, evidence of both species was used to annotate each genome (i.e., de novo RNA-seq assemblies of both species were aligned to both genomes).To take maximum advantage of annotating two highly similar genomes simultaneously, we developed a custom reconciliation procedure involving whole-genome alignments. The consensus annotations from merging the EVM and Maker-P annotations were transferred to their respective partner genome by using nucmer () and RATT revision 18 () (i.e., the P. andersonii annotation was transferred to T. orientalis and vice versa) based on nucmer whole-genome alignments (SI Appendix, Fig. S10). Through this reciprocal transfer, both genomes had two candidate annotation tracks. This allowed for validation of annotation differences between P. andersonii and T. orientalis, reduced technical variation, and consequently improved all downstream analyses. After automatic annotation and reconciliation, 1,693 P. andersonii genes and 1,788 T. orientalis genes were manually curated. These were mainly homologs of legume symbiosis genes and genes that were selected based on initial data exploration.To assign putative product names to the predicted genes, we combined BLAST results against UniProt, TrEMBL, and nr with InterProScan results (custom script). To annotate Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) enzyme codes we used Blast2GO based on the nr BLAST results and InterProScan results. Finally, we filtered all gene models with hits to InterPro domains that are specific to repetitive elements. [...] To determine relationships between P. andersonii and T. orientalis genes, as well as with other plant species, we inferred orthogroups with OrthoFinder version 0.4.0 (). As orthogroups are defined as the set of genes that are descended from a single gene in the last common ancestor of all of the species being considered, they can comprise orthologous as well as paralogous genes. Our analysis included proteomes of selected species from the Eurosid clade: A. thaliana TAIR10 (Brassicaceae, Brassicales) () and Eucalyptus grandis v2.0 (Myrtaceae, Myrtales) from the Malvid clade (); Populus trichocarpa v3.0 (Salicaeae, Malpighiales) (), legumes M. truncatula Mt4.0v1 () and G. max Wm82.a2.v1 (Fabaceae, Fabales) (), F. vesca v1.1 (Rosaceae, Rosales) (), and P. andersonii and T. orientalis (Cannabaceae, Rosales) from the Fabid clade (Dataset S2). Sequences were retrieved from phytozome (www.phytozome.net). [...] To assess orthologous and paralogous relationships between Parasponia and Trema genes, we inferred phylogenetic gene trees for all 21,959 orthogroups comprising Parasponia and/or Trema genes by using the neighbor-joining clustering algorithm (). Based on these gene trees, for each Parasponia gene, its relationship to other Parasponia and Trema genes was defined as follows: (i) orthologous pair indicates that the sister lineage is a single gene from the Trema genome, suggesting that they are the result of a speciation event; (ii) inparalog indicates that the sister lineage is a gene from the Parasponia genome, suggesting that they are the result of a gene duplication event; (iii) singleton indicates that the sister lineage is a gene from a species other than Trema, suggesting that the Trema gene was lost; and (iv) multiortholog indicates that the sister lineage comprises multiple genes from the Trema genome, suggesting that the latter are inparalogs. For each Trema gene, the relationship was defined in the same way but with respect to the Parasponia genome (SI Appendix, Table S6). Because phylogenetic analysis relies on homology, we assessed the level of conservation in the multiple-sequence alignments by calculating the trident score using MstatX (https://github.com/gcollet/MstatX) (). Orthogroups with a score below 0.1 were excluded from the analysis. Examination of orthogroups comprising >20 inparalogs revealed that some represented repetitive elements; these were also excluded. Finally, orthologous pairs were validated based on the whole-genome alignments used in the annotation reconciliation. [...] To assess gene expression in Parasponia nodules, RNA was sequenced from the three nodule stages described earlier as well as uninoculated roots (SI Appendix, Table S8). RNA-seq reads were mapped to the Parasponia reference genome with HISAT2 version 2.02 () using an index that includes exon and splice site information in the RNA-seq alignments. Mapped reads were assigned to transcripts with featureCounts version 1.5.0 (). Normalization and differential gene expression were performed with DESeq2. Nodule enhanced genes were selected based on >2.0-fold change and P ≤ 0.05 in any nodule stage compared with uninoculated root controls. Genes without functional annotation or orthogroup membership or from orthogroups with low alignment scores (<0.1 trident score, as detailed earlier) or representing repetitive elements were excluded from further analysis. To assess expression of Parasponia genes in the hybrid nodules, RNA was sequenced from nodules and uninoculated roots. Here, RNA-seq reads were mapped to a combined reference comprising two parent genomes from P. andersonii and T. tomentosa. To assess which genes are nodule-enhanced in medicago, we reanalyzed published RNA-seq read data from Roux et al. () [archived at the National Center for Biotechnology Information (NCBI) under sequence read archive (SRA) study ID code SRP028599]. To assess which of these genes may be coopted from the ancient and widespread arbuscular mycorrhizal symbiosis, we generated a set of 575 medicago genes induced upon mycorrhization in medicago by reanalyzing published RNA-seq read data from Afkhami and Stinchcombe (archived at the NCBI under SRA study ID code SRP078249) () Both medicago data sets were analyzed as described earlier for Parasponia but by using the medicago genome and annotation version 4.0v2 as reference ().To assess common recruitment of genes in nodules from Parasponia and medicago, we counted orthogroups comprising P. andersonii and medicago nodule-enhanced genes. To assess whether this number is higher than expected by chance, we performed the hypergeometric test as well as three different permutation tests in which we randomized the Parasponia gene set, the medicago gene set, or both sets with 10,000 permutations. We then determined putative orthology between the Parasponia and medicago genes within the common orthogroups based on phylogenetic analysis. Parasponia and medicago genes were considered putative orthogroups if they occurred in the same subclade with more than 50% bootstrap support; otherwise, they were considered close homologs. […]

Pipeline specifications