Similar protocols

Protocol publication

[…] RepeatScout was used to identify de novo repetitive elements in the A. rabiei genome. It generated a library of 278 repetitive families with l-mer size 15, which included transposable elements (TEs) and dispersed duplicated sequences. This library was then filtered using following parameters: 1) Predicted repeats were aligned to genome assembly via BLASTN and hits were discarded if alignment length was <50 bp; 2) Repeats with frequency <5 in the genome were removed, and 3) Those repeats were also discarded for which significant hits to known proteins were found in Uniprot, except the ones showing hits to the known TEs. The resultant 155 consensus sequences were classified using TEclass. Moreover, these repetitive families were also annotated using RepBase (http://www.girinst.org/repbase/index.html) by TBLASTX search. Then, the A. rabiei genome assembly was masked with 155 repetitive families using RepeatMasker.A high-throughput SSR search to identify mono- to hexanucleotide SSR motifs was performed using MIcroSAtellite identification tool (MISA) (http://pgrc.ipk-gatersleben.de/misa/download/misa.pl) with default parameters. The default parameters used were: minimum SSR motif length of 10 bp and repeat length of mono-10, di-6, tri-5, tetra-5, penta-5, and hexa-5; the maximum size of interruption allowed between two different SSRs in a compound sequence was 100 bp. [...] Protein-coding genes in the A. rabiei masked genome were predicted using three different gene prediction programs: GeneMark-ES, Fgenesh and AUGUSTUS. Fgenesh was trained with S. nodorum that predicted a total of 7,707 protein coding genes, while the unsupervised training program GeneMark-ES predicted 11,299 genes. For AUGUSTUS, A. rabiei ESTs were used as hints file and S. nodorum, C. sativus and P. tritici-repentis (all belongs to the order Pleosporales) were selected as default gene models. This resulted in prediction of 10,708, 11,293 and 10,843 protein coding genes, respectively. Altogether 51,850 genes predicted from all the three programs were used to retrain AUGUSTUS (with parameters from C. sativus as default gene model) and then new genes were predicted. Additionally, annotated proteins from S. nodorum, C. sativus and P. tritici-repentis were mapped onto the genome of A. rabiei using Exonerate: protein2genome. The resultant mapped genes from Exonerate were mapped back to the genes predicted by the retrained AUGUSTUS and only the genes which could be mapped were selected.In order to evaluate the genome completeness, the highly conserved single or low copy genes were searched in the predicted proteins of A. rabiei. The BLASTP search was carried out against the single-copy families that contribute 246 single copy genes from all 21 species available in the FUNYBASE. Additionally, 248 core eukaryotic genes (CEGs) were also searched by BLASTP. For both the approaches to assess the completeness, the cut-off E-values of ≤ 1e-5 was implemented. [...] For functional annotation of A. rabiei predicted genes, BLASTX search against NCBI non-redundant database was performed with cut-off E-values of ≤ 1e-5 and identity ≥40%. Gene ontology (GO) analysis was carried out using BLAST2GO. For pathway analysis, the 10,596 protein sequences were annotated from the Kyoto Encyclopedia of Genes and Genomes (KEGG) using blastKOALA. A total of 3,423 predicted protein sequences were assigned KO identifiers. These assigned KO identifiers were used to map the KEGG database with help of KEGG mapper to identify the pathways. Pfam analysis was done by batch sequence search against Pfam database with E-value ≤ 1e-5 (http://pfam.xfam.org/). For CAZymes prediction, CAZymes Analysis Toolkit (CAT) was used. To identify the potential pathogenicity-related proteins, BLASTP search was performed against Pathogen-Host Interaction database (PHI-base) with threshold E- value of ≤ 1e-5. The tRNA genes were predicted using a combination of tRNAscan-SE and ARAGORN. The nucleotide sequences of the assembled genome were used for prediction using default parameters and a eukaryotic gene model. [...] The phylogeny was performed using amino acid sequences of actin (ACT), beta-tubulin (BTUB), translation elongation factor-1 alpha (TEF1) and NAD-dependent glycerol-3-phosphate dehydrogenase (GPD). Protein sequences were downloaded from GenBank. The amino acid sequences were aligned in T-REX using MAFFT as the sequence alignment tool. ProtTest 3.2.1 was used for the estimation of best-fit protein evolutionary model for ML analysis. The species tree was generated in T-REX using RAxML with LG model of evolution. The phylogenetic tree was visualized using FigTree (v1.4.) (http://tree.bio.ed.ac.uk/software/figtree/). […]

Pipeline specifications