Computational protocol: Comparative genome analysis of Pseudogymnoascus spp. reveals primarily clonal evolution with small genome fragments exchanged between lineages

Similar protocols

Protocol publication

[…] Gene predictions for 14 P. spp. strains were done as described further. Each genotype assembly file was masked using RepeatMasker 3.3.0. To find exons and introns, RNAseq data we had for strains F-3808 and F-4515 were mapped on the masked scaffolds of each strain using Tophat2 [] (version 2.0.8) and the results were used to generate intron hints for AUGUSTUS gene predictor (with bam2hits and filterBam programs from AUGUSTUS pipeline, included in distributive, and samtools package for sorting and filtering). AUGUSTUS extrinsic.cfg file was adjusted for considering information about potential intron boundaries from RNAseq data (larger bonus for intron confirmed by RNA mapping, tiny penalty if not). Final gene prediction was done by AUGUSTUS [] (version 2.7.) with intron hints and species parameter was set to “botrytis_cinerea”. [...] Whole-genome alignment of the assembled contigs was performed in 2 steps. First, we used LASTZ [], the program which identifies the regions of local similarity, to match the contigs from different samples. Single_cov2 from TBA package [] was used to filter out the lower-scored alignments in regions with more than one significant alignment. Then, to increase the length of the alignment blocks, we performed global alignment of contig groups obtained on stage 1 using CLUSTAL []. For the analysis of the genomic regions with the conflicting phylogenetic configuration we only used the alignment blocks of length >20 kbp. The entire length of such blocks is 5.6 Mbp. [...] We considered a nucleotide site to support phylogenetic configuration (strain A, (strain B, strain C)), if nucleotides in strain B and strain C are identical and distinct from nucleotide in strain A, also we required nucleotide in strain A to be carried by at least 6 of the rest 11 sequenced G. spp. strains. Phylogenetic configuration (VKM F-3808, (VKM F-3557, VKM F-4514)) was name canonical as it stands for the vast majority of the genome, whereas phylogenetic configuration (VKM F-3557, (VKM F-3808, VKM F-4514)) and (VKM F-4514, (VKM F-3808, VKM F-3557) were named non-canonical. The nucleotide frequency of sites with noncanonical phylogenetic configuration is 0.002.We considered a window of length 200 nt to have a noncanonical phylogenetic configuration, if the number of nucleotide sites supporting a noncanonical phylogenetic configuration exceeds the number of sites with canonical phylogenetic configuration by at least 8 nucleotides. The threshold of 8 guaranties that less than 0.01 such windows would be found at random. The overlapping windows were combined into the resulting regions with the boundaries set at nucleotide sites supporting noncanonical phylogenetic configuration. PAML implementation of Kishino-Hasegawa test was run to compare phylogenetic configurations and calculate bootstrap values [], pRELL threshold was set at 0.95.To ensure the regions with altered phylogenetic configuration are not assembly artifacts, we mapped the original sequence reads using bwa [] program on the regions with noncanonical phylogenetic configuration, overlapping the boundaries of the region to ensure that these region are not the assembly artifacts. Regions with noncanonical phylogenetic configuration show coverage similar to the rest of the genome. [...] To identify gene orthologs we searched bidirectional best hits for each pair of P. spp. strains. We obtained 7524 groups of homologous genes, which are present in each of these 14 strains. Then, each group of homologous genes was aligned with MACSE []. Finally, the concatenate of alignments was used to calculate synonymous and nonsynonymous distances with codeml program from PAML-package. Only codon columns present in all 14 strains were used in the analysis. Dendroscope (v. 3.2.10) [] was used for visualizations of phylogenies. We evaluate number of genes lost on each branch from sets of orthologs which have no blast hits to exon sequences in certain lineages. The lost gene is considered pseudogene if the significant blast hit to genome is observed but gene structure is disrupted, the gene is considered deleted if there is no significant blast hit to genome.Gene orthologs were also used to estimate synteny across P. spp. strains. The pair of two orthologous genes was considered syntenic if those genes were adjacent in each strain. The pair of two orthologous genes where genes were adjacent only in one strain was considered nonsyntenic. Total numbers of syntenic orthologous pairs out of all orthologous pairs are shown in Additional file: Table S1. […]

Pipeline specifications

Software tools RepeatMasker, TopHat, AUGUSTUS, SAMtools, LASTZ, PAML, BWA, MACSE, Dendroscope
Applications Phylogenetics, RNA-seq analysis, Nucleotide sequence alignment
Organisms Pseudogymnoascus destructans, Bacillus phage SPP1
Diseases Leukemia, Lymphoid