Computational protocol: Comparison of environmental and isolate Sulfobacillus genomes reveals diverse carbon, sulfur, nitrogen, and hydrogen metabolisms

Similar protocols

Protocol publication

[…] DNA was collected from the 5way site at Iron Mountain Mine [] and extracted as previously described []. This sample is referred to as ‘5way fungal streamer,’ and comprised the majority of sequencing information used in the genome assembly. Briefly, 4–5 ml of frozen biofilm was washed and thawed with cold acidified 0.9% NaCl (pH 1.0, H2SO4), homogenized with a 16 G needle in cold pH 7.0 phosphate buffered saline, and finally ground in a mortar and pestle in liquid nitrogen. The powdered biofilm pellets were then extracted with a PowerSoil DNA isolation kit (MoBio Laboratories, Carlsbad, CA, USA). Illumina library preparation sequencing was carried out according to JGI protocols [, ].Three flow cells of Illumina HiSeq were used to obtain ~90 million paired-end 76 bp reads (8.46 Gb). An additional overlapping paired-end library was sequenced to generate longer sequence reads. The initial paired-end reads were merged using SeqPrep [] into ~32 million long reads (6.17 Gb) with a median length of 194 bp. Reads from 17 additional previously published metagenomic datasets were used to assist binning and assembly []. [...] A total of 854 million paired-end (PE) Illumina reads (98 Gb) pooled from the 18 samples described above were assembled de novo using the Iterative De-Bruijn Assembler optimized for single-cell and metagenomic assemblies (idba-ud) []. A kmer-range of 19–99 was used with the pre-correction option enabled. All PE-reads were quality-trimmed using Sickle (sickle pe options -q 15 and -l 40) prior to assembly. PE-reads were mapped back to the resulting scaffolds using bwa []. All scaffolds larger than 1500 bp were binned using emergent self-organizing maps (ESOM) trained on tetranucleotide frequencies as previously described []. Briefly, larger sequences were subdivided into 10 kb fragments, and trained for 100 epochs using the k-batch training method, along with any leftover fragments exceeding 3 kb in length. To minimize noise, sequences in the range 1500–3000 bp were not included during training, but instead projected onto the trained weight vectors generated with the larger fragments. In order to separate individual Sulfobacillus bins, a discrete bin containing putative members of the order Clostridiales, was isolated and further binned using log-normalized sample abundance patterns from 18 different samples, this time including fragments down to 1500 bp. Five distinct sub-bins (AMDSBA1, AMDSBA2, AMDSBA3, AMDSBA4 and AMDSBA5) were individually re-assembled de novo using every read-pair that mapped within these respective bins. Additionally, because of its high similarity to S. thermosulfidooxidans DSM9293, all PE-reads mapping to this organism were included in the AMDSBA5 sub-assembly as well. Due to its considerably lower abundance, AMDSBA2 was sub-assembled using PE-reads as described above, but with the inclusion of additional reads from the overlapping HiSeq PE library that mapped within AMDSBA2 bin (median read length 194 bp). Scaffolds belonging to the Clostridiales bin, but not to the 5 putative genome bins were not re-assembled. In order to identify areas of inconsistent coverage possibly indicative of chimeric assemblies, reads were mapped back to reassembled genomes and scaffolds binned a second time again using time-series abundance data. The resulting ESOM map was used to identify scaffold fragments that binned inconsistently, suggestive of chimeric assemblies. These scaffolds were manually checked with stringent paired read mapping (100% identity) to either confirm the assembly or identify points where chimeras might occur. [...] Open reading frames on assembled contigs for Iron Mountain Sulfobacillus and S. thermosulfidooxidans strain ‘Cutipay’ were predicted using Prodigal [], and tRNAs predicted with tRNAscan []. Proteins were then compared with BLAST [] to UniRef90 [] and KEGG [] with matches greater than 60 bits being reported. Reverse BLASTs were also used to identify reciprocal best BLAST hits. Proteins were further analyzed with InterProScan [] to identify conserved domains. Protein sequences and corresponding annotations for S. thermosulfidooxidans AT-1 (DSM9293), S. acidophilus TPY, and S. acidophilus NAL (DSM10332) were downloaded from the Integrated Microbial Genomes database available at []. While S. thermosulfidooxidans strain ‘Cutipay’ and S. acidophilus TPY were included in most analyses here, results focus on the high quality genomes of the type strains S. thermosulfidooxidans AT-I (DSM9293) and S. acidophilus NAL (DSM10332). Unless otherwise specified, all references to S. thermosulfidooxidans and S. acidophilus in the text refer specifically to these type strains.Orthologs shared between species were identified with amino acid sequence searches between all pair-wise combinations of genomes using USEARCH []. Reciprocal best hits between each pair of genomes were considered orthologs if the alignments of their sequences had an E-value less than 0.01, a Bit score greater than 40, and covered at least 65% of each amino acid sequence. [...] All protein trees were made by first aligning protein sequences with MUSCLE []. Next, alignments were trimmed using Gblocks [, ], which stringently curates the alignment to phylogenetically informative sites. For each alignment, the optimum amino acid substitution model was estimated using ProtTest []. All trees were generated using RAxML [] using the PROTCAT rate model and the ProtTest-determined amino acid substitution model. Support was evaluated using 500 bootstrap replications in RAxML. Trees were visualized in iTOL [].The concatenated ribosomal protein tree was made using 16 core ribosomal proteins (rpL2, 3, 4, 5, 6, 14, 15, 16, 18, 22, 24, rpS3, 8, 10, 17, 19), selected based on low frequencies of lateral gene transfer [, ]. These proteins were aligned with MUSCLE [], manually trimmed, then concatenated, resulting in an 2,052 residue alignment. Phylogenetic relationships were analyzed using RAxML using the LG + α + γ substitution model, and nodal support determined with 500 bootstrap replicates. Trees were visualized in FigTree (available: [...] Near full-length 16S rRNA sequences for Sulfobacillus species were generated using EMIRGE—an iterative template-guided assembler that probabilistically generates 16S rRNA gene sequences using a 16S rRNA database []. First, potential 16S rRNA gene regions were found by a BLAST of all assembled contigs against SILVA db v. 108 []. Reads that mapped to these regions were extracted and trimmed with Sickle (available, allowing only paired-end reads with length >60 and quality scores >20. For the reference database, 186 sequences were downloaded from the SILVA SSU database representing the 174 sequences of ‘Family XVII Incertae Sedis’ (the family to which the Sulfobacillus genus belong) as well as twelve representative sequences of other organisms known to be represent the majority of the Richmond Mine microbial communities, including Leptospirillum species and Archaea of the ARMAN and Thermoplasmatales lineages. Potential chimeric sequences in this database were removed with DECIPHER [] and UCHIME [] [searched against the 2011 Greengenes database []]. EMIRGE was run in 50 iterations with a “join_threshold” parameter of 0.99.Using the QIIME software suite [], the EMIRGE-generated sequences and representative sequences of Sulfobacillus isolates were aligned using PyNAST with default parameters [], and filtered with in QIIME. Edges were trimmed leaving 1,154 unambiguously aligned positions. A maximum likelihood tree was generated with RAxML using the GTRCAT model. Nodal support was evaluated with 500 bootstrap replicates. Trees were visualized in FigTree. […]

Pipeline specifications