Computational protocol: Reconstruction of Ribosomal RNA Genes from Metagenomic Data

Similar protocols

Protocol publication

[…] Ninety completed genomes were selected as references, including 76 bacteria and 14 archaea and combined using established profiles of community diversity with high- (HC), median- (MC), and low- (LC) complexity (). Genomic sequences, 16S rRNA gene sequences and gene copy number per genome were obtained from the Integrated Microbial Genomes website ( Heterogenous 16S rRNA genes within a genome were considered separately. For each metagenome complexity, three read data set (1,000,000 reads each, 350 nt) were simulated using empirically derived and context-based error models (GemSIM software ).Three environmental DNA samples for each of the two sponges Cymbastelaconcentrica and C. coralliophila were obtained as described in ref. . Shotgun pyrosequencing (454 Titanium) was conducted at the J. Craig Venter Institute, Rockville, USA and the resulting average read length corresponded to the simulated datasets above. The shotgun sequencing is available through the Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis website ( under project accession ‘CAM_PROJ_BotanyBay’. [...] The metagenomic reads of the simulated communities and the sponge microbial communities were pre-processed with PrinSeq using the settings ‘(“minlen”:“60”,“maxlen”:“700”,“minqualm”:“20”,“nsmaxp”:“1”,“complval”:“50”, “noniupac”:“true”,“derep0”:“true”,“derep1”:“true”,“complmethod”:“2”,“trimtails”:“6”,“trimns”:“1”,“trimscore”:“15”,“trimwindow”:“2”,“trimstep”:“1”,“tailsite”:“1”,“trimsite”:“3”,“trimtype”:“2”,“trimrule”:“1”)’. Metaxa (version 1.0.2) was then used to identify reads containing 16S rRNA sequences. Reads (>300 nt) from triplicates were then pooled and assembled with the GS De Novo Assembler 2.3 (454 Life Sciences, Branford, CT) using the ‘cDNA’ option, which is optimized for the uneven and high coverage typically expected in RNA assemblies. Default settings were used except ‘overlap identity’, which was set to 99%. Additionally, ‘reads limited to one contig’ and ‘extending low depth overlaps’ were selected. The 99% cut-off was chosen to allow overlap of reads with a 1% error, which is typical seen towards the end of pyrosequencing reads . Lower stringency (e.g. 97% as used by Radaxet al. during the assembly of 16S rRNA gene ) resulted in unacceptable rates of chimera formation (data not shown). After aligning contigs to the SILVA 1.08 database by SINA , flanking regions that were not part of the 16S rRNA gene sequences were removed. Resulting contigs were then examined for chimerism. If a contig constituted reads from more than one strain and any of these strains was less than 99% sequence identity to the other strains, it was considered a chimera. [...] Amplification of the 16S rRNA gene was performed on the same DNA sample as used for shotgun sequencing. Primers 28F ‘GAGTTTGATCNTGGCTCAG’ and 519R ‘GTNTTACNGCGGCKGCTG’ were used for amplification of the variable regions V1-3. PCR and subsequent sequencing are described in Dowd et al. 2008 and were performed at the Research and Testing Laboratory (Lubbock, USA). Trace data was deposited at the NCBI Sequencing Read Archive database with the project accession SRP011939.Analysis of the 16S rRNA tag-sequencing data was performed using Mothur v1.23.1 . Specifically, ‘shhh.flows’ was used for de-noising, ‘trim.seqs (pdiffs = 2, bdiffs = 1, maxhomop = 8, minlength = 200)’ was used for barcode removal and quality filtering, SINA was used for sequence alignment with the SILVA 1.08 database , ‘screen.seqs(start = 1048, minlength = 245)’ and ‘filter.seqs (vertical = T, trump = .)’ were used for alignment quality filtering, ‘pre.cluster(diffs = 2)’ was used for further error reduction, ‘chimera.uchime’ was used for de novo removal of chimeric reads, and Metaxa (version 1.0.2) was used to remove mitochondrial and chloroplast sequences. [...] For simulated data, filtered 16S rRNAcontigs (with coverage of more than 10 reads and length greater than 700 nt; see below) and 16S rRNA reads not in contigs were pooled with the 16S rRNA sequences of the reference genomes used for simulation. Redundancy within these pools was removed with CD-hit (99% identify cut-off). PhylOTU was then used to generate OTUs with 0.01, 0.03 and 0.05 phylogenetic distance cut-off. OTUs containing both reference sequences and simulated shotgun sequences (filtered contigs or reads) were assigned as ‘recovered’. OTUs containing only reference sequences were termed as ‘missed’, while those containing only shotgun sequences were assigned as ‘artificial’. OTU coverage was defined as the number of reads contained in each OTU. For the sponge samples, filtered 16S rRNAcontigs (with coverage of more than 10 reads and length greater than 700 nt) and 16S RNA reads not in contigs were pooled with PCR-amplified tag-sequences and then processed as above to generate OTUs. Diversity analysis was performed with QIIME and phylogenetic distance-based rarefaction was based on the tree of non-redundant sequences generated during the PhylOTU process. [...] 16S rRNA classification was performed with the RDP Classifier 2.3 , except for the classification of the abundant OTUs in sponge samples, which was performed with the Greengenes Classifier (March 6, 2012) followed by manual examination. Single-copy gene based analysis was performed using MLTreeMap (version 2.05, ‘minimal sequence length after Gblocks’ set to 35) . For phylogenetic analysis, Maximum-Likelihood trees of the 16S rRNA gene contigs were constructed using RAxML after alignment by SINA and removal of ambiguous positions by Gblocks (−t = d −b4 = 5 −b5 = h) . […]

Pipeline specifications