Computational protocol: Functional genes to assess nitrogen cycling and aromatic hydrocarbon degradation: primers and processing matter

Similar protocols

Protocol publication

[…] When pyrosequencing data are used to compare gene profiles or gene diversities among samples, it is necessary to first bin the sequences by one of two general methods. Either sequences can be clustered into OTUs at a specified distance (the unsupervised method) or sequences may be classified directly using a reference database [the supervised method, as in Wang et al. ()]. The choice of method depends on the specific goals and, to some extent, the current knowledge of the target gene. Clustering better preserves information on diversity and better enables the discovery of novel gene sequences while the supervised method yields more immediately interpretable results and better enables comparisons between different experiments. It is expected to fail, however, in instances where the reference database captures little of the existing gene diversity.To contrast the performance of the supervised and unsupervised methods, soil samples were chosen from an investigation of various cropping systems on microbial soil diversity. These samples came from soils under corn, switchgrass and prairie species and represent the range of soil types in central to southern Michigan and Wisconsin. NifH sequence libraries were produced from DNA extracted from these soil samples by PCR per the protocol described by Wang et al. () and analyzed for differences in gene diversities and gene profiles after binning sequences by each method.Primer design is critical to capturing diversity of any gene (Iwai et al., ). For nitrogen fixation, primers for nifH have been recently evaluated in silico (Gaby and Buckley, ) and the Zf/Zr (Zehr and McReynolds, ) primer combination was found to have high theoretical performance, matching 92% of all reference sequences including all nifH groups I, II, and III, versus 25% for the PolF/PolR (Poly et al., ) primers. However, the Zf/Zr combination proved impractical in use, giving non-specific products and smeared bands on gels when used to amplify DNA extracted from soil (Gaby and Buckley, ). Better performing primer combinations, such as those identified by Gaby and Buckley, should be evaluated for future pyrosequencing studies taking into consideration coverage of groups important to the habitat studied.Because they more reliably amplify DNA extracted from soil, primers PolF and PolR (Poly et al., ) were used in this study. These primers target an approximately 320 bp region of the nifH gene. The forward primer consisted of the 25 bp 454 A Adapter, a 10 bp barcode, followed by the 20 bp primer PolF (5′-CGT ATC GCC TCC CTC GCG CCA TCA G-barcode-TGC GAY CCS AAR GCB GAC TC-3′). The reverse primer consisted of the 25 bp 454 B Adapter and the 20 bp primer PolR (5′-CTA TGC GCC TTG CCA GCC CGC TCA GAT SGC CAT CAT YTC RCC GGA-3′). PolF and PolR are similar to Zf and Zr (Zehr and McReynolds, ) which we also considered using, but were modified to be less degenerate while maintaining broad coverage of nifH cluster I. When originally tested, they captured all 19 test strains, but these were limited to α-, β-, and γ-Proteobacteria, Actinobacteria, and Firmicutes (Poly et al., ). When tested with DNA extracted from pasture and cornfield soils, these primers produced bands of the expected size that hybridized nifH probe from Azospirillum, and did not produce non-specific products.Initial processing of the pyrosequencing reads was performed using tools available on the Ribosomal Database Project's (RDP) FunGene pipeline web site. After reads were quality filtered and barcode sorted, FrameBot was used for translation and frame shift correction by comparing sequences to those in a reference data set containing 782 unique sequences trimmed to cover the nifH amplicon region. Sequences were deposited in the European Nucleotide Archive under accession numbers ERS329752-ERS329769.Sequencing data was processed by closest match analyses and by clustering at a 5% distance, and analyzed using the packages vegan (Oksanen et al., ) and phyloseq (McMurdie and Holmes, ) in R (R Core Team, ). In both cases, the number of sequences was rarefied to the minimum number of sequences per sample and empty OTUs removed. In the case of closest match, this left 3,693 sequences per sample in 160 OTUs representing 83 genera. In the case of clustering, this left 3,750 sequences per sample in 1,706 OTUs representing 81 genera.By far, the majority of sequences were identified as Proteobacteria, further classified to α-, β-, γ-, and δ-Proteobacteria. The primers were originally designed to amplify nifH sequences from Proteobacteria, Firmicutes, and Actinobacteria, but a significant number of Verrucomicobia sequences were obtained as well (Table ). Approximately 4% of the sequences were similar to environmental sequences that could not be classified to the phylum level, and may therefore represent novel sequences.Unsurprisingly, a greater number of OTUs are observed and estimated when sequences are clustered (Figure ). Comparisons among treatments, however, are similar. Clustering better resolves samples by estimated number of species; that is, standard errors are relatively smaller. Ordinations of data resulting from closest match and clustering are generally similar with the Michigan prairie and Michigan switchgrass sites separated from each other and from the other sites using both methods (Figure ). In this case, the clustering based analysis provides greater resolution as it also separates Wisconsin prairie sites from the others.Multiple F-tests were performed for difference in taxa abundance among treatments for data processed by both means. For the closest match method, 12 OTUs were found with an unadjusted p < 0.05, but none were significant after correcting for false discovery rate. For the clustering method, 46 OTUs were found with an unadjusted p < 0.05, and one of these was significant with adjusted p < 0.05. Clustr0103, genus Methylosinus, occurred exclusively in corn samples and was more abundant at the Michigan ones.In the case of nifH presented here, the supervised and unsupervised methods provide similar results. This is because the database was tailored to nifH sequences amplifiable by the primer combination PolF/PolR and does capture most of the gene diversity in the amplicon libraries. For that reason, relatively few sequences are distant from their closest match in the database used for their identification. When this is not the case, identification to closest match may be binned into subcategories by separate bins encompassing those >90% similar to closest match, 75–90%, 50–75% similar, and those less than 50% similar. This binning by distance minimizes binning disparate sequences and is to be preferred for that reason. As an aid to interpretation, taxonomy may be assigned to clusters using a similar scheme.Even though the difference in performance between the two methods, supervised and unsupervised, was minimal for this data set, clustering provided better estimates of total diversity, and proved more powerful in resolving differences in structure between treatments and in finding significantly different OTUs among treatments. For these reasons, it is recommended as the preferred method, and especially so when the reference database is less comprehensive than the one for nifH, which is currently the case for virtually all ecofunctional genes. […]

Pipeline specifications

Software tools FGP, RDP FrameBot, phyloseq
Application Genome annotation
Chemicals Nitrogen