Computational protocol: Implications of Pyrosequencing Error Correction for Biological Data Interpretation

Similar protocols

Protocol publication

[…] Soil sampling was performed at the Cedar Creek Ecosystem Science Reserve (CCESR; part of the National Science Foundation Long-Term Ecological Research network) in July of 2009, from a long-term plant richness manipulation . We targeted soil under the dominant influence of each of four different plant species (two C4 grasses: Andropogon gerardii, Schizachyrium scoparium; two legumes: Lespedeza capitata, Lupinus perennis) by collecting soil cores from the base of individual plants. Each sample consisted of four bulked soil cores (5 cm diameter, 30 cm depth) collected from different individuals within the same plot and homogenized by hand. Each plant species was sampled in five different plant richness treatments (monoculture and assemblages of 4, 8, 16 or 32 species). There were three plot-level replicates per host-plant richness combination.The PowerSoil DNA kit (MO BIO; Carlsbad, CA USA) was used to extract DNA from soil. The manufacturer’s protocol was modified with extended bead beating and sonication to enhance recovery of DNA from Actinobacterial spores . Selective primers were used to amplify a portion of the 16s rDNA gene. We used StrepB as our forward primer, and the reverse complement of Act283 as our reverse primer, each at a final concentration of 200 nM. Both primers are selective for Actinobacteria and together amplify a fragment of approximately 165 nucleotides, encompassing the V2 variable region of the 16S rRNA gene. Primers were modified to contain one of 30 different 10mer identifying barcodes . PCRs consisted of 10 ng of template DNA in a 50 uL reaction volume using PCR Supermix High Fidelity (Invitrogen; Carlsbad, CA USA). PCR conditions consisted of an initial denaturation step of 30 sec at 94 C, followed by 30 cycles of 30 sec 94 C, 30 sec 57 C, 60 sec 70 C. Products of PCRs were passed through the QIAquick PCR Purification Kit (Qiagen; Valencia, CA USA), quantified by spectrophotometry, diluted with elution buffer to approximately 15 ng/uL, and quantified by fluorometry (Quant-iT dsDNA HS assay kit; Invitrogen). Thirty samples, each with a unique primer barcode, were combined in equimolar amounts to form each of two pooled amplicon samples. Emulsion PCR and sequencing were performed using a GS FLX emPCR amplicon kit according to the manufacturer’s protocols (454 Life Sciences; Branford, CT USA). Each pooled sample was run on one region of a picotitre plate on the GS FLX sequencing system at the University of Minnesota BioMedical Genomics Center. Resulting sequence data have been submitted to the NCBI Sequence Read Archive as accession SRA019985.3.Sequence data were processed through the program AmpliconNoise, version 1.24 for the detection and correction of probable errors. The dataset was processed on a per sample basis, with the raw flowgram signals as the input to AmpliconNoise. Initial processing tested for a perfect match to the forward primer, truncated flowgrams at 225 flows and discarded any reads that did not reach this length threshold. The PyroNoise algorithm was run with parameters set at s = 1/60, c = 0.01. The SeqNoise algorithm was run with parameters set at s = 1/30.3, c = 0.08.Subsequent processing, and all processing for the standard (not-denoised) pipeline, was done with the program Mothur 1.20.1 . Quality screening criteria and the number of reads culled are shown in . Sequences were aligned to the Silva reference database using kmer searching with a ksize of 8 to find the best template sequence and the gotoh alignment method with a reward of +1 for a match and penalties of −1 for a mismatch, −2 for opening a gap, and −1 for extending a gap. Aligned sequences were truncated to a length of 150 nucleotides and screened for chimeras using the UChime method . In the standard processing pipeline, sequences that differed by only a single base pair were pre-clustered. Sequences passing these quality criteria were clustered into operational taxonomic units (OTUs) using a 3% sequence dissimilarity criterion and the average neighbor clustering method. We rarefied to a consistent sampling effort of 3,000 reads per sample prior to calculating diversity statistics. Five of the 60 samples were excluded from diversity analysis because they consisted of fewer than 3,000 reads. For phylogenetic diversity analysis, dendrograms were generated from sequence distance matrices using Clearcut , as implemented in Mothur.The per nucleotide error rate implied by AmpliconNoise processing was calculated as the pairwise distance between the input and output sequences, where distance is defined as the number of base differences between the two sequences divided by the length of the shortest sequence, where terminal gaps are ignored and each internal gap contributes a length of one. Pairwise alignments were made with ClustalW and distance was calculated using Mothur. We generated a PostgreSQL database to map reads from the raw data through AmpliconNoise processing steps and associated accession number changes, in order to contrast OTU composition between processing pipelines. […]

Pipeline specifications