Computational protocol: Systematic Bias Introduced by Genomic DNA Template Dilution in 16S rRNA Gene-Targeted Microbiota Profiling in Human Stool Homogenates

Similar protocols

Protocol publication

[…] The 16S rRNA gene sequencing was performed using a next-generation Illumina MiSeq 300-bp read sequencing platform with custom sequencing primers read1 (R1; 5′-TATGGTAATTGTCCTACGGGAGGCAGCAG-3′), read2 (R2; 5′-AGTCAGTCAGCCCCGTCAATTCMTTTRAGT-3′), and index (5′-ACTYAAAKGAATTGACGGGGCTGACTGACT-3′). R1s and R2s were first demultiplexed by the use of Illumina BCL2Fastq software. Paired R1-R2 reads were analyzed using hybrid-denovo (), which was specifically developed for nonoverlapping paired-end reads, with default parameter settings. Briefly, quality filtering was performed using Trimmomatic (). Surviving read pairs were further trimmed, concatenated, and sorted by cluster size. Afterward, de novo operational taxonomic unit (OTU) picking (97% similarity) was conducted via the use of the UPARSE algorithm (). UCHIME was used to perform additional reference-based chimera removal against the gold database (). High-quality single-end R1s were further mapped to the R1 end of the OTU representatives using USEARCH (). The remaining unmapped R1s were clustered into new OTUs via the use of the UPARSE algorithm and added to the list of OTUs generated by the paired-end reads. Thus, the OTU representatives consisted of a mixture of single-end and paired-end reads. We then aligned all the OTU representatives using the structure alignment algorithm Infernal trained on the Ribosomal Database Project (RDP) database (). OTU representatives that were not aligned or that had negative alignment scores were removed since they hypothetically represented nonbacteria. A phylogenetic tree was built from the aligned OTU representatives using FastTree (). FastTree has a low penalty level with respect to end gaps, which is favorable when processing a mixture of single-end and paired-end reads. Finally, R1 and R2 reads were stitched together with ambiguous nucleotides between them and then assigned taxonomy data by the RDP classifier () trained on the Greengenes database v13.5 (). OTUs not classified as Bacteria and singleton OTUs were removed as they were presumed to be contaminants. Samples with fewer than 2,000 reads were removed. [...] We investigated the effects of technical variables on the overall microbiota structure as described previously (). We first summarized the microbiota data by using α-diversity and β-diversity measures. Three α-diversity metrics were used: the observed OTU number, the Shannon index, and the inverse Simpson index. The observed OTU number reflects species richness, whereas the Shannon and inverse Simpson index values take into account both species richness and species evenness. Inverse Simpson index values place more weight on dominant taxa than Shannon index values. β-Diversity values, in contrast, indicate the levels of diversity shared by bacterial populations in terms of ecological distance. Different distance metrics provide distinctive views of community structure. Three β-diversity measures, namely, Bray-Curtis, unweighted UniFrac, and weighted UniFrac distances, were calculated by using the OTU table and a phylogenetic tree (“vegdist” and “GUniFrac” function in the R packages “vegan” and “GUniFrac”). Analyses that determine weighted UniFrac and unweighted UniFrac distance values are designed to detect phylogenetic signals, i.e., clusters of OTUs on the phylogenetic tree, while Bray-Curtis distance values are more efficient in catching scattered signals. Weighted UniFrac distance values place more weight on the significance of abundant lineages and are more efficient at detecting changes in abundant lineages, while unweighted UniFrac distance values use presence and absence information and are powerful for detecting community membership change or changes in abundance in rare lineages (if the abundance is below the detection limit, it will appear to represent absence) (). Significant values achieved using any of the statistical metrics described above are reported in the manuscript. Sequencing data were rarefied to a common depth of 11,328 reads before α-diversity and β-diversity analyses were performed.To assess the association between a technical variable and α-diversity, we fitted a linear regression model to the α-diversity metrics after rarefaction. gDNA concentration and 16S rRNA gene yield data were log-transformed before analysis. To assess the association with β-diversity measures, we used permutational multivariate analysis of variance (PERMANOVA; “Adonis” function in the R “vegan” package). Significance was assessed with 1,000 permutations. A distance-based R2 value was used to quantify the percentage of variability of the microbiota that was explained by various technical factors (). Ordination plots based on Bray-Curtis, unweighted UniFrac, and weighted UniFrac distances were generated with principal-coordinate analysis as implemented in R (“cmdscale” function in the R “stats” package). To identify bacterial genera associated with gDNA concentration, we conducted differential abundance analysis at the genus level by using the Wilcoxon rank sum test. We filtered out genera with a prevalence of less than 10%. We normalized the count data into relative abundances (proportions) by dividing by the total read count. Taxa with a maximum proportion of less than 0.2% were excluded from testing to reduce the number of the tests. False-discovery-rate control (B-H procedure; “padjust” in standard R packages) was used to correct for multiple testing, and false-discovery-rate-adjusted P values (q values) of less than 0.01 were considered significant. […]

Pipeline specifications