Computational protocol: Assessing gut microbiota perturbations during the early phase of infectious diarrhea in Vietnamese children

Similar protocols

Protocol publication

[…] EMIRGE (Expectation Maximization Iterative Reconstruction of Genes from the Environment) is a database dependent assembler built on an iterative expectation-maximization (EM) algorithm, which is used to reconstruct full length 16S rRNA small subunit gene sequences (by simultaneously mapping and clustering) and estimate their relative abundance. The SILVA small subunit (SSU) rRNA database version 119 was filtered to remove potential large subunit (LSU) rRNA sequences, and closely related sequences were clustered at 97% identity by USEARCH; this was achieved by using PhyloFlash v2.0 ( The resulting database was used as the reference template for EMIRGE. To limit the computational effort related to using the complete data set, one million paired-end reads were randomly subsampled without replacement from each sample library using seqtk ( Reads were trimmed using Sickle to remove those with quality <30 and length <60. We inputted the trimmed reads from each subsampled library into the amplicon-optimized version of EMIRGE, with 40 iterations and a 97% joining threshold. The EMIRGE output, a set of assembled and clustered sequences, for each sample indicates its representative OTUs with their estimated abundances. Assembled sequences with sample-wise normalized relative abundances less than 0.01% were removed from further analysis.A pseudo-count for each sequence per subsampled library was calculated by scaling the number of successfully mapped reads to the EMIRGE estimated relative abundance of each sequence. All filtered sequences and a count table detailing their respective abundance were pooled and imported into the 16S rRNA processing platform mothur v.1.36.0. To minimize the length differences in assembled sequences and to facilitate more accurate OTU clustering, sequences were aligned to a trimmed version of the SILVA reference database to include only the amplified region (338F-1061R). Gap-only columns were filtered from this alignment, and sequences with a maximum of 7 (∼1%) ambiguous sites were retained for downstream analyses. Sequences were dereplicated, and ambiguous sites were replaced randomly with one of the 4 nucleotides (ATCG). These unique sequences and their respective counts served as input for the UPARSE clustering algorithm, which clustered all pooled sequences at a 97% similarity threshold. Chimeric sequences were stringently predicted and removed by UCHIME v6.0, set at both de novo mode and against the ChimeraSlayer informed reference database (e.g. ‘gold’ database). A total of 7,479 OTUs were reconstructed for 199 of the successfully sequenced samples. An alignment consisting of the most abundant representative sequences from each OTU was used to construct a phylogenetic tree using FastTree 2 under default parameters. Taxonomic assignments of OTU representatives up to the genus level were performed using the mothur implemented Ribosomal Database Project (RDP) classifier, with a minimum support threshold of 80%. [...] All analyses were conducted in R52 using multiple packages, including ‘phyloseq’, ‘cluster’, ‘randomForestSRC’, ‘ggplot2’, ‘nnet’, ‘lmtest’, ‘vegan’, ‘DESeq2’, ‘SpiecEasi’ and other packages. An OTU count table, taxonomy classification table, related clinical and demographic data and the OTU phylogenetic tree were imported and analyzed as a ‘phyloseq’ object, allowing a unified and interactive analysis approach. [...] To examine the relationship between the microbiome structures and associated explanatory variables for 195 samples with complete metadata, we applied constrained analysis of principal coordinates (CAP) on the calculated weighted Unifrac dissimilarity matrix and a set of demographic variables (age, sex, WAZ score, feeding pattern, income, rural residence), as well as diarrheal status. This was performed using the ‘capscale’ function in the ‘vegan’ package. Significant variables were identified and included in the final model based on Akaike information criterion (AIC) in a stepwise model selection approach. The α diversity of these 195 samples was estimated by Shannon diversity index. Analysis of variance (ANOVA) with post-hoc Tukey test and Bonferroni correction for multiple comparisons was used to compare control and diarrheal α-diversity among and within each CST. Multinomial logistic regression modeling was applied using the ‘multinom’ function in ‘nnet’ package to evaluate the association of the various aforementioned demographic factors and clinical features (vomiting, dysentery, and infection type) to the CST membership of diarrheal cases, with the Bacteroides rich CST (CST2) serving as the reference group. These predictors were included on the basis of limited missing data and capable of being assessed by clinicians upon patient's admission. Five samples with WAZ scores >3 or <-3, and one CST1 sample with age exceeding 2 standard deviations were considered as outliers and removed, resulting in 136 diarrheal samples with full metadata being subjected to regression modeling. [...] A correlation network was constructed for all 199 control and diarrheal samples to characterize the potential interactions between most representative 92 OTUs, defined as OTUs detected in at least 10 samples. This filtering step did not substantially affect the representativeness of the data set, with the median sample retainment rate of 93% (IQR: 87% - 97%). The network was constructed using the SparCC wrapper in the package ‘SpiecEasi’. The statistical significance for each interaction was assessed by 100 bootstrap iterations, with p values adjusted for multiple comparison correction. To avoid spurious correlations, only those with adjusted p values no greater than 0.05 and absolute magnitude equal to or above 0.25 were considered as significant correlations and represented in the final plots. […]

Pipeline specifications

Software tools EMIRGE, USEARCH, Seqtk, mothur, UPARSE, UCHIME, ChimeraSlayer, FastTree, RDP Classifier, phyloseq, Ggplot2, vegan, DESeq2, UniFrac, SparCC
Databases HMP HOMD
Organisms Homo sapiens, Bacteria, Bifidobacterium pseudocatenulatum
Diseases Diarrhea, Dysentery, Escherichia coli Infections, Dysbiosis