Computational protocol: The Inuit gut microbiome is dynamic over time and shaped by traditional foods

Similar protocols

Protocol publication

[…] Using the default parameters of the SmileTrain pipeline (https://github.com/almlab/SmileTrain/wiki), preclustering, quality filtering, primer removal, merging of raw sequences, and postclustering dereplication were performed on raw sequences using USEARCH v. 7.0.1090 []. Operational taxonomic units (OTUs) were called from the filtered reads, using a de novo distribution-based clustering method, with otu_caller.py script from SmileTrain which performs a custom algorithm using USEARCH. This clustering method takes into account the distribution of DNA sequences across samples and as well as the genetic distance between sequences [, ]. Subsequent analyses were performed using QIIME software version 1.8.0 []. Using the assign_taxonomy.py script, taxonomy was assigned to samples at a 97% identity level with the GreenGenes database version 13_8 []. To produce a filtered OTU table, used for the majority of the subsequent analysis, OTUs with fewer than 10 observations across all samples were removed from the OTU table using the filter_otus_from_otu_table.py script. In parallel, an unfiltered table including rare OTUs was kept to perform alpha diversity analyses. Each sample was then rarefied to 10,000 reads using the single_rarefaction.py script, yielding two rarefied OTU tables, one filtered for rare OTUs and one unfiltered. Twelve samples were eliminated at this stage (< 10,000 reads), including one PCR negative control and five extraction negatives, leaving a total of 172 samples remaining for all downstream analyses. Three extraction and one PCR negatives clustered with microbiome samples in a PCoA ordination, suggesting possible cross-contamination. To check for potential contamination, samples were compared across rows, columns and quadrant of the plate, according to their location in the 96 wells-plates used for library preparation, using the permanova function [] of the vegan package []. No correspondence between negatives and their proximate samples (same row, column, or quadrant) were observed (p > 0.05). This suggests that cross-contamination occurred randomly, and likely impacted all samples equally. Because most negative controls contained few reads (< 10,000), we also conclude that cross-contamination had a negligible influence on most samples, and that most reads were not due to contamination. [...] Except where noted, R software [] and packages were used for all statistical analyses. A threshold of α = 0.05 was considered statistically significant, and p values were adjusted with a Bonferroni correction in cases of multiple tests.To assess alpha diversity within each sample, four metrics were computed from the unfiltered OTU table (i.e., containing rare OTUs with fewer than 10 observations across all samples) using the phyloseq package []. Observed OTUs is a metric that counts the number of distinct OTUs in every sample. Chao1 is a non-parametric community richness estimator [], whereas Shannon and Simpson indices are diversity estimators considering both community richness and evenness. To compare alpha diversity estimates obtained from paired stool and toilet paper samples from the same individual, a paired Student’s t test was performed for each metric. To compare alpha diversity estimates across locations, and across samples preservation conditions, Mann–Whitney–Wilcoxon tests were performed for each metric. The Kruskal–Wallis rank sum test for non-parametric data was used to compare alpha diversity across seasons.To perform beta diversity analysis between samples, distance matrices were computed from the filtered OTU tables to avoid any biases caused by the presence of rare OTUs. Jenson–Shannon divergence (JSD), Bray–Curtis (BC) dissimilarity, unweighted UniFrac, and weighted UniFrac distances were calculated using the phyloseq package. JSD is a metric based on Shannon entropy [], whereas BC measures the compositional dissimilarity between two samples based on the relative abundance of OTUs. UniFrac uses distances between samples on a phylogenetic tree, either accounting for the relative abundance of OTUs (weighted) or not (unweighted) []. Principal coordinate analyses (PCoA) were then performed on the resulting distance matrices using the ggplot2 package []. Before computing the ordinations, the Euclidean nature of the distances was verified with the ade4 package []. A square-root transformation was applied when the Euclidean property was not respected (which was true for all cases, except for Unweighted UniFrac in Additional file  Figures S8A., S9A., and S12A.), ensuring an accurate representation of distances. Sample clustering hypotheses were tested using a permutational multivariate analysis of variance (permanova), and homogeneity of dispersion among sample groups was assessed using the betadisper function []. Permanova was selected because it has been shown to be more powerful than other tests to detect differences in community structure, even when group dispersions are heterogeneous []. To compare distance matrices computed from toilet paper and stool samples from the same individual, a Mantel [] test was computed using the ade4 package. All permutations tests were based on 9999 iterations.Information about food consumption, collected from dietary questionnaires, was compiled in a tabular matrix. Similar food categories were grouped to limit the number of variables and to avoid highly correlated variables (Additional file  Table S3). The mean frequency per day (number of times eaten ± standard deviation) of the different food categories was calculated for each sample, from 48 h dietary recall information provided in the questionnaires. Centered and scaled frequencies of consumption of food categories were then used as an explanatory variables in redundancy analyses (RDA) computed on squared-root-transformed unweighted UniFrac distances, using the vegan package. A partial RDA model was then used to estimate the amount of variation in microbiome community composition uniquely explained by diet after controlling for geography (Nunavut or Montréal), or by geography after controlling for diet. Adjusted R 2 statistics were computed to produce unbiased estimators of explained variation, accounting for the number of predictors in the models.To compare frequency of consumption of the different food categories across geography, the mean values per individual were calculated for each food categories. Then, we used a permuted Student’s t test on each food category to compare geographic locations, with Bonferroni correction for multiple tests. To compare frequency of consumption of the different food categories across season, the mean values per individual were calculated for each category for each season separately. The mean values within each season were compared with a Kruskal–Wallis rank sum test for non-parametric data with Bonferroni correction.We used linear discriminant analysis (LDA) effect size (LEfSe) [] to identify differentially abundant taxa between groups of samples []. This analysis combines a Kruskal–Wallis test, followed by an LDA step. The subset of OTUs violating the null hypothesis of the Kruskal–Wallis test serves to build the LDA model. From this model, each OTU is assigned an LDA score to assess its association with the categorical variables of interest (e.g., geography, sex).To investigate the stability of microbiome diversity within individuals over time, we computed the Shannon alpha diversity metric for each sample. Mean alpha diversity values were calculated for each individual. We then defined dispersion as the absolute difference in alpha diversity between each sample from an individual to that individual’s mean alpha diversity value. To compare the two populations (Nunavut vs. Montréal), we pooled the dispersion values separately for individuals from Montréal or Nunavut and used a permutational Student’s t test with 9999 permutations to assess the statistical significance of the difference in dispersion. To assess beta-diversity dispersion within each participant, we calculated the distance from each sample to the centroid for that person from unweighted UniFrac distances, using the betadisper function. We pooled the dispersion values separately for individuals from Montréal or Nunavut. To compare the two populations, a permutational Student’s t test with 9999 permutations was used to assess the statistical significance of the difference in dispersion. […]

Pipeline specifications