Computational protocol: The development of lower respiratory tract microbiome in mice

Similar protocols

Protocol publication

[…] The primary goal of statistical analyses was to examine, in parallel, the relative abundance and diversity of the lung microbiome during developmental stages. Abundance levels, deemed to be proportional to the number of reads of a taxonomic unit per week, were generated using QIIME. Briefly, the raw reads were demultiplexed, filtered, quality-checked, and analyzed using QIIME 1.8.0 []. Clustering into operational taxonomic units (OTUs) was done at 97% similarity levels. The reads from the 10 negative controls were first demultiplexed using the respective barcodes and then analyzed along with the previously demultiplexed week data points. As the sequence reads from week points had been already analyzed without the 10 negative controls, the new analysis allowed us to do this analysis with the 10 negative controls. We did not find any evidence for bias or skew arising from the 10 negative control sequences. Greengenes [] and RDP datasets [] were employed to assign taxonomy. ICC values were calculated by using the ICCest() method of R library “ICC.” This method estimates the ICC values using the variance components from a one-way ANOVA. In order to account for uneven sample counts and low-depth samples as an artifact of sequencing, we employed standard rarefaction protocols provided in QIIME. However, it has been shown before that rarefaction may not always be the appropriate methodology to standardize all the samples []. Therefore, we also log10-transformed the sample data for statistical analysis of OTU data. Both alpha and beta diversity indices were calculated after following standard rarefactions steps for each week points. For these indices, rarefaction was done looking at the graphs of diversity vs. sequencing depth. In order to select the most appropriate sequencing depth, the first quantile value of the number of ordered reads was taken to be the threshold value. The number reads at the first quantile was 15,451 with 34,268 being the median. As the number of reads from all the 10 negative control samples were extremely low (relative to the threshold of 15451 taken for the specific week samples), this threshold value of 15451 resulted in majority of the reads being discarded. Only ~18% of the reads passed the threshold value of 15451 for PBS control. Principal coordinates analysis (PCoA) was undertaken thereafter using weighted UniFrac distances []. In order to better visualize PCoA’s binning, we calculated the “sum of squares” distance measures from raw abundance measures, before employing the principal component analysis and k-means clustering. The number of clusters was determined by employing within-group sum of squares (WSS). The cluster number (K) was chosen by first plotting the number of clusters vs. the WSS and then visually looking for the break point (“elbow”) in the plot. A value of k = 6 was chosen for further analysis. The R method clara (package: cluster) using “Euclidean” distances were used to define the clusters for the two-dimensional PCA plot. The OTU table generated by QIIME was further used for statistical analysis using in-house R (https://cran.r-project.org/) scripts. Diversity was evaluated using Simpson’s Diversity Index [] and UniFrac distances (measure of beta diversity []). Both “inverse” (1/λ) and “complement” (1-λ) SDI were calculated. Higher SDI values depict greater microbial diversity. Microbial abundance levels were log10-transformed, and then hierarchical clustering using the “Ward.D2” [] method and maximum linkage were used to generate the heatmap (Fig. ). The R function prop.test() based on Pearson’s chi-squared test was used for equal proportions. The success and failure values were calculated from the percentage abundance values for each genus. For time series analysis, average of all the samples per week was taken. A time series was created for each of the 138 genera (week 1 to week 8). Genera were selected for further analysis by visually inspecting their time series profiles as well as lag (=1) plots. ACF (autocorrelation function) and PACF (partial correlation function) plots were generated for each of the genera to inspect whether the time series was stationary. Whenever required, differencing (for removing trend) and logs (in case of unequal variance) were taken in order to stationarize the time series. Augmented Dickey-Fuller test for stationarity was also undertaken for all the time series. As we followed the simplest autoregressive model or order 1, it was feasible to undertake linear regression analysis for the time series. The p value was generated from the regression model and represents the probability that the coefficient estimate is significantly different from 0.RPMM, a model-based hierarchical clustering methodology that has been previously employed to analyze high-dimensional microbial abundance datasets [], was used to cluster the log10-transformed abundances. The OTU network files generated by QIIME were input into Cytoscape (http://www.cytoscape.org/) and Gephi (https://gephi.org/). Modularity analysis, betweenness centrality, and degree indices were used to format and color the layout of the network. […]

Pipeline specifications

Software tools QIIME, UniFrac, Gephi
Databases Greengenes
Application 16S rRNA-seq analysis
Organisms Mus musculus