Computational protocol: A Lachnospiraceae-dominated bacterial signature in the fecal microbiota of HIV-infected individuals from Colombia, South America

Similar protocols

Protocol publication

[…] 16S rRNA gene compositional analysis provides a summary of the composition and structure of the bacterial component of the microbiome. Genomic bacterial DNA extraction methods were optimized to maximize the yield of bacterial DNA while keeping background amplification to a minimum. 16S rRNA gene sequencing methods were adapted from the methods developed for the NIH-Human Microbiome Project,. Briefly, bacterial genomic DNA was extracted using MO BIO PowerSoil DNA Isolation Kit (MO BIO Laboratories). The 16S rDNA V4 region was amplified by PCR and sequenced in the MiSeq platform (Illumina) using the 2 × 250 bp paired-end protocol yielding pair-end reads that overlap almost completely. The primers used for amplification contain adapters for MiSeq sequencing and dual-index barcodes so that the PCR products may be pooled and sequenced directly, targeting at least 10,000 reads per sample.Our standard pipeline for processing and analyzing the 16S rRNA gene data incorporated phylogenetic and alignment-based approaches to maximize data resolution. The read pairs were demultiplexed based on the unique molecular barcodes, and reads were merged using USEARCH v7.0.1001 allowing zero mismatches and a minimum overlap of 50 bases. Merged reads were trimmed at first base with Q5. In addition, a quality filter was applied to the resulting merged reads and reads containing above 0.05 expected errors were discarded.We used an in-house pipeline for 16S analysis as well as several other tools, including custom analytic packages developed at the CMMR to provide summary statistics and quality control measurements for each sequencing run, as well as multi-run reports and data-merging capabilities for validating built-in controls and characterizing microbial communities across large numbers of samples or sample groups.16S rRNA gene sequences were assigned into Operational Taxonomic Units (OTUs) or phylotypes at a similarity cutoff value of 97% using the UPARSE algorithm. OTUs are then mapped to an optimized version (v.111) of the SILVA Database, containing only the 16S v4 region to determine taxonomies Abundances were recovered by mapping the demultiplexed reads to the UPARSE OTUs. A custom script constructs an OTU table from the output files generated in the previous two steps, which is then used to calculate α-diversity, β-diversity, and provide taxonomic summaries that were leveraged for all subsequent analyses discussed below. [...] In order to construct DOC curves, the dissimilarity and overlapping measure must first be defined. Here, we used the measures proposed by Bashan et al.. The measures were then plotted as points in a scatterplot. Subsequently, the DOC area was calculated from LOWESS method. The cutoff point, Oc, from which the slope of the curve is always negative is then calculated (i.e., ∂D∂O<0|O>Oc). The fraction fns is calculated as follows:fns=numberofsamplepairswithO>OctotalnumberofsamplepairsTo calculate the p-values related to the DOC plot slope we carry out the procedure described in the paper published by Bashan et al.. We retain only data points showing an overlap value greater than the median for both the control and disease groups. A bootstrap procedure repeating the aforementioned steps is performed and the slopes of the DOC plots are calculated at every bootstrap realization. The P-values are finally calculated as the fraction of bootstrap realizations resulting with non-negative slopes. The number of bootstrap runs was named n_boot.The compositional nature of the data restricts the data analysis to a simplex space. As such, we applied the Aitchison’s centered log ratio transformation (CLR) to carry the data to a Euclidean space. In this way, each OTU is properly compared between the different samples,. The logratio.transfo function of the mixOmics package was used for CLR transformation,. A multiplicative Bayesian replacement strategy using the cmultRepl function of the zCompositions package allowed us to replace zeros.We performed a Jennrich test to calculate the p-value associated to the correlation structures between HIV-infected patients control group.SPIEC-EASI was employed using the strategy of Meinshausen-Buhlmann for graph estimation of the network, which is a modified precision matrix built from β coefficients calculated from the average nearest neighbor. To reduce the number of false positive during the construction of networks, we eliminated from the analysis any OTU that was not present in at least 50% of the samples. sPLS-DA and Random Forest were techniques used for feature selection. sPLS-DA used an approach that asks and identifies which features (OTUs) separate the HIV-infected group from control group based on a discriminant analysis of the partial least square metric. Random Forest, on the other hand, is a machine learning technique that also identifies features (OTUs) based on a construction of collection of decision trees with controlled variance. Variables capable of differentiating between the groups were verified by Sparse PLS discriminant analysis (sPLS-DA). The most important bacteria were grouped in the first component. The Random Forest technique was also applied in order to determine the importance of bacteria in marking differences between the different patient groups. The performance of bacteria was measured using clr-normalized data and taking into account the mean decrease Gini,. To analyze the differences in microbial communities between the distinct patient groups, we used metagenomeSeq. For this, data was normalized by cumulative sum scaling using the cumNorm function, and analysis was performed with the zero-inflated log-normal fitZigModel function. […]

Pipeline specifications

Software tools USEARCH, UPARSE, mixOmics, SPIEC-EASI, metagenomeSeq
Databases HMP
Application 16S rRNA-seq analysis
Organisms Human immunodeficiency virus 2, Lachnospiraceae, Bacteria, Homo sapiens, Escherichia coli