Computational protocol: Alterations of the human gut microbiome in multiple sclerosis

Similar protocols

Protocol publication

[…] Sequencing of the 105 human stool samples on the Roche 454 GS FLX+ instrument generated 1,426,326 total raw reads. Raw reads were processed using the mothur software package (v.1.34.2)14 (ref. ), which performs demultiplexing and denoising, quality filtering, alignment against the ARB Silva reference database of 16S rRNA gene sequences, and clustering into OTUs (at 97% identity; ). In total, 4317 OTUs were generated. After filtering OTUs that failed to have a mean number of reads per sample ≥1 in at least one cohort, 426 OTUs were available for further analysis.Sequencing of the samples on the Illumina MiSeq instrument generated 11,498,168 total forward raw reads. Preprocessing was performed using the mothur software package. The mothur MiSeq SOP was followed, except we used a custom Python script to perform base quality trimming (sliding window size of 50 nt with average quality score >35); this modification was employed to improve the quality of sequencing reads used for OTU clustering, since the standard mothur SOP assumes longer sequencing reads than generated by our MiSeq protocol. 10,620 OTUs were generated, with 1,191 OTUs available for further analysis after filtering using the same procedure as described for the Roche 454 data (). [...] A measure of alpha diversity, the Shannon entropy was calculated for the samples, and the nonparametric Wilcoxon rank-sum test was applied for hypothesis testing. To visualize differences in overall microbial community structure, the unweighted and weighted UniFrac measures were calculated between all pairs of samples, and Principal Coordinates Analysis plots were generated using custom R scripts. Analysis of molecular variance was used for statistical hypothesis testing of differences in overall microbial community structure between cohorts as assessed with the UniFrac measures.To statistically test for differences in the relative abundances of microbial taxa (phyla or genera) the DESeq2 software package was employed. To control for covariates (gender, age and BMI) of interest, we used the multi-factorial model in DESeq2. The DESeq2 multi-factorial model requires categorical values if >1 covariate is included, so we discretized age and BMI values. For BMI, we used standard World Health Organization (WHO) categories of normal, overweight, and obese. We reclassified one subject with a BMI of 18.1 as normal, as this is at the borderline of the WHO underweight classification (<18.5) and would have resulted in only one subject in the underweight category. For age, we divided the range of study subjects' ages (27 to 63 years) into four categories, with each category spanning a 9-year increment. Eight subjects were missing the clinical data. These subjects were excluded from analyses controlling for covariates.For all statistical testing for 16S rRNA data analysis, P values were adjusted for multiple hypothesis testing using the method of Benjamini and Hochberg.To more accurately identify the microorganisms present in samples and their phylogenetic relationships to known species, the pplacer software package was used to perform phylogenetic placement. Pplacer uses a likelihood-based methodology to place short sequencing reads of 16S rRNA amplicons on a reference tree, and also generates taxonomic classifications of the short sequencing reads using a least common ancestor-based algorithm. The reference tree required for phylogenetic placement was generated using full-length or near full-length (>1,200 nt) 16S rRNA sequences of type strains from the Ribosomal Database Project (RDP). 9,563 16S rRNA sequences were downloaded from RDP-11-1, representing 8,719 type strains with 375 sequences from Archaea and 9,188 sequences from Bacteria strains. Filtering was performed using custom Python scripts with the following criteria: (1) only one sequence per species (if the species had more than one sequence, then the longest sequence was chosen); (2) non-environmental species (using key words); (3) no sequences with unclear taxonomic lineage information; (4) no redundant sequences. 7,890 16S rRNA sequences were retained for constructing the phylogenetic tree. Multiple sequence alignment was performed using Muscle. Per the pplacer manual (http://matsen.github.io/pplacer/generated_rst/pplacer.html), the 16S rRNA reference tree was constructed using FastTree with parameter ‘-nt –gtr' (ref. ). FastTree infers phylogenetic trees using an approximate maximum likelihood-based approach, and generates local support values to estimate the reliability of each split in the tree by using the Shimodaira–Hasegawa test on the three alternate topologies (NNIs) around that split, counting the fraction of 1,000 resamples that support a split over the two potential NNIs around that node. Local support values of >0.95 are considered to strongly support splits, and those >0.70 are considered to moderately support splits. To generate the phylogenetic placement figures in the main manuscript (), representative sequences for each species within the genus of interest were chosen (the most abundant unique sequence assigned to that species by pplacer). […]

Pipeline specifications

Software tools mothur, UniFrac, DESeq2, Pplacer, MUSCLE, FastTree
Applications 16S rRNA-seq analysis, Nucleotide sequence alignment
Organisms Homo sapiens
Diseases Autoimmune Diseases, Multiple Sclerosis
Chemicals Methane