Computational protocol: Subspecies in the global human gut microbiome

Similar protocols

Protocol publication

[…] All samples were processed with the same computational protocol. Reads were quality‐filtered and screened against the human genome sequence for removing contamination as previously described (Zeller et al, ). Species abundance was calculated using established MOCAT (Kultima et al, ) protocols for specI clusters (Mende et al, ). Throughout the manuscript, we used specI clusters at the species level related via the NCBI taxonomy database as a taxonomic reference. Additionally, mOTU abundances were also determined using standard MOCAT procedures (Sunagawa et al, ), but exclusively used to estimate species diversity ().For calling genomic variants, all metagenomic sequencing reads were additionally mapped to a reference set consisting of 1,753 genomes (each representative of one specI cluster) (Mende et al, ), using MOCAT (Kultima et al, ) with default parameters. Specifically, reads were mapped at 97% identity and multiple mappers were discarded. Computation of genome coverage for each specI cluster was performed using qaCompute (https://github.com/CosteaPaul/qaTools), resulting in estimations of both horizontal and vertical coverage per sample, per genome. [...] Population SNPs were called using metaSNV (Costea et al, ), which resulted in 19,221,237 positions over 1,753 genomes. [...] To survey genomic (dis)similarity between conspecific strains carried by each individual, we used a modified Manhattan distance to assign a divergence of 0 to a pair of samples with an identical variation profile and 1 to completely different samples: ∑i=1n|S1i−S2i|/n, where S1i and S2i are the frequencies of SNV “i” in one and the other sample, respectively, and “n” is the total number of compared positions. Positions only covered in one of the samples were not considered in this distance computation, and at least 1,000 positions are required to compute a valid distance (i.e., n > 1,000). We used partitioning around medoids to determine clustering for any given number of clusters k, between 2 and 10. In this clustering step, we only kept one time‐point sample for individuals which have been sampled more than once. Using the prediction strength (PS) (Tibshirani & Walther, ), we determined the support of the clustering for each k. The highest number of clusters that have a PS above 0.8 was considered to be the number of subspecies, as recommended by Tibshirani and Walther for determining high‐quality clusters (Tibshirani & Walther, ). If there were no values above, we conservatively assumed this to be insufficient evidence for the existence of distinct subspecies. [...] Phylogenetic trees (Fig A and B) were based on concatenated sequence alignments of 40 previously described universal marker genes (Creevey et al, ). They were constructed using the “sptree_raxml_all” workflow as implemented in ETE3 v3.0.0b36 (Huerta‐Cepas et al, ). In brief, multiple sequence alignments were built for the 40 marker proteins using Clustal Omega v1.2.1 (Sievers et al, ), translated into codon alignments, concatenated, and used to infer a phylogeny by RAxML v8.1.20 (Stamatakis, ). […]

Pipeline specifications

Software tools MOCAT, mOTU, qaTools, metaSNV, ConStrains, RAxML, Clustal Omega
Databases NCBI Taxonomy Database
Organisms Homo sapiens, Bacteria, Firmicutes, [Eubacterium] rectale
Chemicals Nucleotides