Computational protocol: Reconstructing the Genomic Content of Microbiome Taxa through Shotgun Metagenomic Deconvolution

Similar protocols

Protocol publication

[…] The simple model presented above allowed us to explore the metagenomic deconvolution framework in ideal settings where reads are assumed to be error free and to unambiguously map to genes. We next set out to examine the application of our framework to synthetic metagenomic samples that incorporate both next-generation sequencing error and a typical metagenomic functional annotation pipeline. To this end, we simulated metagenomic sampling of microbial communities composed of three reference genomes (). We specifically focused on strains that represent the most abundant phyla in the human gut, as determined by the MetaHIT project , and for which full genome sequences were available. Furthermore, these strains represented different levels of coverage by the KEGG database (which we used for annotation), ranging from a strain for which another strain of the same species exists in the database, to a strain with no member of the same genus in the database ().Ten communities with random relative strain abundances were simulated. The relative abundances in each community were assumed be to known through targeted 16S sequencing. For the analysis below, relative abundances ranged over a thousand-fold, but using markedly different relative abundance ratios had little effect on the results (see Supporting ). Shotgun metagenomic sequencing was simulated using Metasim , with 1M 80-base reads for each sample and an Illumina sequencing error model (). The abundances of genes in each metagenomic sample were then determined using an annotation pipeline modeled after the HMP protocol , with reads annotated through a translated BLAST search against the KEGG database . To assess the accuracy of this annotation process and its potential impact on downstream deconvolution analysis, we first compared the obtained annotations to the actual genes from which reads were derived. Overall, obtained annotation counts were strongly correlated with expected counts (0.83, P<10−324; Pearson correlation test; ). Of the reads that were annotated with a KO, 82% were annotated correctly. Notably, however, only 62% of the reads originating from genes associated with KOs were correctly identified and consequently the read count for most KOs was attenuated. Highly conserved genes, such as the 16S rRNA gene, were easily recognized and had relatively accurate read count (). Full details of this synthetic community model and of the sequencing simulations are provided in .We deconvolved each KO using the obtained abundances to predict the length of each KO in each genome. We found that the predicted lengths were strongly correlated with the actual lengths (rho 0.84, P<10−324; Pearson correlation test), although for most KOs predicted lengths were shorter than expected (). This under-prediction of KO lengths can be attributed to the normalization process. Specifically, as noted above, the detected abundances of conserved genes used for normalization tended to be less attenuated by the annotation pipeline than the abundances of other genes, which were therefore computed to be shorter than they actually were. Notably, some KOs that are in fact entirely absent from the genomes under study were erroneously detected by the annotation pipeline and consequently predicted to have non-negligible lengths in the reconstructed genomes (). To discriminate the error stemming from the annotation pipeline from error stemming directly from the deconvolution process, we reanalyzed the data assuming that each read was correctly annotated. We found that with the correct annotations, predicted KO lengths accurately reflected the actual length of each KO in each genome (rho 0.997, P<10−324; Pearson correlation test; ).Importantly, while the error introduced by the annotation pipeline significantly affects the accuracy of predicted KO lengths, the presence (or absence) of each KO in each genome can still be successfully predicted by the threshold approach described above (). Specifically, using a threshold of 0.1 of the average length of each KO, metagenomic deconvolution reached an accuracy of 89% (correctly predicting both KO presence and absence) and a recall of 98% across the various genomes. further illustrates the actual and predicted genomic content of each strain, demonstrating that the method can accurately predict the presence of the same KO in multiple strains, highlighting the difference between the metagenomic deconvolution framework and existing binning methods (see also ). We compared these predictions to a naïve ‘convoluted’ prediction (see ), confirming that deconvolution-based predictions were significantly more accurate than such a convoluted null model regardless of the threshold used (P<10−324, bootstrap; ). For example, using a threshold of 0.1 as above, convoluted genomes were only 54% accurate. Considering the determinants of prediction accuracy described above, we further confirmed that prediction accuracy markedly increased for highly variable and taxa-specific genes (Supporting ).Given the noisy annotation process, we again set out to quantify the contribution of annotation inaccuracies to erroneous presence/absence predictions in the reconstructed genomes. As demonstrated in , most KO prediction errors were false positives – KOs wrongly predicted to be present in a strain from which they were in fact absent. Examining such KOs and the annotation of reads in each genome, we found that 99% of the false positive KOs were associated with mis-annotated reads, suggesting that deconvolution inaccuracies in these settings could be attributed almost entirely to erroneous annotation rather than to the deconvolution process itself. We again confirmed that when correct annotations are assumed, both accuracy and recall increase to more than 99%.The analysis above was used to evaluate the impact of sequencing and annotation error on the metagenomic deconvolution framework using simulated metagenomic datasets generated from simple 3-strain communities. In Supporting , we further present a similar analysis, using simulated metagenomic samples generated from 20-strain communities and based on the HMP Mock Communities. We show that our framework obtains similar reconstruction accuracies for these more complex communities (). [...] Finally, we considered human-associated metagenomic samples to demonstrate the application of the metagenomic deconvolution framework to real metagenomic data from highly complex microbial communities. These datasets further represent an opportunity to evaluate genome reconstructions obtained by our framework owing to the high-coverage of the human microbiome by reference genomes , that can be used for evaluation. The Human Microbiome Project , has recently released a collection of targeted 16S and shotgun metagenomic samples from 242 individuals taken from 18 different body sites in an effort to comprehensively characterize the healthy human microbiome. These human-associated microbial communities are diverse, with several hundred to several thousand 16S-based OTUs (operational taxonomical units clustered at 97% similarity) per sample and a total of more than 45,000 unique OTUs across all HMP samples. These OTUs represent bacteria and archaea from across the tree of life, including many novel taxa , and their diversity is in agreement with shotgun metagenomics-based measures . Clearly, the high number of unique OTUs in each sample does not permit deconvolution and genome reconstruction at the OTU level. Moreover, these OTUs do not represent individual species, but rather distinct sequences accurate to only a genus-level phylogenetic classification . Examining the phylogenetic distribution of the taxa comprising the microbiome suggests that certain body sites, such as the tongue dorsum, are dominated by relatively few genera. This allows us to use metagenomic deconvolution at the genus level, predicting the most likely genomic content of the various genera found in the microbiome. Reconstructed genus-level genomes can be viewed as the average genomic content across all present strains in the genus, providing insight into the capacities of the various genera. Moreover, while many species inhabiting the human microbiome have not yet been characterized or sequenced, most human-associated genera include at least a few fully sequenced genomes, allowing us to assess the success of our framework and the accuracy at which reconstructed genera capture known genus-level properties. Notably, however, microbial communities from other environments or from other mammalian hosts often harbor many uncharacterized taxa, even at levels higher than genera , , making a genus-level deconvolution a still biologically relevant goal.We accordingly applied our deconvolution framework to HMP tongue dorsum metagenomic samples (). OTU abundances and taxonomic classification were obtained from the HMP QIIME 16S pipeline . KO abundances were obtained from the HMP HUMAnN shotgun pipeline . In total, 97 tongue dorsum samples had both OTU and KO data available. OTUs were pooled to calculate the relative abundance of each genus in each sample. After pooling, we identified 14 genera that dominated the tongue dorsum. We deconvolved these samples to obtain reconstructed genera and computed KO presence/absence in each reconstructed genus using a threshold of 0.25 copies.To evaluate our predictions, we calculated the similarity between the 14 reconstructed genera and every sequenced genome from these genera (). We find that 12 of the 14 reconstructed genera are most similar to genomes from the correct genus (). Interestingly, Capnocytophaga, one of the two reconstructed genera that did not most closely resemble genomes from its own genus, was the least abundant genus and appeared to be most similar to genomes from the Fusobacterium genus, with which it significantly co-occurs in the tongue dorsum . This potentially reflects the sensitivity of deconvolution to highly correlated taxonomic abundances (see ). Furthermore, overall, the observed similarities between each reconstructed genus and sequenced genomes from other genera () largely reflect inter-genus similarities between the genomes from the various genera (). For example, although the reconstructed Prevotella is most similar to sequenced genomes from the Prevotella genus, it also exhibits high similarity to genomes from Porphyromonas and Capnocytophaga, two other genera from the Bacteroidetes phylum with relatively similar genomic content. These findings suggest that our deconvolution framework was able to accurately capture the similarities and the differences between the various genera based solely on variation in KO and OTU abundances across samples.To further study the capacity of genus-level deconvolution to reconstruct and characterize the various genera in the microbiome, we next focused on the set of genes that best distinguish one genus from the other. Clearly, even within a genus, the set of genes present in a genome varies greatly from species to species and from strain to strain. Yet, for each genus, a small number of genes that are present in almost every genome from that genus and that are absent from most other genomes can be found. These genus-specific genes best typify the genus, potentially encoding unique genus-specific capacities. Moreover, since such genes are consistently present or consistently absent within each genus, genus-level deconvolution is not complicated by the genus-level pooling of genomes. We defined genus-specific KOs as those present in 80% of the genomes from a given genus and in less than 20% of all others HMP reference genomes. We found in total 99 such KOs across 4 genera. Examining the reconstructed genera, we found that our framework successfully predicted the presence or absence of these genus-specific KOs (90% accuracy and 82% recall; ). Increasing the stringency for our definition and focusing on the 63 KOs that appeared in 90% of the genomes from a certain genus and in less than 10% of all others genomes further increased the accuracy (92%) and recall (94%) of our reconstructed genera. Predictions obtained using alternative regression methods were similarly accurate (see Supporting ; , ). [...] The metagenomic deconvolution framework introduced in this manuscript is a technique for associating genomic elements found in shotgun metagenomic samples with their taxa of origin and for reconstructing the genomic content of the various taxa comprising the community. Many different approaches have been developed to create such groupings of metagenomic features. Broadly, these methods fall into one of two categories, “binning” or “deconvolution”, depending on whether the genomic elements can be assigned to more than one group or not. As demonstrated in Supporting (and see also ), the differences between the metagenomic deconvolution framework and these existing methods originate primarily from the different mathematical frameworks employed by the various methods.Binning methods, such as metagenomic linkage analysis , metagenomic clustering analysis , and MetaBin , are designed to cluster genomic elements that can only exist in one taxon (or group). Specifically, metagenomic linkage analysis clusters genes into groups based on their abundances and phylogeny across sets of metagenomic samples using the CHAMELEON algorithm . Similarly, metagenomic clustering analysis clusters genes into groups based on their abundances across sets of metagenomic samples using the Markov clustering algorithm . MetaBin, on the other hand, clusters individual reads based on their sequence similarities and abundances across sets of metagenomic samples using k-medoids clustering. As these methods all cluster genomic elements into distinct groups, they cannot correctly distribute elements that exist in multiple taxa (or groups), making them less appropriate for addressing questions of core vs. shared genome content (and see, for example, refs , , ). As we demonstrate in Supporting , these methods accordingly could not be used to reconstruct the genomic content of the three strains present in the simulated metagenomic samples incorporating sequencing and annotation error in terms of the gene orthology groups identified in the samples.In contrast, deconvolution methods, such as non-negative matrix factorization (NMF) – and the proposed metagenomic deconvolution framework, are designed to assign genomic elements to multiple taxa. Specifically, NMF is a data discovery and compressed sensing tool that is designed to create a set number of groupings of elements that best fits the observed samples by factoring the feature matrix (here, the genomic elements found across a set of metagenomic samples) into two matrices. One matrix represents the abundance of the set of groups in each sample, and the other represents the distribution of genomic elements among these groups. The optimal number of groups can be determined from the fit of the matrix factorization to the original matrix , or the stability of the solutions for a given number of groups . Importantly, while NMF utilizes a mathematically similar approach to the metagenomic deconvolution framework, and can thus theoretically obtain comparable accuracies (see also Supporting ), the two represent fundamentally different techniques. First, the groups identified by NMF are unlabeled, while those used by the metagenomic deconvolution framework by definition have a distinct taxonomic identity. Furthermore, the optimal number of groups detected in a set of samples by NMF does not necessarily correspond to any phylogenetic groupings present in the set of samples. Indeed, NMF does not group the gene orthology groups present in the simulated metagenomic samples incorporating sequencing and annotation error into strain-specific groupings (Supporting ). Second, in the metagenomic deconvolution framework, the calculated quantities of genomic elements in each group have a direct physical interpretation (i.e. gene length or copy number), while NMF calculates coefficients without assigning a clearly interpretable meaning. Lastly, NMF functions on the entire set of genomic elements present in a set of samples (the feature matrix) as a whole, whereas the metagenomic deconvolution framework solves for the distribution of each genomic element among the various groups independently. This separability allows for custom regression techniques to be used for each genomic element (for example, regularized regression like lasso can be used for those genomic elements that are sparsely distributed) and the option to target only those genomic elements of interest. […]

Pipeline specifications

Software tools MetaSim, TBLASTN, QIIME, HUMAnN, MetaBin
Databases KEGG HMP
Applications Phylogenetics, Metagenomic sequencing analysis, 16S rRNA-seq analysis, Amino acid sequence alignment
Organisms Homo sapiens
Chemicals Nucleotides