Computational protocol: Phylogenetic Analysis Suggests That Habitat Filtering Is Structuring Marine Bacterial Communities Across the Globe

Similar protocols

Protocol publication

[…] Fourteen thousand three hundred ninety complementary DNA sequences from 7,195 16S rRNA gene clones were included in the data set. This data set was quality-checked and treated by Pommier et al. [] which resulted in a set of 4,250 sequences that were used for further analysis. To reduce potential remaining errors from the previous analyses, we did an extensive reanalysed sequence data by using state-of-the-art methods for quality check, alignment, data base matching and phylogenetic analysis.We matched each individual sequence to the Greengenes database which consist of curated, chimera-checked, 16S rRNA gene sequences and aligned them with the NAST algorithm at []. After reviewing the alignments manually, the data were shown to contain some short-length sequences, and to avoid errors, such as short branch attractions in later phylogenetic analysis, we removed all sequences <100 bp. The threshold length of 100 bp was chosen in concordance with the data handling of Pommier et al. [], and hence, the results are comparative between the studies. In contrast to Pommier et al. [], who taxonomically assigned operational taxonomic units (OTUs) after clustering, here each sequence in the alignments was taxonomically assigned to the level of phylum with the classification tool of Greengenes. This classification method is based on finding near-neighbours in the Greengenes data base using the Simrank algorithm. Greengenes currently supports three different taxonomies: RDP, NCBI and Hugenholz. If two of these assigned the same taxonomic lineage at the phylum level, the sequence was kept for further analysis. After data base matching, alignment, classification and removal of short sequences, 3,617 sequences were considered high enough quality for further analysis. Note that this is a reduction of about 500 sequences compared to the ones used in Pommier et al. [] (see Table S for details). [...] For community composition and phylogenetic structure, one phylogenetic tree per phylum from all localities was constructed. Further, trees were also constructed for each of the localities, including all phyla. For analysis of phylogenetic distance between communities one tree was constructed containing the whole data set. For those purposes, the maximum likelihood method in the RAxML 7.0.4 software using default settings and GTRMIX-model [] was used. This model constructs a tree by approximation with optimization of individual per site substitution rates and classifications of those individual rates into 25 (default) rate categories. The topology found is then evaluated such that it yields stable likelihood values []. The phylogenetic trees were used as input to the analyses done on sequence level, e.g. calculations of phylogenetic difference between samples. Other analyses, e.g. Morisita’s index of similarity, require species and species abundance as input. We defined “species” or more appropriate, OTUs, by clustering the sequence-level phylogenetic trees with the RAMI software []. This clustering method uses nodes and branch lengths in the phylogenetic tree to construct cluster of sequences, defined by a RAMI threshold of patristic (branch length) distance. As we know little about the correct threshold defining an OTU and the potential effects of the threshold on phylogenetically based results, we did the clustering procedure with threshold values ranging from 0.01 to 0.08. The results were robust over the full range of threshold values; hence, we only report the 0.01 level analysis here. Note also that this approach is different from Pommier et al. [] who clustered sequences based on nucleotide similarity, not on patristic distances. For each of the clusters, a consensus sequence was created with the RAMI tool rami_consesus. The consensus sequences, belonging to a specific data subset, were aligned and new phylogenetic trees were created with the RAxML software as above. Each consensus sequence in the consensus tree defines an OTU, and the number of sequences within each cluster was treated as abundance in the OTU-level analyses. [...] As in Pommier et al. [], we compared localities by calculating the Morisita’s index of similarity of community composition. In contrast to Pommier et al. [], where SEQMANII (LASERGENE v.5) was used on sequence similarity, we used our RAMI clusters as in-data for the analysis. Further, we extended the analysis to involve extensive phylogenetic analysis of community similarity and structure. The Unifrac distance matrix tool [] was used to calculate the pairwise phylogenetic difference between all localities, using one large tree containing all sequences from all localities. Consequently, this analysis is based on sequence level data, not on RAMI cluster as for the Morisita’s index analysis above. This tool calculates a distance metric by comparing common nodes and branch lengths of the sample phylogenies []. Two identical communities will have all nodes and branch lengths in common, and the Unifrac distance metric will be zero (0%). In contrast, if two samples differ already in the first node of the phylogenetic tree (e.g. one sample is clustered in one part and the other sample in another part of the total tree), no common nodes or branches exist and the Unifrac metric will be 1 (100% difference). In contrast to the Morisita’s analysis, which compare common species and their abundances, the Unifrac approach takes the full phylogeny, hence the evolutionary history, into account when calculating the distance metrics. As a consequence, the Morisita’s and the Unifrac analyses do not necessarily have to give the same result. For consistency in results, we converted the Unifrac metrics to show similarity instead of distance by subtracting each Unifrac metric value from one (1 − (Unifrac value)). We calculated the significance of the phylogenetic distance between localities using the pairwise Unifrac significance test and P test significance tool. Both tests compare the sample phylogenetic tree to randomly constructed null model trees (1,000 random permutation of environment labels across the sequences). The reported significance value of phylogenetic distances is the fraction of null model trees that have a Unifrac metric value greater or equal to the sampled tree. The P test compares the number of parsimony changes required to explain the distribution of sequences between samples. The reported P value is the fraction of null model trees that require fewer parsimony changes to explain distributions than does the sample tree. Notably, the Unifrac significance test and Unifrac P test are tools primarily suited for analysis of two or a few environments []. Consequently, few of the P values reported in the outputs from these tests were significant after doing the Bonferroni correction of multiple comparisons. When showing the relative difference between localities in the full data (one phylogenetic tree containing all sequences from all libraries), we used the Unifrac jackknife environment clustering tool []. This tool performs hierarchical clustering analysis of the localities based on the pairwise Unifrac distance metric (pairwise comparisons of libraries) and outputs a dendrogram with coloured nodes denoting the fraction of the random samples that they were recovered in. [...] We used Phylocom [] to analyse clustering or overdispersion patterns in all localities. The net relatedness index (NRI) quantifies the structure of a sample phylogeny derived from the mean phylogenetic distance, consequently capturing the degree of clustering of the phylogeny from root to terminal leaves. The nearest taxa index (NTI) quantifies the terminal structure of the sample phylogeny, hence only captures the clustering of the terminal nodes in the tree. NRI and NTI are defined as:12where MPD = mean phylogenetic distance, MNTD = mean nearest phylogenetic taxon distance and sd = standard deviation. The subscript “sample” denotes values derived from the phylogenetic tree analysed and “rndsample” denotes values derived from 999 random phylogenies constructed according to a null model. The null model used here (# 2) constructs a random phylogenetic tree by assigning sequences/OTUs to the localities (localities found in the focal phylogeny) by random draws from the phylogeny pool (all OTUs available in the input data). This maintains the sequence richness of each sample, but the identities of the species occurring in each sample are randomised. Positive NRI/NTI values indicate a clustered phylogeny where coexisting taxa are more related to each other than expected by chance. A negative NRI/NTI value indicates an overdispersed phylogeny where coexisting taxa are less related to each other than would be expected by chance. We calculated the NRI and NTI, using the comstruct Phylocom tool at both community level (including all phyla) and the phylum level (the structure of phylum X in each locality, respectively). As we have little knowledge of the ecological significant scale for bacterial community, the Phylocom analysis was done on both single sequences and sequence clusters (OTUs) as terminal leaves in the phylogenetic trees. Finally, we used a two-tailed significance test for the NRI/NTI results, using the rank high and rank low values in the Phylocom output []. The rank values are the number of runs showing the null model to have lower or higher NRI/NTI values than the focal phylogeny. Consequently, positive NRI/NTI values associated with rank low >975 and rank high <25 and negative NRI/NTI associated with rank low <25 and rank high >975 are significant at P < 0.05. The analyses on phylum level aim to investigate both specific phylum/locality combinations and marine bacterial communities in general. Consequently, we Bonferroni-corrected the output from Phylocom by multiplying the P values with 1/n, where n is the number of unique phylum/locality combinations (n = 77). […]

Pipeline specifications