Computational protocol: Oceanographic structure drives the assembly processes of microbial eukaryotic communities

Similar protocols

Protocol publication

[…] Care was given to avoid generation of spurious operational taxonomic units (OTUs), leading to inflation in microbial biodiversity. To this end, we combined QIIME v.1.6 () and USEARCH v.6 (, ) sequence preprocessing pipelines. Sequences shorter than 150 nucleotides were removed and remaining sequences were ‘denoized' using the QIIME built-in ‘denoizer' program (). The resulting reads were then aligned using pyNAST () with the Silva 18S alignment () as a template; aligned sequences were manually inspected and 5′/3′-trimmed. Additional sequence quality controls were performed through the USEARCH quality-filtering pipeline (); putative chimeras were detected with UCHIME () using the Silva 18S database (release 108) as reference. Remaining sequences were clustered in OTUs using a 98% sequence identity cutoff with UCLUST () and singleton OTUs (i.e., cluster composed of a single sequence) were discarded from this study. OTU sequence representatives were selected as the most abundant sequence within the cluster. OTU representatives were taxonomically classified via the mothur Bayesian classifier () with a custom-curated 18S sequence database as the reference data set, which was an updated version of as in . Sequences from OTUs classified as multicellular organisms (Opisthonkonta or Streptophyta) were discarded from the study as we chose to focus on microbial diversity. [...] All ecological analyses based on phylogenetic information (UniFrac, phylogenetic diversity (PD) and net relatedness index (NRI)) were based on a phylogenetic tree reconstructed with an approximate maximum-likelihood approach as in FastTree v.2.1 () using the ‘accurate' mode (i.e., -slownni –mlacc 4; with pseudocounts). The phylogenetic reconstruction was based on an alignment of final OTUs, aligned with pyNAST and manually curated.UniFrac dW5000 (weighted) and dUW5000 (unweighted) distances () were computed on data sets subsampled at an even depth of 5000 sequences using generalized UniFrac procedure () implemented in the R package ‘GUniFrac' v.1. Clusters based on dW5000 were generated using hierarchical clustering. One thousand jackknife replicates (with 3750 sequences, i.e., 75% of the subsampled data sets) were used to assess clustering robustness.PD rarefactions analyses were also conducted on samples rarefied to 5000 sequences; PD values were measured following Faith's definition (i.e., total branch length of the OTU phylogenetic tree) and computed every 20 sequences using 100 subsampling iterations. PD values were computed using the R package ‘picante' v.1.6 (). Other α-diversity measures, Chao1 estimator and Shannon index, were computed with QIIME. NRI values were computed using ‘picante', through the commands ses.mpd (−1 × NRI) on abundance weighted data, and tips of the original phylogenic tree were shuffled to generate randomized phylogenetic relationships (null.model=‘taxa.labels'). We also used an alternative null model, based on the independent swap algorithm (null.model=‘independentswap' ; ), to compute NRI indexes. Results similar to those obtained with the null model based on taxa label shuffling were observed; for example weak SCM (wSCM) communities exhibited higher NRI indexes than SCM ones (analyses of variance (ANOVA) and Tukey's honestly significant difference (HSD) test; P=0.004).Phylogenetic placements of Picozoa-like OTUs were performed using the RAxML v.7.2 () evolutionary placement algorithm (). We first built the Picozoa reference tree with sequences and clade information from ) and . Picozoa reference sequences (‘long' sequences from clone libraries) were clustered at 99% using CD-HIT v.4.6 () to avoid redundant sequence information that could affect phylogenetic placements. Resulting sequence cluster representatives were aligned with MAFFT v.7 (in accurate mode with global homology and 1PAM scoring matrix; ). The resulting alignment was then manually inspected, curated and trimmed. The maximum-likelihood phylogenetic tree was reconstructed with RAxML using GTR+G model, which was selected by ModelGenerator v.0.85 () as the nucleotide substitution model best-fitting our data. Node support was computed from 100 bootstrap replicates. Reads representing Picozoa-like OTUs were aligned to our reference Picozoa alignment with MAFFT v.7 using 1PAM matrix and —addfragments mode (). These sequences were then placed on the Picozoa reference phylogenetic tree using RAxML evolutionary placement algorithm. [...] β-Diversity significance was assessed using UniFrac Monte Carlo significance test on dW5000 with 10 000 randomizations, as implemented in QIIME. All subsequent statistical analyses were computed in the R environment v.3 ( Mantel tests for assessing the correlation between dW5000 and depth or geographical distances were computed with the Pearson's correlation method and 10 000 permutations using the R package ‘vegan' v.2 (). To determine the contextual variables driving the microbial community compositions, we first followed a methodology similar to . We removed redundant environmental and biological variables () by using varclus (R package ‘Hmisc' v.3.12; ) with Spearman's correlation (ρ2 cutoff: 0.8; PO4 and salinity had ρ2>0.8 but were retained for subsequent analyses). Non-redundant variables best explaining community dissimilarity matrix (dW5000) were then identified using the BEST procedure (i.e., best subset of environmental variables with maximum rank correlation with community dissimilarities; ). The resulting nine variables best explaining the dW5000 matrix were fitted as vectors onto the nonmetric multidimensional scaling ordination (NMDS) ordination and used for permutational analyses of variance on distance matrix analysis, which were computed using ‘vegan' (envfit and adonis functions, respectively, both using 10 000 permutations). Overall significance of the model composed of the nine BEST-selected variables in explaining dW5000 was assessed using multiple regressions on matrices (‘ecodist' R package v.1.2; ; 10 000 permutations and Spearman's correlations). Differences in PD, NR and taxonomic composition mean among community clusters were tested using ANOVA and pairwise differences were evaluated with Tukey's HSD test. Statistical analyses of sample taxonomic enrichments were computed with Metastats () using 1000 bootstrap permutations and a 0.05 significance threshold. […]

Pipeline specifications