Computational protocol: Taxonomic Characterization of Honey Bee (Apis mellifera) Pollen Foraging Based on Non-Overlapping Paired-End Sequencing of Nuclear Ribosomal Loci

Similar protocols

Protocol publication

[…] Raw reads were imported into CLC Genomics Workbench v. 7 (Qiagen) for adapter screening and quality trimming. We imposed a maximum error probability of 0.05 for individual bases and a total score of 10 (+1/-2 for matches/mismatches and indels) to trim the forward and reverse fusion primers. Degenerate positions in the primers were accommodated by providing multiple explicit sequence variants as search models. The required minimum read length was 150 bp and at most three ambiguous positions were allowed. Only intact pairs were retained after trimming, and any that overlapped based on the CLC Genomics ‘merge reads’ function were discarded as improper fragments (minimum merge score of 25 with a +1/-2/-3 scoring scheme for matches, mismatches, and indels, respectively). Quality-checked reads were converted to FASTA format for further use.To evaluate the ability of our approach to identify taxa without using prior knowledge of plant biogeography, phenology, or bee foraging behavior, we used de novo operational taxonomic unit (OTU) identification followed by taxonomic assignment with BLAST and the NCBI Taxonomy resource. This strategy was chosen because while considerable prior knowledge exists it is incomplete and not all plausible species are represented by complete ribosomal ITS sequences in GenBank (see below). Nonetheless, we do evaluate our results for consistency with distribution and phenology data from North Dakota floras [–], as well as consistency of samples by date and site using similarity scores.Selection of de novo OTUs involves clustering sequence reads into sufficiently similar sets such that they can all be represented by a single member, which is then used as input for downstream processes under the assumption that important taxonomic information is not lost by this grouping. Rather than performing a computationally expensive clustering of all reads together, we used a divide-and-conquer strategy to reduce the total number of pairwise alignments made, whereby reads were binned based on an initial mapping to an ITS reference database drawn from GenBank accessions. The custom database included all entries from the GenBank nucleotide database assigned to the “Angiosperm” taxonomic node that contained “transcribed spacer” or “ITS” in the header, further clustered at 95%. Read pairs were mapped to this database using bowtie2 [] with the “very-sensitive” and “end-to-end” quick switches and a maximum insert size set to 1100 bp. The resulting alignment file was parsed to seed initial clusters and was not used for any subsequent analysis. At this stage, unmapped reads were not used in OTU generation but were recaptured in a later step, see below.Because clustering programs typically include a length sorting step and select the longest representative sequence in a cluster, retaining short sequences in each bin is an unnecessary computational cost assuming that all sampled taxa will be represented by at least one high-quality read pair. We therefore only included read pairs in the clustering step that had a combined length ≥ 520 bp after trimming. To format read pairs for clustering, we created scaffolds by reverse-complementing the second read and joining it with the first, with a linking sequence of 30 “N”. (This length of inserted sequence prevented single reads from erroneously mapping across the scaffold gap). We then ran CD-HIT-est [] on each scaffold pool sequentially with a wrapper script using a 97% identity threshold and otherwise default parameters. While a 97% threshold is used throughout the workflow (see below), this value was chosen operationally and is not an assumption of species-level divergence for these ITS sequences. As evolutionary rates vary, no single value is likely to recapitulate taxonomy, and indeed in our results most species are represented by multiple OTUs but some closely related species cannot be distinguished at this threshold (see below).A second mapping was performed with bowtie2 of all passed read pairs against the selected cluster representatives. The resulting alignment file was parsed to identify read pairs not mapping to an existing cluster representative at the 97% level; these were re-captured and clustered as a single pool of scaffolds using the CD-HIT-est parameters described above. A third mapping was performed in order to generate a consensus sequence for the final set of OTU clusters, using SAMtools [] and only those read pairs that mapped at 97% identity with at most five alignment positions represented by indels. Final OTUs were checked for chimeras with UCHIME [] using the GenBank-sourced ITS database described above as a reference. The scoring parameters were beta = 5, minimum score = 1, and minimum divergence ratio = 1. The choice of a threshold score for chimera removal depends on the properties of the data and reference set, with high-confidence chimeras appearing as strong outliers in the score distribution. Inspection of the data led us to select a relatively high threshold score of 30 for removing chimeric scaffolds, as lower scoring candidates were often proposed chimeras of closely related taxa or were ambiguous when analyzed with BLAST. However, our paired-read assignment criteria (see below and ) constituted a much stronger filter of potential chimeric sequences, as read pairs with conflicting taxonomic matches are deferred to the next most inclusive taxonomic level or left unassigned (see below).We observed that G homopolymers were sometimes present at the 3’ end of scaffolds, which we presume to be sequencing artifacts. To prevent these regions from seeding alignments, scaffolds were screened for low-complexity sequence using mdust [] with default scoring parameters.We used the metagenomics package Megan v. 5.10.2 [] extensively for preliminary taxonomic assignments and to analyze read counts, however we did not use the paired-end mode of Megan for final OTU assignment. This was because only a small fraction of Megan assignments actually used the paired BLAST scores from both mates, for undetermined reasons. Instead, we implemented the lowest common ancestor (LCA) approach directly using BLASTN and the NCBI taxonomy resource. To apply this approach to paired reads, we split OTU scaffolds into their component reads and searched the nt database with BLASTN (using default settings for the ‘discontinuous megablast’ option). We then used perl scripts to parse the top 25 matched accessions from the BLAST output, generating a matrix of high-scoring pairs (HSPs) accessions, bit scores, and percent identities. We used the NCBI Taxonomy text dumps ( to add the taxonomic classification for each matched accession. To apply the ‘naïve’ LCA algorithm to each OTU, at each of five specified ranks (species, genus, tribe, family, and order) we summed the best score of each read in a pair for each taxon associated with these top 25 accessions. If only one taxon had a combined score > 600 and was within 3% of the best score, the OTU was assigned to that taxon, otherwise the OTU was considered undetermined at that taxonomic level. A 3% cutoff corresponds to a minimum bit-score difference of 18 and an average difference greater than 20 (see ). We believe a larger percentage cutoff would be overly conservative given the already stringent requirement of concordant 97% mapping of two reads of at least 150 bp each, however, it should be noted that neither OTU assignments nor read mappings are associated with an explicit error probability. We calculated summary statistics for OTU assignments, including mean percent identity and bit scores of HSPs, as well as whether assignments fell within the target group (the green plants, or kingdom Viridiplantae) or were non-target taxa.A fourth and final mapping generated counts for taxa by sample, summing across OTUs with the same taxon assignment. Again, only pairs that mapped to the same OTU at 97% and 5 or fewer indel positions were counted. We chose to consider invalid read pairs that mapped discordantly across different OTUs even if given the same taxonomic assignment. Their inclusion would have increased the total mapped reads by less than 5%, at a cost of greater uncertainty (OTU assignments vary in strength) and a more burdensome workflow for retrospective reassignment as new reference sequences become available. Read counts are reported as counts per million (cpm) to account for variations in library size.Read-mapping is an approximate process that results in a nonzero level of suboptimal assignments []. For this reason as well as simple heteroscedasticity, low counts are not expected to be robust. As even genuine low-abundance taxa lack biological relevance for our research objectives, we excluded taxa with less than a total of 50 counts summed across all samples. This threshold was chosen subjectively based on the long tail of low-count taxa in the total distribution. Threshold criteria have been included as options in other metabarcoding workflows (e.g., [,]) as a standard noise-reduction measure.Taxonomic diversity of read counts was assessed using the bcDiversity function in the entropart R package [] with ChaoShen bias correction and exponent q = 1. As the diversity value is bias-corrected based on actual sample sizes, raw values rather than library-size normalized cpm were used. Rarefaction curves were obtained with the ‘rarefy’ function of the vegan R package []. Two samples, A091 and A092 (see ), were suspected of being mislabeled, a conclusion supported by a later sequencing run. These were retained in analyses of assignment strength, biological replication, and rarefaction but excluded from those relating to phenology and overall abundance of taxa, where their inclusion would have been misleading. […]

Pipeline specifications