Computational protocol: Large-Scale Biomonitoring of Remote and Threatened Ecosystems via High-Throughput Sequencing

[…] For all 24 samples, a total of 11.64 million Illumina reads were generated from both COI fragments. For each sample, the forward and reverse raw reads for the BE fragment and the F230 fragment were merged with SEQPREP software ( requiring a minimum overlap of 25bp and no mismatches, resulting in 5.8 million total paired-end reads (mean—261,930 reads/sample). All Illumina paired-end reads were filtered for quality using PRINSEQ software [] with a minimum Phred score of 20, window of 10, step of 5, and a minimum length of 100bp. A total of 1.02 million paired BE reads (mean—59,820 reads/sample) and a total of 4.37 million paired F230 reads (mean—182,120 reads/sample) were retained for further processing. USEARCH v6.0.307 [] with the UCLUST algorithm was used to de-replicate and cluster the remaining sequences using a 99% sequence similarity cutoff. This was done to denoise any potential sequencing errors prior to further processing. Chimera filtering was performed using USEARCH with the ‘de novo UCHIME’ algorithm []. At each step, cluster sizes were retained, singletons were retained, and only putatively non-chimeric reads were retained for further processing. All filtered, non-chimeric reads from all 24 samples were pooled and clustered at 98% similarity using USEARCH. For those clusters including at least 100 sequences, membership in each cluster for each sample was recorded as an OTU sequence abundance matrix (DNA-OTU).Both BE and F230 sequences were pooled for each sample and identified using the MEGABLAST algorithm [] against a reference library. This reference library contained all verified COI sequences downloaded from the GenBank database September 5th 2014 with a minimum length of 100bp (N = 985,210 sequences). All MEGABLAST searches were conducted with a minimum alignment length percentage of 85% and a minimum similarity of 90%. Taxonomic identifications were recovered based on unambiguous top matches. Genus, family, and order matrices for taxa with a minimum of ten sequences per sample were generated for each sample based on these matches (heretofore referred to as DNA-order, DNA-family, DNA-genus). Only taxon names within benthic metazoan phyla (i.e., Annelida, Arthropoda, Mollusca, Chordata, Cnidaria) were included in analysis. A subset of matches with a minimum similarity of 98% was used to generate a species matrix (DNA-species). For all identification levels, except DNA-OTU, a subset of the matrix including only representatives of Ephemeroptera, Trichoptera, and Odonata (ETO) was generated. […]

