Computational protocol: Reference datasets of tufA and UPA markers to identify algae in metabarcoding surveys

Similar protocols

Protocol publication

[…] We produced reference datasets that can be used with the Naive Bayesian Classifier (RDP classifier) implemented in the QIIME pipeline , . Each of these datasets consists of: 1) a fasta file containing the reference DNA sequences and short sequence identifiers and 2) a text file matching the sequence identifiers to their taxonomic metadata. To produce these datasets we first mined sequences from GenBank by querying the marker name and downloading all matching items as full GenBank records. We added endolithic (limestone-boring) green algal lineages discovered with the tufA marker in our study “Multi-marker metabarcoding of coral skeletons reveals a rich microbiome and diverse evolutionary origins of endolithic algae” . We identified these algal lineages in a phylogenetic context [see ] and included representatives of the main endolithic clades in the tufA reference dataset. We also retrieved a large diversity of algae with the UPA marker but these lineages did not receive the same nomenclature as the tufA lineages because the correspondence between the tufA and the UPA algal clades was unknown. To solve this issue and match tufA and UPA clades we used chloroplast genome data. The complete chloroplast genomes of two endolithic algal strains – Ostreobium HV05042 and SAG699 – were sequenced , and added to the UPA reference dataset. Phylogenetically, these strains are in Ostreobium Clade 3 and Clade 4, respectively. Since there are no reference sequences for Ostreobium Clade 1 and Clade 2 it is possible that OTUs belonging to Ostreobium Clades 1 and 2 will be classified as Clades 3 and 4 or will be only classified at higher taxonomic levels.The reference datasets were equalized so as not to contain identical sequences or a disproportional number of closely related species, which yields downstream benefits for taxonomic assignment [see ]. To equalize the datasets and exclude closely related or identical reference sequences, we built a UPGMA tree of the sequences with a JC69 model. We sliced this tree at 0.001 branch length units from the tips, which yielded several clades containing closely related sequences. We kept in the dataset one reference sequence from each of these clades based on their quality (i.e. length and number of undefined bases). For the tufA OTUs obtained in Marcelino and Verbruggen we used a threshold of 0.1 branch length units (1–3 OTUs per family) to not add a disproportionally high amount of endolithic algal lineages in the reference dataset. The reference datasets were converted to a QIIME-friendly format with the gb_2_RDP.py script (), which uses the metadata information contained in GenBank files to produce the taxonomic metadata required by RDP. The gb_2_RDP.py script is also available at: https://github.com/vrmarcelino/Make_Ref_Dataset/blob/master/gb_2_RDP.py […]

Pipeline specifications