Computational protocol: Resolving Prokaryotic Taxonomy without rRNA: Longer Oligonucleotide Word Lengths Improve Genome and Metagenome Taxonomic Classification

Similar protocols

Protocol publication

[…] A complete set of non-draft sequenced microbial genomes (including 1,315 bacteria and 109 archaea) () were downloaded from the National Center for Biotechnology Information (NCBI) website ( on June 21st, 2012. The genomes were filtered to remove plasmids to allow for an analysis of only chromosomal DNA. Additionally, 16S rRNA sequences for all included genomes were downloaded from the Ribosomal Database Project (RDP) ( All genomes included in this study contain taxonomic information obtained from the NCBI taxonomic database ( and parsed to include: species, genus, family, order, phylum and domain annotations. An analysis of the phylums included in this study shows that while the NCBI genomes database contains good diversity (30 phylums) it also includes a strong bias towards proteobacteria (43% of total genomes) and firmicutes (19% of total genomes).All chromosomes were analyzed to determine their mono- through nona- oligonucleotide signatures using a “sliding window” to find the count of each possible oligonucleotide combination . These counts were converted into percentages where the ratio of all percentages represents the genetic signature. The Euclidean distance between chromosomes was determined using the following formula:Where each (p, q) set represents a bin from the oligonucleotide signature. The Euclidean distances between all organism pairs were converted into a distance matrix for analysis using the neighbor program within the Phylip software package . This resulted in cladograms representing the Euclidean distances (i.e. oligonucleotide signatures) between all members of the 1,424 organism dataset. Corresponding bacterial and archaeal 16S rRNA sequences were combined and aligned using the RDP’s on-line tools. The aligned 16S rRNA sequences where converted into a phylogenetic tree using the dnadist and neighbor tools within Phylip. The oligonucleotide signature based cladograms and the 16S rRNA based phylogenetic tree were analyzed using Bioperl’s TreeIO tools to extract the distance between all leaf nodes. Results were filtered to generate a list of the nearest neighbor for all leaf nodes in all cladograms. Using taxonomy data for all leafs and their nearest neighbor we determined the percentage of occurrences when a nearest neighbor is from the same taxonomic group (i.e. same domain, same phylum, etc). Additionally, the taxonomic data between nearest neighbors allowed for the color-coding of cladogram nodes based on taxonomy.Euclidean distance verses 16S identity plots were generated by plotting the Euclidean distance between all organism pairs in our 1,424 member dataset verses the 16S identity between the pair. The identity between aligned 16S rRNA sequences was determined using the dnadist program within Phylip. Taxonomy data was also included to color-code the plot. The genus normalized Euclidean distance normalization metric was derived from dividing all Euclidean distances by the largest genus-genus Euclidean distance for all oligonucleotide lengths ().To generate the leave-one out analysis we calculated the Euclidean distance between all organisms in our 1,424 member dataset, not including self-comparisons. We then organized all resulting distances into thirty equally sized bins and calculated the taxonomical relationships for all organism pairs in each bin. Each bin was then analyzed for the percentage of times the same taxonomic identity was seen (i.e. how often a bin contained organism pairs from the same genus).To determine random divergence we constructed a random one million base pair DNA sequence. The sequence was subjected to one million iterations where we randomly selected a single base and mutated it to a randomly selected base. The mutated DNA sequence was written to disk every one hundred iterations and each of these sequences were compared to the original by calculating the tetranucleotide and heptanucleotide signature and calculating the Euclidean distance from the original sequence. The results were plotted as iteration number verses Euclidean distance from the original sequence.To analyze metagenomically relevant fragments from 1,424 completed genomes we randomly pulled 1,000, 2,500, 5,000, 10,000, 15,000, 25,000 and 50,000 base pair fragments from each completed genome and calculated tetranucleotide/heptanucleotide signatures for all fragments. The Euclidean distance was calculated between each fragment for all fragment lengths and each set of Euclidean distances was converted into a distance matrix. Distance matrices were analyzed using the neighbor application in Phylip to generate cladograms. BioPerl’s TreeIO was used to calculate the nearest neighbor for all nodes and the NCBI taxonomy was used to pull genus of all sequenced genomes included. The percentage of nearest neighbors having the same genus was calculated and plotted verses fragment length for both tetranucleotide and heptanucleotide signatures.To determine the Euclidean distances based on fragment length the organism’s chromosome was broken in chunks with lengths of: 50,000, 20,000, 15,000, 10,000, 5,000, 2,500, 1,000 and 500 base pairs. The tetranucleotide and heptanucleotide signatures were calculated for all chunks along with the Euclidean distances between all chunks. The average Euclidean distance between all chunks was then calculated between all organism pairs over all chunk lengths. These average Euclidean distances were then plotted verses chunk length for each organism pair.To determine nt database matches for the Bison Pool metagenome dataset using oligonucleotide signatures both the nt database and the Bison Pool dataset were parsed to DNA sequences in excess of 10,000 base pairs. Next, the tetranucleotide and heptanucleotide signatures were calculated for all sequences in the nt database as well as the Bison Pool dataset. The Euclidean distance was calculated between all members of the Bison Pool dataset and nt using both tetranucleotide and heptanucleotide signatures, with the pairing with the lowest Euclidean distance designated as the best match. NCBI BLAST was used between the over 10,000 base pair nt database and the over 10,000 base pair Bison Pool datasets to determine the “correct” best match between the metagenomes and the nt database. Results were then analyzed for how often the best BLAST hit and shortest Euclidean distance hit agreed.The twenty-five Yellowstone National Park metagenomes were combined into a single file and parsed so that every 700th sequence was pulled out for analysis, giving us 242 scaffolds. NCBI BLAST was used to find all related sequences for the 242 metagenomic scaffolds within the NCBI nt database. The tetra- and hepta- nucleotide Euclidean distance was calculated between all metagenomic scaffolds and all their related hits in the nt database. The calculated Euclidean distances and the scaffold lengths were plotted using R.Perl scripts developed for the determination of oligonucleotide signatures from DNA sequences as well as for calculating the Euclidean distance between oligonucleotide signatures are available for download from the Raymond ground website at […]

Pipeline specifications