Computational protocol: Transporting Ocean Viromes: Invasion of the Aquatic Biosphere

Similar protocols

Protocol publication

[…] We performed quality control by removing (i) reads homologous to a 17-bp sequence (GTTTCCCAGTCACGATC) used as a primer for random transcription/amplification (allowing up to 3 mismatches per read) and (ii) low quality reads (defined as reads < 30 bp in length, with quality score of 50% of the bases < Q30, and/or with degenerate bases (‘N’s)). Finally, we generated 13.3–174.7 million high quality reads for the 72 samples, with an average of 47 million reads per sample.Sequence reads were assembled into contiguous reads (contigs) using IDBA-UD []. Alignment of reads to contigs was also performed with Bowtie 2 []. A total of 7.0 million contigs were produced for the 72 samples, and average 79.7 ± 8.1% of reads were mapped at a unique position of the contigs. We carried out taxonomic assignment of contigs by performing BLASTX searches (E <10−5) against sequences in the National Center for Biotechnology Information (NCBI) viral database (downloaded in September 2014), and then summarizing the results with MEGAN (Min Score = 50.0, Max Expected = 1.0E-5, Top Percent = 10.0, Min Support Percent = 0.0, Min Support = 1, and LCA Percent = 100.0) []. Of assigned contigs (2.17 million), we removed contigs that lacked any taxonomic information (e.g., unclassified phages) from the data sets. The abundance of a viral taxonomic group was determined by Ri = Σ (Ni/Li), where Ri is the relative abundance of viral family i, Ni is the number of reads aligned to a contig in viral family i, and Li is the length (kbp) of a contig in viral family i. To compare a particular group of viruses in a virome to the rest of the viromes and to normalize different sequencing scale between viromes, the percentage of the relative abundance of a phylogenetic group within a virome was used rather than its raw value. Information on the relative abundance of viral taxonomic group was compiled in a matrix where different viromes were represented as rows and taxonomic groups in columns. Similarity Percentages (SIMPER) analysis was performed to identify discriminating taxonomic groups by comparing relative abundances of viral families between geographic origins using PAST statistical package []. Spearman's correlation coefficient was computed to examine relationships between discriminating viral families and geographical locations using R Statistics Environment [].A subset of contigs most similar to viruses infecting human, fish, and shrimp were extracted from the data sets. These contigs were again BLASTX-searched (E <10−3) against the inclusive NCBI non-redundant (nr) database (downloaded in April 2014) and any contigs more similar to non-viral proteins were excluded. Genome coverage plots were computed for the selected viral pathogens to examine predicted genes similar to each gene on the reference genomes from the NCBI viral database using Metavir 2 [].We used two approaches to estimate the total number of distinct viral species (viral richness) present in each of our viromes. First, we defined virus richness as a total number of identified viral families in the data sets. As relying on the assigned taxonomic groups to determine viral richness limits the observation of unassigned viral groups, tools specifically designed to calculate viral richness (known and unknown viruses) were used as our second approach. Briefly, 2,500,000 quality trimmed reads were randomly sampled from each virome data sets. Contig spectra was calculated with Circonspect [] using the Minimo assembler employing default parameters (98% sequence identity overlapping by at least 35 bp) on all reads. Then, CatchAll [] was employed with its default parameters and produced viral richness estimates under the best parametric model according to statistical and heuristic criteria. Spearman's correlation coefficient was computed to examine relationships between virus richness and variables using R Statistics Environment [].To take all sequences into account in virome comparison rather than a small known fraction with the use of publically available sequence databases, sequence similarity was computed using TBLASTX comparison as implemented in Metavir 2 []. Briefly, a subset of 2,500,000 quality trimmed reads from each virome was uploaded to Metavir 2. Assembled contigs were not used for virome-to-virome comparison, as assembly step introduces bias in the relative abundance of each sequence. The average of best TBLASTX hit scores between virome A reads and virome B reads was computed to represent the sequence similarity between viromes. The resulting similarity matrix (through 0 for no similarity to 100 for a perfect match) for all virome pairs was converted to a dissimilarity matrix by subtracting from 100. A heatmap was generated by a hierarchical cluster analysis using the complete linkage algorithm in R Statistics Environment []. To test for statistically significant differences between groupings of the samples made according to geographic origins, Analysis of similarity (ANOSIM) (9999 permutations) was carried out on the previously generated dissimilarity matrix using PAST statistical package []. […]

Pipeline specifications

Software tools IDBA-UD, Bowtie, BLASTX, MEGAN, METAVIR, CatchAll, TBLASTX
Applications Genome annotation, Phylogenetics, Nucleotide sequence alignment
Organisms Viruses
Diseases Pulmonary Fibrosis, HIV Infections