Similar protocols

Protocol publication

[…] All command line programs for data analysis were run on the bioinformatics cluster of CGR (University of Liverpool) in a Debian 5 or 7 environment.Raw fastq files were trimmed to remove Illumina adapters using Cutadapt version 1.2.1 with option -O 3 () and Sickle version 1.200 with a minimum quality score of 20 (). Further quality control was performed with Prinseq-lite () with the following parameters: minimum read length of 35, GC percentage between 5 and 95%, minimum mean quality of 25, dereplication (removal of identical reads, leaving 1 copy), and removal of tails of a minimum of 5 poly(N) sequences from 3′ and 5′ ends of reads.The positive- and negative-control libraries described earlier were used for contaminant removal. The reads of the control samples were analyzed using Diamond blastx () against the nonredundant protein database of NCBI (nr, November 2015 version). The blast results were visualized using MEGAN6 Community Edition (). An extra contaminant file was created with the complete genomes of species present at over 1,000 reads in the positive- and negative-control samples. Then, bowtie2 () was used for each sample to subtract the reads that mapped to the positive-control, negative-control, or contaminant file. The unmapped reads were used for assembly with SPAdes version 3.9.0, with k-mer values of 21, 31, 41, 51, 61, and 71 and the options --careful and a minimum coverage of 5 reads per contig (). The contig files of each sample were compared with the contigs of the controls (assembled using the same parameters) using blastn of the BLAST+ suite (). Contigs that showed significant similarity with control contigs were manually removed, creating a curated contig data set. The unmapped read data sets were then mapped against this curated contig data set with bowtie2, and only the reads that mapped were retained, resulting in a curated read data set.The curated contig and read data sets were compared to the RefSeq viral (January 2017 release) and nonredundant protein (nr, May 2017 release) reference databases using Diamond blastx at an e value of 1e−5 for significant hits (, , ). Taxon assignments were made with MEGAN6 Community Edition according to the lowest-common-ancestor algorithm with default settings (). We chose the family level taxon assignments to represent the overall viral diversity because there is generally little amino acid identity between viral families. The taxon abundance data were extracted from MEGAN6 and imported into RStudio for visualization (). Genes on the assembled contigs were predicted with Prokka () using the settings --kingdom Viruses and an e value of 1e−5. Multiple alignments of genes and genomes were made in MEGA7 using the MUSCLE algorithm with default settings (, ). The alignments were manually trimmed, and phylogenetic trees were built using the maximum-likelihood method in MEGA7 with the default settings. Sequences upstream from potential CDSs of Prokka-annotated picobirnaviruses were extracted using extractUpStreamDNA (https://github.com/ajvilleg/extractUpStreamDNA), and all 5′ UTRs and transcription start sites were manually verified in UGene (). These extracted sequences were then subjected to a motif search using the MEME Suite (, ). […]

Pipeline specifications