Computational protocol: A diverse group of small circular ssDNA viral genomes in human and non-human primate stools

Similar protocols

Protocol publication

[…] A total of 70 human stool samples from 25 US outbreaks of unexplained diarrheal disease with typical viral gastroenteritis epidemiology () were analyzed by viral metagenomics. Human samples were submitted to the Centers for Disease Control and Prevention (CDC) with the corresponding dates (). Fifty-five samples from 21 French outbreaks were similarly pooled and analyzed by metagenomics (). Stools from non-human primates were collected in December 2009, from the San Francisco Zoo, including 4 samples from aye-aye (Daubentonia madagascariensis), 3 from bare-face tamarin (Saguinus bicolor), 3 from black howler monkey (Alouatta caraya), 4 from chimpanzee (Pan troglodytes), 3 from emperor tamarin (Saguinus imperator), 5 from gorilla (Gorilla gorilla), 7 from ring-tailed lemur (Lemur catta), 2 from lion-tailed macaques (Macaca Silenus), 3 from mandrills (Mandrillus sphinx), 4 from patas monkeys (Erythrocebus patas), 2 from siamang (Symphalangus syndactylus), and 3 from squirrel monkey (Saimiri sp.). Deep sequencing using the Illumina Miseq platform (human samples) and the 454 Genome Sequencer FLX platform (primate samples) was performed on enriched viral particles according to previously described protocols (, ). Sequence data were analyzed by a customized NGS pipeline as described previously (). Specifically, human host reads and bacterial reads were subtracted by mapping the reads to human reference genome hg19 and bacterial RefSeq genomes release 66 using bowtie2. Remaining reads were considered duplicates if position 5 to 55  from 5′ prime end were identical. One random copy of duplicates was kept. Low-sequencing quality tails were trimmed using Phred quality score 10 as the threshold. Adaptor and primer sequences were trimmed using the default parameters of VecScreen (NCBI). The cleaned reads were de novo assembled using EnsembleAssembler (). The assembled contigs, along with singlets were aligned to an in-house viral proteome database using BLASTx and E-value cutoff 0.01. The matches to viral sequences were then aligned to an in-house non-virus-non-redundant (NVNR) universal proteome database using BLASTx. Hits with more significant adjusted E-value to NVNR than to virus were removed. To digitally screen for smacovirus-related sequences in our in-house virome, the available iral DNA genomes were compared with 1.04 billion sequences using BLASTn and E-value cutoff of 0.0001. Resulting hits were analyzed manually by sequence alignment and phylogenetic analysis. [...] To generate phylogenetic trees, the protein sequences were aligned using Mafft with the E-INS-I alignment strategy (). Bayesian inference trees were constructed using MrBayes (). The Markov chain was run for a maximum of 1 million generations, in which every 50 generations were sampled and the first 25 percent of Markov chain Monte Carlo samples were discarded as burn-in.Rolling circle replication (RCR) motifs were analyzed from the sequence alignment with reference genomes of circoviruses and geminiviruses. Pairwise protein identities of replicase and capsid protein sequences were calculated using the species demarcation tool software (). Stem–loop structure was analyzed using mfold with default settings (). […]

Pipeline specifications

Software tools Bowtie2, VecScreen, BLASTX, BLASTN, MAFFT, MrBayes, Mfold
Applications Phylogenetics, Metagenomic sequencing analysis
Organisms Homo sapiens, Viruses, Pan troglodytes, Gorilla gorilla, Alouatta caraya, Mus musculus