Similar protocols

Protocol publication

[…] We removed the low-quality reads from the TARA Oceans dataset using ‘iu-filter-quality-minoche’, which is a program in illumina-utils v1.4.1 () (available from, which implements the noise filtering parameters described by . After simplifying the header lines of 31 FASTA files for Prochlorococcus isolate genomes using the anvi’o script ‘reformat-fasta’, we concatenated all FASTA files into a single file, and used Bowtie2 () with default parameters and the additional ‘--no-unal’ flag to recruit quality-filtered short metagenomic reads on to Prochlorococcus isolate genomes (‘read recruitment’ is an analogous term to ‘mapping’, or ‘short read alignment’). We used samtools () to convert resulting SAM files into sorted and indexed BAM files. [...] We used Phylosift v1.0.1 () with default parameters to quantify evolutionary distances between genomes. Briefly, Phylosift (1) identifies a set of 37 marker gene families in each genome, (2) concatenates the alignment of each marker gene family across genomes, and (3) computes a phylogenomic tree from the concatenated alignment using FastTree 2.1 (). We finalized the phylogenomic tree by setting a midpoint root with FigTree v.1.4.3 (). [...] We used anvi’o () v3 (available from to profile the read recruitment results following the workflow outlined by . Briefly, we first used the program ‘anvi-gen-contigs-database’ to profile Prochlorococcus genomes, during which Prodigal v2.6.3 () with default settings identified open reading frames. We used InterProScan v5.17-56 () and eggNOG-mapper v0.12.6 () outputs for our genes with the program ‘anvi-import-functions’ to import annotations from other databases, including PFAM (), and eggNOG (). We then used the program ‘anvi-run-ncbi-cogs’ to annotate genes with functions by searching them against the December 2014 release of the Clusters of Orthologous Groups (COGs) database () using blastp v2.3.0+ (). We finally used the program ‘anvi-profile’ to process the BAM file and generate an anvi’o profile database, which stored the coverage and detection statistics of each Prochlorococcus genome in the TARA Oceans data. We used ‘anvi-import-collection’ to link contigs to genomes from which they originate. Finally, the program ‘anvi-summarize’ generated a static HTML output that gave access to the mean coverage values of each genome (and individual genes within them) across metagenomes. [...] The anvi’o pangenomic workflow developed for this study consists of three major steps: (1) generating an anvi’o genome database (‘anvi-gen-genomes-storage’) to store DNA and amino acid sequences, as well as functional annotations of each gene in genomes under consideration, (2) computing the pangenome (‘anvi-pan-genome’) from a genome database by identifying ‘gene clusters’, and (3) displaying the pangenome (‘anvi-display-pan’) to visualize the distribution of gene clusters across genomes, interactively bin gene clusters into logical groups, and inspect the alignment of genes in a given cluster interactively. In our study, a ‘gene cluster’ represents sequences of one or more predicted open reading frames grouped together based on their homology at the translated DNA sequence level. Gene clusters with more than one sequence may contain orthologous or paralogous sequences, or both, from one or more genomes analyzed in the pangenome. To compute the Prochlorococcus pangenome, we first generated an ‘anvi’o genomes storage database’ from the FASTA files of 31 Prochlorococcus isolate genomes using the ‘--internal-genomes’ flag. We then used the program ‘anvi-pan-genome’ with the genomes storage database, the flag ‘--use-ncbi-blast’, and parameters ‘--minbit 0.5′, and ‘--mcl-inflation 10′. This program (1) calculates similarities of each amino acid sequence in every genome against every other amino acid sequence using blastp (), (2) removes weak hits using the ‘minbit heuristic’, which was originally described in ITEP (), to filter weak hits based on the aligned fraction between the two reads, (3) uses the MCL algorithm () to identify gene clusters in the remaining blastp search results, (4) computes the occurrence of gene clusters across genomes and the total number of genes they contain, (5) performs hierarchical clustering analyses for gene clusters (based on their distribution across genomes) and for genomes (based on gene clusters they share) using Euclidean distance and Ward clustering by default, and finally (6) generates an anvi’o pan database that stores all results for downstream analyses and can be visualized by the program ‘anvi-display-pan’. [...] We used the ggplot2 () library for R to visualize the relative distribution of genomic groups on the world map. Anvi’o performed all other visualizations, and we finalized our figures for publication using Inkscape, an open-source vector graphics editor (available from […]

Pipeline specifications