Computational protocol: Community genomic analyses constrain the distribution of metabolic traits across the Chloroflexi phylum and indicate roles in sediment carbon cycling

Similar protocols

Protocol publication

[…] Four lanes of Illumina HiSeq paired-end sequencing were conducted by the Joint Genome Institute. The 4 m sample sequence comprised 360,739,614 reads, the 5 m sample 497,853,726 reads, and the 6 m sample 140,430,174 reads. The read length was 150 bp. Reads were preprocessed using Sickle (https://github.com/najoshi/sickle) using default settings. Only paired end reads were used in the assemblies. Most of our analyses relied upon IDBA_UD assemblies using default parameters [] of sequence data for the 4 m sample, the 5 m sample, and the 6 m data separately. One barcoded lane of Illumina HiSeq paired-end sequence containing all three of the depth samples was co-assembled using the IDBA_UD assembler [] using two different parameter settings: mink 40, maxk 100 (Combo1 assembly) and mink 40, maxk 100, min_count 2 (Combo3 assembly). Combo1 was only used for binning; Combo3 was used during genome curation.Emergent self-organizing map (ESOM) clustering based on tetranucleotide frequencies of scaffolds produced in the Combo1 assembly was used to identify segregated clusters of scaffolds corresponding to individual genomes []. Chloroflexi scaffolds were identified using taxonomic affiliation of genes predicted with Prodigal (meta-Prodigal option; []) based on best blast match, where 40% of genes were required to have a match to Chloroflexi sequences in order for a scaffold to be included. GC content, abundance profile in the 4, 5, and 6 m depth metagenomes, and the taxonomic affiliations of the genes encoded on scaffolds were used to curate a consistent set of scaffolds with a high proportion of Chloroflexi-affiliated predicted genes.Three Chloroflexi genomes were selected for curation and characterization based on their positions as relatively phylogenetically novel organisms within the Chloroflexi, the presence of a clearly defined ‘genome’ bin within the ESOM analysis, and the taxonomic predictions for the genes within the genomes. For each genome bin, the paired reads from all depth samples that mapped to the genomes’ scaffolds were reassembled. RBG-2 reads were assembled using IDBA_UD under default parameters [] (Rifle BackGround organism # (RBG)). The RBG-9 and RBG-1351 reads were assembled using Velvet []. For each genome, mini assemblies using all reads mapping to the ends of scaffolds were conducted until no further connections between scaffolds could be made. Genome completion was examined with a suite of 76 genes selected from a set of single copy phylogenetic marker genes that show no evidence of lateral gene transfer [,].A functional prediction was conducted on open reading frames on scaffolds of interest. This involved amino-acid similarity searches against UniRef90 [] and KEGG [,]. Additionally, UniRef90 and KEGG were searched back against the translated sequences to identify reciprocal best-blast matches. Reciprocal best blast matches were filtered with a minimum 300 bit score. One-way blast matches were filtered with a minimum 60 bit score. The translated sequences were also submitted to motif analysis using InterproScan []. tRNA sequences were predicted using tRNAscan-SE []. Finally, the annotation summaries were ranked: reciprocal best-blast matches were ranked the highest, followed by one-way matches, followed by InterproScan matches, followed by just a gene prediction (annotated as hypothetical proteins). [...] For specific functional genes of interest, reference datasets were generated from sequences mined from NCBI databases. In all cases, the nearest homolog within the Chloroflexi was determined and included in the reference set; absence of Chloroflexi within these datasets indicates there were no identifiable homologs. Alignments were generated using MUSCLE v. 3.8.31 [,], curated manually, and phylogenies conducted using PhyML [] with 100 bootstrap resamplings. [...] Existing reference datasets for the 16 ribosomal proteins chosen as single-copy phylogenetic marker genes (RpL2, 3, 4, 5, 6, 14, 15, 16, 18, 22, and 24, and RpS3, 8, 10, 17, and 19) [,] were augmented with sequences mined from recently sequenced genomes from the Chloroflexi, Nitrospirae, and TM7 phyla, among others, from the NCBI and JGI IMG databases. Each individual gene set was aligned using MUSCLE version 3.8.31 [,] and then manually curated to remove end gaps and ambiguously aligned regions. Model selection for evolutionary analysis was determined using ProtTest3 [,] for each single gene alignment. The curated alignments were concatenated to form a 16-gene, 930 taxa, 2,456-position alignment. A maximum likelihood phylogeny for the concatenated alignment was conducted using PhyML under the LG + α + γ model of evolution and with 100 bootstrap replicates []. [...] RpS3 sequences were mined from the JGI IMG-M site from all available metagenome sequences, excluding human microbiome samples, using the gene name search tool. In cases where multiple assemblies or samples were available for the same environmental site, a subset of representative metagenome assemblies was selected. A total of 7,707 RpS3 sequences were identified. After removing protein sequences shorter than 200 aa, 1,152 partial and full length RpS3 sequences were searched against the NCBI nr protein database using BLASTp []. The sequences were aligned with the RpS3 reference set as described above, and the Chloroflexi-affiliated sequences identified using a combination of a Neighbor-Joining Jukes-Cantor tree and MEGAN [] on the BLASTp data. The final dataset of 794 sequences was aligned and masked, and the best fitting evolutionary model determined as described above. A maximum likelihood phylogeny was conducted using PhyML under the LG + α + γ model of evolution, with 100 bootstrap replicates []. For coverage calculations, Bowtie2 [] was used to map all reads from each depth (as singletons) to a dataset comprising all of the RBG scaffolds containing the rpS3 genes. Coverage levels were normalized across the three datasets for total number of reads, and the relative ratio and abundances determined. […]

Pipeline specifications

Software tools IDBA-UD, Prodigal, Velvet, InterProScan, tRNAscan-SE, MUSCLE, PhyML, ProtTest, BLASTP, MEGAN, Bowtie2
Databases UniRef KEGG
Applications Genome annotation, Phylogenetics, Nucleotide sequence alignment
Chemicals Adenosine Triphosphate, Carbohydrates, Carbon