Computational protocol: Photosynthetic functions of Synechococcus in the ocean microbiomes of diverse salinity and seasons

Similar protocols

Protocol publication

[…] In order to analyze the relative abundance of bacterial composition in the marine microbiome, MetaPhlAn2 was applied with the default parameter options only for the bacterial composition (i.e.—ignore_archaea,—ignore-eukaryotes, and—ignore-viruses) []. To investigate the conservation against putative Synechococcus strains in the microbiome, a homology search was conducted by using BLAST [] against 6 Parasynechococcus complete genomes (CC9902, CC9311, CC9605, RCC307, WH7803, and WH8102) from the NCBI repository. In each sample, the assembled contigs over 1 kbp were searched as a query with the stringent threshold (E-value < = 1E-10 and sequence identity > = 40%), and the best hits were retained. Scatter plots were generated by using Plotly []. [...] The shotgun metagenome reads were filtered and assembled by using MEGAHIT [] with default k-mer options. Subsequently, all contigs over 1,000 bp were used to identify the proteins in the microbiome by using FragGeneScan []. The predicted proteins in each sample were clustered with the proteins in 13 complete reference Synechococcus. In order to analyze the various aspects, CD-HIT [] clustering was conducted with two different options of protein sequence similarity and shorter sequence coverage options: 70 over 70 (i.e. 70% sequence similarity over 70% of the shorter sequence); 90 over 90; 95 over 95. Subsequently, the proportion of clustered proteins of each Synechococcus strain was calculated and presented in the heatmap. [...] All 13 complete genomes of Synechococcus strains were obtained from the NCBI ftp repository (2015 April). A detailed genomic information on the Synechococcus strains is summarized in . A total of 35,769 protein sequences in Synechococcus were clustered using CD-HIT (with 40% of protein similarity and 40% of shorter sequence coverage) [] to find the orthologs genes in each cluster. As a result, 566 clusters were identified as core genes, 5,206 clusters as dispensable genes, and 4,030 clusters as strain-specific genes.Jaccard Index were calculated using 5,206 dispensable proteins of 13 Synechococcus strains. Bi-clustering was conducted with average linkage and Euclidean distances. The phylogeny of Synechococcus was constructed using all identified 566 core proteins. Multiple sequence alignment was conducted separately on proteins in each core cluster by using MUSCLE []; all alignments in each strain were concatenated in the same order. Subsequently, neighbor-joining trees were constructed using MEGA [] (ver. 7.0.14) with several options (100 for bootstrapping, Poisson model for substitution model, uniform rates, and pairwise deletion for gap/missing treat). [...] Functional analyses were conducted on Synechococcus by using KEGG database v54 []. Proteins in Synechococcus were aligned against the protein sequences in KEGG database by using BLASTp [] (E-value < 1.0E-10). The best hit of the query was assigned based on the E-value. For multiple best hits with the same E-value, alignment with the best sequence similarity was selected. Most of the KEGG genes can be mapped to KEGG ortholog, but there are a few cases that cannot be mapped to KEGG ortholog. In this case, we choose the second alignment based on the E-value and sequence similarity.The same KEGG orthologs of the phycobilisome complex were used both for comparative genomics of Synechococcus strains and for the analysis of the functions in microbiomes. BLASTP of microbiome reads were performed against proteins in KEGG database (E-value < 1E-10; sequence identity > 90%). […]

Pipeline specifications