Computational protocol: Genomes of Stigonematalean Cyanobacteria (Subsection V) and the Evolution of Oxygenic Photosynthesis from Prokaryotes to Plastids

Similar protocols

Protocol publication

[…] Prior to genome sequencing the identity of the gDNA was verified by sequencing of the 16S rDNA with primers 101F (ACTGGCGGACGGGTGAGTAA) and 1047R (GACGACAGCCATGCAGCACC), and comparison against cyanobacterial sequences available in NCBI. Genome sequencing was performed on the Genome Sequencer FLX using Titanium chemistry (Roche Applied Science, Penzberg, Germany) yielding a 10- to 32-fold coverage. Genome scaffolding was achieved by 3 kbp paired-end standard runs. The sequencing libraries were prepared from 4 μg of gDNA for whole genome shot gun sequencing and 5 μg of gDNA for paired-end sequencing, according to the supplier’s instructions. Additionally, a fosmid library was constructed with the Copy Control Fosmid Library Production Kit (Biozym Scientific, Hess. Oldendorf, Germany). Terminal DNA sequences of cloned genomic inserts were determined with an ABI 3730xl DNA Analyzer (Life Technologies, Darmstadt, Germany). Furthermore, Sanger-reads were generated from fosmid clones to cover the gaps between contigs for each of the five genomes. Sequence data were assembled with the GS De Novo Assembler Software (ver., 2.3, and 2.5.3). For each genome, large (>500 bp) and small contigs (<500 bp) were obtained, including numerous repetitive elements and insertion segments. For finishing purposes, all DNA sequences were uploaded into the Consed program (). The final annotation including COGs () of the genome sequences was accomplished with the GenDB software (). Gene prediction was performed by means of combining results of the software tools GLIMMER (), CRITICA (), and GISMO (). [...] Fully sequenced cyanobacterial proteomes were downloaded from NCBI version March/2011. For the reconstruction of cyanobacterial gene families, we conducted an all-against-all BLAST search (Ver. 2.2.17) () using the protein sequences. Reciprocal best BLAST hits (rBBH) were performed using a threshold of E value ≤ 10−10 and percent amino acid identity ≥30. For the clustering analysis, the overall protein sequence similarity between rBBH proteins, calculated as the percent of identical amino acids, was multiplied by the length ratio of the two proteins. Clusters of gene families were inferred from the rBBH similarity matrix using the MCL ver. 1.008 clustering procedure (), with the inflation parameter (I) set to 2.0. For the reconstruction of a consensus tree phylogeny, 324 gene families present as single copies in all cyanobacterial genomes analyzed were aligned with MAFFT () ver. 6.717b. Phylogenetic trees were reconstructed using the Neighbor-Joining (NJ) approach (). Protein sequence distances were calculated with PROTDIST (), and applying the JTT substitution model (). Phylogenetic trees were reconstructed with NEIGHBOR (). The consensus phylogeny was reconstructed with CONSENSE (). A concatenated alignment was reconstructed from the aligned protein sequences, and all genes were weighted equally (supplementary fig. S1, Supplementary Material online). A phylogenetic tree was reconstructed from the concatenated alignment using the NJ approach and the software described as earlier. A phylogenetic network was reconstructed with SplitsTree Ver. 4.10 using the default parameters (). A minimal lateral network (MLN) was reconstructed using the consensus phylogeny as the reference tree, and the gene families described earlier according to the approach described in . Maximum likelihood phylogeny was reconstructed using PhyML () with LG model + I (estimation of invariant sites) + G (gamma distribution with 4 rate categories). Tree topology (SPR), branch length, and rate parameters were optimized. [...] Sequences of nuclear-encoded proteins from the whole genomes of Arabidopsis thaliana, Oryza sativa subsp. japonica, Physcomitrella patens, Chlamydomonas reinhardtii, Entamoeba histolytica, Dictyostelium discoideum, Filobasidiella neoformans, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster, Ciona intestinalis, Danio rerio, Gallus gallus, Canis lupus familiaris, and Homo sapiens were obtained from RefSeq database release November 2009 (). Nuclear proteomes of Cyanidioschyzon merolae version February 2005 (), Ostreococcus tauri version 2.0 (), and Xenopus tropicalis release 4.1, August 2005 (), were downloaded from the respective genome project websites. Additionally, 650 fully sequenced genomes of prokaryotes, including those of 46 cyanobacterial representatives, were downloaded from NCBI RefSeq database release November 2009 (). To avoid clustering artifacts of distantly related eukaryotic and prokaryotic sequences, the sequences of cyanobacteria and photosynthetic eukaryotes were first clustered into separate sets of protein families. Matrices of algal/plant and cyanobacterial sequences were constructed from reciprocal best BLAST hits using an all-against-all BLAST, and thresholds of E-value ≤ 10−10 and amino acid sequence identities ≥25%. Clusters of homologous protein sequences were reconstructed from each of the matrices using MCL () Ver. 08-312, 1.008, with scheme = 7 and I = 2.0. Protein sequences of noncyanobacterial prokaryotes and nonphotosynthetic eukaryotes were added to the plant/algal clusters of proteins, depending on their sequence homologies using the above threshold, and a limit of three sequences per phylum. Overlapping plant/algal and cyanobacterial clusters were joined. The sequences of protein families were aligned using MAFFT () Ver. 6.717b (2009/12/03). Multiple sequence alignment quality was assessed using the HoT-method (). Plant/algal protein sequences with Sum of Pairs Score <80% were excluded from the cluster. Phylogenetic trees were reconstructed using maximum likelihood approach with PhyML () and the best-fit model as inferred with ProtTest (). The search for a best-fit model using ProtTest was restricted for nuclear gene substitution models including JTT () and WAG () matrices. These were tested with all combinations of +I (estimation of invariant sites), +G (gamma distribution with 4 rate categories), and +F (using amino acid frequencies from the alignment) parameters. Branch lengths, model, and topology were optimized. From among 35,862 trees in total, WAG model was found as the best fit in 89% of the trees, with WAG + I + G as the more prevalent choice (34%). Genes of endosymbiotic origin in algal and plant genomes were inferred from the phylogenetic trees by searching for sisterhood between cyanobacterial protein sequences and their counterparts encoded by the nuclear genes of the photosynthetic eukaryotes (). Protein families in the latter phototrophs were counted as having resulted from EGT(s), if at least one of them had a cyanobacterial sequence as the nearest neighbor. Concatenated alignments were analyzed and used for tree construction by the same methods as described earlier. […]

Pipeline specifications

Software tools Newbler, Consed, GenDB, Glimmer, CRITICA, GISMO, MAFFT, PHYLIP, SplitsTree, PhyML, ProtTest
Applications Genome annotation, Phylogenetics
Organisms Langat virus