Computational protocol: Analysis of synonymous codon usage patterns in sixty-four different bivalve species

Similar protocols

Protocol publication

[…] We considered two bivalve mollusk species with a fully sequenced genome (C. gigas and P. fucata) (; ) and 62 other species whose transcriptome has been sequenced using next generation sequencing technologies and deposited in public sequence databases. When both 454 Life Sciences and Illumina-generated sequencing reads were available for a same species, the latter were chosen due to higher throughput and lower rate of sequencing errors. Namely, Illumina reads were used for species Anadara trapezia, Arctica islandica, Argopecten irradians, Astarte sulcata, Atrina rigida, Azumapecten farreri, Bathymodiolus platifrons, Cardites antiquata, Cerastoderma edule, Corbicula fluminea, Crassostrea angulata, Crassostrea corteziensis, Crassostrea hongkongensis, Crassostrea virginica, Cycladicama cumingii, Cyrenoida floridana, Diplodonta sp. VG-2014, Donacilla cornea, Elliptio complanata, Ennucula tenuis, Eucrassatella cumingii, Galeomma turtoni, Glossus humanus, Hiatella arctica, Lampsilis cardium, Lamychaena hians, Mactra chinensis, Margaritifera margaritifera, Mercenaria campechiensis, Meretrix meretrix, Mizuhopecten yessoensis, Mya arenaria, Myochama anomioides, Mytilus californianus, Mytilus edulis, Mytilus galloprovincialis, Mytilus trossulus, Neotrigonia margaritacea, Ostrea chilensis, Ostrea edulis, Ostrea lurida, Ostreola stentina, Pecten maximus, Perna viridis, Pinctada martensii, Placopecten magellanicus, Polymesoda caroliniana, Pyganodon grandis, Ruditapes decussatus, Ruditapes philippinarum, Sinonovacula constricta, Solemya velum, Sphaerium nucleus, Uniomerus tetralasmus and Villosa lienosa (; ; ; ; ; ; ; ; ; ; ; ; ; ). The 454 Life Sciences sequences were used for Bathymodiolus azoricus, Geukensia demissa, Laternula elliptica, Mimachlamys nobilis, Pinctada maxima, Saccostrea glomerata, and Tegillarca granosa (; ; ; ; ). Details about the data used for the different species are provided in .Sequence data were processed as follows: predicted CDS from the fully sequenced genomes of C. gigas (release 9) and P. fucata were retrieved from http://oysterdb.cn and http://marinegenomics.oist.jp/pinctada_fucata, respectively. De novo transcriptome assemblies were performed for all the other 62 bivalve species with the CLC Genomics Workbench (v.7.5, CLC Bio, Aarhus, Denmark) using the de novo assembly tool with “automatic word size” and “automatic bubble size” parameters selected, and setting the minimum allowed contig length to 300 bp.In all transcriptomes, ORFs (Open Reading Frames) longer than 100 codons were predicted with TransDecoder (http://transdecoder.sourceforge.net). We selected the predicted CDS of C. gigas, and of one representative species for the Imparidentia (R. decussatus), Protobranchia (S. velum) and Palaeoheterodonta (P. grandis) lineages to identify a subset of evolutionarily conserved protein-coding genes with a 1:1 orthology ratio across Bivalvia. This was achieved by performing reciprocal tBLASTx searches (the e-value threshold was set a 1 × 10−10 and only hits displaying sequence identity >50% were considered). This procedure resulted in a selection of 2,846 conserved protein-coding genes, whose orthologous sequences were retrieved in the remaining 60 species. Due to the heterogeneous tissue and developmental stage origin, the different sequencing platforms and depth applied, several of these evolutionarily conserved sequences could not be identified or were fragmented in some transcriptomes. In order to ensure a minimum quality criteria, all the selected species had to display at least 25% of the sequences included in the dataset of evolutionarily conserved genes, with an average length >500 nucleotides. A number of additional transcriptomes derived from publicly available data did not meet such criteria and were therefore not included in our analyses (). [...] The sets of evolutionarily conserved genes retrieved for each species were individually processed with the cusp tool of the EMBOSS package () obtaining codon frequencies and GC composition for each codon position. RSCU values for each individual codon were calculated for each species as described by Sharp and colleagues (). The effective number of codons (ENC) for each species was calculated according to using the EMBOSS chips tool, summing codons over al sequences (). The sENC-X values were determined for every amino acid for each species and scaled to a range of values between 0 and 1 according to . EMBOSS chips was also used to calculate ENC for individual genes whenever necessary. We identified a reference set of 50 highly expressed genes for the calculation of Codon Adaptation Index (CAI) based on the average expression in C. gigas digestive gland (SRA:SRX093412), gills (SRA:SRX093414) and hemocytes (SRA:SRX093417) RNA-seq libraries and their inclusion in the above mentioned set of 2,846 genes conserved across bivalves. Gene expression was calculated as TPM (Transcripts Per Million) (), with the RNA-seq mapping tool included in the CLC Genomics Workbench 8.5 (Aarhus, Denmark), setting length and similarity fraction parameters to 0.75 and 0.98 and insertion/deletion/mismatch penalties to 3. Orthologous genes were used for CAI calculation in other species. CAI values were computed with CAI calculator 2 (). The gene expression levels of M. galloprovincialis transcripts were calculated using the digestive gland (SRA:SRX126945-8), gills (SRA:SRX389466) and hemocytes (SRA:SRX389338) RNA-seq libraries (; ).Scatter plots were generated between ENC and the average GC content calculated at the third codon position (GC3) for each species, between ENC and sENCx and between ENC and CAI; Paerson correlation coefficients and linear regression analyses were computed with R 3.1.0 (http://www.r-project.org). […]

Pipeline specifications

Software tools CLC Genomics Workbench, TransDecoder, TBLASTX, EMBOSS
Applications RNA-seq analysis, Nucleotide sequence alignment
Diseases Ataxia Telangiectasia
Chemicals Amino Acids