Computational protocol: Exploring the larval transcriptome of the common sole (Solea solea L.)

Similar protocols

Protocol publication

[…] Sequencing was performed with GS FLX Titanium series reagents and one single region on a Genome Sequencer FLX instrument. Bases were called with 454 software by processing the pyroluminescence intensity for each bead-containing well in each nucleotide incorporation. A total of 909,466 sequence reads were produced from the normalised cDNA library constructed using a mixture of larval and adult tissues (see above). All Roche 454 FLX reads were trimmed to remove adapter sequences and have been deposited in the NCBI Sequence Read Archive (SRA) [] under accession number SRA058691. An additional set of 314,486 reads was available from a second cDNA library of skeletal muscle (L. Bargelloni, unpublished data). In addition, 21 mRNA sequences for S. solea were available in NCBI [] (as of 1st September 2011). All 454 sequence reads and all mRNAs were then assembled with Newbler 2.6 software using default settings. Newbler software produces “contigs”, “Isotigs” and “Isogroups”. An Isogroup is a collection of contigs containing reads that imply connections between them. An Isotig is meant to be analogous to an individual transcript; different isotigs from a given Isogroup can be inferred splice-variants. Ideally, Isogroups are transcripts, isotigs are splice variants of one transcript and contigs are separate exons. [...] The Basic Local Alignment Search Tool (BLAST) was used to annotate S. solea Isotigs and contigs. Blast2GO software [] was used to perform Blastn (cut off E-value of < 1.0 e-7) searches against the NCBI nucleic nr database as well as Blastx (cut off E-value of < 1.0 e-5) searches against the NCBI amino acid nr database and SWISSPROT database. By using this approach, Gene Ontology (GO) terms associations for “Biological process”, “Molecular function” and “Cellular component” were also obtained for transcripts with a significant match with a known protein. To improve the number of annotated transcripts, two additional approaches were attempted: i) blastx (cut off E-value of < 1.0 e-5) and blastn (cut off E-value of < 1.0 e-7) searches against proteins and high-quality draft transcriptomes of Danio rerio, Gasterosteus aculeatus, Oryzias latipes, Takifugu rubripes, Tetraodon nigroviridis, Homo sapiens, and Mus musculus available on the Ensembl Genome Browser (release 56, [], ii) blastn search (cut off E-value of < 1.0 e-7) against D. rerio, O. latipes, Gadus morhua, G. aculeatus, Ictalurus furcatus, I. punctatus, Salmo salar, Oncorhynchus mykiss, Oreochromis niloticus, Pimephales promelas, H. sapiens, M. musculus databases stored in the NCBI UniGene database [].Annotated transcripts were then further clustered through mapping against a single species proteome, i.e. looking for independent isotigs or contigs that putatively encoded the same protein (named “redundant” for brevity). Two or more Isotigs/contigs were considered clustered together when they displayed the same annotation with at least 3 of 5 fish species when considering the Ensembl Gene IDs of five fish species (D. rerio, G. aculeatus, O. latipes, T. nigroviridis, and T. rubripes). In the case of two transcripts encoding the same protein, only the longer one was used for microarray design. [...] Gene expression analyses were performed with the Agilent-036353 Solea solea oligo microarray (GEO accession: GPL16124). All unique annotated transcripts (15,385; see Results), excluding those annotated only with Unigene (2,549), were employed for microarray design. Transcript matches with ENSEMBL protein or transcript databases were then exploited to infer sole sequence orientations by identifying i) transcripts with unequivocal orientation (sequence frame concordant across all matches), ii) transcripts with ambiguous orientation (sequence frame not concordant across matches), and iii) transcripts with unknown orientation (transcripts whose match was against the NCBI nr nucleotide database). One probe for annotated sequences with unequivocal orientation (10,987) was designed while, whenever possible, two probes with both orientations (sense and antisense) were designed for Isotigs with ambiguous/unknown orientation (1,849). A total of 14,674 oligonucleotide probes (60 nt) representing 12,836 transcripts were in situ synthesised onto the array using Agilent non-contact ink-jet technology (8 × 15 K format, including default positive and negative controls).A single dye (Cy3) labelling scheme was implemented, and a mixture of 10 different viral poly-adenylated RNAs (Agilent Spike-In Mix) was added to each RNA sample to monitor labelling and hybridisation quality as well as microarray analysis work-flow. Sample labelling and hybridisation were performed as reported in Ferraresso et al. [] with slight modifications. Processed slides were scanned at 5 μm resolution with an Agilent G2565BA DNA microarray scanner. Default settings were modified to scan the same slide twice at two different sensitivity levels (XDR Hi 100% and XDR Lo 10%). The two linked images generated were analysed together, and data were extracted and background subtracted using the standard procedures contained in Agilent Feature Extraction (FE) Software version 9.5.1. [...] The normalisation procedure was performed using R statistical software []. Microarray data were quantile normalised across all arrays. To exclude poor-quality probes from statistical analyses, hybridisation success and mean fluorescence for each probe were evaluated in a total of 31 experiments (four biological replicates for each developmental stage with the exception of 13 dph, for which one biological replicate was discarded). Microarray probes were considered unreliable when a successful hybridisation (“glsFound” equal to 1) in less than 50% of the experiments and a mean fluorescence below 10 were observed. Using this approach, 753 probes were filtered out, leaving 13,921 probes for all further analyses. A total of 546 probes of 753 (72.5%) were sense or antisense oligos designed for transcripts with unknown orientation for which the second probe (antisense or sense respectively) showed good performance.Cluster analyses were performed on the entire dataset using the AutoSOME strategy [] by modifying default settings to increase Ensemble runs to 500 and to maintain the p-value threshold at 0.05. A fuzzy cluster network for illustrating the AutoSOME results was generated with the visualisation tool Cytoscape []. Bidirectional Hierarchical Clustering (HCL) and Principal Component Analysis (PCA), as implemented in TIGR MultiExperiment Viewer (MeV, version 4.5.1), were also performed on the entire gene expression dataset. Expression profile comparisons between developmental stages were performed using Significance Analysis of Microarrays (SAM) software []. Two-class comparisons (FDR 1%, minimal Fold-Change (FC) ≥ 2) were performed by considering each time point as independent. SAM quantitative correlation analyses (FDR 0%) were also performed in order to reveal genes whose expression was positively or negatively correlated with either developmental stages or sample projection on the PCA Y-axis. A non-parametric Spearman rank-correlation test was used to assess the correlation between the expression values measured by real-time RT-PCR and microarray for a set of 10 candidate genes. Spearman correlation tests were implemented using SPSS 12.0. [...] Functional annotation analysis of differentially expressed genes was performed using the DAVID (Database for Annotation, Visualisation and Integrated Discovery) web-server []. “Biological process”, “Molecular function” and “Cellular component” annotations were performed by setting gene count = 4 and ease = 0.05. KEGG pathway analysis was also performed with gene count = 4 and ease = 0.05. Because DAVID contains functional annotation data for a limited number of species, it was necessary to link sole transcripts with sequence identifiers that could be recognised in DAVID. This was performed using S. solea matches with zebrafish proteins and transcripts (see “Transcriptome annotation” section). Finally, D. rerio Ensembl Gene IDs were obtained from the corresponding Ensembl protein and transcript entries using the BIOMART data mining tool []. [...] All annotated Isotigs and contigs were used for microsatellite repeat searches using MISA software []. A sequence was considered to contain a microsatellite if it possessed any of the following repeated motifs: at least 6 repeated dinucleotides or at least 5 repeated tri-, tetra-, penta- or hexanucleotide motifs. […]

Pipeline specifications

Software tools BLASTN, Blast2GO, BLASTX, Agilent Feature Extraction, SAM
Applications Gene expression microarray analysis, Genome data visualization
Chemicals Folic Acid, Vitamin A