Computational protocol: Alternative Splicing within and between Drosophila Species, Sexes, Tissues, and Developmental Stages

Similar protocols

Protocol publication

[…] Within-species analyses are analyses where only samples within a species were directly compared (Figs –, , – Figs, , and ). We used seqtk (https://aur.archlinux.org/packages/seqtk-git) to randomly choose paired-end RNA-seq reads so that within each species, each sample used had the same number of reads (). The pooled data from all four species had a total of 957,749,296 reads. For samples with read lengths > 50 bp, we used read quality data from FastQC (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc) to determine cutoffs and trimmed the reads to 50 bp.We used bowtie2-build and TopHat v2.0.13 [] to map the reads to each genome (D. pseudoobscura Release 3.1 downloaded from http://www.flybase.org, D. miranda assembly [], D. albomicans assembly [], D. nasuta assembly, unpublished, ) using the parameters —b2-sensitive,—coverage-search, and—microexon-search and then ran Cufflinks v2.1.1 [] without a reference annotation on each sample. For each species, we used cuffmerge to combine all of the cufflinks-generated annotations to create a master annotation. We used cuffdiff [] with this master annotation to obtain expression data for each sample. We use the “commonly adopted arbitrary inclusion threshold” of FPKM = 1 [] that is used in other Drosophila gene expression studies [].We used picard v1.106 (http://picard.sourceforge.net) to obtain insert size mean and standard deviation for each sample. We used MATS [, ] to on one hand detect and annotate putative AS events from our aligned RNA-seq data (bam files) and cuffmerge annotations. MATS was also used to get Ψ (“PSI”, Percent Spliced In) values for each alternatively spliced exon by running pairwise comparisons between all samples within each species and taking an average Ψ value for each exon calculated using reads on target and junction counts, excluding samples for which the exon was not classified as alternatively spliced. The program was run using default parameters except setting the “-analysis” parameter to “P” for our paired-end sequencing reads. The anchor/overhang length was the tophat2 default of 8 bp, so at least eight nucleotides had to map to each end of a given junction. The annotation of AS events by MATS included intron retention events that overlapped with splice sites of other annotated AS events. To validate our pipeline, we also tested a few tissues using an alternative pipeline. We used AltEventFinder [] to annotate skipped exon alternative splicing events and MISO [, ] to compute Ψ values. shows correlations of the results of both pipelines for post-embryonic tissues in D. miranda computed in R using the corr function.heatmap.2 was used in R to generate heatmaps, which uses hclust to cluster samples. The R function prcomp was used to perform the PCAs. The R function wilcox.test was used to perform Wilcoxon rank sum tests ().Note that for analyses within species, we directly compare datasets subsampled down to the same number of reads () that are mapped to the same genome assembly. Therefore, differences in genome assembly quality among the four genomes and the differences in the numbers of reads used among species are not expected to impact our within-species inferences. [...] Interspecific comparisons are analyses in which samples among different species were compared directly (, , and and Figs). Pairwise whole genome alignments for each pair of species were done using the software Mercator (https://www.biostat.wisc.edu/~cdewey/mercator/) [] and MAVID []. First we used Mercator to build an orthology map for each pair of species. Then MAVID was used to perform global whole genome alignments. Finally, for each exon/gene in each species, the coordinates of the corresponding ortholog in the other species were determined using the sliceAlignment program (Mercator distribution).Using pairwise alignments of D. pseudoobscura, D. miranda, D. albomicans, and D. nasuta to D. melanogaster, we kept genes/exons aligned with >0.5 overlap between all pairs. We kept all genes/exons with 1:1 orthology. If there were multiple genes/exons from one species that aligned to the same D. melanogaster gene/exon, we kept the pair with the highest overlap score. If the highest overlap score was shared between the D. melanogaster gene/exon and more than one gene/exon from the other species, we did not use the gene(s)/exon(s)in our analysis. If there were multiple D. melanogaster genes/exons that aligned to the same gene/exon from the other species, we kept the pair with the highest overlap score. If the highest overlap score was shared between one gene/exon from the other species and more than one D. melanogaster gene/exon, we did not use the gene(s)/exon(s) in our analysis. For comparisons over embryonic development, this left us 6,707 genes with 1:1 orthologous relationships between D. pseudoobscura and D. miranda and 1,122 exons with 1:1 orthologous relationships between D. pseudoobscura and D. miranda that were also annotated as alternatively spliced in at least one sample and expressed with an FPKM > 1 in at least one sample per species. For comparisons of post-embryonic tissues, we recovered 3,005 genes with 1:1 orthologous relationships among D. pseudoobscura, D. miranda, D. albomicans, and D. nasuta and 472 exons with 1:1 orthologous relationships among the four species that were also annotated as alternatively spliced in at least one sample and expressed with an FPKM > 1 in at least one sample per species.We used orthology information to create annotations for each species containing only genes orthologous among all species (D. pseudoobscura, D. miranda, D. albomicans, and D. nasuta for comparisons of post-embryonic samples; D. pseudoobscura and D. miranda for comparisons of embryonic samples) and ran kallisto [] using these annotations. We used TPM (Transcripts Per Million) values from each sample for each gene for our interspecies gene expression analysis. We used sleuth [] to normalize TPM values between samples and to compute Jensen-Shannon divergence between samples.We used bowtie2-build and TopHat v2.0.13 [] to map the reads to each genome (D. pseudoobscura Release 3.1 downloaded from http://www.flybase.org, D. miranda assembly [], D. albomicans assembly [], D. nasuta assembly (unpublished, ) using the parameters —b2-sensitive,—coverage-search, and—microexon-search and then ran Cufflinks v2.1.1 [] using the–G parameter and the reference annotations used for orthology analysis on each sample.For interspecies splicing analyses, we ran MATS [, ] to on one hand detect and annotate putative AS events from our aligned RNA-seq data (bam files) and orthology annotations. MATS was also used to get Ψ (“PSI”, Percent Spliced In) values for each alternatively spliced exon. The program was run using the same parameters as for the “within-species” analysis. MATS compares splicing in samples pairwise, and we compared all samples within species that shared the same read lengths. Spermatheca reads from D. nasuta and D. albomicans were trimmed to 76 base pairs (the size of all other reads in these species). In each sample, we took an average Ψ value for each exon calculated using reads on target and junction counts, excluding samples for which the exon was not classified as alternatively spliced. For exons alternatively spliced in some samples but not others, we looked at the expression calculated for that exon in cufflinks. If the exon had an FPKM value < 1 and the upstream or downstream exons had an FPKM value > 1, the exon was assigned a Ψ value of 0. If the exon had an FPKM value > 1, the exon was assigned a Ψ value of 1. If the exon had an FPKM value < 1 and the upstream and downstream exons had an FPKM value < 1, the exon was not assigned a Ψ value.heatmap.2 was used in R to generate heatmaps, which uses hclust to cluster samples. We computed Spearman (, ) and Pearson () correlations for GE and AS in adult tissues and embryonic stages in R using the corr function. Jensen-Shannon divergence of GE () was computed using sleuth[].For analyses among species, we directly compare only orthologous genes and exons recovered in all four species. Therefore, our inferences are limited by the least complete genome assembly and annotation. We did not subsample datasets for our interspecies analyses (), but instead required that each orthologous gene or exon be expressed at FPKM > 1 in at least one sample per species to be included in our analyses. We consider our interspecies analysis conservative; we may be underestimating the number of orthologous genes and comparable splicing events. However, due to our approach, we do not expect a considerable number of splicing events or expressed genes to be erroneously identified as species-specific. […]

Pipeline specifications

Software tools Seqtk, FastQC, Bowtie, TopHat, Cufflinks, Picard, MISO, Hclust, MAVID, kallisto
Applications RNA-seq analysis, Nucleotide sequence alignment
Organisms Drosophila melanogaster, Homo sapiens, Caenorhabditis elegans