Computational protocol: The bench scientist's guide to statistical analysis of RNA-Seq data

Similar protocols

Protocol publication

[…] As a method for characterizing global changes in transcription, RNA-Seq is an attractive option because of the ability to quantify differences in mRNA abundance in response to various treatments and diseases, as well as to detect alternative splice variants and novel transcripts []. Compared to microarray techniques, RNA-Seq eliminates the need for prior species-specific sequence information and overcomes the limitation of detecting low abundance transcripts. In addition, early studies have demonstrated that RNA-Seq is very reliable in terms of technical reproducibility []. As a result, biologists studying an array of model and non-model organisms are beginning to utilize RNA-Seq analysis with ever growing frequency [-]. However, without experience using bioinformatics methods, the large number of choices available to analyze differential expression can be overwhelming for the bench scientist (see Table one in []).Essentially, RNA-Seq consists of five distinct phases, 1) RNA isolation, 2) library preparation, 3) sequencing-by-synthesis, 4) mapping of raw reads to a reference transcriptome or genome and 5) determining significance for differential gene expression (for review see []). In an effort to familiarize the bench scientist with the post-sequencing analysis of RNA-Seq data (phase 5), we have developed an analysis strategy based on currently available bioinformatics tools. Here, we compare three statistical tools used to analyze differential gene expression: edgeR, DESeq and Limma [-]. Based on their performance, we present an analysis strategy that combines these tools in order to generate an optimized list of genes that are differentially expressed. Finally, we highlight several aspects of RNA-Seq analysis that have the potential to lead to misleading conclusions and discuss options to minimize these pitfalls. [...] The raw reads described in Table  were aligned to the soybean reference transcriptome [] using the mapping tool Novoalign, a short read aligner demonstrated to be highly accurate [,]. When differential expression was analysed subsequently, the total number of genes with significantly altered transcript abundance in plants exposed to elevated ozone was 11,995 for edgeR, 11,317 for DESeq and 9,131 for Limma. Since RNA-Seq generates count data, it is more appropriate to use a discrete probability distribution to analyze differential gene expression []. Therefore, edgeR and DESeq, which are based on the negative binomial distribution, are compatible with the data generated by RNA-Seq [,]. In contrast, Limma [] was adapted to analyze RPKM values using a method previously developed for continuous data from microarray studies (fluorescence values) and is based on the t-distribution []. The Limma method was clearly very different from the two negative binomial distribution methods, but even between edgeR and DEseq there were 678 additional genes identified by edgeR as differentially expressed, representing approximately 6% of the significant genes. [...] A first potential limitation of this approach is that it may be too conservative, as evidenced by the 2,242 marginally significant genes that were removed from the final optimized list (Figure  , Step C). The behavior of these genes was analysed in the context of changes to transcripts with broadly similar functions, using the MapMan expression tool [] to analyze functional category significance for each of the lists of marginally significant genes (Table  ). This tool first identified 11 functional categories from the optimized list of differentially expressed genes consisting of a subset of genes that collectively responded to elevated ozone in a similar manner; i.e., the expression profile of each significant functional category was different from the expression profile of all other categories. When the lists of marginally significant genes were analyzed subsequently, most of these categories were found not to be significantly different, indicating that the eliminated genes did not respond in a manner similar to the optimized list of genes. However, statistical significance was achieved for several categories. Despite having an expression profile consistent with the remaining genes included in the optimized list, 320 RNA, 70 stress, 36 hormone metabolism, 19 DNA, and 10 mitochondrial electron transport-related genes were eliminated based on a non-significant determination by one of the two statistical tools.An additional limitation was uncovered by further investigation of the final list of optimized genes. After a cursory examination of several genes that were previously characterized to be regulated by growth in elevated ozone, we identified a potential issue with the statistical analysis that preferentially impacted the high abundance genes. It is well-documented that plants grown in elevated ozone exhibit reduced photosynthesis, increased antioxidant capacity and increased protein turnover []. Four high abundance genes (Glyma05g25810, Glyma20g27950, Glyma17g37280 and Glyma11g11460) involved with these processes were not found to be differentially expressed by at least one of the statistical tools used in this analysis, despite RPKM values with obvious differences and analysis of variance (ANOVA) results that indicated significance (Table  ). A more detailed examination across a range of RPKM values support the finding of an increase in type II error for high abundance genes. Four out of 10 randomly selected genes with RPKM values near 1000 that were determined not to be differentially regulated by both edgeR and DESeq did, in fact, have significantly altered transcript abundance when analyzed using ANOVA (Figure  A). In contrast, none of the genes with RPKM values near 10 were identified as false negatives (Figure  C). [...] Illumina sequences from each of the samples from three biological replicates of control and treatment (elevated ozone) were cleaned using the FASTX toolkit, with a minimum quality score of 20 and minimum length of 75 nt. Soybean genome (Gmax_109) and gff file (Gmax_109.gff3) were downloaded from phytozome ( http://www.phytozome.net/soybean). Soybean transcripts were extracted from the genome sequences based on the.gff file. These soybean transcripts (46,367 transcripts) were considered as reference transcriptome for RNA-Seq analysis.Mapping of Illumina sequences with Novoalign was done with –H (for hard clipping the reads), –l 65, -rA10 (to allow 10 multiple alignments). With these parameters at least 90% of the each read's length should map to the reference to consider it as a mapped read. After mapping with Novoalign, read counts for each gene were generated using PERL scripts. These reads counts were used for statistical analysis using DESeq and edgeR packages of ‘R’ to determine differential expression at the gene level. Since approximately 92% of the mapped reads aligned to the transcriptome uniquely, multireads were not considered. All biological replicates demonstrated a >0.93 correlation when RPKM values were compared, indicating high reproducibility of replicates. See online user guides for more information about performing alignments with Novoalign ( http://www.novocraft.com/wiki/tiki-index.php). [...] First-strand cDNA synthesis was performed using 1 μg of DNase treated RNA and was reverse transcribed in a 20 μl reaction with Superscript II (Life Technologies, Grand Island, NY) and oligo(dT) primers according to the manufacturer's instructions. Quantitative PCR was performed on an Applied Biosystems 7900HT Fast Real-Time PCR System (Life Technologies, Grand Island, NY) using Power SYBR Green PCR master mix (Life Technologies, Grand Island, NY) and 400nM of each primer in a 10 μl reaction. Primers were aliquoted onto a 384-well PCR plate using a JANUS automated liquid handling system (Perkin Elmer, Waltham, MA). The following are the primer sequences for each of the target genes: Rubisco (Glyma19g06340), primer A- GCACAATTGGCAAAGGAAGT, primer B- GAGAAGCATCAGTGCAACCA; LHCA5 (Glyma06g04280), primer A- GTGGAGCATCTTTCCAATCC, primer B- TGGATAAGCTCAAGCCCAAG; SBPase (Glyma11g34900), primer A- ATAAGTTGACCGGCATCACC, primer B- GGGTTGTCAGATGTGGCTCT; starch synthase (Glyma13g27480), primer A- GACCCTCTCGATGTTCAAGC, primer B- ATTCTCTGAGGTGGCAATGG; glutaredoxin (Glyma13g30770), primer A- AATCCAATGGCACCTATCCA, primer B- AGGGTTCACTCCCAGACCTT. Target gene expression was normalized to cons14 []. Each PCR amplification curve was analyzed with LinRegPCR software [] to calculate the PCR efficiency and threshold value from the baseline-corrected delta-Rn values in the log-linear phase. The normalized expression level for each gene was determined as reported in []. […]

Pipeline specifications

Software tools edgeR, DESeq, limma, NovoAlign, MapMan, FASTX-Toolkit, LinRegPCR
Databases Phytozome
Applications RNA-seq analysis, qPCR
Organisms Glycine max