Computational protocol: Getting the most out of RNA-seq data analysis

Similar protocols

Protocol publication

[…] A large number of DEG call methods have been proposed (), with the majority of them being parametric methods that make distributional assumption about the read count data. An exhaustive comparison of all available methods for the present study was not feasible, nor necessary, since the relative performance of various subsets of these methods have been investigated in several studies (; ; ; ). As a result, less promising methods can be omitted from comparison.Comparisons involving methods specifically designed for unreplicated experiments received little attention, despite the abundance of such type of RNA-seq data. For this reason, we included ASC, GFOLD and NOISeq. For replicated experiments, we focused on methods that have received the most attention from the scientific community (as reflected by their relatively high citations per year), such as edgeR, DESeq and its new version, DESeq2. These are parametric methods that explicitly model the distribution of count data using the negative binomial distribution. Initially designed for standard experiments with biological replicates, these methods were later modified to accommodate analysis of unreplicated experiments as well, but their performance relative to ASC, GFOLD and NOISeq remains unclear. We did not include two methods with high citations per year: Cuffdiff2 and DEGSeq, based on conclusions from recent method comparative analyses. For example, Cuffdiff2 was found to have very low precision when replicate size increased in the analysis of two large RNA-seq data sets from mouse and human (). Furthermore, showed that edgeR had slightly superior performance in the receiver operating characteristic curve compared to DESeq and Cuffdiff2. Another comparative study involving DESeq, DEGseq, edgeR, NBPSeq, TSPM and baySeq showed that DEGseq had the largest false positive rate among them ().The inclusion of the popular non-parametric method NOISeq provides a contrast between performance of parametric and non-parametric methods. We included voom, which connects log-transformed read count data to the mature limma analysis pipeline (; ) that has been used so successfully for detecting DEG in microarray data analysis. Finally, the Z-test for equality of two proportions was included to set upper bounds in the tested performance metrics that are attainable by naive application of a common textbook statistical method. Let Nij be the pooled normalized read counts of the ith gene in the jth phenotype class (j = 1, 2), N+j=∑iNij the total normalized read counts in the jth phenotype class, and Ni+=∑j=1,2Nij the total normalized read counts of the ith gene in all phenotype classes. Specifically, the Z-test statistic for the ith gene is given by Zi=pˆi1−pˆi2pˆi1−pˆi/N, where pˆij=Nij/N+j, pˆi=Ni+/N, and N is the total number of normalized counts. provides a description of the core modeling approaches of the eight methods considered in the present study. [...] To set up our benchmarking exercise, we needed two RNA-seq data sets whereby variation in their phenotype classes produced mild and strong biological effect sizes in the tissue of interest, respectively. We further required the RNA-seq data sets to have fairly large replicate sizes to enable the simulation of different replicate size scenarios. To this end, we identified two suitable RNA-seq data sets in the Recount database (). The latter contains unnormalized RNA-seq count data sets from 18 major studies that have been assembled from raw reads using the Myrna () pipeline.The Bottomly data set () consists of gene expression data (22 million Illumina reads per sample, read length of ∼30 bases) obtained from the brain striatum tissues of two mice strains: C57BL/6J (n = 10) and DBA/2J (n = 11). Both mice strains are known to show large, strain-specific variation in neurological response when subjected to opiate drug treatment (; ; ).The Cheung data set () consists of gene expression data (40 million Illumina reads per sample, read length of 50 bases) from immortalized human B-cells of 24 males and 17 females. Sex hormones are known to modulate B cell function (; ). For example, estrogen modulates B cell apoptosis and activation (), while testosterone suppresses immunoglobulin production by B cells (). In the absence of antigenic challenge, however, it seems reasonable to expect only a modest number of DEG in male and female B cells.After removal of transcripts with zero counts in all samples, the Bottomly count table contained 13,932 transcripts, down from an initial 36,536 transcripts, whereas the Cheung count table contained 12,410 transcripts, down from 52,580. Prior to analysis, the count data were normalized using DESeq normalization (), which has been shown to be robust to library size and composition variation (). However, raw counts were used for DESeq2 analysis since the method explicitly requires such type of data as input. [...] The construction of a reliable reference DEG set from which performance metrics for each method is evaluated is a non-trivial problem, if one eschews a simulation-based approach. To avoid circular reasoning, this reference set needs external validation from independent evidence such as confirmation from qPCR results.Here, we chose voom (; ) as the method of choice for setting the reference DEG set. Unlike other DEG methods that primarily model mean–variance relationships in the count data using discrete distributions such as the Poisson or negative binomial distributions, voom log-transforms count data into a microarray-like data type suitable for analysis using the robust limma pipeline (; ). Because of this, using voom to set the reference DEG set can avoid biasing results of the called DEG due to algorithmic similarities. A gene was defined as differentially expressed using the same joint filtering criteria for edgeR, DESeq, DESeq2 and Z-test. We found the nonparametric SAMSeq (), which has also been reported to have strong DEG mining performance, unsuitable for setting the reference DEG set as it returned different DEG sets for different random seeds and number of permutation parameters ().The validity of voom as a tool for constructing reasonable in silico reference DEG sets for the Bottomly and Cheung data set requires justification. To this end, we compared its performance with other DEG call methods on an RNA-seq data set in which qPCR validation results for sufficiently large numbers of genes are available. Briefly, the Rajkumar data set () consists of gene expression count data (26,119 genes; minimum of 10 million Illumina reads per sample, read length of ∼50 bases) from the amygdala tissues of C57BL/6NTac strain mice. There are two phenotype classes: wild type (n = 8), and heterozygotes for the Brd1 gene deletion (n = 8). A total of 115 genes were selected for qPCR validation (additional Table 5 in ); differential expression was observed in 60 of them, and not in the remaining 55. Each DEG call method returns Ng differentially expressed gene candidates. We considered a method to be sound for setting the reference DEG set if it did not return too few (tens) or too many (thousands) candidates. Among methods that satisfied this criterion, the one that had relatively higher positive predictive value (PPV; the complement of the false discovery rate) would be preferred. Let NTP be the number of true positives, and NFP the number of false positives. Then the number of DEG that lack validation result is U = Ng − NFP − NTP. If U is not too large or too small, then the expected number of true positives can be estimated as NTP∗=NTP+NTPNTP+NFPU. The expected PPV is therefore given by PPV∗=NTP+NTP∗Ng. […]

Pipeline specifications

Software tools GFOLD, NOISeq, edgeR, DESeq, DESeq2, Cufflinks, DEGseq, NBPSeq, baySeq, voom, limma, Myrna, SAMSeq
Application RNA-seq analysis
Organisms Plum pox virus