Improves power, model and account for over-dispersion inherent in RNA-seq data. EAGLE provides a flexible framework for modeling influence of both technical and biological factors while accounting for extra-binomial variation in sequencing data. This R package is a method to test for gene-environment (GxE) interactions using allele specific expression (ASE). It uses a binomial generalized linear mixed model (GLMM), predicting the relative number of RNA-seq reads from each allele at exonic, heterozygous loci under different environmental conditions.
Reconstructs a consensus transcriptome from a collection of individual assemblies. TACO is an algorithm that employs change point detection to break apart complex loci and correctly delineate transcript start and end sites, and a dynamic programming approach to assemble transcripts from a network of splicing patterns. It also contains an easy to use companion tool for comparing meta-assemblies to reference transcriptomes, assessing overlap with reference and also protein coding potential.
Provides assistance for internal controls that can assess almost all stages of the RNA-seq workflow. Sequins supports library preparation, sequencing, split-read alignment, transcript assembly, gene expression and alternative splicing. This software is appropriate to evaluate downstream bioinformatic steps, enhance the optimization parameter choice and can be used as normalization factors to compare multiple sample.
Identifies both protein-coding and non-coding indicators. TROM performs a comprehensive transcriptome mapping for diverse tissues and cell types within and across four mammalian species. It also provides a useful resource of conserved cell-state associated transcription factors, RNA-binding proteins and lncRNAs, which characterize transcriptomes of various cell states and enable researchers to explore new hypotheses in developmental biology.
A Java program for the automated detection and classification of transcription start sites (TSS) from RNA-seq data. TSSpredator reads RNA-seq data in the form of simple wiggle files and performs a genome wide comparative prediction of TSS, for example between different growth conditions.
A fast and accurate approach for phasing variants that are overlapped by sequencing reads, including those from RNA-sequencing (RNA-seq), which often span multiple exons due to splicing. phASER provides 1) dramatically more accurate phasing of rare and de novo variants compared to population-based phasing; 2) phasing of variants in the same gene up to hundreds of kilobases away which cannot be obtained from DNA-sequencing reads; 3) high confidence measures of haplotypic expression, greatly improving power for allelic expression studies.
Identifies transcription units (Tus) with given RNA-seq data of any bacterium using a machine-learning approach. SeqTU can serve as the baseline information for studying transcriptional and post-transcriptional regulation in C. thermocellum and other bacteria.
A computational method to assess a quantitative measure of mRNA integrity. This is done by quantitatively modeling of the 3' bias of read coverage profiles along each mRNA transcript. A per-sample summary mRIN is then derived as an indicator of mRNA degradation. This method has been used for systematic analysis of large scale RNA-Seq data of postmortem tissues, in which RNA degradation during tissue collection is particularly an issue.
Offers a statistical model for counts of RNA-Seq data. mseq is an R package that gathers an iterative glm procedure for the Poisson linear model, a training procedure of the multiple additive regression trees (MART) model and cross-validation for both methods. It can model non-uniformity in short-read rates with the aim of improving the estimation of gene and isoform expressions for both Illumina and Applied Biosystems data.
Identifies branch points in complex genomes. LaSSO is an algorithm that provides an approach to detect lariat intermediates and to map branch points on a genomic scale. This tool can perform on the identification of additional cryptic or alternative splice sites in analyzing an intronic sequence with its corresponding upstream or downstream exon sequence. Moreover, it can be applied to spot novel splicing events by partitioning the genome with a sliding window while ignoring known annotations.
Infers relative poly(A) site used in terminal exons from RNA sequencing data and KAPAC. PAQR is composed of three modules: (1) a script to deduce transcript integrity values, (2) a script to create the coverage profiles for all considered terminal exons, and (3) a script to obtain the relative usage together with the estimated expression of poly(A) sites with sufficient evidence of usage. The software enables evaluation of 3′ end processing in data sets such as those from The Cancer Genome Atlas (TCGA).
Infers sequence motifs that are associated with the processing of poly(A) sites in specific samples. KAPAC is an approach that deduces position-dependent activities of sequence motifs on 3′ end processing from changes in poly(A) site usage between conditions. The software analysis of TCGA data reveals pyrimidine-rich elements associated with the use of poly(A) sites in cancer and implicates the polypyrimidine tract-binding protein 1 (PTBP1) in the regulation of 3′ end processing in glioblastoma.
Captures all k-mer variation in an input set of RNA-seq libraries. DE-kupl is a k-mer-based computational protocol that has four main components: (1) indexing, (2) filtering and masking, (3) differential expression (DE) and (4) extending and annotating. The software directly analyzes the contents of the raw FASTQ files, displacing mapping to the final stage of the procedure. It is able to detect a wide range of differential transcription and RNA processing events.
Allows identification of multiple gene sets that play a role in the characterization, clinical application, or functional relevance of a disease phenotype. GISPA is designed to characterize the molecular tumor profile of a single sample relative to other, comparison samples based on changes (increasing/decreasing) among several diverse, genome-wide data types. A user-friendly interface, shinyGISPA, was also developed to combine and compare multiple levels of genomic to proteomic data.
Allows users to analyze U-indel RNA editing in non-model species with no prior data available. T-Aligner is a read mapping and assembly tool that fits multiple potential edited open reading frames (ORFs) from shotgun reads mapped to each cryptogene. The application enables the read mapping and visualization of the totality of the editing states and their coverage as well as the assembly of canonical and alternative translatable mRNAs.
A tool designed to simultaneously uncover patterns of focal copy number alteration and coordinated expression change, thus combining both principles. FocalScan outputs a ranking of tentative cancer drivers or suppressors. FocalScan works with RNA-seq data, and unlike other tools it can scan the genome unaided by a gene annotation, enabling identification of novel putatively functional elements including lncRNAs. Application on a breast cancer data set suggests considerably better performance than other DNA/RNA integration tools.
Improves the predictive performances of ordinary logistic ridge regression and the group lasso. GRridge allows the use of multiple sources of co-data (e.g. external p-values, gene lists, annotation) to improve prediction of binary, continuous and survival response using (logistic, linear or Cox) group-regularized ridge regression. It also facilitates post-hoc variable selection and prediction diagnostics by cross-validation using ROC curves and AUC.
Resolves conflicts due to repeated sequences in RNA. Barnacle is a pipeline for detecting and characterizing chimeric transcripts from long RNA sequences, such as those generated by de novo transcriptome assembly. It identifies sequences with a variety of anomalous alignment topologies, predicts partial tandem duplications (PTDs), internal tandem duplications (ITDS), and fusions from these sequences, and measures the coverage of the inferred chimeric transcripts relative to corresponding wild-type transcripts.
Calculates a filtering threshold for replicated RNA sequencing data. HTSFilter provides an intuitive data-driven way to filter RNA-seq data and to effectively remove genes with low constant expression levels. HTSFilter may be useful in a variety of applications for RNA-seq data, including differential expression analyses, clustering and co-expression analyses, and network inference.
An algorithm for finding candidate pairs for clustering gene expression data. The idea is that two sequences are candidates for comparison if they share α many common k-words, where the leftmost and rightmost words start at least β away from each other.
Reconstructs transcription landscape from RNA-Seq read counts. Parseq is a statistical approach for analyzing the RNA-Seq read count profiles along the genome. The software estimates the model parameters, reconstructs local transcription levels, calls transcribed regions and identifies coverage breakpoints based on this framework.
Rapid and quantitative metrics for evaluating structure probing data quality. SPEQC uses metrics to rapidly and quantitatively evaluate data quality from structure probing experiments, demonstrating their efficacy on both small synthetic libraries and transcriptome-wide datasets. A signal-to-noise ratio concept evaluates replicate agreement, which has the capacity to identify high-quality data. The developed metrics and tools will be useful in summarizing large-scale datasets and will help standardize quality control in the field.
Accesses the results of a systematically and continually updated and continually growing analysis of public RNA-seq data in European Nucleotide Archive (ENA). RNASeq-er enables ontology-powered search for and retrieval of CRAM, bigwig and bedGraph files, gene and exon expression quantification matrices as well as sample attributes annotated with ontology terms. It provides access to baseline gene expression quantifications, aggregated across all runs in each of over 4000 normal tissue, cell type, developmental stage, sex, and strain conditions in 61 species.
An R script for processing MiTCR-derived CDR3 data from Peripheral T Cell Lymphoma (PTCL) RNA-seq. TcellClonality will remove non-productive CDR3 sequences, calculate the relative abundance of each CDR3, resolves ambiguity in CDR3 chain assignment, and classifies CDR3 clonotypes as being dominant or background (using control samples to determine the background level). It then calculates Shannon Entropy and estimates Tumor Purity for each sample. Finally, it includes code for analyzing T Cell Receptor (TCR) C gene expression. For analysis, RSEM v1.2.29 was used to calculate gene expression levels for all transcripts, and transcripts from TCR C genes were extracted.
Recognizes chimeras corresponding to potential L1 Chimeric Transcripts (LCTs) in RNA-seq data from one or several samples. CLIFinder aims to analyze stranded paired-end RNA-seq data. It can be useful for genome-wide analyses of LCT expression in different tissues, in normal or pathological conditions. This tool is customizable and can be adapted for investigation of transcription from other repeat types.
Provides a fast and simple way to analyze and evaluate multiple normalization methods via visualization and representation of correlation values, based on a user-defined set of uniformly expressed genes. NVT generates publication-ready figures and also provides correlation measures. The package thereby facilitates the documentation of methodological decisions for RNA-seq experiments.
A software for inference of B-cell receptor (BCR) repertoires using short-read RNA sequencing data. V'DJer uses customized read extraction, assembly and V(D)J rearrangement detection and filtering to produce contigs representing the most abundant portions of the BCR. V’DJer allows for full inference of repertoire characteristics including variable and joining gene segment usage, population diversity, sequence sharing between populations, antigen binding region amino acid properties and motifs, clonal structure and somatic hypermutation in BCR repertoires.
Helps in the visualization, interpretation, and experimental validation of both classical and complex splicing variations. MAJIQ-SPEL offers researchers fast and accurate primer design for de novo splicing variations not in the annotated transcriptome. It can define and quantify alternative splicing through Local Splicing Variations (LSVs). The tool accepts as input LSVs quantified from RNA-Seq data.
Analyses the chloroplast transcriptome using RNA-Seq. ChloroSeq uses RNA-Seq alignment data to deliver detailed analyses of organelle transcriptomes, which can be fed into statistical software for further analysis and for generating graphical representations of the data. It can also examine splicing efficiency and RNA editing profiles.
Predicts DNase I hypersensitivity. BIRD handles the prediction problem where both predictors and responses are ultra-high-dimensional. It groups correlated predictors into clusters and transforms each cluster into one predictor. The tool not only offers the computational efficiency suitable for big data regression, but also had the best prediction performance in a problem compared to other methods.
Allows users to construct individualized diploid genomes and transcriptomes for multiparent populations. Seqnature performs two main functions: (1) for inbred strains, it incorporates single nucleotide polymorphisms (SNPs) into a reference genome to create a strain-specific genome sequence; and (2) for multiparent populations, it uses inferred founder haplotypes to construct a pair of individualized haploid genome sequences incorporating known SNPs.
Stores code and certain raw materials for a detailed RNA-seq tutorial. Informatics for RNA-seq is an educational tutorial and working demonstration pipeline for RNA-seq analysis including an introduction to: cloud computing, next generation sequence file formats, reference genomes, gene annotation, expression analysis, differential expression analysis, alternative splicing analysis, data visualization, and interpretation.
Allows users to correct read counts from various sequencing data. FIXSEQ is an R package that is able to decrease noise while increasing stability in subsequent inference procedures as well as complementing existing literature on applications of heavy-tailed distributions. The program can be applied on many types of sequencing assays and downstream processing algorithms, without additional assumptions.
Constructs ortholog annotations for comparative transcriptome analysis between closely-related species. XSAnno is built as a pipeline that includes whole genome alignment and local alignment methods. This software exploits several filters to reduce the number of false positives caused by differences in mappability. The pipeline has been tested on human and non-human primate brain transcriptome data, but can applied to other species.
Permits to predict transcription units. TrUC uses oriented data to make its predictions. It predicts introns from un-oriented RNA-Seq data. It allows to improve Paramecium gene models not only by predicting Transcription Start Sites (TSS) and Transcription Termination Sites (TTS) if Cap-Seq and oriented mRNA-Seq data are available, but also thanks to the intron predictions. The tool can be useful in the case of high gene density for Paramecium.
Allows study of extracellular vesicle (EV) mediated mRNA transfers between cells. EVtransfer also investigates the role of exosomes as a vehicle in mediating the exchange. The software enables quality control, alignment, mapping, and base call recalibration on the raw SNP array and RNA sequencing reads data, evaluation of the significance of genotypic variation of a cell line under in vitro co-culture, and estimation of the rate of false discovered loci involving in the transfer process.
Identifies enriched sites from differential RNA-seq experiments comprising enriched and unenriched libraries. ToNER uses a global distribution model to report statistics of enrichment for all nucleotides. It calculates position-wise normalized read depth ratio between two libraries for all mapped genome positions. The tool is able to identify transcription start site (TSS) from Cappable-seq data in prokaryotes. It can locate enriched positions in complex data of eukaryotes such as m6A-seq.
Reconstructs the haplotype-specific isoforms from long single-molecule reads. HapIso is a comprehensive method for the accurate reconstruction of the haplotype-specific isoforms of a diploid cell that uses the splice mapping of the long single-molecule reads and partitions the reads into parental haplotypes. To overcome gapped coverage and splicing structures of the gene, the haplotype reconstruction procedure is applied independently for regions of contiguous coverage defined as transcribed segments.
Allows genome-scale simulation of transcription and translation at individual molecule and single base-pair resolution. TABASCO is a single-molecule stochastic simulator optimized to handle molecular events specific to gene expression such as the initiation, elongation and termination of transcription and translation as well as interactions among protein-DNA complexes. The software provides the descriptions of gene expression dynamics while allowing analysis of phenomena such as how intermolecular events between DNA-protein complexes affect system-wide gene expression.
A data-adaptive approach was developed to estimate the lower bound of high expression for RNA-seq data. The Kolmgorov-Smirnov statistic and multivariate adaptive regression splines were used to determine the optimal cutoff value for separating transcripts with high and low expression. Results from the proposed method were compared to results obtained by estimating the theoretical cutoff of a fitted two-component mixture distribution. The robustness of the proposed method was demonstrated by analyzing different RNA-seq datasets that varied by sequencing depth, species, scale of measurement, and empirical density shape.
Enables query and combination of complex data sources to facilitate research and discovery. Sidekick is a web-based biological decision-making framework that focuses on bottom-up discovery and organization of belief. The software allows scoring and manipulation of gene pair lists as well as its user belief management. The software supports three queries, four filters, six genomes several ways to combine results, and methods for saving and restoring workflows and data.
Predicts transcript using long RNA (lRNA)-seq reads. Traphlor is composed of two steps: constructing a splicing graph and solving the assembly problem on top of this graph. It models long reads as subpath constraints. This tool transforms paired-end reads into long reads by a local assembly method which fills the gap between them. It is sensitive to high levels of noise, e.g. alignment errors near splice sites.
A powerful test for finding eQTL effects based upon RNA-seq data. globalSeq can be computed efficiently and is able to handle sets of highly correlated covariates. The proposed test can detect genetic and epigenetic alterations that affect gene expression. It can examine complex regulatory mechanisms of gene expression.
Permits flexible priors such as parametric mixture priors and non-parametric priors. ShrinkSeq is implemented in the context of integrated nested Laplace approximation (INLA). It consists of a hybrid full Bayes-empirical Bayes method. This tool combines several aspects relevant to the analysis of RNA sequencing data: large applicability, enhanced power and reproducibility and multiplicity-corrected inference.
Offers an approach for approximating L0 penalized generalized linear model (GLM) adaptive ridge algorithm. L0 ADRIDGE is a method, based on a GLM, that consists of a MATLAB package providing various methods such as penalized Poisson or logistic regression. It intends to assist in (i) determining disease status (ii) supporting physicians in clinical decisions making (iii) easing features selection and prediction with big omics data.
Filters the protein-encoding transcripts assembled by RNA-Seq according to homology search. dbHT-Trans can generate metadata for each processing step and store them into a MySQL database. It has been used to filter out the falsely assembled protein-encoding transcripts based on their potential biological implications. This tool may return false positives if the reference protein databases are not very representative.
A global framework relying on three principles: (i) the statistical universe is a single patient; (ii) significance is derived from geneset/biomodules powered by paired samples from the same patient; and (iii) similarity between genesets/biomodules assesses commonality and differences, within-study and cross-studies. Thus, patient gene-level profiles are transformed into deregulated pathways.
A visualization and analysis method to interrogate putative transcription start sites (TSSs) in relation to various features that are indicative for transcriptional activity. The Zipper plot enables researchers to evaluate whether a set of putative long non-coding (lncRNAs) have the characteristics of independent transcriptional units by integrating information on the 5’ boundary of the transcripts and the chromatin state at the TSSs.
A computational protocol aimed to discover the source of all reads, which originate from complex RNA molecules, recombinant antibodies and microbial communities. The ROP accounts for 98.8% of all reads across poly(A) and ribo-depletion protocols, compared to 83.8% by conventional reference-based protocols. ROP profiles repeats, circRNAs, gene fusions, trans-splicing events, recombined B/T-cell receptor sequences and microbial communities. The ‘dumpster diving’ profile of unmapped reads output by our method is not limited to RNA-seq technology and may be applied to whole-exome and whole-genome sequencing.