Improves power, model and account for over-dispersion inherent in RNA-seq data. EAGLE provides a flexible framework for modeling influence of both technical and biological factors while accounting for extra-binomial variation in sequencing data. This R package is a method to test for gene-environment (GxE) interactions using allele specific expression (ASE). It uses a binomial generalized linear mixed model (GLMM), predicting the relative number of RNA-seq reads from each allele at exonic, heterozygous loci under different environmental conditions.
Reconstructs a consensus transcriptome from a collection of individual assemblies. TACO is an algorithm that employs change point detection to break apart complex loci and correctly delineate transcript start and end sites, and a dynamic programming approach to assemble transcripts from a network of splicing patterns. It also contains an easy to use companion tool for comparing meta-assemblies to reference transcriptomes, assessing overlap with reference and also protein coding potential.
Provides assistance for internal controls that can assess almost all stages of the RNA-seq workflow. Sequins supports library preparation, sequencing, split-read alignment, transcript assembly, gene expression and alternative splicing. This software is appropriate to evaluate downstream bioinformatic steps, enhance the optimization parameter choice and can be used as normalization factors to compare multiple sample.
Identifies both protein-coding and non-coding indicators. TROM performs a comprehensive transcriptome mapping for diverse tissues and cell types within and across four mammalian species. It also provides a useful resource of conserved cell-state associated transcription factors, RNA-binding proteins and lncRNAs, which characterize transcriptomes of various cell states and enable researchers to explore new hypotheses in developmental biology.
A Java program for the automated detection and classification of transcription start sites (TSS) from RNA-seq data. TSSpredator reads RNA-seq data in the form of simple wiggle files and performs a genome wide comparative prediction of TSS, for example between different growth conditions.
A fast and accurate approach for phasing variants that are overlapped by sequencing reads, including those from RNA-sequencing (RNA-seq), which often span multiple exons due to splicing. phASER provides 1) dramatically more accurate phasing of rare and de novo variants compared to population-based phasing; 2) phasing of variants in the same gene up to hundreds of kilobases away which cannot be obtained from DNA-sequencing reads; 3) high confidence measures of haplotypic expression, greatly improving power for allelic expression studies.
Identifies transcription units (Tus) with given RNA-seq data of any bacterium using a machine-learning approach. SeqTU can serve as the baseline information for studying transcriptional and post-transcriptional regulation in C. thermocellum and other bacteria.
A computational method to assess a quantitative measure of mRNA integrity. This is done by quantitatively modeling of the 3' bias of read coverage profiles along each mRNA transcript. A per-sample summary mRIN is then derived as an indicator of mRNA degradation. This method has been used for systematic analysis of large scale RNA-Seq data of postmortem tissues, in which RNA degradation during tissue collection is particularly an issue.
Offers a statistical model for counts of RNA-Seq data. mseq is an R package that gathers an iterative glm procedure for the Poisson linear model, a training procedure of the multiple additive regression trees (MART) model and cross-validation for both methods. It can model non-uniformity in short-read rates with the aim of improving the estimation of gene and isoform expressions for both Illumina and Applied Biosystems data.
Identifies branch points in complex genomes. LaSSO is an algorithm that provides an approach to detect lariat intermediates and to map branch points on a genomic scale. This tool can perform on the identification of additional cryptic or alternative splice sites in analyzing an intronic sequence with its corresponding upstream or downstream exon sequence. Moreover, it can be applied to spot novel splicing events by partitioning the genome with a sliding window while ignoring known annotations.
Provides a data-driven solution to test the assumptions of global normalization methods. Group level information about each sample (such as tumor/normal status) must be provided because the test assesses if there are global differences in the distributions between the user-defined groups.
Infers relative poly(A) site used in terminal exons from RNA sequencing data and KAPAC. PAQR is composed of three modules: (1) a script to deduce transcript integrity values, (2) a script to create the coverage profiles for all considered terminal exons, and (3) a script to obtain the relative usage together with the estimated expression of poly(A) sites with sufficient evidence of usage. The software enables evaluation of 3′ end processing in data sets such as those from The Cancer Genome Atlas (TCGA).
Infers sequence motifs that are associated with the processing of poly(A) sites in specific samples. KAPAC is an approach that deduces position-dependent activities of sequence motifs on 3′ end processing from changes in poly(A) site usage between conditions. The software analysis of TCGA data reveals pyrimidine-rich elements associated with the use of poly(A) sites in cancer and implicates the polypyrimidine tract-binding protein 1 (PTBP1) in the regulation of 3′ end processing in glioblastoma.
Captures all k-mer variation in an input set of RNA-seq libraries. DE-kupl is a k-mer-based computational protocol that has four main components: (1) indexing, (2) filtering and masking, (3) differential expression (DE) and (4) extending and annotating. The software directly analyzes the contents of the raw FASTQ files, displacing mapping to the final stage of the procedure. It is able to detect a wide range of differential transcription and RNA processing events.
Allows identification of multiple gene sets that play a role in the characterization, clinical application, or functional relevance of a disease phenotype. GISPA is designed to characterize the molecular tumor profile of a single sample relative to other, comparison samples based on changes (increasing/decreasing) among several diverse, genome-wide data types. A user-friendly interface, shinyGISPA, was also developed to combine and compare multiple levels of genomic to proteomic data.
Allows users to analyze U-indel RNA editing in non-model species with no prior data available. T-Aligner is a read mapping and assembly tool that fits multiple potential edited open reading frames (ORFs) from shotgun reads mapped to each cryptogene. The application enables the read mapping and visualization of the totality of the editing states and their coverage as well as the assembly of canonical and alternative translatable mRNAs.
A tool designed to simultaneously uncover patterns of focal copy number alteration and coordinated expression change, thus combining both principles. FocalScan outputs a ranking of tentative cancer drivers or suppressors. FocalScan works with RNA-seq data, and unlike other tools it can scan the genome unaided by a gene annotation, enabling identification of novel putatively functional elements including lncRNAs. Application on a breast cancer data set suggests considerably better performance than other DNA/RNA integration tools.
Improves the predictive performances of ordinary logistic ridge regression and the group lasso. GRridge allows the use of multiple sources of co-data (e.g. external p-values, gene lists, annotation) to improve prediction of binary, continuous and survival response using (logistic, linear or Cox) group-regularized ridge regression. It also facilitates post-hoc variable selection and prediction diagnostics by cross-validation using ROC curves and AUC.
Resolves conflicts due to repeated sequences in RNA. Barnacle is a pipeline for detecting and characterizing chimeric transcripts from long RNA sequences, such as those generated by de novo transcriptome assembly. It identifies sequences with a variety of anomalous alignment topologies, predicts partial tandem duplications (PTDs), internal tandem duplications (ITDS), and fusions from these sequences, and measures the coverage of the inferred chimeric transcripts relative to corresponding wild-type transcripts.
Calculates a filtering threshold for replicated RNA sequencing data. HTSFilter provides an intuitive data-driven way to filter RNA-seq data and to effectively remove genes with low constant expression levels. HTSFilter may be useful in a variety of applications for RNA-seq data, including differential expression analyses, clustering and co-expression analyses, and network inference.
Rapid and quantitative metrics for evaluating structure probing data quality. SPEQC uses metrics to rapidly and quantitatively evaluate data quality from structure probing experiments, demonstrating their efficacy on both small synthetic libraries and transcriptome-wide datasets. A signal-to-noise ratio concept evaluates replicate agreement, which has the capacity to identify high-quality data. The developed metrics and tools will be useful in summarizing large-scale datasets and will help standardize quality control in the field.
An algorithm for finding candidate pairs for clustering gene expression data. The idea is that two sequences are candidates for comparison if they share α many common k-words, where the leftmost and rightmost words start at least β away from each other.
A software for inference of B-cell receptor (BCR) repertoires using short-read RNA sequencing data. V'DJer uses customized read extraction, assembly and V(D)J rearrangement detection and filtering to produce contigs representing the most abundant portions of the BCR. V’DJer allows for full inference of repertoire characteristics including variable and joining gene segment usage, population diversity, sequence sharing between populations, antigen binding region amino acid properties and motifs, clonal structure and somatic hypermutation in BCR repertoires.
Accesses the results of a systematically and continually updated and continually growing analysis of public RNA-seq data in European Nucleotide Archive (ENA). RNASeq-er enables ontology-powered search for and retrieval of CRAM, bigwig and bedGraph files, gene and exon expression quantification matrices as well as sample attributes annotated with ontology terms. It provides access to baseline gene expression quantifications, aggregated across all runs in each of over 4000 normal tissue, cell type, developmental stage, sex, and strain conditions in 61 species.
Reconstructs transcription landscape from RNA-Seq read counts. Parseq is a statistical approach for analyzing the RNA-Seq read count profiles along the genome. The software estimates the model parameters, reconstructs local transcription levels, calls transcribed regions and identifies coverage breakpoints based on this framework.
An R script for processing MiTCR-derived CDR3 data from Peripheral T Cell Lymphoma (PTCL) RNA-seq. TcellClonality will remove non-productive CDR3 sequences, calculate the relative abundance of each CDR3, resolves ambiguity in CDR3 chain assignment, and classifies CDR3 clonotypes as being dominant or background (using control samples to determine the background level). It then calculates Shannon Entropy and estimates Tumor Purity for each sample. Finally, it includes code for analyzing T Cell Receptor (TCR) C gene expression. For analysis, RSEM v1.2.29 was used to calculate gene expression levels for all transcripts, and transcripts from TCR C genes were extracted.
Provides a fast and simple way to analyze and evaluate multiple normalization methods via visualization and representation of correlation values, based on a user-defined set of uniformly expressed genes. NVT generates publication-ready figures and also provides correlation measures. The package thereby facilitates the documentation of methodological decisions for RNA-seq experiments.
Furnishes a method for controlling false discovery rate (FDR) using the peFDR procedure. PairedFB is a method turned towards paired RNA-seq data with a modified beta-binomial likelihood. It permits identification of differentially expressed genes (DEGs) in the presence of heterogeneous treatment effects among pairs. Moreover, this method is able to generate the desired FDR levels no matter what the nominal levels and the sample sizes are.
Promotes sparse variable selection via regularization governed by the covariance and inverse covariance matrices of explanatory variables. Precision Lasso provides methods to handle instability and inconsistency separately and then combine these two regularization schemes to derive the final model. This tool deals with correlated variables for instability and linear dependencies for inconsistency.
Recognizes chimeras corresponding to potential L1 Chimeric Transcripts (LCTs) in RNA-seq data from one or several samples. CLIFinder aims to analyze stranded paired-end RNA-seq data. It can be useful for genome-wide analyses of LCT expression in different tissues, in normal or pathological conditions. This tool is customizable and can be adapted for investigation of transcription from other repeat types.
Consists of a dimension reduction technique designed for complex data sets with multiple overlaid signals observed in noisy conditions. SMSSVD is a parameter-free unsupervised dimension reduction technique primarily designed to reduce noise, adaptively for each low-rank-signal in a data matrix. The software represents the data in a way that enables unbiased exploratory analysis and reconstruction of the multiple overlaid signals, including finding the variables that drive the different signals. It was evaluated on several gene expression and synthetic data sets.
Helps in the visualization, interpretation, and experimental validation of both classical and complex splicing variations. MAJIQ-SPEL offers researchers fast and accurate primer design for de novo splicing variations not in the annotated transcriptome. It can define and quantify alternative splicing through Local Splicing Variations (LSVs). The tool accepts as input LSVs quantified from RNA-Seq data.
Investigates generalized linear mixed model (GLMM) into RNA sequencing and bisulfite sequencing data. PQLseq is a package able to handle binary and continuous predictor variables as well as multiple covariates as fixed effects, coupled to a parallel computing capacity and the ability of generating unbiased heritability estimates for sequencing count data. This application can be used for differential analysis in large sequencing studies.
Analyses the chloroplast transcriptome using RNA-Seq. ChloroSeq uses RNA-Seq alignment data to deliver detailed analyses of organelle transcriptomes, which can be fed into statistical software for further analysis and for generating graphical representations of the data. It can also examine splicing efficiency and RNA editing profiles.
Predicts DNase I hypersensitivity. BIRD handles the prediction problem where both predictors and responses are ultra-high-dimensional. It groups correlated predictors into clusters and transforms each cluster into one predictor. The tool not only offers the computational efficiency suitable for big data regression, but also had the best prediction performance in a problem compared to other methods.
Allows users to construct individualized diploid genomes and transcriptomes for multiparent populations. Seqnature performs two main functions: (1) for inbred strains, it incorporates single nucleotide polymorphisms (SNPs) into a reference genome to create a strain-specific genome sequence; and (2) for multiparent populations, it uses inferred founder haplotypes to construct a pair of individualized haploid genome sequences incorporating known SNPs.
Stores code and certain raw materials for a detailed RNA-seq tutorial. Informatics for RNA-seq is an educational tutorial and working demonstration pipeline for RNA-seq analysis including an introduction to: cloud computing, next generation sequence file formats, reference genomes, gene annotation, expression analysis, differential expression analysis, alternative splicing analysis, data visualization, and interpretation.
Allows users to correct read counts from various sequencing data. FIXSEQ is an R package that is able to decrease noise while increasing stability in subsequent inference procedures as well as complementing existing literature on applications of heavy-tailed distributions. The program can be applied on many types of sequencing assays and downstream processing algorithms, without additional assumptions.
Constructs ortholog annotations for comparative transcriptome analysis between closely-related species. XSAnno is built as a pipeline that includes whole genome alignment and local alignment methods. This software exploits several filters to reduce the number of false positives caused by differences in mappability. The pipeline has been tested on human and non-human primate brain transcriptome data, but can applied to other species.
Permits to predict transcription units. TrUC uses oriented data to make its predictions. It predicts introns from un-oriented RNA-Seq data. It allows to improve Paramecium gene models not only by predicting Transcription Start Sites (TSS) and Transcription Termination Sites (TTS) if Cap-Seq and oriented mRNA-Seq data are available, but also thanks to the intron predictions. The tool can be useful in the case of high gene density for Paramecium.
Allows study of extracellular vesicle (EV) mediated mRNA transfers between cells. EVtransfer also investigates the role of exosomes as a vehicle in mediating the exchange. The software enables quality control, alignment, mapping, and base call recalibration on the raw SNP array and RNA sequencing reads data, evaluation of the significance of genotypic variation of a cell line under in vitro co-culture, and estimation of the rate of false discovered loci involving in the transfer process.
Identifies enriched sites from differential RNA-seq experiments comprising enriched and unenriched libraries. ToNER uses a global distribution model to report statistics of enrichment for all nucleotides. It calculates position-wise normalized read depth ratio between two libraries for all mapped genome positions. The tool is able to identify transcription start site (TSS) from Cappable-seq data in prokaryotes. It can locate enriched positions in complex data of eukaryotes such as m6A-seq.
Reconstructs the haplotype-specific isoforms from long single-molecule reads. HapIso is a comprehensive method for the accurate reconstruction of the haplotype-specific isoforms of a diploid cell that uses the splice mapping of the long single-molecule reads and partitions the reads into parental haplotypes. To overcome gapped coverage and splicing structures of the gene, the haplotype reconstruction procedure is applied independently for regions of contiguous coverage defined as transcribed segments.
Allows genome-scale simulation of transcription and translation at individual molecule and single base-pair resolution. TABASCO is a single-molecule stochastic simulator optimized to handle molecular events specific to gene expression such as the initiation, elongation and termination of transcription and translation as well as interactions among protein-DNA complexes. The software provides the descriptions of gene expression dynamics while allowing analysis of phenomena such as how intermolecular events between DNA-protein complexes affect system-wide gene expression.
A data-adaptive approach was developed to estimate the lower bound of high expression for RNA-seq data. The Kolmgorov-Smirnov statistic and multivariate adaptive regression splines were used to determine the optimal cutoff value for separating transcripts with high and low expression. Results from the proposed method were compared to results obtained by estimating the theoretical cutoff of a fitted two-component mixture distribution. The robustness of the proposed method was demonstrated by analyzing different RNA-seq datasets that varied by sequencing depth, species, scale of measurement, and empirical density shape.
Enables query and combination of complex data sources to facilitate research and discovery. Sidekick is a web-based biological decision-making framework that focuses on bottom-up discovery and organization of belief. The software allows scoring and manipulation of gene pair lists as well as its user belief management. The software supports three queries, four filters, six genomes several ways to combine results, and methods for saving and restoring workflows and data.
Predicts transcript using long RNA (lRNA)-seq reads. Traphlor is composed of two steps: constructing a splicing graph and solving the assembly problem on top of this graph. It models long reads as subpath constraints. This tool transforms paired-end reads into long reads by a local assembly method which fills the gap between them. It is sensitive to high levels of noise, e.g. alignment errors near splice sites.
Allows generation of connected sub-graphs from datasets of RNA-Seq reads. MapReduce-Inchworm permits management of massively distributed queries including for genome analysis and is useful for processing high throughput sequencing datasets more efficiently. It can cluster k-mers into multiple groups, each of which should contain k-mers from same gene. Its main functions are: map, collate, or reduce.
A powerful test for finding eQTL effects based upon RNA-seq data. globalSeq can be computed efficiently and is able to handle sets of highly correlated covariates. The proposed test can detect genetic and epigenetic alterations that affect gene expression. It can examine complex regulatory mechanisms of gene expression.