Coding and noncoding region discrimination software tools | Transcription data analysis
With the availability of genome-wide transcription data and massive comparative sequencing, the discrimination of coding from noncoding RNAs and the assessment of coding potential in evolutionarily conserved regions arose as a core analysis task.
Identifies functional protein-coding transcripts with greater sensitivity by detecting the termination of translation at the end of an open reading frame. RRS is well-suited to distinguish real translation from non-ribosomal contamination since it is robust to potential protection by non-ribosomal proteins as such protection should show no bias for the presence of a stop codon. It provides a valuable metric to prioritize candidates for more in-depth characterization.
Distinguishes protein-coding from non-coding RNAs. CPC employs a discriminative model based on four sequence intrinsic features. The CPC model is species-neutral, making it useful for ever-growing non-model organism transcriptomes and even transcriptomes of organisms that are poorly annotated or lack genome assembly. The web server is mobile-friendly and more accessible on mobile devices such as the iPad.
A powerful signature tool by profiling adjoining nucleotide triplets to effectively distinguish protein-coding and non-coding sequences independent of known annotations. CNCI is effective for classifying incomplete transcripts and sense-antisense pairs. The implementation of CNCI offered highly accurate classification of transcripts assembled from whole-transcriptome sequencing data in a cross-species manner, that demonstrated gene evolutionary divergence between vertebrates, and invertebrates, or between plants, and provided a long non-coding RNA catalog of orangutan.
An alignment-free program which accurately annotates lncRNAs based on a Random Forest model trained with general features such as multi k-mer frequencies and relaxed open reading frames. Benchmarking versus five state-of-art tools shows that FEELnc achieves similar or better classification performance on GENCODE and NONCODE datasets. FEELnc also provides several specific modules that enable to fine-tune classification accuracy, to formalize the annotation of lncRNA classes and to annotate lncRNAs even in the absence of training set of noncoding RNAs.
Establishes a central, redistributable workbench for scientists and programmers working with RNA-related data. The RNA workbench builds a sustainable community around it. This platform is unique in combining available tools, workflows and training material, as well as providing easy access for experimentalists. It serves as a central hub for programmers, which can easily integrate and deploy their existing or novel tools and workflows.
Discovers long non-coding RNAs (lncRNAs) in plants via classifying coding and long non-coding transcripts. PLncPRO allows prediction of lncRNAs in plants based on various sequence features extracted from the training data using random forest (RF) algorithm. The software was used to predict lncRNAs in two crop plants, chickpea and rice, under abiotic stress conditions, and the lncRNAs identified can provide a resource to elucidate their exact function in abiotic stress responses in future studies.
Provides a platform for identifying the coding or noncoding nature of conserved sequence tags (CSTs). CSTminer generates a coding potential score (CPS) for CSTs previously identified in a pairwise genome comparison and uses it for determining the probable nature of the investigated genomic region. The application allows user to submit its data or gives access to Ensembl genomes thanks to a chromosome coordinates or gene name search.
Predicts the coding potential for a given transcript. COME integrates multiple sequence-derived and experiment-based features using a decompose-compose method, which makes COME’s performance more accurate and robust than other well-known tools. First, COME composes the feature matrix for the given transcripts using the pre-calculated features vectors. Second, COME predicts the coding potential by the pre-trained models, using the feature matrix generated in the first step.
Classifies biological sequences based on the feature extraction from complex network measurements. BASiNET is a feature extraction method for biological sequences (RNAs) classification based on complex networks and its topological measures. The software does not require prior annotation of the genome, nor alignment of the sequences in database. It allows new sequences to be classified for the previously trained organisms.
Predicts protein-coding potential on transcripts. mRNN is capable of learning true defining features of mRNAs, including trinucleotide patterns and depletion of in-frame stop codons after the start of an open-reading frame. It is able to leverage long-range information dependencies for classification, as evidenced by pairwise mutation analysis. This tool integrates human knowledge of mRNA structure into its learning process.
A method to determine whether a multi-species nucleotide sequence alignment is likely to represent a protein-coding region. It does not rely on homology to known protein sequences; instead, it examines evolutionary signatures characteristic to alignments of conserved coding regions, such as the high frequencies of synonymous codon substitutions and conservative amino acid substitutions, and the low frequencies of other missense and non-sense substitutions (CSF = Codon Substitution Frequencies).
Predicts Coding Data Sequence in transcripts. FrameDP can adapt itself to different levels of sequence qualities. It provides an automatic protein description based on InterPro domain content. The functional annotation capabilities rely on BioMoby web services and on the REMORA workflow manager. FrameDP can use multiple Markov models and can handle degenerated sequences both for signals and inside Markov models.
Classifies ncRNAs by using primary information from a deep sequencing experiment, i.e. the relative positions and lengths of reads. ALPS is not designed to recognize a certain class of ncRNA, it can be used to detect novel ncRNA classes, as long as these unknown ncRNAs have a characteristic pattern of deep sequencing read lengths and positions. The software is able to classify known ncRNAs with high sensitivity and specificity.
Calculates the coding potential of a transcript using a machine learning model (random forest) based on multiple features including sequence characteristics of putative open reading frames, translation scores based on ribosomal coverage, and conservation against characterized protein families. LncRNA-ID competes favorably with existing coding potential computation tools in lncRNA identification.
Allows the extraction of any genomic region surrounding annotated coding sequence and its upload on a MySql database. RRE is a parser that is suitable for users wishing to generate genome-wide, specific sequence-feature datasets, such as putative promoters, first non-coding exon, all introns of a specific chromosome or contig. The RRE database is used to retrieve annotated putative regulative regions as well as the non-coding regions linked to orthologue annotations.
A program to detect coding regions in multiple sequence alignments that is optimized for emerging applications not covered by current protein gene-finding software. It is open source software and available for all major platforms.
Identifies the long non-coding RNAs (lncRNAs) from the assembled novel transcripts. lncScore can also be used to calculate the coding potential. This alignment-free tool uses a logistic regression model with 11 carefully selected features. lncScore accurately distinguishes lncRNAs from mRNAs, especially partial-length mRNAs in the human and mouse datasets. In addition, lncScore also performed well on transcripts from five other species (Zebrafish, Fly, C. elegans, Rat, and Sheep). To speed up the prediction, multithreading is implemented within lncScore.
Reduces the number of features calculated from nucleotides of transcripts. longdist is a Support Vector Machine (SVM) based method to distinguish long non-coding RNAs (lncRNAs) from protein coding transcripts (PCTs), using features from the nucleotide patterns (frequencies of di-, tri- and tetranucleotides) of transcripts, chosen with the support of Principal Component Analysis (PCA), together with Open Reading Frame (ORF) length and ORF relative length.
Provides a prototype noncoding RNA genefinder, based on comparative genome sequence analysis. QRNA detects conserved RNA secondary structures, including both ncRNA genes and cis-regulatory RNA structures. It uses three different probabilistic models (for RNA-structure-constrained, coding-constrained, and position-independent evolution) to examine the pattern of mutations in a pairwise sequence alignment. The alignment is classified as RNA, coding, or other, according to the Bayesian posterior probability of each model. This program is freely available for download.