Gene detection software tools | RNA sequencing data analysis
Locating the protein-coding genes in novel genomes is essential to understanding and exploiting the genomic information but it is still difficult to accurately predict all the genes. The recent availability of detailed information about transcript structure from high-throughput sequencing of messenger RNA (RNA-Seq) delineates many expressed genes and promises increased accuracy in gene prediction.
Searches protein database using a translated nucleotide query. BLASTX is a BLAST search application that compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. This application can also work in Blast2Sequences mode and can send BLAST searches over the network to public NCBI server if desired.
Builds transcriptomes from RNA-seq data. Trinity is a standalone software composed of three main components: (i) Inchworm, that first generates transcript contigs; (ii) Chrysalis, for clustering them and constructing complete de Bruijn graphs for each cluster and; (iii) Butterfly that processes individual graphs in parallel for finally resulting to the reconstruction of the transcript sequences.
A highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. This approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci.
Provides sensitivity in identifying existing genes. Prodigal is a gene-finding program for microbial for genome annotation of either draft or finished microbial sequence. It was developed to predict translation initiation sites more accurately. This application also permits to minimize the number of false positive predictions. This method can be useful for automated microbial annotation pipelines.
Identifies possible linkages between protein-coding portions derived from a single genomic locus (mostly corresponding to exons) split into unassembled contigs. It requires absence of substantial overlap between potential split protein-coding sequences. It also involves similar evolutionary distances between two potential fragments and their full-length counterpart in reference genomes.
A web server designed for identifying protein-coding regions in expressed sequence tag (EST)-derived sequences. For query sequences with a hit in BLASTX, the program predicts the coding regions based on the translation reading frames identified in BLASTX alignments, otherwise, it predicts the most probable coding region based on the intrinsic signals of the query sequences. The output is the predicted peptide sequences in the FASTA format, and a definition line that includes the query ID, the translation reading frame and the nucleotide positions where the coding region begins and ends. The predicted protein sequences can then be used as the input for additional annotation tools, such as InterProScan, for identifying protein families, domains and functional sites, the Conserved Domain Search service for the detection of structural and functional domains, and SignalP for locating potential signal peptides.
A tool for ab initio identification of protein-coding regions in RNA transcripts. The algorithm parameters are estimated by unsupervised training which makes unnecessary manually curated preparation of training sets.
A pipeline for unsupervised RNA-seq-based genome annotation that combines the advantages of GeneMark-ET and AUGUSTUS. As input, BRAKER1 requires a genome assembly file and a file in bam-format with spliced alignments of RNA-seq reads to the genome. First, GeneMark-ET performs iterative training and generates initial gene structures. Second, AUGUSTUS uses predicted genes for training and then integrates RNA-seq read information into final gene predictions. In our experiments, we observed that BRAKER1 was more accurate than MAKER2 when it is using RNA-seq as sole source for training and prediction. BRAKER1 does not require pre-trained parameters or a separate expert-prepared training step.
A gene prediction method that identifies potential coding regions exclusively based on the mapping of reads from an RNA-Seq experiment. GIIRA was foremost designed for prokaryotic gene prediction and is able to resolve genes within the expressed region of an operon. However, it is also applicable to eukaryotes and predicts exon intron structures as well as alternative isoforms.
An efficient and fast genome scaffolding method, using proteins to scaffold genomes. The pipeline aims to recover protein-coding gene structures. We tested the method on human contigs; using human UniProt proteins as guides, the improvement on N50 size was 17% increase with an accuracy of ∼97%. PEP_scaffolder improved the proportion of fully covered proteins among all proteins, which was close to the proportion in the finished genome. The method provided a high accuracy of 91% using orthologs of distant species. Tested on simulated fly contigs, PEP_scaffolder outperformed other scaffolders, with the shortest running time and the highest accuracy.
A multi-level bioinformatics protocol and pipeline. RNAMiner includes five steps: Mapping RNA-Seq reads to a reference genome, calculating gene expression values, identifying differentially expressed genes, predicting gene functions, and constructing gene regulatory networks.
A gene prediction pipeline that uses RNA-Seq data to train and provide hints for the generation of Hidden Markov Model (HMM)-based gene predictions and to evaluate the resulting models. The pipeline has been developed and streamlined by comparing its predictions to manually curated gene models in three fungal genomes and validated against the high-quality gene annotation of Neurospora crassa.
Combines Genewise with our own homology-based method, AlignFS, to identify protein-coding regions and correct erroneous frame-shifts, suitable for subsequent phylogenetic analysis. We compared AlignWise against other open reading frame finding software and demonstrate that the AlignFS algorithm is more accurate than Genewise at correcting frame-shifts within an order. We show that AlignWise provides the greatest accuracy at higher evolutionary distances, out-performing both AlignFS and Genewise individually. AlignWise produces a single ORF per transcript and identifies and corrects frame-shifts with high accuracy. It is therefore well suited for analysing novel transcriptome assemblies and EST sequences in the absence of a reference genome.
Computes the number of RNA sequences that code user-specified peptides in one to six overlapping reading frames. RNAsampleCDS can sample a user-specified number of messenger RNAs that code such peptides and compute the position-specific scoring matrix and codon usage bias for all such RNA sequences. RNAsampleCDS runs in linear time and space, although if guanine-cytosine content is optionally controlled, then time and space requirements are quadratic.
Allows users to detect candidate coding regions into transcript sequences. Transcoder is a standalone software that starts from either a FASTA or GFF file. The application also can scan and retain open reading frames (ORFs) for homology to known proteins by using a BlastP or a Pfam search and incorporate the results into the obtained selection. Predictions can then be visualized by using a genome browser such as IGV.
Assists users to automatic training of a eukaryotic ab initio gene finding algorithm. GeneMark-ET improves GeneMark-ES in integrating RNA-Seq read alignments into the self-training procedure, and it employs semi-supervised training to estimate parameters of the hidden semi-Markov model. This method accelerates the annotation process in large genomes while improving the accuracy of gene identification.
Simplifies the execution and data integration from heterogeneous biological sequence analysis tools. Pegasys enables the execution and integration of heterogeneous biological sequence analyses. It allows users to create workflows using any combination of the available programs in this program by dragging/dropping and linking graphical icons that represent sequence analysis tools. It can execute and integrate results from ab initio gene prediction, pair-wise and multiple sequence alignments, RNA gene detection and masking of repetitive sequences.
Provides a method for gene identification in prokaryote genome sequences. REGANOR is an online pipeline that utilizes gene finders GLIMMER and CRITICA and returns a set of reliable predictions based on a number of different parameters. This web server includes a search for ribosomal RNA (rRNA) and transfer RNA (tRNA) genes. The results can also be used to identify missing genes in genome re-annotation efforts.