1 - 50 of 67 results

CESAR / Coding Exon Structure Aware Realigner

Avoids spurious mutations while being able to report real mutations, both on simulated and real data. CESAR is an Hidden-Markov-Model (HMM) based method that enhances the utility of genome alignments for comparative gene annotation by (i) being significantly faster and memory efficient, which allows routine application without large computer resources, (ii) improving the ability to identify distal splice site shifts, which increases the accuracy of gene annotation, and (iii) providing a new gene mode that is able to detect complete intron deletions and can be used to annotate entire genes instead of individual exons.


A gene finder based on a Generalized Hidden Markov Model (GHMM). Although the gene finder conforms to the overall mathematical framework of a GHMM, additionally it incorporates splice site models adapted from the GeneSplicer program and a decision tree adapted from GlimmerM. It also utilizes Interpolated Markov Models for the coding and noncoding models. Currently, GlimmerHMM's GHMM structure includes introns of each phase, intergenic regions, and four types of exons (initial, internal, final, and single).

DOGMA / DOmain-based General Measure for transcriptome and proteome quality Assessment

A program for fast and easy quality assessment of transcriptome and proteome data based on conserved protein domains. DOGMA measures the completeness of a given transcriptome or proteome and provides information about domain content for further analysis. DOGMA provides a very fast way to do quality assessment within seconds. It achieves similar completeness scores as existing programs, but is able to run much faster when it is used in combination with a fast domain annotation tool such as UProC.


Predicts genes in anonymous genomic sequences designed with a hierarchical structure. In the first step, splice sites, and start and stop codons are predicted and scored along the sequence using position weight matrices (PWMs). In the second step, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites, plus the log-likelihood ratio of a Markov model for coding DNA. In the last step, from the set of predicted exons, the gene structure is assembled, maximizing the sum of the scores of the assembled exons.


forum (1)
A web service for the genome-wide prediction of protein coding genes from eukaryotic DNA sequences. mGene.web offers pre-trained models for the recognition of gene structures including untranslated regions in an increasing number of organisms. With mGene.web, users have the additional possibility to train the system with their own data for other organisms on the push of a button, a functionality that will greatly accelerate the annotation of newly sequenced genomes.


An automated pipeline that performs gene prediction using self-trained HMM models and transcriptomic data. The program processes the genome and transcriptome sequences of a target species through GlimmerHMM, SNAP, and AUGUSTUS training pipeline that ends with the program MAKER2 combining the predictions from the three models in association with the transcriptomic evidence. The pipeline generates species-specific HMMs and is able to predict genes that are not biased to other model organisms. Our evaluation of the program revealed that it performed better than the use of the closest related HMM from a standalone program.

NPACT / N-Profile Analysis Computational Tool

A computational and graphical representation tool for gene identification and sequence annotation. NPACT identifies sequence segments of any length with statistically-significant 3-base compositional periodicities and associated with ORF structures. NPACT produces graphical representations that allow genome-wide uninterrupted visual comparison of compositional profiles, pre-annotated genes and sequence segments of three-base periodicity with ‘Newly Identified ORFs’, enabling frame analysis on a genomic scale.


Predicts coding exons on a target human sequence (may also work for other mammalian sequences as targets), based on comparison with a homologous sequence from a different species. Rosetta identifies coding exons in both species based on coincidence of genomic structure (splice sites, exon number, exon length, coding frame, and sequence similarity). Rosetta performed well in identifying coding exons, showing 95% sensitivity and 97% specificity at the nucleotide level. The program is available online.

AnABlast / Ancestral-patterns search through A BLAST-based strategy

Generates profiles of accumulated alignments in query amino acid sequences using a low-stringency BLAST strategy. To validate this approach, all six-frame translations of DNA sequences between every two annotated exons of the fission yeast genome were analysed with AnABlast. AnABlast-generated profiles identified three new copies of known genes, and four new genes supported by experimental evidence. New pseudogenes, ancestral carboxyl- and amino-terminal subtractions, complex gene rearrangements, and ancient fragments of mitDNA and of bacterial origin, were also inferred. Thus, this novel in silico approach provides a powerful tool to uncover new genes, as well as fossil-coding sequences, thus providing insight into the evolutionary history of annotated genomes.


Leverages information from multiple related species to identify those genes whose existence can be verified through comparison with known gene families, but which have not been predicted. OrthoFiller is a software designed to address the problem of finding and adding such missing genes to genome annotations. When applying to existing “complete” genome annotations, OrthoFiller can identify and correct substantial numbers 26 of erroneously missing genes in these two sets of species.


A state-of-the-art gene finder based on the Generalized Hidden Markov Model framework, similar to Genscan and Genie. GeneZilla is highly reconfigurable and includes software for retraining by the end-user. It is written in highly optimized C++. The run time and memory requirements are linear in the sequence length, and are in general much better than those of competing systems, due to GeneZilla's novel decoding algorithm. Graph-theoretic representations of the high scoring open reading frames are provided, allowing for exploration of sub-optimal gene models. It utilizes Interpolated Markov Models (IMMs), Maximal Dependence Decomposition (MDD), and includes states for signal peptides, branch points, TATA boxes, CAP sites, and will soon model CpG islands as well.


A fully automatic high-accuracy identification system that provides consensus prokaryotic CDS information. CONSORF first predicts the CDSs supported by consensus alignments. The alignments are derived from multiple genome-to-proteome comparisons with other prokaryotes using the FASTX program. Then, it fills the empty genomic regions with the CDSs supported by consensus ab initio predictions. From those consensus results, CONSORF provides prediction reliability scores, predicted frame-shifts, alternative start sites and best pair-wise match information against other prokaryotes. CONSORF features a genome browser that provides comprehensive and consistent annotation information, including reliability scores, predicted frame-shifts, candidate start sites and the best match.


A program for comparative ab initio prediction of protein coding genes in mouse and human DNA. Doublescan takes two input DNA sequences (one from mouse, one from human) which are known to be or which seem to be similar to each other and simultaneously predicts the genes of both sequences as well as the alignment of the two sequences. Doublescan can model partial, complete and multiple genes (as well as no genes at all) and can also align pairs of genes which are related by events of exon-fusion or exon-splitting. The mathematical method underlying Doublescan is a pair hidden Markov model.

Exogean / EXpert On GEne Annotation

Predicts transcripts human mRNA and mouse protein sequence alignments. Exogean enables prediction of several alternative transcripts per gene. It can be useful for annotation of eukaryote protein coding genes based on alignments with proteins from a different species and/or mRNAs from the same species. This tool produces information on each predicted gene and transcript that summarizes their structure, the evidence used, the problems and conflicts encountered and the solutions applied.