Absent word identification software tools | Genome annotation
The search for short words that are absent in the genome of one or more organisms (neverwords, also known as nullomers) is attracting growing interest because of the impact they may have in recent molecular biology applications.
Classical notions for sequence comparison are increasingly being replaced by other similarity measures that refer to the composition of sequences in terms of their constituent patterns. One such measure is the minimal absent words. MAW is a linear-time and linear-space algorithm for the computation of minimal absent words based on the suffix array.
A software tool for the computation of absent words. Unwords is more efficient than previous algorithms and easier to use. It directly computes unwords without the need to specify a length estimate. Moreover, it avoids the space requirements of index structures such as suffix trees and suffix arrays.
An alignment-free method and associated program to compute relative absent words (RAW) in genomic sequences using a reference sequence. Currently, EAGLE runs on a command line linux environment, building an image with patterns reporting the absent words regions (in SVG) as well as reporting the associated positions into a file. EAGLE has got scripts to run on the current outbreak and the other existing ebola virus genomes (using the human as a reference), including the download, filtering and processing of the entire data.
Computes all minimal absent words of a word of length N using suffix array (SA) and longest common prefix (LCP) array. pMAW uses synthetic DNA sequences for processing the computing minimal absent words. This tool is a MAW module.
Represents an external-memory algorithm for computing minimal absent words. EmMAW allows for computation of minimal absent words on far bigger data sets. This tool making use of external memory. The computation is done with regard to a sequence of length n and its SA, LCP, and BWT in external memory. On a standard workstation, the implementation requires around 3 hours to process the full human genome when as little as 1 GB of RAM is made available.
A command line tool for generating, for a given reference genome, a set of k-mers absent in that genome. The main differences with respect to previously developed tools for neverwords generation are (i) calculation of the distance from the reference genome, in terms of number of mismatches, and selection of the most distant sequences that will have a low probability to anneal unspecifically; (ii) application of a series of filters to discard candidates not suitable to be used as PCR primers.