An estimated 25% of all eukaryotic proteins contain repeats, which underlines the importance of duplication for evolving new protein functions. Internal repeats often correspond to structural or functional units in proteins. Methods capable of identifying diverged repeated segments or domains at the sequence level can therefore assist in predicting domain structures, inferring hypotheses about function and mechanism, and investigating the evolution of proteins from smaller fragments.

(Biegert and Soding 2008) De novo identification of highly diverged protein repeats by probabilistic consistency. Bioinformatics.

An algorithm for the identification of simple sequences in proteins, RNA or DNA. SIMPLE provides a general relative simplicity factor for the sequence, representing the amount of repeats of short motifs (1-4) in respect to random sequences of the same composition. The simplicity is not restricted to tandem repeats but cryptic repeats are also taken into account. The program also provides a list of the short motifs which reiterate significantly and their position within the sequence.
A HMM-based method for accurate prediction of RNA and pentatricopeptide repeat protein binding. aPPRove takes as input a PPR protein (primary structure), one or more RNA transcripts or binding footprints, and outputs the binding sites that have highest statistical significance, and how the nucleic acids in the RNA aligned to the amino acid pairs (defined by positions 6 and 1') in the PPR sequence for each binding. The statistical significance is based on the significance of the alignment in comparison to random alignments of the PPR to a database of transcripts.
LSTM_protein / Long Short-Term Memory
A fast model-based recurrent neural network for protein homology detection. LSTM automatically extracts indicative patterns for the positive class, but in contrast to profile methods it also extracts negative patterns and uses correlations between all detected patterns for classification. LSTM is capable to automatically extract useful local and global sequence statistics like hydrophobicity, polarity, volume, polarizability and combine them with a pattern. These properties make LSTM complementary to alignment-based approaches as it does not use predefined similarity measures like BLOSUM or PAM matrices.
Dover Analyzer
A wizard like application that takes collections of databases annotated in FASTA format and guides the user through a few steps to compute the overlap, diversity and redundant or non-redundant sets of peptide sequences. It is implemented in Java to achieve platform independence and has been designed initially to analyze the publicly available antimicrobial peptide databases. However, additional analysis can be done by simply adding new databases or replacing the existing ones.
A scoring tool based on a machine learning approach. Tally is able to achieve a better separation between sequences with structural tandem repeats (TRs) and sequences of aperiodic structures, than existing scoring procedures. It performs at a level of 81% sensitivity, while achieving a high specificity of 74% and an area under the receiver operating characteristic curve of 86%. Tally can be used to select a set of structurally and functionally meaningful TRs from all TRs detected in proteomes.
A profile-based method which uses a P-value-dependent score offset to include divergent repeat units and which exploits the tendency of repeats to occur in tandem. TPRpred detects not only tetratrico peptide repeat (TPR)-like repeats, but also the related pentatrico peptide repeats (PPRs) and SEL1-like repeats. The corresponding profiles were generated through iterative searches, by varying the threshold parameters for inclusion of repeat units into the profiles, and the best profiles were selected based on their performance on proteins of known structure. TPRpred performs significantly better in detecting divergent repeats in TPR-containing proteins, and finds more individual repeats than the existing methods.
A powerful genome data-mining tool designed to efficiently identify tandem repeat (TR) patterns in biological sequence data. XSTREAM uses a seed-extension strategy coupled with several post-processing algorithms to analyze FASTA-formatted protein or nucleotide sequences. It uses a number of user-defined parameters to identify non-redundant TR sequences with diverse periods and domain sizes, and varied levels of degeneracy. Additionally, XSTREAM effectively merges discontinuous TRs into larger TR domains, clusters similar TR sequences, models TR domain architectures, and detects hierarchical TR patterns.
REPPER / REPeats and their PERiodicities
Detects and analyzes regions with short gapless repeats in protein sequences or alignments. REPPER is a web server that implements programs using a sliding window, so as to show the boundaries of periodic regions and allow the detection of multiple regions with different periodicities in the same protein. User can take a multiple sequence alignment as input, and also calculate a profile for a given single input sequence using PSI-BLAST with two iterations and an E-value cutoff of 0.001.
