Computational protocol: Profile-based short linear protein motif discovery

Similar protocols

Protocol publication

[…] A set of protein sequences is used as the query set. These sequences, or a subset of them, are believed to contain a common motif responsible for a functional activity. The motif is likely to be relatively conserved between orthologues of the proteins in other related species, in contrast to generally unconserved surrounding disordered regions of the protein. We used the relative local conservation scoring system described in Davey et al. 2009 [] to mask out unconserved residues of the query sequences before submitting the sequences to the MEME program to discover over-represented SLiM profiles amongst the sequences []. In addition, we used the SLiMBuild algorithm from SLiMFinder to produce weightings of the relatedness of the query sequences to each other [].Additional masking of the query sequences to remove transmembrane regions and domains, taken from UniProt annotation [], was performed in order to increase the likelihood of identifying linear motifs in the query sequences, by eliminating such sources of high-scoring false positives.Previous work by Fuxreiter et al. has shown that disordered regions are enriched for short linear motifs []. This has been confirmed by a separate analysis of experimentally validated motifs from the ELM database []. Both indicate that the residues that comprise the motif are likely to have high disorder propensities as compared to the flanking regions. A cutoff of 0.3 ensures a balance between reducing the search space excessively whilst removing regions of the protein known to be ordered. From Davey et al. [] 82% of known motifs have a disorder score over 0.3 the cut-off used in this analysis. [...] In order to generate alignments for the proteins in the benchmark dataset, we used the series of metazoan Ensembl whole genomes downloaded in March 2010 []. We follow the method used in the Gopher orthologous protein identification and alignment algorithm described in Edwards et al. []. Each query sequence in the set was searched using BLAST (masking out low complexity regions) against the metazoan proteome at an expectation threshold of e = 10-4. The set of hits from this search was then used to search against the database again at a relaxed threshold of e = 10, but without complexity filtering. Sequences at this stage had to have 40% global similarity to the original query for inclusion. The most similar sequence for each species was retained for inclusion in the alignment. Multiple sequence alignments were then generated using the MUSCLE program [].We adopted the treatment of evolutionary information previously developed and evaluated for SLiM discovery [,], since the problem of treating evolutionary information is likely to be very similar for both profile and regular expression discovery of linear motifs. Improving public orthology resources such as those of Ensembl [] may prove useful in future implementations of the method, accelerating calculations. […]

Pipeline specifications