Computational protocol: Protein disorder prediction at multiple levels of sensitivity and specificity

Similar protocols

Protocol publication

[…] The protein sequences used for testing were acquired from the Protein Data Bank (PDB) []. This dataset consisted of 3131 sequences, with a disorder residue frequency of 5.4% (54,364). As similarly noted [], the majority of disordered regions were located at the N- and C- termini ends of the protein sequences.To strictly evaluate the performance of DISpro, we removed all sequences previously used in the training or testing of the original DISpro. Thus, classification accuracy is based solely on data previously unseen by the network.The remaining 2408 sequences were then analyzed for disorder residue frequency and region length. Out of a total 799,153 residues in the PDB dataset, 5.1% (40,455) of the residues were marked as disordered. Of those 40,455 disordered residues, 18.7% (7552) were located in regions of length greater than 30 amino acids. Overall frequency of region length can be seen in Figure . [...] The overall neural network system remains unchanged from the original DISpro, but it is discussed here briefly to ensure clarity. As in [], DISpro utilizes a 1-dimensional recursive neural network, which we will refer to as 1D-RNN []. Please see Baldi and Pollastri (2003) for a detailed explanation of the 1D-RNN's rolling "wheel" system [].In the 1D-RNN architecture, the network is designed such that it can accept an entire sequence at once, rather than the more common sliding window technique, thereby allowing for variable input size. As an example, let us use a sequence of arbitrary length I. In this case, I represents the total number of residues in the example sequence, and Ii is a vector containing the 25 values used to represent residue i. Of these values, 20 represent the frequencies of the 20 amino acids from a PSI-BLAST profile [], and the other five are binary values denoting secondary structure and solvent accessibility predictions [,,].For an output value, the 1D-RNN produce a vector of real numbers O, where Oi is the probability that residue i will be disordered. These probabilities are then utilized by DISpro to select a classification of disordered or ordered, based on a decision threshold of 0.5 []. However, by varying this threshold (as discussed in the next subsection), we are able to investigate the relationship of specificity and sensitivity of disorder predictions. […]

Pipeline specifications

Software tools DISpro, BLASTP
Application Protein structure analysis
Diseases Genetic Diseases, Inborn