Computational protocol: A New Method for Predicting the Subcellular Localization of Eukaryotic Proteins with Both Single and Multiple Sites: Euk-mPLoc 2.0

Similar protocols

Protocol publication

[…] Protein sequences were collected from the Swiss-Prot database at http://www.ebi.ac.uk/swissprot/. The detailed procedures are basically the same as described in ; the only difference is: in order to establish a more updated benchmark dataset, instead of version 50.7 of the Swiss-Prot database released on 9-Sept-2006, the version 55.3 released on 29-Apr-2008 was adopted. After strictly following the procedures as described in , we finally obtained a benchmark dataset containing 7,766 different protein sequences that are distributed among 22 subcellular locations (); i.e.,(1)where represents the subset for the subcellular location of “acrosome”, for “cell membrane”, for “cell wall”, and so forth; while represents the symbol for “union” in the set theory. A breakdown of the 7,766 eukaryotic proteins in the benchmark dataset according to their 22 location sites is given in . To avoid redundancy and homology bias, none of the proteins in has pairwise sequence identity to any other in a same subset. The corresponding accession numbers and protein sequences are given in Online .Because the system investigated now contains both the single-location and the multiple-location proteins, some of the proteins in may occur in two or more location sites. Therefore, it is instructive to introduce the concept of “virtual sample”, as illustrated as follows. A protein sample coexisting at two different location sites will be counted as 2 virtual samples even though they have an identical sequence; if coexisting at three different sites, 3 virtual samples; and so forth. Accordingly, the total number of the different virtual protein samples is generally greater than that of the total different sequence samples. Their relationship can be formulated as follows(2)where is the number of total different virtual protein samples in , the number of total different protein sequences, the number of proteins with one location, the number of proteins with two locations, and so forth; while is the number of total subcellular location sites (for the current case, as shown in and ).For the current 7,766 different protein sequences, 6,687 occur in one subcellular location, 1,029 in two locations, 48 in three locations, 2 in four locations, and none in five or more locations. Substituting these data into Eq.2, we have(3)which is fully consistent with the figures in and the data in Online .As stated in a recent comprehensive review , to develop a powerful method for statistically predicting protein subcellular localization, one of the most important things is to formulate the sample of a protein with the core features that have intrinsic correlation with its localization in a cell. Since the concept of pseudo amino acid composition (PseAAC) was proposed , it has provided a very flexible mathematical frame for investigators to incorporate their desired information into the representation of protein samples. According to its original definition, the PseAAC is actually formulated by a set of discrete numbers as long as it is different from the classical amino acid composition (AAC) and that it is derived from a protein sequence that is able to harbor some sort of its sequence order and pattern information, or able to reflect some physicochemical and biochemical properties of the constituent amino acids. Since the concept of PseAAC was proposed, it has been widely used to deal with many protein-related problems and sequence-related systems (see, e.g., , , , , , , , , , , , , , , , , , , , , , and a long list of PseAAC-related references cited in a recent review ). As summarized in , until now 16 different PseAAC modes have been used to represent the samples of proteins for predicting their attributes. Each of these modes has its own advantage and disadvantage. In this study, we are to formulate the protein samples by hybridizing the following three different modes of PseAAC. [...] GO database was established according to the molecular function, biological process, and cellular component. Accordingly, protein samples defined in a GO database space would be clustered in a way better reflecting their subcellular locations , . However, the way of using GO mode to represent a protein sample in the original Euk-mPLoc predictor was derived through its accession number from the GO database . Thus, when using Euk-mPLoc to perform prediction, the accession number of a query protein would be indispensable. To avoid such a requirement, the following different procedures are proposed to derive the GO representation mode. [...] FunD is the core of a protein that plays the major role for its function. That is why in determining the 3-D (dimensional) structure of a protein by experiments (see, e.g., , ) or by computational modeling (see, e.g., , ) the first priority was always focused on its FunD. Actually, using the FunD information to formulate protein samples for statistical predictions was originally proposed in , , and quite encouraged results were achieved. In that time, the 2005 FunDs in the SBASE-A database were used as bases to formulate the protein samples. Since then, a series of follow-up protein FunD databases were established, such as COG , KOG , SMART , Pfam , and CDD . Of these databases, CDD contains the domains imported from COG, Pfam and SMART, and hence is relatively much more complete . The version 2.11 of CDD contains 17,402 characteristic domains. Using each of these domains as a base vector, we can define a FunD space with 17,402 dimensions. Thus, by following the similar procedures in , a protein sample can be uniquely defined through the steps described below: [...] Since biology is a natural science with historic dimension, all biological species have actually developed continuously starting out from a very limited number of ancestral species. It is quite typical for protein sequences . Their evolution involves changes of single residues, insertions and deletions of several residues, gene doubling, and gene fusion. With such changes accumulated for a long period of time, many similarities between initial and resultant amino acid sequences are eliminated, but the corresponding proteins may still share many common attributes, such as their location site in a cell. Therefore, to catch the core feature and intrinsic relationship from a huge number of complicated protein sequences, it is particularly important to take into account the evolution effects. To realize this, here we are to incorporate the evolution information through the “Position-Specific Scoring Matrix” or “PSSM” , i.e., to express the protein by a matrix as formulated by(8)where is the length of (counted in the total number of its constituent amino acids), represents the score of the amino acid residue in the position of the protein sequence being changed to amino acid type during the evolutionary process. Here, the numerical codes 1, 2, …, 20 are used to denote the 20 native amino acid types according to the alphabetical order of their single character codes. The scores in Eq.8 were generated by using PSI-BLAST to search the Swiss-Prot database (version 55.3 released on 29-Apr-2007) through three iterations with 0.001 as the -value cutoff for multiple sequence alignment against the sequence of the protein , followed by a standard conversion given below:(9)where represent the original scores directly created by PSI-BLAST that are generally shown as positive or negative integers (the positive score means that the corresponding mutation occurs more frequently than expected by chance, while the negative means just the opposite); the symbol means taking the average of over , and means the corresponding standard deviation. The converted values obtained by Eq.9 will have a zero mean value over the 20 amino acids and will remain unchanged if going through the same conversion procedure again. However, according Eq.8, a protein with length is corresponding to a matrix of rows. Hence, proteins with different lengths will correspond to matrices of different dimensions. This will become a hurdle for us to develop a predictor able to unanimously cover proteins of any length. To overcome such a hurdle, one possible avenue is to represent a protein sample by(10)where(11)where represents the average score of the amino acid residues in the protein being changed to amino acid type during the evolutionary process. However, if of Eq.10 was used to represent the protein , all the sequence-order information during the evolutionary process would be erased. To avoid completely erasing the sequence-order information, the concept of PseAAC as originally proposed in was utilized; i.e., instead of Eq.10, let us use the pseudo position-specific scoring matrix as given by(12)to represent the protein , where(13)meaning that is the correlation factor by coupling the most contiguous position-specific scoring matrix scores along the protein chain for the amino acid type ; that by coupling the second-most contiguous position-specific scoring matrix scores; and so forth. Note that, as mentioned in the Material section of , the length of the shortest protein sequence in the benchmark dataset is , and hence the value allowed for in Eq.13 must be smaller than 50. When , becomes a naught element and Eq.12 is degenerated to Eq.10.A hybridization of the above three different PseAAC modes, i.e., Eq.4, Eq.6, and Eq.12, will be used to represent protein samples for establishing a new classifier for predicting eukaryotic protein subcellular localization, as described below. [...] To make Eqs.15–16 capable to handle proteins with multiple locations as well, the ensemble classifier needed to be modified to , where is a threshold parameter for controlling the count of multiple location sites and optimizing the predicted results, as formulated by Eqs.39–48 in where it was also elaborated how to evaluate the overall success rate when using on a benchmark dataset containing both single and multiple location proteins.The entire ensemble classifier thus established is called “Euk-mPLoc 2.0”, where “2.0” refers to an updated version evolved from Euk-mPLoc . To provide an intuitive picture, a flowchart is given in to illustrate the prediction process of Euk-mPLoc 2.0. […]

Pipeline specifications

Software tools PseAAC, Euk-mPLoc, SBASE
Databases Pfam UniProt
Application Protein sequence analysis