Computational protocol: Method for Rapid Protein Identification in a Large Database

Similar protocols

Protocol publication

[…] Database searching is frequently employed to identify unknown amino acid sequences of peptides/proteins with high throughput. The main idea of this approach is shown in . In this approach, proteins in the sample are digested into a peptide mixture. A mass spectrometer is then used to produce tandem mass (briefly as MS/MS) spectra, which are to be identified to query in a known protein database. On the other hand, the theoretical MS/MS spectra are predicted according to enzymatic digestion rules, that is, simulated digestion, based on peptide sequences from a known protein database. The most common method is to use a search algorithm to identify peptides by correlating experimental and theoretical MS/MS data generated from possible peptides in the protein sequence database through simulated digestion.Simulated digestion theoretically refers to digestion based on the known protein sequences and enzyme specificity. Simulated digestion generally includes three types: specific, semispecific, and nonspecific. In specific digestion, protein sequence hydrolysis only occurs at a specific amino acid. For example, trypsin will cut polypeptide chain after lysine (K) or arginine (R) under the premise that proline (P) is not the next residue. In semispecific digestion, hydrolysis only occurs at some particular amino acids in one terminus, whereas the other will be disconnected at any amino acid. Nonspecific digestion occurs when disconnection occurs at any amino acid in both termini, that is, equivalent to any substrings of the amino acid sequence. Nonspecific digestion is usually avoided, especially when identifying on large databases, because of its running time and high memory demand.Protein posttranslational modification on eukaryotic cells is of great significance in presenting the protein structure and function and explaining the mechanisms of major diseases. Over 1000 kinds of modifications are currently available in the database. Searching for an excessive number of modification types is thus unrealistic. Therefore, not more than 10 types of variable modification could be assigned for current mature search engines, such as SEQUEST and Mascot, which obviously cannot meet actual needs.The types of digestion and modification are generally restricted. As the mainstream approach to database searching, the most significant advantage of the restricted method is its reduction of the scale of candidate peptides because this method assigns some factors depending on experience. However, individual experience is not always accurate. Despite appearing to be a perfect solution for the open method to support a large database, any type of digestion, and any type of modification, the search speed has restricted the development of this approach because of the large search space.Meanwhile, the exponential growth of the protein database, the rapid generation of mass spectrometry data, and the requirement for nonspecific digestion and postmodifications in complex-sample identification also pose a significant challenge on the identification scale and speed. The size of the genomic and protein sequence database grows exponentially, exceeding even Moore's Law in terms of the requirements for computing hardware. As shown in , the increasing trend of the protein database UniProtKB/TrEMBL is a representative case.The opinion of Patterson in [] is curt and to the point: “…our ability to generate data now outstrips our ability to analyze it.” [...] shows the typical identification workflow, usually including (A.) spectra preprocess, (B.) build index, and (C.) search index to identify. Based on an analysis of the protein identification process, three main methods are available to accelerate search engines at present. First, preprocess protein database can be secured, such that a more efficient index structure can be constructed. This design is a high-performance solution in a small-scale protein database. However, for protein databases with a scale of tens of MB, the index created by this method has to use several GB of storage space. Moreover, building an index for a large database is time consuming. Second, efficient search algorithms or technologies can be presented for search engine acceleration, such as an inverted index. Third, a parallel search can be conducted to improve query efficiency in clusters.With the popularity of cluster applications, successive parallel versions of some mainstream protein identification tools have been introduced. Most of these versions are based on the simple task of partitioning technology among spectra. As opposed to a stand-alone version, the identification speed can be increased several times. However, online digestion and fragmentation cannot be avoided for each retrieval.To prepare for large-scale protein identification, identification on a large-scale protein database, with any type of restriction and modification, must be supported. Based on pFind, we designed a scalable and efficient system to meet the rapid identification needs.pFind is one of the first protein identification software designed and developed in China. In terms of accuracy and speed, pFind has reached the level of international mainstream commercial software, such as SEQUEST and Mascot. As early as 2008, pFind has participated in the international evaluation on protein identification organized by the Association of Biomolecular Resource Facilities and has demonstrated strong performance in terms of identification accuracy and false positive rate control capability. pFind is currently the only protein search engine devoted to first-line research that was developed in China and is used by hundreds of groups around the world, including Duke University and MIT.Search engines usually need to digest protein sequences online as well as filter peptides according to the mass error, which may add unnecessary overhead. When spectrum data are large, the overhead for online digestion will unnecessarily increase because the process will have to be performed repeatedly for each batch. If the index space can be guaranteed, nonspecific digestion on the protein database would significantly improve efficiency.In nonspecific digestion, the protein sequence may cleave at any amino acid to form peptide fragments, which indicates that the hydrolysis of peptide can be any substring of the protein sequence. For each protein sequence, all subsequences within a specified length and mass range are generated. This optimization is step B.opt., as shown in . Step B.2, as shown in , can be handled in one way by nonspecific digestion for all enzymes and even offline. This condition not only lays the foundation for acceleration but also reduces the dependence on expertise.In this work, we built a reverted index of peptide fragments generated by nonspecific digestion in mass prior to spectrum queries. The index generation process helps eliminate the overhead of simulated digestion during a search while naturally supporting the retrieval of nonspecific digestion. All subsequences generated from the protein database are sorted by their masses in ascending order, and an index table is constructed in which the key is the mass of each peptide represented by three integers: protein ID, start position, and amino acid length. Therefore, all index terms are recorded with equal lengths. Given an explicit mass or mass range to be queried, the time complexity of finding the first valid position in the datasheet is O(1). Some range searches obtained from spectrum peaks with mass tolerance can then quickly retrieve the unique index. Undoubtedly, this approach will simplify identification process and will save both the index build and search time.Modification identification is another time-consuming process, and unrestricted posttranslational modification identification remains inadequate. InsPecT describes an unrestrictive PTM search algorithm that searches for any possible type of modification at once in a “blind search” mode, which does not depend on any given modification list. Such ideas can be used to identify more types of modifications, but its operation speed will be affected to a certain extent. By contrast, the number of modification types in Unimod is almost complete; at least in the vast majority of mass spectrometry experiments, Unimod is sufficient. As of Mascot version 2.3, support for Error Tolerant, an earlier proposed open identification method, is provided to iterate search over conventional identification. Moreover, this approach only supports semidigestion or all modifications in the Unimod database.Using pFind with the DeltAMT algorithm [], the Beijing Proteome Research Center identified core fucosylation (CF) modification. Over 100 CF glycoproteins and CF modification sites were identified from plasma samples of human liver cancer, the greatest number among all reports. The scale of identification results indicates significant progress in finding potential biomarkers. The discovery of a large number of modifications is of great significance for follow-up research and would aid in the early discovery of cancer markers []. Therefore, we have reasons to believe that accelerating pFind is an efficient method to accelerate postmodification prediction. The two-stage method can be used to determine one terminus of the peptide as well as to obtain a smaller number of candidates. The other terminus of the peptide can then be determined, taking the mass difference as the modification to seek in the inverted index of modification that was initially built according to Unimod, based on the smaller number of candidates. The complexity in determining one modification based on mass is O(1). To focus on acceleration, we will concentrate on dealing with nonspecific digestion. […]

Pipeline specifications