Computational protocol: Spectrum-to-Spectrum Searching Using a Proteome-wide Spectral Library*

Similar protocols

Protocol publication

[…] Spectrum-to-spectrum search applications typically consist of three main components: a spectral preprocessor, which includes ion filtering and intensity scaling, a spectral library and the scoring method (). Current software, such as X!Hunter (), BiblioSpec (), or SpectraST (), do not allow optimization of each of these components independently. For example, X!Hunter can efficiently search large libraries, but at the expense of reduced ion representation in library MS/MS spectra. To address this need, we designed a cross-platform spectral library search application, Spec2spec, written in Java with a flexible object-oriented architecture to allow independent optimization of each component. In this architecture, spectral filters and scoring methods are predefined as abstract classes, which simplify the development and testing of new filters and scoring methods. To enable efficient searches of large simulated libraries, we prefiltered and partitioned the libraries by m/z and charge, and searched the partitions in multiple threads. This sacrificed the flexibility to customize filtering methods, but significantly reduced the loading time to an average of 1 min per library partition (). The search times for Spec2spec were on the same order as those for sequence algorithm searching; searches of the UPS1 database required 26 min. on average whereas Mascot required 21 min. The overall workflow for spectral library generation and spectrum-to-spectrum searching is shown in . [...] We also developed probabilistic scores using a hypergeometric distribution to model the frequency of random matching of fragment ions between experimental and library spectra. In spectrum-to-sequence searching, a hypergeometric probability distribution closely approximates the frequency of randomly matching MS/MS fragments to those predicted from a sequence database (), and scoring functions based on this model have shown higher performance than other probabilistic methods in database searching (, ). Probabilistic scores typically consider only the m/z for fragment ion matches and ignore peak intensity. Therefore, we developed a scoring function where peaks from the library and experimental spectra are prefiltered by intensity, before matching and probability calculations.The hypergeometric probability score by multi-candidate consideration (MHP) uses a hypergeometric distribution to model the frequency of random matches between fragment ions in an experimental spectrum and the set of all fragment ions found in library spectra within a certain precursor mass tolerance: The terms in parentheses are binomial coefficients. N represents the number of all fragment ions from library spectra with precursor masses that fall within tolerance of the precursor mass of the experimental spectrum, i.e. from all candidate library spectra. K represents the number of N peaks that match ions in the experimental spectrum within tolerance. N1 is the number of fragment ions in a candidate library spectrum, and K1 is the number of N1 peaks that match ions in the experimental MS/MS. Natural logarithms of the binomial coefficients are used to simplify the calculation of the final score ().MHP is adapted from a hypergeometric score described by Sadygov et al. (), which was used to model random matching to predicted fragment ions in a sequence database, rather than a spectral library. By considering random matches to the global background of all candidate fragment ions in a spectral library, MHP should correct for mass and size dependent biases that arise with other scores, such as Sequest's XCorr (). Consistently, the hypergeometric score described for spectrum-to-sequence searching was shown to be largely independent of peptide charge state and thus peptide mass ().The SHP score considers matches between experimental and candidate library spectra, without considering background matches within the library. The experimental spectrum is first divided into 1-m/z bins. In this equation, N represents the total number of these bins between the lowest m/z peak and the highest m/z peak, m represents the number of ions in the experimental spectrum, k represents the number of ions in the experimental spectrum which match the library spectrum, and n represents the number of ions in the library spectrum. The hypergeometric probability score by single-candidate consideration (SHP) was adapted from a hypergeometric score described by Tabb et al. (), except that SHP uses a univariate, rather than a multivariate, hypergeometric distribution and library spectra are used in place of predicted fragment m/z ladders from a protein sequence database. […]

Pipeline specifications

Software tools BiblioSpec, SpectraST, Comet
Application MS-based untargeted proteomics
Organisms Homo sapiens
Diseases Multiple Sclerosis