Computational protocol: Protein disulfide topology determination through the fusion of mass spectrometric analysis and sequence-based prediction using Dempster-Shafer theory

Similar protocols

Protocol publication

[…] Disulfide (S-S) bonds constitute one of the main cross-linkages present in proteins and can be broadly characterized to be structural, catalytic, or allosteric []. Structural S-S bonds play an important role in the folding and stabilization of proteins and are involved in the formation of structural motifs such as the cysteine knot and CXXC motif. Catalytic S-S bonds mediate thiol-disulfide interchange reactions in substrate proteins and play an important role in the regulation of enzymatic activity [,]. Finally, allosteric S-S bonds regulate protein function in non-enzymatic ways by triggering a conformational change when the bond breaks and/or forms. Thus, identification of the S-S bond topology constitutes one of the essential components for understanding and reasoning about both protein structure and function [].At the state-of-the-art, several methods can be used for determination of S-S bonds including Edman degradation, NMR, crystallography, and algorithmic methods that are either based on analysis using sequence information (hereafter termed sequence-based methods) or analysis of information from Mass Spectrometry (hereafter called MS-based methods). Recent introductions and reviews of these methods can be found in [,,]. It is important to note that each of the above class of methods has advantages as well as shortcomings. For instance, the use of Edman degradation can be limited due to requirements of ultra-pure samples. Similarly, NMR and crystallography, while highly accurate, require relatively large amounts (10 to 100 mg) of pure protein in a particular solution or crystalline state. Both these methods can also be limited by protein size, and are fundamentally low-throughput.Amongst approaches that involve algorithmic analysis, sequence-based methods utilize global features, such as the statistical frequency of amino acid residues [] and cysteine state sequences [] or local features that encode the characteristics of the sequence environment around the cysteines [,]. The process of developing a model for determining the S-S connectivity from such features can be based on: (1) characteristics of nearest neighbour(s). Techniques in this category identify disulfide bonds based on the closest training sample(s) in the feature space [-]. From a machine learning perspective, this class of methods constitutes examples of instance-based learning. (2) Supervised learning of the classification function. Methods in this class have employed approaches like neural networks, support vector machines, and logical regression [,-]. (3) Methods based on physics-based modelling. This class of methods has primarily been based on modelling the problem as a graph, where cysteines constitute the vertices and the edges are weighted using some measure that is indicative of physical-chemical interactions, such as contact potential or evolutionary information [,]. Determining the disulfide connectivity is then cast as a graph-theoretic optimization problem.An advantage of sequence-based methods is that once a model has been developed, its application does not require significant data preparation and can be run in high-throughput settings as it only requires the protein sequence information. A critical disadvantage however, lies in the fact that it may not always be possible to obtain an accurate mapping between local or global features and the presence of specific disulfide bonds. For supervised methods, difficulties can also arise if the test samples have high sequence homology with the training set but weaker structural homology.MS-based methods [-] involve a combination of experimental and algorithmic processing and can be applied under conditions of either partial reduction or non-reduction of the protein. The basic idea behind MS-based methods lies in: (1) generating the theoretical spectra in terms of the fragmentation model used by a specific method and (2) matching the theoretical spectra to the experimental spectra obtained from the MS or MS/MS step. While MS-based methods are generally more accurate than sequence-based methods, as shown by the direct comparisons in [], they too have limitations. For instance, ambiguous results can occur under conditions of partial reduction if the S-S bonds have similar reduction rates. Under non-reduction conditions on the other hand, S-S bonds can be missed for molecules that have multiple S-S bonds or large number of cysteines []. Furthermore, the fragmentation model used in the algorithms for interpreting MS-data can also have limitations; commonly used fragmentation models often consider only a small number of ion types to avoid a combinatorial explosion in the number of theoretical fragments that have to be generated and matched []. However, other ion types do occur and should ideally, be accounted for. Finally, under certain bond arrangements, the fragmentation process from mass spectrometry may itself lack sufficient information to identify specific bonds. This can happen for example when (1) the precursor ion fragmentation produces different fragments only at the outside boundaries of the intra-disulfide bond, (2) the presence of cross-linked or circular disulfide bonds prevent the fragmentation of precursor ions, or (3) the energy used to fragment complex molecules is not sufficient to break strong intra-chain and inter-chain bonds present in the molecules structure. All the above conditions can cause too few product ions to be generated.An illustration of the variable success of established S-S bond detection methods as applied to a set of nine eukaryotic Glycosyltransferases is shown in Table . While not exhaustive in terms of available methods, the table demonstrates that no single class of method performs accurately in all cases. For instance, the mass spectrometry-based method MassMatrix fails to identify the C24-C145 bond in the molecule C2GnT-I (Swiss-Prot:Q09324). This bond is found by both DISULFIND and DiANNA 1.1, which are sequence-based methods. However, as the reader can see, not all sequence-based methods find this bond. The table also highlights the fact that methods (and underlying models) which work well in some cases don't work equally well in others. For instance, DisLocate, which utilizes protein subcellular localization to determine the S-S bonds, can find only one bond. However, on the SPxx data sets [], this method has been shown to outperform other sequence-based methods [].Given the aforementioned context, we propose a novel theoretical framework, as well as a concrete method for S-S bond determination based on aggregation and fusion of evidence from different methods. This framework is based on the Dempster-Shafer theory of evidence combination. As part of our proposed method, we specifically focus on combining evidence from MS-based and sequence-based methods and show that this approach significantly improves upon each of its constituents, in terms of the ability to detect S-S bonds. […]

Pipeline specifications

Software tools DISULFIND, DisLocate
Databases UniProt
Application Protein structure analysis