Computational protocol: Predicting and improving the protein sequence alignment quality by support vector regression

Similar protocols

Protocol publication

[…] Conventionally, the alignment quality is calculated by comparing the sequence alignment and the structural alignments generated by various structure alignment programs such as SARF [], CE and MAMOTH, assuming that the structure alignments are the gold standard. A problem of this approach is that depending on the specific choice of structure alignment program, the structure alignments can vary significantly, especially for distant homolog pairs. A different approach is that first the structure prediction model of a query protein is quickly generated by directly copying C-α positions of all aligned residues of the template protein using the sequence alignment, and then the protein structure model quality measure such as MaxSub [] or TM-score [] is calculated and used as a alignment quality score. The second approach is more relevant to the present study, because the main focus of this work is how to generate good sequence alignments that would eventually lead to better structure models. Specifically, we use MaxSub [], a popular model quality measure which finds the largest subset of Cα atoms of a model that superimpose well over the experimental structure. At the stage of training, each alignment is converted into a structure model of the query protein. MaxSub score is then calculated using the model derived from the alignment and the correct structure, with d parameter set to 3.5 Å which has been found to be a good choice for the evaluation of fold-recognition models []. We have also considered to use TM-score [], another popular model quality measure, as the alignment quality measure. However, it turned out that the correlation between MaxSub scores and TM-scores was as high as 0.95. Therefore, we expect that our specific choice of MaxSub score as the alignment quality measure does not affect the performance of our method and the main conclusion of this work. [...] To train SVR models for all templates in the training set, feature vector scheme developed in previous work [] is adopted with slight modification. We first generate all-against-all alignments within the set sharing the same fold by profile-profile alignment scheme with 48 different combinations of alignment parameters (gap open-penalty, gap extension-penalty, base-line score, and weight of predicted secondary structure). The profile-profile alignment score to align the position i of a query q and the position j of a template t is given by m i j = ∑ k = 1 20 [ f i k q S j k t + S i k q f j k t ] + s ij + b [email protected]@[email protected]@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemyBa02aaSbaaSqaaiabdMgaPjabdQgaQbqabaGccqGH9aqpdaaeWbqaamaadmaabaGaemOzay2aa0baaSqaaiabdMgaPjabdUgaRbqaaiabdghaXbaakiabdofatnaaDaaaleaacqWGQbGAcqWGRbWAaeaacqWG0baDaaGccqGHRaWkcqWGtbWudaqhaaWcbaGaemyAaKMaem4AaSgabaGaemyCaehaaOGaemOzay2aa0baaSqaaiabdQgaQjabdUgaRbqaaiabdsha0baaaOGaay5waiaaw2faaaWcbaGaem4AaSMaeyypa0JaeGymaedabaGaeGOmaiJaeGimaadaniabggHiLdGccqGHR[email protected][email protected] where fikq [email protected]@[email protected]@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqa[email protected][email protected], fjkt [email protected]@[email protected]@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqa[email protected][email protected], Sikq [email protected]@[email protected]@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqa[email protected][email protected] and Sjkt [email protected]@[email protected]@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqa[email protected][email protected] are the frequencies and the position-specific score matrix (PSSM) scores of amino acid k and at position i of a template q and position j of a template t, respectively. For the secondary structure score (sij), a positive score is added (subtracted) if the predicted secondary structure of the query protein at the position i is the same (different) type of secondary structure of the template protein at position j. Finally, the constant baseline score (b) is added to the alignment score.The frequency matrices and PSSMs are generated by running PSI-BLAST [] with default parameters except for the number of iterations (j = 11) and the E-value cutoff (h = 0.001). For each template of length n in the training set, alignments with the other templates in the training set are generated. Then, these alignments are transformed, respectively, into (n + 1)-dimensional feature vectors, (sa1, sa2, ..., sai, ..., san, queary_lenth) where sai is the profile-profile alignment score at position i of a given template [] and query_length is the length of the query protein (Figure ). If gaps occur, fixed negative scores are arbitrarily assigned. This is the modified version of []. The difference is that we use query_length instead of total alignment score. Since the size of the vector, n is dependent on the length of template protein, we make the same number of SVRs for all templates. [...] Only templates sharing at least the same fold with a target template are trained. To learn as many alignment examples as possible, 48 alignments are made per each pair of a query and a template (Table ). Gap open penalty ranging from 5 to 13 is used; gap extension is one or two; baseline value is zero or one. The parameter for the predicted secondary structure information content is also varied. The input and the target of SVR are derived from the previous two sections. We would like to emphasize that there is no correct alignment example. Regression is basically a real value prediction. In training step for each input-target data of training sample, SVR models are trained with radial basis function (RBF) kernel without attempting serious performance optimization by SVMlight version 6.01 with the parameter gamma of 0.001 []. […]

Pipeline specifications