Computational protocol: Presep: Predicting the Propensity of a Protein Being Secreted into the Supernatant when Expressed in Pichia pastoris

Similar protocols

Protocol publication

[…] To train the models used for Presep, we constructed the Secreprot dataset containing 1093 proteins experimentally validated in P. pastoris. To generate a representative set of protein sequences that could accurately identify proteins secreted into the supernatant, we investigated the prediction performance of Type I and Type II PseAAC, respectively. Type I PseAAC is a parallel-correlation type analysis that generates 20+ λ discrete numbers to represent a protein . Type II PseAAC is a series-correlation type analysis that generates 20+ i * λ discrete numbers to represent a protein, with i defined as the number of amino acid attributes selected. The parameter of λ denoted the correlation rank of amino acids along a protein sequence, which can reflects the rank of correlation and is a non-Negative integer. . Type I and Type II PseAAC models were generated using PseAAC-Builder with different parameters selected for each analysis; the prediction performance for each of these methods is shown in . Using a 20-fold cross validation, this method displays a high degree of accuracy for both strategies, with MCC and overall accuracy (Q2) scores of 0.78 and 95%, respectively. However, the parameters used in these analyses, w and λ, had remarkably different effects on model performance depending on the method used. Using the Type I encoding strategy, w exhibited a much weaker effect on model performance than λ. This effect was not seen with the Type II encoding method, with w greatly affecting model performance. These results highlight the need to optimize parameter settings based on the encoding method used. The top 10 parameter settings identified in this analysis are shown in . […]

Pipeline specifications

Software tools PseAAC, PseAAC-Builder
Application Protein sequence analysis
Organisms Komagataella pastoris, Komagataella phaffii GS115