Computational protocol: Effective Identification of Bacterial Type III Secretion Signals Using Joint Element Features

Similar protocols

Protocol publication

[…] Previous studies suggested that individual Sse or Acc features almost make no contribution to the specific recognition of T3S proteins , . In these studies, however, the authors assumed that the Sse and Acc variables were independent of Aac. Alternatively, we consider Sse, Acc and Aac as co-variables depending on each other, and the joint profiles of these 3 features were observed for each position of signal sequences of T3S and non-T3S proteins.As shown in 3S proteins exhibit more apparent joint element preference than non-T3S proteins. Specifically, there are apparently fewer elements present in each position of T3S N-terminal sequences. For most positions, the cumulative occurrence frequency for the top 10 and top 20 elements are both higher for T3S proteins (). ‘SCe’ (‘serine-coil-exposed’) is most frequently preferred by T3S proteins for most positions, followed by ‘TCe’ (‘threonine-coil-exposed’), ‘PCe’ (‘proline-coil-exposed’), ‘NCe’ (‘asparagine-coil-exposed’), ‘GCe’ (‘glycine-coil-exposed’), etc. (). The difference is still striking when the number of non-T3S and T3S is equal (), indicating the general joint element preference in T3S proteins is not caused by smaller data size. Non-T3S proteins also show preference for certain elements, especially within the first 25 positions, and yet the preferred elements are apparently different. For example, ‘LHb’ (‘leucine-helix-buried’), ‘AHb’ (‘alanine-helix-buried’), and ‘VHb’ (‘valine-helix-buried’) are more frequently found in the non-T3S proteins (). [...] The position-specific joint element features were extracted using Bi-profile Bayes (BPB) model , and then trained with Support Vector Machine (SVM). The parameters were optimized and shown in . The new classifier, namely T3SEpre, achieved excellent classifying performance, with a sensitivity of 95.9% at a high specificity of 97.7% () in a 5-fold cross-validation.We found that the Sse and Acc feature made important contribution to the specificity of T3S signals. BPBAac, which adopts the position-specific Aac feature only, is one of the best T3S protein classification programs . A direct comparison showed that T3SEpre outperformed BPBAac with the same training dataset (; ). A BPBAll model was also trained with the current datasets based on the simple linear combination of Aac, Sse and Acc features . Consistent with previous results, the discriminative performance of BPBAac was slightly better than BPBAll . This indicates that Sse and Acc feature do not independently contribute to the T3S specificity, rather in an Aac-dependent manner (; ). Furthermore, T3SEpre was compared with SSE-ACC, a T3S classifier using SVM to train sequence-based but not position-specific features. As shown in and , T3SEpre also outperformed SSE-ACC in terms of sensitivity, specificity, accuracy, MCC and AUC of ROC curve. Therefore, the position-based features are proved to be more effective in distinguishing T3S proteins.To make a thorough comparison, independent datasets were also tested. First, two large-scale T3S protein datasets, Arnold 2009 and Lower 2009 , were used. Arnold 2009 contains 109 high-quality validated T3S effectors from different species . Lower 2009 contains 533 partially validated T3S effectors , . For both datasets, T3SEpre performed apparently better than BPBAac, especially in terms of sensitivity, accuracy, and MCC values (). T3SEpre also outperformed earlier software Effective T3 (). In addition, other two new datasets (Mukaihara 2010 and Baltrus 2011) containing validated T3S effectors from an individual bacterial species or genus , were also adopted. Mukaihara 2010 contains a group of validated Ralstonia T3S effectors while Baltrus 2011 is a comprehensive set of validated Pseudomonas T3S effectors , . For Mukaihara 2010, T3SEpre correctly recalled 32 out of the total 35 non-homologous effectors (91.4%), whereas BPBAac and Effective T3 only recalled ∼60% of them (). T3SEpre also recalled much more known Baltrus 2011 effectors ().The robustness of T3SEpre was further examined using two strategies : (1) Sub-datasets with different size were randomly selected from training data to re-train the model and to classify the remaining data; (2) Leave-One-Out strategy was adopted: the T3S and non-T3S proteins from one bacterial genus/subgroup was classified by the model trained on the remaining training data. The results showed that models trained by different sub-datasets performed equally well, and the performance was still fairly good even when only 30% of the original training data were used (). In Leave-One-Out assessment, most of the effectors (93.4±5.4%) were recalled and consistently high specificity (98.0±2.2%) was obtained (). A comparison was also made between T3SEpre and BPBAac. Except for few genera or subgroups (e.g., Yersinia and Citrobacter), T3SEpre recalled more (or identical number of) effectors at a similar high specificity (). Chlamydiae is a genus phylogenetically distant to other bacteria with functional T3SS. Using effectors and non-effectors of other bacteria as training sequences, BPBAac recalled 73.7% (14/19) of Chlamydiae effectors; however, T3SEpre model trained with the same dataset recalled 94.7% (18/19) of the effectors (). Results from animal and plant pathogens/symbionts’ T3S effectors also demonstrated the high efficacy of T3SEpre (). […]

Pipeline specifications

Software tools SSE-ACC, T3SEpre, BPBAac
Application Protein sequence analysis
Organisms Saccharomyces cerevisiae