Computational protocol: Cytoskeletal Components of an Invasion Machine—The Apical Complex of Toxoplasma gondii

Similar protocols

Protocol publication

[…] Three replicates of conoid-enriched and conoid-depleted fractions were prepared for MS/MS analysis using proteinase K digestion as described in []. For seven additional replicates of conoid-enriched and nine additional replicates of conoid-depleted fractions, protein solutions were brought to 400 μl in 100 mM Tris-HCl pH 8.5, precipitated with a final TCA concentration of 20% on ice overnight, washed twice with cold acetone, and dried with a SpeedVac. The protein pellets were then prepared for MS/MS analysis with endoproteinase LysC and trypsin digestion [] as described. MudPIT was performed as described in [,]. MS/MS datasets were searched with a modified SEQUEST algorithm [] against a database combining all ORFs ≥ 15aa predicted from a six-frame translation of the latest release of the T. gondii 10× genome shotgun sequences, as well as ORFs ≥ 50 aa from a six-frame translation of Toxoplasma EST sequences (both available at []), or against a database combining host (human sequences from NCBI RefSeq) with the Toxoplasma sequences.The PEP_PROBE algorithm [], a modified version of SEQUEST [], was used to match MS/MS spectra to peptides. When applied to complex protein mixtures, the raw output from PEP_PROBE includes a large proportion of incorrect peptide assignments that must be removed by various selection criteria based on the match parameters. The PEP_PROBE outputs were parsed and filtered using DTASelect []. Any peptide hits had to have a minimum length of 12 amino acids. In the initial filtering, peptide assignments to spectra were retained only if they had a minimum cross-correlation score (XCorr) of 1.8 for singly-charged spectra, 2.5 for doubly-charged spectra, and 3.5 for triply charged spectra, and a normalized difference in correlation score (ΔCN) of at least 0.08. These thresholds were deliberately set low enough to be sure of retaining any bonafide spectra–peptide matches, but inevitably passed a significant number of spectra with incorrectly assigned peptides.Further filtering was accomplished by setting up a discriminant function [] specific for this dataset. To provide a training set, a pseudogenome was constructed, composed of 2 million random-sequence ORFs, ranging in length from 15 aa to 3000 aa (1.13 × 108 total amino acids) with exactly the same length distribution and amino acid composition as the ORFs of the T. gondii genome. The truly random nature of this pseudogenome was confirmed by whole-genome BLAST against both human and T. gondii genomes. In the case of human, there were 34 and three, and for Toxoplasma, 121 and 20, perfectly matched 11 aa and 12 aa strings within the pseudogenome, and zero matches of longer than 12 aa in either case, as predicted for a random set of this size. The complete set of MS/MS spectra from the conoid-enriched and conoid-depleted fractions was assigned to peptides within this pseudogenome using PEP_PROBE with the same selection parameters as used for the assignments to Toxoplasma peptides. This yielded a set of ~14,000 peptide–spectra assignments that are known with complete certainty to be incorrect. An empirical discriminant function capable of distinguishing correct from incorrect assignments was then constructed in an iterative fashion as follows.The incorrect peptide–spectra assignments against the random pseudogenome were combined with the observed experimental peptide assignments (an unknown mixture of correct and incorrect) against T. gondii ORFs. Initially all of the T. gondii genome peptide identifications were designated “correct.” The statistical software package STATA (StataCorp, College Station, Texas, United States) was used to generate a discriminant function for separating correct from incorrect peptide–spectra matches based on the parameters returned by PEP_PROBE [] for each peptide identification. Of the eight parameters characterizing the peptide assignment to a spectrum, (Xcorr, ΔCN, M_H_, CalcM_H_, SpRank, SpScore, Ion Proportion, peptide length), three (M_H_, SpScore, SpRank) were found to contribute insignificantly to the discriminant function and were eliminated. Of the remaining five, Xcorr and especially ΔCN were by far the most effective. After constructing the initial discriminant function, it was used to calculate the probability that each experimental peptide identification was in fact correct. Assignments of T. gondii peptides to experimental spectra classified as having less than 20% probability of being correct in this first pass of the discriminant analysis were relabeled as “incorrect” and a new set of discriminant function coefficients was then computed. This process, which converged rapidly, was iterated until a stable distribution among “correct” and “incorrect” was obtained (typically seven to ten cycles, with very small changes beyond cycle 3). The final discriminant function correctly recognized >99% of the PEP_PROBE assignments of spectra to peptides within the pseudogenome as being incorrect.To investigate the stability of the discriminant function, cross-validation was performed by splitting the input data (Toxoplasma plus pseudogenome spectra–peptide assignments) into 15 randomly selected subsets each comprising 30% of the total, and 15 separate sets of discriminant function coefficients were then iteratively calculated as above. When applied to the unused 70% of the input data, all of these functions correctly classified >99% of the incorrect peptide–spectra matches. The concordance among the classifications of individual spectra–peptide matches produced using these 15 discriminant functions averaged better than 98%. Ten independent preps of the conoid-enriched fraction and 12 preps of the conoid-depleted fraction were processed and analyzed as described above. […]

Pipeline specifications

Software tools Comet, DTASelect
Application MS-based untargeted proteomics
Organisms Toxoplasma gondii, Homo sapiens