Computational protocol: Immunoinformatics Comes of Age

Similar protocols

Protocol publication

[…] The most thoroughly studied step of T cell epitope generation is peptide binding to MHC molecules, and the Web-based databases that include peptide-MHC data enable binding predictions. The MHCPEP database [], for example, contains 13,000 MHC-binding peptides. Each entry contains the peptide sequence, its MHC specificity and, when available, experimental methods, observed activity, binding affinity, source protein, anchor positions, and references. This database, however, has been static since 1998. MHCBN [] includes 18,790 MHC-binding peptides, 3,227 MHC-nonbinding peptides, 1,053 TAP binders and nonbinders, and 6,548 T cell epitopes. A beta-version of the new Immune Epitope Database and Analysis Resource (IEDB) has recently come online that will focus on epitopes in potential bioterrorism agents or emerging infectious diseases []. More databases are available, and some are discussed below together with relevant prediction tools.Peptide-MHC binding is the most predictable aspect of T cell epitope generation. MHC class I and class II genes are highly polymorphic, and the majority of their variable positions are located in binding pockets that restrict peptide interactions to those with particular amino acids at characteristic positions (); the set of amino acids that are well tolerated in these binding pockets are called anchor motifs. The search for epitopes in full-length proteins or within the context of a reactive peptide can be narrowed through a search for MHC-appropriate anchor motifs. Primary HLA class I anchor positions are generally located at the C terminus and a middle position of a peptide; as optimal epitope lengths vary between 8 and 12 amino acids long, the spacing between these two positions varies [,]. The first MHC allele-specific motifs were defined for murine class II molecules []. Tracking anchor motifs patterns alone was soon found to be of limited predictive value [], while including more extensive binding patterns using quantitative matrices representing the frequency and weight of every amino acid in every position enabled the prediction of epitope locations in protein sequences with somewhat greater [,–], although still limited [], accuracy.For many MHC alleles, both simple and extended motifs are characterized and used to predict potential epitopes. For example, the SYFPEITHI database [] contains extensive information on MHC class I and class II anchor motifs and binding specificity, and includes more than 4,500 entries of MHC proteins and aligned sequences of their epitopes and natural ligands, with source proteins, organisms, and publication references for each peptide. The SYFPEITHI epitope prediction server [] uses a frequency-based scoring system for every amino acid position within a peptide. The SYFPEITHI database allows, through examination of aligned peptides known to bind the HLA molecules, appreciation of the relative level of conservation of anchor motifs, as well as the number of peptides that bind despite imperfect motifs.The Los Alamos HIV/HCV databases offer a simple tool (MotifScan) for identifying HLA anchor-binding motifs in query proteins, highlighting them on a protein or protein alignment [,]. This tool is based on motif libraries included at the SYFPEITHI site, assembled by S. Marsh and colleagues [,], and motifs extracted from the primary literature. The more sophisticated MHC-peptide binding prediction approaches have generally been applied to limited numbers of MHC proteins, so MotifScan provides a more comprehensive, but less reliable, exploration of potential HLA-binding peptides. The input protein sequences can be automatically uploaded from predefined sets of HIV or HCV proteins, or the user can input any protein sequence or sequence alignment. MotifScan is taken one step further for HIV and HCV through the Epitope Location Finder (ELF) [], where HLA anchor motifs are mapped onto proteins or peptides in conjunction with known epitopes taken from extensive database listings of class I HIV and HCV T cell epitopes and their presenting HLAs [,]. Currently the HIV CD8+ T cell epitope database contains 3,150 entries describing 1,600 distinct MHC class I-epitope combinations (a single epitope can have multiple entries); the HCV database contains 510 entries describing 250 distinct MHC class I–epitope combinations. These databases include detailed biological information regarding the response to the epitope, including its impact on long term survival, common escape mutations, and whether an epitope is recognized in early infection; links to the primary literature; and curated alignments summarizing the epitope's global variability.A central assumption of the traditional prediction methods based on motif frequencies is that each position contributes independently to binding. Interactions at one site, however, can affect interactions in another site [,]. Statistical classifiers such as Hidden Markov Models have better success rates at MHC-binding predictions, and machine learning methods such as artificial neural networks and support vector machines can recognize nonlinear sequence-dependent correlated effects in MHC binding. Machine learning methods as well as statistical methods are also useful for defining characteristic sequences related to TAP binding, and for addressing the complexity of proteasome cleavage [–]. These methods, however, require large numbers of well-characterized peptides as training sets []. One comparative analysis suggested that motifs gave the most accurate MHC-binding predictions with limited data, but as the data increases, machine learning methods become more reliable predictors []. In another comparative study, a support vector machine outperformed other methods []. Both motif-based and machine learning methods for prediction of different steps of T cell epitope generation are available () [], often offered in combination with databases of MHC-ligand interactions (). Below we discuss some of the Web sites that are particularly helpful for T cell epitope prediction, many of which incorporate all three elements: immunoproteasome cleavage, TAP binding, and MHC binding.The Edward Jenner Institute for Vaccine Research maintains the AntiJen database, which contains quantitative experimental binding data for peptides that bind to MHC, TAP, TCR-MHC complexes, T cell epitopes, and B cell epitopes; it also offers data on immunological protein-protein interactions. It includes more than 24,000 entries. The MHCPred [,] tool predicts the energetics of protein-ligand interactions related to the free energy of binding, and takes into account individual amino acids and contributions from side chain-side chain interactions, allowing peptide-MHC and peptide-TAP binding predictions. This site also allows the prediction of high affinity peptides by comparing the predicted binding affinities of the original and the mutated peptides. PREDEPP [,] relies on the structural conservation and interactions observed in crystal structures of peptide-MHC complexes. A peptide's compatibility for binding is evaluated statistically by pairwise potentials. The Web site also predicts proteasomal cleavage sites [].The BIMAS tool [,] ranks potential peptides based on a predicted half-time of disassociation from HLA class I molecules, based on coefficient tables deduced from the published literature. The Max Planck Institute for Infection Biology offers MAPPP software [] that combines either BIMAS or SYFPEITHI MHC-binding prediction with the proteasome cleavage software FRAGPREDICT []. FRAGPREDICT predicts potential proteasomal cleavage sites based on a combination of two algorithms. A statistical analysis of cleavage-determining amino acid patterns is performed [], followed by predictions of major proteolytic fragments based on a kinetic model of the 20S proteasome describing the time-dependent digestion of smaller (up to 40 residues long) peptide substrates [].The following three suites of tools allow MHC/class I epitope prediction through a combination of cleavage prediction, TAP binding, and MHC binding. The Center for Biological Sequence Analysis offers the NetChop tool [,] for predicting proteasomal or immunoproteasomal cleavage using a nonlinear neural network, trained on in vitro experimental cleavage data or MHC class I ligand data, respectively. NetMHC [–] predicts binding of peptides to HLA supertypes (groups of HLA proteins that are likely to cross-present epitopes because of similarity in allowed binding motifs) or to 120 individual HLA alleles, using artificial neural networks. NetCTL [,] predicts epitopes by combining predictions of peptide-HLA-supertype binding (NetMHC), proteasomal C-terminal cleavage (NetChop), and TAP transport efficiency using a weight-matrix based method []. The Bioinformatics Centre Institute of Microbial Technology has also developed a suite of servers [,–,,] designed for predicting immunologically interesting features in antigen sequences. ProPred1 and ProPred, along with a series of related programs using different strategies, predicts specific MHC-binding peptides in proteins [,]. Promiscuous binders can be predicted using a support vector machine by MHC2Pred for MHC class II, or quantitative matrices by MMBPred for MHC class I []. Pcleavage uses a support vector machine to predict proteasomal cleavage based on in vitro data, or immunoproteasomal cleavage data based on MHC class I ligand data []. TAPPred predicts binding to TAP []. CTLpred predicts CTL epitopes in an antigen sequence by combining the processing and binding prediction methods []. IEDP also offers a suite of tools for T cell epitope prediction. Their peptide-MHC class I binding prediction tool allows the options of using an artificial neural net, average relative binding [], or a stabilized matrix method []. A comparison of the accuracy of these methods is underway by the IEDP team. These three methods also use the average binding method for the prediction of MHC class II peptide binding []. Their MHC class I-peptide binding prediction can be combined with immunoproteasome cleavage [] and TAP transport predictions [], to predict MHC class I epitopes.Many of the sites listed are convenient for large-scale calculations. Some, for example SYFPEITHI and MHCPred, allow one to incorporate multiple HLA alleles for epitope prediction, while others, such as NetChop, NetMHC, NetCTL, FRAGPREDICT, and IEDP tools allow one to upload protein alignments. MotifScan, MAPPP, and the ProPred series allow both. These methods are currently being applied to peptide vaccine design and can be used to identify epitopes that have the desirable properties of promiscuous presentation by many HLAs and relative conservation [,,]. We have recently taken a very different approach to T cell vaccine design and developed a computational method for designing polyvalent protein cocktails that provide maximum peptide coverage (where peptides are set to a user-specified length, for example nine amino acids) in a population of diverse proteins []. The mosaic proteins we create resemble real proteins, as they are assembled using a genetic algorithm by in silico homologous recombination of natural strains, and sets of mosaics are created based on the optimizing their combined population coverage. While no Web interface has yet been built for this code, the two related programs are freely available. One program enables an exploration of the peptide coverage in any set of natural proteins by a prototype vaccine strain or combinations of strains, while the other designs sets of mosaic proteins for a polyvalent vaccine that will maximize population coverage. These tools could be applied to any variable pathogen for vaccine design, or used to design sets of reagents to probe the immune response. [...] The conformational aspects of antibody binding complicates the problem of B cell epitope prediction, making it less tractable than T cell epitope prediction. Indeed, Blythe and Flower [] recently undertook an exhaustive assessment of amino acid propensity scales using the AntiJen B cell epitope database, and even the best combinations performed only marginally better than random []. If one wishes to explore antigenic propensity using traditional methods, however, IEDB provides tools for predicting five features that have been proposed to relate to B cell antigenicity, including beta turn prediction [], surface accessibility [], flexibility [], and hydrophilicity []; it also includes an antigenicity predictor based on amino acid frequencies in antigenic domains and chemistry []. An alternative strategy for predicting linear B cell epitopes, ABCpred, uses a neural network trained and tested on the BCIPEP B cell epitope database [].Although antibody epitope prediction is difficult, many other antibody-specific resources are available on the Web (). If the variable region sequence of a monoclonal antibody is obtained, ABcheck [] enables a rapid crosscheck against the Kabat antibody database to identify unusual residues that might be a sequencing artifact. (As a historical aside, the Kabat database was an early immunological database compiled to provide researchers with a comprehensive comparison of antibody sequences. It was available as a book long before the Internet enabled Web-searchable molecular databases, at a time when GenBank, a resource that originated at Los Alamos National Laboratory, was still in its early, groundbreaking stages. GenBank eventually moved to the National Library of Medicine. Similarly, the Los Alamos HIV database, the first pathogen-specific sequence database, was initially available only as a book of aligned viral sequences.) The sequence could then be submitted to DNAPLOT, alignment software that enables rearranged V genes to be reliably assigned to their closest V, D, and J segment germline counterparts. The most comprehensive data for crystallographic structures can be found at the molecular modeling database (MMDB) [], summaries of antibody crystal structures are maintained at SACS [], and both structures and alignments are available through the antibody group (ABG). The ImMunoGeneTics (IMGT) database provides annotated listings and alignments of both immunoglobulins and TCR binding regions [,] .We maintain comprehensive Web-searchable databases of pathogen-specific HIV [] and HCV antibodies []. These are listings of monoclonal and polyclonal responses to the proteomes of these pathogens, including information regarding epitope location and variation, escape mutations, structure, biological impact of antibody responses, keywords, and links to PubMed. The HIV database currently contains 1,273, and the HCV database 120, unique antibody entries. Antibody entries are associated with multiple publications; for some of the more intensively studied HIV neutralizing monoclonal antibodies, more than 130 papers are cited, each with a brief summary of what was learned about the specific antibody in that paper. It is difficult to track a given monoclonal antibody in the literature by other means, as often many antibodies are used in a single study so are not named in an abstract. To compound the problem, the name of a monoclonal antibody often “mutates” as it is exchanged between different labs, so is not readily searchable by traditional means. […]

Pipeline specifications

Software tools MHCPred, MAPPP, NetChop, NetMHC, NetCTL, ProPred1, MHC2Pred, MMBPred, Pcleavage, TAPPred, CTLPred, ARB, ProPred, ABCpred, DNAPLOT
Databases MMDB SYFPEITHI
Application Immune system analysis
Diseases Autoimmune Diseases