Computational protocol: Conserved amino acid markers from past influenza pandemic strains

Similar protocols

Protocol publication

[…] A pattern classification approach [] is used with heuristic feature selection [,] to predict the candidate markers. Taken as input is a multiple sequence alignment (using MUSCLE []) for a collection of influenza genomes, where the 11 proteins are concatenated together. Each position in the alignment is converted to a bit vector of length 21, where an entry of 1 in the vector indicates the presence of one of the 20 amino acids or an insertion symbol. For an input alignment of length x (and 21 × x length bit vector), to find all n sized mutation subsets, x choose n combinations are checked, which is time prohibitive even for small n when x is large. A heuristic is used to exploit the information obtained from the linear support vector machine (LSVM) to reduce the size of x to 60 and limit n to 10. Note that even this size (~7 × 1010) in theory could be too large to efficiently process. Since smaller combination sizes were found, the search space size was sufficiently reduced to compute a solution. The LSVM computes weights for each position in the alignment reflecting the relative influence on the classifier. These weights are used to select the x most heavily weighted mutations from which to consider combinations. A similar approach was used in document classification [] and a related approach was taken to classify 70 antibody light chain proteins []. LSVM code was developed by modifying the software package LIBSVM [].The expected classification accuracy is defined by the accuracy of the LSVM using the aligned proteome as input and 5-fold cross validation. Similar to the approach taken by [] for human specific markers, sequences in the multiple sequence alignment used for training the classifier were labeled either human or avian depending on the host, excluding the avian to human crossover samples (H5N1, H9N2, H7N7 and H7N3) from training and testing. The 2,026 human persistent strains and 1,018 avian strains were grouped by time, location and subtype, with representative samples chosen at random to yield 281 distinct human strains and 560 distinct avian strains. Classifier accuracy was estimated by randomly dividing the data set into 5 non-overlapping partitions. The classifier was trained on 4 of the partitions and accuracy was measured by the percentage of correct classifications on the fifth partition, with the percentage of correct classifications calculated separately for each class to account for the difference in class size. The average of all 5 tested non-overlapping partitions was calculated giving two accuracy values (one for each class) and the final accuracy measure was the average of these two values. The 34 pandemic conserved markers given in this report were required to be positively identified in every sequenced strain in each of the three pandemic outbreaks without deviation from the majority consensus. This led to three markers reported in [] that were excluded from this report for lack of conservation or positive identification (when an ambiguous sequence code was present) in one of the sequenced strains associated with the pandemic outbreaks.The host specificity classifier misclassified 2 human and 2 avian strains for a classification accuracy of 99.5%. The classification errors appeared to be due to recent reassortment events that suggest the presence of influenza genomes that are a mix of both human and avian strains [].The high mortality rate data set was constructed using the same procedure as the host type dataset and the same 5-fold cross validation procedure was used to estimate accuracy. A total of 111 influenza genomes were classified as high-mortality rate strains and 2,001 were classified as low-mortality rate strains, with a non-redundant subset taken for training (35 high mortality rate, and 255 low mortality rate). The percentage of high and low mortality rate strains that were correctly classified was 96.2% and 96.9% respectively (an average of 96.6%). The lower accuracy for the high mortality rate classifier compared to the host type classifier likely highlights the genetic complexity associated with high mortality rate and the influence of other important factors such as host interaction.Newly generated classifiers using only a small subset of the aligned proteomes as input were required to match the original classifier accuracy (99.5% for host type and 96.6% for high mortality rate type) within a margin of error defined by a confidence threshold. The confidence thresholds were defined by confidence intervals assuming 1 sided t-test comparisons using the standard deviation in the cross validation tests. Lowering the classification accuracy threshold allowed for the possibility of undetected reassortment events and other potential strain labeling errors (such as host interaction factors) that preclude perfect separation of class types.The genotype analysis shown in Figures and includes 193 non-human non-avian influenza strains. All data was downloaded from the NCBI influenza whole genome database []. […]

Pipeline specifications

Software tools MUSCLE, LIBSVM
Applications Miscellaneous, Nucleotide sequence alignment
Organisms Sus scrofa, Homo sapiens
Diseases Infection, Influenza, Human