Computational protocol: Lineage structure of Streptococcus pneumoniae may be driven by immune selection on the groEL heat-shock protein

Similar protocols

Protocol publication

[…] We used a dataset sequenced by Croucher et al., comprising 616 carriage S. pneumonaie genomes isolated in 2001, 2004 and 2007 from Massachusetts (USA). The data included 133, 203, 280 samples from 2001, 2004, 2007, respectively; and is stratified into 16 samples of serotype 10A, 50 of 11A, 7 of 14, 24 of 15A, 60 of 15BC, 8 of 16F, 5 of 17F, 6 of 18C, 73 of 19A, 33 of 19F, 1 of 21, 21 of 22F, 33 of 23A, 23 of 23B, 17 of 23F, 11 of 3, 4 of 31, 5 of 33F, 6 of 34, 49 of 35B, 18 of 35F, 2 of 37, 9 of 38, 47 of 6A, 17 of 6B, 33 of 6C, 3 of 7C, 11 of 7F, 4 of 9N, 6 of 9V and 14 of NT (see ref. for collection details). Sequence reads were taken from the project ERP000889 on the European Nucleotide Archive (http://www.ebi.ac.uk/) and assembled using an automated pipeline with the Velvet algorithm. In summary, we performed a whole genome multi-locus sequence typing (wgMLST) allelic notation using the BIGSdb software with an automated BLAST process and the Genome Comparator tool (with ATCC700669 serotype 23F, accession number FM211187, as the reference genome). This wgMLST approach resulted in the identification of 2135 genes in common between the reference and all the samples in the dataset. Alleles identical to the reference were classified as ‘1’, with subsequent sequences, differing at least by one base, labelled in increasing order. Genes were further classified as allele ‘0’ when genetic data present had no match to the genome of interest, or were found to be truncated or non-coding. For a visual representation of the allelic annotation and diversity please refer to S1 dataset of Watkins et al.. Functional characterization of genes and gene families was done by literature search and access to the Kyotto Encyclopedia of Genes and Genomes (KEGG) database (www.genome.jpkeggpathway.html).The allelic matrix as obtained by this approach and used in the RFA analysis (see below) is herein made available in supplementary Table , which also includes the Accession Numbers, gene name, gene product, gene position in reference genome, and year of collection, Sequence Cluster and serotype of each sample. [...] We implement a machine learning approach based on a Random Forest Algorithm (RFA) to predict particular features (serotype or Sequence Cluster) of each pneumococci isolate from information on the wgMLST allelic composition of the 2135 genes. In summary, the RFA process takes the following pseudo-steps: (I) the response variable and predictor variables are chosen by the user; (II) a predefined number of independent bootstrap samples are drawn from the dataset with replacement, and a classification tree is fit to each sample containing roughly 2/3 of the data, for which predictor variable selection on each node split in the tree is conducted using only a small random subset of predictor variables; (III) the complete set of trees, one for each bootstrap sample, composes the random forest (RF), from which the status (classification) of the response variable is predicted as an average (majority vote) of the predictions of all trees. Compared to single classification trees, RFA increases prediction accuracy, since the ensemble of slight different classification results adjusts for the instability of the individual trees and avoids data overfitting.Here we use randomForest: Breiman and Cutler’s Random Forests for Classification and Regression, a software package for the R-statistical environment. Predictor variables are set to be each gene in our genome samples and the response variable is set to the serotype or Sequence Cluster classification of each genome (as per ref. ). We use the Mean Decrease Accuracy (MDA), or Breiman-Cutler importance, as a measure of predictor variable importance, for which classification accuracy after data permutation of a predictor variable is subtracted from the accuracy without permutation, and averaged over all trees in the RF to give an importance value. The strategy herein employed is not of quantitative nature, as the absolute scale of scores produced by the RFA is dependent on the dataset being analyzed. Instead, we focus on the 2.5% of top RFA scores as presented by the resulting MDA distribution for all genes, thus selecting the subset of genes which present combinations of alleles that appear statistically more informative than expected at the genome level (i.e. we assume that 95% of the scores should fall between the 2.5th and 97.5th percentiles). With this assumption and the approach detailed below, we effectively select the genes which present a p-value <0.05 given an intrinsic distribution of scores generated by data permutation (a null distribution of scores).For the results presented in the main text, we assume the predictor variables to be numerical (as opposed to categorical). This assumption is known to introduce RF biases, as classification is effectively made by regression and artificial correlations between allele numbering and the features being selected (serotype and Sequence Cluster) may be present. The assumption is herein necessary since the RFA R-based implementation (version 3.6.12) has an upper limit of 53 categories per predictor variable and we find some genes to present up to 6 times this limit in allele diversity. The categorical constraint is a common feature of RFA implementations, as predictor variables with N categories imply 2N possible (binary) combinations for an internal node split, making the RFA method computationally impractical. Given this inherent RFA limitation, we implemented an input randomization strategy (random reassignment of values to alleles) to minimize potential bias. For this, M random permutations of each gene’s variant allelic numbering in the original dataset is performed, effectively creating M independent input matrices. The RFA is run over the input matrices and in the main results we present each gene’s average MDA score. Sensitivity analyses were performed by comparing RFA results between two independent sets of M = 50 input matrices (effectively comparing 100 independent runs) (Figs  and ). Results suggest that the existing biases in independent runs of the RFA due to the assumption of numerical predictors are virtually mitigated with our input randomization strategy approach, specially for the experiments presented in the main results (i.e. using a 1% cutoff of gene mismatches, Fig. ). […]

Pipeline specifications

Software tools Velvet, BIGSdb, randomforest
Databases ENA KEGG
Applications Miscellaneous, De novo sequencing analysis
Organisms Streptococcus pneumoniae