Computational protocol: Pherotype Polymorphism in Streptococcus pneumoniae Has No Obvious Effects on Population Structure and Recombination

Similar protocols

Protocol publication

[…] We obtained 4,089 S. pneumoniae genomes from five publicly available sets (, online): 288 genomes from GenBank, which include 121 genomes of pathogenic strains from Atlanta, Georgia, the United States (); 3,017 genomes of carriage strains from Myanmar refugees in Maela, Thailand (); 616 genomes of carriage strains from Massachusetts, the United States (); 142 genomes of carriage strains from Rotterdam, the Netherlands (; ); and 26 PMEN (Pneumococcal Molecular Epidemiology Network) genomes (; ). These genomes were previously assembled and underwent quality control (). We located comC and comD within these genomes using an iterative DNA reciprocal BLAST search as previously described (). We excluded Rotterdam strain 724 and Maela strain 6983_6#45 from all pherotype analyses because the genomes of both strains carry two complete comC alleles encoding for both Csp-1 and Csp-2. We determined sequence types of genomes using MLSTcheck (, online; ). [...] To reconstruct the phylogenetic relationships of comD, we aligned full-length alleles using MUSCLE 3.8.425 () and Geneious 7.1.5 (). After deleting sites with 5% or more gaps, we inferred the GTR + I+ γ substitution model using jModelTest 2.1.7 (). We used MrBayes 3.2 (; ) for phylogenetic reconstruction across four independent runs; we report clades with posterior probabilities ≥ 0.90 across these runs.ComC in S. pneumoniae consists of a 41-amino acid peptide that is cleaved intracellularly into a 24-amino acid leader sequence and the secreted peptide signal called CSP. The short length of comC prevented us from reconstructing the phylogenetic history with confidence, as the mature CSP sequences were unable to be aligned. To indicate similarity between mature CSP signals, we used MUSCLE 3.8.425 () and Geneious 7.1.5 () to align amino acid variants and produced a UPGMA phylogram of amino acid identity.In order to characterize the overall genetic population structure in S. pneumoniae, we used data from a previously reported full genome phylogenetic tree () based on alignment to strain R6_uid57859 (). Using a single reference strain restricted us to examining genetic diversity only in genes found in strain R6_uid57859; however the scale of computational analyses in this paper prevented us from repeating analyses with a range of diverse reference strains. In summary, we used genome sites found in at least 99.5% of all the above S. pneumoniae genomes (1,444,122 sites) and used the single best maximum likelihood tree [ln(likelihood) of − 3,2145,034.7] starting from 15 unique random trees and from 15 unique parsimonious trees, as calculated by RAxML v8.2.4 () and ExaML v3.0 (). This tree included 82 genomes from pneumococcus Complex 3 () and 240 PMEN-1 genomes (); we excluded these strains from all of our further analyses because they were originally sampled based on membership to clonal complexes and are therefore biased with respect to population structure and pherotype. About 43 Streptococcus sp. viridans genomes were used as an outgroup for this tree, as previously reported ().For interspecific phylogenies, we identified comC and comD sequences from the available NCBI genomes for S. infantis (5 genomes), S. mitis (45 genomes), S. oralis (18 genomes), and S. pseudopneumoniae (40 genomes). The extensive sequence diversity in the mature CSP meant that we lacked confidence in aligning comC. Therefore, we used BAli-Phy v.2.3.6 () for phylogenetic analysis because it estimates both the alignment and the tree simultaneously. We used the HKY+ γ model of substitution and constrained the leader sequence of comC to align together. BAli-Phy then estimated the full comC alignment and tree across eight independent runs using the S07 model of indel mutations.Using MUSCLE 3.8.425 () and Geneious 7.1.5 (), we aligned the full-length comD sequences and deleted sites with more than 5% gaps. For further analysis, we used the filtered polymorphic sites from Gubbins (); any regions in individual sequences with evidence of recombination from Gubbins were replaced with N’s. We used a GTR + γ model of nucleotide substitution as determined by jModelTest 2.1.7 () to reconstruct the phylogeny using MrBayes 3.2 across four independent runs. [...] We utilized three separate approaches to assess whether there was more or less recombination within/between pherotypes than expected, based on the null expectation that recombination rates are proportional to the rates that strains encounter each other. The first two methods are based on pairwise tests between strains using GeneConv 1.81a (), whereas the third is based on results from Gubbins v2.1.0 (). Despite different limitations associated with each approach, it is reassuring that we found broad concordance in their associated output.We used GeneConv 1.81a () to detect recombination events of at least 100 bp between all pairwise comparisons of the 4,089 genomes. Briefly, this program detects continuous sections of DNA in pairwise alignments that have higher identity than the surrounding regions after accounting for monomorphic sites. We analyzed the genomes as “circular,” used Holm–Bonferroni correction for multiple testing (), and used a Gscale constant of 2, which scales to the number of mismatches allowed while detecting recombination events. The number of total recombination events detected by pherotype, by population, and by population/pherotype is reported in in online, respectively.We calculated the expected frequency of within- and between-pherotype recombination by assuming the null hypothesis that pherotype frequencies are indicative of how often pherotypes encounter each other; this encounter frequency is then an opportunity for genetic exchange that is proportional to actual recombination rates. Accordingly, if i represents the frequency of strains producing Csp-1, we expect two Csp-1 strains to encounter each other with frequency i2. To find the proportion of pairwise encounters that occur between strains carrying the same pherotype, we scaled the encounter frequency by the total encounter frequency of all strains, given that one must be a Csp-1 strain: i2+2i1-i(encounter freq. of two Csp-1 strains) + (encounter freq. of a Csp-1 strain with a non-Csp-1 strain)The proportion of pairwise encounters between strains expressing Csp-1 is then: i2i2+2i1-i which reduces to: i22i-i2.We used this estimate to calculate the proportional expectation that recombination that occurs between strains carrying the Csp-1 pherotype. We compared this expected proportion with the observed proportion of within-Csp-1 strain recombination events, in which the number of detected pairwise recombination events between Csp-1 strains was divided by the total number of detected pairwise recombination events involving at least one Csp-1 strain. We repeated these calculations for each pherotype. The same approach was used to examine the expected frequency of between-pherotype recombination, except here the expected encounter rate between pherotypes is 2ij, where i and j represent the frequencies of the focal pherotypes. As above, this leads to an expected proportion of between pherotype recombination as 2ij2i-i2 while focusing on pherotype i and as 2ij2j-j2 while focusing on pherotype j. This reflects that assumption that pherotypes i and j encounter other strains at rates proportional to their respective frequencies.By dividing observed by expected recombination, we could assess if there was more or less recombination than expected by chance: values >1 indicate that there is more recombination than expected by chance, whereas values <1 indicate the opposite. We used the PropCIs package in R () to estimate 95% confidence intervals for this ratio, and we tested if the observed recombination proportion differed significantly from the expected recombination proportion using Pearson’s χ2 statistic using R (). We additionally calculated observed and expected recombination proportions by classifying strains by populations in place of pherotype.To examine within and between pherotype recombination in more detail, we next characterized the unique recombination events occurring between strains carrying the two dominant pherotypes (Csp-1 and Csp-2), which together comprise 95.5% of strains. By highlighting unique recombination events, the aim of this analysis was to remove the potentially biasing influence of vertical transmission, which could cause more ancient recombination events to be overrepresented compared with newer events. In order to identify unique recombination events, we grouped events that shared an identical start and end position in the full genome alignment. Next, we calculated the average proportion of these unique recombination events taking place between strains with Csp-1 and Csp-2 for each group (, online). We examined a range of cut-off points for the minimum proportion of strains that must be involved in each grouped recombination event, including: 0.05% of strains (≥2 strains, 1,663,312 events); 0.1% of strains (≥4 strains, 1,156,428 events); 0.5% of strains (≥20 strains, 381,084 events), 1.0% of strains (≥41 strains, 197,740 events), 2.5% of strains (≥102 strains, 59,980 events), and 5.0% of strains (≥204 strains, 19,428 events); this gradually focused the analysis on increasingly ancient recombination events. The expected proportion of between Csp-1 and Csp-2 recombination events was calculated as the expected encounter frequencies when only considering these two pherotypes, that is: 2* ii+j*j(i+j), where i and j are the frequencies of Csp-1 and Csp-2, respectively.To determine if our results were robust to the method chosen to detect recombination, we estimated within/between pherotype recombination using Gubbins v2.1.0 () to estimate both the phylogeny and recombination break points for each of the 13 populations that contained more than one strain of both Csp-1 and Csp-2 pherotypes; we excluded populations that contained a single pherotype to minimize population-based recombination bias that would unduly influence a single pherotype. We first realigned each strain to the strain in each of the 13 populations with the highest N50 (, online) using Stampy v1.0.23 (). For the four populations aligned to complete, reference genomes, we inferred a single phylogeny per population. For each of the other nine populations, we inferred a phylogeny for each contig of the population’s reference genome. We only used Gubbins recombination events that were inferred to occur on the terminal branch leading to a strain’s sequence, so that only the strain within the population has the recombined sequence. This strain was designated as the recipient of the recombined sequence and we assumed that the strain had not changed pherotype since the recombination event. Gubbins can falsely label all but one strain as recipients of a recombination event in the case of an unrooted tree; we found no such cases in our analyses. To find the closest related strain to the recombination donor strain, we used BLASTn (; ) on the full 4,089 strain genome database using the recombined sequence as the query to find the strains with highest bitscores. After removing the recombination recipient strain, we only considered comparisons in which all strains with the highest bitscore had the same pherotype, to best estimate the pherotype of the recombination donor strain. We used a binomial test using the frequency of the pherotype across all strains, and we used Holm–Bonferroni to correct for multiple testing (). To prevent diluting our power to find differences, we removed all population/pherotype combinations with less than ten comparisons (, online). […]

Pipeline specifications

Software tools MLSTcheck, MUSCLE, Geneious, jModelTest, MrBayes, RAxML, ExaML, BAli-Phy, Gubbins, Stampy, BLASTN
Applications Phylogenetics, WGS analysis, Nucleotide sequence alignment
Organisms Streptococcus pneumoniae