Computational protocol: Sequential Acquisition of Virulence and Fluoroquinolone Resistance Has Shaped the Evolution of Escherichia coli ST131

Similar protocols

Protocol publication

[…] Quality control (QC) was performed for all raw read data sets. Briefly, raw reads were analyzed using PRINSEQ v0.20.3 () and trimmed with a mean base pair quality score (Q) of ≥20 and a read length of ≥70% of the original read length. Additionally, it was necessary to correct 35 sets of raw read data from data set_2 that had heterogeneous Illumina encoding and/or erroneous paired-end length encoding (see in the supplemental material). QC and assembly metrics for data set_1 have been previously reported by Petty et al. (). Lastly, contaminant searches were performed for each sample using Kraken on a subset of 100,000 randomly chosen reads ().Quality-filtered Illumina paired-end reads were assembled de novo using Velvet v1.2.07 () with k-mer ranges of 45 to 85 for 101-bp reads, 29 to 61 for 76-bp reads, and 29 to 47 for 50-bp reads. An optimal k-mer value for each assembly was selected on the basis of best assembly metrics, including N50 (i.e., 50% of bases are incorporated in contigs of this length or above), number and size of contigs, number and continuity of uncalled bases, and peak coverage. Contigs that were ≥200 bp at an optimal k-mer were then ordered against E. coli EC958 () using Mauve v2.3.1 (). QC and assembly statistics for data set_2 are summarized in the supplemental material.Quality-filtered Illumina reads for data set_1 and data set_2, as well as error-free simulated reads of complete genomes, were mapped on the reference strain EC958 using SHRIMP v2.0 (). Nesoni v0.108 () was used to call and annotate substitution-only SNPs, with a consensus cutoff and majority cutoff of 0.90 and 0.70, respectively. SNPs were also determined in parallel using the reference-free k-mer-based approach developed in kSNP v2.0 (). Default parameters as well as a k-mer value of 19 selected as the optimal value predicted by the kSNP-associated Kchooser script were applied. [...] To avoid distortion of the phylogenetic signal caused by SNPs acquired through recombination, we used the Bayesian clustering algorithm BRATNextGen () to detect recombinant regions among the combined data set. Similar to our previous work (), we used as an input an SNP-based multiple-genome alignment composed of each strain-specific pseudogenome built by integrating the SNPs predicted for each strain to the reference genome of EC958. To help with the identification of underlying clusters of strains, BRATNextGen initially computes a hierarchical clustering tree relative to the proportion of ancestral sequences shared between all strains. A segregation cutoff of 0.12 was then specified to separate each previously identified ST131 clade (clades A, B, and C) and non-ST131 strains into distinct clusters. Recombination was then evaluated within and between each cluster with the convergence approximated using 20 iterations of the learning algorithm. Significance was estimated using 100 permutations with a statistical significance threshold of 0.05. Using the same initial data set, recombination analysis was also carried out using Gubbins (), an independent method of recombination detection. [...] SNPs identified through reference-based mapping for the 188 ST131 strains were used to build phylogenies using maximum likelihood (ML), prior to and after removal of SNPs associated with recombinant regions. Phylogenetic trees were generated with RAxML v7.2.8 () using the general time-reversible (GTR) GAMMA model of among-site rate variation (ASRV), and validated using 1,000 bootstrap repetitions to assess nodal support. Additionally, reference-free k-mer-based phylogenetic trees were constructed using kSNP v2.0 with default parameters () and genome assemblies as an input. A k-mer value of 19 was selected as the optimal value predicted by the kSNP-associated Kchooser script. All trees were then viewed using Figtree v1.4.0 () or EvolView (), and further compared using the Tanglegram algorithm of Dendroscope v3.2.10 (), which generates two rectangular phylograms to allow comparison of bifurcating trees. [...] Preliminary estimation of the underlying temporal signal of our data was obtained by performing a regression analysis between the root-to-tip genetic distance extracted from the recombination-free maximum likelihood tree, the isolation year, and lineage information for each sequence, as implemented in Path-O-Gen v1.4 (). To further investigate the divergence of clade C from clade B, we performed a temporal analysis on the 3,779-bp nonrecombinant SNPs of the 172 clade B and C strains using BEAST 2.0 (), a Bayesian phylogenetic inference software, which can estimate the dating of emergence of distinct lineages. We compared multiple combinations of the molecular clock model (strict, constant relaxed log normal, and exponential relaxed log normal), substitution model (Hasegawa, Kishino, and Yano [HKY] model and GTR), and population size change model (coalescent constant, exponential growth, Bayesian skyline, and extended Bayesian skyline). Markov chain Monte Carlo (MCMC) generations for each analysis were conducted in triplicate for 100 million steps, sampling every 1,000 steps, to ensure convergence. Replicate analyses were then combined with LogCombiner, with a 10% burn-in. The GTR nucleotide substitution model was preferred over the HKY model, and was used with four discrete gamma-distributed rate categories and a default gamma prior distribution of 1. The uncorrelated log normal clock model consistently gave better support based on the Bayes factor and Akaike’s information criterion-based (AICM) analyses, compared to a strict clock model. The Bayesian skyline population tree model was chosen as the best-fitting tree model. Maximum clade credibility (MCC) trees reporting mean values with a posterior probability limit set at 0.5 were then created using TreeAnnotator.In order to adequately investigate the biogeographical history of our ST131 collection, we evaluated potential bias in the geographical origin of strains, which could negatively impact our predictions. Statistical significance of the geographical origin distribution in clade B, C1, and C2 was assessed by chi-square test with Bonferroni correction for multiple comparisons. Overrepresented countries were randomly subsampled down to 15 representative sequences, while countries with fewer than 5 representatives had to be excluded from the analysis (South Korea and Portugal). Overall, we constructed 10 independent randomly subsampled data sets with 85 isolates representing 7 countries, each with 5 to 15 representative sequences. Reconstruction of possible ancestral geographical states was then performed using BEAST 1.8.2 on each subsampled data set. In addition to the previous parameters selected for the temporal analysis, a symmetric substitution model, a Bayesian stochastic search variable selection (BSSVS) model, and a strict clock for discrete locations were chosen for the phylogeographical analysis. MCMC generations were conducted for 100,000,000 steps, sampling every 10,000 steps. MCC trees were then generated using TreeAnnotator for each run with a posterior probability limit set at 0.5. Location posterior probabilities of the most recent common ancestor (MRCA) were then collated for clades B and C and for clade C only. [...] Comparative genomic analyses were conducted using a combination of tools, namely, Artemis, Artemis Comparison Tool (), and Mauve (). Graphical representations showing the presence, absence, or variation of mobile genetic elements (MGEs) or other regions of interest, virulence factor genes, and antibiotic resistance genes were carried out using BLASTn and read-mapping information as implemented in the SeqFindR visualization tool (). Regions of interest previously described in the genome of ST131 reference strain EC958 (, ) and virulence factors, including autotransporters, fimbriae, iron uptake, toxins, UPEC-specific genes, and other virulence genes, were screened in all ST131 strains with SeqFindR using a cutoff of ≥95% nucleotide identity over the whole length compared to the assembly or the consensus generated from mapping. Additionally, the prevalence of antibiotic resistance-associated genes was also investigated using Srst2 () against the ARGannot database, with a minimum depth of 15× read coverage. […]

Pipeline specifications

Software tools PRINSEQ, Kraken, Velvet, Mauve, Nesoni, kSNP, Gubbins, RAxML, FigTree, EvolView, Dendroscope, TempEst, BEAST, ACT, BLASTN, SRST
Applications Phylogenetics, WGS analysis, Nucleotide sequence alignment
Organisms Escherichia coli, Escherichia coli O25b:H4-ST131, Homo sapiens
Diseases Infection
Chemicals Fluoroquinolones