Computational protocol: Genetic characterization of clinical and environmental Vibrio parahaemolyticus from the Northeast USA reveals emerging resident and non-indigenous pathogen lineages

Similar protocols

Protocol publication

[…] Phylogenetic analysis was performed from concatenated sequences derived by PCR amplification of multiple house-keeping loci. The amplicons were generated using Master Taq (5 PRIME, MD US), and sequenced by the Sanger method at the UNH Hubbard Center for Genome Studies or by Functional Biosciences (WI, US). For inferring multi-locus phylogeny, we used either seven loci (See Supplementary Figure ) from two schemes as previously described (Ellis et al., ) including three loci adopted to determine the relationships broadly among Vibrio spp. (Sawabe et al., ) (gyrB, pryH, and recA) and four loci adopted to closely examine within species relationships (González-Escalona et al., ) (dnaE, dtdS, pntA, and tnaA); because these four are the only sequenced loci that overlap with those from strains in the public database (, the phylogenetic relationships of a larger collection of isolates in this study with those of a global distribution were inferred using only four loci (dnaE, dtdS, pntA, and tnaA) (Figure ). The primer sequences (Supplementary Table ) and corresponding cycling parameters were used exactly as in published protocols (Sawabe et al., ; González-Escalona et al., ; Jolley, ). For phylogenies inferred from all seven loci, each forward and reverse raw sequence for 25 clinical isolates from 2010 to 2012 was assembled, and the contiguous sequences were then aligned and trimmed to match the length of corresponding sequence of 192 Great Bay Estuary environmental isolates (Ellis et al., ), only two of which harbor hemolysin genes. An additional eight isolates collected during 2013 were also included in some analysis. The sequences for individual isolates were then concatenated in alphabetic order. For phylogenies inferred from four loci (dnaE, dtdS, pntA, and tnaA), each raw sequence was assembled, aligned, and trimmed to match the exact corresponding amplicon sequence from the public database. Neighbor-joining trees for concatenated sequence of either four loci (1868 bp) or seven loci (2988 bp) were constructed by Mega 5.0 software (Tamura et al., ) using Jukes-Cantor model. The statistical support was assessed by 1000 bootstrap re-assemblies.Comparisons with the published MLST database ( to identify STs were performed on 12 clinical and 16 environmental isolates for which the sequencing of three additional loci (gyrB, pyrC, and recA) were completed as described (González-Escalona et al., ). Raw sequences were assembled, aligned, and trimmed as described above. Allele numbers and ST numbers were determined by matching the public database. The STs of sequenced strains were determined from raw short read sequences using the short read sequence typing (SRST2) pipeline (Inouye et al., ).The extent of recombination and mutation within the population was visualized and analyzed by several approaches. The contribution of recombination to phylogeny was evaluated visually using SplitsTree v4 neighbor net analysis of four loci, and the Phi test module was applied for determining statistical support (Huson and Bryant, ). The standardized index of association (IAS) was determined from a non-redundant allele database for the collection of 90 clinical and 16 environmental strains using the LIAN 3.5 linkage analysis program (Haubold and Hudson, ). This statistic describes the linkage disequilibrium in a multilocus data set where a low rate of recombination relative to mutation is indicative of linkage disequilibrium (IA > 1). The null hypothesis that variation of the observed data (VD) does not differ from that predicted for a population in equilibrium (i.e., experiencing a high rate of recombination relative to mutation) (Ve) was tested by a non-parametric Monte Carlo simulation, with the 5% critical value to determine significant linkage (L). ClonalFrame 1.1 was used to determine the relative influence of recombination compared to mutation (r/m) to nucleotide variation (Didelot and Falush, ). [...] Representative strains within the species V. parahaemolyticus were selected from among the 25 NCBI genome groups (defined as such by ~90% genome identity) from NCBI genomes phylogeny ( that had accompanying information on geographic isolation, year, and sample source (environmental or clinical including wound, stool, and ear). The raw sequences from MAVP-E, MAVP-26, MAVP-36, MAVP-45, MAVP-V, MAVP-M, CT4287, (see Table and Supplementary Table for a description of these isolates) were processed and de novo assembled using the A5 pipeline (Tritt et al., ). The assembled contigs of all isolates were analyzed using REALPHY v. 1.09 (Bertels et al., ). Sequences were analyzed in three separate alignments, each with a unique reference strain including 10290 (GCA_000454205.1), BB22OP(NC_019955.1, NC_019971.1), and RIMD 2210633 (NC_004605.1, NC_004603.1), for phylogenies across a broad distribution of strains, and 10290, 10329 (NZ_AFBW01000001.1 - 33.1), and 10296 (GCA_000500105.1) for analysis of strains within the ST36 clonal complex clade. From these alignments multiple alignment positions were extracted and then merged into a single alignment. Neighbor-joining phylogenies were reconstructed using the maximum likelihood method in PhyML, with a GTR substitution matrix and a gamma-distributed rate heterogeneity model (Guindon et al., ). Phylogenies were visualized as trees using FigTree 1.4.2 (Rambaut, ). The branch length reflects nucleotide changes per by total number of nucleotides in the sequence. […]

Pipeline specifications

Software tools MEGA, SRST, SplitsTree, LIAN, ClonalFrame, REALPHY, PhyML, FigTree
Databases PubMLST
Applications Phylogenetics, WGS analysis
Organisms Vibrio parahaemolyticus
Diseases Infection, Stomach Neoplasms