Computational protocol: Genomic Analysis of Salmonella enterica Serovar Typhimurium DT160 Associated with a 14-Year Outbreak, New Zealand, 1998–2012

Similar protocols

Protocol publication

[…] Each isolate’s genome was assembled de novo. We used an in-house Perl script to trim reads at an error probability of 0.01 and generate random subsets of paired reads from 750,000 to 1.2 million paired reads in increments of 150,000, varying the average coverage. We assembled each of the random sets by using the de novo assembler Velvet version 1.1 () at a variety of k-mers (from 55 to 245) in increments of 10. De novo assembly resulted in multiple genome assemblies for each isolate. We ranked the metrics for each of 4 parameters (longest genome length, fewest number of contigs, largest N50 value, and longest contig length) in numeric order and calculated an overall equally summed ranking score for each assembly. We used the assemblies with the lowest total rank for further analyses. We used QUAST (), a quality assessment tool for evaluating and comparing genome assemblies, to analyze the DT160 de novo assemblies and determine their GC content (i.e., the percentage of a DNA sequence made up of guanine and cytosine bases). [...] We used Snippy version 2.6 ( and kSNP version 3.0 () to identify core single-nucleotide polymorphisms (SNPs). Snippy is a pipeline that uses the Burrows-Wheelers Aligner () and SAMtools version 1.3.1 () to align reads from different isolates to a sequence and uses FreeBayes () to identify variants among the alignments. We used kSNP to analyze de novo assembled genomes, along with the reference genome, S. enterica serovar Typhimurium 14028S (GenBank accession no. NC_016856). We used an in-house Python script to determine the read coverage of all the SNPs identified via kSNP. We used Snippy to align reads from each isolate to the reference genome (GenBank accession no. NC_016856) before identifying SNPs. SNPs were accepted if they had a >10 read depth and a >90% consensus for each isolate. The position of the SNP on the reference genome was used to determine if both methods identified the SNP or if they were unique to the method (). This method identified 793 core SNPs shared by the 109 New Zealand DT160 isolates. [...] We used RAxML version 8.2.4 () to construct a maximum-likelihood tree based on the 793 core SNPs of the 109 DT160 isolates; we used EvolView version 2 () to visualize and edit the tree. We used SplitsTree () to form a NeighborNet tree of the 109 New Zealand DT160 isolates based on the 793 core SNPs that they share and to compare the New Zealand and UK isolates based on the 1,521 core SNPs that they share. We used MEGA6 () and the maximum composite likelihood model () to predict the pairwise distance between the 109 New Zealand DT160 isolates, based on the 793 core SNPs they share, and the 109 New Zealand and 2 UK isolates, based on the 1,521 core SNPs that they share. [...] We used an in-house Perl script to split the 793 codons into 5 groups: those associated with the first, second, or third codon; those contained in overlapping coding regions; and those found in intergenic regions. We also used the in-house Perl script to determine whether the SNPs were synonymous or nonsynonymous. We then exported the partitioned SNPs into BEAUti to create an XML file for BEAST 1.8.3 (). To allow for variation in base substitution among codon positions, we used separate Hasegawa Kishino Yano models to estimate the 5 SNP groups (); to allow for and estimate changes in the effective population size, we used the Gaussian Markov random field Bayesian skyride model (); to allow for variation in mutation rates among lineages, we used an uncorrelated relaxed molecular clock (), which was calibrated by the tip dates. We ran the XML file in BEAST for 40 million steps a total of 3 times with different starting seeds before using LogCombiner ( to combine the runs with a 10% burn-in. To visualize the results and the relative change in effective population size, we used Tracer version 1.6 ().To determine the mutation rate for the DT160 genome, we multiplied the mutation rate estimated by BEAST by the number of analyzed core SNPs (793 bp) and then divided the product by the mean genome size of the analyzed isolates (4,884,485 bp). We used the discrete phylogeographic model () to predict ancestral migrations between host groups over the course of the outbreak. [...] We used Prokka () to annotate de novo assembled genomes, and we used Roary () to cluster proteins and identify those that were found only in a subset of isolates and those that differed in length between the isolates. We used ClustalW version 2.1 () to align amino acid sequences, and we used an in-house Perl script to determine if these alignments contained mismatches. The nucleotide sequence of all proteins that differed were extracted from the assembled genomes, along with 500-bp flanks on either side of the sequence, by using an in-house Perl script. We could not obtain 500-bp flanks for some genes because they were located at the end of contigs. For those genes, the flank was cut short, but their length was annotated. We extracted flanks to help with read alignment. This extraction left a pool of nucleotide sequences from each isolate, for every protein that potentially differed in sequence. For each protein, we extracted all nucleotide variants from the pool by using an in-house Perl script. We used SRST2 version 2, a read mapping–based tool (), to align reads from each isolate to the sequence variants, and we used SAMtools version 1.3.1 () to form a consensus sequence from the aligned reads. We set the consensus cutoff at a read depth of >8 and a consensus of >80%. The flanks were removed from the consensus sequences, and the sequence variants were translated into amino acid sequences by using an in-house Perl script. We identified protein differences by comparing the amino acid sequences from each isolate and combined the differences with the nonsynonymous SNPs identified by SNP analysis. The position of nonsynonymous SNPs within proteins was used to prevent repeats.We used the Clusters of Orthologous Groups of proteins (COGs) database () to predict protein functions. For each functional group, we calculated the proportion of proteins that differed in sequence, and we used a Fisher exact test, computed via Monte Carlo Markov Chains of ≈109 iterations, to determine if there were any differences between these proportions.We used an in-house Perl script to form a presence–absence matrix of all the protein differences. We used Primer-E version 6 () to predict the Euclidian distance between the isolates based on the presence–absence matrix. The centroid is the arithmetic mean for a group of data points in an n-dimensional space. To assess differences in centroids among isolates collected from different sources or time periods, we applied PERMANOVA ( To assess differences in dispersions between different groups, we computed dispersions (z-values) by using PermDisp () and then modeled them using a regression model with date of collection and source as the explanatory variables. […]

Pipeline specifications

Software tools Velvet, QUAST, Snippy, kSNP, SAMtools, FreeBayes, RAxML, EvolView, SplitsTree, NeighborNet, MEGA, BEAST
Applications Phylogenetics, De novo sequencing analysis, Nucleotide sequence alignment
Organisms Salmonella enterica subsp. enterica serovar Typhimurium, Homo sapiens