Computational protocol: Identification of Salmonella for public health surveillance using whole genome sequencing

[…] FASTQ reads were quality trimmed using Trimomatic () with bases removed from the trailing end that fell below a PHRED score of 30. If the read length post trimming was less than 50 bp the read and its pair were discarded. The PHE KmerID pipeline ( was used to compare the sequenced reads with 1,769 published genomes to identify the bacterial species (and Salmonella subspecies) and to detect cultures submitted by the local and regional hospital laboratories that contained more than one bacterial species (mixed cultures). KmerID determines a similarity index between the FASTQ reads and each of the 1,769 published reference genomes by calculating the percentage of 18-mers in the reference that are also present in the FASTQs. Only 18-mers that occur at least twice in the FASTQ are considered present. Mixed cultures are detected by comparing the list of similarities between the sample and the references with the similarities of the references to each other, and filtering this comparison for inconsistencies. ST assignment was performed using the Metric Orientated Sequence Typer (MOST), a modified version of SRST (), available from The primary difference between SRST and MOST is in the metrics provided around the result, while SRST gives a single score, MOST provides a larger array of metrics to give users more details on the read level associated with their result. Preliminary analysis was undertaken using the MLST database described in . It takes approximately 10–15 min to run MOST using a single core on the PHE infrastructure which consists of Intel Xeon CPU E5-2680 [email protected] 2.70GHz, 16 cores sharing 125 Gb Memory.For isolates that had novel STs, or a ST but no associated serovar in the Achtman MLST database, the serovar was determined by phenotypic serotyping at PHE. STs and corresponding serovars of isolates serotyped and sequenced during this study were added to a modified version of the Achtman MLST database, held and curated at PHE. These novel STs were assigned a preliminary ST (PST) and an inferred serovar was determined. The PHE MLST database currently holds 7,000 strains and 1,200 serovars and is up-dated every three months.For some STs that contained two serotypes, whole genome SNP phylogenetic analysis was carried out by mapping the strains of interest against a reference genome from within the same sequence type (for ST909 H145100685 was used; for ST49, H143720759 was used), using BWA mem (). SNPs were called using GATK2 () in unified genotyper mode. Core genome positions that had a high quality SNP (>90% consensus, minimum depth 10×, GQ ≥ 30, MQ ≥ 30) in at least one strain were extracted and RAxML v8.1.17 phylogenies determined with the gamma model of rate heterogeneity and 100 bootstraps undertaken. […]

Pipeline specifications

Software tools SRST, BWA, GATK, RAxML
Applications Phylogenetics, WGS analysis
Organisms Salmonella enterica subsp. enterica, Salmonella enterica subsp. salamae, Salmonella bongori
Diseases Salmonella Infections