Computational protocol: Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance

Similar protocols

Protocol publication

[…] We present one empirical dataset for each of four major foodborne bacterial pathogens (L. monocytogenes, S. enterica ser. Bareilly, E. coli, and C. jejuni) and one simulated dataset generated from the S. enterica ser. Bareilly tree using the pipeline TreeToReads (), for which both the true tree and SNP positions are known. In addition, we propose a standard spreadsheet format for describing these and future benchmark datasets. That format can be readily applied to any other bacterial organism and supports automated data analyses. Finally, we present Gen-FS Gopher, a script for easily downloading these benchmark datasets. All of these materials are freely available for download at GitHub: of the four empirical datasets is either representative of a food recall event in which food was determined to be contaminated with a specific bacterial pathogen, or of an outbreak in which at least three people were infected with the same pathogen. In all four datasets, the results of the epidemiological investigation and the phylogenomic analyses are in concordance. In other words, all isolates implicated in a given event share a common ancestor, or cluster together, in the phylogeny. Although it might be tempting to place these four datasets in the context of a transmission network, it is not the appropriate usage. A phylogeny (with clinical and environmental isolates at the tips and inferred ancestors at internal nodes) is more appropriate due to the nature of foodborne outbreaks: point sources that usually originate from food vehicles, whereas a transmission network more appropriately models person-to-person transmission events. Although our particular four datasets are not intended for transmission network analysis, this does not prevent any future datasets with this intended usage. On the contrary, we have included a field “intendedUse” which addresses this issue and helps future-proof the proposed dataset format (). All isolates listed in these benchmark datasets were sequenced at our federal or state-partner facilities, using either an Illumina MiSeq (San Diego, CA, USA) or a Pacific Biosciences (PacBio) instrument (Menlo Park, CA, USA).The simulated dataset was created using the TreeToReads v 0.0.5 (), which takes as input a tree file (true phylogeny), an anchor genome, and a set of user-defined parameter values. We used the S. enterica ser. Bareilly tree as our “true” phylogeny and the closed reference genome (CFSAN000189, GenBank: GCA_000439415.1) as our anchor. The parameter values were set as follows: number_of_variable_sites = 150, base_genome_name = CFSAN000189, rate_matrix = 0.38, 3.83, 0.51, 0.01, 4.45, 1, freq_matrix = 0.19, 0.30, 0.29, 0.22, coverage = 40, mutation_clustering = ON, percent_clustered = 0.25, exponential_mean = 125, read_length = 250, fragment_size = 500, stdev_frag_size = 120. The output is a pair of raw MiSeq fastq files for each tip (simulated isolate) in the input tree and a VCF file of known SNP locations.Maximum likelihood phylogenies included for each dataset were inferred by first gathering SNPs from SNP Pipeline () and then using Garli version 2.01 () for phylogenetic reconstruction on each resulting SNP matrix. […]

Pipeline specifications

Software tools TreeToReads, GARLI
Applications Phylogenetics, WGS analysis
Organisms Bacteria, Listeria monocytogenes, Salmonella enterica, Escherichia coli, Campylobacter jejuni