Computational protocol: A Framework for Assessing the Concordance of Molecular Typing Methods and the True Strain Phylogeny of Campylobacter jejuni and C. coli Using Draft Genome Sequence Data

Similar protocols

Protocol publication

[…] Illumina traces from 80 of the C. jejuni and C. coli genomes sequenced by Lefébure et al. () were assembled using Velvet (version 1.1.06; Zerbino and Birney, ) using a hash length of 25 as this was found to give optimal assemblies. The order of the contigs was inferred by comparison with the C. jejuni NCTC 11168 reference genome using ABACAS (Assefa et al., ). Prediction of coding sequences and annotation was completed using the rapid annotation using subsystem technology (RAST; Aziz et al., ). [...] In order to generate a measure of quality of WGS data, we examined the C. jejuni genomes [closed reference sequence (RefSeq) genomes (n = 9), draft RefSeq genomes (n = 8), draft 454 genomes (n = 3), and draft Illumina genomes (n = 41)] using a two-step process to examine truncations in core genes predicted in each genome. In the first step, a set of “core genes” for C. jejuni was identified based on a preliminary comparative genomic survey using a subset of RefSeq annotated genomes. Whole genome pair-wise homology searching using BLAST+ (version 2.2.25; Camacho et al., ) was performed at the ORF level using the program BLASTP using the strain NCTC 11168 as a reference. Genes were considered “core” if conserved across all of the genomes analyzed, yielding a set of 1,314 genes. In the second step, the 1,314 genes from strain NCTC 11168 were queried against the predicted ORFs for the set of 61 C. jejuni genomes using BLASTP. Alignment lengths were used to identify truncations if shorter than the length of the RefSeq. A one-tailed unpaired t-test was performed using GraphPad Prism version 5.04 for Windows (GraphPad Software, San Diego) to determine statistical significance of increase in number of truncations observed in draft quality genome sequences compared to closed RefSeq. [...] A semi-automated approach was developed to rapidly infer a core genome phylogeny for the dataset. In the first step, a robust set of “highly conserved core” (HCC) genes for C. jejuni and C. coli was identified based on a preliminary comparative genomic survey using a subset of RefSeq annotated genomes. Whole genome pair-wise homology searching using BLAST+ (version 2.2.25; Camacho et al., ) was performed at the ORF level using the program BLASTP. Genes were considered “core” if conserved across all of the genomes analyzed. A 90% sequence identity cut-off was used to identify HCC genes, yielding a set of 389 genes (Table S3 in Supplementary Material). In the second step, the program CONCATENATOR (Kruczkiewicz et al., ), a program written in C# using the.NET Framework 4.0, was used to: (1) identify the homologous sequences for the set of 389 HCC genes in each genome in the dataset using BLASTN; (2) perform individual alignments for each gene using MUSCLE (Edgar, ,); and (3) concatenate the alignments to produce a single alignment (i.e., a “concatenome”). The reference core genome phylogeny for the dataset was then estimated based on the concatenome using Sea View (Gouy et al., ) using uncorrected distances. [...] The program “microbial in silico typer” (MIST) was used to generate in silico molecular typing results from whole genome sequence data (Kruczkiewicz et al., ). MIST derives several kinds of in silico typing data from “raw” genome sequences (i.e., contig assemblies), including MLST (Dingle et al., ), porA typing (Clark et al., ), flaA typing (Meinersmann et al., , ), and CGF (Taboada et al., ). The full implementation of MIST, which was written in the C# programming language using the.NET Framework 4.0, will be described in detail elsewhere; functionalities used in this study will be briefly described here. Sequence Typing: the sequence for each of the target genes (i.e., MLST genes: aspA, glnA, gltA, glyA, pgm, tkt, uncA; the porA gene; and the flaA gene) was identified in each of the contig assemblies through homology searching using BLAST+ (version 2.2.25; Camacho et al., ). Alleles were inferred for each gene by comparing these sequences against allelic sequences obtained from the C. jejuni PubMLST database. MLST allelic profiles were used to determine the sequence type (ST) and clonal complex (CC) for each strain. Comparative Genomic Fingerprinting. Presence of targets in the CGF40 scheme (Taboada et al., ) was determined by performing a homology search for each target using BLASTN against each WGS and using a sequence identity cut-off of 95% to score the presence/absence of each target. To generate CGF40 clusters, pair-wise profile similarities we computed using the simple matching coefficient and clustered using the unweighted-pair group method using average linkages (UPGMA) in Bionumerics (v.5.1; Applied Maths, Austin, TX, USA), using 100, 95, and 90% fingerprint similarities for cluster definition. […]

Pipeline specifications

Software tools BLASTP, BLASTN, MUSCLE, BioNumerics
Applications Phylogenetics, WGS analysis, Nucleotide sequence alignment
Organisms Campylobacter jejuni, Campylobacter coli