Computational protocol: Development of a comparative genomic fingerprinting assay for rapid and high resolution genotyping of Arcobacter butzleri

Similar protocols

Protocol publication

[…] For WGS analysis, DNA was extracted using a DNEasy Blood and Tissue Kit (Qiagen Inc, Toronto, ON). To minimize possible genetic bias amongst strains selected for WGS, A. butzleri isolates from diverse sources were genotyped using Amplified Fragment Length Polymorphism (AFLP) analysis as described previously [,], and eight strains selected to represent highly diverse AFLP profiles were chosen for sequencing (Table ). The identity of isolate DNA was tested by sequencing approximately 1000 bp of the 16S rRNA gene and by comparing the results with A. butzleri sequences within the National Centre for Biotechnology Information (NCBI) genetic database [,]. The DNA for isolates to be sequenced was quantified by spectrophotometry (A600) (Ultrospec 3100 pro, GE Healthcare Life Sciences, Baie d’Urfe, QC). Isolates were sequenced as paired-end, 100 bp reads on a HiSeq platform (Illumina Inc., San Diego, CA) with Phred30 (99.9%) base-calling accuracy [], and reads were de novo assembled into contigs using ABySS [] with specifications for short paired-end reads. Sequencing data for the A. butzleri isolates were accessioned in the NCBI genetic sequence database as a single bioproject (PRJNA233527). [...] Rapid Annotation Using Subsystem Technology [] was used to identify open reading frames (ORF) for the eight sequenced A. butzleri genomes, as well as three previously available genome assemblies (RM4018 - PRJNA58557, ED1 - PRJNA158699, JV22 - PRJNA61483). The genome assembly for a fourth strain, 7h1h (PRJNA200766), was not available at the time that the comparative genomic analysis was performed, however we were able to utilize the four published WGS strains for all subsequent in silico CGF analyses.To identify core and accessory genes, the ORFs from each genome were searched against the eleven genome assemblies using the program BLASTP from the Basic Local Alignment Search Tool [,], with filtering to remove redundant results from likely orthologous genes. ORFs present in all assemblies were identified as core, and all non-redundant ORFs absent from one or more strains were designated as accessory. [...] To simplify CGF assay design, accessory genes with limited genotypic potential due to a highly biased population distribution (i.e. present in greater than 80% of strains or present in fewer than 20% of strains) were eliminated from further consideration as candidate markers. Moreover, for groups of accessory genes that presented redundant patterns of presence and absence in the dataset (i.e. genes that are typically linked and provide limited additional discrimination), only one representative gene from each unique pattern was considered as a candidate marker for CGF development. Short genes (i.e. <300 bp) and/or those containing nucleotide gaps or polymorphisms that might affect PCR primer design were also discarded. Accessory genes meeting the above criteria were identified and used to design an expanded CGF assay (i.e. the reference assay) to examine the population structure of a diverse collection of A. butzleri isolates (n=152) based on accessory genome variability. Data from these isolates, which were recovered from river water, raw and treated sewage, diarrheic and non-diarrheic human beings, and non-human animals in Southwestern Alberta was used in conjunction with in silico-derived [] CGF data from four published genome-sequenced strains (RM4018 - PRJNA58557, ED1 - PRJNA158699, JV22 - PRJNA61483, 7h1h - PRJNA200766). CGF profiles were also generated in silico using the program MIST [] for the eight isolates sequenced de novo to allow for comparison with PCR-derived CGF data, thus facilitating assessment of marker performance. A dendrogram representing an estimate for a ‘reference phylogeny’ was constructed from the binary (i.e. presence and absence) data for those genes that generated data fully concordant with in silico-predicted CGF profiles (n=72). Hierarchical clustering was performed by the unweighted pair group method with arithmetic mean (UPGMA) using the hclust function in R [] and the simple matching coefficient of genetic similarity. [...] PCR data for the reference and CGF40 assays was generated for the 152 A. butzleri isolates. The CGF profiles of four previously published genome-sequenced strains (RM4018, ED1, JV22, and 7h1h) were also obtained in silico []. To verify concordance between the expanded CGF and CGF40 assays, binary data from each assay was subjected to hierarchical clustering by UPGMA using the hclust function in R [] and the simple matching coefficient of genetic similarity. The online ‘Comparing Partitions’ tool [] was used to calculate the discriminatory power of each assay and the concordance between assays. The discriminatory power of each CGF assay was calculated using Simpson’s ID [], and the concordance was calculated as the AWC value between the CGF40 assay and the reference phylogeny. A “tanglegram” was generated using a custom R script to compare dendrograms for the CGF40 and the reference phylogeny. This script is available online at https://gist.github.com/peterk87/d92f81ae475063792f49. Briefly, the script generates the dendrograms from binary CGF40 and reference phylogeny data and rearranges the CGF40 dendrogram with respect to the reference phylogeny in order to maximize structural concordance or minimize entanglement of branches using the “untangle_step_rotate_1side” function from the R package dendextend (https://github.com/talgalili/dendextend). It then uses the reference phylogeny to create color-coded linkage groups at a 90% cluster similarity level and plots the color-coded tanglegram. […]

Pipeline specifications

Software tools Hclust, dendextend
Application Phylogenetics
Organisms Arcobacter butzleri, Homo sapiens