Computational protocol: A phylogeny-based sampling strategy and power calculator informs genome-wide associations study design for microbial pathogens

[…] MTB: The phylogeny was constructed based on the whole genome multiple alignment. As MTB populations are considered to be predominantly clonal, most of the genome is thought to support a single consensus phylogeny that is not impacted significantly by recombination []. A superset of SNPs relative to reference strain H37Rv [] was created across the clinical isolates from the variant caller SNP reports. SNPs occurring in repetitive elements including transposases, PE/PPE/PGRS genes, and phiRV1 members (273 genes, 10% of genome) (genes listed in reference []) were excluded to avoid any concern about inaccuracies in the read alignment in those portions of the genome. Furthermore, SNPs in an additional 39 genes previously associated with drug resistance [] were also removed to exclude the possibility that homoplasy of drug resistance mutations would significantly alter the phylogeny. After applying these filters the remaining SNPs were concatenated and used to construct a parsimony phylogenetic tree using PHYLIP dnapars algorithm v3.68 [] with KZN-DS [] strain as an outgroup root. We constructed a phylogeny by two methods. First, using Bayesian Markov chain Monte Carlo (MCMC) methods as implemented in the package MrBayes v3.2 [] using the GTR model and a maximum likelihood tree using PhyML v3.0 []. Second, using the GTR model with eight categories for the gamma model and the results were consistent with the PHYLIP Phylogeny. [...] Using multi-local sequence typing data, a phylogeny was estimated using ClonalFrame [], a model-based approach to determining microevolution in bacteria. This program differentiates mutation and recombination event on each branch of the tree based on the density of polymorphisms. ClonalFrame was run with 50,000 burn in iterations and 50,000 sampling iterations. The consensus tree represents combined data from three independent runs with 75% consensus required for inference of relatedness. Recombination events were defined as sequences with a length of >50 bp with a probability of recombination > =75% over the length, reaching 95% in at least one site. […]

Pipeline specifications

Software tools PHYLIP, MrBayes, PhyML, ClonalFrame
Application Phylogenetics
Organisms Escherichia coli
Diseases Tuberculosis