Computational protocol: Analysis of nuclear and organellar genomes of Plasmodium knowlesi in humans reveals ancient population structure and recent recombination among host-specific subpopulations

Similar protocols

Protocol publication

[…] Raw sequence data were downloaded for 48 isolates from Kapit and Betong in Malaysian Borneo [], 6 isolates from Sairikei in Malaysian Borneo () [] and 6 long-time isolated lines, maintained in rhesus monkeys sourced originally from Peninsular Malaysia and Philippines []. The sequence data accession numbers can be found in . The samples were aligned against the new reference for the human-adapted line A1-H.1 (pathogenseq.lshtm.ac.uk/knowlesi_1, accession number ERZ389239, []) using bwa-mem [] and SNPs were called using the Samtools suite [], and filtered for high quality SNPs using previously described methods [, ]. In particular, the SNP calling pipeline generated a total of 2,020,452 SNP positions, which were reduced to 1,632,024 high quality SNPs after removing those in non-unique regions, and in low quality and coverage positions. Samples were individually assessed for detecting multiplicity of infection (MOI) using: (i) estMOI [] software, and (ii) quantifying the number of positions with mixed genotypes (if more than one allele at a specific position have been found in at least 20% of the reads []). The measures led to correlated results (r2 = 0.8), which highlighted the robustness of these two methods. Samples were classified into three subcategories: (i) single infections (> = 98% genome showing no evidence of MOI and < = 1/10,000 SNP positions with mixed genotypes), (ii) low MOI (>85% genome showing no evidence of MOI and < = 4/10,000 SNPs positions with mixed genotypes); (iii) high MOI (<85% genome showing no evidence of MOI, and > 4/10,000 SNPs positions with mixed genotypes). Samples with high MOI were removed from subsequent analyses. [...] For comparisons between populations, we first applied the principal component analysis (PCA) and neighbourhood joining tree clustering based on a matrix of pairwise identity by state values calculated from the SNPs. We used the ranked FST statistics to identify the informative polymorphism driving the clustering observed in the PCA []. Finally, we created haplotype plots using only SNP positions with MAF > 0.05 over all the populations, and displayed each sample as a row to allow closer inspection of the chromosome regions where interesting recombination events are observed. The XP-EHH metric [] implemented within the rehh R package was used to assess evidence of recent relative positive selection between regional clusters from Kapit and Betong. The results were smoothed by calculating means in 1 Kbp windows, where windows overlapped by 250bp. The raXML software (v.8.0.3, 1000 bootstrap samples) was used to construct robust phylogenetic trees (90% bootstrap values > 95) for nuclear and organellar SNPs. Estimates of divergence times for subpopulations was based on a Bayesian Markov Chain Monte Carlo (MCMC) (BEAST, v.1.8.1) approach applied to mitochondrial sequences, with identical parameters settings to those described elsewhere []. The Shimodaira-Hasegawa [] and the Templeton [] tests were used to detect incongruence between the tree topologies. […]

Pipeline specifications

Software tools BWA, SAMtools, estMOI, RAxML
Applications Population genetic analysis, GWAS
Organisms Macaca fascicularis, Plasmodium knowlesi, Homo sapiens, Toxoplasma gondii, Macaca nemestrina