Computational protocol: Multilocus haplotypes reveal variable levels of diversity and population structure of Plasmodium falciparum in Papua New Guinea, a region of intense perennial transmission

Similar protocols

Protocol publication

[…] Allele frequencies for the 13 villages and overall were determined using CONVERT version 1.31 software. This software was then used to generate input files for the various population genetic software used []. Genetic diversity was assessed using ARLEQUIN version 3.11 software [] by determining the number of haplotypes (h), the number of alleles per locus (A) and the expected heterozygosity, calculated as He=nn−1(1−∑i=1npi2), where p is the frequency of the ith allele and n is the number of alleles in the sample. Because A is strongly influenced by sample size it is only reliable for large sample sizes (e.g. catchments) therefore we also calculated the allelic richness (Rs) which is normalized on the basis of the smallest sample size and based on the rarefaction method developed by Hurlbert [] and implemented in FSTAT version 2.9.3 software []. Associations between the latter two diversity indices and correlates of transmission intensity were measured by Spearmans rank correlation test using SPSS version 17. To measure multilocus LD (non-random associations among loci), the standardized index of association (ISA) was calculated using the program LIAN version 3.5 [] for the whole dataset and a curtailed dataset with haplotypes only from confirmed single infections, as a precaution against the bias that may result from presence of any false dominant haplotypes []. As only complete haplotypes could be analysed by LIAN version 3.5, to maximize sample size, this analysis included only eight loci (TA1 and TAA42 were excluded). Due to the small size of the dataset within some villages, LD was calculated only on the scale of each catchment. Population differentiation was estimated by using two pairwise distance measurements: FST (θ, which estimates the weighted average F statistics over all loci based on the number of different alleles between haplotypes []; and RST which calculates F statistics from the sum of the squared size difference (i.e. number of repeat units) between haplotypes [] using only the seven microsatellite loci that follow the simple step-wise mutation model (TA87, ARAII, Pfg377, 2490, TA81, PfPK2 and TA60; []).Significance for both FST and RST was tested by comparison with 95% confidence intervals from 1023 permutations. As RST considers the distances between alleles it is the more sensitive of the two statistics. Correlations between genetic differentiation and geographic distance (the shortest distance in km, as defined by the exact distance between geographic co-ordinates) were measured using the Mantel test [] in FSTAT version 2.9.3 []. As small sample size may result in a biased estimate of genetic differentiation the Mantel tests included only villages with n ≥ 22. To confirm the population structure identified by F statistics, Structure v. 2.3 software [] was also used to test whether each haplotype clustered according to geographic origin. Structure assigns individual multilocus haplotypes probabilistically to one of a number of clusters (K) or jointly to multiple clusters (admixture) based on the allele frequencies at each locus [,]. The analysis was run 20 times for K = 1-20 for 10,000 Monte Carlo Markov Chain (MCMC) iterations after a burn-in period of 10,000 using the admixture model and correlated allele frequencies for the analysis. The most likely K was defined by calculating the rate of change of K, ΔK, according to the method of Evanno et al [] and geographic population structure determined by assessing whether the ancestry coefficients were asymmetric among sampling locations []. To further visualize the complex relationships among haplotypes that might result from recombination a weighted network approach that connects haplotypes if they shared at least three alleles was utilized. Network analysis was done using the free software Cytoscape []. Each node within the network represents an individual haplotype, and edges between nodes represent shared alleles between haplotypes. For visual clarity, a threshold was set such that nodes were only joined by edges if they shared more than three loci. Modifications of this threshold value did not qualitatively change the structure of the network. Above this threshold, the edges in the network were weighted according to the number of shared alleles. Missing data points were assumed to be different between loci. An edge-weighted spring-embedded algorithm was used to construct the network. Based on Kamada and Kawai's notion of "force-directed" networks [], the algorithm treats nodes as objects that repel each other dependent on a spring force between them, which is modified by the weight of the edge. […]

Pipeline specifications

Software tools Arlequin, LIAN
Application Population genetic analysis
Organisms Plasmodium falciparum, Toxoplasma gondii, Homo sapiens
Diseases Infection, Malaria, Malaria, Falciparum