Computational protocol: Population Genomic Analysis of Strain Variation in Leptospirillum Group II Bacteria Involved in Acid Mine Drainage Formation

Similar protocols

Protocol publication

[…] For analysis of sequence variation, the final Phrap contigs and aligned reads were imported from Consed into the program Strainer []. Substrains consisting of linked polymorphisms were defined for all large Phrap contigs. A substrain was defined heuristically to consist of two or more linked polymorphisms with Phred scores greater than 20. These were considered to be separate subpopulations from the dominant strain due to the improbability of multiple linked polymorphisms arising simultaneously due to mutation alone. For strain grouping analyses, the composite sequence was corrected to reflect the dominant polymorphism pattern so long as these changes retained SNP grouping defined by reads and their mate pairs. Sequences were grouped based on the presence of SNPs in more than one read (considering only high-quality base calls, with Phred scores > 20).Additionally, custom Perl scripts were developed to identify and classify every polymorphic site in the assembly as synonymous, nonsynonymous, intergenic, or indel. SNPs were classified as replicated or nonreplicated. These counts are summarized in and were used to generate for a Phred quality score cutoff of 25. The effect of different quality score cutoffs on the distribution of these classes was also examined ( and ). In a region spanning approximately one quarter of the genome, indel SNPs were examined in detail for their effect on frameshifts and gene splits ( and ). The location of divergent UBA-type sequences was also mapped for this region. [...] For each substrain consisting of at least four reads, we counted the synonymous and nonsynonymous (replacement) substitutions both within the substrain and between the substrain and the dominant composite sequence (). A custom Perl pipeline was developed to extract aligned reads and a consensus sequence for every gene in the Phrap assembly. Bases with Phrap quality scores less than 25 were masked. This threshold approach treats all bases with higher quality scores as “true” and does not take into account the error probabilities associated with these scores, resulting in a somewhat inflated estimate of the true number of SNPs in the population []. Because substitutions at roughly two-thirds of all sites will result in nonsynonymous substitutions, we would expect a bias towards nonsynonymous SNPs due to sequencing errors. To address this question, we repeated the analysis with different levels of sequence quality score cutoffs (). Additionally, we examined the effect of read depth on all analyses (see and ). Due to common sequencing errors at read ends and the presence of inserted genes, reads diverging from the consensus by more than 2% were eliminated from the analysis. This cutoff also eliminated reads derived from the UBA-type Leptospirillum group II. The two short, deeply sampled regions containing multiple UBA-type reads were analyzed separately in comparison with syntenous CG-type reads.Additional custom Perl scripts were used to calculate synonymous and nonsynonymous polymorphisms within each strain and substrain. Substitutions occurring immediately adjacent to masked bases were not counted due to the high probability of alignment or sequencing errors in these regions. The consensus sequence determined by Phrap was used to determine synonymous and nonsynonymous fixed differences between each substrain and the main strain. The number of synonymous and nonsynonymous sites in each strain was calculated using the program codeml in the PAML package [].The numbers of polymorphisms and fixed substitutions were used to test the hypothesis that substrains are maintained in the population through selection for adaptively significant variants. Two polymorphism-divergence tests were used: the MK test [] and the MKPRF test []. For the purpose of these tests, each substrain is considered to be a distinct population (note that this differs from the terminology used elsewhere in this paper). The rationale behind the MK test is that in the absence of selection, the number of synonymous and nonsynonymous fixed differences between two populations, and the number of synonymous and nonsynonymous polymorphisms within each population, depend only on the mutation rate and time since divergence. The ratio of nonsynonymous to synonymous polymorphisms within populations should therefore be the same as the ratio of nonsynonymous to synonymous fixed differences between populations. Positive selection for nonsynonymous mutations will produce an excess of nonsynonymous fixed substitutions relative to nonsynonymous polymorphisms. The MKPRF test refines this idea, using a population genetic model to explain the distribution of within-population variation via maximum likelihood estimation of mutation rate, divergence time, and a selection parameter. Polymorphism-divergence data were combined for all genes within a strain for the MK and MKPRF tests, whereas individual genes were also analyzed using the MKPRF test. The MKPRF tests were carried out using an online implementation (Computational Biology Service Unit, Cornell University).In addition to the MK test, we also calculated several summary statistics for each strain and substrain as well as the entire assembly. Because of variable coverage depth across the assembly, it was necessary to weight these parameters by the fraction of total sites at a given depth []. The parameter θ = 2N e u, an estimator of the product of population size and mutation rate for a neutrally evolving population, was calculated with Watterson′s infinite sites estimator [] and a finite sites modification []. Estimates for each coverage class were combined to yield a single value. The program “piim” [], which corrects for sequencing error in the estimation of population genetic parameters from metagenomic data, was also used to calculate θ for the entire assembly. Pairwise heterozygosity (π) was calculated as described for a variable coverage assembly of Drosophila simulans [] using custom scripts. Under a neutral model of evolution for a Wright-Fisher population, π and θ are expected to be equal. Tajima′s D measures the discrepancy between these statistics: D T = (π – θ)/C, where C is a normalizing constant calculated from the data []. The statistical significance of D depends on the sample size, but roughly, for sample sizes greater than 7, D is considered significant at the 95% level if it is smaller than −2 or larger than 2 []. D T for strains, substrains, and interstrain regions was calculated with BioPerl population genetics modules []. […]

Pipeline specifications

Software tools Phrap, Consed, Strainer
Application Metagenomic sequencing analysis
Organisms Leptospirillum ferriphilum
Chemicals Nucleotides