Computational protocol: Parallel Tagged Next-Generation Sequencing on Pooled Samples – A New Approach for Population Genetics in Ecology and Conservation

Similar protocols

Protocol publication

[…] The obtained NGS reads were initially examined using the Genome Sequencer FLX System Software version 2.3 and its SFF Tools commands (Roche Diagnostics). Only reads that contained complete population-specific barcodes with no mismatches were extracted for later analysis. Extracted reads were analyzed on a per-population basis and assessed for Phred quality scores and read length. The 20 Standard Flowgram Format (sff) files (one for each population) were further processed using CLC Genomics Workbench 4.9 (CLC bio). Reads were mapped to the 16S and Cyt b reference sequences. Both references were generated as consensus sequences from the L. hochstetteri 16S and Cyt b sequences available in GenBank. The CLC mapping parameters were set to the following: mismatch, insertion and deletion cost = 2, length fraction = 0.75, similarity fraction = 0.9. This filtering included in the final mappings only reads, which had greater than 90% similarity with the reference in at least 75% of their read length. It is important to note that the CLC Genomics Workbench mapping parameters do not filter reads based on the read length per se, thus reads of various length may be included in the final mappings that are then used for the SNP detection. The criteria for SNP detection were set to minimum central and neighboring Phred quality score = 20, minimum coverage = 20 reads, and a minimum variant frequency was adjusted for each population based on the number of pooled individuals. For example, if the population consists of seven individuals, the frequency of a single copy variant/allele for a mitochondrial haplotype is expected to be approximately 14% (1 in 7). Low frequency variants/alleles may result from sequencing errors , thus to be conservative any SNP variants with a frequency less than 5%, approximately our lowest expected level of real haplotypes, were excluded from all populations.One limitation of the current CLC Genomics Workbench analyses is that they provide only information on frequencies of variants at the individual SNPs among the reads, but not the phase (combination) of these variants among the multiple SNPs within the reads. Furthermore, because the final mappings do not typically include only full-length reads, the reconstruction of population haplotypes for both 16S and Cyt b mitochondrial genes was inferred as follows: each population pool contained equimolar amounts of a known number of individuals, thus the frequencies of SNPs identified and calculated from our NGS data using CLC Genomics Workbench should approximate the frequency of those SNP variants in a given population. For example, if in a population-pool of seven individuals, detected frequency of variants C/T at a given SNP position of a haploid mtDNA gene is about 0.7/0.3, respectively, five individuals likely exhibit at this position variant C and two individuals exhibit variant T. Such inference allows reconstruction of population haplotypes, however given the current limitations of the analyses, the phasing of the variants at multiple polymorphic sites (SNPs) within an individual haplotype could not be automated and was therefore resolved by eye using longer reads in the final mappings. The haplotypes were then used in further population genetic analyses.Estimates of population nucleotide diversity (π) and pairwise population differentiation (FST) were calculated using DnaSP 5.10.01 for 16S haplotypes and Cyt b haplotypes separately to assess the possibility that differences in the accuracy of our approach are influenced by amplicon length. The pairwise FST values were calculated according to Hudson et al. and based on the mean number of differences between different sequences sampled from the same population and sampled from two different populations. As such, the calculated pairwise FST values can be considered as estimates of similarity/dissimilarity between populations. Population genetic structure inferred from 16S haplotypes and Cyt b haplotypes was also assessed using an analysis of molecular variance (AMOVA) with 1000 permutations, as implemented in the software Arlequin 3.5. . AMOVA calculates analogs of Wright’s , hierarchical F-statistics, designated as Φ-statistics. […]

Pipeline specifications

Software tools CLC Genomics Workbench, DnaSP, Arlequin
Application Population genetic analysis
Organisms Homo sapiens, Leiopelma hochstetteri