Computational protocol: Patterns of Genome Wide Variation in Glossina fuscipes fuscipes Tsetse Flies from Uganda

Similar protocols

Protocol publication

[…] Whole genomes of 16 individuals from the NB population were sequenced at the Genome Institute of the Washington University School of Medicine (St Louis, MO) as 100-bp reads using the Illumina HiSeq2000 platform at ∼40 × coverage ( and Table S1). All sequence data for each of the 16 samples was trimmed using flexbar version 2.4 () prior to alignment. Synonymous site identification to perform genetic diversity calculations was performed using the GfusI1.1 gene-build (; ) with the variant effect predictor (VEP) tool within Ensembl () to determine the genomic context and potential function of all identified variants. The software annotates all SNPs as belonging to one of several functional categories: stop-gained, stop-lost, frameshift-coding, nonsynonymous-coding, splice-site, promoter, 5′ UTR, 3′ UTR, upstream, downstream, intronic; synonymous coding or intergenic. Subsequently, only variants with synonymous effects were parsed for each sample in order to compile a list of synonymous variant sites across the genome assembly. Genetic diversity (π), and Tajima’s D values (5000-bp nonoverlapping windows) were calculated using vcftools v. 0.1.12b () and average values obtained within the R software (). [...] The ddRAD library raw sequence reads were demultiplexed, quality filtered, and filtered for unambiguous barcodes using “process_radtags” from the Stacks software (). This dataset was then used to call SNPs using the G. fuscipes reference assembly as described in the next section. To improve the coverage of individuals for which not enough reads were obtained by the original ddRAD run due to the low quality of the sample, we combined the reads with the data from WGS of 16 individuals from the NB population. Combined, our dataset included 58 individuals, as six of them were sequenced with both technologies ( and Table S1). The software PGDSpider v. () was used to convert between file formats for downstream analyses. [...] Polymorphic loci were identified from the combined reads of the 58 individuals (ddRAD and WGS) by mapping them against the 2395 supercontigs of the G. fuscipes GfusI1 reference assembly (; ) using Bowtie2 v. 2.1.0 () in the “very sensitive” option, and Samtools v. 0.1.19 (). Variants were called using the bcftools utility from Samtools, and data filtered in vcftools v. 0.1.10 () based on genotype depth of coverage (DP > 7) and percentage of missing data allowed (< 30%). Only loci that genotyped in at least 80% of the samples were included in subsequent analysis. Our final variant calling file (vcf format) contained only biallelic SNPs and no indels. Five individuals that did not genotype for at least 80% of SNPs were removed from the analysis, and a second filtering was performed on the remaining 53 individuals, including a minor allele frequency filter (MAF > 0.05). All summary statistics including depth of coverage, SNP density, Hardy-Weinberg equilibrium tests, Tajima’s D, and between populations Fst, were performed using vcftools v. 0.1.12b (), and processed with the R software (). Tajima’s D values were estimated using a nonoverlapping window size of 1000 bp. Tajima’s D values are indicative of the presence and direction of selection in the region. In general, values above 2 are considered significant, with positive values suggesting balancing selection, and negative values the presence of a selective sweep or a bottleneck (). […]

Pipeline specifications

Software tools Flexbar, VCFtools, PGDSpider, Bowtie2, SAMtools, bcftools
Application WGS analysis