Similar protocols

To access compelling stats and trends, optimize your time and resources and pinpoint new correlations, you will need to subscribe to our premium service.


Pipeline publication

[…] y of the species was established by looking at the taxonomical traits of the males produced from isofemale lines (genital arches, number of sex combs) and the female mating choice (i.e., whether they chose D. simulans or D. sechellia in two-male mating trials)., We obtained sequence data from 20 D. simulans inbred lines [] from NCBI’s Short Read Archive (BioProject number PRJNA215932). We also sequenced wild-caught outbred D. sechellia females (see above) from Praslin (n = 7 diploid genomes), La Digue (n = 7), Marianne (n = 2), and Mahé (n = 7). These new D. sechellia genomes are available on the Short Read Archive (BioProject number PRJNA395473). For each line we then mapped all reads with bwa 0.7.15 using the BWA-MEM algorithm [] to the March 2012 release of the D. simulans assembly produced by Hu et al. [] and also used the accompanying annotation based on mapped FlyBase release 5.33 gene models []. Next, we removed duplicate fragments using Picard (, before using GATK’s HaplotypeCaller (version 3.7; [–]) in discovery mode with a minimum Phred-scaled variant call quality threshold (-stand_call_conf) of 30. We then used this set of high-quality variants to perform base quality recalibration (with GATK’s BaseRecalibrator program), before again using the HaplotypeCaller in discovery mode on the recalibrated alignments. For this second iteration of variant calling we used the—emitRefConfidence GVCF option in order to obtain confidence scores for each site in the genome, whether polymorphic or invariant. Finally, we used GATK’s GenotypeGVCFs program to synthesize variant calls and confidences across all individuals and produce genotype calls for each site by setting the—includeNonVariantSites flag, before inferring the most probable haplotypic phase using SHAPEIT v2.r837 []. The genotyping and phasing steps were performed separately for our D. simulans and D. sechellia data, and for each of step in the pipeline outlined above we used default parameters unless otherwise noted. In order to remove potentially erroneous genotypes (at either polymorphic or invariant sites), we considered genotypes as missing data if they had a quality score lower than 20, or were heterozygous in D. simulans. After throwing out low-confidence genotypes, we masked all sites in the genome missing genotypes for more than 10% of individuals in either species’ population sample, as well as those lying within repetitive elements as predicted by RepeatMasker ( Only SNP calls were included in our downstream analyses (i.e. indels of any size were ignored). These phased and masked data are available at, Having obtained genotype data for our two population samples, we used ∂a∂i [] to model their shared demographic history on the basis of the folded joint site frequency spectrum (downsampled to n = 18 and n = 12 in D. simulans and D. sechellia, respectively); using the folded spectrum allowed us to circumvent the step of producing whole genome alignments to outgroup species in D. simulans coordinate space in order to attempt to infer ancestral states. We used an isolation-with-migration (IM) model t […]

Pipeline specifications

Software tools BWA, Picard, GATK, SHAPEIT, RepeatMasker