Computational protocol: Integrating restriction site-associated DNA sequencing (RAD-seq) with morphological cladistic analysis clarifies evolutionary relationships among major species groups of bee orchids

Similar protocols

Protocol publication

[…] We optimized a bioinformatics pipeline for maximizing the recovery of loci across the coverage variation and phylogenetic depth present within our data, making use of the paired-end reads available. This was a dynamic process, and in the following account we present only the final analytical pipeline, optimized primarily on the number of variable sequence positions in the data and on the bootstrap support in the resulting trees. Given the vast number of parsimony-informative single nucleotide polymorphisms (SNPs) generated via RAD-seq, the bootstrap support and even the topologies resulting from various analytical approaches proved to be surprisingly sensitive to experimental parameters, most notably the permitted levels of missing individuals per site.The raw Illumina paired-end reads were demultiplexed and quality filtered using the program ‘process_radtags’ from the suite STACKS v.1.35 (), rescuing barcodes and RAD tags with a maximum of one mismatch. Next, the overlapping pairs of reads (i.e. originating from DNA fragments of a length smaller than twice the read length) were merged using FLASH v.1.2.11 () under default settings. Only a subset of the pairs could be overlapped through this approach. We then employed the resulting contigs of variable length (i.e. between 96 and 180 bp) to build a catalogue of loci using pyRAD v.3.0.63 (). This software package allows for indels (expected in the phylogenetic framework of our study) and for unequal length of input sequences (resulting from overlapping read pairs) when clustering orthologous loci. The extended contigs were clustered with pyRAD using an 85 % similarity threshold to create RAD tags, by retaining only those loci with a minimum depth of coverage of six at each site and that were present in at least 60 % of the samples.The longest individual contig was selected from each cluster for promotion to the final reference, which was encoded as a genome with as many ‘chromosomes’ as contigs. This construct has been further used to call variants based on all pairs of reads (i.e. overlapping or non-overlapping) generally following the Best Practices recommendations for DNA sequencing (; ) for Genome Analysis Toolkit (GATK) v.3 (), but without marking and removing PCR duplicates. These are impossible to distinguish in RAD-seq data sets, given the consistent start of reads at the restriction cut site and, specific for our analysis, the frequent mapping stopping point at the 3’ end of the reference contigs.After the reads were mapped with the MEM algorithm of BWA v.0.7.15-r1140 (), the BAM files have been processes by sorting them by queryname and adding read groups with Picard tools v.2.1.0 (available from The IndelRealigner module from GATK v3.6-0-g89b7209 was used to improve local alignments around indels. Variants were further called for each sample in the GVCF mode of the GATK HaplotypeCaller to generate an intermediate gVCF. Next, we processed all samples in the cohort in a joint genotyping analysis with the GenotypeGVCFs module of GATK, employing the minimum phred-scaled confidence threshold of ten at which variants should be called. Thus, we followed GATK best practices recommendations for DNAseq. After retaining only SNPs with the SelectVariants module of GATK, the variants were further filtered out if any one of the following three criteria was fulfilled: (1) the quality normalized by the coverage (QD) was <2; (2) the Phred-scaled P-value for the Fisher’s exact test to detect strand bias (FS) was >60; or (3) the root mean square of mapping quality across all samples (MQ) was <40. [...] The resulting vcf file was further filtered using vcftools v.0.1.14 (), set to retain only those SNPs that are covered in at least 60 % of the individuals and show a minimum minor allele frequency of 0.065. Two vcf files were produced employing these settings: the first including all 34 individuals and the second omitting the two outgroup accessions (Steveniella and Serapias). The filtered vcf files were converted to phylip format by concatenating the SNP positions with PGDSpider v. (), summarizing heterozygosities as IUPAC ambiguities.The ingroup-only file was further used to produce a phylogenetic network in SPLITSTREE v.4.3.11 (). Splits were created from uncorrected-p distances and visualized as a neighbour net within which each end node represents an individual accession.A maximum likelihood (ML) phylogenetic tree with 1000 rapid bootstrap inferences, a GTR substitution matrix and GAMMA model of rate heterogeneity was inferred based on the all-individuals data set using RAxML v.8.2.9 (). The analysis was run using ascertainment bias correction (ASC), given that our data set contained only concatenated informative SNP positions. Prior to the analysis, RAxML demanded the removal of 90 ‘invariable’ sites from the data set that represented exclusively heterozygote polymorphism. The RAxML results were graphically visualized and edited in FigTree v.1.4.2 (available from The same matrix was subsequently subjected to Bayesian tree-building using MrBayes (analytical details are provided in the legend to ). […]

Pipeline specifications

Software tools VCFtools, PHYLIP, PGDSpider, SplitsTree, RAxML, FigTree, MrBayes
Application Phylogenetics
Organisms Apis mellifera