Computational protocol: A genome-wide phylogeny of jumping spiders (Araneae, Salticidae), using anchored hybrid enrichment

Similar protocols

Protocol publication

[…] Specimens were preserved in 95% ethanol, and stored between two months and 10 years before use. DNA extractions were done using the Qiagen DNEazy blood and tissue kit, using the protocol for <10 mg samples. The second through fourth pairs of legs were used if they provided sufficient sample volume; otherwise, the carapace and sometimes the distal part of the abdomen was added.Library preparation, enrichment, and sequencing were conducted at the Center for Anchored Phylogenomics at Florida State University ( After extraction, up to 500ng of each DNA sample was sonicated to a fragment size of ~300–800 bp using a Covaris E220 ultrasonicator. Indexed libraries were then prepared following Meyer and Kircher (2010), but with modifications for automation on a Beckman-Coulter Biomek FXp liquid-handling robot (see for details). Size-selection was performed after blunt-end repair using SPRI select beads (Beckman-Coulter Inc.; 0.9x ratio of bead to sample volume). Indexed samples were pooled at equal quantities (16 samples per pool), and then each pool was enriched using the AHE Spider Probe kit v1 developed by and a modified v2 (Hamilton et al. unpublished), which has been refined to yield greater enrichment within araneomorph spiders than the original version. After enrichment, the two enrichment reactions were pooled in equal quantities and sequenced on one PE150 Illumina HiSeq 2500 lanes at Florida State University Translational Science Laboratory in the College of Medicine.Prior to assembly, overlapping paired reads were merged following . For each read pair, the probability of obtaining the observed number of matches by chance was evaluated for each possible degree of overlap. The overlap with the lowest probability was chosen if the p-value was less than 10-10, a stringent threshold that helps avoids chance matches in repetitive regions (see for details). Read pairs failing to merge were utilized but left unmerged during the assembly.Divergent reference assembly was used to map reads to the probe regions and extend the assembly into the flanking regions (see and for details). For this analysis, the Aphonopelma, Aliatypus, Ixodes and Hypochilus references () were utilized as references. Preliminary matches were called if at least 17 of 20 spaced-kmer bases matched and the preliminary matches were confirmed if at least 55 of 100 consecutive bases matched. Assembly contigs derived from less than 23 reads were removed in order to reduce the effects of cross contamination and rare sequencing errors in index reads.Orthology was determined among the homologous consensus sequences at each locus following and . Pairwise distances among homologs were computed for each locus based on the percent of shared continuous and spaced 20-mers. Sequences were clustered using a Neighbor-Joining algorithm by distance, but allowing at most one sequence per species to be in a given cluster. In order to reduce the effects of missing data, data were reduced by removing from downstream processing clusters that contained fewer than 50% of the species. The result of this assessment was 492 orthologous clusters (loci).For all samples except Tisaniba, the nHomologs statistic presented in the Supplementary Table shows value near 1, indicating that at each locus approximately one homolog was recovered by the assembler. This is an indication that recent gene duplication and loss is very low in this group, and that our results are not compromised by the deep arachnid whole-genome duplication (). It also indicates that the individuals whose DNA was pooled for each species were quite similar (the assembler interpreted any differences at the level of allelic differences). This is not the case for Tisaniba, which had an elevated nHomolog value of 1.71, meaning that at 71% of the loci, two homologs were identified and separated into different consensus sequences. For these loci the orthology method would choose the consensus sequence most similar to that of the most similar relatives, and likely removed the other consensus from downstream analysis.Sequences in each orthologous cluster were aligned using MAFFT v7.023b (), using the --genafpair and --maxiterate 1000 flags. The alignment for each locus was then trimmed/masked using the steps described in . Each alignment site was identified as “conserved” if the most commonly observed character was present in > 50% of the sequences. Each sequence was scanned for regions that did not contain at least 10 of 20 characters matching to the common base at the corresponding conserved site. Characters from regions not meeting this requirement were masked. Third, sites with fewer than 23 unmasked bases were removed from the alignment. Geneious version 7 (; ) was used to visually inspect each masked alignment and to remove regions of sequences identified as obviously misaligned or paralogous. Trimming resulted in some loci being deleted, yielding a final total of 447 loci. This represents a higher success rate than Hamilton et al. (2016), This represents a higher success rate than Hamilton et al. (2016), whose study had greater breath, across all spiders, and used an older probe set.In preparation for phylogenetic analyses, the 447 trimmed AHE loci were re-aligned individually with MAFFT version 7.058b () using the L-INS-i option (--localpair --maxiterate 1000). Although assigning codon positions could have allowed better model partitioning in the phylogenetic analysis, we were unable to do so because the loci are often relatively short (average about 560 bases; see Supplementary Table) and we lack a well-annotated reference transcriptome. Our attempts to assign codon positions via TransDecoder version 3.0.1 () yielded unrealistic results for many loci, and so we left codon positions unassigned. [...] We inferred the phylogeny for the 46 taxa using Maximum Likelihood, parsimony, and SVDQuartets applied to a concatenated supermatrix of the 447 aligned loci, and using ASTRAL (a coalescent-based approach, like SVDQuartets) applied to ML-reconstructed gene trees of the 447 separate loci.Two Maximum Likelihood (ML) analyses on the concatenated matrix were performed using RAxML version 8.2.8 (). One left the matrix unpartitioned. The other used partitions chosen by PartitionFinder version 1.1.1 () based on an initial partition by locus. PartitionFinder grouped the loci via a relaxed clustering algorithm assuming linked branch lengths and evaluating 10% of schemes at each step according to BIC score. We used relaxed clustering as, for large datasets such as ours, it has been demonstrated to produce results consistently comparable to a greedy algorithm but with much more computational efficiency (). The best scheme according to our PartitionFinder analyses grouped loci into 21 partitions. Both maximum likelihood analyses assumed the GTR+gamma+I model.We present as our primary result the best-scoring ML tree from the partitioned supermatrix and 200 search replicates. Robustness of clade support was explored by a bootstrap analysis with 1000 replicates, in each of which 5 search replicates were done.Parsimony bootstrap analysis was performed by PAUP* version 4.0a151 (), with 1000 replicates, for each of which we used TBR branch rearrangement, multrees, maxtrees = 100, and 2 search replicates.We also used two methods based on the multi-species coalescent model to infer the species phylogeny, SVDQuartets () and ASTRAL II (). SVDQuartets was performed by PAUP* version 4.0a150 using exhaustive quartet sampling and 1000 bootstrap replicates. The ASTRAL analysis was performed by version 4.7.12 using default settings, based on the 447 gene trees, one from each locus, obtained by RAxML version 8.2.8 from a simple ML search (model GTRGAMMA, unpartitioned). […]

Pipeline specifications

Software tools MAFFT, Geneious, TransDecoder, SVDquartets, RAxML, PartitionFinder
Applications Phylogenetics, Population genetic analysis
Chemicals Nucleotides