Computational protocol: Resolving Cypriniformes relationships using an anchored enrichment approach

Similar protocols

Protocol publication

[…] Although the Anchored Hybrid Enrichment kit developed for vertebrates by Lemmon et al. [] contains a fish reference (Danio) and has been utilized in teleosts with moderate success [], we desired an enrichment tool more efficient and appropriate for phylogenomics in teleosts. Because of the complex nature of teleost genome evolution, which involved multiple whole-genome duplications and lineage-specific gene losses [], it is impractical to identify a set of loci that are truly single-copy across all of Teleostei. Previous studies claiming to have identified single-copy loci in teleosts (e.g. []) likely only identified loci that were single-copy in the species they considered; evaluation of those loci in additional teleost lineages suggests that these loci are not universally single-copy (see below). Consequently, we aimed to target loci containing up to four gene copies in each of three diverse lineages of teleosts: zebrafish, platyfish, and cichlids.Candidate target regions for Teleostei were derived by combining the 394 Vertebrate Anchor (v2) loci of Prum et al. [] and the 135 loci identified as Fugu-Danio single-copy orthologs by Li []. For the vertebrate anchor loci, teleost orthologs were obtained for Danio rerio (danRer7) using the human (hg19) coordinates and the USCS genome browser batch-coordinate (liftover) tool []. For the Fugu-Danio orthologs, orthologous human (hg19) and chicken (galGal3) coordinates were obtained using the USCS liftover tool and the Danio coordinates identified by Li []. Once the coordinates for Danio, Homo, and Gallus were obtained for all 529 candidate target regions, sequences corresponding to those regions [plus sufficient flanking region to obtain up to 3000 base pairs (bp) total] were extracted from the genomes and aligned by locus using MAFFT [], v7.023b with “–genafpair” and “–maxiterate 1000” flags. The alignments were then used to generate a Danio-specific reference database containing spaced 20-mers. The Danio reference was then used to identify homologous regions in the genomes of zebrafish (Cypriniformes: Cyprinidae: Danio rerio; danRer7), platyfish (Cyprinodontiformes: Poeciliidae: Xiphophorus maculatus [], and cichlid (Perciformes, Cichlidae: Maylandia zebra; []).As expected, we obtained multiple homologs for many of the candidate loci (only 64 loci were single copy in all three species). Consequently, only 277 loci had fewer than five homologs per species and were considered further. We aligned with MAFFT [], v7.023b with “–genafpair” and “–maxiterate 1000” flags) all homolog sequences (up to 12 per locus) for each of the 277 candidates together with the homologous human probe region sequence from the Vertebrate Anchor (v2) design. Alignments were then manually inspected for misplaced and grossly misaligned sequences, which were removed. Finally, alignments were trimmed to include regions best suited for Anchored Hybrid Enrichment (conserved, low-gap, high taxon representation), taking care that the chosen region contained the human probe region. A total of 260 loci were retained.Finally, in order to ensure efficient enrichment, we checked for high-copy regions (e.g. microsatellites and transposable elements) in each of the three teleost references as follows. First, a database was constructed for each species using all 15-mers found in the trimmed alignments for that species. We also added to the database all 15-mers that were 1 bp removed from the observed 15-mers. The genome for the species was then exhaustively scanned for the presence of these 15-mers and matches were tallied at the alignment positions at which the 15-mer was found. Alignment regions containing > 100,000 counts in any of the three species were masked to prevent probe tiling across these regions. Probes of 120 bp were tiled uniformly at 5.5× tiling density. [...] Reads were quality filtered using Illumina’s Casava software with the chastity filter set to high. In order to increase read length and accuracy overlapping reads were then merged following Rokyta et al. []. Non-overlapping read pairs were kept separate but still used in the assembly. All reads were then assembled into contigs following Prum et al. [] using mapping references derived from the zebrafish, platyfish, and cichlid sequences used for probe design. This assembler produces separate contigs for gene copies differing by more than 5 % sequence divergence. To reduce errors caused by low-level indexing errors during sequencing, contigs were then filtered by removing those derived from fewer than 50 reads. Additional file : Table S2 provides a summary of the sequence data collected and assemblies that resulted.Sets of homologs were produced by grouping by target locus (across individuals) and the filtered consensus sequences. Orthology was then determined for each target locus as follows. First, a pairwise distance measure was computed for pairs of homologs, with distance being computed as the percentage of 20-mers observed in the two sequences that were found in both sequences. A neighbor-joining clustering algorithm was then used to cluster the consensus sequences in to orthologous sets, with at most one sequence per species in each orthologous set (see [] for details). In order to minimize the effects of missing data, clusters containing fewer than 130 (72 %) of the species were removed from downstream processing.Sequences in each orthologous set were aligned using MAFFT v7.023b [] with --genafpair and --maxiterate 1000 flags. In order to remove poorly aligned regions raw alignments were then trimmed and masked following Prum et al. [], with the following adjustments: sites with > 50 % similarity were identified as good, 20 bp regions containing < 14 good sites were masked, and sites with fewer than 30 unmasked bases were removed from the alignment.For all phylogenetic analyses, sequences from the gymnotiform, siluriform, and characiform species were used as the outgroup. For the concatenated dataset, the alignment was partitioned by locus and the phylogeny estimated using RAxML using GTR+ Γ model with 500 bootstrap replicates. For the species tree analysis, a maximum likelihood phylogeny was estimated with 100 bootstrap replicates for each of the separate loci using RAxML with GTR+ Γ model assumed. We then used the RAxML bootstrap trees to estimate a species tree using STAR [] with default parameters using STRAW []. ASTRAL-II (v4.10.2) [] was also used for species tree inference using the gene trees and their 100 bootstrap replicates. We performed 100 replicates of multi-locus bootstrapping.To test our analyses against previous morphological hypotheses, we re-examined the datasets in Conway [] and Britz et al. [] by running 1000 replicates of a heuristic search in PAUP* []. We traced the characters in Mesquite v.3.04 []. We also performed Bayesian analyses on these morphological datasets under the Mk + Γ model in mrBayes 3.2 [], which has been demonstrated to perform better than parsimony due to rate heterogeneity in character evolution []. Estimating rate heterogeneity can be biased by sampling only variable or parsimony-informative characters, so we analyzed the data with correction for parsimony-informative characters for the Conway [] dataset and variable characters for the Britz et al. [] datasets (one character in these datasets was not parsimony-informative). For each dataset, we ran MCMC with two runs of four chains for 1,000,000 generations, sampling every 1,000. We assessed convergence using Tracer v1.5 []. […]

Pipeline specifications

Software tools liftoveR, MAFFT, BaseSpace, RAxML, ASTRAL, Mesquite, MrBayes
Applications Phylogenetics, Genome data visualization
Organisms Cyprinus carpio