Computational protocol: Pseudoscorpion mitochondria show rearranged genes and genome wide reductions of RNA gene sizes and inferred structures, yet typical nucleotide composition bias

Similar protocols

Protocol publication

[…] We used two taxa for this study, Pseudogarypus banksi from the pseudoscorpion superfamily Feaelloidea and Paratemnoides elongatus from the superfamily Cheliferoidea. Phylogenetic analyses of molecular data place the superfamily Feaelloidea as sister to the other pseudoscorpions []; therefore these taxa are from substantially divergent lineages. The specimens were a gift from Jeff Shultz, and first identified by him, and then subsequently by Mark Harvey, an expert on pseudoscorpions.Total genomic DNA was extracted from the legs of one adult for each specimen using the Qiagen™ DNAeasy kit. A region of the mtDNA cytochrome oxidase 1 gene was amplified with the primers HCO2198 and LCO1490 [], and a region of cytochrome b was amplified with the primers CytbF and CytbR []. The PCR products were cleaned and sequenced, and primers were designed that faced outward from these regions. The Paratemnoides-specific primers we designed were Paret3CO1-UF (5'-CTC TGT TTG TAT GGT CCG TG-3') paired with ParetCytb2-LR (5'-GTT TGA TAC TGC AAA GTT TCC TC-3'), and ParetND4L-LR (5'-ACA TAG AAA TTA ATA AAC CAA CCA C-3') paired with ParetCO2-LR (5'-GTA AAA CTA TAT TAT TAA ATG TGT G-3'). The Pseudogarypus-specific primers we designed were PseCO1-UF (5'-CTG TAT TAG CGG GAG CAA TCA CCA T-3') paired with PseCob-LR (5'-GGG GGT GAG TAT AGG GTT GGC TTC-3') and PseCO1-LR (5'-GTC CAC CCT GTT CCA CAT CCT ATC TC-3') paired with Pse2Cob-UF (5'- ACT CAC CCC CAC CCA TAT TAA ACC -3').Long PCR amplification of the two halves of the genomes was performed using the Takara™ LA Taq DNA polymerase kit. A 100-μl reaction contained final concentrations of 0.16 mM of each dNTP, 0.4 mM of each primer, 1× Takara ™ polymerase buffer, 1 μl of mitochondrial DNA (concentration not determined) and 2.5 Units of Takara ™ polymerase. The reactions were cycled at 92°C for 30 sec, 50-58°C (depending on the primers) for 25 sec, and 68°C for 12 min, for 37 cycles, followed by a final extension at 72°C for 15 min. The PCR products were electrophoresed in a 0.8% agarose gel to estimate size and concentration, cleaned, and resuspended in water.The long PCR products were processed and sequenced at the DOE Joint Genome Institute, using methods previously described []. Sequences were processed using Phred, trimmed for quality, and assembled using Phrap. Quality scores were assigned automatically, and the electropherograms and assembly were viewed and verified for accuracy using Sequencher™ (GeneCodes). [...] Protein-encoding genes were identified by similarity of inferred amino acid sequences to those of other arthropod mtDNAs. Once the protein-coding gene boundaries had been determined, the remaining regions were searched for tRNAs with the use of the program tRNAscan-SE 1.21 []. Any regions still not identified as coding for a gene were searched for conserved tRNA anticodon motifs. Potential tRNA genes were compared to the tRNA gene sequences from other chelicerates, using the methods outlined by Masta and Boore [].Ribosomal RNA gene locations were inferred based on sequence similarity to other chelicerates, and by inferring regions of conserved secondary structures of the SSU rRNA and LSU rRNA. The entire secondary structures of the LSU rRNAs for both pseudoscorpions were inferred by comparisons with conserved regions in Archaea, Bacteria, and Eucarya [] and by using the mt LSU rRNA structure from the harvestman Phalangium opilio []. The rRNA helices were numbered following the scheme of Wuyts et al. []. [...] We added the new pseudoscorpion sequences to our existing alignments of mt genome sequences of chelicerates. Additionally, sequences from other arthropod mt genomes were downloaded from GenBank. These included sequences from Myriapoda and Pancrustacea, which were used to root the phylogenetic trees. A full list of taxa used in this study, along with their mt genome GenBank numbers, is provided in Additional file : Table S1.Using the annotated gene boundary information, protein-coding genes were individually excised from the genomic sequence and put into an individual file for each gene. The program SeaView [] was used to view the sequences as amino acids as they would be translated with the invertebrate mitochondrial genetic code. For three of the taxa available on GenBank, some genes likely contained sequencing errors, as stop codons were present within the genes. We corrected this by adding extra "N" characters to help place nucleotide sequences into the correct reading frame. For these taxa and genes (Stylochyrus rarior CO3 and ND4, Bothropolys sp. Cytb, Daphnia pulex CO1), the correct reading frame was identified when no internal stop codons were found and the amino acid sequences appeared to be relatively similar to the others in the alignment. CLUSTAL W version 2.0.12 [] was used to align each of these 13 genes, using the Gonnet series protein matrix, with a gap opening of 10 and gap extension penalty of 0.2. The nucleotide sequence was then aligned using the amino acid alignment information, using a scripted pipeline.In an effort to assure that only homologous regions of the sequence alignments were used in phylogenetic analyses, the program Gblocks [] was used to remove regions that were ambiguously aligned or had poorly conserved amino acids. This method has been shown to improve phylogenetic signal, when used in conjunction with maximum likelihood methods []. The aligned amino acids were trimmed with Gblocks version 0.91b, using default parameters, with the exceptions of "type of sequence", which was set to "codons", and "allowed gap positions", which was set to "with half." This latter setting allowed sites that are without gaps in at least half the taxa to be retained. After Gblocks trimming of each of the 13 protein-coding sequence alignments, the 13 datasets were concatenated into a single alignment file that was used in subsequent phylogenetic analyses. [...] Phylogenetic analyses using maximum likelihood were performed on the Gblocks-reduced amino acid sequence alignments implemented in the program RAxML 7.2.8 []. We employed several different models of evolution in different analyses. In one set of analyses, the general time-reversible (GTR) model was used, with the gamma-distributed model for rate heterogeneity. Other analyses used the mtART [] or the mtREV models of evolution with the gamma-distributed model for rate heterogeneity. These models were selected because they had previously been found to perform well with arthropod mitogenomic data [,]. For each of these different models of evolution, we performed an additional likelihood analysis that also estimated the proportion of invariable sites (I). For 5 of these different models of evolution, 10 replicate runs were performed. Due to time constraints, only 2 runs were performed for the GTR + G + I model of evolution. After each analysis, a majority-rule consensus of the 10 best trees was constructed using SumTrees 3.0 [].For each model of evolution employed, 1000 bootstrap replicates were performed in 10 separate runs, with 100 bootstrap replicates in each run. Each bootstrap analysis was conducted with random seed values.Bayesian analyses were performed using PhyloBayes 3.2f []. A site-homogeneous model was used for site-specific frequencies in one set of analyses. The same three models of evolution (GTR, mtREV, and mtART) as used in likelihood analyses were used for Bayesian inference. However, in PhyloBayes analyses, site-specific rates across sites were modeled using a Dirichlet process. Each run consisted of two separate chains of at least 3 million generations, with tree sampling taken every 100 generations. Burn-in was calculated after one-fourth of the trees were produced, with the remaining trees used to produce a consensus tree and calculate posterior probabilities. We also performed analyses using the CAT site-heterogeneous mixture model, as suggested by Lartillot et al. []. We ran 34 chains in PhyloBayes, applying the CAT model for both frequency and rates site-heterogeneity. Each chain was run for 300,000 to 500,000 generations, until the chains stabilized.During exploratory analyses to determine whether certain taxa on long branches influenced our phylogenetic results, we tried removing or adding specific taxa to our analyses. Each time a taxon was added or removed from the dataset, the entire alignment and Gblock trimming process as described above was repeated, followed by phylogenetic analyses.To overcome the potential problems that nucleotide skew can have in creating misleading phylogenetic inferences, multiple analyses were undertaken to counter the effect at both the nucleotide and amino acid levels. Four recoded datasets were generated. We recoded mitochondrial nucleotides as either purines or pyrimidines (RY recoding), following the suggestions of Phillips and Penny []. For one dataset we RY recoded at 3rd codon positions only, and for another dataset, we RY recoded 1st and 3rd codon positions. We also implemented a variation on RY recoding as proposed by Hassanin et al. [], termed the Neutral Transitions Excluded (NTE) method, whereby selected codons are RY recoded. For all RY recoding, maximum likelihood analyses were conducted in RAxML using a GTR plus gamma model of evolution.In a 4th recoded data set, we aimed to overcome possible saturation and effects of amino acid skew by recoding amino acids into physiochemical groups. We categorized amino acids into the following six functional groups: hydrophobic (valine, leucine, isoleucine, and methionine); aromatic (phenylalanine, tyrosine, and tryptophan); small/neutral (serine, threonine, alanine, proline, and glycine); acidic/amides (aspartate, glutamate, asparagine, and glutamine); basic (histidine, arginine, and lysine); and sulfhydryl (cysteine). Both Bayesian and maximum likelihood analyses were performed on this recoded dataset. [...] To evaluate whether alternative phylogenetic hypotheses were compatible with our data, we used Approximately Unbiased (AU) tests [] to assess alternative tree topologies. To evaluate the possibility of arachnid monophyly, we constrained all arachnids to be monophyletic, but excluded pycnogonids and xiphosurans from this clade. We did not enforce any further constraints on that tree topology. We also evaluated the hypotheses of a monophyletic Acari, and of a sister-group relationship between Solifugae and Pseudoscorpiones. Constraint trees were generated in Newick format, and likelihood analyses on the constraint trees were run in RAxML, using the MtART + gamma model of evolution. RAxML was used to generate per-site log likelihood scores, which were passed to the program CONSEL [] to determine the statistical support for the alternative topologies. […]

Pipeline specifications

Software tools Phrap, Sequencher, tRNAscan-SE, SeaView, Clustal W, Gblocks, RAxML, DendroPy, PhyloBayes, CONSEL
Applications Genome annotation, Phylogenetics
Diseases Thrombasthenia
Chemicals Guanine