Computational protocol: Phylogenomic Perspective on the Relationships and Evolutionary History of the Major Otocephalan Lineages

Similar protocols

Protocol publication

[…] For each live species, liver tissues of 3–5 individuals were extracted, immediately immersed in RNAlater (Life Technologies, Carlsbad, CA, USA), and frozen on liquid nitrogen until assay. RNAiso Plus reagent (Takara Biotechnology, Dalian, China) was used to isolate total RNA following recommendations of the manufacturer. The crude extract was purified using an RNeasy mini kit (Qiagen, Valencia, CA, USA) to exclude genomic DNA, and a Bioanalyzer 2100 (Agilent) was used to determine the integrity of the sample. The RNA-seq libraries were constructed using the Illumina Gene Expression Preparation Kit (Illumina, San Diego, CA, USA). Briefly, the mRNA was enriched from total RNA using Magnetic Oligo (dT) Beads (Illumina) and fragmented into pieces using the RNA fragmentation kit (Ambion, Austin, TX, USA). Reverse transcriptase (Invitrogen) and random hexamer-primers were used to synthesize the first-strand cDNA, and then DNA polymerase I (NEB) and RNaseH (Invitrogen) were used to synthesize the second-strand cDNA. The double-stranded cDNA was end-repaired by T4 DNA polymerase (NEB), Klenow enzyme (NEB) and T4 polynucleotide kinase (NEB). A single A-base addition was used to prepare the DNA fragments for ligation to the adapters using DNA ligase (NEB). Suitable fragments were selected using a Gel Extraction Kit (Qiagen) and amplified by PCR. These purified products represented the designated cDNA library. The library was paired-end sequenced on an Illumina HiSeqTM 2500 platform.Low-quality sequences with ambiguous ‘N’ bases and known adapters were filtered to remove reads in which more than 10% of the bases had Q-values < 20. Sequences shorter than 60 bp as well as rRNA sequences that aligned with the SILVA database were discarded to avoid sequencing artifacts. Trinity was then used to separately assemble the left reads into the resulting contigs for each sample, and the contigs were joined into transcripts. Transcripts longer than 200 bp were selected to construct the sample contig set for further analysis. The SRA data for Engraulis encrasicolus were converted into FASTQ data using SRA Toolkit ( and processed following the standard protocol as above. [...] In efforts to obtain phylogenetic inferences that would not be affected by misleading history signals, such as ‘hidden paralogs’, we generated 8-species-genome COGs (core-ortholog groups) that were used to search potential orthologs instead of the universal InParanoid database. The HaMStR pipeline was performed for orthology assignment. Fish-specific genome duplication in teleosts, which may result in “one-to-two” or “one-to-many” rather than “one-to-one” orthology relationships, were considered, and the amino acid sequences of eight model fish species and the corresponding “one-to-one” relationships from Ensembl by BioMart, were constructed as the COGs for the putative ortholog search following the procedure for the “Generation of new core-ortholog sets” from the hamstrsearch_local package in HaMStR. We set “5” as the minimum number of sequences for one core-ortholog unit. The sample contig sets of each species were assigned to the COGs using a BLASTX analysis. To acquire similar sequences, BL2SEQ was used to align each translated contig sequence to the best hit from the output of the BLASTX search, and the sequence whose translated format had the lowest E-value was chosen as the optimal candidate. After more than one contig sequence was screened out from the COGs as the ortholog, the shorter sequences were cut off, and then the putative single-copy orthologs were obtained. Using this approach, a total of 1,452 nucleotide and amino acid orthologs among 22 species were extracted from the newly generated COGs representing the most conservative regions (the COGs data on 4 species were excluded from phylogenetical analysis). Each collected locus of the COGs represented an ortholog cluster.MAFFT v7.222 was used to align each protein ortholog cluster with the parameter “–ep 0, –genafpair, –maxiterate 1,000, –thread 90”, and then PAL2NAL was applied to align each nucleotide ortholog cluster from the corresponding aligned protein sequences. When mismatches occurred, MACSE was used to finish the alignment instead. After all of the ambiguous “N” bases were replaced as the gaps, Gblocks with parameter “-t = c, Allowed Gap Positions = None/with half” were used to trim both ortholog clusters. Ultimately, 1,110 ortholog clusters longer than 60 bp were retained and concatenated to supermatrices by a Python script. To visualize the degree of distribution homogeneity for each locus of each species, a heat-map analysis was created using the R package. [...] To ensure that the optimal outgroup was selected, we performed a ML inference for the protein supermatrices with half gaps of 22 species by running RAxML 7.2.6 for 100 bootstrap replicates under the PROTGAMMAJTTF model. The LB score for each taxon was then calculated using TreSpEx v1.1 based on the ML tree with PDs. By considering the position of the nodes, which were broadly accepted (available at, we retained in the COGs data on 4 species: Danio rerio, Oreochromis niloticus, Oryzias latipes and Takifugu rubripes. With the addition of the remaining 14 species screened out from the transcriptome sequences, 18 species used to infer the otocephalan phylogeny included Clupeiformes (1), Gonorynchiformes (1), Gymnotiformes (1), Cypriniformes (5), Siluriformes (4), Characoidei (1), Citharinoidei (1), Osteoglossiformes (1), Perciformes (1), Beloniformes (1) and Tetraodontiformes (1); the latter four orders were used as outgroups. To more clearly illustrate the data, we graphically compared the number of raw reads and mapped reads. Four datasets were finally assembled from both ortholog clusters that represented the nucleotide and protein supermatrices with half gaps and without gaps of 18 species for the phylogenetic analysis. For the nucleotide supermatrices with half gaps, we constructed ML trees of data that were (1) unpartitioned; (2) unpartitioned excluding third codon positions (1,000 bootstrap replicates); (3) partitioned by codon position (designated as 12n + 3n, where 1, 2, and 3 represent the 1st, 2nd and 3rd codon positions, respectively, and the subscript “n” represents nucleotides); (4) partitioned by genes; and (5) partitioned by genes excluding 3n under the best-fit GTRGAMMAI model tested by Modeltest with 100 bootstrap replicates. The ML analysis was also applied to the nucleotide and protein supermatrices without gaps (500 and 1,000 bootstrap replicates, respectively; GTRGAMMAI model) as well as to the protein supermatrices with half gaps (unpartitioned and partitioned by genes, 100 bootstrap replicates; PROTGAMMAJTTF model). Nucleotide supermatrices without gaps were also implemented for a Bayesian Inference (BI) under the GTRGAMMAI model with two independent Monte Carlo Markov chain (MCMC) runs for a total length of 56,000 cycles in PhyloBayes version 4.1. The bpcomp program (maxdiff < 0.1) was then used to determine any discrepancies between the two chains following the burn-in of 5,000 cycles and sub-sampling every 100 trees. [...] After determining the AUQ, SD, SL and R2, four datasets were generated from the concatenated dataset. We implemented ML analyses for the four datasets with RAxML 7.2.6 under the best fit model for 500 and 1,000 bootstrap replicates. To evaluate the confidence of all topology hypotheses, CONSEL was used to implement the AU test, the Shimodaira-Hasegawa (WSH) test, the Kishino-Hasegawa (KH) test and the Bootstrap Probability (BP) test after the per site log-likelihoods of each topology were calculated using RAxML 7.2.6 and PAML 4.8. Eight datasets comprised of four 1110-gene datasets that represented the nucleotide and protein supermatrices with half gaps and without gaps and four datasets without bias screened from 129-gene datasets after sequence bias detection. [...] Beast v1.8.3 was used to estimate a time-calibrated tree with a node-dating strategy. A BEAST XML file was generated by BEAUTi v1.8.3 using an uncorrelated log-normal-distribution relaxed-clock model and a Yule speciation process as the tree prior. The descriptions of 7 fossil calibrations of the MRCA are presented in the Supplementary Text. The GTR model was used as the substitution model, Gamma + Invariant Sites were used for the site heterogeneity categories, and the Yule tree prior was used for all BEAST runs. As for the prior parameter, ucld.stdev and ucld.mean were set as the uniform distributions. The MCMCs were run in BEAST for 90 million generations with sampling every 1,000 cycles for each dataset. The effective sample sizes of all parameters were > 200. Tracer v1.5 was used to check the stationarity of the MCMC parameter sampling, and TreeAnnotator v1.6.1 ( was then used to inspect the posterior set of trees, with the first 20% of the sampled trees discarded as burn-in. […]

Pipeline specifications