Computational protocol: Teleost Fish-Specific Preferential Retention of Pigmentation Gene-Containing Families After Whole Genome Duplications in Vertebrates

Similar protocols

Protocol publication

[…] We define here as a gene family the genes that are derived by duplication and speciation from a pre-1R/2R ancestral deuterostome single-copy gene. After 1R/2R but before Ts3R, a gene family can be either multigenic if the ancestral gene has been kept as duplicates after 1R/2R, or monogenic if no duplicate has been maintained. We define as pigmentation gene-containing family (PGCF) a gene family containing at least one pigmentation gene. This implies that a PGCF can also contain genes that are not involved in pigmentation (experimental evidence), or for which no pigmentation function has been documented so far (lack of data) (see Figure S2 for a graphical representation of these definitions).To assess the evolutionary history of each PGCF, and particularly how it was shaped by the four rounds of WGDs in vertebrates studied thereafter, sequences from 22 vertebrate genomes representing the major vertebrate lineages were analyzed. Eight sarcopterygian genomes including seven tetrapods were studied (Latimeria chalumnae, Xenopus tropicalis, Anolis carolinensis, Gallus gallus as well as the mammals Ornithorhynchus anatinus, Monodelphis domestica, Mus musculus and Homo sapiens). Thirteen actinopterygians (ray-finned fish) genomes were included in the study, encompassing a wide diversity of clades, including salmonids to study Ss4R (): the non-teleost spotted gar Lepisosteus oculatus (the closest outgroup to teleosts with sequenced genome that did not experience the Ts3R ()) and the teleost species Danio rerio, Astyanax mexicanus, Gadus morhua, Gasterosteus aculeatus, Tetraodon nigroviridis, Takifugu rubripes, Poecilia formosa, Xiphophorus maculatus, Oryzias latipes, Oreochromis niloticus, Oncorhynchus mykiss and Salmo salar. To encompass the broad diversity of vertebrates we additionally included the genome of a chondrichthyan (cartilaginous fish), the elephant shark Callorhinchus milii. At the time of analysis, data from the lamprey genome were too incomplete to be included in the study. Sequences from five non-vertebrate deuterostomes that diverged from the vertebrate lineage before the 1R/2R WGDs were used as outgroups in the molecular phylogenies: the two urochordates Ciona intestinalis and C. savignyi, the cephalochordate Branchiostoma floridae (amphioxus) and the more distantly related ambulacrarians Strongylocentrus purpuratus (echinoderm) and Saccoglossus kowalevskii (hemichordate).For each species, protein sequences of the longest isoform were retrieved from Ensembl v86 (www.ensembl.org; Oct. 2016) and NCBI (https://www.ncbi.nlm.nih.gov/, last accessed March 08th, 2017). When an Ensembl sequence was missing from the Ensembl annotation pipeline, we manually tested on the genome assembly (using Ensembl BLASTn method with the closest ortholog as query) whether this was due to gene loss or annotation skews. In the latter case, we used the FGENESH+ program of the Softberry suite (http://www.softberry.com/) to extract the gene sequence and predict the protein sequence.Sequence alignments were performed using Clustal Omega v1.2.1 () with default parameters and were individually manually curated for each PGCF (). Each alignment was then analyzed using the software prottest3 v.3.4.2 () and the evolutionary model for tree-building was selected based on the Bayesian information criterion (). Tree building was conducted with the best fitting model using the Maximum Likelihood method implemented in PhyML v3.1 () with SH-aLRT support (; ). The best-fitting model for each alignment is available (Table S2).In order to assess if a duplication event was likely to be due to Ts3R and not to a small-scale duplication (SSD), and to identify large paralogons supporting Ts3R evidence, we conducted for each gene a synteny analysis using the Genomicus web browser (v86; http://genomicus.biologie.ens.fr/genomicus-86.01/cgi-bin/search.pl) ().Finally, as genes can be described in different organisms with different names, we used the unified nomenclature system for humans provided by the HGNC Database () to avoid nomenclature ambiguities. […]

Pipeline specifications

Software tools BLASTN, Clustal Omega, ProtTest, PhyML
Databases Genomicus
Application Phylogenetics
Organisms Danio rerio