Computational protocol: Comparative genomics reveals contraction in olfactory receptor genes in bats

Similar protocols

Protocol publication

[…] To investigate the evolutionary dynamics of gene gain and loss in bats, we first identifed homologous genes among bats and other closely related laurasiatherian mammals using the OMA standalone software package. The OMA pipeline clusters homologous sequences from complete genomes in order to identify orthologous pairs of sequences and infer hierarchical orthologous groups (“HOGs”) of genes that have descended for a common ancestral gene in a specified taxonomic range. OMA has been shown to be among the most reliable orthology inference methods—notably outperforming many tree-based methods (e.g. refs –).We augmented the two bat species contained in the May 2016 OMA database (M. lucifugus and P. vampyrus) with six bat proteomes retrieved from the GenBank (E. fuscus, M. natalensis, M. brandtii, M. davidii, P. alecto and R. aegyptiacus). As outgroups, we also included 12 other laurasiatherian mammals: Ailuropoda melanoleuca (giant panda), Bos taurus (cow), Canis familiaris (dog), Equus caballus (dog), Erinaceus europaeus (European hedgehog), Felis catus (cat), Mustela furo (ferret), Ovis aries (sheep), Sorex araneus (common shrew), Sus scrofa (pig), Tursiops truncatus (bottlenose dolphin) and Vicugna pacos (alpaca). Although additional genome and transcriptome data were available for bats at the time of the analysis (e.g. refs , –), we decided to focus only on species whose genome was either Sanger sequenced or sequenced using high-throughput technologies at very high depth of coverage (>75X). Our reasoning was based upon previous findings showing that gene prediction and/or annotation errors could inflate estimates of gene turnover for taxa with low coverage draft genomes, a type of bias that we wanted to alleviate -or at least minimize- in downstream analyses. To this end, we also assessed gene annotation completeness of our 20 proteomic datasets using the Benchmarking Universal Single-Copy Orthologs (BUSCO) software, based on a 3,300 gene set conserved across vertebrates.If alternative protein isoforms were present for a given locus in any of the above proteomes, we only kept the first splicing variant (usually the longest). We predefined a species tree topology based on the most recent phylogenomic study of bats by Tsagkogeorga et al., and ran the OMA pipeline under default parameters. To control that the inference of gene gains and losses was not biased by the predefined input phylogeny, we repeated our analysis using a tree topology estimated de novo directly from the proteomic data using OMA (parameter “SpeciesTree” set to “estimate” in the parameter file), as well as six proposed alternative species tree topologies for the diversification of laurasiatherian orders (Supplementary Figure S1). To map events of gene contraction and expansion at specific branches of the tree, the OMA hierarchical group output was parsed with HAM, a Python program developed by the authors available at http://lab.dessimoz.org/ham, inferring the placement of gene gains, duplications and losses using a parsimony criterion (with all types of events equally weighted). [...] To estimate the rate of HOG expansion and contraction in bats, we analysed the HOG data inferred from the OMA pipeline using the software CAFE 3. For each HOG, we counted the number of genes present for each sampled genome in a given group of homologues, and converted these HOG counts at the tips of our tree into a CAFE formatted dataset. Again we used the species tree topology of Tsagkogeorga et al. (consistent with the de novo tree inferred by OMA), together with divergence times taken from Meredith et al. and from www.timetree.org to build the following reference species tree: (((R. aegyptiacus: 24, (P. alecto: 12, P. vampyrus: 12): 12): 42, (M. natalensis: 45, (E. fuscus: 25, (M. davidii: 14, (M. lucifugus: 9, M. brandtii: 9): 5): 11): 20): 21): 14, ((F. catus: 55, (C. l. familiaris: 46, (M. p. furo: 40, A. melanoleuca: 40): 6): 9): 24, (E. caballus: 78, (V. pacos: 65, (S. scrofa: 64, (T. truncatus: 56, (O. aries: 26, B. taurus: 26): 30): 8): 1): 13): 1): 1).To accommodate errors potentially present in our HOGs dataset, we used the caferror.py script of the CAFE 3 software package, which estimates the error in a dataset without a priori information on the error distribution. We ran CAFE with the error estimation on HOGs with at least one gene present in the most recent common ancestor of Scrotifera (the clade uniting bats, carnivores and ungulates). We also used the option –s in caferror.py in order to obtain estimates of error for each of our 18 mammalian genomes separately (excluding insectivores used as outgroups) in addition to the average global error estimate across all species of our phylogeny. [...] To annotate HOGs in terms of proteins and their putative role in species biology, we first interrogated protein information available in UniProtKB database. We selected one representative protein for each HOG, preferably from the cow when present, else from dog, cat or horse, as these species appeared to have better annotation records compared to the rest of our sampled mammals. Then, for a given list of HOGs of interest, we used the collected protein identifiers as queries in UniProtKB to infer functional groups based on GO terms, as well as based on keywords from protein annotations.Second, we tested for enrichment in GO terms of HOGs showing gene gains and losses in the respective MRCAs of all bats, Old World fruit bats, echolocating bats, as well as in the MRCAs of Ferungulata, carnivores, and even-toed ungulates and cetaceans (Cetartiodactyla). Again, we sampled one cow sequence from each HOG, such as each HOG was linked to a cow gene and its associated GO terms as a proxy for the functional role of its members. In addition, based on the same set of identifiers we used Ensembl Biomart to retrieve human gene homologues, which are better annotated than other mammalian genes and thus could potentially offer better insights into the functional profile of our gene sets.The most current associations of cow and human genes with GO terms were download from the Gene Ontology website (October 2016), and GO enrichment analyses were performed based on Fisher’s exact test, using the Python package GOATOOLS (https://github.com/tanghaibao/Goatools). The background population of HOGs for the GO analyses was defined separately in each test, and included all HOGs present in the parental node of the branch tested for enrichment. Finally, resulted p-values were adjusted for multiple testing using Bonferroni, Sidak, and Holm corrections, also implemented in GOATOOLS. [...] The genome size (C-value) for each of our sampled species was obtained from the Animal Genome Size Database. If the exact species did not have an entry in the database, we used an expected C-value calculated from the average C-value from other species of the same genus. Nonparametric Spearman correlation tests between genome size values, and estimates of average gene expansion, gene gain and gene loss were performed in R. To account for the phylogenetic history of the compared species, we used the comparative method phylogenetic independent contrasts (PIC) implemented in the PHYLIP package. […]

Pipeline specifications