Computational protocol: Faster Evolving Primate Genes Are More Likely to Duplicate

Similar protocols

Protocol publication

[…] We defined singleton and duplicable genes within the great apes as individual ancestral genes that had either remained unduplicated or been duplicated at least once, respectively. In order to uncouple duplication status from evolutionary rate measurement we also required that all genes in this study remain singletons in gibbon and macaque, where evolutionary rates will be measured. An initial list of 2,961 candidate singleton gibbon genes was defined as genes with only self-hits in an intraspecific all-against-all BLASTp (E value threshold = 0.1) using the longest protein for each gene. Gene trees were obtained from Ensembl Compara, the generation of which is described in ), and these were pruned using ETE3, a python framework for phylogenetic analysis (), to include only species of interest (Gibbon, macaque, orangutan, gorilla, human, and chimp). We employed several quality control filters using Python to verify the singleton status at the base of this pruned tree and to exclude ancestral duplication followed by loss in gibbon or macaque (graphical summary in , online). Of the 2,961 putative singleton gibbon genes, 692 were excluded because they either lacked a gene tree, a macaque ortholog, or lacked any identifiable orthologs in any other genome (these could be annotation artifacts, novel genes, or very rapidly evolving genes); 1,926 gene trees had a single gibbon and a single macaque gene; and 343 trees had more than one homolog in gibbon and/or macaque (potentially ancestrally duplicated).The 343 gene trees with multiple macaque or gibbon homologs need further examination to assess whether or not they can be included in this analysis. Where the tree topology indicated that the duplication predated the primate lineage such that there were subtrees being made up of a single macaque and single gibbon gene with orthologs in the great apes, these were split and retained as distinct gene family trees (, online).We identified 1,478 gene trees containing a single macaque and gibbon gene and also a single homolog in orangutan, gorilla, chimp, and human and were thus considered singleton gene trees. In the set of gene trees where at least one of the four great ape species had more than one gene within the gene tree, the observed gene counts could have arisen via a gene duplication event in the great apes, or gene loss events in macaque and gibbon. As this study is specifically interested in identifying great-ape-specific gene duplications, it was necessary to rule out gene loss as an explanation for the gene counts. For the set of 125 gene trees of the nonsingletons (more than one copy in at least one of the great apes) we evaluated whether the gene copy number was due to a recent great-ape-specific duplication or to an ancestral duplication with loss in some lineages. We used genetic distance between paralogs to distinguish ancestral and recent duplication events in a protocol similar to where the paralogs are considered to be created by a recent duplication event when the genetic distance between the paralogs is less than the genetic distance from either one to the gibbon gene. If the distance between, say, two chimpanzee paralogs (A) is less than the distance between the gibbon ortholog and each of the sister chimpanzee genes (denoted B and C, respectively), we rule out ancestral duplication with lineage-specific loss and infer that the duplication event occurred more recently than the speciation event (that is, within the time period of interest here; illustrated in ). Genetic distances were obtained from the Ensembl gene trees (). If more than two paralogs per species were present, the two most closely related paralogs were considered. The method is potentially confounded by interparalog gene conversion. However, we do not think this is a substantial issue and moreover, under these circumstances, the macaque-gibbon distance would appear as a very obvious outlier and we see no evidence for this.Of the 125 gene trees with only one ortholog in macaque and gibbon and multiple orthologs in some great ape genomes, we cannot rule out ancestral duplication followed by gene loss in macaque and gibbon for 53 gene trees. The remaining 72 gene trees include great ape gene duplication events. [...] For each of the 1,478 singleton gene trees and the 72 duplicate gene trees, the protein sequences of the gibbon, macaque, orangutan, gorilla, chimpanzee, and human genes were aligned using MUSCLE () and then converted to nucleotide alignments using Translator-x (). Lists of duplicate and singleton families are provided as online. To test whether duplicate genes and singleton genes evolve at different rates, as has been described previously (; ; ), we extracted the human and macaque sequences from the multiple sequence alignments and estimated dN/dS using the codeml module of PAML 4.8, set to runmode =−2 for pairwise rate calculation, CodonFreq = 2, with all other parameters as default. For each duplicate gene within a gene tree, we calculate the pairwise rate with the macaque sequence and then take a mean rate for the gene tree. A graphical summary of PAML usage is shown in , online. [...] Unless otherwise stated, statistical tests and plots were performed and created using R, sm, and ggplot2 (; ). The kernel density estimation (KDE) test is a nonparametric test for testing whether two 2D sets of data are the same. It was run using the ks (kernel-smoothing) package for R (). […]

Pipeline specifications