Computational protocol: Strong functional patterns in the evolution of eukaryotic genomes revealed by the reconstruction of ancestral protein domain repertoires

Similar protocols

Protocol publication

[…] Protein predictions for 114 completely sequenced eukaryotic genomes were obtained from a variety of sources; for details, as well as information regarding numbers of protein predictions, see Additional file .The domain repertoire for each genome was determined by hmmscan (with default options, except for an E-value cutoff of 2.0 and 'nobias') from the HMMER 3.0b2 package [] using hidden Markov models from Pfam 24.0 []. In a second step, the hmmscan results were filtered by the domain specific 'gathering' (GA) cutoff scores provided by Pfam, followed by removal of domains of obvious viral, phage, or transposon origin (such as Pfam domain 'Viral_helicase1', a viral superfamily 1 RNA helicase). In case of overlapping domains, only the domain with the lowest E-value was retained.Based on these preprocessing steps, a list of domains was created for each of the 114 genomes and, together with each of the three eukaryotic evolutionary trees described in the text, used for a Dollo parsimony [] based inference of ancestral domain repertoires. The results of this step are lists of gained, lost, and present domains for each ancestral species.In order to assess the robustness of our results relative to preprocessing steps, we also performed our analyses with a variety of different parameter combinations, such as uniform E-value based cutoffs ranging from 10-4 to 10-18, as well as domain specific 'noise' (NC) and 'trusted' (TC) cutoff values from Pfam, with or without overlap and/or viral domain removal. We were unable to find a combination of these settings that would significantly change the numbers presented here and invalidate our conclusions. For example, Additional file shows select domain counts for a variety of cutoff values. While, as expected, the absolute counts of domains are dependent on the cutoff value(s) used, overall tendencies (such as the LECA having an inferred domainome similar in size to that of extant mammals, and significant domain losses at the roots of deuterstome and ecdysozoa subtrees) are independent of the cutoff values used. Additional file shows detailed gain and loss numbers under a uniform E-value-based cutoff of 10-8.Pfam domains (lost, gained, and present) where mapped to GO terms by using the 'pfam2go' mapping (dated 2009/10/01) provided by the GO consortium []. GO term enrichment analysis for gained and lost domains was performed using the Ontologizer 2.0 software [] with the Topology-Elim algorithm [], which integrates the graph structure of the GO in testing for group enrichment. Enrichments are calculated relative to the union of all Pfam domains (with GO annotations) present in all genomes analyzed in this work. As summarized in Additional file , we tested whether different calculation methods in the Ontologizer 2.0 software (such as 'Topology-Weighted', 'Parent-Child-Union' or 'Parent-Child-Intersection' instead of 'Topology-Elim' []), as well as different approaches for multiple testing correction, would lead to noticeable different conclusions regarding enriched GO categories at various points during animal evolution. While the level of detail is dependent on the calculation method used (for example, 'Parent-Child-Union' and 'Parent-Child-Intersection' methods in general lead to very broad terms, whereas the other methods give more specific results), the results for each setting show predominantly gains in regulatory functions and losses in metabolic processes during animal evolution.The preprocessing steps, the Dollo parsimony approach, and basic ancestral GO term analyses, were performed by software of our own design []. […]

Pipeline specifications

Software tools HMMER hmmscan, HMMER, Ontologizer
Databases Pfam
Application Amino acid sequence alignment