Computational protocol: Inferring Hierarchical Orthologous Groups from Orthologous Gene Pairs

Similar protocols

Protocol publication

[…] We define an orthology graph to be a graph over the present day genes as nodes and with edge set , representing pairwise orthology relations between genes as defined by Fitch , i.e. they are symmetric, but non-transitive. We further require that every present day gene in be part of at least one orthologous relation, such that has no singleton. As mentioned in the introduction, pairwise orthologs can be inferred using well-established methods, many of which do not require gene tree reconstruction or gene/species tree reconciliation.Here, we consider two cases: perfect data, where we assume that the pairwise orthologs have been correctly and exhaustively identified, and “real data”, where these have been imperfectly identified, using OMA pairwise (Sect. “Orthology graph inference”; ).To restrict the orthology graph to a chosen taxonomic range, we denote by the orthology subgraph induced by the vertex subset , again, without singleton genes. Finally, denotes the set of connected components in . A connected component is defined as a maximal subgraph where there exists a path on the graph between every pair of nodes. [...] COCO-CL requires initial homologous clusters and refines them into a hierarchy by applying a single linkage clustering algorithm on the induced pairwise distance estimates of the cluster’s multiple alignment. As suggested by the authors , we built the initial clusters using the COG algorithm . The COG parameters were chosen according to software documentation, i.e. E-value cutoff =  and hit coverage threshold  = 0.5. We applied COCO-CL on both the simulated and real datasets. On simulated data, and in order to assess the COCO-CL gene family refinement procedure independently from the COG clustering step, we also used the true simulated homologous gene families as input clusters. To conform to the definition of hierarchical groups, we fixed COCO-CL’s paralogy threshold , i.e. two sub-clusters sharing genes from the same species have to be related by a duplication. For the analysis on simulated data, we varied the bootstrap threshold between 0 and 0.95. For the analysis on empirical data, we set the bootstrap threshold to the default value (0.75). [...] LOFT is a tree-based orthology inference method . It computes Neighbour-Joining gene trees based on pairwise distances using the model followed by an evolutionary event-labelling step of the internal nodes based on a species overlap criterion. Similarly to COCO-CL, LOFT requires initial gene families to work on. Again, we use the inferred COG clusters using the parameters as described above on both simulated and real datasets. On the simulated dataset, as additional control, we repeated the analyses using the true and complete homologous gene families as input. […]

Pipeline specifications

Software tools OMA, COCO-CL, LOFT
Applications Genome annotation, Phylogenetics