Computational protocol: A Phylogeny-Based Benchmarking Test for Orthology Inference Reveals the Limitations of Function-Based Validation

Similar protocols

Protocol publication

[…] Starting with orthologous groups at the level of last universal ancestor (COGs) in the eggNOG database (Table S1 in ), we manually recovered orthologous groups for 238 gamma-proteobacterial species, which we refer to as Reference Orthologous Groups (RefOGs). To define the boundaries of each RefOG based on the last common ancestor (LCA) of gamma-proteobacteria, we used 147 genomes as outgroups (including species from beta- and alpha-proteobacteria as well as more distant species). Initially, the homologous sequences were aligned using MUSCLE and the alignments were used to build aLTR-supported phylogenetic trees using PhyML version 3.0 . For trees (i.e. groups) that represent multiple gamma-proteobacteria-specific families, we curated the family that was best resolved by the presence of the outgroup(s) taking into account well-accepted species trees , , . However, in several cases no clear outgroup could be defined, hampering the resolution at the desired level. For these families, RefOGs were defined based on i) manual inspection of the alignments and ii) previously published descriptions of the families.After the initial curation of the families, all sequences determined to be members of a particular gamma-proteobacteria orthologous group, as well as the corresponding outgroup sequences, were aligned using MUSCLE . Alignments were cut based on the first and last well-aligned columns as estimated using GBLOCKS with the following parameters: (Minimum Length Of A Block: 10, Allowed Gap Positions: With Half, Use Similarity Matrices: Yes). This was done in order to remove highly divergent N- and C-terminal parts of each alignment where misalignment is assumed to be common. Alignments were further manually cleaned to remove large parts where all sequences but one had gaps or short sequences that did not align within a conserved “block”. Based on the refined alignments, Hidden Markov Models (HMMs) were built using the HMMER3 package . At a second refinement step, the HMM models were used to identify related sequences that were previously unidentified from the 385 aforementioned genomes. We did not define a global cut-off for sequence recruitment, but instead treated each family uniquely by adding all sequences with a bit score within the range of bit scores of already known members. After the addition of those sequences, phylogenetic trees were calculated using PhyML version 3.0 with the following settings: 100 bootstrap replicates, optimization of tree topology, branch lengths and rate parameters, four substitution rate categories and the NNI topology search option. RefOG identifiers, alignments, HMM models and trees are available at http://eggnog.embl.de/orthobench2. [...] eggNOG generates OGs for different taxonomic levels; thus, in the current comparison we used OGs generated for gamma-proteobacteria species only (i.e. gproNOGs). The predictions are publicly available at: http://eggnog.embl.de/version_3.0/downloads.html. We mapped the reference orthologs to the gproNOG orthologous groups (). For the following analysis, each RefOG was associated with the gproNOG for which it would maximize the number of true orthology assignments. For every protein in the dataset, three distinct functional features were retrieved, using the following approach:Domain content conservation analysis: Protein domain were assigned to the proteins by scanning them against the Pfam database (Pfam 26.0) using HMMER3 with curated family-specific score thresholds taken from Pfam. We then defined “family-specific conserved domains” for each RefOG by identifying protein domains that are present in more than 75% of its members. This cutoff was chosen conservatively to reflect the variability of domain content as found within gene families where most of the sequence is orthologous but where potentially a few domains can have different histories as a result of domain shuffling. Content conservation of a protein relative to its RefOG was then defined as the number of these “family-specific conserved domains” that it was found to contain. Since the analysis considers domain content rather than domain architecture (the set of unique domains rather than the vector of ordered domains), no special handling of short repeat-type domains was considered necessary.Gene order conservation analysis: For each gene in the RefOGs, we inspected 10 adjacent genes (five before, five after, approximately matching the size of many bacterial operons). Although this number is significantly larger from the average operon size (3 genes) , it is more appropriate for larger phylogenetic distances that important genomic rearrangements may have taken place. This is also in agreement with previous studies that have identified important regulatory and functional genomic structures, named as gene clusters or uber-operons , which are composed from larger genomic regions. Gene families were defined among the resulting total set of genes by linking together (through single linkage) all genes that are identified as homologs by NCBI BLAST at an E-value threshold of 10e-5 and additionally required the aligned region to be at least 90% of the length of the shorter gene. The resulting clusters define (operationally) gene families in the immediate neighborhood of each RefOG member. These gene families were considered conserved (“family-specific conserved neighbors”) for a RefOG if they were present in the genomic neighborhoods of at least 75% of the genes in each RefOG. Gene order conservation for a particular protein was then defined as the number of these conserved families also found in its 10-gene genomic neighborhood.Enzymatic activity conservation analysis: For each protein, its closest BLAST hit in version 66.0 of the KEGG database was determined, in 99% of all cases this was to an identical sequence. Based on the KEGG pathway membership of these matches, and parsing the descriptions of the corresponding KEGG pathways for Enzyme Commission (EC) numbers, proteins were assigned EC number functional annotations. Given the hierarchical nature of EC numbers, annotations were expanded to include both more and less specific descriptions (e.g. EC:1.2.3.4 expanded to EC:1.2.3.4, EC:1.2.3.-, EC:1.2.-.-). We considered each of these descriptions to be conserved for a RefOG (“family-specific conserved enzymatic activity”) if assigned to more than 75% of its members. The functional conservation score for each protein was then defined as the number of these conserved functional terms assigned. […]

Pipeline specifications