## Similar protocols

## Protocol publication

[…] We compared the retention of homologs in six core eudicot species, namely three rosids: peach, cacao and grape, and three asterids: tomato, Utricularia and Mimulus, and four monocot species: rice, Setaria, sorghum and Brachypodium, forming three independent data sets.The data preparation in our approach, illustrated in Figure , starts with applying the **SynMap** program in the CoGe platform [,] to selected pairs of genomes stored on the CoGe site. This produces synteny blocks of genes (five or more, in the present work) likely to be orthologs because they have high sequence similarity and are in the same syntenic context. This includes paralogous genes syntenically mapping to the same ortholog(s). Additional syntenic paralogs derived from polyploidy can be detected through SynMap self-comparisons of genomes. All the genes sharing orthologies and paralogies thus detected, among all the species in each data set are then grouped together yielding "homology sets" representing ancestral pre-WGD genes [].The homology sets were first examined to see whether they contained at least one gene from each species in the group. In the first analysis, all sets with no gene in any of these species were excluded from the analysis. (In a later analysis, described below, other homology sets were used.) The remaining sets were classified according to the number of species in which there was more than one copy, so that in the three-species comparisons, the sets could be classified as 0, 1, 2, or 3, and in the four-species set a score of 4 was also possible. We call this number the fractionation score.For each homology set, each of its genes was annotated by submitting it to **Blast2GO** []. Then all the annotations from all the genes in this set were considered as annotations for the set as a whole. No account was taken of the multiplicity of "hits" of a single annotation within the set. Of course, for every annotation, each of the higher-level terms of each hit was also counted as an annotation.Among all the homology sets we constructed, approximately 90% hit at least one GO term, resulting in 10,688 monocot, 6360 rosid and 4638 asterid homology sets for further analysis.The GO terms are divided at the highest level into "Biological Process", "Molecular Function" and "Cellular Component" and there are a further 67 terms at the next level, which we call "high-level terms". Homology sets with large fractionation scores, i.e., which contain more than one paralog in all or most genomes, tend to have a higher total number of annotations, simply by virtue of having a larger number of genes. This leads to the artifactual observation that almost all functional categories are more favored by homology sets with high fractionation scores. To correct for this bias, we use a normalized proportion of hits for each term for each fractionation score. This is calculated as the number of hits of the term over all homology sets with this fractionation score, divided by the total number of sets with hits for any terms within the appropriate highest-level term. Thus, if "organelle" received 100 hits in all homology sets with fractionation score 3, and if the number of sets hitting any "Cellular Component" term is 300, the normalized "proportion" is 33.3%.These normalized proportions could then be plotted against fractionation score as in Figure . By considering every combination of homology set and functional category as a data point with X-coordinate its fractionation score and its Y-coordinate 1 or 0, depending on whether the homology set was a hit (1) or not (0) for that category, we could then calculate a regression score for the functional category. In Figure , the functional categories with significant negative slopes are black and those with significant positive score are red or orange. […]

## Pipeline specifications

Software tools | SynMap, Blast2GO |
---|---|

Databases | CoGe |

Application | Genome annotation |