Computational protocol: Functional bias in molecular evolution rate of Arabidopsis thaliana

Similar protocols

Protocol publication

[…] We used the Bowers et al. [] dataset for the A. thaliana whole genome α-duplication event, which based on estimation, occurred before the divergence of Arabidopsis thaliana from Brassica but after its divergence from the Malvaceae.To ensure the quality of the analysis we screened GO annotations from the TAIR dataset based on their evidence codes. We removed those pairs from the analysis set if at least one gene in a pair did not have an annotation that was curated or experimentally assigned. If either gene in a gene pair was not annotated experimentally or by curator, the pair was excluded from the analysis. In order to make a direct link between function and the molecular evolution rate we labeled each pair with only their most specific shared functions from GO terms. We also required that these shared functions were at a depth of one or greater (assuming the nodes "Biological Process", "Cellular Component", and "Molecular Function" have a depth of zero). An example is shown in Figure . Here we define the depth of a term to be the minimum distance over all paths from the root to that term's node. GO annotations were obtained from TAIR on June 4, 2009.To assess the overall effect of gene function on sequence divergence, we placed genes into functional groups. The objective in creating these groups can be defined as follows. For any two gene pairs the more specific their shared role in the cell the tighter their subsequent grouping. To do this we used the GOSim package [] to cluster genes based on their functional profile. The GOSim package "provides the researcher with various information theoretic similarity concepts for GO terms." Within the GOSim package we selected the Resnik method [] to create term, term similarities for all pairs of terms in the Gene Ontology. This is defined to be: , where Pa(t, t') denotes the term set of all common ancestors of GO terms t and t', and IC(t) is the information content of term t as defined by Lord et al. []. These similarities were combined using the "optimal assignment" method by Frohlich et al. [] to give coefficients, for all pairwise combinations of gene pairs, that indicate the functional similarity of selected pairs. The optimal assignment method assigns each term of the gene with fewer annotations to exactly one term of the other gene, such that the sum of term-term similarities is maximized. Based on these coefficients, Ward's hierarchical clustering algorithm [] was used to group genes together with similar functional profiles. The resulting hierarchical tree was cut using a bottom up approach such that each group meets a minimum size constraint of ≥ 20. In order to maintain good functional specificity for shared functions, without reducing the population of the resulting groups to a trivial number, groups were defined by the lowest internal node that achieved the minimum size threshold. For Ward's clustering algorithm the height corresponds to the analysis of variance (ANOVA) sum of squares difference between two clusters added up over all the variables within those clusters.The protein and DNA sequences were obtained from The Arabidopsis Information Resource (TAIR) database on February 20, 2008 []. We aligned the protein sequences for duplicate gene pairs using the needle program, with default parameters. The program implements the Needleman-Wunsch global alignment algorithm []. We then aligned the DNA sequences for duplicated gene pairs according to the aligned protein sequences using the PAL2NAL program []. Last, to calculate dN/dS for duplicate gene pairs, we use the yn00 with default parameters within the Phylogenetic Analysis of Maximum Likelihood (PAML) program []. yn00 implements the method of Yang and Nielsen [] which calculates dN/dS taking into account transition/transversion rate biases and base/codon frequency biases.To evaluate statistical correlation between dN/dS and functional groups we used the analysis of variance (ANOVA) method []. ANOVA uses Fisher's F-test to determine statistical significance of variance in group means compared to the mean for a group. The resulting p-value determines if the null hypotheses, that mean dN/dS values are equivalent for all functional groups, should be rejected. To identify the specific groups that are significantly different, we used Tukey's honestly significant difference (HSD) criteria which is based on Studentized range distribution for determining critical values [].To characterize specific functional groups we examine the group's relative enrichment of GO annotation for genes in the group compared to all the genes in the α-duplication event. For this purpose we use the Ontologizer software's parent-child union method with Bonferroni correction []. […]

Pipeline specifications

Software tools GOSim, PAL2NAL, PAML, Ontologizer
Databases TAIR
Applications Phylogenetics, Nucleotide sequence alignment
Organisms Arabidopsis thaliana