Computational protocol: To the beat of a different drum: determinants implicated in the asymmetric sequence divergence of Caenorhabditis elegans paralogs

Similar protocols

Protocol publication

[…] Lynch and Conery [] had initially identified paralogs in the C. elegans genome [] by downloading the complete set of available putative amino acid sequences, filtering out all possible nonfunctional protein sequences and conducting all-against-all BLASTP searches with E = 10-10 as a cutoff to identify duplicate pairs. To protect against the inclusion of large multigene families, only small gene families with ≤ 5 duplicate pairs were retained. From this dataset, 290 gene duplicate pairs with low synonymous divergence were further analyzed to determine their genomic and structural characteristics [].Of the initial 290 gene duplicate pairs with low synonymous site divergence (KS ≤ 0.10) within the C. elegans genomic data set of Katju and Lynch [,], 63 of the initial 290 pairs were no longer consider valid paralogs in Wormbase WS214 owing to one of the following conditions: (i) alterations to the ORF(s) of one or both paralogs such that they no longer appeared homologous in their coding regions, or (ii) one or both paralogs retired/killed/superseded (e.g. following identification as a transposon). These 63 duplicate pairs were removed from the dataset. Spliced, unspliced and 2 kb of the flanking region (both 5′ and 3′) nucleotide and amino acid sequences for both paralogs for the remainder 227 duplicate pairs were retrieved from WormBase release WS214 (http://www.wormbase.org/). Paralogous sequences were aligned using ClustalW2 at the EMBL-EBI site and checked manually in the sequence alignment editor Se-Al [http://tree.bio.ed.ac.uk/software/seal/]. With respect to the nucleotide sequence alignments, 2 kb of upstream and downstream flanking region sequence in addition to the ORF sequence were initially retrieved and aligned with the spliced and unspliced sequences. For instances wherein homology between paralogs extended beyond 2 kb of the flanking region(s), an additional 1 kb of flanking sequences was accessed from the database and subsequently aligned. The addition and alignment of flanking sequences was iterated until no homology was apparent between the paralogs for a continuous stretch of 1 kb in both the 5′ and 3′ directions. [...] Measures of synonymous sequence divergence in coding regions (KS) were recalculated using the codeml program in the PAML package [] via PAL2NAL (http://www.bork.embl.de/pal2nal/). For each duplicate pair, I attempted to identify an outgroup gene that exists as a single-copy ortholog in a closely-related genome (C. brenneri, C. briggsae, C. japonica or C. remanei) or a more evolutionarily distant paralog in the same multigene family within C. elegans. 97 duplicate pairs lacking an identifiable ortholog in the four congeneric outgroup genomes or a more distantly-related gene family member in C. elegans were excluded from further analysis. An outgroup sequence was successfully identified for the remaining 130 of the 227 duplicate pairs which comprised the final data set. The synonymous divergence between paralogs within this set of 130 duplicate pairs ranged from 0 – 13.6% (0≤ KS ≤ 0.1363).Tajima [] proposed a relative rate test to determine whether two protein or nucleotide sequences have evolved at a similar relative rate. These two sequences could be orthologs from two organisms or paralogs within the same organism. In other words, the relative rate test statistically determines if two sequences follow the molecular clock hypothesis of approximately constant rates of nucleotide or amino acid substitution over evolutionary time. In the relative rate test, sequences A and B share a common ancestor O and the sequence of an outgroup (C) is known. By measuring the substitution rates AB, AC, and BC, it is possible to infer the rates OA and OB and to perform a χ2-test to determine whether these rates are comparable (the null hypothesis) or whether one lineage has evolved at a relative accelerated or decelerated rate, thus violating the behavior of a molecular clock. Within the final set of 130 C. elegans duplicate pairs, each sequence triplet comprising the homologous coding sequences of two focal C. elegans paralogs and an outgroup sequence was aligned at the protein and nucleotide levels and analyzed via the relative rate test [] using the program MEGA 4.0 [] (http://www.megasoftware.net/). For gene duplicate pairs displaying structural heterogeneity in their coding regions (partial and chimeric structure discussed in the subsequent section), all measures of sequence divergence (synonymous divergence KS and degree of sequence asymmetry/symmetry via Tajima’s relative rate test) were calculated using only the homologous regions between the focal C. elegans paralogs.Conservative tests like Tajima’s relative rate test have extremely low statistical power for detecting rate asymmetry between paralogs that have accumulated few mutations, as would be the case for evolutionarily recent duplicates []. For example, Lynch and Katju [] calculated that if each of two paralogs had accrued ten mutations since the duplication event, an absolute difference of at least nine mutations between the two copies would be required to reject the null hypothesis of equal rates. The power to detect asymmetric sequence divergence is also compromised in shorter tracts of paralogous sequences []. Furthermore, the earliest-incurred mutations may be paramount in dictating altered evolutionary trajectories for individual paralogs. To circumvent this challenge of low power associated with Tajima’s relative rate test, a new continuous variable was created to measure the extent of asymmetry between duplicates using the number of unique sites in each paralog as determined by Tajima’s relative rate test. Tajima’s relative rate test was employed to determine the number of unique sites in each paralog relative to an outgroup sequence. But in lieu of restricting our final data set to an extremely low number of duplicate pairs that were detected by Tajima’s relative rate test as showing significant rate asymmetry, all 130 duplicate pairs were used in the analyses. This variable, “asymmetry/site” was quantified at the level of both the nucleotide and amino acid sequences and was calculated as the absolute difference in the number of unique sites for each paralog as determined by Tajima’s relative rate test and subsequently scaled to a per site level to standardize for gene length. An asymmetry/site value of 0 indicates equal rates of molecular evolution in the paralogs. This method also serves to effectively exclude nonhomologous sites present in one paralog to the exclusion of the other (such as those found in partial and chimeric duplicate pairs). These measures of asymmetry were calculated for each of 130 duplicate pairs (Additional file : Table S1 and Additional file : Table S2) and lent far greater power to the study, enabling a more comprehensive analysis of the determinants of rate asymmetry than would have been possible if only duplicate pairs demonstrating significant rate asymmetry based on Tajima’s relative rate test were utilized. […]

Pipeline specifications

Software tools BLASTP, Clustal W, Se-Al, PAML, PAL2NAL, MEGA
Databases WormBase
Application Nucleotide sequence alignment
Organisms Caenorhabditis elegans