Computational protocol: Genome-Wide Analysis of Syntenic Gene Deletion in the Grasses

Similar protocols

Protocol publication

[…] Lists of syntenic gene pairs were initially generated for all pairwise comparisons—including self–self comparisons—using the SynMap utility of CoGe () with the parameters described in supplementary table S3 (Supplementary Material online)of this paper. Individual stretches of syntenic genes were merged into larger syntenic blocks using the method described in ().Synonymous substitution rates between individual syntenic gene pairs were calculated within the SynMap utility for aligned coding sequences of gene pairs guided by the alignment of the translated coding sequences of gene pairs by nwalign (http://pypi.python.org/pypi/nwalign/). Synonymous substitution rates for these aligned sequences were calculated by a customized version of CODEML ().Syntenic blocks containing 12 or more gene pairs were assigned to an evolutionary event, whether speciation (orthologous) or WGD (homeologous), based on a unified synonymous substitution rate (Ks) for genes contained within the block. This unified synonymous substitution rate is defined as the average synonymous substitution rate among gene pairs contained within the syntenic block after discarding the most diverged two-thirds of genes contained within the syntenic block. The calculation of synonymous substitution rates is very sensitive to errors in gene model annotation or sequence alignment, and examining only the lowest one-third of Ks values provides sufficient data set to differentiate sequence blocks while eliminating any distortion from the very high substitution rates calculated between incorrectly aligned coding sequences. Grass genomes also include a class of high third base pair position GC content genes that generate unreliable synonymous substitution rate calculations ().These calculations produced two fully distinct peaks for synonymous substitution rates of syntenic gene blocks for interspecies comparisons: one corresponding to orthologous syntenic blocks created by speciation and the other to homeologous syntenic blocks resulting from the pregrass tetraploidy. Intraspecies comparisons identified a single fully distinct peak of homeologous syntenic blocks resulting from the the pregrass duplication in sorghum, rice, and brachypodium, and the more recent maize lineage-specific tetraploidy within maize (supplementary fig. S4, Supplementary Material online). [...] Homeologous and orthologous pairs of genes defined by inter- and intraspecies comparison were merged using in-house python scripts to produce lists of pan-grass syntenic genes. When no ortholog of a syntenic group of genes was identified in a species, a predicted orthologous location was identified using the first orthologously conserved genes within that genome up and downstream of the missing gene. If these conversed genes were separated by more than 1 MB or were located on different chromosomes, the group of genes was considered to have no syntenic coverage in the missing species.When a predicted orthologous region was identified, a three step process was used to confirm the absence of a syntenic ortholog. First, all annotated genes within the predicted orthologous region were compared using LASTZ () with all members of the group of syntenically conserved genes in other species. Any gene with sequence similarity to the existing group of conserved syntenic genes was considered a conserved ortholog and added to the syntenic group. If no gene within the predicted region was hit, the sequence of the entire predicted region was extracted and compared with the existing group of conserved syntenic genes using LASTZ with default settings. Any hit with a score of 3,000 or greater within the region was considered an unannotated conserved gene or gene fragment. Gaps with no syntenic matches to either annotated genes or unannotated sequences were further subdivided between those where a gap of 50 or more Ns were present at the predicted location and those were there were no annotated gaps within the predicted location.If the same gene was included in multiple syntenic groupings, the group with fewer identified orthologous and homeologous genes was removed from our comparison. Syntenic groups were three or more genes not classified as local duplicates of each other were all identified as orthologs within the same species were also removed from the data set. These predominately consisted of sequences that were treated as separate genes in some species but merged into single gene in others.Putative homeologous gene pairs identified only in a single species where neither copy of the gene was sorted with evidence of syntenic orthologs in any other grass species were omitted from our analysis.Local duplicate genes were defined as a series of homologous genes interrupted by now less than 20 intervening genes (40 genes in maize, given the greater gene density of the maize working gene set). Homology was defined using the same parameters used by SynMap. […]

Pipeline specifications

Software tools SynMap, PAML, LASTZ
Databases CoGe
Applications Genome annotation, Nucleotide sequence alignment