Computational protocol: High Occurrence of Functional New Chimeric Genes in Survey of Rice Chromosome 3 Short Arm Genome Sequences

Similar protocols

Protocol publication

[…] Sequence data of Chr3s in O. glaberrima, O. punctata, and O. barthii, O. nivara and O. rufipogon were downloaded from Gramene (http://www.gramene.org/). Chr3s sequences of O. sativa ssp. indica were downloaded from 2003/10/7 BGI version (ftp://ftp.genomics.org.cn/pub/ricedb/rice_update_data/genome/9311). The whole-genome sequences of O. glaberrima were downloaded from http://www.iplantcollaborative.org/. We performed genome pairwise comparisons between O. sativa ssp. japonica Chr3s coding sequences (CDSs) and other five species Chr3s genome sequences. The annotation and CDSs of O. sativa ssp. japonica were downloaded from Michigan State University (MSU) Rice Genome Annotation Project (RGAP, MSU V7) (http://rice.plantbiology.msu.edu/downloads.shtml). To search for the O. sativa-specific new genes, the first step was to identify the Chr3s orthologous genes among six species. We used two criteria to define the orthologous genes. First, we conducted a BLAT () search for Chr3s orthologous genes by aligning genome sequences of O. glaberrima, O. sativa ssp. indica, O. barthii, O. punctata, O. nivara, and O. rufipogon against the CDSs of O. sativa ssp. japonica. We had two requirements: the alignment of the orthologous sequence needed to cover over 95% of the length of the O. sativa ssp. japonica CDSs and must be located in the synteny region of all the genomes. Whether an O. sativa ssp. japonica gene was considered in the synteny region was defined by the presence of at least two flanking genes in the 30-kb DNA fragment containing the gene hit in other genomes. Second, the orthologous sequences were defined as two sequences with reciprocal best hits of each other. We conducted the reciprocal searches using BLAT and defined a pair of sequences from two genomes having the best hit against each other as “reciprocal” best hits. We descendingly sorted the hits according to the BLAT alignment score and then BLAT identity score (http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 for methods to compute these two scores). We then defined the ones ranking in the first as the “best” hits. After we identified the orthologous genes, we filtered them out, and picked the remaining annotated genes, which are only present in O. sativa ssp. japonica and/or the other three Asian rice species (O. sativa ssp. indica, O. rufipogon, O. nivara) but are absent in all the African rice species O. glaberrima, O. barthii, and O. punctata (). We further BLAT CDSs of O. sativa ssp. japonica-specific genes to the entire O. glaberrima genome and identified their homologous regions in O. glaberrima. The results were then BLAT back to all CDSs of O. sativa ssp. japonica. We only selected O. sativa ssp. japonica genes, which did not have reciprocal BLAT best hits in O. glaberrima genome as O. sativa ssp. japonica new gene candidates. These genes likely originated after the divergence between Asian rice species and African rice species about 1 Ma. We further estimated the average rates of synonymous substitution (Ks) using gKaKs pipeline with Yn00 method for all Chr3s orthologous genes earlier identified between O. sativa ssp. japonica and O. glaberrima ().To determine the origination pattern of these recently evolved new genes in O. sativa ssp. japonica, we searched for their paralogs in the O. sativa ssp. japonica genome. To identify paralogous gene pairs, we BLAT the CDSs of the candidate genes against all the CDSs of O. sativa ssp. japonica with the match length of the paralogous gene pair more than 100 bp and mismatch length/(mismatch length + match length) less than 0.1. We picked up only the paralogous gene pairs with Ks less than 0.0192, which is the average Ks of the orthologous gene pairs between O. sativa ssp. japonica and O. glaberrima corresponding to 1 Myr divergence time. We further removed the genes with “retrotransposon protein” and “transposon protein” terminology in their annotations to define the list of O. sativa ssp. japonica new gene candidates. Next, to test whether these O. sativa lineage-specific new genes were ancient duplicate genes that lost in African Oryza species, we applied reciprocal BLASTP searches to identify whether these new gene candidates contain orthologous copies in other distantly related species. We BLASTP protein sequences of these new gene candidates to all proteins in Uniprot (http://www.uniprot.org/), which includes SwissProt and TrEMBL data. If a new gene candidate had hits in other species, we BLASTP these hits back to all O. sativa ssp. japonica proteins (http://www.gramene.org/Multi/blastview). If this best hit from BLASTP search was the new gene, we deleted this new gene candidate. We also used Repeatmasker (RepeatMasker libraries version: rm-20120418) to scan the transposons existing in CDSs of new gene candidates. [...] We calculated the ratio of nonsynonymous substitution and synonymous substitution rates (Ka/Ks, donated as “ω") using maximum likelihood algorithm (codeml) implemented in the PAML package (). The significance of ω that deviated from neutrality (ω = 1) was tested using the likelihood ratio test (LRT). We aligned the sequences of paralogous/orthologous gene pairs using bl2seq (). We used codeml to calculate the ω value between the two sequences (). We then used codeml with two models (ω fixed at 1 and ω varying freely) to test whether any of the identified new genes were statistically under natural selection (). Phylogenetic analysis of the gene tree was performed using Neighbor Joining algorithm implemented in PAUP (). The CDSs of the gene family were aligned using ClustalW (). The bootstrap analysis with 1,000 replicates was used to assess the robustness of the branches.To address whether ω < 1 is due to that the parental gene is under strong purifying selection and the new gene is a pseudogene evolving neutrally, we applied PAML branch model to calculate ω values for the branch leading to new genes. We first downloaded the recently completed whole-genome sequences of O. glaberrima, O. barthii, and O. punctate from http://www.iplantcollaborative.org. We identified the orthologous sequences of parental genes from the three outgroup species using ortholog search approach described earlier. We aligned only homologous region for all sequences using MAFFT () and Perl scripts. We estimated ω for the foreground branch leading to the O. sativa ssp. japonica lineage-specific new gene and for background branches leading to the parental genes and their orthologous genes in outgroup species (O. glaberrima, O. barthii, and O. punctata). We used a two-ratio model allowing different ω in foreground and background branches with PAML codeml. The significant level of foreground branch ω was tested using LRT compared with the null hypothesis of a model where foreground ω fixed to 1 and background ω varied freely (). […]

Pipeline specifications

Software tools BLAT, BLASTP, RepeatMasker, PAML, PAUP*, Clustal W, MAFFT
Databases Gramene Ricedb RGAP
Applications Phylogenetics, WGS analysis
Organisms Oryza sativa, Oryza sativa f. spontanea, Oryza rufipogon, Oryza glaberrima, Oryza barthii, Oryza punctata