Computational protocol: Conserved but Attenuated Parental Gene Expression in Allopolyploids: Constitutive Zinc Hyperaccumulation in the Allotetraploid Arabidopsis kamchatica

Similar protocols

Protocol publication

[…] The diploid parental species A. halleri and A. lyrata each possess eight chromosomes (n = 8, 2n = 2x = 16) and the allopolyploid hybrid has 2n = 4x = 32 chromosomes. Because A. kamchatica is a self-fertilizing species (), the A. halleri- and A. lyrata-derived chromosomes are generally homozygous. This effectively allows us to treat the eight A. halleri- and eight A. lyrata-derived chromosomes as haploid genomes. We created homeologous A. kamchatica (A. halleri-derived = H-origin) and (A. lyrata-derived = L-origin) diploid-guided reference genomes for a Japanese (MUR) and an Alaskan (PAK) accession using de novo assemblies of A. halleri subsp. gemmifera and A. lyrata subsp. petraea from . Genomic DNA from both accessions of A. kamchatica (accessions MUR: Murodo, Japan and PAK: Potter, Alaska) was extracted from leaf tissue using the DNeasy Plant Kit (Qiagen) and the total DNA was sequenced using Illumina HiSeq 2000 with insert size of 200 bp. DNA reads from A. kamchatica were mapped using BWA-MEM version 0.7.5a on the two diploid genomes independently. We classified the reads to each parental (H-origin or L-origin) using HomeoRoq (http://seselab.org/homeoroq, last accessed July 14, 2016). In this method, reads are first mapped to both parental genomes, and then classified as H-origin, L-origin, common, or unclassified (see in for schematic diagram). If a read was mapped to only one of the parental genomes, it was discarded. After mapping to the A. halleri genome, we detected A. kamchatica halleri-origin (H-origin) reads and identified single-nucleotide polymorphisms (SNPs) and short insertions and deletions using Samtools 0.1.18. Then, the nucleotides were replaced on the detected variant position with the alternative nucleotides if the positions were covered by at least five reads. At least 80% of the reads were required to have the alternative nucleotide from the reference to be called a SNP. The A. kamchatica lyrata-origin (L-origin) genome was updated in a similar manner. These processes, mapping, read classification, and reference modification, were repeated ten times. We used only origin reads the first five times and both origin and common reads the last five times for the reference modification. Then, using the updated genomes, we performed the SNP replacement iteration three more times. For the final reference of each A. kamchatica accession, we assumed that both imputed references derived from parental species were homeologous genomes. The de novo diploid genome assemblies and the polyploid SNP replacement strategy decreases mapping bias that would come from mapping only to a single common genome, which improves our estimates of homeolog expression levels for RNA-seq experiments. [...] We extracted total RNA using RNeasy (Qiagen) from three replicates of both leaf and root tissues from two A. kamchatica (MUR: Murodo, Japan and PAK: Potter, Alaska), A. halleri subsp. gemmifera (Tada mine), and A. lyrata subsp. petraea (Siberia) sampled at time zero (no Zn treatment) and 48 h after 500 μM Zn treatment. In total, 48 cDNA libraries were constructed at the Functional Genomics Center, University of Zurich, and then sequenced using Illumina HiSeq 2000 generating 2× paired-end 100 bp reads. Low-quality regions of the sequenced reads were trimmed using Trimmomatic-0.30 () before mapping. We used STAR-2.3.0 () to map RNA-seq reads in A. kamchatica with default parameter settings to map reads and the HomeoRoq pipeline to classify reads to H-origin or L-origin genomes. Mapping and read classification were done using the reference genomes with SNP replacement for two A. kamchatica accessions as described above for the genome assembly step. Therefore, the reference genomes of each A. kamchatica genotype match those used in the RNA-seq experiments. If a read was mapped to multiple positions, the aligned position with a “primary” flag (the position with the best alignment score) was adopted to quantify expression levels to avoid ambiguous results caused by multiply mapped reads. Reads that only mapped to one parental genome were discarded. Using this type of read to quantify homeolog expression levels may cause a biased estimation of expression levels attributed to the difference in quality of the assembled genomes of two parental species, even though part of them would truly represent species-specific gene fragments. We mapped A. halleri and A. lyrata RNA-seq reads directly using STAR-2.3.0 without the read classification step.We identified 25,509 homeologous contigs and scaffolds in the two A. kamchatica genotypes using a reciprocal blast hit (best-to-best) strategy using a reciprocal best hit (best-to-best) based on BLAST E values of 10−15 for each A. kamchatica genotype. Among these, 19,820 are orthologous with A. thaliana also using a reciprocal blast hit strategy. We also detected orthologous genes between the predicted genes on diploid A. halleri and A. lyrata contigs with A. thaliana genes using TAIR 10 gene annotations () using reciprocal best hits based on BLAST E values of 10−15. We limited our dataset by TAIR gene IDs that are common to both A. halleri- and A. lyrata-derived copies in both diploids and polyploid genomes to compare gene expression between homeologs and between species. The use of reciprocal best hits and limiting the gene dataset to items containing A. thaliana orthologous gene models is somewhat conservative, but we are primarily concerned with homologous gene pairs with annotated function. We also used one-way blast of A. lyrata-derived scaffolds as the query against the A. halleri-derived reference genome to confirm that our reciprocal blast results blasted to the same contig if there were multiple hits in the one-way blast. In addition, when gene duplications occur in a particular species, the sum of the expression of those duplicates can drive the function underlying the phenotype of interest (e.g., HMA4 and hyperaccumulation). The highly similar duplicates are likely to be assembled as a single copy in the current assembly (see the section regarding pyrosequencing validation of homolog expression) because of their short overall contig length. [...] Differentially regulated (up- and down-regulated) homeologs in control and Zn treatments were estimated using edgeR () with only H-origin or L-origin read count data. We first performed differential expression analyses for each homeolog (H-origin or L-origin), tissue (leaf or root), and genotype (Alaskan or Japanese) separately. Diploid gene regulation was estimated similarly using orthologous read count data to detect up- and downregulation from control to Zn-treated samples. Significance was determined using a false discovery rate (FDR; BH test) cutoff of <0.05. If the RPKM was <0.1 in both control and Zn-treated samples, the gene was considered unexpressed and was not counted as differentially regulated if significant. We conducted GO analysis on differentially regulated genes using agriGO () and GO categories with ≥30 query genes for MF and a FDR P value threshold of <0.05. Changes in homeolog ratios from control to Zn-treated conditions were estimated using the statistical test described by to control for overdispersion and reduce false-positive ratio-change estimates.RPKM reported as total levels of gene expression were estimated in the allopolyploids including both origin and common reads counted using HTSeq 0.5.4 in union mode after the read classification was bundled in the HomeoRoq pipeline as described in the following : RPKMx,i=109Cx,iGiT, where x∈{lyr, hal} (1)Chal,i=Hi+ Mi HiHi+Li Clyr,i=Li+MiLiHi+Li where Cx,i is the number of reads that are mapped on gene i in species s, Gi is the length of gene i in the A. halleri genome, and T=Chal,i+Clyr,i is the total number of mapped reads. In our procedure, there are three types of count values produced after read classification: 1) Hi: the number of reads mapped to A. halleri-derived homeolog i, 2) Li: the number of reads mapped to A. lyrata-derived homeolog i, and 3) Mi: the number of reads that cannot be classified in the parental species because the reads are identical to the genomes of both species (i.e., common reads). The Mi reads were allocated to one of the species at a rate proportional to the H-origin and L-origin ratio of each homeolog for total RPKM comparisons. That is, two RPKM values were estimated corresponding to each gene in the A. kamchatica genome, one for the A. halleri-derived homeolog (H-origin), and one for the A. lyrata-derived homeolog (L-origin). In the diploid, RPKM values were calculated normally after counting mapped reads to the reference without the read classification. The RPKM values of tetraploids and diploids would be comparable despite minor methodological differences.Complete datasets of A. kamchatica (MUR and PAK genotype leaf and root) transcriptomics can be found in supplementary files 5 and 6, Supplementary Material online, and the A. halleri and A. lyrata transcriptomics datasets can be found in supplementary file 7, Supplementary Material online. Supplementary file 2, Supplementary Material online shows candidate gene expression ratios and for both polyploid genotypes and parental diploid species. [...] Using the approach described above for diploid-guided reference assembly for 20 A. kamchatica genotypes, we generated consensus contigs for H-origin and L-origin genes for each genotype that align to the BAC sequences from A. halleri subsp. halleri (GenBank BAC accession numbers: EU382072.1 and EU382073.1) containing the HMA4 locus (; ). Coding sequences from the 20 A. kamchatica H-origin and L-origin HMA4 regions were aligned using Muscle and Geneious (v. 6.06). We also aligned our H-origin and L-origin contigs to a corresponding region on JGI’s A. lyrata () chromosome 3 (Alyr_JGI_scaffold_3.23427106-23532642) containing the orthologous HMA4 region. This showed the same physical order of 11 H-origin and L-origin contigs surrounding and including the HMA4 coding sequences on both A. halleri (BAC) and A. lyrata (JGI) genomes (supplementary fig. S8 A and B, supplementary material online), and we assume the same orientation and synteny in A. kamchatica for our diversity comparisons.The A. halleri and A. lyrata genomes have some important structural differences surrounding the HMA4 locus, where the published A. halleri BAC sequence is nearly 2.5 times longer (including intergenic space and introns) than that of the corresponding region containing these 11 genes on chromosome 3 in the JGI A. lyrata assembly. This is largely because of the triplication of the HMA4 gene on the BAC assembly. Our de novo assemblies of A. halleri and A. lyrata each contain only a single contig with the HMA4 gene (homologous to A. thaliana AT2G19110.1) identified as homeologs by both reciprocal blast hit and one-way blast of the A. lyrata genome to the A. halleri genome. Moreover, many of the genes identified as surrounding the HMA4 locus by aligning to the BAC sequence in A. halleri are also on separate contigs in our diploid reference assemblies described in supplementary file 3, Supplementary Material online.The 11 gene alignments are orthologous to A. thaliana TAIR10 annotated gene models: AT2G19010.1, AT2G19045.1, AT2G19050.1, AT2G19060.1, AT2G19070.1, AT2G19080.1, AT2G19090.1, AT2G19110.1, AT2G19130.1, AT2G19150.1, AT2G19160.1, and AT2G19170.1. We also included two contigs corresponding to two genes outside of the A. halleri BAC sequence and A. lyrata scaffold corresponding to the S1 and S13 genes used by to estimate background diversity between homeologs (supplementary file 4, Supplementary Material online). The genes outside of the BAC sequence are: AT2G18750.1 and AT2G19490.1. Eight of these genes correspond to the gene regions S1, S2, S3, S4 (including HMA4-1, AT2G19110.1), S11, S12, and S13, with S1 (AT2G18750.1) and S13 (AT2G19490.1) not linked (“background genes”) and outside of the BAC sequence. We then aligned the 13 H-origin and L-origin coding sequences from 20 A. kamchatica genotypes to A. thaliana orthologs to use as a common outgroup for each homeolog. Summary and diversity statistics, including divergence from A. thaliana, were estimated using libsequence () packages and custom R and Perl shell scripts. The HKA test was performed using DNASP (). […]

Pipeline specifications

Software tools BWA, HomeoRoq, SAMtools, Trimmomatic, STAR, edgeR, agriGO, HTSeq
Application RNA-seq analysis
Organisms Arabidopsis thaliana, Arabidopsis lyrata, Homo sapiens
Chemicals Zinc