Computational protocol: Transcript Polymorphism Rates in Soybean Seed Tissue Are Increased in a Single Transformant of Glycine max

Similar protocols

Protocol publication

[…] In this study, we utilized datasets obtained from whole-transcriptome sequencing of transgenic soybean seed tissues expressing three different transgenes to detect exomic polymorphisms. Each experimental group consisted of nine seeds from each of the three transgenic events (designated ST111, ST77, and 764) as well as the wild type controls. All transgenic plants were generated using Agrobacterium transformation and taken to homozygosity over multiple generations prior to sequence acquisition. The transgenes and recombinant proteins expressed and accumulated in the seed tissues have been described previously by our lab []. Quality control filtering and assembly were conducted as part of the previous study prior to polymorphism detection and annotation with samtools and bcftools. [...] Soybean cultivation, RNA extraction, and differential gene expression analyses were carried out using RNA-seq as previously described in Lambirth et al. []. In order to facilitate exome SNP calls, samtools version 1.2 [, ] was used to index the soybean reference genome version 2.75 sequence file obtained from Phytozome [], which was amended to contain scaffolds of all three T-DNA sequences using the -faidx command to allow alignment of nonreference transgene transcripts. Alignment files previously generated by TopHat [] for all previously reported samples [] in  .bam format were converted to  .bcf files using the -mpileup command with -g and -f parameters to specify the output format and to use the indexed reference fasta file. The bcftools call command was subsequently used on the indexed  .bcf files with the -c parameter to invoke the original consensus calling method enabling SNP and INDEL identification. The stat and plot-vcfstats commands were used to generate statistical summaries for each sample. Total SNP counts for each sample were averaged to calculate the standard deviation and standard error for each experimental group of wild type and transgenic seeds, and unpaired two-tailed t-tests were used to compare each transgenic group with wild type. Individual nucleotide base changes, transition and transversion rates, INDELS, and single and multiallele SNPs were also recorded and compared between groups. To identify SNPs present in each experimental group, the intersections of each variant file for all nine individual replicates in the ST77, ST111, 764, and WT groups were taken. SNPs in these combined files overlapping between all three transgenic groups, as well as those unique to each event when compared to WT, were pulled from the  .vcf files using the -vcf-isec command in vcftools []. SNPs shared between the transgenics and WT control group were removed to generate  .vcf files containing SNPs found exclusively in each transformation event.Visualization of SNPs and INDELS was conducted using Circos version 0.69-2 []. Variance files in  .bcf format from the samtools/SNPeff output were converted to  .vcf files with the -bcftools view-o command; then, using the cut -f 1,2,6 command, base calls, quality score metrics, and chromosomal locations were extracted. Formatted tab delimited files were converted to the input text file for Circos using an in-house script generated in Python (version 2.7.11), which filtered the calls to exclude all polymorphisms with quality scores below 15. The script is included in the Additional Files of this manuscript (in Supplementary Material available online at Plots were then generated using the -circos -conf circus.conf command to display SNP distributions of the transgenic and wild type samples for all 20 soybean genomic chromosomes. SNPs were spaced in the plots every 20 bases and layered vertically.To predict any possible translational effects resulting from detected SNPs and INDELS, snpEff version 4.1i [] was utilized on the resulting variance call files generated by bcftools. A custom database for snpEff was constructed using the -build command consisting of the soybean genome FASTA reference file described previously containing our T-DNA sequences, as well as the gene model  .gff3 file from the Cufflinks output from our previous RNA-seq data [] obtained from Phytozome. The  .gff3 file provided a reference index for gene positions and identifiers, as well as intron/exon models and untranslated regions. No codon table configuration was necessary as Glycine max utilizes standard codon triplets allowing snpEff to run with the default parameters, employing the SNP/INDEL call  .vcf file from bcftools as input. Multithreaded processing using the -t option was not used, as this removes statistical calculations and the resulting reports from the output file. Statistical comparisons of SNP rates and effects between each group were conducted using two-tailed unpaired t-tests in Microsoft Excel. Functional annotation of genes containing variants was accomplished by loading the gene output list from snpEff into the agricultural gene ontology (GO) enrichment tool AgriGO [] using the integrated single enrichment analysis tool with the Glycine max Wm82.a2.v1 background reference. Significant terms were calculated using Fisher's exact test with a corrected false discovery rate p value threshold of 0.05. All computations were conducted on an Apple Macbook Pro with a 2.7 GHz quad-core Intel i7 processor and 16 GB of DDR3 RAM. […]

Pipeline specifications

Software tools SAMtools, bcftools, TopHat, VCFtools, Circos, SnpEff, Cufflinks, agriGO
Databases Phytozome
Applications RNA-seq analysis, Genome data visualization
Organisms Glycine max
Chemicals Nucleotides