Computational protocol: De novo assembly of the pennycress (Thlaspi arvense) transcriptome provides tools for the development of a winter cover crop and biodiesel feedstock

Similar protocols

Protocol publication

[…] A pooled sample containing equal amounts of purified total RNA from each of the five tissue samples was submitted to the University of Minnesota Biomedical Genomics Center for sequencing. RNA was subjected to quality control using the Invitrogen RiboGreen RNA assay (Life Technologies), and RNA integrity was analyzed by capillary electrophoresis on an Agilent BioAnalyzer 2100 (Agilent Technologies, http://www.agilent.com). Polyadenylated RNA was selected using oligo(dT) purification and reverse-transcribed to cDNA. cDNA was fragmented, blunt-ended, and ligated to the Illumina TruSeq Adaptor Index 3 (Illumina Inc., http://www.illumina.com). The library was size-selected for an insert size of 200 bp, and quantified using the Invitrogen PicoGreen dsDNA assay (Life Technologies). The pooled RNA sample was sequenced using the Illumina HiSeq 2000 platform using 100 bp, paired-end reads, producing 374 million reads above Q30. Read pairs had a mean insert size of 200 bp. Duplicate reads were removed, and the first 10 nucleotides were trimmed from the 5′ end of each read using the tools in the CLC Genomics Workbench 5.5 (CLC Bio, http://www.clcbio.com). The additional trimming parameters were: removal of low-quality sequence limit = 0.05; removal of ambiguous nucleotides, maximum two nucleotides allowed; removal of terminal nucleotides, 10 nucleotides from the 5′ end; removal of Illumina TruSeq Indexed Adaptor 3 and Universal Adapter sequences.Reads were de novo assembled into contigs using the CLC Genomics Workbench 5.5 de novo assembly tool. A series of independent assemblies were performed to analyze the effects of varying the de novo assembly parameters. Assemblies were performed using varying word size (18, 24, 30, 36, 40, 46, 52, 58 and 64), and with length fractions (match length) of 0.7 and 0.95. An additional 23 assemblies were performed using values outside these parameters, with a total of 41 assemblies performed. The remaining assembly parameters were: auto bubble size, yes; minimum contig length, 300 bp; perform scaffolding, yes; mismatch cost, 3; insertion cost, 3; deletion cost, 3; update contigs, yes. Functional annotations and gene ontologies were assigned to each assembled contig from the final assembly using Blast2GO with the following parameters: BLASTx against the NCBI non-redundant protein database, BLAST E-value = 0.001, and reporting the top 20 hits. Comparative blast searches against Arabidopsis were performed using the CLC Genomics Workbench blast function, using sequences obtained from the TAIR10 release of the Arabidopsis transcriptome and proteome (http://www.arabidopsis.org) (). Sequences for Arabidopsis lyrata (), Capsella rubella (), B. rapa () and T. halophila were obtained from Phytozome v9.1 (http://www.phytozome.net). Further statistical analysis and figures were prepared using r (). The final assembly described here has been submitted to DDBJ/EMBL/GenBank under the accession GAKE01000000. The complete, annotated FASTA file is available at http://www.cbs.umn.edu/lab/marks/pennycress/transcriptome. […]

Pipeline specifications

Software tools CLC Genomics Workbench, CLC Assembly Cell, Blast2GO, BLASTX
Application RNA-seq analysis
Organisms Arabidopsis thaliana
Diseases Metabolic Diseases