Computational protocol: Draft De Novo Transcriptome of the Rat Kangaroo Potorous tridactylus as a Tool for Cell Biology

Similar protocols

Protocol publication

[…] To sequence the rat kangaroo transcriptome, we extracted total RNA from unsynchronized cultured rat kangaroo PtK2 cells. Thus, this transcriptome reflects transcripts present in these cultured PtK2 kidney epithelial cells. We enriched for mRNA using poly(A) tail selection and constructed a cDNA sequencing library with average insert size of 275 bp. We performed next-generation sequencing via a paired-end 150-cycle rapid run on the Illumina HiSeq2500, generating 679,303,792 raw reads (), corresponding to very high coverage depth. We sequenced over 99 billion nucleotides, and these had a Q20 (i.e. sequencing error rate <1%) of 98.4% and GC content of 49.9% ().We assembled the transcriptome de novo using the Trinity software package [,]. This software was specifically designed for reconstructing a full-length transcriptome from RNA sequencing (RNA-Seq) data when a genome sequence is not available. From this point on, we will refer to our assembled transcript isoforms as “Trinity transcripts” and to inferred loci emitting one or more related isoforms as Unigenes. The breakdown of Trinity transcripts and Unigenes with respect to coding potential and isoform multiplicity is given in . We assembled 347,323 different Trinity transcripts (), and these had a mean length of 1,197 nt and N50 of 3,405 nt (i.e. 50% of the assembled bases were incorporated in Trinity transcripts of ≥3,405 nt; ). We analyzed the relative abundance of each Trinity transcript () and Unigene (), reported as TPM (transcripts per million; ), using RSEM (RNA-Seq by Expectation Maximization) []. There was a relatively high number of non-coding Unigenes with predominantly low abundance and low isoform multiplicity (). In contrast, the 20,079 protein coding Unigenes had an average of 3.7 isoforms each and displayed a bimodal abundance distribution, with about 10,000 Unigenes at a low abundance similar to non-coding Unigenes, and a second population of about 10,000 higher abundance Unigenes ().We annotated the translated Trinity transcripts using i) BLASTP [] similarity search against the SwissProt protein database [], ii) protein family classification based on the PFAM database [], iii) gene ontology (GO) [] mapping, and iv) orthologous groups of gene (eggNOG, evolutionary genealogy of genes: Non-supervised Orthologous Groups) classification [] (). In total, the 75,290 Trinity transcripts that were identified with open reading frames correspond to 20,079 Unigenes (unique genes), of which 7,846 have transcripts in a distinct cluster and 12,233 have a single transcript not in a cluster (). As an initial test of transcriptome quality, we searched in the transcriptome for the mitotic gene KIF2C/MCAK ( gene marked with a star), whose full transcript sequence was previously known from PtK1 cells [], and available on NCBI. There was only one nucleotide mismatch between both protein coding sequences, consistent with accurate transcriptome sequencing, assembly and annotation. [...] We have made sequencing, assembly and annotation of data publicly available as a resource for the community. First, all raw sequencing reads have been submitted to the NCBI Sequence Read Archive (SRA; http://www.ncbi.nlm.nih.gov/sra) under accession number SRP055986. Second, as noted above we have included processed data as Supporting Information files: Trinity assembled transcript sequences (), isoform abundances (), Unigene abundances () and protein annotations (). Third, we created a web portal (http://dumontlab.ucsf.edu/ratkangaroo.htm) for the rat kangaroo transcriptome to allow researchers across fields to more easily access the assembled and annotated data. There, the first browser allows the user to browse different Trinity transcripts using a custom UCSC Genome Browser interface running in the Amazon Web Services (AWS) cloud and maintained by Maverix Biomics, Inc., where mRNA transcripts have been substituted for chromosomes. Trinity transcripts are assembled and numbered, and a BLAT (BLAST-like alignment tool) [] search tool enables mapping of any input protein or nucleotide sequences to the rat kangaroo transcriptome.The second, custom-designed transcript browser contains Trinotate-annotated transcript information, and allows the user to search by gene description, or transcript ID (from the first browser). This browser contains the above annotation analyses, with links to external databases, and abundance (TPM, and FPKM, fragments per kilobase of transcript per million fragments mapped) statistics for each Trinity isoform. Both browsers were designed to be used by a broad range of biologists and do not require specialized knowledge. […]

Pipeline specifications

Software tools Trinity, RSEM, BLASTP, AWS, BLAT, Trinotate
Databases Pfam UCSC Genome Browser
Applications Miscellaneous, RNA-seq analysis, Transcription analysis, Genome data visualization
Organisms Rattus norvegicus, Potorous tridactylus, Homo sapiens