Computational protocol: De novo derivation of proteomes from transcriptomes for transcript and protein identification

Similar protocols

Protocol publication

[…] Prior to further processing for RNAseq, the three samples were used as substrates for PCR based test to confirm the presence of virus transcripts (adenovirus DBP gene) present in both the virus infected samples and not in the uninfected samples (Primer list in ). The three samples were labelled UN (uninfected control), T8 (8 hours post infection) and T24 (24 hours post infection). Next, the Trizol extracted RNA was extracted again using RNAeasy (Qiagen) prior to quantitation and processing for poly A+ selection and 56bp paired end sequencing on the University of Bristol Illumina GAIIx using the manufacturers reagents and protocols. The sequencing data was then uploaded to the Galaxy suite of software for analysis, hosted on a local Galaxy instance at the University of Bristol High Performance Computing resource, BlueCrystal.The raw sequence reads have been deposited with ArrayExpress at the European Bioinformatics Institute with the accession number E-MTAB-1277.The paired end sequence data for each time point was initially mapped to a female hg19 (i.e. less the Y chromosome) using TopHat. The following parameters were set: Mean inner distance=80; standard deviation = 15; maximum mismatches in anchor region = 0; minimum intron length = 70; maximum intron length = 500000; allow indel search = yes; maximum insertion length = 3; maximum deletion length = 3; maximum alignments allowed = 40; minimum intron length that may be found during split-segment search = 50; maximum intron length that may be found during split-segment search: = 500000; number of mismatches allowed in the initial read mapping = 2; number of mismatches allowed in each segment alignment for reads mapped independently = 2; minimum length of read segments = 2; own Junctions = no; closure search = yes; exonic hops in splice graph minimum = 50; maximum intron length found by closure search = 5000; minimum intron length found by closure search = 50; coverage search = yes; minimum intron by coverage search = 50; maximum intron by coverage search = 20000.Mapped reads were then filtered to retain only those reads that map in a proper pair before separating reads that mapped to one location from those that map to more than one location. Gene expression quantitation on uniquely mapping reads was performed using Cufflinks supplied with the Ensembl gtf (v64) as a reference throughout the analysis. The following parameters were set for Cufflinks:Maximum intron length = 500000; minimum isoform fraction = 0.05; premRNA fraction = 0.05; quartile normalisation = yes; use reference annotation = yes; perform bias correction = yes; set parameters for paired end reads = no.In addition to mapping to the human genome, Tophat was used to map to the adenovirus type 5 genome (AC_000008.1) and to the human papillomavirus serotype 18 (NC_001357.1) with the same parameters listed above but with the following changes:Minimum intron length = 30; maximum intron length = 34000 (7000 for papillomavirus); minimum intron length that may be found during split-segment search = 10; maximum intron length that may be found during split-segment search: = 34000 (7000 for papillomavirus).We also used the Trinity de novo assembly software installed on our local copy of the Galaxy suite with default parameters. For this analysis we combined all three time points of data into one large data set comprising ~82 million paired end reads. The output of assembled transcripts (~102,000 entries) was then translated (forward and reverse) into proteins using the EMBOSS tool “getorf” with a minimum nucleotide length of 200 bp between the start and stop codons. Duplicate protein sequences were amalgamated to produce ~80,000 different protein sequences (PIT proteins list) which was then used for the MS/MS analysis. We analysed this list to obtain data on size distribution () and used BLAST on this file to analyse its relationship to the human proteome (). […]

Pipeline specifications

Software tools TopHat, Cufflinks, Trinity, EMBOSS
Application RNA-seq analysis
Organisms Homo sapiens, unidentified adenovirus