Computational protocol: Exploring the stability of long intergenic non-coding RNA in K562 cells by comparative studies of RNA-Seq datasets

Similar protocols

Protocol publication

[…] Transcriptome reconstruction of PH and ENCODE datasets of K562 cells by RNA-Seq was performed respectively using rigorous read set through a sliding window filtering the average quality values within the window less than 20 and the length of reads less than 35 bp by Trimmomatic []. After quality control, we obtained 90.7 million 2*100-base paired-end reads generated by Illumina Hiseq2000 sequencing on polyadenylated selected (Poly-A+) RNAs. On the other hand, ENCODE RNA-Seq dataset was incorporated 112.3 million 2*76-base paired-end reads generated by Illumina GAIIx sequencing on Poly-A+ RNAs from NCBI Gene Expression Omnibus (GEO) database with accession number GSM765405. [...] Quality-control reads were aligned by TopHat (v2.0.7) [], and transcripts of PH and ENCODE datasets were reconstructed by Cufflinks (v2.0.2) [] with the Ensembl annotation, respectively. Because of strand-specific of ENCODE by RSeQC script [], fr-firststrand library type was performed for ENCODE by Cufflinks. To eliminate all annotated non-lincRNA transcripts, the intersection of transcripts of the ‘u’ category (unknown, intergenic transcript in Cufflinks) was attained using cuffcompare script with four public databases annotations (annotated protein-coding genes, microRNAs, rRNAs, tRNAs and pseudogenes), including Ensembl (Homo_sapiens.GRCh37.70.gtf), UCSC (hg19), Gencode (gencode.v15.annotation.gtf.gz) and Refseq (ref_GRCh37.p10_top_level.gff3) respectively. That is, the intersection of intergenic transcripts was acquired apart from all annotated non-lincRNA annotations of four databases. Annotated lincRNAs were acquired through the intersection of intergenic transcripts to run Cuffcompare script again with Gencode lncRNAs annotation (gencode.v18.long_noncoding_RNAs.gtf.gz). The remaining transcripts were possible novel lincRNAs. Then, possible novel lincRNAs were filtered on the basis of some characteristics including FPKM value (Due to the low expression of lincRNAs, we considered to acquire more lincRNAs based on the density distribution of lincRNAs. ENCODE, FPKM ≥ 0.1; PH, FPKM ≥ 0.01), length ( ≥ 200 nt), ORF (< 100 codons) and exonic number ( ≥ 2). After that, putative novel lincRNAs were acquired based on non-coding potential by integrating the results of four softwares including iseeRNA (noncoding), CPAT (no), CPC (noncoding) and PhyloCSF (score < 100) (Figure ).iseeRNA, a lightweight SVM-based program, is designed for computational identification of lincRNAs from high-throughput transcriptome sequencing data. CPAT, which overcomes several intrinsic pairwise and multiple alignments limitations, uses logistic regression model based on ORF size, ORF coverage, Fickett TESTCODE and Hexamer bias. CPC relys on pairwise alignment to assess the protein-coding potential of a transcript based on six biologically meaningful sequence features. PhyloCSF uses a multi-species nucleotide sequence alignment to calculate the phylogenetic conservation score, which is likely to represent a protein-coding region. […]

Pipeline specifications