Computational protocol: RNA-Seq of Human Neurons Derived from iPS Cells Reveals Candidate Long Non-Coding RNAs Involved in Neurogenesis and Neuropsychiatric Disorders

Similar protocols

Protocol publication

[…] Total RNA was isolated from cells using the miRNeasy Kit (Qiagen) according to the manufacturer's protocol. An additional DNase1 digestion step was performed to ensure that the samples were not contaminated with genomic DNA. RNA purity was assessed using the ND-1000 Nanodrop. Each RNA sample had an A260∶A280 ratio above 1.8 and A260∶A230 ratio above 2.2. Briefly, total RNA (25 ng) was converted to cDNA using the NuGEN Ovation RNA-Seq System according to the manufacturer's protocol (NuGEN, San Carlos, CA, USA). The protocol employs a single primer isothermal amplification (SPIA) method to amplify RNA target into double stranded cDNA under standardized conditions that markedly deplete rRNA without preselecting mRNA. cDNA was then used for Illumina sequencing library preparation using Encore NGS Library System I. NuGEN-amplified double-stranded cDNA was fragmented into ∼300 base pair (bp) using a Covaris-S2 system. DNA fragments (200 ng) were then end-repaired to generate blunt ends with 5′ phosphatase and 3′ hydroxyls and adapters were ligated for paired end sequencing on Illumina HiSeq 2000. The purified cDNA library products were evaluated using the Agilent bioanalyzer and diluted to 10 nM for cluster generation in situ on the HiSeq paired-end flow cell using the CBot automated cluster generation system followed by massively-parallel sequencing (2×100 bp) on HiSeq 2000. We obtained 104-bp mate-paired reads from DNA fragments of average length of 250-bp (standard deviation for the distribution of inner distances between mate pairs is approximately 50 bp). iPSC and neuron RNA-Seq reads were separately aligned to the human genome (GRCh37/hg19) using the software TopHat (version 1.1.4) .Splice junctions were automatically determined by TopHat, with the provided guidance of annotated gene models (GTF file) obtained mainly from Ensembl (http://www.ensembl.org). In our analysis, all three splice sites, “GT-AG”, “GC-AG” and “AT-AC” were considered. All splicing junctions supported by at least one high-quality mapped read were kept. The option for searching novel splice variants in Tophat was left on. The resulting alignment data from Tophat were then fed to an assembler Cufflinks (version 0.9.3) to assemble aligned RNA-Seq reads into transcripts . Annotated transcripts were obtained from the UCSC genome browser (http://genome.ucsc.edu) and the Ensembl database; the category of transcripts was described at http://vega.sanger.ac.uk/info/about/gene_and_transcript_types.html. The number of transcripts in each category is listed in . Transcript abundances were measured in Fragments Per Kilobase of exon per Million fragments mapped (FPKM), which originated from the idea of RPKM (Reads per Kilobase per Million) . To address the common issue in the assembly that a read may align to multiple isoforms of the same gene or multiple transcripts within the same genetic locus, maximum likelihood estimation was performed by Cufflinks based on a numerical optimization algorithm for calculating FPKM. Finally, the program Cuffdiff was used to define differential expression . Instead of using transcript abundances computed separately by Cufflinks for each condition, Cuffdiff took alignment data from both iPSC and day 10 neurons, together with a list of human genome annotations (the same GTF file as used above, including both coding and non-coding transcripts) to infer expression differences at the level of transcript isoforms or primary transcripts or genes. Since we only had one replicate, the variance of FPKM was directly estimated from read counts using Poisson distribution, as described in detail at the Cufflinks website (http://cufflinks.cbcb.umd.edu/howitworks.html#hdif). In brief, the counts of reads that mapped to all nucleotides within a gene/transcript were assumed to follow Poisson distribution and then their mean was used as the variance. Student's t-test was then used to find significantly differentially expressed transcripts, with the test statistic derived from the log ratio of FPKM values in our two samples (see Cufflinks website for further details). To overcome the known bias in data normalization arising from a small number of highly expressed genes, we normalized our iPSC and neuronal data with total number of fragments mapped to the upper quartile of high expressed genes/transcripts rather than total mapped fragments . […]

Pipeline specifications

Software tools TopHat, Cufflinks
Databases UCSC Genome Browser
Application RNA-seq analysis
Organisms Homo sapiens