Computational protocol: Systematic Identification and Characterization of RNA Editing in Prostate Tumors

Similar protocols

Protocol publication

[…] DNA-seq reads were aligned onto the human reference genome (hg19/GRCh37) using BWA (0.5.9-r16) allowing 1nt mismatch at most in a 24nt seed. For RNA-seq, reads were mapped onto the hg19 genome and exon-exon junctions by splice-aware aligner Tophat (v1.4.1) , using the known gene model annotation from Ensembl release 62. Reads with an unmapped mate or multi-mapped location were filtered out using Bamtools (1.0.2) and PCR or sequencing optical duplicates were marked and removed by Picard (1.55) (http://picard.sourceforge.net). Using NCBI dbSNP build 132, multiple sequence local realignment around InDels and base quality recalibration was performed by GATK (1.4) (The Genome Analysis Toolkit) to correct likely misalignments. Integrating DNA/RNA sequencing data of all specimens, SNVs/InDels were identified and filtered by GATK to achieve high-confidence sites (strand bias, base quality, mapping quality and position bias were taken into account). Additionally for RNA-seq data, we used samtools (0.1.18) to call SNVs/InDels, and retained as high-confidence only those sites which were concordant between both GATK and samtools results. All variants were annotated with genic regions and potential consequences on protein-coding sequences using the tool AnnoVar . The effect of non-synonymous SNVs on protein function was assessed using Condel , a method which integrates several predictive tools (e.g. SIFT, Polyphen2, MutationAssessor).Based on the alignment of RNA-seq reads, gene expression profiles for each sample were calculated based on the gene annotation (Ensembl release 62). Only reads which were unique to one gene and exactly corresponded to gene structure were assigned to the corresponding genes. Raw read counts were normalized by R package DESeq (1.10.1) , which was designed for gene expression analysis of RNA-seq data across all samples (Table S17 in ). [...] To minimize false positives we applied the following filters:To rule out the possibility that RDD sites could be genuine polymorphic sites or mutations we excluded variants present in dbSNP build 132 (except SNPs with molecular type “cDNA”), which includes variants from the 1000 Genome Project. Given the potential for SNPs not present in dbSNP and the low coverage of our DNA-seq, we rigorously removed sites which were observed in any DNA-seq reads of any specimen. Additionally, we downloaded the COSMIC database and filtered out any sites previously reported as mutations.To exclude false positives resulting from poor mapping quality around splicing sites, we filtered out all sites located within 8bp intronic flanking region of all splicing sites.To exclude potential contamination from the mouse genome , we retrieved 61bp of the genomic sequences flanking RDD sites and substituted the RDD site with the edited base. Then we applied BLAT(V3.4) alignment against mouse genome (MM10). If the substituted flanking sequence had a better hit on MM10 than the original flanking sequence and the hit covered the RDD site with identity greater than 90%, then we excluded the corresponding RDD site as potential contamination.To exclude false positives due to mismapping reads from paralogous genes or repeat regions, we retrieved 61 and 101bp of flanking genomic sequences and substituted all covered RDD sites with edited bases, then aligned them onto human genome (hg19/GRCh37) by BLAT. If the substituted flanking sequence was able to be aligned better or equally well on other regions in genome, we discarded those RDD sites as potential false positives. [...] All RDD sites were annotated by genic regions according to Ensembl release 62 (see Table S18 in for all RDD sites) and illustrated using Circos (http://mkweb.bcgsc.ca/circos). We defined recurrent RDDs as those present in at least 2 samples, and clusters of RDDs were defined as consecutive RDD sites within a 50bp distance or at least 3 RDD sites within a 100bp window. The DARNED database for hg19, which contains 40,485 A→G and 3 C→T RNA editing sites collected from human ESTs studies, was downloaded from http://darned.ucc.ie/download. Conserved elements were predicted using the phastCons algorithm , where elements are derived from comparative genome sequence alignment of 46 species. miRNA target regions were predicted by miRanda (downloaded from http://www.microrna.org), and only predictions with a good score were retained. To evaluate transcript expression with and without RDDs affecting miRNA target regions we did the following. A matrix C(i,k) was created to store the number of miRNA target RDDs in gene k in sample i. Then we assigned gene expression data to miRNA target RDDs genes from the above matrix and produced a new matrix C'(i,k) storing RDD counts and expression values in nodes for 1,196 genes in 16 samples. Expression data of 135 (10.12%) genes could not be assigned because of different version of gene annotation between miRanda and Ensembl release 62 (Table S12 in ). To estimate whether miRNA target RDDs affected gene expression, we classified samples into two groups: with and without RDDs. To minimize the bias from gene expression on detection of RDDs, we considered only genes with approximate (RNA) sequencing coverage greater than 10X. Furthermore, we only evaluated protein coding genes since our negative control was RDDs affecting coding regions. Genes with both miRNA target RDDs and coding regions RDDs were removed from comparisons. Network analysis was performed using Ingenuity (IPA) Knowledge Base 9 (Ingenuity Systems, www.ingenuity.com). […]

Pipeline specifications