Computational protocol: Luminal breast cancer-specific circular RNAs uncovered by a novel tool for data analysis

Similar protocols

Protocol publication

[…] The starting RNA-seq datasets were obtained from libraries generated with the TruSeq stranded library preparation kit (Illumina) using RNA depleted of both poly(A+) and ribosomal RNAs fractions. Libraries were analyzed with the DNA 1000 chip (Agilent) using Agilent 2100 Bioanalyzer and quantified using the Qubit DNA HS kit (Lifetechnologies). Pool of 12 libraries (pooled at equimolar concentration) was generated, quantified and run on the HiSeq2000 (Illumina) sequencer in 50 nt paired-end sequencing mode following manufacturer instruction. A total of 12 datasets, with an average depth from 30.7 to 116.1 million paired-end reads were obtained, composed of triplicates of four MCF-7 culture conditions: i) hormone-deprived (HD) media ii) HD+ 17β-estradiol (6h) iii) medium added of FBS 10% iv) double-stranded RNA complementary to ESR1 mRNA (siRNA) (48h). Raw data are deposited at GSE101410. CIRI v. 1.2 [] circRNA prediction analysis was performed aligning RNA-Seq reads with BWA v. 0.6.1 [] with option bwasw and -T = 15. Gencode v19 was used as reference transcriptome dataset while hg19 as human reference genome assembly. CIRI algorithm applied in default settings with -P and -low option.CircRNA prediction with CIRCexplorer v. 1.0.6 was performed as proposed in []. RNA-Seq reads were aligned using Tophat v. 2.0.0 [] with options -bowtie1, -a = 6, -m = 2 -microexon-search -no-novel-juncs. Unmapped reads were analyzed with Tophat-Fusion with options -fusion-search -keep-fasta-order -bowtie1 -no-coverage-search. CircRNA prediction analysis with find_circ v. 1.2 [] was performed by aligning reads using Bowtie v. 2.0 [] with options -very-sensitive -phred33 -mm -D = 20 -score-min = C,-15,0. Unmapped reads were used as input for the find_circ pipeline following the procedure proposed in []. For each analysis, the number of BS reads reported by each algorithm was normalized using DESeq2 v.1.14.1 R package []. On each set of circRNA predicted by the three algorithms, the circRNAs predicted in at least two out of the three biological replicates in each culture condition and associated with an average number of BS supporting read > 2 were selected. Using this threshold 3,271, 1,811, and 2,797 circRNAs were predicted with CIRI, find_circ and CircExplorer, respectively. CIRI algorithm was applied with the same settings to predict circRNA from ENCODE MCF-7 RNA-Seq experiments performed using total RNA (GSM2072571, GSM2072572), poly(A)+ (GSM767851), and poly(A)- (GSM765388) RNA selection protocols. Only circRNAs identified in both the biological replicates of the experiments were considered for the analysis. [...] CircHunter is a new tool designed for the post-discovery analysis of circRNA predictions. CircHunter is composed by three modules (i) circRNA classification, (ii) BS sequence reconstruction, and (iii) BS sequence quantification in deep sequencing datasets.In the circRNA classification module the algorithm considers the annotation from a reference transcriptome, which in our analysis was Ensembl v85. Initially, genomic coordinates of each Ensembl exon are overlapped against circRNA genomic coordinates using bedtools [] intersect function. Then, the genomic coordinates of each exon annotation are tested for the overlap against circRNA BS site position. Each overlap is classified based on the number and the position of the BS within the transcript annotations. Each circRNA/transcript overlap is classified to five possible criteria ():multiexonic, when two exons are mapped to each splice site of the circRNA;monoexonic, when a single exon spans the entire region involved in the circularization;putative exonic, when there is no precise match between the circRNA BS sites and the exon boundaries but BS sites is mapped within exon genomic coordinates;vintronic; when at least one intron is mapped to a circRNA BS site;intergenic; when at least one circRNA BS sites exceeds the boundaries of the associated gene.This analysis provides a transcript-level classification of circRNA-overlapping transcript. Then, a single circRNA can be associated with multiple classifications when overlapped on multiple transcripts. To obtain the univocal classification of each circRNA, the main isoform of the circRNA host genes is considered by selecting the Ensembl transcript identified with the suffix “001”. If none main isoform is overlapped with a circRNA, the other isoforms are evaluated following the order provided by Ensembl.The circRNA nomenclature applied in this work was based on the isoform considered for the univocal classification. Specifically, each circRNA name was composed by the prefix “Circ” followed by the host gene symbol and ended with the rank of 5ʹ and 3ʹ exons involved in the circularization. Intergenic circRNAs were named based reporting their genomic coordinates while intronic circRNAs were distinct by the “I” suffix. The univocal classification was considered in the analysis of the number and rank of the exons involved in the BS event.The circRNA BS sequence reconstruction module applied a python script which select two set of genomic coordinates starting from the BS sites and involving a portion of BS exon selected by the user (default length is 35 bp). An R script is then applied to convert the genomic coordinates in R GRanges objects. Then, the BS sequence is reconstructed using the function getSeq and xscat. These functions were applied respectively to extract and to concatenate properly the two sequences composing the BS junction.The BS sequence quantification in deep sequencing datasets module is performed by HashCirc. HashCirc is organized on three steps: in the first and second steps an alignment-free prediction method is exploited to identify the set of putative sequencing reads mapped on the sequences of interest; while in the third step the selected putative reads are aligned against the sequences of interest (i.e. circRNA BS junction sequences) to generate the corresponding counting table (i.e. the counting of the number of reads aligned with each sequence).Step 1: Significant k-mer generation. In this step, the entire set of sequences is scanned and a set of substrings with length k, namely k-mers, is generated using a sliding window approach.For instance, given a string ATCCCGTC the following k-mers with length three are generated: ATC, TCC, CCC, CCG, CGT and GTC.Then, a hashing is exploited to build the function isPresent: {A,C,G,T}k → [0,1] which, for each k-mer, returns one if it appears in any sequence otherwise 0.A k-mer α is considered significant and therefore selected if isPresent(α) = 1. These selected k-mers will be used to identify the putative reads in the next step.Step 2: read selection. In this step hashing is still used to build function check:{A,C,G,T}k →{0,1}, which for each k-mer returns 1 if it is a selected k-mer otherwise 0. Then, the function check is applied on all the k-mers of a read so that a read is selected as putative one if it contains more than N k-mers for which check function returns 1.Step 3: read counting. The derived set of putative reads are hence aligned w.r.t the sequences. For each read, its best alignment with respect to all the sequences is identified and used to generate the sequence counting table. In the CircHunter tool suite, the HashCirc module, is composed of two C++ applications for each step of the data processing:The first step takes as input a set of sample reads, the set of sequences and the threshold N, and returns the corresponding set of putative reads which contain at least N k-mer shared with the set of sequences. The k-mers generated by the sequences are stored in RAM exploring an ad-hoc C++ hash table class implementation to optimize the trade-off between the memory utilization and the execution time.The second step takes as input the set of putative reads for each sample, it counts the frequency of a set of reference sequences (i.e. pre-defined BS sequences). For this step, the Smith-Waterman algorithm provided by SIMD Smith-Waterman C++ library is used.To perform the HashCirc analysis, sequences of 70 bp representing the hypothetical circRNA BS junctions were extracted from the circRNA predictions using CircHunter. Two set of genomic coordinates spanning +35/−35 bp from the junction point respectively were generated using this algorithm and circRNAs shorter than 70 bp were splitted in two halves used for the junction reconstruction. The efficiency of HashCirc in circRNA quantification was evaluated by Pearson correlation analysis between the number of reads counted by HashCirc with the reads reported by CIRI. ENCODE MCF-7 RNA-Seq experiments were also considered for analysis of the algorithm sensitivity in circRNA detection. For this analysis data from Poly(A)+ (GSM767851), Poly(A)- (GSM765388), and total RNA-Seq datasets (GSM2072571, GSM2072572) were analyzed by setting the k-mer length (k) to 26, the minimum number of matched k-mer (N) to 21, and the minimum number of matches (M) equal 40. A set of 75-bp simulated paired-end reads from [] was considered as circRNA negative set since it was generated from linear mRNA annotations. A ratio between the reads counted by HashCirc in the PolyA- and the simulated read datasets was computed to evaluate the rate of false positive count. The results of these testing analyses are reported in .To make easy the use of this new pipeline a Graphical User Interface (GUI) based on Java Swing Class was developed too. Moreover, the pipeline was integrated into a Docker container to facilitate its distribution and installation. It can be downloaded at https://github.com/carlo-deintinis/circhunter [...] The Control gene set was defined by selecting genes lacking circRNA predictions considering the union between circBase [], circRNADb [] annotations and circRNAs predicted in this study. To select control genes expressed in MCF-7, a public Poly(A)+ MCF-7 RNA-Seq experiment performed in full medium (GSE48213, []) was re-analyzed and genes with a FPKM > 1 were considered as expressed. Using these criteria 5,583 control genes were selected. The not significant difference between host and control genes expression was confirmed by Wilcoxon Rank-Sum test (p-value = 0.796). Another control set was defined by selecting a subset of control genes paired with circRNA host genes by considering the first intron length. Specifically, a python script was designed to pair each circRNA host gene with a randomly selected control gene with similar intron length. The pairing was performed 1,000 times. This control set was called Control-I. The Random gene set was composed by 1,000 sets of 1,761 genes randomly selected from Gencode v19 annotations.Enrichment analysis of circRNA host gene and the Control gene set was performed using Enrichr web tools []. The comparison of gene/transcript length, the number of exons and isoforms between circRNA host genes and control set was performed using the Ensembl annotations. Data of the main gene isoforms (reported with suffix “001” by Ensembl) were used in this analysis. Analysis of candidate circRNA intronic retention events was performed by considering reads paired with each BS spanning read. The genomic coordinates of these reads were retrieved from BWA alignment outputs using BS read identifiers provided by CIRI. Then, the reads genomic coordinates were mapped against Ensembl exon genomic coordinates. Only perfectly matched reads were considered for this analysis.Alu annotations were downloaded from UCSC using the RepeatMasker track and their coordinates were overlapped with an intronic region of 500 bp flanking the circRNA BS junctions as previously performed []. [...] The epigenetic status at the genomic region involved in the BS events was evaluated by overlapping the BS genomic coordinates with 15 chromatin states defined for the MCF-7 epigenome in full medium culture condition []. Data of ChIP-Seq experiment against H3K27ac (GSM1383859), H3K36me3 (GSM970217), H3K4me3 (GSM1383862) and RNAPII (GSM1276019) were also analyzed to measure the ChIP-Seq genomic signal profile around 5ʹ or 3ʹ BS sites. Specifically, ChIP-Seq reads were aligned against hg19 human genome using Bowtie2 in default settings. The genomic signal profile was then computed using seqMINER algorithm v1.3 [] considering a genomic region of +/− 1 kbp centered on the BS sites. The genomic signal normalization was then normalized using NormChIP algorithm []. The number of H3K36me3 read covering the first five exons of circRNA host genes, the control set, or the control-I set were counted using the coverageBed function of bedtools. [...] Raw Ago-HITS-CLIP data and Ago-HITS-CLIP peaks were retrieved from GSE57855. The raw sequencing reads were aligned using Bowtie2 algorithm in default settings and reads aligned within a genomic region of +/− 1 kbp centered on circRNA BS sites were counted. As control the corresponding splicing sites of exon 2 and exon 3 from genes were analyzed.MRE prediction was performed on the reconstructed sequence of exonic circRNAs composed of one two, or three exons. MRE prediction was performed using the Miranda algorithm [] in default settings and considering mirBase v20 annotations. Overlap with circRNA exons and Ago-HITS-CLIP was performed using coverageBed function of bedtools. To analyze miRNAs expressed in MCF-7 cells, processed data from small RNA-Seq experiments from GSE78168 were considered. Only miRNAs associated with an average Read Per Million Mapped reads (RPPM) > than 100 in E2 untreated cells were considered.HashCirc was applied on Ago-HITS-CLIP to identify Ago-RNA binding generated by BS events. Given the smaller length of Ago-HITS-CLIP reads the algorithm was applied with settings k = 21, N = 17, and M = 30. Read count normalization was performed using DESeq2 algorithm. MRE prediction was performed on the reconstructed BS sequence of circRNA subset associated with an Ago-HITS-CLIP signal as defined by HashCirc analysis. Different control sets were defined for this analysis: 1) 100 sets of 3,271 sequences generated by randomly permuting the 3,271 CM7 BS sequences; 2) 100 sets of sequences generated by randomly selecting and permuting 127 CM7 BS sequences; 3) 100 sets generated by randomly permuting the 127 BS sequences overlapped with the Ago-HITS-CLIP datasets; 4) 100 set of sequences generated by randomly shuffling the 3,271 CM7 BS sequences.Considering the CM7 BS sequences overlapped with more than six averaged read of the Ago-HITS-CLIP, the overlap with the splicing junction of the linear mRNA sequences was evaluated. The overlap was performed by selecting the splicing junction sequence in between of the exons involved in the circularization and also by considering the junction formed between the circularizing exons and the external 5ʹ and 3ʹ flanking exons. Direct sequence alignment between Ago-HITS-CLIP reads and the splicing junctions was performed using Bowtie2 algorithm with local option. The read coverage was computed using Samtools pileup function. […]

Pipeline specifications

Software tools Bowtie2, BEDTools, DESeq2, SAMtools
Application sRNA-seq analysis
Diseases Breast Neoplasms, Neoplasms