Computational protocol: Global identification of hnRNP A1 binding sites for SSO-based splicing modulation

Similar protocols

Protocol publication

[…] FASTQ reads were first demultiplexed according to the barcode sequence RRRIIIIRR, in which R is a random nucleotide and I is a part of the identifier sequence. The two random segments were then stitched together into a 5-bp random tag used for identifying PCR duplicates and saved within the read name. Following demultiplexing, the reads were quality trimmed and adapter trimmed such that any adapter sequence at the 3′ end of the reads was removed, allowing up to one mismatch. Reads with a length of 20 bp or more were retained. For these steps, we used custom Perl scripts. Next, reads were mapped to the human genome (hg19) using the Burrows-Wheeler Aligner (BWA) [] (bwa aln parameters: -l 20 -n 2, bwa samse parameters: -n 10) and, using mapping positions together with the random tags, PCR duplicates were removed using a custom Perl script. Tags with up to one mismatch between them were considered identical. This method allows detection of PCR duplicates that contain sequencing errors, either in the random tag or within the fragment sequence. Finally, we removed non-uniquely aligned reads prior to downstream analyses. [...] Crosslinking sites were defined as the base immediately preceding the read start for reads without deletions, while the crosslinking site in reads containing deletions was defined as the site of the deletion. The mapping of crosslinking sites to the beginning of fragments was validated using iCLIPro [] to estimate the overall crosslinking profile of hnRNP A1 in our iCLIP experiments. The program was run on aligned fragments of lengths 20–35 bp compared to fragments of length 41 bp in windows of 300 bp containing at least 20 aligned fragments. To identify binding sites, we extended these crosslinking sites by 10 bp to either side, or in reads with deletions it was defined as ±10 bases from the deletion site. Peak detection was performed on these regions using CLIPper []. We used the “superlocal” algorithm to account for local sequencing bias in 1-kb flanking regions, and the “random” algorithm to estimate p values within these windows. We set a threshold at FDR = 10 % to identify significant binding sites with full parameters: -s hg19 --premRNA --disable_global_cutoff --FDR = 0.1 --superlocal --threshold-method = random. To identify binding motifs, we used the HOMER analysis program [] run in RNA mode on the identified binding sites with the exact sizes of the binding sites as the targets, and a random similarly sized set of regions as the background (full parameters: -rna -size given -len 6,7,8 -mis 3 -depth high –basic). To generate the scoring matrix we used the frequency matrix to compute log2 scores of the frequencies relative to a background frequency of 0.25. We downloaded raw fastq files for the hnRNP A1 samples from the study of Huelga et al. [] and mapped them with BWA as described for our own data. We removed PCR duplicates based on mapping position and then removed non-uniquely aligned reads. To identify binding sites, we performed peak detection using CLIPper directly on the mapped reads. We used BEDTools to identify overlapping binding sites. FDR values were calculated by adjusting p values using the Benjamini-Hochberg method. [...] We used the RSeQC [] package to calculate genomic distributions and ngs.plot [] to plot them in localized windows. Overall densities across genomic regions (Ensembl version 75) were calculated by normalizing to the total combined length of the region in question (reads per kb), while plotted densities were normalized to the total number of reads (reads per million). The alignment of iCLIP reads to the human genome was visualized using the Integrative Genomics Viewer (IGV) [, ]. Coverage profiles of iCLIP tags and peaks were created using ngs.plot []. [...] Three biological replicates of hnRNP A1 knockdown in HeLa cells were prepared for RNA sequencing using the TruSeq Stranded Total RNA library Prep Kit, Human/mouse/rat (Illumina). Samples were paired-end sequenced using Illumina’s HiSeq 1500. Reads were subsequently mapped with Spliced Transcripts Alignment to a Reference (STAR) [], and gene expression analysis was carried out using HTSeq [] and DESeq2 [] with conditional quantile normalization []. We used SAJR [] to detect alternative splicing and select significant events as those with FDR <0.1. FDR values for all analyses were calculated by adjusting p values using the Benjamini-Hochberg method. To compare significant splicing changes of cassette exons, we extracted cassette exons that overlapped the intronic regions with cassette exon missplicing indicated in the Huelga et al. data made public at http://rnabind.ucsd.edu and with identical cassette exon lengths. Log2 fold-change estimates from SAJR analysis were computed by taking the log2 of the ratio between conditions of the average inclusion ratios with 1 % pseudoinclusion added to each condition. […]

Pipeline specifications

Software tools BWA, CLIPper, HOMER, BEDTools, RSeQC, ngs.plot, IGV
Application CLIP-seq analysis
Chemicals Nucleotides