Computational protocol: Alternative splicing regulation at tandem 3′ splice sites

Similar protocols

Protocol publication

[…] The presence of cis-regulatory elements in both exon and intron regions close to the NAGNAG acceptor sites, not including the NAGNAG site, was studied using the following procedure. Overlapping words of length 5 and 4 were counted in 100 bp pre-mRNA sequences in the upstream intron and downstream exon flanking the NAGNAG motif. An additional search was applied on a subset of these sequences which is conserved between human and mouse at 30 bp flanking the NAGNAG acceptor, respectively. Since the conserved regions were shorter, to avoid statistical bias which could arise from sparse data, in the latter set we counted only words of length 4. In cases in which the flanking exonic or intronic sequences were shorter than 100 or 30 bp, in the respective datasets, we analyzed the available sequences. Important to note that in cases were the downstream exon or upstream intron were shorter than 100 bp, the analysis was not extended to the following exon or intron. Conservation values were extracted from human/mouse (mm5) pairwise alignments downloaded from the UCSC website. Words were counted separately in the flanking regions of each of the following groups: (i) 215 EST-confirmed NAGNAG 3′AS sites, (ii) a subset of 78 sequences from set 1 with the strong CAGCAG tandem acceptor, (iii) 5050 ‘Proximal’ NAGNAGs, (iv) 584 ‘Distal’ NAGNAGs, (v) 984 Skipped exons, and (vi) 102 461 constitutively spliced NAGNAG acceptors. All word counts were normalized to the size of the dataset. In addition each of the above datasets was randomly shuffled and words were counted as described above for the shuffled sets. From each group (1–5) the counts in the equivalent shuffled set were subtracted. The CS set was used as the background control group for calculating the log2 ratio (lr) for each word, as shown in the equation below. 1log2(N(i)−N(si))/T(i)(N(cs)−N(scs))/T(cs) where N(i) is the number of counts in set i (–), N(si) is the number of counts in set i after random shuffling, T(i) is the total number of words in set I, N(cs) is the number of counts in the CS set and N(scs) is the number of counts in the CS group after random shuffling. T(cs) is the total number of words in the CS set.To ensure that we do not get significant results from low counts, in each set only words that were found in the top 1% were considered.Words with lr >1 detected in the exon regions were screened against the matrices in ESEfinder () and the RESCUE-ESE database () to search for known ESE's and ESS's and their putative binding proteins. In addition a manual literature search was applied to examine the possible function of the overabundant words in the intronic regions. […]

Pipeline specifications

Software tools ESEfinder, RESCUE-ESE
Application WGS analysis
Organisms Mus musculus, Homo sapiens