Computational protocol: Prediction of alternatively skipped exons and splicing enhancers from exon junction arrays

Similar protocols

Protocol publication

[…] We performed a search for over-represented kmers in our exonic and intronic sequences using the RESCUE-ESE method as described in []. Briefly, using the Perl programming language, counts for all kmers between two sets of sequences are tallied and evaluated for significant over-representation between the sets using a Z-test, and all kmers with Bonferroni-adjusted p-value < 0.01 were clustered and aligned into groups of motifs. Distances between motifs were defined as the absolute deviation between the nucleotide frequencies of two aligned positions in the best local alignment. Motif pairs with distance less than 2 were deemed as a motif match. For kmer calculations with exons, the comparisons are 1) exons versus introns and 2) constitutive versus alternative exons. We also considered two comparisons for each of 5' and 3' SS based on splice site strength scored by the position-specific weight matrices and applied the 1st and 3rd quartiles of all known splice sites as the cutoffs for weak and strong splice sites: 3) weak versus strong splice site exons and 4) strong versus weak splice site exons. For kmer calculations with introns, the comparisons are 1) introns versus exons and 2) flanking introns of constitutive versus alternative exons; 3) flanking introns of weak versus strong splice site exons and 4) flanking introns of strong versus weak splice site exons. All sequence analyses used intronic/exonic sequences up to 200 bp from the SS and excluded the splice site regions that cover the first 5 bp at each end of exons and the first 20 bp and 10 bp for the 3' and 5' ends of introns respectively.In addition, we searched for novel sequence motifs utilizing the whole data set without the need to dichotomize exons into AE and CE by applying the correlation-based method REDUCE [] to the splice site proximal sequences and our exon-level score for measuring exon skipping (n = 83789). REDUCE enumerates all possible 5–6 base pair kmers and finds those that show significant correlation with expression values from a single microarray experiment. Multiple kmers are identified by iteratively removing the significant kmers from sequences and re-evaluating the correlation between the remaining kmer frequencies and residuals of the linear regression fit from the previous run (i.e., subtracting the contribution from the previously selected kmer(s)). We masked the splice sites by removing 5 bp at the 5' and 3' end of each exon. The output lists all significant kmers with p-values adjusted using the Bonferroni correction.To look for motifs in the development-related AS exon sets, we examined the genes that had appropriate keyword hits (see below). Within these genes, sequences were extracted for exons predicted to be AE based on our exon-skipping score. De novo motif finders MEME [] and BioProspector [] were applied to these selected exon sequences subtracting the first 5 bp at either end of each exon. BioProspector was run with the following options: only examine the forward stand (-d), assume every sequence has a motif (-a 1) and motif width range from 6 to 18 (-w 6,8,10,12,14,16). MEME was run with the following options: default value of only looking at the forward strand, DNA sequence (-dna), several choices for the expected number of motif occurrences (zero or one in each sequence -zoops, one in each sequence -oops, variable number -tcm) and the motif width (-w 6,8,10,12, and the range -minw 6 -maxw 20). For each of the 15 MEME runs and the top three motifs from the 6 BioProspector runs, we aligned the consensus sequences for the final predicted motifs with ClustalW []. All motifs and alignments are displayed in the Figures using WebLogo [] […]

Pipeline specifications

Software tools RESCUE-ESE, BioProspector, Clustal W, WebLogo
Applications WGS analysis, Genome data visualization
Organisms Homo sapiens