Computational protocol: Splice site identification using probabilistic parameters and SVM classification

Similar protocols

Protocol publication

[…] We have conducted several simulations to evaluate the performance of the proposed algorithm using two standard and publicly available splice site datasets.The first dataset is known as NN269 [], which consists of 1324 confirmed true acceptor sites, 1324 confirmed true donor sites, 5552 false acceptor sites and 4922 false donor sites collected from 269 human genes. Each of the pseudo acceptor/donor sites also has AG/GT in the splicing junction but is not a real splice site according to the annotation. The window size for an acceptor is 90 nucleotides {-70 to +20} with consensus AG at positions -69 and -70. This includes the last 70 nucleotides of the intron and first 20 nucleotides of the succeeding exon. The donor splice sites have a window of 15 nucleotides {-7 to +8} with consensus GT at positions +1 and +2. This includes the last 9 bases of the exon and first 6 bases of the succeeding intron. The dataset is available at []. This data set is split into a training set and a testing set. The training data set contains 1116 true acceptor, 1116 true donor, 4672 false acceptor, and 4140 false donor sites. The test data set contains 208 true acceptor sites, 208 true donor sites, 881 false acceptor sites, and 782 false donor sites. Figure and show the two sample logo [] of NN269 acceptor and donor sites. They represent the residues enriched and depleted in the sample. In NN269 acceptor dataset, AG is conserved in position 69 and 70 of the sequences, and for donor splice sites, GT is conserved in position 8 and 9 of the sequences.We also used a second dataset named DGSplicer []. The DGSplicer true dataset is created by extracting a collection of 2381 real acceptor sites and 2381 real donor sites from 462 annotated multiple-exon human genes from []. Two of the donor splice sites and one acceptor splice site were excluded from the collection to form a set of 2380 real acceptor sites and 2379 real donor sites as those three splice sites contained symbols other than A, C, G, and T. Also a large collection of 400314 pseudo acceptor sites and 283062 pseudo donor sites were collected from 462 annotated human genes and used as the false dataset []. The window size for the acceptor is 36 nucleotides {-27 to +9} with consensus AG at positions -26 and -27, which includes the last 27 nucleotides of the intron and first 9 nucleotides of the succeeding exon. The donor splice sites have a window of 18 nucleotides {-9 to +9} with consensus GT at positions +1 and +2, which includes the last 9 bases of the exon and first 9 bases of the succeeding intron. The dataset is available at []. [...] The training of a model was conducted in two stages: the MM1 parameters estimation and the SVM with second order polynomial kernel training. The training sequences were aligned with respect to the consensus dinucleotides prior to stage one. The estimates of the MM1 are the ratios of the frequencies of each dinucleotide in each sequence position as shown in (14). Only the true splice site training sequences were used to create the Markov model. The desired output level is set to +1 or -1 depending on the true or false splice site class label. We used the LIBSVM [] implementation of the support vector machine, which is freely available at []. […]

Pipeline specifications

Software tools Two Sample Logo, LIBSVM
Applications Miscellaneous, Genome data visualization
Diseases Substance-Related Disorders
Chemicals Nucleotides