Computational protocol: Predicting siRNA efficacy based on multiple selective siRNA representations and their combination at score level

Similar protocols

Protocol publication

[…] Recently, there is mounting evidence that siRNA activity is influenced by the thermodynamic stability of the ends of siRNAs and the energy gain due to hybridization at the siRNA binding site, which determine the accessibility for an interaction between siRNA and mRNA target. Therefore, we would like to include this impact into our predict model, and try to take such thermodynamic parameters into FQt. To our best knowledge, this is the first work introduces the thermodynamic parameters of siRNA-mRNA binding into siRNA efficacy prediction.The thermodynamic of siRNA-mRNA interaction consists of two components: the energy necessary to make a potential binding region accessible and the energy gained from the base pairing of the two interaction partners. The first component needs two dimensional real numbers to record the free energy for exposing the binding site in siRNA ΔGs and mRNA ΔGm. The second component describes the energy gained by siRNA-mRNA interaction ΔGh. We can obtain the three thermodynamic parameters using a simple web server tool RNAup developed by Mückstein U in University of Vienna. The tool only needs the sequences of siRNA and targeted mRNA, and will output the three thermodynamic parameters soon. We use RNAup to calculate the thermodynamic parameters of siRNA-mRNA interaction of siRNAs in Huesken’s dataset, and compute their Pearson correlation coefficients (PCC) between the three thermodynamic parameters and observed inhibitions as shown.In , we also collect the PCCs between some main features in other groups of FQt and observed inhibitions. It may be observed that ΔGh achieves the highest PCC among the three thermodynamic parameters. And the PCCs of three thermodynamic parameters are comparable to those of the features with high PCCs from nucleotide frequencies and thermodynamic stability. Thus they explore the strong correlations between thermodynamic of siRNA-mRNA interaction and siRNA efficacy. Meanwhile, we further investigate their discriminative ability for distinguishing active siRNA from inactive siRNA. We divide siRNAs in Huesken’s dataset into two classes according to the discipline of 70% inhibition of targeted mRNA, and draw the box plots of the three thermodynamic parameters to indicate their distributions between active siRNA and inactive siRNA as .From , we can observe that the three thermodynamic parameters are discriminative to active and inactive siRNA. Therefore, we believe that they are effective and meaningful for siRNA efficacy prediction. [...] The above introduced four groups of features are formed a mix feature vector as the quantitative representations FQt of siRNA. They quantitatively characterize siRNA from the views of sequence frequencies, thermodynamic stability profile, thermodynamic of siRNA-mRNA interaction and the targeted mRNA. However, because of the lack of direct experimental evidence of these quantitative features linked to siRNA activity, we would like to investigate the contributions among these features in FQt by a feature selection method.F-score is a straightforward indicator to measure the discriminative ability of two sets, which is a frequently used feature selection tool for two-class classification problem. The F-score of the i-th feature can be defined as: where , , are the average of the i-th feature of the whole, positive, and negative samples, respectively. is the i-th feature of the k-th positive sample, and is the i-th feature of the k-th negative sample. The larger the F-score suggests that the involved feature is more discriminative. Therefore it may be a feature selection criterion to select the subset features with more importance. In our algorithm, we label siRNAs in Huesken’s dataset to two categories according to the above mentioned 70% division discipline. Then we calculate the F-score of each feature in FQt using the simple tool provided by libSVM, and conduct the binary search to choose the best feature subset. The selective features are deemed strongly relevant to siRNA efficacy, while the absent features are considered weakly relevant. From the experiments (details in “Results of feature selection” section), we obtained 68 dimensional selective features formed the optimal quantitative representations . [...] In this article, we adopt Pearson Correlation Coefficient (PCC) to measure the correlation between the predicted efficacy and observed inhibitions, which is the most common use in a regression system. Its definition is as follow: Where X and Y represents the predicted values and observed labels. n is their common size. and σX denote the mean and standard deviation of X respectively. Likewise, Y and σY denote the mean and standard deviation of Y respectively.As above mentioned, some literatures also conducted the experiments of predicting siRNA efficacy in classification way. Therefore, some classification indicators, including sensitivity and specificity are also employed in our work. These indicators can be calculated as follows: Where TN, FN, TP and FP are the number of true negatives, false negatives, true positives and false positives respectively.In addition, the Receiver Operating Characteristic (ROC) curve is also used to exhibit the overall performance of algorithms. The ROC curve is drawn by plotting the true positive rate (i.e. sensitivity) versus the false positive rate (i.e. 1 – specificity) with different thresholds. In ROC, we may further observe the area under ROC curve (AUC) to evaluate the reliability of classification system. A perfect classification system may obtain the maximum AUC value 1, while the AUC value 0.5 implies a random classification. […]

Pipeline specifications

Software tools LIBSVM, siPRED
Applications Miscellaneous, Non-coding RNA analysis