Computational protocol: Automatic learning of pre-miRNAs from different species

Similar protocols

Protocol publication

[…] Probabilistic properties refer to parameters that measure the stability (or variation) of certain features when computed from a sequence and from its randomized (shuffled) versions []. Since real pre-miRNAs are energetically more stable than pseudo-hairpins, the differences between the MFE of a real pre-miRNAs and the mean MFE of its shuffled sequences (MFEshuf) are expected to be neglectable. Two different formulas to capture the energy stability have been proposed. The first, zG [], is the difference between MFE and MFEshuf, in unities of standard deviation (SDshuf). The second, p (randfold) [, ], is the relative frequency by which MFEshuf was lower than MFE. Similarly to the computation of zG, the z-variants of dP, dQ, dD and dF were also assessed and represented as zP, zQ, zD and zF []. [...] The learning algorithms used in this work were Support Vector Machines (SVMs), Random Forest (RF) and J48. These algorithms have different learning biases, which is important for the present work, since learning biases may favor a feature set over others. SVMs and RFs are the algorithms most frequently used for pre-miRNA classification and J48 was chosen because of its simplicity and interpretability.J48 implements the well known C4.5 algorithm []. As one of the most popular algorithm based on the divide-and-conquer paradigm, C4.5 recursively divides the training set into two or more smaller subsets, in order to maximize the information entropy. The J48 implementation builds pruned or unpruned decision trees from a set of labeled training data. We used RWeka [], an R interface of Weka [], with the default parameter values. RWeka induces pruned decision trees from a data set.To train SVMs, we used a Python interface for the library LIBSVM 3.12 []. This interface implements the C-SVM algorithm using the RBF kernel. The kernel parameters γ and C were tuned by 5-fold cross validation (CV) over the grid (C;γ) = (2 −5,2−3,…,215; 2 −15,2−13,…,23). The pair (C;γ) that led to the highest CV predictive accuracy in the training subsets was used to train the SVMs using the whole training set. The resulting classifier was applied to classify the instances from the corresponding test set.RF ensembles were induced over the grid (30, 40, 50, 60, 70, 80, 90, 100, 150, 250, 350, 450) ×[ (0.5, 0.75, 1, 1.25, 1.5)*d ], representing respectively the number of trees and the number of features. The value d is the default number of features tried in each node split, where d is the dimension of the feature space or the number of features in the feature set. We chose the ensemble with the lowest generalization error over the grid, according to the training set, and applied it to classify the instances of the corresponding test set. The ensembles were obtained using the randomForest R package [] in an in house R script. […]

Pipeline specifications