Computational protocol: Distinguishing mirtrons from canonical miRNAs with data exploration and machine learning methods

Similar protocols

Protocol publication

[…] In the study we used two datasets. First, the miRBase set (Supplementary Table ) consisted of mirtrons and canonical miRNAs deposited in miRBase (Release 21, 06/14). To date Wen et al. provided the most comprehensive but also stringent mirtron/canonical miRNA annotation, therefore we used it in our study. From the database we extracted hairpin and mature miRNA sequences from both arms. We restricted the set to pre-miRNAs yielding functional mature miRNAs from both hairpin arms. The set contained 216 mirtrons and 707 canonical miRNAs. The second set we used, called putative mirtrons set (Supplementary Table ) consisted of 201 novel mirtron loci annotated in study by Wen et al.. Their sequences were gathered using UCSC browser - hairpin coordinates were made available in supplementary tables of Wen et al.. Hairpin secondary structures and free energies for both sets were calculated using RNAfold (version 2.3.3) from ViennaRNA Package with default options. [...] For data visualization we performed Principal Component Analysis (PCA). Linearly dependent features needed to be excluded from PCA calculations, therefore we arbitrarily decided to drop uracil compositions in all investigated hairpin regions, i.e. hairpin_U, mature5p_U, mature3p_U and interarm_U. The calculations were performed using the R prcomp function with prior data normalization. ggplot2 package was used for plotting. The first two PCs explained 37,6%, while first three 46,8% of all variance. [...] We implemented six commonly used, methodologically diverse classifiers:Logistic Regression calculated using glm functionLinear Discriminant Analysis using lda function from MASS package with default parametersSupport Vector Machine using svm function from e1071 package with default radial kernel and default parametersNaïve Bayes without smoothing using naiveBayes method from e1071 packageDecision Tree without pruning using tree packageRandom Forest using RandomForest package and default parameters (500 trees)Logistic Regression calculated using glm functionLinear Discriminant Analysis using lda function from MASS package with default parametersSupport Vector Machine using svm function from e1071 package with default radial kernel and default parametersNaïve Bayes without smoothing using naiveBayes method from e1071 packageDecision Tree without pruning using tree packageRandom Forest using RandomForest package and default parameters (500 trees)Classifier performance was measured using 5-fold cross validation.For each of classifiers we calculated the following performance measures:Sensitivity1Sens=TPTP+FNSpecificity2Spec=TNTN+FPArea under curve (AUC) - Area under ROC curveF1-Score:3F1score=2∗TP2∗TP+FP+FNMathew’s Correlation Coefficient (MCC)4MCC=TP∗TN−FP∗FN(TP+FP)∗(TP+FN)∗(TN+FP)∗(TN+FN)SensitivitySpecificity2Spec=TNTN+FPArea under curve (AUC) - Area under ROC curveF1-Score:3F1score=2∗TP2∗TP+FP+FNMathew’s Correlation Coefficient (MCC) […]

Pipeline specifications

Software tools RNAfold, ViennaRNA, Ggplot2, randomforest
Databases miRBase
Applications Miscellaneous, RNA structure analysis
Chemicals Guanine