Computational protocol: Transcriptome-Wide Annotation of m5C RNA Modifications Using Machine Learning

Similar protocols

Protocol publication

[…] In this study, we constructed four m5C datasets: DatasetCV (cross-validation dataset), DatasetHT (hold-out test dataset), DatasetIT1 (independent test dataset for samples from the Arabidopsis silique tissue) and DatasetIT2 (independent test dataset for samples from the Arabidopsis shoot tissue).DatasetCV and DatasetHT were constructed based on m5C modifications in transcripts expressed in the Arabidopsis root tissue at single-nucleotide resolution using RNA bisulfite sequencing technology (David et al., ). During bisulfite conversion, unmethylated cytosines were converted into uracils, while methylated cytosines were not converted. Bisulfite-treated RNA samples were sequenced to generated 100-nt paired-end reads using the Illumina HiSeq 2500. Low-quality reads were processed using Trimmomatic (Bolger et al., ), and the left clean reads were globally mapped to in silico bisulfite-converted Arabidopsis reference genome sequences using the RNA mode of B-Solana (Kreck et al., ). For each cytosine site in the Arabidopsis reference genome, the methylation level was calculated using a proportion statistic: P = (C+Ψ)/(T+C), where C and T represent the number of cytosines and thymines in aligned reads at the cytosine site under analysis, respectively. Ψ specifies the added pseudo counts (1/8 counts). The false discovery rate (FDR) was calculated using the R package qvalue (Storey, ). Cytosines were regarded as positive samples (m5C modifications) if they satisfied the following criteria: methylation level ≥1% and FDR ≤ 0.3. After the removal of sequence redundancy, we finally obtained 1,296 m5C modifications in 885 transcripts (Table ). In these 885 transcripts, cytosines were regarded as negative samples (non-m5C modifications) if they were not annotated as m5C modifications. In order to avoid over-fitting and GC bias in training process, we limited the number of negative samples to be 10 times of positive samples. Thus, for each positive sample, 10 samples were selected in the 200-nt region around the positive sample, among which GC content difference is not more than 5%. This allows a similar distribution of positive and GC-matched negative samples, which is markedly different from the background distribution of all cytosines in these 885 transcripts (Figure ). Note that some of the negative samples may in fact be true m5C modifications not yet discovered. We randomly divided these 1,296 positive samples and 12,960 negative samples into two parts for constructing DatasetCV and DatasetHT, respectively. The DatasetCV comprises 1,196 positive samples and 11,960 negative samples, while the DatasetHT has a balanced number (100) of positive and negative samples (Table ).Using the same criteria mentioned above, another two datasets (DatasetIT1: 79 positive and negative samples; DatasetIT2: 73 positive and negative samples) were also constructed for Arabidopsis silique and shoot tissues, respectively (Table ). Of note, positive and negative samples in DatasetIT1 and DatasetIT2 were not overlapped with those in DatasetCV and DatasetHT.Each sample in these four datasets was represented by a sequence window of 43 nucleotides centered around the respective cytosine site. For samples near the borders of the available RNA sequence, the positions missing from the 43-nt window were filled with “N,” the symbol for unknown. The Arabidopsis reference genome sequences (TAIR10) and annotated transcripts used in this study were downloaded from the Araport 11 database (https://www.araport.org/data/araport11). [...] Ten-fold cross-validation experiments were performed to optimize m5C prediction models in PEA-m5C by iteratively varying window size and feature number. Cross-validation is a standard method for estimating the generalization accuracy of ML systems. In a ten-fold cross-validation, the DatasetCV was randomly divided into 10 equal subsets and each subset was iteratively selected as a testing set for evaluating the model trained with other nine subsets. In each fold of cross-validation, considering the high unbalance between positive and negative samples (1:10), the negative samples were randomly divided into 10 parts, each of which coupled with the set of positive samples were used for training an RF-based m5C prediction model. Therefore, ten RF-based m5C prediction models were constructed in the training process. In the testing process, each sample was scored using these ten RF-based m5C prediction models. The corresponding ten prediction scores were averaged as the final prediction score of the sample under analysis. Once the testing process was completed, the prediction accuracy of PEA-m5C (an ensemble of ten RF-based m5C prediction models) was evaluated using the receiver operating characteristic (ROC) analysis, which plots a curve of false positive rate (FPR) varying at different true positive rate (TPR). The value under the ROC curve (AUC) was used to quantitatively score the prediction performance of PEA-m5C. AUC is ranged from 0 to 1, the higher the better prediction performance. After 10 subsets have been successively used as the testing set, the corresponding 10 AUC values were averaged as the overall prediction performance of PEA-m5C.The PEA-m5C was optimized to maximize the AUC by iteratively varying window size L from 5- to 43-nt and feature number from 2 to 4*L+106. The feature subset was selected according to the feature importance estimated using the information gain approach implemented in R package “FSelector” (Cheng et al., ). The detailed process of model optimization is given in Figure . We initialize AUC matrix (“AUCMatrix”) and feature matrix (“FMatrix”) as two empty sets (Lines 1-2). Then for a given window size L (5-nt ≤ L ≤ 43-nt) (Line 3), we varied the upstream sequence length (Lu) from 1-nt to (L-2)-nt and the number of feature subset from 2 to 4*L+106 (Lines 4-7). Subsequently, for each feature subset, we performed a 10-fold cross-validation experiment and stored the corresponding AUC value into a vector (“AUCVector”) (Lines 8-9). After all possible feature subsets have been examined using 10-fold cross-validation experiments, the maximum AUC in “AUCVector” will be stored in the “AUCMatrix” (Lines 11-12), and the corresponding feature subset with maximum AUC will be stored in “FMatrix” (Lines 13-15). Finally, after all possible window sizes have been performed, the optimized Lu and Ld can be obtained by searching the maximum value in “AUCMatrix” (Lines 18-19), and the optimized feature subset can be obtained by searching “FMatrix” with Lu and Ld (Lines 20-21). [...] The iRNAm5C-PseDNC is only available m5C predictor that aims to accurately predict m5C modifications in mammalian genomes. It was constructed using the RF algorithm with only PseDNC features, and was trained with mammalian m5C modifications (window size: 41-nt) (Sun et al., ). In order to fairly compare prediction performance between iRNAm5C-PseDNC and our proposed model PEA-m5C, we also re-trained iRNAm5C-PseDNC with positive and negative samples of 41-nt in the DatasetCV, and this re-trained predicted model was named as iRNAm5C-PseDNC*. Prediction performance of iRNAm5C-PseDNC, iRNAm5C-PseDNC* and PEA-m5C was estimated on DatasetHT, DatasetIT1 and DatasetIT2 using six widely used measures: sensitivity (Sn, also known as recall), specificity (Sp), precision (Pr), accuracy (Acc), F1-score (F1), and Matthews correlation coefficient (MCC). These measures were defined as follows:where TP, TN, FP, and FN represent the number of true positives, true negatives, false positives and false negatives, respectively. F1 is the harmonic mean of Pr and Sn. Compared with Sn, Sp, Pr, and F1, Acc and MCC are two more information measures which combine all of the predictions (TP, TN, FP, and FN) into a single score. Acc, which ranges from 0 to 1, measures the proportion of correct predictions. MCC, also known as the phi coefficient, measures the correlation between the observations and predictions. It is generally regarded as a balanced measure, which can be used even if the two classes are of very different size. The value of MCC ranges from −1 to 1, where 1 represents a perfect prediction, 0 indicates no better than random prediction and −1 means total disagreement between observations and predictions. […]

Pipeline specifications

Software tools FSelector, iRNAm5C-PseDNC
Applications Miscellaneous, RNA modification analysis
Organisms Arabidopsis thaliana
Chemicals 5-Methylcytosine