Computational protocol: Improving Protein Expression Prediction Using Extra Features and Ensemble Averaging

Similar protocols

Protocol publication

[…] The models for protein expression prediction require input features. The ones used were codon bias, codon identification number, codon count and minimum free energy of folding. The first three features are illustrated in . Even though the scope of the present article is limited to the analysis of these four features, many others such as the presence of rare codons [] or the location of G-quadruplexes [] could have also been used. Codon bias is the difference in the percentage of times (frequency) a certain codon appears in a protein RNA relatively to the total number of codons encoding a certain amino acid. It is estimated by the codon frequency or by the Codon Adaptation Index (CAI) which is a measure of codon bias toward codons frequently used in highly expressed proteins in a certain genome []. The present article uses the former. Throughout the article, when "selected codon bias" is mentioned it means that the calculations were done employing the bias of only a subset of codons equal to that from Welch et al. article. "Codon bias" only, means that the bias of all codons was used. Codon identification is a number from one to 64 that identifies a certain sequence of three nucleotides that encode an amino acid. Using standard dummy variables for codon identification was not considered to be a good option. In these variables, each codon would be encoded by a vector with 64 entries, 63 of which would have the value zero, and only the position designated for a certain codon would have the value one. Consequently, the vector encodings would be orthogonal []. The reason not to use dummy variables is that the instances dimensionality would be unnecessarily large, which could be counterproductive to the effort of giving regression models the ability to provide good results for new inputs, i.e. to generalize. In addition, codon identification numbering (see ) is done in such a way that AAA is closer to AAC than to ACC, because they are more similar, or AAA is closer to CCC than to GGG or TTT, following a rule employed in the construction of a 4-ary tree that reveals hidden symmetries in the genetic code []. The rule consists of constructing the tree starting with the sequence A, C, G, T []. In fact, the sorting from is strongly related to the codon table generated from the 4-ary tree which might explain why this sorting provided better results than random identifications.Codon count is the number of times that each codon appears in the RNA sequence. Free energy, also known as Gibbs free energy or minimum free energy (MFE), is minimised as a protein folds so that it reaches a configuration that is as stable as possible. MFE values for the whole mRNA sequence were determined using the Vienna RNA package (http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi) for both Welch et al. [] and Supek and Smuk []. MFE values measured between nucleotides -4 and 37 or -4 and 38, for Welch et al. and Supek and Smuk, respectively, were obtained together with the datasets from these two works. The nucleotides are counted from the first nucleotide of the first open reading frame (ORF). The use in the models of free energies from the whole protein or from some nucleotides depends on which one provides the best validation results for each dataset.The features described have dimensionality of one for the MFE and 64 for the codon bias or count. In Welch et al., since three codons are missing, the codon bias dimensionality used is only 61. Finally, the codon identification feature has a dimensionality that depends on the protein size, for Welch et al. it is 575 and for Supek and Smuk it is 288. Employing all the features simultaneously generates inputs of high dimensionality which poses a risk to good generalization. In total, the features for Welch et al. and Supek and Smuk datasets have a dimensionality of 701 and 417, respectively. Good generalization is ensured by the use of the rigorous nested n-fold cross-validation together with partial least squares regression or support vector machines that were designed to handle large dimensionalities. This will be better explained in the next sections.The rationale for choosing codon bias and minimum free energy as input features is their good results presented in Welch et al. [] and Supek and Smuk []. Codon identification number and codon count were selected because the former provides an exact description of the sequence of codons and the latter adds the importance of each codon with respect to the remaining ones in a simple way when compared to codon identification. The features codon bias, codon identification and codon count can be calculated using scripts included in the provided software (please see section ""). The scripts are inside the directory named “features” of , which provides all the code used to generate the results here presented. The MFE must be calculated using the Vienna RNA package. [...] The necessary software code was written in MATLAB. The PLS implementation used was from MATLAB toolboxes and the SVR algorithm was from LIBSVM []. All the code and data is freely available at "http://sels.tecnico.ulisboa.pt/software/" under “Protein expression prediction” and in . An example on how to use the created ensembles is given in folder “FinalCodonModel”. […]

Pipeline specifications

Software tools RNAfold, LIBSVM
Applications Miscellaneous, RNA structure analysis
Diseases Epilepsies, Partial