Computational protocol: Enhanced MissingProteins Detection in NCI60 CellLines Using an Integrative Search Engine Approach

Similar protocols

Protocol publication

[…] The BAM files corresponding to the cell lines available in both the NCI60 data set and the CCLE project were downloaded from the GDC Data Portal (https://portal.gdc.cancer.gov). The reference genome used for the alignment of the reads was hg19. The annotation of the transcript structures of the human transcriptome considered in this study was derived from MiTranscriptome. This assembly, based on 7256 RNA-Seq experiments from human normal tissues and cancer samples, contains 384 066 predicted transcripts, 165 020 of them corresponding to protein coding genes of Gencode version 19. The ab initio transcriptome assembly was performed using Cufflinks. The quantification of these transcripts for each RNA-Seq experiment to obtain the matrix of expression levels of the 43 cell lines was performed using the software featureCounts. Finally, a global normalization method using the mean size of the libraries was applied to make the samples comparable.A multiomic bioinformatic analysis was used to highlight the samples in which the probability of detection of missing proteins was higher. For this purpose, we used the expression profiles of all the gene structures in the 43 cell lines of the NCI60 for which we had RNA-Seq experiments in the CCLE project. We considered a gene to be expressed when at least one of its corresponding transcripts was expressed. The difference between expressed and highly expressed genes was defined based on the histogram of the normalized counts for all the gene structures in all the cell lines: a gene was considered expressed in a cell line when its expression value was greater than the first quartile (Q1) or highly expressed when its expression exceeded the third quartile (Q3). Using these thresholds as reference, it was possible to identify which of the analyzed samples had an over representation of missing proteins at transcript level. These cell lines would be considered as good candidates for validation of missing proteins, especially those ones that expressed a higher number of the one-hit wonders detected in the shotgun experiments (see the Supporting Information for more details and R code). [...] In the peptide detectability study, all the tryptic peptides of the human proteome and their detection frequency in proteomic experiments were the input data. Tryptic peptides were obtained from neXtProt database using Proteogest software, and detection frequencies for each peptide were downloaded from GPMDB database (http://peptides.thegpm.org/~/peptides_by_species/). The total number of observations for each peptide was defined considering all the observations independently of the parent ion charge. Then more than 550 physicochemical and biochemical properties were calculated for each tryptic peptide using seqinr R package. These properties were: peptide length, peptide molecular weight, theoretical isoelectric point, percentage of different classes of amino acids (tiny, small, aliphatic, aromatic, nonpolar, polar, charged, positive or negative amino acids), and the mean value of the characteristics stored in the AAindex database (release 9.1).We sorted tryptic peptides based on the number of observations in proteomic experiments and compared the properties of the most observed peptides with the less observed ones. We randomly sampled 5000 peptides from the 50 000 most observed peptides and 5000 peptides from the 50 000 less observed peptides 500 times. In this way, we performed 500 t tests for each feature, and we corrected the obtained p-values using FDR. There were 302 properties with FDR < 0.05 in the 500 tests, but some of them were redundant. For each group of correlated properties described by the AAindex database, we chose the feature with the best mean FDR.A final selection of 106 nonredundant properties was used to distinguished between the most and the less observed peptides in GPMDB database. For this purpose, the 100 000 tryptic peptides used for the selection of the differential peptide properties were divided in a set of training peptides (75% of the peptides) and a testing set (the remaining 25%). Different classification methods were trained and their performance was evaluated using Receiver Operator Characteristics Curve (ROC) analysis. Some methods included built-in feature selection, such as RPART, C5, JRIP, Random Forest (RF), and PART, while others do not (Partial Least Squares (PLS), Generalized Linear Model (GLM), Naïve Bayes (NB), Neural Network (NNET), and Support Vector Machine (SVM.R)). This machine learning approach was performed with caret R package, and the RF classifier resulted to be the best option for the prediction of detectable peptides (see the Supporting Information for more details and R code). […]

Pipeline specifications

Software tools Cufflinks, Subread, seqinr
Databases GPMDB neXtProt CCLE GENCODE AAindex MiTranscriptome
Applications RNA-seq analysis, Genome data visualization
Organisms Homo sapiens
Diseases HIV Infections