Computational protocol: Integration of RNA-Seq data with heterogeneous microarray data for breast cancer profiling

Similar protocols

Protocol publication

[…] The first step in the methodology for microarray data is to put together all the selected series, independently of their technology (Affimetrix or Illumina). Consequently, a quality analysis assessment was performed across the series, in order to detect and consequently remove any possible outlier. This outliers detection and removal was performed through arrayQualityMetrics R package [], which computes the Kolmogorov-Smirnov statistic K a between the distribution of each array and the distribution of the pooled data. Next, sample normalization was performed using the limma R package normalizedBetweenArrays function [], in order to remove dynamic expression variability between samples. Once the samples were normalized, the expressed gene values were obtained. Figure  outlines the microarray data analysis pipeline. Fig. 1 [...] The pipeline proposed by Anders et al. [] has been followed for the extraction of RNA-Seq data as it is shown in Fig. . Starting from the SRA original files, several tools like sra-toolkit [], tophat2 [], bowtie2 [], samtools [] and htseq [] have been used to obtain the read count for each gene. Once the read count files were obtained, the expression values were calculated using the cqn and the NOISeq R packages []. Fig. 2 [...] A new data processing pipeline is proposed in this work which extends the classical gene expression data analysis pipeline in two ways. On one hand, this pipeline integrates data from both microarray and RNA-Seq technologies. Furthermore, once the integration has been carried out, a gene selection process and an assessment through a classification process were performed, using separated training and test datasets. The workflow of the entire pipeline is shown in Fig. . Fig. 3 In a first step, sample integration of data from both microarrays and RNA-Seq technologies has been carried out using the merge function from base R package. Once the gene expression values have been obtained for each technology separately, a normalization of all joint technologies was applied using the normalizedBetweenArrays function cited before over all datasets available (see Table ). These tasks are essential in order to have available a right normalization of the biological data and its subsequent processing [, ]. We have to note that each of the series in Table  were originally differently quantified depending on the respective technology and manufacturer.The next steps in the pipeline for gene expression levels calculation and extraction of DEGs, were made only over the training dataset, thus leaving the test dataset for later assessment.Gene extraction was performed at different levels using the limma R package, both at individual levels (microarray data and RNA-Seq data separately) and at integrated level (joined microarray and RNA-Seq data). […]

Pipeline specifications